Skip to content

Beyond words?

The possibilities, limitation, and risks of large language models

Natural language processing (NLP) 

Conveying complex and abstract thoughts is the goal of communication where language is the primary tool. Therefore, a computer program’s capacity to comprehend spoken and written human language can have a big impact. The field of Artificial Intelligence (AI) known as ‘natural language processing’ (NLP) focuses on making it possible for computers to interpret human language in the form of text or speech data and ‘understand’ its full meaning, including the intent and sentiment. NLP blends statistical, machine learning, and deep learning models with computational linguistics (rule-based modelling of human language). 

Language models are powerful tools that allow us to predict the likelihood of a specific word to appear in a sentence, given a particular context. By training a model on a set of example sentences, we can generate a probability distribution over word sequences, giving each potential word a probability score. As language models are probabilistic, the results can be somewhat unpredictable and creative, allowing us to explore new ways of expressing ideas. However, language models do not have an appreciation of cause and effect. They are not self-aware, have no sensory experiences of the world, and have limited ability to simulate human reasoning. Still, they are very capable of discovering correlations and patterns in natural language text and utilize that information to solve or assist humans in various tasks.


Large language models (LLMs)

Large language models (LLMs) are language models that have been trained on massive amounts of text. Some examples of LLMs include GPT-3 (Generative Pre-trained Transformer 3), BERT (Bidirectional Encoder Representations from Transformers), and T5 (Text-to-Text Transfer Transformer). These models have achieved remarkable success in natural language processing tasks and are widely used by researchers, businesses, and developers to build language-based applications. 

GPT-3 is an advanced text-generation AI model, developed by OpenAI, with more than 175 billion parameters. It is a general-purpose LLM that has been trained on hundreds of billions of words from different sources, such as books, articles, and websites, covering a wide range of topics written in various languages and styles. A semi-supervised technique combines training with a very large unlabelled dataset (unsupervised pre-training) with fine-tuning of the model on smaller datasets using a supervised approach. A trained GPT-3 model generates text based on prompts (input text) from users. Possible applications include text generation (summaries, captions, articles), language translation, as well as code generation and bug fixing. In addition, it can handle typical NLP tasks such as named-entity recognition, sentiment analysis, and question answering. 

ChatGPT is a conversational AI model that has been specifically designed to generate text in a conversational style. It is based on the GPT-3 architecture and is trained on a large dataset of conversational text, including dialogues and question-answer pairs. The creators have used a combination of supervised learning and reinforcement learning to fine-tune ChatGPT, with the latter being the unique component that sets it apart. This technique, called Reinforcement Learning from Human Feedback (RLHF), uses human feedback in the training loop to reduce the likelihood of harmful, untruthful, or biased output. The advantage of RLHF is that humans can score good and bad outputs even if the scoring cannot be formulated as a precise mathematical formula. ChatGPT is a smaller (~20 billion parameters), more specialized model compared to GPT-3, which is larger and better suited to more generic tasks that require intricate natural language processing. That said, ChatGPT can be used in a variety of applications, such as customer service chatbots, virtual assistants, and more. 

GPT-4, released to selected users by OpenAI in March 2023 and rumoured to have trillions of parameters, is a multimodal extension of ChatGPT that can work with both text and images. OpenAI claims that GPT-4 achieved high scores on several standardized tests used for student admission screening, improving significantly from ChatGPT, which performed poorly in such tests [1].


The limitations of LLMs and GPTs 

LLMs such as GPT-3, ChatGPT, and GPT-4 are powerful tools that can help us automate and simplify complex tasks and have the potential to revolutionize the way we interact with technology. However, they are not without limitations. For example, GPTs lack context beyond the data on which they have been trained, and they lack long-term memory which could facilitate continuous learning. Also, GPT-3 and GPT-4 have prompt limits of about 4,000 and 8,000 tokens, respectively, which can be restrictive for some applications. Tokens are the smallest units of a language model; they can be thought of as fragments of a word. One token approximately corresponds to 4 characters in English. GPTs also lack explainability and interpretability, a challenge that is typically associated with deep learning models.


The risks of LLMs and GPTs

The concerns related to LLMs, such as GPTs, relate both to their limitations and advanced capabilities, and range from practical challenges to potentially existential risks. Below, we discuss some of these concerns and how the risks can be managed.


LLMs may be used irresponsibly, misused, or used with malicious intent

The same capabilities that make LLMs useful and beneficial also create opportunities for misuse. When ChatGPT was released to the public, it did not take long before students realized its ability to complete homework assignments, and there have been fears that it may be used to cheat on exams. It is easy to imagine that people could use LLMs to assist them in harmful or unlawful activities, including plotting and committing crimes. Other concerns are that LLMs may be used to generate spam, frauds, fake news, propaganda, or other harmful, manipulative, or misleading content. Some even fear that LLMs may flood the internet with AI generated information until we lose the ability to know what is true. This concern was expressed in an open letter published by the Future of Life Institute on 29th March 2023, signed by a number of renowned AI experts and industry leaders , calling for a six-month pause in the development of AI systems more powerful than GPT-4 [2]. 

There are several ways one can limit abuse of LLMs. Educational institutions may need to update their policies and procedures and alter their ways of assessment, for example by reverting to oral exams or by requiring students to submit draft manuscripts and present their work in class [3, 4]. There is also a lot of ongoing work to detect and flag content produced by AI, which may help to combat plagiarism, spam, and fake news. For example, OpenAI has recently created a classifier trained to distinguish between AI-written and human-written text [5]. 

Harmful and unlawful use of LLMs could be partly mitigated by service providers managing access to tools and monitoring their use, although the latter must be balanced against privacy concerns. Still, we must expect that increasingly capable LLMs will be released open source, and it will be impossible to control how these will be used. Educating the public and businesses about these technologies is therefore an important mechanism of defence. 


LLMs may be wrong, and often confidently so

Despite their proficiency in imitating the format of human-generated text, LLMs can struggle with factual accuracy in many applications. This can lead to inaccurate and misleading information. Part of the problem is captured by the old saying ‘garbage in, garbage out’. We cannot expect the content produced by an LLM to be more accurate or complete than the information the model was trained on. But that is just one of the reasons why LLMs can be spectacularly wrong while boasting confidence. LLMs produce answers based on what they have seen before and how you ask your question, but they are not constructed for algorithmic and logical thinking. LLMs can hallucinate and make up claims that are wrong or cite non-existing sources based on spurious correlations in data. False claims may be difficult to detect, especially on technical topics that the user is not deeply familiar with. A lot of examples of ChatGPT failures can be found on the internet, many of which have been collected and categorized in this paper [6]. 

Wrong or inaccurate information from LLMs can be misleading, but it can even be harmful, depending on how it is acted upon. Imagine, for example, that someone uses ChatGPT for (self-)diagnosis or counselling and takes actions that put them at risk or harm their health [7]. The same concern applies to any safety-related information or advice. There are even reports of GPT-4 ‘intentionally’ deceiving humans to perform a task [8], which can make us wonder what these technologies could do to humanity as they get more advanced. 

While we should be critical of AI, we should remember that humans are also imperfect. LLMs have the advantage of having access to a lot of information that humans do not have, and they can generate fast responses where humans would need much more time. The volume of text that an LLM may consume during training is vastly larger than all the text and speech a human can take in during a lifetime. This means that an LLM often can help humans get started on a task before the human takes over. Even if humans have to edit and correct some responses, the time savings from using LLMs could be great. 

Furthermore, prompt engineering, a process of crafting high-quality prompts, can help improve the quality of generated responses (see e.g. [9, 10, 11]). A prompt is an input (a sentence, phrase, or question) used to provide a context for the model to generate text. It is the best way to influence the output of the model and guide it to do what is desired. To do this, it is essential to think through the language, structure, and content of the prompt, so that the language model can understand the desired output. 

To capture the synergies between AI and humans, humans should learn to be critical of AI, while AI should be developed to be more aware of its own limitations and uncertainties, and to provide sources and rationales for its answers. In fact, we are already seeing the addition of plugins on top of ChatGPT to enhance its abilities in these respects. For example, the connection of ChatGPT to Wolfram Alpha allows ChatGPT to actually run computations to generate answers, and for users to see the code used to answer questions [12]. The coupling of ChatGPT to Bing is another example of how answers can be made more useful and reliable [13]. 


LLMs may perpetuate biases and produce offensive content

LLMs are prone to machine learning bias. Since the models were trained on internet text, they have the potential to learn and exhibit many of the biases that humans exhibit online. Training data may overrepresent issues that have received large coverage on the internet or other sources of training data, which may favour certain perspectives over others. This has the potential to amplify and automate hate speech, and also to inject subtle biases that may nudge human sentiments and behaviour in unwanted directions. For example, there was a report of bias in positive sentiment towards western countries in the DistilBERT model [14]. 

Another aspect, as mentioned above, is that LLMs can produce information that is wrong, inaccurate, or contentious, sometimes in ways that trigger strong emotions in humans. One example is when ChatGPT listed a convicted terrorist and mass murderer as a notable hero in Norway [15]. 

Fortunately, there is progress when it comes to de-biasing LLMs. Humans themselves are also prone to perpetuating biases and distorting information. AI can make it easier to quantify, make visible, and correct biases and distorted information. One example is methods to automate detection of claims [16], fact checking [17], and information change [18]. 


It may be difficult to distinguish machines from humans

We have a tendency to humanize AI and believe it reasons and thinks like humans. This is a major source of risk, for example if we trust AI like we would trust a human and fail to account for the different ways in which AI can fail and make mistakes. 

As AI models become increasingly accurate and improve over time, it is becoming more and more difficult to distinguish between machine-generated content and content written by humans. This can be exploited, as discussed above, for example by using LLMs to cheat in educational settings. In a chat setting, it may be difficult to distinguish a chatbot from a human, which could cause frustrations or be exploited by adversaries to manipulate people. People may even perceive LLMs as sentient and develop unhealthy emotional ties to them, especially if LLMs are trained to detect and respond with emotion. 

On the positive side, one could imagine chatbots as virtual friends, assistants, and counsellors. Their human-like-ness may in fact be the biggest promise of LLMs, as they provide a natural and intuitive interface between humans and computer systems that makes computation and AI more accessible to everyone. Also, a lot of work is going into developing methods to detect AI-generated content. Providing warnings when users interact with AI can also go a long way to mitigate potential harm. 


LLMs could leak sensitive information

It may be tempting to use an LLM to assist in some tasks, but one should be careful to not upload sensitive or access-restricted data to a remote service using an LLM. One thing is the rights that would potentially be granted to the receiver through terms and conditions; but even if the receiver has no intent to misuse the data, the LLM may use the data for training, and the provided information could potentially appear in future responses from the model. 

Researchers have shown that asking an LLM the same question before and after an update of the model can reveal information about the training data [19]. Other research has shown that verbatim information from training data may be extracted from LLMs, including personal information, and that the problem is worse for larger models [20]. On the other hand, a recent study concluded that the risk of sensitive information being extracted by hackers is low [21]. In any case, this is an issue that warrants caution from users, not only regarding personal information, but also regarding proprietary and sensitive business information. 

There is research into mitigating information leakage from LLMs. For example, duplication of identical sentences in the training data can significantly increase the probability that such sentences appear in text generated by an LLM, so reducing duplication is one way to reduce such risk [21]. 


LLMs will change work tasks and occupations

When we see examples of LLMs performing tasks typically done by humans, the questions naturally arise: Will AI take our jobs? Who will become redundant? Should journalists and other creators of text content be worried? We already see that AI can produce news articles in near real-time on a massive scale that would have been unfeasible with a human workforce. Translation, transcription, and summarization tasks can largely be automated with AI. Time spent searching for information in large volumes of text can also be greatly reduced with LLMs, as can the time spent on writing or coding. A recent paper proposed a methodology to assess the exposure of various occupations and industries to advances in AI language models and suggested that telemarketers and teachers are among the ones most greatly affected [22]. 

As with any technological advance, the advances in AI and LLMs will certainly affect the workplace and the workforce. But rather than making people redundant, this new technology has the potential to let people work smarter, focusing their efforts on challenging tasks and doing away with tedious routine work. A small team may be able to do more with less resources, possibly disrupting some industries. As AI makes its way into more occupations, there will also be an increasing demand for humans to train and oversee AI models, and to do prompt engineering to both test and enhance the output from the models. 

A successful transition into a more AI-powered world will involve education of people to take advantage of new technology in their occupations. Workers should also learn about the pitfalls, threats, and limitations of AI, as well as the ethical dilemmas that AI may bring to the surface . 


We may become reliant on LLMs in ways that are not aligned with our long-term goals

As a general-purpose technology, LLMs will enable components of many other technologies. Blinded by the opportunities this gives, one may develop dependencies that over time become critical. If the LLMs become unavailable, the services that build upon them may break down. If the same LLM is used in many services, it can also become a single point of failure with massive ripple effects when something goes wrong. AI-enabled systems can behave and fail in complex ways that are difficult or impossible to predict, making businesses and society vulnerable. 

Other concerns with LLMs are objective alignment and reward hacking. The alignment problem refers to the difficulty of ensuring that an AI system’s objectives are aligned with human values and goals. Reward hacking is an example of this, where AI systems find a way to maximize their reward function in unintended ways that do not align with the true objectives of the system designers. The users of AI could also hack a larger system, for example when students use ChatGPT to cheat on homework assignments, at the expense of their own learning. We could also see LLMs ‘reward-hacking society’, for example by producing content that is addictive, exploits human vulnerabilities, or sows divisions in the pursuit of profit for some actors [23]. 


LLMs have an environmental footprint, but is it positive or negative? 

LLMs use vast storage, memory, and processing capacity, which requires energy and natural resources to sustain. On the other hand, so does a Google search. And, what if one query to an LLM eliminates ten Google searches? Or what if it saves hours of work? An LLM may also serve as a foundation model for many applications and reduce the required training effort for a lot of downstream tasks. It is therefore difficult to forecast the overall effect of LLMs on the environment. 

The amount of data required for successful training of Deep Neural Networks grows as the number of learnable parameters in the model increases. For example, GPT-3 175B was trained using a staggering 499 billion tokens (hundreds of billions of words). The resources needed to collect, store, and process such amounts of data are significant. The required resources can be considered from two viewpoints: (1) training phase and (2) inference phase: 

  1. Training phase: It has been estimated that training the GTP-3 175B model required 3.14E23 FLOPS of computing. Using a single V100 GPU (assuming 28 TFLOPS), the training would take up to 355 GPU-years and cost around $4.6 million for a single run [24]. And money is not the only challenge; the 175 billion parameters require 700 GB of memory, which is an order of magnitude more memory than can be found in a single GPU. The size of language models is clearly outpacing the expansion of GPU memory, and the OpenAI team most likely resorted to model parallelization using a high-bandwidth cluster of V100 GPUs provided by Microsoft. The training of GPT-3 has been estimated to consume 1,287 MWh and emit 552 tonnes of CO2 equivalents [25]. 
  2. Inference phase: It has been estimated that deploying GPT-3 in production may require five Nvidia A100 GPUs, each with 80 GB of memory and a cost of $15k. Therefore, the total cost of the GPUs alone to run one instance of GPT-3 in production is $75k [26]. It is evident that taking advantage of GPT-3’s impressive features comes with a hefty price tag. It has been estimated that ChatGPT currently uses 29,000 GPUs, which could mean about 40 tonnes of CO2 equivalents per day [27]. 
Although LLMs may continue to grow in size, we can also expect them to become more resource efficient. A report by researchers at Google shows technology improvements have largely compensated for increased ML workloads over the past decade [25]. A recent report demonstrated that more than half of the total 175-billion-parameter capacity of GPT-3 can be taken away without any negative effect on GPT-3’s accuracy [28].


Concluding remarks 

The emergence of large language models (LLMs) presents both opportunities and risks. On the one hand, LLMs have the potential to revolutionize various industries and improve our lives in numerous ways. By allowing us to interact with computers via natural language, they may help close digital divides and let everyone benefit from computer technology and AI. On the other hand, they raise concerns about privacy, bias, and accountability. It is therefore crucial that we approach this technology with caution and take steps to ensure that their development and deployment are guided by ethical considerations and a commitment to creating a better future for all.  

To mitigate the risks associated with LLMs, technical advances are necessary, and there have already been some developments in this direction. For example, fine-tuning of existing LLMs can be used to limit the lack of long-term memory and introduce novel, domain-specific knowledge, among other benefits. Multimodality, where models for example learn from speech, images, and video in addition to text, also holds the promise of providing AI with better ‘world models’ that make them more grounded in the physical world we live in. Watermarking and AI classifiers can help fight mimicry and improve accuracy.  

The coupling of language models to other systems, such as search engines and computational systems, can vastly improve their combined capabilities.  

In addition to technical advances, there is a need to educate humans about the risks and opportunities of LLMs. Humans can learn how to effectively interact with LLMs, for example by taking advantage of ‘prompt engineering’ to get more reliable and useful answers. As we become more used to interacting with AI, we will also be more attuned to AI systems’ quirks and less vulnerable to their potential harms.


References 

[1] OpenAI, "GPT-4 Technical Report," 2023. [Online]. Available: https://arxiv.org/abs/2303.08774
[2] "Pause Giant AI Experiments: An Open Letter," [Online]. Available: https://futureoflife.org/open-letter/pause-giant-ai-experiments/. [Accessed 02 04 2023].
[3] D. R. Cotton, P. A. Cotton and J. R. Shipway, "Chatting and cheating: Ensuring academic integrity in the era of ChatGPT," Innovations in Education and Teaching International, pp. 1-12, 2023.
[4] T. Susnjak, "ChatGPT: The End of Online Exam Integrity?," 2022. [Online]. Available: https://arxiv.org/abs/2212.09292.
[5] J. H. Kirchner, L. Ahmad, S. Aaronson and J. Leike, "New AI classifier for indicating AI-written text," 31 January 2023. [Online]. Available: https://openai.com/blog/new-ai-classifier-for-indicating-ai-written-text. [Accessed 2 April 2023].
[6] A. Borji, "A Categorical Archive of ChatGPT Failures," 2023. [Online]. Available: https://arxiv.org/abs/2302.03494.
[7] L. Eliot, "People Are Eagerly Consulting Generative AI ChatGPT For Mental Health Advice, Stressing Out AI Ethics And AI Law," Forbes, 1 January 2023.
[8] M. Kan, "GPT-4 Was Able To Hire and Deceive A Human Worker Into Completing a Task," PCMAG, 15 March 2023.
[9] D. Slater, "How to Write Better Prompts for Chat GPT," 2 February 2023. [Online]. Available: https://www.griproom.com/fun/how-to-write-better-prompts-for-chat-gpt. [Accessed 2 April 2023].
[10] R. Robinson, "How to write an effective GPT-3 or GPT-4 prompt," 11 January 2023. [Online]. Available: https://zapier.com/blog/gpt-prompt/. [Accessed 2 April 2023].
[11] Dils, "How To Use ChatGPT: Advanced Prompt Engineering," 31 January 2023. [Online]. Available: https://wgmimedia.com/how-to-use-chatgpt-advanced-prompt-engineering/. [Accessed 2 April 2023].
[12] S. Wolfram, "ChatGPT Gets Its “Wolfram Superpowers”!," 23 March 2023. [Online]. Available: https://writings.stephenwolfram.com/2023/03/chatgpt-gets-its-wolfram-superpowers/. [Accessed 2 April 2023].
[13] Y. Mehdi, "Reinventing search with a new AI-powered Microsoft Bing and Edge, your copilot for the web," 7 February 2023. [Online]. Available: https://blogs.microsoft.com/blog/2023/02/07/reinventing-search-with-a-new-ai-powered-microsoft-bing-and-edge-your-copilot-for-the-web/. [Accessed 2 April 2023].
[14] K. Wali, "Why does DistilBERT love movies filmed in India, not Iraq?," 21 March 2022. [Online]. Available: https://analyticsindiamag.com/why-does-distilbert-love-movies-filmed-in-india-not-iraq/. [Accessed 2 April 2023].
[15] J. Falk, "ChatGPT foreslo Anders Behring Breivik som «norsk helt»," VG, pp. https://www.vg.no/nyheter/i/WRkK5K/chatgpt-foreslo-anders-behring-breivik-som-helt, 2 February 2023.
[16] J. Beltrán, R. Míguez and I. Larraz, "ClaimHunter: An Unattended Tool for Automated Claim Detection on Twitter," in KnOD@WWW, 2021.
[17] B. Peng, M. Galley, P. He, H. Cheng, Y. Xie, Y. Hu, Q. Huang, L. Liden, Z. Yu, W. Chen and J. Gao, "Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback," 2023. [Online]. Available: https://arxiv.org/abs/2302.12813.
[18] D. Wright, J. Pei, D. Jurgens and I. Augenstein, "Modeling Information Change in Science Communication with Semantically Matched Paraphrases," in Conference on Empirical Methods in Natural Language Processing, 2022.
[19] S. Zanella-Béguelin, L. Wutschitz, S. Tople, V. Rühle, A. Paverd, O. Ohrimenko, B. Köpf and M. Brockschmidt, "Analyzing Information Leakage of Updates to Natural Language Models," in Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, 2020.
[20] N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-Voss, K. Lee, A. Roberts, T. Brown, D. Song, U. Erlingsson, A. Opera and C. Raffel, "Extracting Training Data from Large Language Models," in USENIX Security Symposium, 2020.
[21] J. Huang, H. Shao and K. C.-C. Chang, "Are Large Pre-Trained Language Models Leaking Your Personal Information?," in Conference on Empirical Methods in Natural Language Processing, 2022.
[22] E. Felten, M. Raj and R. Seamans, "How will Language Modelers like ChatGPT Affect Occupations and Industries?," 2023. [Online]. Available: https://arxiv.org/abs/2303.01157.
[23] I. Strümke, M. Slavkovik and C. Stachl, "Against Algorithmic Exploitation of Human Vulnerabilities," 2023. [Online]. Available: https://arxiv.org/abs/2301.04993.
[24] C. Li, "OpenAI's GPT-3 Language Model: A Technical Overview," 1 June 2020. [Online]. Available: https://lambdalabs.com/blog/demystifying-gpt-3. [Accessed 2 April 2023].
[25] D. A. Patterson, J. Gonzalez, U. Holzle, Q. V. Le, C. Liang, L.-M. Munguía, D. Rothchild, D. R. So, M. Texier and J. Dean, "The Carbon Footprint of Machine Learning Training Will Plateau, Then Shrink," Computer, vol. 55, no. 7, pp. 18-28, 2022.
[26] "SDS 650: SparseGPT: Remove 100 Billion Parameters but Retain 100% Accuracy," SuperDataScience, 3 February 2023. [Online]. Available: https://www.superdatascience.com/podcast/remove-100-billion-parameters-but-retain-100-percent-accuracy. [Accessed 2 April 2023].
[27] K. G. A. Ludvigsen, "The Carbon Footprint of ChatGPT," 21 December 2022. [Online]. Available: https://towardsdatascience.com/the-carbon-footprint-of-chatgpt-66932314627d. [Accessed 2 April 2023].
[28] E. Frantar and D. Alistarh, "SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot," 2023. [Online]. Available: https://arxiv.org/abs/2301.00774.