A major concern in the past few years, regarding the context of Large Language Models has been “hallucination”. An LLM is said to hallucinate when it generates a grammatically correct but logically inconsistent response.
AI chatbots such as GPT-4, the newly released Gemini 1.5 Pro, and Claude 3 have been subject to hallucinations. AI enthusiasts and developers dependent on these tools are having a hard time working on and diagnosing this issue.
Users utilize AI-generated replies to support their jobs and decisions without realizing they are using inaccurate information. This could lead to unfavourable results, especially in fact-driven fields like banking, law, and healthcare.
What is Hallucination in Large Language Models and how can we reduce them for more efficient responses? In this article, we are going to explore all these topics in-depth. So, let’s get right into it!
What is Hallucination in LLMs?
When ML models, especially large language models (LLMs) such as GPT-3 or GPT-4, generate outputs that are grammatically correct and coherent but factually erroneous or incomprehensible, this phenomenon is known as an LLM hallucination.
In this sense, the creation of inaccurate or misleading information is referred to as “hallucinations.” Several things, including the intrinsic complexity of language, biases in the model, and restrictions in the training data, can cause these hallucinations.
When a user requests something from a generative AI tool, they usually want something that answers the prompt correctly. But occasionally, the results of AI algorithms are not derived from training data, the transformer decodes the outputs wrongly, or there is no discernible pattern. This output is said to be “hallucinated.”
LLMs would not provide a citation for the source of their response because they are not search engines or databases. These models use the prompt you provide as an extrapolation to produce text. Although it is the highest correlated result from the prompt, the extrapolation result is not always supported by any training data.
Some well-known instances of LLMs hallucinating are as follows:
- The erroneous claim made by Google’s Bard chatbot that photographs of a planet outside of our solar system were taken by the James Webb Space Telescope.
- Radio host Mark Walters was accused of embezzling money by ChatGPT when asked to summarize the Second Amendment Foundation v. Ferguson lawsuit. This led to OpenAI getting sued.
- In 2022, Meta withdrew its Galactica LLM demo after giving consumers false information that occasionally had bias at its core.
- Nonexistent medical references in general brain-related topics were cited by ChatGPT.
- Sydney, Microsoft’s conversation AI, acknowledged that it had spied on Bing personnel and fell in love with users.
Still up to date, LLMs have been hallucinating and it has been a serious issue in the aspect of content generation and citations.
Why do LLMs hallucinate?
There are several reasons and sources as to why LLMs offer hallucinated responses. Some reasons lie in the LLM’s architecture and some can be attributed to the way users structure their prompts and the information they are demanding. Let’s look at all these reasons:
1) Noisy or incomplete training data
Inaccuracies or gaps in the model’s comprehension may result from incomplete, irrelevant, accurate, updated, or correct data in the dataset. Thus, the outcomes that are produced are also inaccurate.
2) Incorrect Encoding
The encoder will produce false vector representations if it misinterprets the input text, which is training data that is converted into vectors by an embedding model. Consequently, the model decoder will produce misleading text outputs.
3) Lack of Foundation
These models’ lack of real-world experience and access to real-time data, in contrast to humans, restricts their comprehension and increases the possibility of errors. They still haven’t developed the source of foundation that sets them apart from human beings in generating content.
4) Inaccurate Decoding
A ranked list of vectors with their corresponding “probabilities” is generated by LLM-powered chatbot decoders such as Gemini and ChatGPT. However, because the results would be extremely flat, they don’t always choose the vectors with the highest probabilities.
To produce more imaginative answers, they occasionally choose at random lower-ranked vectors. The model’s outputs get more creative the more randomness is added, but they also run the risk of being hallucinations.
5) Uncertain Questions
The model may provide a response that deviates from the user’s intended meaning if the input question or prompt is unclear and is based on what it judges to be the most likely interpretation. Often LLMs keep on hallucinating in such situations in iterative loops, without providing any response at all.
6) Pre-existing bias
Biases from the training set may be inherited by models, causing them to draw conclusions that can cause hallucinations. Biasing is highly seen in LLMs even nowadays, mainly attributed to the developers who designed them, reflecting their own biases in the models’ outputs.
7) Lack of common sense and reasoning
Hallucinations may also result from LLMs’ frequent deficiency in “common sense” reasoning skills. They may have a propensity for pattern detection but the former negatively takes the lead here.
Examples of LLMs Hallucinating
Here we have curated some examples of LLMs showing hallucination from several tech enthusiasts and developers. Let’s explore them one by one:
Visual Hallucinations
A user on X, named Ethan Mollick, a Professor at The Wharton School, provided a set of four visual graphs to Claude 3, Gemini 1.5, and GPT-4. GPT-4 had some trouble analyzing the visual information and as a result, produced some hallucinations in interpreting the information given.
Interestingly all three models hallucinated later on when they were asked to find the mean for the graphs.
Multimodal & graphs: Looking at a set of four graphs from a study of law students & AI, Claude does by far the best, Gemini 1.5 does well, and GPT-4 has trouble with visual details, leading to some hallucinations.
— Ethan Mollick (@emollick) March 5, 2024
However, when asked to provide the means for the graph, all fail. pic.twitter.com/4ipLKxauDP
Multi Interpretations
Dave Schukin, CEO & co-founder of Observant AI, asked Claude 3 about himself, and the hallucinations in the results were shocking. In one response Claude wrongly stated that Dave was the co-founder and CEO of Anthropic and in another response, it states that Dave is an American entrepreneur!
Claude 3: very impressive so far, but the hallucinations… 👀 pic.twitter.com/5EM3BKuP5A
— Dave Schukin (@schukin) March 5, 2024
Imagining Conversations
Jad El Jerdy, an AI/ML enthusiast, witnessed some hallucinations from Claude 3 Haiku where the chatbot surprisingly started imagining a conversation between a human and an assistant. It continued the imagination to such an extent that it started talking with itself.
You can see the hallucinated conversations in the blue highlighted portions in the tweet below:
Anyone else getting weird hallucinations with Claude 3 Haiku where it imagines a conversation on its own and starts talking with itself ?
— Jad El Jerdy (@jad_jerdy) March 13, 2024
That's so weird. pic.twitter.com/ET3BTPfXd4
Image-to-text Hallucinations
Nikita Sokolsky, a user on X, asked both Claude 3 and GPT-4 to transcribe a handwritten text image written in Russian by Stalin. GPT-4 responded that it couldn’t do it but Claude 3 continued to provide a transcription that contained hallucinations.
I've found that it's more prone to hallucinations than GPT-4 when it comes to image processing. Tried a _very_ hard image-to-text problem, GPT-4 said it can't do it, Claude 3 hallucinated a fake text. pic.twitter.com/dNCVlbTLOI
— Nikita Sokolsky (@nsokolsky) March 6, 2024
Misinterpreting Images
Fuxiao Liu, an AI enthusiast with research interests in LLMs, Multimodal, and Computer Vision witnessed a high level of hallucinations from Gemini Pro. Gemini was given several images that it wrongly interpreted. In one of the images having a green traffic light signal, Gemini responded that the green signal was only for “Turning Vehicles”.
It continues to show hallucinations in the other images where it was given a math and a visual interpretation problem.
🔥CoBig shoutout to @Google for launching their groundbreaking large multimodal model #Gemini.
— Fuxiao Liu (@FuxiaoL) December 8, 2023
🚩However, there are still obvious hallucinations with Gemini. Here are a few examples with Gemini Pro.
Want to learn more? Look at our HallusionBench Paper: https://t.co/3qjsR3Drh7 pic.twitter.com/5obk5SjF6H
How to mitigate hallucinations in LLMs?
Leading tech companies are concerned about AI hallucinations. Removing random or irrelevant hallucinations from GPT-4 or Claude 3 is a difficult task given the complexity of deep neural networks and their incapacity to distinguish between truths and lies.
However, below are some ways that can help reduce the level of hallucinations in LLMs:
1) Feeding Contexts
The process of giving the model enough data to work with as a framework is known as context feeding. This approach is useful when you want the model to respond to a straightforward query or directive. Contextualizing the question eliminates uncertainty and lowers the possibility of irrelevant or erroneous responses.
For example, you can provide an LLM with the data that Real Madrid CF, won the UEFA Champions League in 2022. After that, it may successfully answer the question “Who won the UEFA Champions League in the year 2022?”.
Nowadays a few LLMs haven’t been up to date with the latest information available on the internet. This often results in either hallucination errors or no response at all.
2) Prompt Engineering
The technique of creating input questions or instructions to elicit more accurate and desired outputs from large language models (LLMs) is known as prompt engineering. It is an essential ability for working with applications of artificial intelligence (AI), assisting developers in getting better output from language models.
To optimize model output and correct for potential biases, prompt engineering entails carefully crafting input prompts, investigating linguistic subtleties, and testing a variety of prompts. AI systems become more dependable and user-friendly with this sophisticated methodology, which guarantees more accurate and contextually relevant outcomes.
Users input text as prompts when using LLMs. Prompts are a way to give GPT-4 instructions on what to do. It’s similar to coding, but instead of using programming languages, you use plain English. As a result, you must be very clear about your goals and convey them without using technical jargon.
For example, here we provided GPT-4 with a simple prompt in the first “What is rag?”. Unable to comprehend the exact meaning it shows a degree of hallucination.
Now we provide a proper and more structured form of the prompt “What is Retrieval Augmented Generation?”. It successfully provides us with an accurate response.
This is merely a basic account of how prompts can induce delusions and how adding additional context might help prevent them. Prompt engineering techniques that are more sophisticated can help correctly guide the LLM and prevent hallucinations.
3) Retrieval Augmented Generation
After learning from a data distribution, generic models are not able to correctly infer anything else. They frequently need more context when producing text for applications that come next. One technique that adds embeddings from a domain-specific knowledge source to the prompt is called Retrieval-Augmented Generation (RAG).
By providing extra information to the prompt, RAG helps the model generate responses that are more pertinent and less prone to hallucinations.
For example, if you need guidance on how to perform CPR on a patient, RAG will find a database the information on conducting CPR. Then, before the language model processes the original prompt, the CPR information is applied to it.
4) Model Fine-Tuning
ML teams often employ a technique called fine-tuning to impart new knowledge to a model while preserving its current skills, especially when it comes to tasks involving the processing of natural language.
Additionally, by using this strategy, the model’s response can be shaped, and unwanted behaviors can be avoided. The model is less likely to invent believable but false responses as it is retrained with domain-specific knowledge.
A new dataset, typically smaller and more task-specific than the first dataset, is used to further train the model throughout the fine-tuning process. By tailoring the pre-trained model’s generalized learning to the unique features of the new dataset, the procedure seeks to enhance the model’s performance on tasks associated with the new dataset.
Conclusion
Hallucination is a common yet big issue in the field of LLMs and generative AI. But all it takes is just some enhanced training along with structured and precise prompts. Let’s hope to say goodbye to the evil of LLM hallucinations in the days to come.