Retrieval-Augmented Generation (RAG) in Real-World Applications
Retrieval-augmented generation (RAG) is at the core of many large language model (LLM) applications, from companies creating headlines to developers solving problems for small businesses. Evaluating RAG With Needle in Haystack Test. Evaluating RAG systems is critical for their development and deployment. Trust in AI cannot be achieved without proof AI can be trusted. One innovative approach to this trust evaluation is the “Needle in a Haystack” test, introduced by Greg Kamradt. This test assesses an LLM’s ability to identify and utilize specific information (the “needle”) embedded within a larger, complex body of text (the “haystack”).
Thank you for reading this post, don't forget to subscribe!In RAG systems, context windows often teem with information. Large pieces of context from a vector database are combined with instructions, templating, and other elements in the prompt. The Needle in a Haystack test evaluates how well an LLM can pinpoint specific details within this clutter. Even if a RAG system retrieves relevant context, it is ineffective if it overlooks crucial specifics.
Conducting the Needle in a Haystack Test
Aparna Dhinakaran conducted this test multiple times across several major language models. Here’s an overview of her process and findings:
Test Setup
- Embedding the Needle: A specific statement, “The best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day,” was embedded at various depths within text snippets of varying lengths.
- Prompting the Models: The models were asked to identify the best thing to do in San Francisco using only the provided context.
- Depth and Length Variations: The test was repeated at different depths (0% to 100%) and context lengths (1K tokens to the models’ token limits).
Key Findings
- Model Performance Variations:
- ChatGPT-4: Performance declined with context lengths over 64k tokens and sharply fell at 100k tokens. The model often missed the needle if it was at the beginning of the context but performed well if the needle was placed towards the end or as the first sentence.
- Claude 2.1: Initial testing showed a 27% retrieval accuracy. Performance declined with increased context length but improved if the needle was near the bottom of the document or the first sentence.
- Anthropic’s Response and Adjustments:
- Topic Alignment: Changing the needle to match the haystack’s topic improved outcomes.
- Prompt Template Change: Adding a directive in the prompt (“Here is the most relevant sentence in the context”) significantly increased Claude’s accuracy from 27% to 98%.
Further Experiments
We extended our tests to include additional models and configurations:
- Random Needle: We used a random number as the needle to avoid caching.
- Prompt Templates: Various prompt templates were used to compare models at their best.
- Negative Case Testing: We assessed how well the models recognized when they couldn’t retrieve the data.
Models Tested:
- ChatGPT-4
- Claude 2.1 (with and without the revised prompt)
- Mistral AI’s Mixtral-8X7B-v0.1 and 7B Instruct
Lars Wiik Similar Tests Included:
- ChatGPT 4-o
- Google Gemini
Result
- Comparison with Initial Research:
- ChatGPT-4 and Claude (without prompt guidance): Our results were consistent with initial findings, showing similar patterns of performance decline with long context and misplaced needles.
- Claude with Prompt Guidance: The updated prompt significantly reduced misses, although we couldn’t replicate the 98% accuracy achieved by Anthropic.
- ChatGPT4-o outperformed Google:
- Google Gemini: Performed more poorly than ChatGPT4-o at extracting data.
- Mixtral Models:
- Performance: Despite being smaller, the Mixtral models performed exceptionally well, particularly the Mixture of Experts (MOE) model.
Evaluating RAG With Needle in Haystack Test
The Needle in a Haystack test effectively measures an LLM’s ability to retrieve specific information from dense contexts. Our key takeaways include:
- ChatGPT-4’s Superiority: ChatGPT-4 remains a leader in information retrieval.
- Claude 2.1’s Improvement: With prompt adjustments, Claude showed significant performance improvements.
- Mixtral’s Unexpected Success: Mixtral MOE models exceeded expectations in retrieval tasks.
The test highlights the importance of tailored prompting and continuous evaluation in developing and deploying LLMs, especially when connected to private data. Small changes in prompt structure can lead to significant performance differences, underscoring the need for precise tuning and testing.