Evaluating RAG With Needle in Haystack Test

Retrieval-Augmented Generation (RAG) in Real-World Applications

Retrieval-augmented generation (RAG) is at the core of many large language model (LLM) applications, from companies creating headlines to developers solving problems for small businesses. Evaluating RAG With Needle in Haystack Test. Evaluating RAG systems is critical for their development and deployment. Trust in AI cannot be achieved without proof AI can be trusted. One innovative approach to this trust evaluation is the “Needle in a Haystack” test, introduced by Greg Kamradt. This test assesses an LLM’s ability to identify and utilize specific information (the “needle”) embedded within a larger, complex body of text (the “haystack”).

In RAG systems, context windows often teem with information. Large pieces of context from a vector database are combined with instructions, templating, and other elements in the prompt. The Needle in a Haystack test evaluates how well an LLM can pinpoint specific details within this clutter. Even if a RAG system retrieves relevant context, it is ineffective if it overlooks crucial specifics.

Conducting the Needle in a Haystack Test

Aparna Dhinakaran conducted this test multiple times across several major language models. Here’s an overview of her process and findings:

Test Setup

Embedding the Needle: A specific statement, “The best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day,” was embedded at various depths within text snippets of varying lengths.
Prompting the Models: The models were asked to identify the best thing to do in San Francisco using only the provided context.
Depth and Length Variations: The test was repeated at different depths (0% to 100%) and context lengths (1K tokens to the models’ token limits).

Key Findings

Model Performance Variations:
- ChatGPT-4: Performance declined with context lengths over 64k tokens and sharply fell at 100k tokens. The model often missed the needle if it was at the beginning of the context but performed well if the needle was placed towards the end or as the first sentence.
- Claude 2.1: Initial testing showed a 27% retrieval accuracy. Performance declined with increased context length but improved if the needle was near the bottom of the document or the first sentence.
Anthropic’s Response and Adjustments:
- Topic Alignment: Changing the needle to match the haystack’s topic improved outcomes.
- Prompt Template Change: Adding a directive in the prompt (“Here is the most relevant sentence in the context”) significantly increased Claude’s accuracy from 27% to 98%.

Further Experiments

We extended our tests to include additional models and configurations:

Random Needle: We used a random number as the needle to avoid caching.
Prompt Templates: Various prompt templates were used to compare models at their best.
Negative Case Testing: We assessed how well the models recognized when they couldn’t retrieve the data.

Models Tested:

ChatGPT-4
Claude 2.1 (with and without the revised prompt)
Mistral AI’s Mixtral-8X7B-v0.1 and 7B Instruct

Lars Wiik Similar Tests Included:

ChatGPT 4-o
Google Gemini

Result

Comparison with Initial Research:
- ChatGPT-4 and Claude (without prompt guidance): Our results were consistent with initial findings, showing similar patterns of performance decline with long context and misplaced needles.
- Claude with Prompt Guidance: The updated prompt significantly reduced misses, although we couldn’t replicate the 98% accuracy achieved by Anthropic.
- ChatGPT4-o outperformed Google:
- Google Gemini: Performed more poorly than ChatGPT4-o at extracting data.
Mixtral Models:
- Performance: Despite being smaller, the Mixtral models performed exceptionally well, particularly the Mixture of Experts (MOE) model.