Retrieval-Augmented Generation (RAG) in Real-World Applications

Retrieval-augmented generation (RAG) is at the core of many large language model (LLM) applications, from companies creating headlines to developers solving problems for small businesses. Evaluating RAG With Needle in Haystack Test. Evaluating RAG systems is critical for their development and deployment. Trust in AI cannot be achieved without proof AI can be trusted. One innovative approach to this trust evaluation is the “Needle in a Haystack” test, introduced by Greg Kamradt. This test assesses an LLM’s ability to identify and utilize specific information (the “needle”) embedded within a larger, complex body of text (the “haystack”).

In RAG systems, context windows often teem with information. Large pieces of context from a vector database are combined with instructions, templating, and other elements in the prompt. The Needle in a Haystack test evaluates how well an LLM can pinpoint specific details within this clutter. Even if a RAG system retrieves relevant context, it is ineffective if it overlooks crucial specifics.

Conducting the Needle in a Haystack Test

Aparna Dhinakaran conducted this test multiple times across several major language models. Here’s an overview of her process and findings:

Test Setup

  • Embedding the Needle: A specific statement, “The best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day,” was embedded at various depths within text snippets of varying lengths.
  • Prompting the Models: The models were asked to identify the best thing to do in San Francisco using only the provided context.
  • Depth and Length Variations: The test was repeated at different depths (0% to 100%) and context lengths (1K tokens to the models’ token limits).

Key Findings

  1. Model Performance Variations:
    • ChatGPT-4: Performance declined with context lengths over 64k tokens and sharply fell at 100k tokens. The model often missed the needle if it was at the beginning of the context but performed well if the needle was placed towards the end or as the first sentence.
    • Claude 2.1: Initial testing showed a 27% retrieval accuracy. Performance declined with increased context length but improved if the needle was near the bottom of the document or the first sentence.
  2. Anthropic’s Response and Adjustments:
    • Topic Alignment: Changing the needle to match the haystack’s topic improved outcomes.
    • Prompt Template Change: Adding a directive in the prompt (“Here is the most relevant sentence in the context”) significantly increased Claude’s accuracy from 27% to 98%.

Further Experiments

We extended our tests to include additional models and configurations:

  • Random Needle: We used a random number as the needle to avoid caching.
  • Prompt Templates: Various prompt templates were used to compare models at their best.
  • Negative Case Testing: We assessed how well the models recognized when they couldn’t retrieve the data.

Models Tested:

  • ChatGPT-4
  • Claude 2.1 (with and without the revised prompt)
  • Mistral AI’s Mixtral-8X7B-v0.1 and 7B Instruct

Lars Wiik Similar Tests Included:

  • ChatGPT 4-o
  • Google Gemini

Result

  1. Comparison with Initial Research:
    • ChatGPT-4 and Claude (without prompt guidance): Our results were consistent with initial findings, showing similar patterns of performance decline with long context and misplaced needles.
    • Claude with Prompt Guidance: The updated prompt significantly reduced misses, although we couldn’t replicate the 98% accuracy achieved by Anthropic.
    • ChatGPT4-o outperformed Google:
    • Google Gemini: Performed more poorly than ChatGPT4-o at extracting data.
  2. Mixtral Models:
    • Performance: Despite being smaller, the Mixtral models performed exceptionally well, particularly the Mixture of Experts (MOE) model.

Evaluating RAG With Needle in Haystack Test

The Needle in a Haystack test effectively measures an LLM’s ability to retrieve specific information from dense contexts. Our key takeaways include:

  • ChatGPT-4’s Superiority: ChatGPT-4 remains a leader in information retrieval.
  • Claude 2.1’s Improvement: With prompt adjustments, Claude showed significant performance improvements.
  • Mixtral’s Unexpected Success: Mixtral MOE models exceeded expectations in retrieval tasks.

The test highlights the importance of tailored prompting and continuous evaluation in developing and deploying LLMs, especially when connected to private data. Small changes in prompt structure can lead to significant performance differences, underscoring the need for precise tuning and testing.

Related Posts
Salesforce OEM AppExchange
Salesforce OEM AppExchange

Expanding its reach beyond CRM, Salesforce.com has launched a new service called AppExchange OEM Edition, aimed at non-CRM service providers. Read more

Salesforce Jigsaw
Salesforce Jigsaw

Salesforce.com, a prominent figure in cloud computing, has finalized a deal to acquire Jigsaw, a wiki-style business contact database, for Read more

Health Cloud Brings Healthcare Transformation
Health Cloud Brings Healthcare Transformation

Following swiftly after last week's successful launch of Financial Services Cloud, Salesforce has announced the second installment in its series Read more

Salesforce Artificial Intelligence
Salesforce CRM for AI driven transformation

Is artificial intelligence integrated into Salesforce? Salesforce Einstein stands as an intelligent layer embedded within the Lightning Platform, bringing robust Read more