Guide to RAG (Retrieval-Augmented Generation)
Thank you for reading this post, don't forget to subscribe!Retrieval-Augmented Generation (RAG) has become increasingly popular, and while it’s not yet as common as seeing it on a toaster oven manual, it is expected to grow in use. Despite its rising popularity, comprehensive guides that address all its nuances—such as relevance assessment and hallucination prevention—are still scarce. Drawing from practical experience, this insight offers an in-depth overview of RAG.
Why is RAG Important?
Large Language Models (LLMs) like ChatGPT can be employed for a wide range of tasks, from crafting horoscopes to more business-centric applications. However, there’s a notable challenge: most LLMs, including ChatGPT, do not inherently understand the specific rules, documents, or processes that companies rely on.
There are two ways to address this gap:
- Retrain the model with the company’s data, which can be lengthy, costly, and potentially unsuccessful.
- Use Retrieval-Augmented Generation (RAG), which leverages existing models and augments them by connecting them to relevant company information via a retriever mechanism.
How RAG Works
RAG consists of two primary components:
- Retriever: This searches for information relevant to the query. A vector database (like Qdrant) is often used, storing the indexed documents of a company.
- Generator: This processes the retrieved data and synthesizes an answer using an LLM. The Generator takes the relevant information and distills it into a coherent response.
While the system is straightforward, the effectiveness of the output heavily depends on the quality of the documents retrieved and how well the Retriever performs. Corporate documents are often unstructured, conflicting, or context-dependent, making the process challenging.
Search Optimization in RAG
To enhance RAG’s performance, optimization techniques are used across various stages of information retrieval and processing:
- Initial Processing of User Queries: Users often submit questions in confusing formats. The model must first rephrase and clean the query before searching.One method for this is RAG Fusion, where the model generates multiple versions of the user’s question, conducts searches for each, and ranks the results using algorithms like Cross-Encoders, which are slower but more accurate for ranking.
- Data Searching in Repositories: The Retriever searches for data from various company repositories. Vector databases, like Qdrant or Pinecone, are commonly used for semantic searches, offering high accuracy for complex queries by embedding documents and queries into vector space.
- Ranking and Combining Results: Once the Retriever returns documents, they must be ranked for relevance before being passed to the Generator. Techniques such as Cross-Encoders or Reciprocal Rank Fusion (RRF) are used to prioritize results that are most likely to be relevant to the user’s query.
- Evaluating and Formatting Responses: After generating a response, additional layers of evaluation help ensure quality and coherence. This could involve reranking the answers or applying stylistic adjustments for tone and clarity.
Python and LangChain Implementation Example
Below is a simple implementation of RAG using Python and LangChain:
pythonCopy codeimport os
import wget
from langchain.vectorstores import Qdrant
from langchain.embeddings import OpenAIEmbeddings
from langchain import OpenAI
from langchain_community.document_loaders import BSHTMLLoader
from langchain.chains import RetrievalQA
# Download 'War and Peace' by Tolstoy
wget.download("http://az.lib.ru/t/tolstoj_lew_nikolaewich/text_0073.shtml")
# Load text from html
loader = BSHTMLLoader("text_0073.shtml", open_encoding='ISO-8859-1')
war_and_peace = loader.load()
# Initialize Vector Database
embeddings = OpenAIEmbeddings()
doc_store = Qdrant.from_documents(
war_and_peace,
embeddings,
location=":memory:",
collection_name="docs",
)
llm = OpenAI()
# Ask questions
while True:
question = input('Your question: ')
qa = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=doc_store.as_retriever(),
return_source_documents=False,
)
result = qa(question)
print(f"Answer: {result}")
Considerations for Effective RAG
- Structured vs. Unstructured Data: Company documents often vary widely in structure. This inconsistency can affect the performance of the retriever. Preprocessing the documents to normalize formats can significantly enhance the quality of retrieved information.
- Vector Database Performance: The accuracy of RAG systems heavily depends on the vector database. Techniques like ensembling retrievers (combining dense and sparse retrievers) can improve overall search results by balancing speed and relevance.
Ranking Techniques in RAG
- Cross-Encoders: These provide detailed relevance assessments between a query and documents, although they are slower and more resource-intensive than Bi-Encoders.
- Reciprocal Rank Fusion (RRF): This method scores documents based on their rank in multiple retriever outputs. It’s particularly useful for combining results from different systems, ensuring more consistently relevant results.
Dynamic Learning with RELP
An advanced technique within RAG is Retrieval-Augmented Language Model-based Prediction (RELP). In this method, information retrieved from vector storage is used to generate example answers, which the LLM can then use to dynamically learn and respond. This allows for adaptive learning without the need for expensive retraining.
Guide to RAG
RAG offers a powerful alternative to retraining large language models, allowing businesses to leverage their proprietary knowledge for practical applications. While setting up and optimizing RAG systems involves navigating various complexities, including document structure, query processing, and ranking, the results are highly effective for most business use cases.