File Directory for Large Document RAG Systems: Improving Information Retrieval
Problem Statement:
How can we efficiently retrieve the right document from thousands in a Retrieval-Augmented Generation (RAG) system?
In this insight, we’ll show you how to create an automated file directory through a Knowledge Table. By running preset queries on your documents and saving the results as a Knowledge Graph, you can direct natural language queries to a filtered set of documents. This method not only reduces the search space but also enhances retrieval accuracy.
When building RAG systems for enterprises, the challenge lies in managing large collections of tens of thousands of documents within their enterprise memory systems. Enterprises often struggle with messy, unstructured document collections that need to be streamlined for effective AI-powered retrieval.
RAG systems can be broken down into two key steps:
- Identifying the most relevant document(s) for a given query.
- Parsing the content within the filtered documents to retrieve the required information.
Step 1: Intelligent Document Parsing and Tagging
To identify relevant documents, we first need intelligent document parsing and tagging systems that can assign contextual labels to each document. Fortunately, Large Language Models (LLMs) excel at this task.
Consider a scenario where we use a set of 10 Samsung product catalogs. We run predefined queries, which form the schema for the document metadata. These questions include:
- “What product is discussed?”
- “What is the model discussed?”
- “What category does this product belong to?”
- “What is the product description?”
Each query generates a metadata value tied to the specific document, which helps classify the content accurately. These tags and metadata become the foundation of our Knowledge Table.
Step 2: Programmatic Metadata Generation
This entire metadata-generation process can be automated programmatically, which is particularly useful for developers who need to handle large document sets without relying on a UI.
Here’s an example of creating a Knowledge Graph using a set of pre-defined metadata:
pythonCopy codefrom whyhow import WhyHow
client = WhyHow(api_key="<whyhow-api-key>", base_url="https://api.whyhow.ai")
graph_id = client.create_graph_from_knowledge_table(
file_path="./path/to/triples.json",
workspace_name="manuals",
graph_name="manual graph"
)
# Print the created graph ID
print(f"Created graph with ID: {graph_id}")
This code snippet creates a “Manual Graph” with document metadata, and the graph can now be queried to extract high-level context about the documents. The metadata is also stored as properties of each node, allowing us to trace information back to the original document.
Step 3: Filtering Relevant Documents
Once the metadata is created, we can run queries to identify the most relevant documents for answering specific questions.
For example, if we need to answer, “Which 2023 devices have biometric capabilities?” we first query the general manual graph to find all relevant documents from 2023. Then, we query the individual product-specific graphs to extract the precise information.
Here’s the process:
pythonCopy codedef get_docs(q: str):
query_response = client.graphs.query_unstructured(
graph_id="<manual-graph-id>",
query=q
)
return query_response
def query_specific_graph(graph: str, query: str):
answer = client.graphs.query_unstructured(graph_id="<graph-id>", query=query)
return answer.answer
Step 4: Aggregating and Summarizing Answers
Once relevant documents are identified, we collect responses from each specific product graph. These answers are then aggregated, and a final summarization step is performed to generate the final response.
pythonCopy codedef get_final_answer(answers: list, query: str):
prompt = f"""
Answer the following query using the information provided
in the data below.
Explain your answer.
Query: {query}
Data: {answers}
"""
print(f"Prompt: {prompt}")
response = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
max_tokens=512,
)
return response.choices[0].message.content
Step 5: Extending to Other Metadata Types
We can extend this process to query for different metadata types. For instance, instead of querying by year, you can query by product type, such as “phones” or “laptops.” In this case, querying the system for “Which biometric capabilities do phones have?” will return information from product-specific graphs for devices like the Samsung Galaxy S21, S22, and S23.
This approach can be extended to classify documents based on their type—product catalogs, maintenance manuals, academic papers, or any other type of document. By organizing and filtering documents appropriately, enterprises can ensure that they only search through relevant documents, thus improving retrieval efficiency.
Conclusion
By using a Knowledge Table to categorize and tag documents and then converting that information into a Knowledge Graph, enterprises can streamline the search process in RAG systems. This method enables more accurate document retrieval, reduces unnecessary search space, and improves overall efficiency in accessing enterprise knowledge.