File Directory for Large Document RAG Systems: Improving Information Retrieval

Problem Statement:
How can we efficiently retrieve the right document from thousands in a Retrieval-Augmented Generation (RAG) system?

In this insight, we’ll show you how to create an automated file directory through a Knowledge Table. By running preset queries on your documents and saving the results as a Knowledge Graph, you can direct natural language queries to a filtered set of documents. This method not only reduces the search space but also enhances retrieval accuracy.

When building RAG systems for enterprises, the challenge lies in managing large collections of tens of thousands of documents within their enterprise memory systems. Enterprises often struggle with messy, unstructured document collections that need to be streamlined for effective AI-powered retrieval.

RAG systems can be broken down into two key steps:

  1. Identifying the most relevant document(s) for a given query.
  2. Parsing the content within the filtered documents to retrieve the required information.

Step 1: Intelligent Document Parsing and Tagging

To identify relevant documents, we first need intelligent document parsing and tagging systems that can assign contextual labels to each document. Fortunately, Large Language Models (LLMs) excel at this task.

Consider a scenario where we use a set of 10 Samsung product catalogs. We run predefined queries, which form the schema for the document metadata. These questions include:

  • “What product is discussed?”
  • “What is the model discussed?”
  • “What category does this product belong to?”
  • “What is the product description?”

Each query generates a metadata value tied to the specific document, which helps classify the content accurately. These tags and metadata become the foundation of our Knowledge Table.

Step 2: Programmatic Metadata Generation

This entire metadata-generation process can be automated programmatically, which is particularly useful for developers who need to handle large document sets without relying on a UI.

Here’s an example of creating a Knowledge Graph using a set of pre-defined metadata:

pythonCopy codefrom whyhow import WhyHow

client = WhyHow(api_key="<whyhow-api-key>", base_url="https://api.whyhow.ai")

graph_id = client.create_graph_from_knowledge_table(
    file_path="./path/to/triples.json",
    workspace_name="manuals",
    graph_name="manual graph"
)

# Print the created graph ID
print(f"Created graph with ID: {graph_id}")

This code snippet creates a “Manual Graph” with document metadata, and the graph can now be queried to extract high-level context about the documents. The metadata is also stored as properties of each node, allowing us to trace information back to the original document.

Step 3: Filtering Relevant Documents

Once the metadata is created, we can run queries to identify the most relevant documents for answering specific questions.

For example, if we need to answer, “Which 2023 devices have biometric capabilities?” we first query the general manual graph to find all relevant documents from 2023. Then, we query the individual product-specific graphs to extract the precise information.

Here’s the process:

pythonCopy codedef get_docs(q: str):
    query_response = client.graphs.query_unstructured(
        graph_id="<manual-graph-id>",
        query=q
    )
    return query_response

def query_specific_graph(graph: str, query: str):
    answer = client.graphs.query_unstructured(graph_id="<graph-id>", query=query)
    return answer.answer

Step 4: Aggregating and Summarizing Answers

Once relevant documents are identified, we collect responses from each specific product graph. These answers are then aggregated, and a final summarization step is performed to generate the final response.

pythonCopy codedef get_final_answer(answers: list, query: str):
    prompt = f"""
    Answer the following query using the information provided
    in the data below.
    Explain your answer.

    Query: {query}
    Data: {answers}
    """

    print(f"Prompt: {prompt}")

    response = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=512,
    )

    return response.choices[0].message.content

Step 5: Extending to Other Metadata Types

We can extend this process to query for different metadata types. For instance, instead of querying by year, you can query by product type, such as “phones” or “laptops.” In this case, querying the system for “Which biometric capabilities do phones have?” will return information from product-specific graphs for devices like the Samsung Galaxy S21, S22, and S23.

This approach can be extended to classify documents based on their type—product catalogs, maintenance manuals, academic papers, or any other type of document. By organizing and filtering documents appropriately, enterprises can ensure that they only search through relevant documents, thus improving retrieval efficiency.

Conclusion

By using a Knowledge Table to categorize and tag documents and then converting that information into a Knowledge Graph, enterprises can streamline the search process in RAG systems. This method enables more accurate document retrieval, reduces unnecessary search space, and improves overall efficiency in accessing enterprise knowledge.

Related Posts
Who is Salesforce?
Salesforce

Who is Salesforce? Here is their story in their own words. From our inception, we've proudly embraced the identity of Read more

Salesforce Unites Einstein Analytics with Financial CRM
Financial Services Sector

Salesforce has unveiled a comprehensive analytics solution tailored for wealth managers, home office professionals, and retail bankers, merging its Financial Read more

AI-Driven Propensity Scores
AI-driven propensity scores

AI plays a crucial role in propensity score estimation as it can discern underlying patterns between treatments and confounding variables Read more

Tectonic’s Successful Salesforce Track Record
Tectonic-Ensuring Salesforce Customer Satisfaction

Salesforce Technology Services Integrator - Tectonic has successfully delivered Salesforce in a variety of industries including Public Sector, Hospitality, Manufacturing, Read more