File Directory for Large Document RAG Systems: Improving Information Retrieval

Problem Statement:
How can we efficiently retrieve the right document from thousands in a Retrieval-Augmented Generation (RAG) system?

In this insight, we’ll show you how to create an automated file directory through a Knowledge Table. By running preset queries on your documents and saving the results as a Knowledge Graph, you can direct natural language queries to a filtered set of documents. This method not only reduces the search space but also enhances retrieval accuracy.

When building RAG systems for enterprises, the challenge lies in managing large collections of tens of thousands of documents within their enterprise memory systems. Enterprises often struggle with messy, unstructured document collections that need to be streamlined for effective AI-powered retrieval.

RAG systems can be broken down into two key steps:

  1. Identifying the most relevant document(s) for a given query.
  2. Parsing the content within the filtered documents to retrieve the required information.

Step 1: Intelligent Document Parsing and Tagging

To identify relevant documents, we first need intelligent document parsing and tagging systems that can assign contextual labels to each document. Fortunately, Large Language Models (LLMs) excel at this task.

Consider a scenario where we use a set of 10 Samsung product catalogs. We run predefined queries, which form the schema for the document metadata. These questions include:

  • “What product is discussed?”
  • “What is the model discussed?”
  • “What category does this product belong to?”
  • “What is the product description?”

Each query generates a metadata value tied to the specific document, which helps classify the content accurately. These tags and metadata become the foundation of our Knowledge Table.

Step 2: Programmatic Metadata Generation

This entire metadata-generation process can be automated programmatically, which is particularly useful for developers who need to handle large document sets without relying on a UI.

Here’s an example of creating a Knowledge Graph using a set of pre-defined metadata:

pythonCopy codefrom whyhow import WhyHow

client = WhyHow(api_key="<whyhow-api-key>", base_url="https://api.whyhow.ai")

graph_id = client.create_graph_from_knowledge_table(
    file_path="./path/to/triples.json",
    workspace_name="manuals",
    graph_name="manual graph"
)

# Print the created graph ID
print(f"Created graph with ID: {graph_id}")

This code snippet creates a “Manual Graph” with document metadata, and the graph can now be queried to extract high-level context about the documents. The metadata is also stored as properties of each node, allowing us to trace information back to the original document.

Step 3: Filtering Relevant Documents

Once the metadata is created, we can run queries to identify the most relevant documents for answering specific questions.

For example, if we need to answer, “Which 2023 devices have biometric capabilities?” we first query the general manual graph to find all relevant documents from 2023. Then, we query the individual product-specific graphs to extract the precise information.

Here’s the process:

pythonCopy codedef get_docs(q: str):
    query_response = client.graphs.query_unstructured(
        graph_id="<manual-graph-id>",
        query=q
    )
    return query_response

def query_specific_graph(graph: str, query: str):
    answer = client.graphs.query_unstructured(graph_id="<graph-id>", query=query)
    return answer.answer

Step 4: Aggregating and Summarizing Answers

Once relevant documents are identified, we collect responses from each specific product graph. These answers are then aggregated, and a final summarization step is performed to generate the final response.

pythonCopy codedef get_final_answer(answers: list, query: str):
    prompt = f"""
    Answer the following query using the information provided
    in the data below.
    Explain your answer.

    Query: {query}
    Data: {answers}
    """

    print(f"Prompt: {prompt}")

    response = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=512,
    )

    return response.choices[0].message.content

Step 5: Extending to Other Metadata Types

We can extend this process to query for different metadata types. For instance, instead of querying by year, you can query by product type, such as “phones” or “laptops.” In this case, querying the system for “Which biometric capabilities do phones have?” will return information from product-specific graphs for devices like the Samsung Galaxy S21, S22, and S23.

This approach can be extended to classify documents based on their type—product catalogs, maintenance manuals, academic papers, or any other type of document. By organizing and filtering documents appropriately, enterprises can ensure that they only search through relevant documents, thus improving retrieval efficiency.

Conclusion

By using a Knowledge Table to categorize and tag documents and then converting that information into a Knowledge Graph, enterprises can streamline the search process in RAG systems. This method enables more accurate document retrieval, reduces unnecessary search space, and improves overall efficiency in accessing enterprise knowledge.

Related Posts
Salesforce OEM AppExchange
Salesforce OEM AppExchange

Expanding its reach beyond CRM, Salesforce.com has launched a new service called AppExchange OEM Edition, aimed at non-CRM service providers. Read more

Salesforce Jigsaw
Salesforce Jigsaw

Salesforce.com, a prominent figure in cloud computing, has finalized a deal to acquire Jigsaw, a wiki-style business contact database, for Read more

Health Cloud Brings Healthcare Transformation
Health Cloud Brings Healthcare Transformation

Following swiftly after last week's successful launch of Financial Services Cloud, Salesforce has announced the second installment in its series Read more

Top Ten Reasons Why Tectonic Loves the Cloud
cloud computing

The Cloud is Good for Everyone - Why Tectonic loves the cloud You don’t need to worry about tracking licenses. Read more

author avatar
get-admin