LLMs Turn CSVs into Knowledge Graphs

Neo4j Runway and Healthcare Knowledge Graphs

Recently, Neo4j Runway was introduced as a tool to simplify the migration of relational data into graph structures. LLMs Turn CSVs into Knowledge Graphs. According to its GitHub page, “Neo4j Runway is a Python library that simplifies the process of migrating your relational data into a graph. It provides tools that abstract communication with OpenAI to run discovery on your data and generate a data model, as well as tools to generate ingestion code and load your data into a Neo4j instance.” In essence, by uploading a CSV file, the LLM identifies the nodes and relationships, automatically generating a Knowledge Graph.

Knowledge Graphs in healthcare are powerful tools for organizing and analyzing complex medical data. These graphs structure information to elucidate relationships between different entities, such as diseases, treatments, patients, and healthcare providers.

Applications of Knowledge Graphs in Healthcare

Integration of Diverse Data Sources

Knowledge graphs can integrate data from various sources such as electronic health records (EHRs), medical research papers, clinical trial results, genomic data, and patient histories.

Improving Clinical Decision Support

By linking symptoms, diagnoses, treatments, and outcomes, knowledge graphs can enhance clinical decision support systems (CDSS). They provide a comprehensive view of interconnected medical knowledge, potentially improving diagnostic accuracy and treatment effectiveness.

Personalized Medicine

Knowledge graphs enable the development of personalized treatment plans by correlating patient-specific data with broader medical knowledge. This includes understanding relationships between genetic information, disease mechanisms, and therapeutic responses, leading to more tailored healthcare interventions.

Drug Discovery and Development

In pharmaceutical research, knowledge graphs can accelerate drug discovery by identifying potential drug targets and understanding the biological pathways involved in diseases.

Public Health and Epidemiology

Knowledge graphs are useful in public health for tracking disease outbreaks, understanding epidemiological trends, and planning interventions. They integrate data from various public health databases, social media, and other sources to provide real-time insights into public health threats.

Neo4j Runway Library

Neo4j Runway is an open-source library created by Alex Gilmore. The GitHub repository and a blog post describe its features and capabilities. Currently, the library supports OpenAI LLM for parsing CSVs and offers the following features:

Data Discovery: Leveraging OpenAI LLMs to extract meaningful insights from data.
Graph Data Modeling: Using OpenAI and the Instructor Python library to develop accurate graph data models.
Code Generation: Creating ingestion code tailored to the preferred data loading method.
Data Ingestion: Utilizing Runway’s built-in PyIngest implementation, a widely-used Neo4j ingestion tool, to load data.

The library eliminates the need to write Cypher queries manually, as the LLM handles all CSV-to-Knowledge Graph conversions. Additionally, Langchain’s GraphCypherQAChain can be used to generate Cypher queries from prompts, allowing for querying the graph without writing a single line of Cypher code.

Practical Implementation in Healthcare

To test Neo4j Runway in a healthcare context, a simple dataset from Kaggle (Disease Symptoms and Patient Profile Dataset) was used. This dataset includes columns such as Disease, Fever, Cough, Fatigue, Difficulty Breathing, Age, Gender, Blood Pressure, Cholesterol Level, and Outcome Variable. The goal was to provide a medical report to the LLM to get diagnostic hypotheses.

Libraries and Environment Setup

pythonCopy code# Install necessary packages
sudo apt install python3-pydot graphviz 
pip install neo4j-runway

# Import necessary libraries
import numpy as np
import pandas as pd
from neo4j_runway import Discovery, GraphDataModeler, IngestionGenerator, LLM, PyIngest
from IPython.display import display, Markdown, Image

Load Environment Variables

pythonCopy codeload_dotenv()
OPENAI_API_KEY = os.getenv('sk-openaiapikeyhere')
NEO4J_URL = os.getenv('neo4j+s://your.databases.neo4j.io')
NEO4J_PASSWORD = os.getenv('yourneo4jpassword')

Load and Prepare Medical Data

pythonCopy codedisease_df = pd.read_csv('/home/user/Disease_symptom.csv')
disease_df.columns = disease_df.columns.str.strip()
for i in disease_df.columns:
    disease_df[i] = disease_df[i].astype(str)
disease_df.to_csv('/home/user/disease_prepared.csv', index=False)

Data Description for the LLM

pythonCopy codeDATA_DESCRIPTION = {
    'Disease': 'The name of the disease or medical condition.',
    'Fever': 'Indicates whether the patient has a fever (Yes/No).',
    'Cough': 'Indicates whether the patient has a cough (Yes/No).',
    'Fatigue': 'Indicates whether the patient experiences fatigue (Yes/No).',
    'Difficulty Breathing': 'Indicates whether the patient has difficulty breathing (Yes/No).',
    'Age': 'The age of the patient in years.',
    'Gender': 'The gender of the patient (Male/Female).',
    'Blood Pressure': 'The blood pressure level of the patient (Normal/High).',
    'Cholesterol Level': 'The cholesterol level of the patient (Normal/High).',
    'Outcome Variable': 'The outcome variable indicating the result of the diagnosis or assessment for the specific disease (Positive/Negative).'
}

Data Analysis and Model Creation

pythonCopy codedisc = Discovery(llm=llm, user_input=DATA_DESCRIPTION, data=disease_df)
disc.run()

# Instantiate and create initial graph data model
gdm = GraphDataModeler(llm=llm, discovery=disc)
gdm.create_initial_model()
gdm.current_model.visualize()

Adjust Relationships

pythonCopy codegdm.iterate_model(user_corrections='''
Let's think step by step. Please make the following updates to the data model:
1. Remove the relationships between Patient and Disease, between Patient and Symptom and between Patient and Outcome.
2. Change the Patient node into Demographics.
3. Create a relationship HAS_DEMOGRAPHICS from Disease to Demographics.
4. Create a relationship HAS_SYMPTOM from Disease to Symptom. If the Symptom value is No, remove this relationship.
5. Create a relationship HAS_LAB from Disease to HealthIndicator.
6. Create a relationship HAS_OUTCOME from Disease to Outcome.
''')

# Visualize the updated model
gdm.current_model.visualize().render('output', format='png')
img = Image('output.png', width=1200)
display(img)

Generate Cypher Code and YAML File

pythonCopy code# Instantiate ingestion generator
gen = IngestionGenerator(data_model=gdm.current_model, 
                         username="neo4j", 
                         password='yourneo4jpasswordhere',
                         uri='neo4j+s://123654888.databases.neo4j.io',
                         database="neo4j", 
                         csv_dir="/home/user/", 
                         csv_name="disease_prepared.csv")

# Create ingestion YAML 
pyingest_yaml = gen.generate_pyingest_yaml_string()
gen.generate_pyingest_yaml_file(file_name="disease_prepared")

# Load data into Neo4j instance
PyIngest(yaml_string=pyingest_yaml, dataframe=disease_df)

Querying the Graph Database

cypherCopy codeMATCH (n)
WHERE n:Demographics OR n:Disease OR n:Symptom OR n:Outcome OR n:HealthIndicator
OPTIONAL MATCH (n)-[r]->(m)
RETURN n, r, m

Visualizing Specific Nodes and Relationships

cypherCopy codeMATCH (n:Disease {name: 'Diabetes'})
WHERE n:Demographics OR n:Disease OR n:Symptom OR n:Outcome OR n:HealthIndicator
OPTIONAL MATCH (n)-[r]->(m)
RETURN n, r, m

MATCH (d:Disease)
MATCH (d)-[r:HAS_LAB]->(l)
MATCH (d)-[r2:HAS_OUTCOME]->(o)
WHERE l.bloodPressure = 'High' AND o.result='Positive'
RETURN d, properties(d) AS disease_properties, r, properties(r) AS relationship_properties, l, properties(l) AS lab_properties

Automated Cypher Query Generation with Gemini-1.5-Flash

To automatically generate a Cypher query via Langchain (GraphCypherQAChain) and retrieve possible diseases based on a patient’s symptoms and health indicators, the following setup was used:

Initialize Vertex AI

pythonCopy codeimport warnings
import json
from langchain_community.graphs import Neo4jGraph

with warnings.catch_warnings():
    warnings.simplefilter('ignore')

NEO4J_USERNAME = "neo4j"
NEO4J_DATABASE = 'neo4j'
NEO4J_URI = 'neo4j+s://1236547.databases.neo4j.io'
NEO4J_PASSWORD = 'yourneo4jdatabasepasswordhere'

# Get the Knowledge Graph from the instance and the schema
kg = Neo4jGraph(
    url=NEO4J_URI, username=NEO4J_USERNAME, password=NEO4J_PASSWORD, database=NEO4J_DATABASE
)
kg.refresh_schema()
print(textwrap.fill(kg.schema, 60))
schema = kg.schema

Initialize Vertex AI

pythonCopy codefrom langchain.prompts.prompt import PromptTemplate
from langchain.chains import GraphCypherQAChain
from langchain.llms import VertexAI

vertexai.init(project="your-project", location="us-west4")
llm = VertexAI(model="gemini-1.5-flash")

Create the Prompt Template

pythonCopy codeprompt_template = """
Let's think step by step. You are an expert doctor with knowledge of the symptoms and lab test results that patients have. You will take the following information to return a list of possible diseases the patient could have. The information you will receive is in the format of questions and answers, where the questions are the symptoms or lab tests, and the answers are the values or results of those tests.

Given the following information:
{information}

Write a Cypher query that queries the Neo4j graph database to find possible diseases for the patient based on the provided information. Include relationships from Disease to Symptom, HealthIndicator, and Outcome if applicable.
"""

PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["information"]
)
chain = GraphCypherQAChain.from_llm(llm, kg, qa_prompt=PROMPT)

Query for Diagnostic Hypotheses

pythonCopy codeoutput = chain.run('''{
    "Fever": "No",
    "Cough": "Yes",
    "Fatigue": "Yes",
    "Difficulty Breathing": "No",
    "Age": "32",
    "Gender": "Female",
    "Blood Pressure": "Normal",
    "Cholesterol Level": "High",
    "Outcome Variable": "Positive"
}''')

The Cypher query returned by the model is:

cypherCopy codeMATCH (d:Disease)-[:HAS_SYMPTOM]->(s:Symptom {name: "Cough", value: "Yes"}),
(d)-[:HAS_SYMPTOM]->(s2:Symptom {name: "Fatigue", value: "Yes"}),
(d)-[:HAS_LAB]->(l:HealthIndicator {name: "Cholesterol Level", value: "High"}),
(d)-[:HAS_OUTCOME]->(o:Outcome {result: "Positive"})
RETURN d

LLMs Beyond Generative AI — LLMs Turn CSVs into Knowledge Graphs

get-admin

See Full Bio

LLMs Turn CSVs into Knowledge Graphs

Recent Posts

Salesforce Integrates Own Co. Capabilities

Alaska Inspires

Salesforce Code Genie

Agentforce: Modernizing 311 and Case Management

Tectonic Sponsors CA Innovation Day

Contact Us

Be in touch today — and start your business on a path to success.

Category

Archives

LLMs Turn CSVs into Knowledge Graphs

LLMs Turn CSVs into Knowledge Graphs

Neo4j Runway and Healthcare Knowledge Graphs

Applications of Knowledge Graphs in Healthcare

Integration of Diverse Data Sources

Improving Clinical Decision Support

Personalized Medicine

Drug Discovery and Development

Public Health and Epidemiology

Neo4j Runway Library

Practical Implementation in Healthcare

Libraries and Environment Setup

Load Environment Variables

Load and Prepare Medical Data

Data Description for the LLM

Data Analysis and Model Creation

Adjust Relationships

Generate Cypher Code and YAML File

Querying the Graph Database

Visualizing Specific Nodes and Relationships

Automated Cypher Query Generation with Gemini-1.5-Flash

Initialize Vertex AI

Initialize Vertex AI

Create the Prompt Template

Query for Diagnostic Hypotheses

Related Posts

Recent Posts

Contact Us

Be in touch today — and start your business on a path to success.

Category

Tags

Archives

Subscribe to our mailing list. Join our mail list to receive our newsletter