Skip to content

This project is a Streamlit application that leverages Large Language Models (OpenAI) and LangChain to automatically convert structured flat data (CSV/Excel) into a Neo4j Knowledge Graph. Once the graph is built, it provides a "Chat with your Data" interface, allowing users to ask natural language questions about the relationships and entities with

Notifications You must be signed in to change notification settings

Rakesh-Seenu/NeoGraphQA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NeoGraphQA - Automated Knowledge Graph Builder & QA System

This project is a Streamlit application that leverages Large Language Models (OpenAI) and LangChain to automatically convert structured flat data (CSV/Excel) into a Neo4j Knowledge Graph. Once the graph is built, it provides a "Chat with your Data" interface, allowing users to ask natural language questions about the relationships and entities within the dataset.

🚀 Features

  • Automated Data Modeling: Uses GPT-4o-mini to analyze your data columns and automatically design a Neo4j schema (Nodes, Relationships, Properties).
  • Intelligent Import Generation: Generates robust Cypher queries to ingest data, handling unique constraints, data cleaning, and ID generation (apoc.text.slug).
  • Dynamic Knowledge Graph Creation: Clears the existing database, applies constraints, and populates the graph with your specific dataset on startup.
  • Natural Language Q&A: Translates user questions into Cypher queries to retrieve answers directly from the graph.
  • Streamlit UI: A simple, interactive web interface for building the graph and chatting with it.

Simple Flow

sequenceDiagram
    autonumber
    actor User as 👤 You
    participant App as 💻 The App
    participant AI as 🧠 AI Brain
    participant DB as 🗄️ Graph Database

    Note over App, DB: Phase 1: Building the Knowledge (Automatic)
    
    App->>App: Reads your CSV/Excel file
    App->>AI: "Look at this data. How should I structure it?"
    AI-->>App: "Create 'Student', 'University', and 'Department' nodes."
    App->>DB: Saves all the data into the Graph structure
    
    Note over User, DB: Phase 2: Chatting with Data
    
    User->>App: "Which students are in the IT department?"
    App->>AI: "Translate this English question into a Database Query"
    AI-->>App: Generates Code (MATCH (s:Student)-[:ENROLLED_IN]->(d:Department)...)
    App->>DB: Runs the code to find the data
    DB-->>App: Returns "Martin Rodriguez"
    App-->>User: "Martin Rodriguez is in the IT department."
Loading

📂 Project Structure.

├── src/
│   ├── app.py                  # Main Streamlit application entry point
│   ├── data_ingestion.py       # Handles loading CSV/Excel files
│   ├── df_source.py            # Singleton instance of the loaded dataframe
│   ├── model_builder.py        # Logic for LLM-based schema extraction and Cypher generation
│   ├── neo4j_graph_handler.py  # Manages Neo4j connection, data import, and QA chain
│   └── utils/
│       └── prompts.yaml        # YAML file containing system prompts for the LLM
├── requirements.txt            # Python dependencies
└── README.md

🛠️ Prerequisites

Python 3.10+ Neo4j Database: You need a running Neo4j instance (AuraDB or local). OpenAI API Key: For the LLM reasoning and generation.

⚙️ Installation & Setup

  1. Clone the repository
git clone <repository-url>
cd <repository-directory>
  1. Install Dependencies It is recommended to use a virtual environment.
pip install -r requirements.txt
  1. Environment ConfigurationCreate a .env file in the root directory and add your credentials:
# OpenAI
OPENAI_API_KEY=sk-your-openai-api-key

# Neo4j
NEO4J_URI=bolt://localhost:7687  # or your AuraDB URI
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=your-password

# Data Source
CSV_FILE_PATH=path/to/your/data.csv # or data.xlsx

Note: Ensure your data file (data.csv or data.xlsx) is accessible via the path defined in CSV_FILE_PATH.

  1. Prompts ConfigurationEnsure src/utils/prompts.yaml exists. (the code expects utils/prompts.yaml relative to execution context, or you might need to adjust the path in model_builder.py depending on where you run the script from).

▶️ UsageRun the Streamlit application:

streamlit run src/app.py

How it works at Runtime:

  • Initialization: The app reads your CSV/Excel file.
  • Modeling: The LLM analyzes the columns to propose a Node/Relationship schema.
  • Constraint Generation: Unique constraints are generated to ensure data integrity.
  • Import: Cypher scripts are generated and executed to populate the Neo4j database.
  • Chat: You can now type questions like "Which students are enrolled in the IT department?" and get answers based on the graph data.

🧩 How It Works (Under the Hood)

  • Data Ingestion: data_ingestion.py loads the raw file into a Pandas DataFrame.
  • Schema Inference: model_builder.py iterates through dataframe columns. It sends samples to the LLM (using prompts from prompts.yaml) to determine if a column represents an Entity (Node), a Relationship, or a Property.
  • Refinement: The schema is reviewed by the LLM to handle nested objects or arrays, ensuring 3rd normal form-like graph structures.
  • Cypher Generation: The system generates specific MERGE statements to create nodes and relationships idempotently.
  • Graph Construction: neo4j_graph_handler.py executes these statements against your Neo4j instance.
  • Q&A: The GraphCypherQAChain from LangChain converts English to Cypher to answer user queries.

Indepth Flow

sequenceDiagram
    actor User
    participant StreamlitUI as Streamlit App (app.py)
    participant DataLoader as Data Ingestion (data_ingestion.py)
    participant Builder as Model Builder (model_builder.py)
    participant LLM as OpenAI (GPT-4o)
    participant GraphHandler as Graph Handler (neo4j_graph_handler.py)
    participant Neo4jDB as Neo4j Database

    Note over StreamlitUI, Builder: Initialization Phase
    StreamlitUI ->> DataLoader: Load CSV/Excel from Environment
    DataLoader -->> StreamlitUI: Return DataFrame
    
    StreamlitUI ->> Builder: Import Schema & Cypher Logic
    activate Builder
    Builder ->> LLM: Analyze DataFrame Columns (model_prompt)
    LLM -->> Builder: Proposed Data Model (Nodes/Rels)
    Builder ->> LLM: Generate Unique ID Logic (uid_prompt)
    LLM -->> Builder: ID Schema
    Builder ->> LLM: Generate Import Cypher (import_prompt)
    LLM -->> Builder: Cypher Merge Statements
    deactivate Builder

    Note over StreamlitUI, Neo4jDB: Graph Construction Phase
    StreamlitUI ->> GraphHandler: Initialize Handler
    StreamlitUI ->> GraphHandler: clear_db()
    GraphHandler ->> Neo4jDB: DELETE ALL nodes
    StreamlitUI ->> GraphHandler: apply_constraints()
    GraphHandler ->> Neo4jDB: CREATE CONSTRAINTS
    StreamlitUI ->> GraphHandler: import_data()
    GraphHandler ->> Neo4jDB: Execute MERGE Statements (Batch Import)
    
    Note over User, Neo4jDB: Q&A Phase
    StreamlitUI ->> GraphHandler: build_qa_chain()
    User ->> StreamlitUI: Ask Question ("Who is enrolled in IT?")
    StreamlitUI ->> GraphHandler: Run QA Chain
    GraphHandler ->> LLM: Translate Question to Cypher
    LLM -->> GraphHandler: Generated Cypher Query
    GraphHandler ->> Neo4jDB: Execute Query
    Neo4jDB -->> GraphHandler: Graph Results
    GraphHandler ->> LLM: Generate Natural Language Answer
    LLM -->> GraphHandler: Final Answer
    GraphHandler -->> StreamlitUI: Display Answer
    StreamlitUI -->> User: Show Response
Loading

⚠️ Important Notes

  • Data Overwrite: The current setup includes handler.clear_db() in app.py. This will wipe your Neo4j database every time the app initializes. Comment this out in src/app.py if you want to persist data across runs.
  • Cost: This application makes heavy use of the OpenAI API for schema generation and querying. Be mindful of token usage with large datasets.

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

About

This project is a Streamlit application that leverages Large Language Models (OpenAI) and LangChain to automatically convert structured flat data (CSV/Excel) into a Neo4j Knowledge Graph. Once the graph is built, it provides a "Chat with your Data" interface, allowing users to ask natural language questions about the relationships and entities with

Resources

Stars

Watchers

Forks