This project is a Streamlit application that leverages Large Language Models (OpenAI) and LangChain to automatically convert structured flat data (CSV/Excel) into a Neo4j Knowledge Graph. Once the graph is built, it provides a "Chat with your Data" interface, allowing users to ask natural language questions about the relationships and entities within the dataset.
- Automated Data Modeling: Uses GPT-4o-mini to analyze your data columns and automatically design a Neo4j schema (Nodes, Relationships, Properties).
- Intelligent Import Generation: Generates robust Cypher queries to ingest data, handling unique constraints, data cleaning, and ID generation (apoc.text.slug).
- Dynamic Knowledge Graph Creation: Clears the existing database, applies constraints, and populates the graph with your specific dataset on startup.
- Natural Language Q&A: Translates user questions into Cypher queries to retrieve answers directly from the graph.
- Streamlit UI: A simple, interactive web interface for building the graph and chatting with it.
sequenceDiagram
autonumber
actor User as 👤 You
participant App as 💻 The App
participant AI as 🧠 AI Brain
participant DB as 🗄️ Graph Database
Note over App, DB: Phase 1: Building the Knowledge (Automatic)
App->>App: Reads your CSV/Excel file
App->>AI: "Look at this data. How should I structure it?"
AI-->>App: "Create 'Student', 'University', and 'Department' nodes."
App->>DB: Saves all the data into the Graph structure
Note over User, DB: Phase 2: Chatting with Data
User->>App: "Which students are in the IT department?"
App->>AI: "Translate this English question into a Database Query"
AI-->>App: Generates Code (MATCH (s:Student)-[:ENROLLED_IN]->(d:Department)...)
App->>DB: Runs the code to find the data
DB-->>App: Returns "Martin Rodriguez"
App-->>User: "Martin Rodriguez is in the IT department."
├── src/
│ ├── app.py # Main Streamlit application entry point
│ ├── data_ingestion.py # Handles loading CSV/Excel files
│ ├── df_source.py # Singleton instance of the loaded dataframe
│ ├── model_builder.py # Logic for LLM-based schema extraction and Cypher generation
│ ├── neo4j_graph_handler.py # Manages Neo4j connection, data import, and QA chain
│ └── utils/
│ └── prompts.yaml # YAML file containing system prompts for the LLM
├── requirements.txt # Python dependencies
└── README.md
Python 3.10+ Neo4j Database: You need a running Neo4j instance (AuraDB or local). OpenAI API Key: For the LLM reasoning and generation.
- Clone the repository
git clone <repository-url>
cd <repository-directory>
- Install Dependencies It is recommended to use a virtual environment.
pip install -r requirements.txt
- Environment ConfigurationCreate a .env file in the root directory and add your credentials:
# OpenAI
OPENAI_API_KEY=sk-your-openai-api-key
# Neo4j
NEO4J_URI=bolt://localhost:7687 # or your AuraDB URI
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=your-password
# Data Source
CSV_FILE_PATH=path/to/your/data.csv # or data.xlsx
Note: Ensure your data file (data.csv or data.xlsx) is accessible via the path defined in CSV_FILE_PATH.
- Prompts ConfigurationEnsure src/utils/prompts.yaml exists. (the code expects utils/prompts.yaml relative to execution context, or you might need to adjust the path in model_builder.py depending on where you run the script from).
streamlit run src/app.py
- Initialization: The app reads your CSV/Excel file.
- Modeling: The LLM analyzes the columns to propose a Node/Relationship schema.
- Constraint Generation: Unique constraints are generated to ensure data integrity.
- Import: Cypher scripts are generated and executed to populate the Neo4j database.
- Chat: You can now type questions like "Which students are enrolled in the IT department?" and get answers based on the graph data.
- Data Ingestion: data_ingestion.py loads the raw file into a Pandas DataFrame.
- Schema Inference: model_builder.py iterates through dataframe columns. It sends samples to the LLM (using prompts from prompts.yaml) to determine if a column represents an Entity (Node), a Relationship, or a Property.
- Refinement: The schema is reviewed by the LLM to handle nested objects or arrays, ensuring 3rd normal form-like graph structures.
- Cypher Generation: The system generates specific MERGE statements to create nodes and relationships idempotently.
- Graph Construction: neo4j_graph_handler.py executes these statements against your Neo4j instance.
- Q&A: The GraphCypherQAChain from LangChain converts English to Cypher to answer user queries.
sequenceDiagram
actor User
participant StreamlitUI as Streamlit App (app.py)
participant DataLoader as Data Ingestion (data_ingestion.py)
participant Builder as Model Builder (model_builder.py)
participant LLM as OpenAI (GPT-4o)
participant GraphHandler as Graph Handler (neo4j_graph_handler.py)
participant Neo4jDB as Neo4j Database
Note over StreamlitUI, Builder: Initialization Phase
StreamlitUI ->> DataLoader: Load CSV/Excel from Environment
DataLoader -->> StreamlitUI: Return DataFrame
StreamlitUI ->> Builder: Import Schema & Cypher Logic
activate Builder
Builder ->> LLM: Analyze DataFrame Columns (model_prompt)
LLM -->> Builder: Proposed Data Model (Nodes/Rels)
Builder ->> LLM: Generate Unique ID Logic (uid_prompt)
LLM -->> Builder: ID Schema
Builder ->> LLM: Generate Import Cypher (import_prompt)
LLM -->> Builder: Cypher Merge Statements
deactivate Builder
Note over StreamlitUI, Neo4jDB: Graph Construction Phase
StreamlitUI ->> GraphHandler: Initialize Handler
StreamlitUI ->> GraphHandler: clear_db()
GraphHandler ->> Neo4jDB: DELETE ALL nodes
StreamlitUI ->> GraphHandler: apply_constraints()
GraphHandler ->> Neo4jDB: CREATE CONSTRAINTS
StreamlitUI ->> GraphHandler: import_data()
GraphHandler ->> Neo4jDB: Execute MERGE Statements (Batch Import)
Note over User, Neo4jDB: Q&A Phase
StreamlitUI ->> GraphHandler: build_qa_chain()
User ->> StreamlitUI: Ask Question ("Who is enrolled in IT?")
StreamlitUI ->> GraphHandler: Run QA Chain
GraphHandler ->> LLM: Translate Question to Cypher
LLM -->> GraphHandler: Generated Cypher Query
GraphHandler ->> Neo4jDB: Execute Query
Neo4jDB -->> GraphHandler: Graph Results
GraphHandler ->> LLM: Generate Natural Language Answer
LLM -->> GraphHandler: Final Answer
GraphHandler -->> StreamlitUI: Display Answer
StreamlitUI -->> User: Show Response
- Data Overwrite: The current setup includes
handler.clear_db()inapp.py. This will wipe your Neo4j database every time the app initializes. Comment this out insrc/app.pyif you want to persist data across runs. - Cost: This application makes heavy use of the OpenAI API for schema generation and querying. Be mindful of token usage with large datasets.
Contributions are welcome! Please feel free to submit a Pull Request.