This project presents an advanced system for performing multi-document relation extraction and synthesizing the findings into a queryable Knowledge Graph. By leveraging a Large Language Model (microsoft/phi-3-mini-4k-instruct), the system ingests unstructured text from various sources, identifies key entities and their relationships, and constructs a graph-based model of the information.
A key feature of this project is its adaptive pipeline, which automatically detects the capabilities of the host environment. If a compatible GPU environment is detected, it fine-tunes the language model using LoRA (Low-Rank Adaptation) for superior accuracy on the specific task of relation extraction. Otherwise, it gracefully falls back to a robust inference-only mode, ensuring the system is always operational.
-
Multi-Source Data Ingestion: Capable of processing unstructured text from a variety of formats, including PDFs, DOCX files, TXT files, and live website URLs.
-
Advanced Relation & Entity Extraction: Goes beyond simple keyword matching to identify complex relationships and classify entities by type (e.g., [Person], [Organization]).
-
Adaptive Fine-Tuning Pipeline: Automatically checks the environment for triton and bitsandbytes compatibility. If available, it fine-tunes the LLM to specialize it for the extraction task, dramatically improving accuracy. If not, it uses a safe fallback to ensure the system never crashes.
-
Knowledge Graph Synthesis: Aggregates all extracted facts into a directed graph using NetworkX, creating a unified view of the information scattered across multiple documents.
-
Complex Querying & Analytics: The knowledge graph can be queried to find multi-hop, inferred relationships (e.g., find a connection between a CEO and a product, even if they aren't mentioned in the same sentence). It also performs centrality analysis to identify the most influential entities.
-
Rich Visualizations: Generates a visual representation of the final knowledge graph using Matplotlib, making the connections easy to understand.
-
Data Ingestion & Pre-processing: Sources are read and their text is extracted. A co-reference resolution step cleans the text, replacing pronouns with the entities they refer to.
-
Adaptive Pipeline Check: The system determines if the environment supports fine-tuning.
-
Model Inference:
-
Fine-Tuning Path: If supported, a custom training dataset is used to fine-tune the Phi-3 model via LoRA. Inference is then run using this specialized model.
-
Fallback Path: If not supported, inference is run using the base pre-trained Phi-3 model.
-
-
Information Parsing: The structured text output from the LLM is parsed into distinct entities and relations.
-
Knowledge Graph Module: The parsed relations are used to build, analyze, visualize, and save a NetworkX graph.
-
Evaluation: The model's output on a test document is compared against a ground-truth dataset to calculate performance metrics.
-
Language Model: microsoft/phi-3-mini-4k-instruct
-
Fine-Tuning: Hugging Face PEFT (LoRA) & bitsandbytes (4-bit quantization)
-
Core ML/NLP: PyTorch, Hugging Face Transformers, spaCy
-
Data Handling: PyMuPDF (PDFs), python-docx (Word), BeautifulSoup (Web)
-
Graph Analytics: NetworkX, Matplotlib
-
Environment: This project is designed to run in a Google Colab environment with a T4 GPU runtime.
-
Installation: The first cell in the notebook installs all required dependencies.
-
Data: The notebook is self-contained. It will automatically create sample training_data.json, evaluation_data.jsonl, and various document files (.pdf, .docx, .txt) for a full demonstration.
-
Execution: Simply run the cells in the notebook from top to bottom. The adaptive pipeline will automatically choose the best execution path.