Intelligent Document Analysis & Knowledge Graph Construction

This project presents an advanced system for performing multi-document relation extraction and synthesizing the findings into a queryable Knowledge Graph. By leveraging a Large Language Model (microsoft/phi-3-mini-4k-instruct), the system ingests unstructured text from various sources, identifies key entities and their relationships, and constructs a graph-based model of the information.

A key feature of this project is its adaptive pipeline, which automatically detects the capabilities of the host environment. If a compatible GPU environment is detected, it fine-tunes the language model using LoRA (Low-Rank Adaptation) for superior accuracy on the specific task of relation extraction. Otherwise, it gracefully falls back to a robust inference-only mode, ensuring the system is always operational.

1. Features

Multi-Source Data Ingestion: Capable of processing unstructured text from a variety of formats, including PDFs, DOCX files, TXT files, and live website URLs.
Advanced Relation & Entity Extraction: Goes beyond simple keyword matching to identify complex relationships and classify entities by type (e.g., [Person], [Organization]).
Adaptive Fine-Tuning Pipeline: Automatically checks the environment for triton and bitsandbytes compatibility. If available, it fine-tunes the LLM to specialize it for the extraction task, dramatically improving accuracy. If not, it uses a safe fallback to ensure the system never crashes.
Knowledge Graph Synthesis: Aggregates all extracted facts into a directed graph using NetworkX, creating a unified view of the information scattered across multiple documents.
Complex Querying & Analytics: The knowledge graph can be queried to find multi-hop, inferred relationships (e.g., find a connection between a CEO and a product, even if they aren't mentioned in the same sentence). It also performs centrality analysis to identify the most influential entities.
Rich Visualizations: Generates a visual representation of the final knowledge graph using Matplotlib, making the connections easy to understand.

2. System Architecture

Data Ingestion & Pre-processing: Sources are read and their text is extracted. A co-reference resolution step cleans the text, replacing pronouns with the entities they refer to.
Adaptive Pipeline Check: The system determines if the environment supports fine-tuning.
Model Inference:
- Fine-Tuning Path: If supported, a custom training dataset is used to fine-tune the Phi-3 model via LoRA. Inference is then run using this specialized model.
- Fallback Path: If not supported, inference is run using the base pre-trained Phi-3 model.
Information Parsing: The structured text output from the LLM is parsed into distinct entities and relations.
Knowledge Graph Module: The parsed relations are used to build, analyze, visualize, and save a NetworkX graph.
Evaluation: The model's output on a test document is compared against a ground-truth dataset to calculate performance metrics.

3. Technologies

Language Model: microsoft/phi-3-mini-4k-instruct
Fine-Tuning: Hugging Face PEFT (LoRA) & bitsandbytes (4-bit quantization)
Core ML/NLP: PyTorch, Hugging Face Transformers, spaCy
Data Handling: PyMuPDF (PDFs), python-docx (Word), BeautifulSoup (Web)
Graph Analytics: NetworkX, Matplotlib

4. How to Use

Environment: This project is designed to run in a Google Colab environment with a T4 GPU runtime.
Installation: The first cell in the notebook installs all required dependencies.
Data: The notebook is self-contained. It will automatically create sample training_data.json, evaluation_data.jsonl, and various document files (.pdf, .docx, .txt) for a full demonstration.
Execution: Simply run the cells in the notebook from top to bottom. The adaptive pipeline will automatically choose the best execution path.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Multi_Doc_AutoRE.ipynb		Multi_Doc_AutoRE.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Intelligent Document Analysis & Knowledge Graph Construction

1. Features

2. System Architecture

3. Technologies

4. How to Use

About

Uh oh!

Languages

Destroyer795/Multi-Doc-AutoRE

Folders and files

Latest commit

History

Repository files navigation

Intelligent Document Analysis & Knowledge Graph Construction

1. Features

2. System Architecture

3. Technologies

4. How to Use

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages