Skip to content

Intelligent Document Analysis and Knowledge Graph Construction

Notifications You must be signed in to change notification settings

Destroyer795/Multi-Doc-AutoRE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 

Repository files navigation

Intelligent Document Analysis & Knowledge Graph Construction

This project presents an advanced system for performing multi-document relation extraction and synthesizing the findings into a queryable Knowledge Graph. By leveraging a Large Language Model (microsoft/phi-3-mini-4k-instruct), the system ingests unstructured text from various sources, identifies key entities and their relationships, and constructs a graph-based model of the information.

A key feature of this project is its adaptive pipeline, which automatically detects the capabilities of the host environment. If a compatible GPU environment is detected, it fine-tunes the language model using LoRA (Low-Rank Adaptation) for superior accuracy on the specific task of relation extraction. Otherwise, it gracefully falls back to a robust inference-only mode, ensuring the system is always operational.


1. Features

  • Multi-Source Data Ingestion: Capable of processing unstructured text from a variety of formats, including PDFs, DOCX files, TXT files, and live website URLs.

  • Advanced Relation & Entity Extraction: Goes beyond simple keyword matching to identify complex relationships and classify entities by type (e.g., [Person], [Organization]).

  • Adaptive Fine-Tuning Pipeline: Automatically checks the environment for triton and bitsandbytes compatibility. If available, it fine-tunes the LLM to specialize it for the extraction task, dramatically improving accuracy. If not, it uses a safe fallback to ensure the system never crashes.

  • Knowledge Graph Synthesis: Aggregates all extracted facts into a directed graph using NetworkX, creating a unified view of the information scattered across multiple documents.

  • Complex Querying & Analytics: The knowledge graph can be queried to find multi-hop, inferred relationships (e.g., find a connection between a CEO and a product, even if they aren't mentioned in the same sentence). It also performs centrality analysis to identify the most influential entities.

  • Rich Visualizations: Generates a visual representation of the final knowledge graph using Matplotlib, making the connections easy to understand.


2. System Architecture

  1. Data Ingestion & Pre-processing: Sources are read and their text is extracted. A co-reference resolution step cleans the text, replacing pronouns with the entities they refer to.

  2. Adaptive Pipeline Check: The system determines if the environment supports fine-tuning.

  3. Model Inference:

    • Fine-Tuning Path: If supported, a custom training dataset is used to fine-tune the Phi-3 model via LoRA. Inference is then run using this specialized model.

    • Fallback Path: If not supported, inference is run using the base pre-trained Phi-3 model.

  4. Information Parsing: The structured text output from the LLM is parsed into distinct entities and relations.

  5. Knowledge Graph Module: The parsed relations are used to build, analyze, visualize, and save a NetworkX graph.

  6. Evaluation: The model's output on a test document is compared against a ground-truth dataset to calculate performance metrics.


3. Technologies

  • Language Model: microsoft/phi-3-mini-4k-instruct

  • Fine-Tuning: Hugging Face PEFT (LoRA) & bitsandbytes (4-bit quantization)

  • Core ML/NLP: PyTorch, Hugging Face Transformers, spaCy

  • Data Handling: PyMuPDF (PDFs), python-docx (Word), BeautifulSoup (Web)

  • Graph Analytics: NetworkX, Matplotlib


4. How to Use

  1. Environment: This project is designed to run in a Google Colab environment with a T4 GPU runtime.

  2. Installation: The first cell in the notebook installs all required dependencies.

  3. Data: The notebook is self-contained. It will automatically create sample training_data.json, evaluation_data.jsonl, and various document files (.pdf, .docx, .txt) for a full demonstration.

  4. Execution: Simply run the cells in the notebook from top to bottom. The adaptive pipeline will automatically choose the best execution path.

About

Intelligent Document Analysis and Knowledge Graph Construction

Topics

Resources

Stars

Watchers

Forks