Skip to content

Document Analysis using NLP, Neo4j and Elastic. For example clustering of clients into sectors based on shared documents, scoring them on Cyber Security.

License

Notifications You must be signed in to change notification settings

paulbrowne-irl/Smart-Document-Analysis

Repository files navigation

Smart Documents and Analysis

Analyze a set of documents (from companies) to give a useful snapshot or summary. Original Documents are left unmodified, the only reports modified are in the 'output' folder

Table of Contents

  1. Expected folder structure
  2. Scripts
  3. Script Structure
  4. What can you do with this project?
  5. Keeping confidential information confidential by default
  6. Key sections in this guide
  7. Underlying technologies
  8. Configuration
  9. First time Setup
  10. Running the application

Expected folder structure

Most scripts have a parameter to point to the client documents (e.g. '..'). There is also parameters to exclude folders (e.g. like the z_scripts folder itself) from the analysis

_ _todo check _ _

  • CLIENT DOCS
  • Client 1 Folder \ sample word doc \ sub folder \ sample xl doc
  • client 2 Folder

Scripts

The key user scripts are in the main app directory. The top of each script contains parameters to modify the script behaviour

  • one_client_pdf.py - gather all client data into one pdf
  • snapshot.py - make a summary of client documents (e.g. last updated, size, number, sentiment etc)

Script Struture

add text ..

What can you do with this project?

This project allows you to:

  • Quickly summarize key information from a collection of documents.
  • Identify trends and patterns in client communications.
  • Automate the process of answering common business questions.
  • Integrate with other tools like Power Automate for automated workflows.

add text ..

Other key features:

add text ..

Keeping confidential information confidential by default

For obvious reasons only generic code and no information / knowledge is shared in this GitHub project. This has the benefit of you being able to add only your own documents when you run your secure local copy. Other key project features with confidentiality in mind:

  • Information is stored locally by default in Elastic Search
  • LLM (Lllama) runs locally (with options to use other remote LLMs)
  • Confidential information redacted before sending to LLM (see setting in config file is you wish to turn this off)

To further limit exposure, we recommend care in ingesting only emails and documents that have already been sent externally. Since the project is open source, you can fully audit the code before use.

Key sections in this guide

add text ..

Underlying technologies:

add text ..

Configuration

The main configuration file is located at app/config/config.conf. This file controls various aspects of the application, including:

  • Data source locations
  • LLM selection (local or remote)
  • Redaction settings
  • API keys

Refer to the comments within the app/config/config.conf file for detailed explanations of each setting.

__ todo - implement this __

  • The main confirmation file is in app/config/config.conf . This config file is shared for the ingest script, the Bot and the Application. Please edit this using the notes in the app/config folder.
  • Some APIs (Copilot, OpenAI, Teamworks helpdesk) require tokens the first time they are run. Please consult the documentation of these tools to retrieve these.
  • The script will ask you for these tokens and store them locally. This is a plain text json file(token-storage-local.json). While it is excluded from storage in git, you may wish to review who has access to it locally as it will store sensitive information.

First time Setup

To setup the project on your local machine, then run for the first time:

  1. Checkout / download the project as a folder onto the host computer from the source __ todo update link __ https://github.com/paulbrowne-irl/smart-document-analysis

  2. Install Python (3.12 or higher) in the usual way. Python pip and virtualenv tools are also needed.

  3. Install Python dependencies - in a terminal window, at the project root

    • Create virtual environment: virtualenv venv
    • Activate virtual environment: source venv/bin/activate
    • Install Python dependencies for this environment: pip install -r requirements.txt

Running the application

add text ..

Running the Application

add text ..

Contributing

Contributions to this project are welcome! Please see the CONTRIBUTING.md file for guidelines on how to contribute.

About

Document Analysis using NLP, Neo4j and Elastic. For example clustering of clients into sectors based on shared documents, scoring them on Cyber Security.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors