Analyze a set of documents (from companies) to give a useful snapshot or summary. Original Documents are left unmodified, the only reports modified are in the 'output' folder
- Expected folder structure
- Scripts
- Script Structure
- What can you do with this project?
- Keeping confidential information confidential by default
- Key sections in this guide
- Underlying technologies
- Configuration
- First time Setup
- Running the application
Most scripts have a parameter to point to the client documents (e.g. '..'). There is also parameters to exclude folders (e.g. like the z_scripts folder itself) from the analysis
_ _todo check _ _
- CLIENT DOCS
- Client 1 Folder \ sample word doc \ sub folder \ sample xl doc
- client 2 Folder
The key user scripts are in the main app directory.
The top of each script contains parameters to modify the script behaviour
one_client_pdf.py- gather all client data into one pdfsnapshot.py- make a summary of client documents (e.g. last updated, size, number, sentiment etc)
add text ..
This project allows you to:
- Quickly summarize key information from a collection of documents.
- Identify trends and patterns in client communications.
- Automate the process of answering common business questions.
- Integrate with other tools like Power Automate for automated workflows.
add text ..
Other key features:
add text ..
For obvious reasons only generic code and no information / knowledge is shared in this GitHub project. This has the benefit of you being able to add only your own documents when you run your secure local copy. Other key project features with confidentiality in mind:
- Information is stored locally by default in Elastic Search
- LLM (Lllama) runs locally (with options to use other remote LLMs)
- Confidential information redacted before sending to LLM (see setting in config file is you wish to turn this off)
To further limit exposure, we recommend care in ingesting only emails and documents that have already been sent externally. Since the project is open source, you can fully audit the code before use.
add text ..
add text ..
The main configuration file is located at app/config/config.conf. This file controls various aspects of the application, including:
- Data source locations
- LLM selection (local or remote)
- Redaction settings
- API keys
Refer to the comments within the app/config/config.conf file for detailed explanations of each setting.
__ todo - implement this __
- The main confirmation file is in
app/config/config.conf. This config file is shared for the ingest script, the Bot and the Application. Please edit this using the notes in theapp/configfolder. - Some APIs (Copilot, OpenAI, Teamworks helpdesk) require tokens the first time they are run. Please consult the documentation of these tools to retrieve these.
- The script will ask you for these tokens and store them locally. This is a plain text json file(
token-storage-local.json). While it is excluded from storage in git, you may wish to review who has access to it locally as it will store sensitive information.
To setup the project on your local machine, then run for the first time:
-
Checkout / download the project as a folder onto the host computer from the source __ todo update link __ https://github.com/paulbrowne-irl/smart-document-analysis
-
Install Python (3.12 or higher) in the usual way. Python
pipandvirtualenvtools are also needed. -
Install Python dependencies - in a terminal window, at the project root
- Create virtual environment:
virtualenv venv - Activate virtual environment:
source venv/bin/activate - Install Python dependencies for this environment:
pip install -r requirements.txt
- Create virtual environment:
add text ..
add text ..
Contributions to this project are welcome! Please see the CONTRIBUTING.md file for guidelines on how to contribute.