Sikorsky Archives MongoDB Demo

This project is a demonstration of how to process and import a large dataset of Excel files into MongoDB. The goal is to facilitate easy searching, updating, inserting, and retrieving of data for the Sikorsky Archives.

Project Overview

This project processes a dataset of approximately 40 GB, stored in various Excel files, and imports it into MongoDB. Each main folder in the dataset corresponds to a collection in MongoDB. Subfolders within these main folders are skipped.

Setup Instructions

Prerequisites

Python 3.6+
MongoDB Community Edition
pandas library
pymongo library
openpyxl and xlrd libraries for handling Excel files

Installation

Clone the Repository:

git clone https://github.com/yourusername/MongoDB-Sikorsky-Archives-Demo.git
cd MongoDB-Sikorsky-Archives-Demo

Install the Required Python Libraries:
```
pip install -r requirements.txt
```
Set Up MongoDB:
- Ensure MongoDB is installed and running on your local machine.
- The default connection string is mongodb://localhost:27017. Update this in the script if necessary.

Requirements File

Create a file named requirements.txt with the following content:

pymongo
pandas
openpyxl
xlrd

Script Details

The script setup_database.py processes and imports Excel files into MongoDB. It handles both .xls and .xlsx file formats.

Key Functions

normalize_date(date_str): Normalizes date formats to month/day/year.
process_aircraft_model(model_str): Splits aircraft model strings into lists.
preprocess_comments(comments): Preprocesses comments.
handle_nan_values(row): Replaces NaN values with None.
validate_data(row): Validates and cleans each data row.
process_excel_file(file_path, collection_name): Reads and processes an Excel file, then inserts data into MongoDB.
main(root_directory): Traverses the root directory, processing each main folder and its Excel files.

Script Workflow

Connect to MongoDB:

The script establishes a connection to the local MongoDB instance using the MongoClient from the pymongo library.

Normalize Data:

Date Normalization: The normalize_date function converts dates to the month/day/year format.
NaN Handling: The handle_nan_values function replaces NaN values with None to ensure MongoDB compatibility.
Aircraft Model Processing: The process_aircraft_model function splits strings with multiple aircraft models into lists.
Comment Preprocessing: The preprocess_comments function strips excess whitespace from comments.

Process Excel Files:

Read Excel Files: The script uses pandas to read .xls and .xlsx files.
Validate Data: The validate_data function is used to ensure data integrity and consistency for each row.
Prepare Documents: The script constructs MongoDB documents from the cleaned and validated data.

Insert Data:

The process_excel_file function inserts the prepared documents into the corresponding MongoDB collection.
Each main folder in the dataset corresponds to a separate MongoDB collection.

Logging:

The script provides detailed logs for each file and directory processed, indicating success or any errors encountered.
It prints messages for successful data insertion and any issues that arise during processing.

Data Validation

The script includes data validation to ensure consistency:

Dates are normalized to month/day/year.
NaN values are replaced with None.
Aircraft models with multiple values are split into lists.
Comments are preprocessed to remove excess whitespace.

Usage

Ensure MongoDB is running:
```
mongod 
```

Run the Script:

python setup_database.py /path/to/root_directory

Script Output: The script logs each directory and file it processes. It prints messages indicating successful insertion of data or any errors encountered.

File Link Update Script

The filelink.py script updates the "fileLink" field in all collections to include the actual file path of the Excel files.

PDF Processing Scripts

ExtractPDF.py

This script takes the PDFs from a collection, extracts the text, and uploads the extracted text as a new field in the collection.

SearchPDF.py

This script allows you to enter keywords or sentences that you want to find from the extracted text.

SearchPageNumber.py

This script takes a keyword and directly searches through the PDF without having to extract the text first. Both search options are possible because of the file path stored in the database for the fileLink field.

Contributing

Please follow these steps to contribute:

Fork the Repository:
- Click the "Fork" button at the top right of the repository page.

Clone Your Fork:

git clone https://github.com/yourusername/MongoDB-Sikorsky-Archives-Demo.git
cd MongoDB-Sikorsky-Archives-Demo

Create a New Branch:

git checkout -b feature/your-feature-name

Make Your Changes and Commit:

 git add .
 git commit -m "Add your commit message here"

Push to Your Fork and Submit a Pull Request:

 git push origin feature/your-feature-name

License

This project is licensed under the MIT License. See the LICENSE file for details.

This README is fully detailed and provides all necessary information for users and contributors, including a fully populated Table of Contents. If you need any more assistance, let me know!

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
ExtractPDF.py		ExtractPDF.py
LICENSE		LICENSE
README.md		README.md
SearchPDF.py		SearchPDF.py
SearchPageNumber.py		SearchPageNumber.py
VERSION.txt		VERSION.txt
fileLink.py		fileLink.py
requirements.txt		requirements.txt
setup_database.py		setup_database.py
shell_search.js		shell_search.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sikorsky Archives MongoDB Demo

Table of Contents

Project Overview

Setup Instructions

Prerequisites

Installation

Requirements File

Script Details

Key Functions

Script Workflow

Data Validation

Usage

File Link Update Script

PDF Processing Scripts

ExtractPDF.py

SearchPDF.py

SearchPageNumber.py

Contributing

License

About

Releases

Packages

Contributors 2

Languages

License

Mahmud-Kandawala/MongoDB-Sikorsky-Archvies-Demo

Folders and files

Latest commit

History

Repository files navigation

Sikorsky Archives MongoDB Demo

Table of Contents

Project Overview

Setup Instructions

Prerequisites

Installation

Requirements File

Script Details

Key Functions

Script Workflow

Data Validation

Usage

File Link Update Script

PDF Processing Scripts

ExtractPDF.py

SearchPDF.py

SearchPageNumber.py

Contributing

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages