The Mini NLP Toolkit is a lightweight package designed for basic natural language processing (NLP) tasks. It provides tools for tokenization, normalization, stopword removal, stemming, lemmatization, and file preprocessing. Additionally, it supports processing files of various formats including .txt
, .csv
, .json
, .docx
, and .pdf
.
Directory structure:
└── CodeHive-by-Jay-Mini-NLP-Tool-for-CodeHive/
├── README.md
├── LICENSE
├── requirements.txt
├── setup.py
├── tests/
│ └── test_preprocess.py
└── toolkit/
├── __init__.py
├── preprocess.py
└── utils.py
toolkit/
: Contains the core functionality of the toolkit.README.md
: Provides an overview of the package.requirements.txt
: Lists dependencies required for the project.setup.py
: Contains metadata for package distribution.tests/
: Contains unit tests for validating the functionality.
-
__init__
Initializes the NLP toolkit by loading NLTK resources (stopwords, stemmer, lemmatizer). -
tokenize(self, text)
- Splits text into sentences and words.
- Input: Raw text (string).
- Output: Tuple of sentences (list) and words (list).
-
normalize(self, text)
- Converts text to lowercase and removes punctuation.
- Input: Raw text (string).
- Output: Normalized text (string).
-
remove_stopwords(self, words)
- Removes common stopwords from a list of words.
- Input: List of words.
- Output: List of filtered words.
-
stem(self, words)
- Applies stemming to reduce words to their root forms.
- Input: List of words.
- Output: List of stemmed words.
-
lemmatize(self, words)
- Applies lemmatization for better semantic understanding.
- Input: List of words.
- Output: List of lemmatized words.
-
preprocess(self, text)
- Combines tokenization, normalization, stopword removal, and lemmatization into a single pipeline.
- Input: Raw text (string).
- Output: Processed text (string).
-
read_file(self, file_path)
- Reads and extracts content from supported file formats.
- Input: File path (string).
- Output: File content (string).
-
process_file(self, input_folder, output_folder, file_name)
- Processes a file, applies preprocessing, and saves the result to an output folder.
- Input: Input folder path (string), output folder path (string), file name (string).
- Output: Preprocessed file saved to
output_folder
.
The following script demonstrates how to use the Mini NLP Toolkit to preprocess text files.
from toolkit import MiniNLPToolkit
toolkit = MiniNLPToolkit()
# Define input and output folders
input_folder = "data/sample_texts"
output_folder = "data/output"
# Specify the file to process
file_name = "sample1.txt"
# Process the file
toolkit.process_file(input_folder, output_folder, file_name)
Sample Input: data/sample_texts/sample1.txt
This is a sample text file. It contains several sentences, words, and punctuation.
Sample Output: data/output/processed_sample1.txt
sample text file contain several sentence word punctuation
-
expand_contractions(text, contractions_dict)
- Expands contractions in text based on the provided dictionary.
- Input: Text (string), contractions dictionary (dict).
- Output: Text with expanded contractions (string).
-
clean_text(text)
- Cleans text by removing numbers and special characters.
- Input: Text (string).
- Output: Cleaned text (string).
-
read_file(file_path)
- Reads a text file and returns its content.
- Input: File path (string).
- Output: File content (string).
-
write_file(file_path, content)
- Writes content to a text file.
- Input: File path (string), content (string).
- Output: None.
The following script demonstrates how to use the Mini NLP Toolkit to preprocess text files.
from toolkit import MiniNLPToolkit, TextUtils
toolkit = MiniNLPToolkit()
text_utils = TextUtils()
# Define input and output folders
input_folder = "data/sample_texts"
output_folder = "data/output"
# Specify the file to process
file_name = "sample1.txt"
# Process the file
toolkit.process_file(input_folder, output_folder, file_name)
# Example usage of TextUtils
text = "I'm learning NLP. It's fun!"
contractions_dict = {"i'm": "I am", "it's": "it is"}
expanded_text = text_utils.expand_contractions(text, contractions_dict)
print("Expanded Text:", expanded_text)
cleaned_text = text_utils.clean_text(expanded_text)
print("Cleaned Text:", cleaned_text)
Sample Input: data/sample_texts/sample1.txt
This is a sample text file. It contains several sentences, words, and punctuation.
Sample Output: data/output/processed_sample1.txt
sample text file contain several sentence word punctuation
- Clone the repository:
git clone https://github.com/CodeHive-by-Jay/Mini-NLP-Tool-for-CodeHive
- Navigate to the project directory:
cd Mini-NLP-Tool-for-CodeHive
- Install dependencies:
pip install -r requirements.txt
.txt
: Plain text files..csv
: Comma-separated values files..json
: JSON files..docx
: Word documents..pdf
: PDF files.
- Add support for additional file formats (e.g.,
.xml
). - Implement advanced NLP features like Named Entity Recognition (NER) and Part-of-Speech (POS) tagging.
- Provide a web interface for easier access.
Contributions are welcome! To contribute:
- Fork the repository.
- Create a feature branch.
- Commit your changes.
- Submit a pull request.
This project is licensed under the MIT License.
For questions or suggestions, please contact the developer: