Mini NLP Toolkit

Overview

The Mini NLP Toolkit is a lightweight package designed for basic natural language processing (NLP) tasks. It provides tools for tokenization, normalization, stopword removal, stemming, lemmatization, and file preprocessing. Additionally, it supports processing files of various formats including .txt, .csv, .json, .docx, and .pdf.

Project Structure

Directory structure:
└── CodeHive-by-Jay-Mini-NLP-Tool-for-CodeHive/
    ├── README.md
    ├── LICENSE
    ├── requirements.txt
    ├── setup.py
    ├── tests/
    │   └── test_preprocess.py
    └── toolkit/
        ├── __init__.py
        ├── preprocess.py
        └── utils.py

Key Directories and Files

toolkit/: Contains the core functionality of the toolkit.
README.md: Provides an overview of the package.
requirements.txt: Lists dependencies required for the project.
setup.py: Contains metadata for package distribution.
tests/: Contains unit tests for validating the functionality.

Features

Class: `MiniNLPToolkit`

Methods

__init__ Initializes the NLP toolkit by loading NLTK resources (stopwords, stemmer, lemmatizer).
tokenize(self, text)
- Splits text into sentences and words.
- Input: Raw text (string).
- Output: Tuple of sentences (list) and words (list).
normalize(self, text)
- Converts text to lowercase and removes punctuation.
- Input: Raw text (string).
- Output: Normalized text (string).
remove_stopwords(self, words)
- Removes common stopwords from a list of words.
- Input: List of words.
- Output: List of filtered words.
stem(self, words)
- Applies stemming to reduce words to their root forms.
- Input: List of words.
- Output: List of stemmed words.
lemmatize(self, words)
- Applies lemmatization for better semantic understanding.
- Input: List of words.
- Output: List of lemmatized words.
preprocess(self, text)
- Combines tokenization, normalization, stopword removal, and lemmatization into a single pipeline.
- Input: Raw text (string).
- Output: Processed text (string).
read_file(self, file_path)
- Reads and extracts content from supported file formats.
- Input: File path (string).
- Output: File content (string).
process_file(self, input_folder, output_folder, file_name)
- Processes a file, applies preprocessing, and saves the result to an output folder.
- Input: Input folder path (string), output folder path (string), file name (string).
- Output: Preprocessed file saved to output_folder.

Example Usage

Example Script: `example_script.py`

The following script demonstrates how to use the Mini NLP Toolkit to preprocess text files.

from toolkit import MiniNLPToolkit

toolkit = MiniNLPToolkit()

# Define input and output folders
input_folder = "data/sample_texts"
output_folder = "data/output"

# Specify the file to process
file_name = "sample1.txt"

# Process the file
toolkit.process_file(input_folder, output_folder, file_name)

Sample Input: data/sample_texts/sample1.txt

This is a sample text file. It contains several sentences, words, and punctuation.

Sample Output: data/output/processed_sample1.txt

sample text file contain several sentence word punctuation

Class: `TextUtils`

Methods

expand_contractions(text, contractions_dict)
- Expands contractions in text based on the provided dictionary.
- Input: Text (string), contractions dictionary (dict).
- Output: Text with expanded contractions (string).
clean_text(text)
- Cleans text by removing numbers and special characters.
- Input: Text (string).
- Output: Cleaned text (string).
read_file(file_path)
- Reads a text file and returns its content.
- Input: File path (string).
- Output: File content (string).
write_file(file_path, content)

Writes content to a text file.
- Input: File path (string), content (string).
- Output: None.

Example Usage

Example Script: `example_script.py`

The following script demonstrates how to use the Mini NLP Toolkit to preprocess text files.

from toolkit import MiniNLPToolkit, TextUtils

toolkit = MiniNLPToolkit()
text_utils = TextUtils()

# Define input and output folders
input_folder = "data/sample_texts"
output_folder = "data/output"

# Specify the file to process
file_name = "sample1.txt"

# Process the file
toolkit.process_file(input_folder, output_folder, file_name)

# Example usage of TextUtils
text = "I'm learning NLP. It's fun!"
contractions_dict = {"i'm": "I am", "it's": "it is"}
expanded_text = text_utils.expand_contractions(text, contractions_dict)
print("Expanded Text:", expanded_text)

cleaned_text = text_utils.clean_text(expanded_text)
print("Cleaned Text:", cleaned_text)

Sample Input: data/sample_texts/sample1.txt

This is a sample text file. It contains several sentences, words, and punctuation.

Sample Output: data/output/processed_sample1.txt

sample text file contain several sentence word punctuation

Installation

Clone the repository:

git clone https://github.com/CodeHive-by-Jay/Mini-NLP-Tool-for-CodeHive

Navigate to the project directory:
```
cd Mini-NLP-Tool-for-CodeHive
```
Install dependencies:
```
pip install -r requirements.txt
```

Supported File Formats

.txt: Plain text files.
.csv: Comma-separated values files.
.json: JSON files.
.docx: Word documents.
.pdf: PDF files.

Future Enhancements

Add support for additional file formats (e.g., .xml).
Implement advanced NLP features like Named Entity Recognition (NER) and Part-of-Speech (POS) tagging.
Provide a web interface for easier access.

Contribution

Contributions are welcome! To contribute:

Fork the repository.
Create a feature branch.
Commit your changes.
Submit a pull request.

License

This project is licensed under the MIT License.

Contact

For questions or suggestions, please contact the developer:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mini NLP Toolkit

Overview

Project Structure

Key Directories and Files

Features

Class: `MiniNLPToolkit`

Methods

Example Usage

Example Script: `example_script.py`

Class: `TextUtils`

Methods

Example Usage

Example Script: `example_script.py`

Installation

Supported File Formats

Future Enhancements

Contribution

License

Contact

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
tests		tests
toolkit		toolkit
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

License

CodeHive-by-Jay/Mini-NLP-Tool-for-CodeHive

Folders and files

Latest commit

History

Repository files navigation

Mini NLP Toolkit

Overview

Project Structure

Key Directories and Files

Features

Class: MiniNLPToolkit

Methods

Example Usage

Example Script: example_script.py

Class: TextUtils

Methods

Example Usage

Example Script: example_script.py

Installation

Supported File Formats

Future Enhancements

Contribution

License

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Class: `MiniNLPToolkit`

Example Script: `example_script.py`

Class: `TextUtils`

Example Script: `example_script.py`

Packages