A Python-based implementation designed to calculate semantic similarity between textual documents using Vector Space Modeling and Cosine Similarity.
The repository is organized to demonstrate both the initial logic and the improved modular version:
main.py: The primary entry point for executing the analysis.preprocess.py: Dedicated module for text normalization and tokenization.similarity.py: Core logic for vector generation and cosine calculations.file_loader.py: Utility module for handling document input operations.utils.py: General-purpose helper functions.old_version.py: The initial single-script implementation (Version 1.0).
- Language: Python 3.x
- Methodology: Vector Space Modeling (VSM)
- Metric: Cosine Similarity
- Domain: Natural Language Processing (NLP)
- Modular Architecture: Code is decoupled into specific modules for better readability and maintenance.
- Preprocessing Pipeline: Robust handling of raw text to ensure accurate similarity scoring.
- Clean Code: Follows structured programming principles to make the logic easy to follow.
- Clone the repository to your local environment.
- Run the program using:
python main.py