Skip to content

rasdharidisha-280306/DocSim

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

DocSim: Document Similarity Analysis Tool

A Python-based implementation designed to calculate semantic similarity between textual documents using Vector Space Modeling and Cosine Similarity.


Project Structure

The repository is organized to demonstrate both the initial logic and the improved modular version:

  • main.py: The primary entry point for executing the analysis.
  • preprocess.py: Dedicated module for text normalization and tokenization.
  • similarity.py: Core logic for vector generation and cosine calculations.
  • file_loader.py: Utility module for handling document input operations.
  • utils.py: General-purpose helper functions.
  • old_version.py: The initial single-script implementation (Version 1.0).

Technical Stack

  • Language: Python 3.x
  • Methodology: Vector Space Modeling (VSM)
  • Metric: Cosine Similarity
  • Domain: Natural Language Processing (NLP)

Key Features

  • Modular Architecture: Code is decoupled into specific modules for better readability and maintenance.
  • Preprocessing Pipeline: Robust handling of raw text to ensure accurate similarity scoring.
  • Clean Code: Follows structured programming principles to make the logic easy to follow.

How to Use

  1. Clone the repository to your local environment.
  2. Run the program using:
    python main.py

About

A PYTHON PROJECT THAT MEASURES SIMILARITY BETWEEN TWO DOCUMENTS USING VECTOR OPERATIONS AND COSINE SIMILARITY.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages