Skip to content

This project has various phases which involves the sequence of steps

Notifications You must be signed in to change notification settings

chandu-atina/DataMining

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 

Repository files navigation

DataMining

=============

This project has various phases which involves the sequence of steps.

  1. Web Crawling

Crawling the web forums to get the appropriate data and store them in flat files

  1. Pre-Processing the data

Data Pre-Processing involves organizing the noisey data and inappropriate data into appropriate format for Pos Tagging

  1. PoS Tagging

Parts of Speech Tagging is done to the processed data using a few standard Pos Taggers like Stanford PoS Tagger, OpenNLP Tagger, LTAG-Spinal etc

  1. Stop-Word Removal

Stop-Word removal includes removal of unmeaning full words, common words etc

  1. Stemming & Lemmatization

Stemming includes removal of similar words and base line them to a single meaningful word. For e.g. running,run can be stemmed to single word "run". Lemmatisation (or lemmatization) is the process of grouping together the different inflected forms of a word so they can be analysed as a single item includes removal of similar words and base line them to a single meaningful word. For e.g. running, ran, run can be lemmatized to single word "run".

  1. Pruning

Low frequency words are removed from word list.

  1. Weighting

Weightage is given to each and every term inside the document by calculating "tfidf". It is the product of term frequency and inverse document frequency. tf idft = tf · (log 2 n − log 2 dft + 1)

  tf  - term frequency
  dft - the number of documents in which term 't' appears
  n   - no.of documents
  1. Cosine Similarity

Cosine distnace between two document vectors s(d i , d j ) = cos( ( d i , d j )) = di·dj / |di|·|dj|

  Cosine Similarity(Doc1,Doc2) = Dot product(Doc1,Doc2) / ||Doc1||*||Doc2||
  1. Clustering

Apply clustering algorithm to form Clusters.

Note

You need to checkout web-crawler project as well in order to work with DataMining. DataMining project has internally dependencies on web-crawler project.

First build web-crawler project and then build DataMining project.

About

This project has various phases which involves the sequence of steps

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages