Skip to content

Jhex-AI/Document-Similarity-Ranking-Enhanced

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Document Similarity using Word2Vec

Calculate the similarity distance between documents using pre-trained word2vec model.

Usage

  • Load a pre-trained word2vec model. Note: You can use Google's pre-trained word2vec model, if you don't have one.

    from gensim.models.keyedvectors import KeyedVectors
    from gensim.models.doc2vec import TaggedDocument
    model_path = './data/GoogleNews-vectors-negative300.bin'
    w2v_model = KeyedVectors.load_word2vec_format(model_path, binary=True)
  • Once the model is loaded, it can be passed to DocSim class to calculate document similarities.

    from DocSim import DocSim
    ds = DocSim(w2v_model)
  • Calculate the similarity score between a source document & a list of target documents.

    source_doc = 'how to delete an invoice'
    target_docs = ['delete a invoice', 'how do i remove an invoice', 'purge an invoice']
    
    # This will return all the target docs with similarity scores between source and target documents
    sim_scores = ds.calculate_similarity(source_doc, target_docs)
    
    print(sim_scores)
  • Output is as follows:

      [ {'score': 0.99999994, 'tag':['0'], 'doc': 'delete a invoice'}, 
      {'score': 0.79869318, 'tag':['1'], doc': 'how do i remove an invoice'}, 
      {'score': 0.71488398, 'tag':['2'], doc': 'purge an invoice'} ]
  • When used with documents containing lots of text data, tag attribute will be useful to identify the documents. A small change must be made in the calculate_similarity function as follows

    results.append({"score": sim_score,"tag":tagged_data[i].tags,"doc":tagged_data[i].words})

  • Output now is as follows:

      [ {'score': 0.99999994, 'tag':['0']}, 
      {'score': 0.79869318, 'tag':['1']}, 
      {'score': 0.71488398, 'tag':['2']}]
  • Note: You can optionally pass a topn argument to the calculate_similarity() method to return the top n target documents with similarity scores.

    sim_scores = ds.calculate_similarity(source_doc, target_docs, topn=5)
  • Note: You can optionally pass a threshold argument to the calculate_similarity() method to return only the target documents with similarity score above the threshold.

    sim_scores = ds.calculate_similarity(source_doc, target_docs, threshold=0.7)

Requirements

  • Python 3 only
  • gensim : to load the word2vec model and tagged document
  • numpy : to calculate similarity scores

License

The MIT License

About

Document Similarity using Word2Vec

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%