Skip to content

A software designed to calculate the semantic similarity between 2 sentences

License

Notifications You must be signed in to change notification settings

Papich23691/S.S-Similarity

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Semantic-similarity

A tool used to compute the semantic similarity between 2 sentences
You can view this tool accuracy in this table


Table of contents

Baby Steps
Sentence Semantic Similarity
spaCy Tools
Reference and License

Requirements

This tools require python3-dev as well as spaCy.

Installation

$ pip3 install spacy
$ python3 -m spacy download en

For the more accurate English model
$ python3 -m spacy download en_core_web_lg

For further downloads follow the instructions given in spaCy's website.

MacOS

Using MacOS it is important to use python3 in order to install spaCy given the fact that
python2 isn't built on wide build in MacOS.

Languages

For different languages install the specific language model given in spaCy's website.
Afterwards replace the models'n name in nlp_util.py ( line 9 ).
For further info .

Compiling

The makefile is located in src directory so type

$ cd src && make

The program is supposed to compile with no errors .

Usage

In order to use the sentence semantic similarity function you'll need to create an nlp object.

#include "nlp.h"

.
.
. 

nlp [object name];
[object name].semantic_similarity(Sentence_1,Sentence_2);

The function will return a number between 0 and 1 ( Closer to 1 is more similar )

The algorithm in a nutshell

POS tagging

First of all the 2 sentences are divided to part of speech tags in order to reduce time
complexity and to make the algorithm more accurate ( only words which have the same tags are being compared ).

The final result will be calculated using a vector representation of each sentence and in order to do that
we will need a vector space basis.

Linear space basis

All the sentences can be represented by the union of the words in the first sentence and the second sentence.
But by having similar semantic fields with one another some words in one sentence can be in the span of another word in the another sentence ( In the linear space ).
It means that a sentence can be represented using words which are not a part of it as long
as certain words in the sentence are linearly dependent by another words in second sentence.

2 words are considered linearly dependent if they are "similar".
Comparing only words with the same pos tag, words which are similar to one another are defined by this function.

equation of similarity

The basis is a union of the words in the first sentence and the second sentence minus the words from the
second sentence which are similar to other words in the first sentence. In order to get one word by another
similar word you can multiply it by their similarity.
In the program the similar words will be saved together in the list representing the basis in order to avoid
calculating the similarity twice and checking which words are similar.

Vectors

Let basis
basis
vecor for the second sentence will be determined as

vecotr sentence2
And for the first sentence as
first

Result

The result will be calculated by using cosine similarity and by calculating the division between the vector norms

calculation

Stats

Using the benchmark data base by James O’Shea [1]
I took multiple examples which are shown in this table (Using multiple SpaCy models)

Sentence 1 Sentence 2 Human benchmark Similarity algorithm - en_core_web_sm Similarity algorithm - en_core_web_md Similarity algorithm - en_core_web_lg
Midday is 12 oclock in the middle of the day. Noon is 12 oclock in the middle of the day. 95.50 96.9431 85.7143 85.7143
A boy is a child who will grow up to be a man. A lad is a young man or boy. 58 66.6667 66.6667 66.6667
The coast is an area of land that is next to the sea. The shores or shore of a sea, lake, or wide river is the land along the edge of it. 58.75 73.1581 59.1532 59.1532
A rooster is an adult male chicken. A voyage is a long journey on a ship or in a spacecraft. 0.5 55.9171 0 0
A furnace is a container or enclosed space in which a very hot fire is made, for example, to melt metal, burn rubbish, or produce steam. A stove is a piece of equipment which provides heat, either for cooking or for heating a room. 34.75 31.1785 31.8863 31.8863
A bird is a creature with feathers and wings, females lay eggs, and most birds can fly. A cock is an adult male chicken. 16.25 10.2824 15.7335 20.5201
In former times, serfs were a class of people who had to work on a particular persons land and could not leave without that persons permission. A slave is someone who is the property of another person and has to work for that person. 48.25 42.2456 42.3675 42.3675
An autograph is the signature of someone famous which is specially written for a fan to keep. The shores or shore of a sea, lake, or wide river is the land along the edge of it. 0.50 57.8627 22.1932 22.1932
An automobile is a car. A car is a motor vehicle with room for a small number of passengers. 55.75 92.9903 50 50
A crane is a large machine that moves heavy things by lifting them in the air. An implement is a tool or other piece of equipment. 18.50 46.7726 20.0479 20.0479
A forest is a large area where trees grow close together. Woodland is land with a lot of trees. 62.75 59.4877 83.341 33.3333
An automobile is a car. In legends and fairy stories, a wizard is a man who has magic powers. 2.00 72.951 0 0
A cock is an adult male chicken. A rooster is an adult male chicken. 86.25 75 75 75
A magician is a person who entertains people by doing magic tricks. In legends and fairy stories, a wizard is a man who has magic powers. 35.50 59.9677 35.0177 34.3168
When you make a journey, you travel from one place to another. A voyage is a long journey on a ship or in a spacecraft. 36 35.174 20 11.1111
A boy is a child who will grow up to be a man. A rooster is an adult male chicken. 10.75 26.9036 18.4459 18.4459
Glass is a hard transparent substance that is used to make things such as windows and bottles. A tumbler is a drinking glass with straight sides. 13.75 26.3125 12.7719 12.7719
Cord is strong, thick string. A smile is the expression that you have on your face when you are pleased or amused, or when you are being friendly. 1 34.0262 0 0
An autograph is the signature of someone famous which is specially written for a fan to keep. Your signature is your name, written in your own characteristic way, often at the end of a document to indicate that you wrote the document or that you agree with what it says. 40.50 69.4671 47.734 47.734

Mean Absolute Error(MAE) - 22.0958
Pearson correlation coefficient - 0.6009

en_core_web_md

Mean Absolute Error(MAE) - 6.57057
Pearson correlation coefficient - 0.9442

en_core_web_lg

Mean Absolute Error(MAE) - 7.73736
Pearson correlation coefficient - 0.9188

graph

Becuase of the lack of nlp tools in C++ I used spaCy.
spaCy is an open source nlp library which is used in this project.
I chose spaCy over other libraries such as NLTK because of it's accuracy and efficiency as shown here
In order to use the pos tagger and the similarity tool of spaCy in C++ simply create an nlp object as shown before.

[1] J.O’Shea,Z.Bandar,K.Crockett,andD.McLean,“Pilot short text semantic similarity benchmark data set: Full listing and descrip- tion,” Computing, 2008.

This project is licensed under the GNU General Public License v2.0 . See License for more info

About

A software designed to calculate the semantic similarity between 2 sentences

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published