A tool used to compute the semantic similarity between 2 sentences
You can view this tool accuracy in this table
Baby Steps
Sentence Semantic Similarity
spaCy Tools
Reference and License
This tools require python3-dev
as well as spaCy.
$ pip3 install spacy
$ python3 -m spacy download en
For the more accurate English model
$ python3 -m spacy download en_core_web_lg
For further downloads follow the instructions given in spaCy's website.
Using MacOS it is important to use python3 in order to install spaCy given the fact that
python2 isn't built on wide build in MacOS.
For different languages install the specific language model given in spaCy's website.
Afterwards replace the models'n name in nlp_util.py ( line 9 ).
For further info .
The makefile is located in src directory so type
$ cd src && make
The program is supposed to compile with no errors .
In order to use the sentence semantic similarity function you'll need to create an nlp object.
#include "nlp.h"
.
.
.
nlp [object name];
[object name].semantic_similarity(Sentence_1,Sentence_2);
The function will return a number between 0 and 1 ( Closer to 1 is more similar )
First of all the 2 sentences are divided to part of speech tags in order to reduce time
complexity and to make the algorithm more accurate ( only words which have the same tags are being compared ).
The final result will be calculated using a vector representation of each sentence and in order to do that
we will need a vector space basis.
All the sentences can be represented by the union of the words in the first sentence and the second sentence.
But by having similar semantic fields with one another some words in one sentence can be in the span of another word in the another sentence ( In the linear space ).
It means that a sentence can be represented using words which are not a part of it as long
as certain words in the sentence are linearly dependent by another words in second sentence.
2 words are considered linearly dependent if they are "similar".
Comparing only words with the same pos tag, words which are similar to one another are defined by this function.
The basis is a union of the words in the first sentence and the second sentence minus the words from the
second sentence which are similar to other words in the first sentence. In order to get one word by another
similar word you can multiply it by their similarity.
In the program the similar words will be saved together in the list representing the basis in order to avoid
calculating the similarity twice and checking which words are similar.
Let basis
for the second sentence will be determined as
And for the first sentence as
The result will be calculated by using cosine similarity and by calculating the division between the vector norms
Using the benchmark data base by James O’Shea [1]
I took multiple examples which are shown in this table (Using multiple SpaCy models)
Sentence 1 | Sentence 2 | Human benchmark | Similarity algorithm - en_core_web_sm | Similarity algorithm - en_core_web_md | Similarity algorithm - en_core_web_lg |
---|---|---|---|---|---|
Midday is 12 oclock in the middle of the day. | Noon is 12 oclock in the middle of the day. | 95.50 | 96.9431 | 85.7143 | 85.7143 |
A boy is a child who will grow up to be a man. | A lad is a young man or boy. | 58 | 66.6667 | 66.6667 | 66.6667 |
The coast is an area of land that is next to the sea. | The shores or shore of a sea, lake, or wide river is the land along the edge of it. | 58.75 | 73.1581 | 59.1532 | 59.1532 |
A rooster is an adult male chicken. | A voyage is a long journey on a ship or in a spacecraft. | 0.5 | 55.9171 | 0 | 0 |
A furnace is a container or enclosed space in which a very hot fire is made, for example, to melt metal, burn rubbish, or produce steam. | A stove is a piece of equipment which provides heat, either for cooking or for heating a room. | 34.75 | 31.1785 | 31.8863 | 31.8863 |
A bird is a creature with feathers and wings, females lay eggs, and most birds can fly. | A cock is an adult male chicken. | 16.25 | 10.2824 | 15.7335 | 20.5201 |
In former times, serfs were a class of people who had to work on a particular persons land and could not leave without that persons permission. | A slave is someone who is the property of another person and has to work for that person. | 48.25 | 42.2456 | 42.3675 | 42.3675 |
An autograph is the signature of someone famous which is specially written for a fan to keep. | The shores or shore of a sea, lake, or wide river is the land along the edge of it. | 0.50 | 57.8627 | 22.1932 | 22.1932 |
An automobile is a car. | A car is a motor vehicle with room for a small number of passengers. | 55.75 | 92.9903 | 50 | 50 |
A crane is a large machine that moves heavy things by lifting them in the air. | An implement is a tool or other piece of equipment. | 18.50 | 46.7726 | 20.0479 | 20.0479 |
A forest is a large area where trees grow close together. | Woodland is land with a lot of trees. | 62.75 | 59.4877 | 83.341 | 33.3333 |
An automobile is a car. | In legends and fairy stories, a wizard is a man who has magic powers. | 2.00 | 72.951 | 0 | 0 |
A cock is an adult male chicken. | A rooster is an adult male chicken. | 86.25 | 75 | 75 | 75 |
A magician is a person who entertains people by doing magic tricks. | In legends and fairy stories, a wizard is a man who has magic powers. | 35.50 | 59.9677 | 35.0177 | 34.3168 |
When you make a journey, you travel from one place to another. | A voyage is a long journey on a ship or in a spacecraft. | 36 | 35.174 | 20 | 11.1111 |
A boy is a child who will grow up to be a man. | A rooster is an adult male chicken. | 10.75 | 26.9036 | 18.4459 | 18.4459 |
Glass is a hard transparent substance that is used to make things such as windows and bottles. | A tumbler is a drinking glass with straight sides. | 13.75 | 26.3125 | 12.7719 | 12.7719 |
Cord is strong, thick string. | A smile is the expression that you have on your face when you are pleased or amused, or when you are being friendly. | 1 | 34.0262 | 0 | 0 |
An autograph is the signature of someone famous which is specially written for a fan to keep. | Your signature is your name, written in your own characteristic way, often at the end of a document to indicate that you wrote the document or that you agree with what it says. | 40.50 | 69.4671 | 47.734 | 47.734 |
Mean Absolute Error(MAE) - 22.0958
Pearson correlation coefficient - 0.6009
Mean Absolute Error(MAE) - 6.57057
Pearson correlation coefficient - 0.9442
Mean Absolute Error(MAE) - 7.73736
Pearson correlation coefficient - 0.9188
Becuase of the lack of nlp tools in C++ I used spaCy.
spaCy is an open source nlp library which is used in this project.
I chose spaCy over other libraries such as NLTK because of it's accuracy and efficiency as shown here
In order to use the pos tagger and the similarity tool of spaCy in C++ simply create an nlp object as shown before.
[1] J.O’Shea,Z.Bandar,K.Crockett,andD.McLean,“Pilot short text semantic similarity benchmark data set: Full listing and descrip- tion,” Computing, 2008.
This project is licensed under the GNU General Public License v2.0 . See License for more info