Feature extraction is an important step for any machine learning problem. No matter how good a modeling algorithm we use if we feed in poor features we will get poor results.
This is often called as **garbage in garbage out**
.
Text representation in terms of vector of numbers.
The most common way to calculate similarity between two text represented in VSM is by using cosine similarity
or distance
.
Different vectorization approaches in categories.
Distributional representations use high dimensional vectors to represent words.
text = ["I love NLP"]
I = [1 0 0]
Love = [0 1 0]
NLP = [0 0 1]
- create vocabulary
- from each document how many times each particular word repeats.
text1 = "I Love programming."
text2 = "I Love coding and programming."
vocabulary = ["I","Love", "programming", "and", "coding"]
text1_vectorized = [1 1 1 0 0 ]
"I"=1 "Love"=1 "programming"=1 "coding"=0 "and"=0
text2_vectorized = [1 1 1 1 1]
"I"=1 "Love"=1 "programming"=1 "coding"=1 "and"=1
It is a contiguous collection of n words and can help capture some context.
- 2 word contiguous : Bi-gram
- 3 word contiguous : Tri-gram
- n word contiguous : n-gram
text_a= "dog loves to eat meat"
text_b = "cat loves to eat milk"
bi-gram = ["dog loves", "loves to", "to eat", "eat meat", "cat loves", "eat milk"] // 6 dim vector
text_a_vectorized = [1 1 1 1 0 0]
text_b_vectorized = [0 1 1 0 1 1]
TF-IDF is a statistical measure used in text mining and natural language processing (NLP) to evaluate the importance of a word in a document relative to a collection of documents (corpus). It combines two metrics:
- TF (Term Frequency): Measures how frequently a term appears in a document.
- IDF (Inverse Document Frequency): Measures how unique or rare a term is across all documents.
The TF-IDF of a term ( t ) in a document ( d ) is calculated as:
Where:
Term Frequency (TF):
Inverse Document Frequency (IDF):
The vectors in distributional representation are very high dimensional and sparse which makes computationally inefficient and hampers learning.
To overcome the limitations of basic vector methods, methods to learn low dimensional representations were devised.
These methods use neural networks to create **dense**
, **low - dimensional representations**
of words and text.
- C-BOW
- Skip gram
The idea for universal text representation is for the contextual word representation. In the modern NLP, model uses Attention Mechanisms to achieve contextual word representation.