update text with figure and more refs

florian-huber · Jun 24, 2024 · f08d2b2 · f08d2b2
1 parent ae35387
commit f08d2b2
Showing 1 changed file with 22 additions and 19 deletions.
diff --git a/notebooks/live_coding_12_NLP_4_ngrams_word_vectors.ipynb b/notebooks/live_coding_12_NLP_4_ngrams_word_vectors.ipynb
@@ -1920,9 +1920,9 @@
     "## Word Vectors: Word2Vec and Co\n",
     "**Tfidf vectors** are a rather basic, but still often used, technique. Arguably, this is because they are based on relatively simple statistics and easy to compute. They typically do a good job in weighing words according to their importance in a larger corpus and allow us to ignore words with low *distriminative power* (for instance so-called *stopwords* such as \"a\", \"the\", \"that\", ...).\n",
     "\n",
-    "With **n-grams** we can even go one step further and also count sentence pieces longer than one word, which allows to also take some grammar or negations into account. The price, however, is that we have to restrict the number of n-grams to avoid exploding vector sizes.\n",
+    "With **n-grams** we can even go one step further and also count sentence pieces longer than one word. With n-grams our models can identify important word combinations such as negations (\"do not like\"), comparatives, or specific expressions (\"the best\") into account. The price, however, is that we have to restrict the number of n-grams to avoid exploding vector sizes.\n",
     "\n",
-    "While TF-IDF vectors and n-grams serve as powerful techniques to represent and manipulate text data, they have limitations. These methods essentially treat words as individual, isolated units, devoid of any context or relation to other words. In other words, they lack the ability to capture the semantic meanings of words and the linguistic context in which they are used.\n",
+    "TF-IDF vectors and n-grams serve as powerful techniques to represent and manipulate text data, but they have limitations. These methods treat words, or tiny groups of words, as individual, isolated units, devoid of any context or relation to other words. In other words, they cannot capture the semantic meanings of words and the linguistic context in which they are used.\n",
     "\n",
     "Take these two sentences as an example:\n",
     "\n",
@@ -1933,7 +1933,21 @@
     "\n",
     "This is where we come to **word vectors**. Word vectors, also known as **word embeddings**, are mathematical representations of words in a high-dimensional space where the semantic similarity between words corresponds to the geometric distance in the embedding space. Simply put, similar words are close together, and dissimilar words are farther apart. If done well, this should show that *\"cookie\"* and *\"cake\"* are not the same word, but mean something very related.\n",
     "\n",
-    "The most prominent example of such a technique is Word2Vec {cite}`mikolov_distributed_2013`{cite}`mikolov_efficient_2013`.\n"
+    "The most prominent example of such a technique is **Word2Vec** {cite}`mikolov_distributed_2013`{cite}`mikolov_efficient_2013`.\n",
+    "\n",
+    "### Word2Vec\n",
+    "\n",
+    "The fundamental idea behind Word2Vec is to use the context in which words appear to learn their meanings. As shown in the {numref}`fig_word2vec_sliding_window`, a sliding window of a fixed size (in this case, 5) moves across the sentence \"The customer likes cake with a cappuccino.\" At each step, the algorithm selects a target word and its surrounding context words. The goal is to predict the target word based on its context or vice versa.\n",
+    "\n",
+    "For example, in the phrase \"the customer likes,\" the target word is \"the,\" and the context words are \"customer\" and \"likes.\" This process is repeated for each possible position in the sentence. These word-context pairs are fed into the Word2Vec model, which learns to map each word to a unique vector in such a way that words appearing in similar contexts have similar vectors. This vector representation captures semantic similarities, meaning that words with similar meanings or usages are positioned closer together in the vector space. Word2Vec thus enables various applications such as sentiment analysis, machine translation, and recommendation systems by providing a mathematical representation of words that reflects their meanings and relationships.\n",
+    "\n",
+    "Word2Vec models can be trained using two main methods: Continuous Bag of Words (CBOW) and Skip-Gram. In CBOW, the model predicts a target word based on its surrounding context words, focusing on understanding the word's context to infer its meaning. Conversely, the Skip-Gram model predicts the surrounding context words given a target word, emphasizing the ability to generate context from a single word {cite}`mikolov_distributed_2013`{cite}`mikolov_efficient_2013`.\n",
+    "\n",
+    "```{figure} ../images/fig_word2vec_sliding_window.png\n",
+    ":name: fig_word2vec_sliding_window\n",
+    "\n",
+    "Techniques such as Word2Vec learn vector representations of individual words based on their \"context\", which is given by the neighboring words. \n",
+    "```\n"
    ]
   },
   {
@@ -2531,26 +2545,15 @@
    "id": "1b84acf6",
    "metadata": {},
    "source": [
-    "### Limitations and Extensions\n",
-    "\n",
-    "While Word2Vec is a powerful tool, it's not without limitations. The main issue with Word2Vec (and similar models that derive their semantics based on the local usage context) is that they assign one vector per word. This becomes a problem for words with multiple meanings based on their context (homonyms and polysemes). To tackle such limitations, extensions like FastText and advanced methods such as GloVe (Global Vectors for Word Representation) and transformers like BERT (Bidirectional Encoder Representations from Transformers) have been proposed.\n",
-    "\n",
-    "### FastText\n",
-    "\n",
-    "FastText, also developed by Facebook, extends Word2Vec by treating each word as composed of character n-grams. So the vector for a word is made of the sum of these character n-grams. This allows the embeddings to capture the meaning of shorter words and suffixes/prefixes and understand new words once the character n-grams are learned.\n",
-    "\n",
-    "### GloVe\n",
-    "\n",
-    "GloVe, developed by Stanford, is another method to create word embeddings. While Word2Vec is a predictive model — a model that predicts the context given a word, GloVe is a count-based model. It leverages matrix factorization techniques on the word-word co-occurrence matrix.\n",
+    "### Limitations and more Powerful Alternatives\n",
     "\n",
-    "### Transformers: BERT & GPT\n",
+    "While Word2Vec is a powerful tool, it has limitations. One significant issue is that Word2Vec assigns one vector per word, which poses a problem for words with multiple meanings based on their context (homonyms and polysemes, such as \"apple\" the fruit vs. \"apple\" the company).\n",
     "\n",
-    "One of the most fundamental limitations of Word2Vec and similar algorithms is rooted in the underlying bag-of-words approach. This removes a lot of information that lies in the **order** of words. Even constructs such as n-grams can only compensate for extremely local pattern such as *\"do not like\"*  vs *\"do like\"*.\n",
+    "A more fundamental limitation of Word2Vec and similar algorithms lies in the underlying bag-of-words approach, which removes information related to the order of words. Even constructs like n-grams can only compensate for extremely local patterns, such as differentiating \"do not like\" from \"do like\".\n",
     "\n",
-    "In contrast, deep learning techniques related to recurrent neural networks or -even more powerful- transformers, have the ability to learn pattern across many more words. In the case of transformers even across pages of text which is precisely why ChatGPT and following large language models were able to use natural language with an formerly unknown degree of subtlety. \n",
-    "BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pretrained Transformer) models ushered in the era of transformers that not only consider local context but also take the entire sentence context into account to create word embeddings.\n",
+    "In contrast, deep learning techniques like recurrent neural networks and, more powerfully, **transformers**, can learn patterns across many more words. Transformers, in particular, can learn patterns across entire pages of text {cite}`vaswani2017attention`, enabling models like ChatGPT and other large language models to use natural language with unprecedented subtlety. Models such as BERT (Bidirectional Encoder Representations from Transformers) {cite}`devlin2018bert` and GPT (Generative Pretrained Transformer) {cite}`radford2018improvin`g produce contextualized representations of words within a given context, taking the entire sentence or paragraph into account rather than generating static word embeddings.\n",
     "\n",
-    "In conclusion, while TF-IDF and n-grams offer a solid start, word embeddings and transformers take the representation of words to the next level. By considering context and semantic meaning, they offer a more complete and robust method to work with text."
+    "In conclusion, while TF-IDF and n-grams offer a solid start, word embeddings like those produced by Word2Vec and contextualized representations from transformers provide more advanced methods for working with text by considering context and semantic meaning."
    ]
   },
   {