You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Regarding the pre-trained vectors for some of the corpora: (on the HistWords website)
For specific decades, there appear to be a handful of word vectors that are "0.0" across all 300 dimensions. It should be noted that for these corresponding words, they are still present in the corpus for this particular decade.
However, they do not seem to get any sort of representation across 300 dimensions, and have been assigned zero values throughout. For example, the vector for the word 'autism', from the 1800s decade of the Google n-grams eng-all vectors is [0.0 ... 0.0] for all 300 dimensions.
Would treating these words as simply 'missing' from the corpus at this particular decade be apt?
The text was updated successfully, but these errors were encountered:
I was recently confused by this myself, and think I may have found an answer.
Appendix A of the paper states (emphasis mine):
For the Google datasets we built models using the top-100000 words by their average frequency over the entire historical time-periods, [...]
My interpretation is that the vocabularies across all decades contain the same 100.000 words, and zero-valued vectors indicate that no embedding for these particular words were found because the words don't appear in the corpus of that decade.
That first assumption is quickly confirmed:
vocabs= []
fordecadeinrange(1800, 2000, 10):
withopen(f"./sgns/{decade}-vocab.pkl", 'rb') asf:
# set semantics enable comparisons without having to sort lists manuallyvocabs.append(set(pickle.load(f)))
# union of first set with all others, serves as a point of reference in the comparison belowu=vocabs[0].union(*vocabs[1:])
all([u==vforvinvocabs])
Confirming the second assumption may be a little more involved, but my take-away is that if you're viewing the data synchronic, it should be safe to drop the zero-valued vectors. If you need a diachronic view, as done in the paper, you should not drop anything.
Regarding the pre-trained vectors for some of the corpora: (on the HistWords website)
For specific decades, there appear to be a handful of word vectors that are "0.0" across all 300 dimensions. It should be noted that for these corresponding words, they are still present in the corpus for this particular decade.
However, they do not seem to get any sort of representation across 300 dimensions, and have been assigned zero values throughout. For example, the vector for the word 'autism', from the 1800s decade of the Google n-grams eng-all vectors is [0.0 ... 0.0] for all 300 dimensions.
Would treating these words as simply 'missing' from the corpus at this particular decade be apt?
The text was updated successfully, but these errors were encountered: