Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero-valued vectors? #12

Open
nishanthsanjeev opened this issue Nov 13, 2020 · 1 comment
Open

Zero-valued vectors? #12

nishanthsanjeev opened this issue Nov 13, 2020 · 1 comment

Comments

@nishanthsanjeev
Copy link

nishanthsanjeev commented Nov 13, 2020

Regarding the pre-trained vectors for some of the corpora: (on the HistWords website)

For specific decades, there appear to be a handful of word vectors that are "0.0" across all 300 dimensions. It should be noted that for these corresponding words, they are still present in the corpus for this particular decade.

However, they do not seem to get any sort of representation across 300 dimensions, and have been assigned zero values throughout. For example, the vector for the word 'autism', from the 1800s decade of the Google n-grams eng-all vectors is [0.0 ... 0.0] for all 300 dimensions.

Would treating these words as simply 'missing' from the corpus at this particular decade be apt?

@baumanno
Copy link

baumanno commented May 4, 2022

I was recently confused by this myself, and think I may have found an answer.
Appendix A of the paper states (emphasis mine):

For the Google datasets we built models using the top-100000 words by their average frequency over the entire historical time-periods, [...]

My interpretation is that the vocabularies across all decades contain the same 100.000 words, and zero-valued vectors indicate that no embedding for these particular words were found because the words don't appear in the corpus of that decade.

That first assumption is quickly confirmed:

vocabs = []
for decade in range(1800, 2000, 10):
    with open(f"./sgns/{decade}-vocab.pkl", 'rb') as f:
        # set semantics enable comparisons without having to sort lists manually
        vocabs.append(set(pickle.load(f)))
        
# union of first set with all others, serves as a point of reference in the comparison below
u = vocabs[0].union(*vocabs[1:])

all([u == v for v in vocabs])

Confirming the second assumption may be a little more involved, but my take-away is that if you're viewing the data synchronic, it should be safe to drop the zero-valued vectors. If you need a diachronic view, as done in the paper, you should not drop anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants