Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check if the word exists in English language #1

Open
aguschin opened this issue Aug 30, 2021 · 4 comments
Open

Check if the word exists in English language #1

aguschin opened this issue Aug 30, 2021 · 4 comments
Labels
enhancement New feature or request

Comments

@aguschin
Copy link
Contributor

Stop using wordnet.synsets to filter words in guessing. This is a workaround to remove non-existing words (otherwise, for example, you can explain "hatter" by "hattter".

One option to solve this is to use some huge English dictionary. Other suggestions are welcomed.

The code line where this happens:
https://gitlab.com/production-ml/the-hat-game/-/blob/master/the_hat_game/game.py#L59

@aguschin aguschin added the enhancement New feature or request label Aug 30, 2021
@naidenovaleksei
Copy link
Contributor

It may be a good idea to collect all possible words from dataset and filter words by them. And default check is filter words by NLTK's words corpus (nltk.corpus.words instead of neural networks in wordnet.synsets) if dataset is not defined.
P.S. The similar problem is discussed here
P.P.S. Maybe using another one third-party library (such as pyenchant) is not convenient, so NLTK's corpus is good choice.

@aguschin
Copy link
Contributor Author

aguschin commented Oct 10, 2021

Good point, @naidenovaleksei!
One question, are there any benefits from using pyenchant for this simple task? If not, I think we can use nltk if dataset is not defined, because we already use it for other text processing tasks here.

@aguschin
Copy link
Contributor Author

aguschin commented Nov 13, 2021

Found this in logs. May it be connected with wordnet not recognizing some words?

2021-07-03 13:16:10,361 - the_hat_game.loggers - INFO - EXPLAINING PLAYER (Make Hat Game Again) to HOST: my wordlist is ['opengl', 'compatibility', 'rearchitecting', 'upgrading', 'backporting', 'reimplemented', 'gtk', 'toolkits', 'rewriting', 'directx']
2021-07-03 13:16:10,361 - the_hat_game.loggers - INFO - HOST TO EXPLAINING PLAYER (Make Hat Game Again): cleaning your word list. Now the list is ['compatibility', 'upgrading', 'rewriting']

2021-07-03 13:16:14,925 - the_hat_game.loggers - INFO - HOST to EXPLAINING PLAYER (LAZY ILON): the word is "usenet"
2021-07-03 13:16:15,207 - the_hat_game.loggers - INFO - EXPLAINING PLAYER (LAZY ILON) to HOST: my wordlist is ['newsgroups', 'nntp', 'crossposted', 'pcboard', 'bbses', 'crossposting', 'newsgroup', 'cypherpunks', 'crosspost', 'funet']
2021-07-03 13:16:15,207 - the_hat_game.loggers - INFO - HOST TO EXPLAINING PLAYER (LAZY ILON): cleaning your word list. Now the list is ['bbses']

@naidenovaleksei
Copy link
Contributor

Looks like it is related.

You can see below that both Wordnet dictionary and Synset dictionary don't contain all English words. nltk.corpus.words the same.

from nltk.corpus import wordnet
from nltk.corpus import words as nltk_words

wordlist = ['compatibility', 'rewriting', 'upgrading', 'backporting']
print("word" + " " * 4, "synsets", "wordnet", "nltk_words", sep="\t")
for word in wordlist:
    word_in_wordnet = word in wordnet.words()
    word_in_synsets = len(wordnet.synsets(word)) > 0
    word_in_nltk_words = word in nltk_words.words()
    print(word, word_in_synsets, word_in_wordnet, word_in_nltk_words, sep="\t")
> word    	synsets	wordnet	nltk_words
> compatibility	True	True	True
> rewriting	True	True	False
> upgrading	True	False	False
> backporting	False	False	False

So I think this will fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants