-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
3 entries in Eng raw data have no pos #1017
Comments
These are extracted from "Thesaurus" namespace pages, for example: Thesaurus:erectile dysfunction. And the JSON object has "source" field with "thesaurus" value. |
I have seen many other wikt entries that refer to the Thesaurus as well but those referenced Thesaurus entries get added to a sense entry or to a top-level synonym/antonym (etc.) entry for that word. These are the only 3 cases of a Thesaurus page being in the extract file all by themselves. Of the almost 10 M entries in the raw English extract there are only 3 entries like these. Looking at the wiki pages for them, I can't really tell if the way the data is organized on the webpage is causing these 3 entries to be written separately. (Note the Thesaurus words are being added to the word json entries.) The reference to Thesaurus doesn't seem different than other words I have looked up. Maybe there's some subtlety I am missing because this data structure is still very new to me. |
These 3 thesaurus pages don't have POS title, and thesaurus data will be added if we haven't add a word entry with the same word, language code and POS. Normally, thesaurus page has POS title and it won't be added as a separate JSON object. Here is the code: wiktextract/src/wiktextract/thesaurus.py Lines 253 to 254 in 0ee4b87
|
I added a Noun header to the wikt entry for Thesaurus:sleep After the next wiktionary dump and extract release, the orphan sleep entry should no longer be there. Or so I believe.
|
That should do it, thanks! I'll leave this open, and hopefully you can update us on when the Wiktextract data is correct. It is possible that there are other similar cases to this that can be fixed on our side, just in this case there wasn't any way except making the Wiktionary entries more complete. |
I noticed this earlier, and I just downloaded the latest raw data file (raw-wiktextract-data.jsonl) dated 2025-01-31, and the same "issue" exists.
There are 9,955,900 lines/json objects in this file, and only 3 of them have no pos entry. As far as I can tell, the wikt pages for these words include a pos.
Line 9,955,610, word: sleep
Line 9,955,623 word: underwear
Line 9,955,682 word: erectile dysfunction
I just thought I should mention it.
The text was updated successfully, but these errors were encountered: