Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3 entries in Eng raw data have no pos #1017

Open
rob-ross opened this issue Feb 4, 2025 · 5 comments
Open

3 entries in Eng raw data have no pos #1017

rob-ross opened this issue Feb 4, 2025 · 5 comments

Comments

@rob-ross
Copy link

rob-ross commented Feb 4, 2025

I noticed this earlier, and I just downloaded the latest raw data file (raw-wiktextract-data.jsonl) dated 2025-01-31, and the same "issue" exists.

There are 9,955,900 lines/json objects in this file, and only 3 of them have no pos entry. As far as I can tell, the wikt pages for these words include a pos.

Line 9,955,610, word: sleep
Line 9,955,623 word: underwear
Line 9,955,682 word: erectile dysfunction

I just thought I should mention it.

  • Rob
@xxyzz
Copy link
Collaborator

xxyzz commented Feb 4, 2025

These are extracted from "Thesaurus" namespace pages, for example: Thesaurus:erectile dysfunction. And the JSON object has "source" field with "thesaurus" value.

@rob-ross
Copy link
Author

rob-ross commented Feb 5, 2025

I have seen many other wikt entries that refer to the Thesaurus as well but those referenced Thesaurus entries get added to a sense entry or to a top-level synonym/antonym (etc.) entry for that word. These are the only 3 cases of a Thesaurus page being in the extract file all by themselves. Of the almost 10 M entries in the raw English extract there are only 3 entries like these. Looking at the wiki pages for them, I can't really tell if the way the data is organized on the webpage is causing these 3 entries to be written separately. (Note the Thesaurus words are being added to the word json entries.) The reference to Thesaurus doesn't seem different than other words I have looked up. Maybe there's some subtlety I am missing because this data structure is still very new to me.

@xxyzz
Copy link
Collaborator

xxyzz commented Feb 6, 2025

These 3 thesaurus pages don't have POS title, and thesaurus data will be added if we haven't add a word entry with the same word, language code and POS. Normally, thesaurus page has POS title and it won't be added as a separate JSON object.

Here is the code:

if (entry, lang_code, pos) in emitted:
continue

@rob-ross
Copy link
Author

rob-ross commented Feb 6, 2025

I added a Noun header to the wikt entry for Thesaurus:sleep

After the next wiktionary dump and extract release, the orphan sleep entry should no longer be there. Or so I believe.

  • Rob

@kristian-clausal
Copy link
Collaborator

I added a Noun header to the wikt entry for Thesaurus:sleep

After the next wiktionary dump and extract release, the orphan sleep entry should no longer be there. Or so I believe.

* Rob

That should do it, thanks! I'll leave this open, and hopefully you can update us on when the Wiktextract data is correct. It is possible that there are other similar cases to this that can be fixed on our side, just in this case there wasn't any way except making the Wiktionary entries more complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants