Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wndb2lmf.py script does not handle pre-3.0 versions #38

Open
goodmami opened this issue Oct 1, 2024 · 6 comments
Open

wndb2lmf.py script does not handle pre-3.0 versions #38

goodmami opened this issue Oct 1, 2024 · 6 comments
Labels
bug Something isn't working
Milestone

Comments

@goodmami
Copy link
Collaborator

goodmami commented Oct 1, 2024

For context: goodmami/wn#199

There are two main issues:

  • The CILI mapping format in https://github.com/globalwordnet/cili/ in older-wn-mappings/ has 3 columns instead of 2
  • WordNet versions prior to 2.1 do not include verb.Framestext, so we'd need to hard-code the frames into the script
@goodmami goodmami added the bug Something isn't working label Oct 1, 2024
@goodmami
Copy link
Collaborator Author

goodmami commented Oct 1, 2024

A third and fourth issue for PWN 1.5:

  • The Unix database files are not available, only a Windows .zip and an old Mac .bin file. For the Windows file, the filenames are all-caps and slightly different (e.g., DICT/NOUN.DAT instead of dict/data.noun), so we'd need to do some special file-loading. Luckily, there doesn't seem to be any strange encoding or line-ending problems.
  • The data files don't just have the license text and data lines, but in between these two blocks there is also a list of files used to create the database. The code for reading data files needs to be made aware of this.

@goodmami
Copy link
Collaborator Author

goodmami commented Oct 1, 2024

The 3rd column of the older cili mappings is a confidence score. We should probably parse that and use it for thresholding out low-quality mappings.

@goodmami
Copy link
Collaborator Author

goodmami commented Oct 7, 2024

@jmccrae and @fcbond, a question for you at the end.

For WNDB to WN-LMF conversion, I use the index.sense file to get the sense keys (cntlist has sense keys, too, but only those with nonzero counts). The index.sense file also includes tag_cnt, so it obviates the need to look at cntlist for counts. At least, that's what I had assumed.

But since WordNet-1.5 does not include index.sense, I need to recreate the sense keys from scratch. It took me a few evenings, but I got a script that rebuilds index.sense with zero diffs when testing against WordNet-1.7, 1.7.1, 2.0, and 2.1.

The (hopefully) last remaining issue relates to the presence of the adjposition markers (a) or (p) (for some reason, I haven't seen (ip)) in the head word of a satellite adjective sense key. For example:

upward%5:00:00:ascending(a):00

It never seems to happen in the primary lemma of a sense key, only the head word for satellite adjectives. They can appear in the cntlist file:

$ grep -c '([ap])' WordNet-*/dict/cntlist
WordNet-1.5/DICT/CNTLIST:93
WordNet-1.6/dict/cntlist:131
WordNet-1.7.1/dict/cntlist:0
WordNet-1.7/dict/cntlist:0
WordNet-2.0/dict/cntlist:3
WordNet-2.1/dict/cntlist:0
WordNet-3.0/dict/cntlist:130
WordNet-3.1/dict/cntlist:130

... or the index.sense file (only for WN-1.6):

$ grep -c '([ap])' WordNet-*/dict/index.sense
WordNet-1.6/dict/index.sense:374
WordNet-1.7.1/dict/index.sense:0
WordNet-1.7/dict/index.sense:0
WordNet-2.0/dict/index.sense:0
WordNet-2.1/dict/index.sense:0
WordNet-3.0/dict/index.sense:0
WordNet-3.1/dict/index.sense:0

These differences seem to be the source of incorrect counts in the original index.sense files. The sense key only%5:00:00:single(a):00 has a tag_cnt of 118 in WN-1.6's cntlist file (first field), but 0 in index.sense (last field):

$ grep "only%5:00:00:single" WordNet-1.6/dict/{cntlist,index.sense}
WordNet-1.6/dict/cntlist:118 only%5:00:00:single(a):00 1
WordNet-1.6/dict/index.sense:only%5:00:00:single(a):00 02111616 1 0

WN-1.7 does not have the adjposition markers in the cntlist or index.sense file, and it gets the full count in both:

$ grep "only%5:00:00:single" WordNet-1.7/dict/{cntlist,index.sense}
WordNet-1.7/dict/cntlist:118 only%5:00:00:single:00 1
WordNet-1.7/dict/index.sense:only%5:00:00:single:00 02148121 1 118

WN-3.0 has the markers in cntlist but not index.sense, and again the counts are wrong:

$ grep "only%5:00:00:single" WordNet-3.0/dict/{cntlist,index.sense}
WordNet-3.0/dict/cntlist:118 only%5:00:00:single(a):00 1
WordNet-3.0/dict/index.sense:only%5:00:00:single:05 02214736 1 0

Since I assume we are faithfully converting the PWN versions to WN-LMF and not fixing bugs, I'm wondering what I should be including in the WN-LMF files. I'm thinking of taking cntlist as the source of truth for counts with a more robust lookup, but should I normalize the sense keys in older versions? That is, for WN 1.5 and 1.6, should we record the sense key as only%5:00:00:single(a):00 or only%5:00:00:single:00?

@goodmami goodmami added this to the Release 1.5 milestone Oct 18, 2024
@goodmami
Copy link
Collaborator Author

@jmccrae and @fcbond, I didn't get any response to my question above, so here is my plan:

  1. Sense keys will be used as they are in index.sense (1.6 through 3.1)
  2. Sense keys for 1.5 will be generated like 1.6 (with adjposition markers)
  3. Counts will be used from cntlist with a flexible search

The following table illustrates what that looks like:

WordNet Sense Key Count
1.5 only%5:00:00:single(a):00 102
1.6 only%5:00:00:single(a):00 118
1.7 only%5:00:00:single:00 118
1.7.1 only%5:00:00:single:00 01
2.0 only%5:00:00:single:00 01
2.1 only%5:00:00:single:05 01
3.0 only%5:00:00:single:05 118
3.1 only%5:00:00:single:05 118

The flexible search for counts means that the counts in my rebuilt index.sense file will differ from the existing index.sense files for WordNets 1.6, 3.0, and 3.1.

Footnotes

  1. This sense key, with or without adjposition markers, was missing from the cntlist file entirely. 2 3

@jmccrae
Copy link

jmccrae commented Oct 24, 2024

I agree we should stick with index.sense primarily

In SemCor the sense is annotated as only%5:00:00:single(a):00 and this is aligned to WN1.6

Filename: brown2/tagfiles/br-h13.xml

<wf cmd="ignore" pos="DT">the</wf>
<wf cmd="done" pos="JJ" lemma="only" wnsn="0" lexsn="5:00:00:single(a):00">only</wf>
<wf cmd="done" pos="JJ" lemma="effective" wnsn="1" lexsn="3:00:00::">effective</wf>
<wf cmd="done" pos="NN" lemma="method" wnsn="1" lexsn="1:09:00::">method</wf>

@goodmami
Copy link
Collaborator Author

@jmccrae thank you, that's the kind of confirmation I needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants