wndb2lmf.py script does not handle pre-3.0 versions #38

goodmami · 2024-10-01T05:24:14Z

For context: goodmami/wn#199

There are two main issues:

The CILI mapping format in https://github.com/globalwordnet/cili/ in older-wn-mappings/ has 3 columns instead of 2
WordNet versions prior to 2.1 do not include verb.Framestext, so we'd need to hard-code the frames into the script

The text was updated successfully, but these errors were encountered:

goodmami · 2024-10-01T05:48:30Z

A third and fourth issue for PWN 1.5:

The Unix database files are not available, only a Windows .zip and an old Mac .bin file. For the Windows file, the filenames are all-caps and slightly different (e.g., DICT/NOUN.DAT instead of dict/data.noun), so we'd need to do some special file-loading. Luckily, there doesn't seem to be any strange encoding or line-ending problems.
The data files don't just have the license text and data lines, but in between these two blocks there is also a list of files used to create the database. The code for reading data files needs to be made aware of this.

goodmami · 2024-10-01T16:33:45Z

The 3rd column of the older cili mappings is a confidence score. We should probably parse that and use it for thresholding out low-quality mappings.

goodmami · 2024-10-07T07:58:31Z

@jmccrae and @fcbond, a question for you at the end.

For WNDB to WN-LMF conversion, I use the index.sense file to get the sense keys (cntlist has sense keys, too, but only those with nonzero counts). The index.sense file also includes tag_cnt, so it obviates the need to look at cntlist for counts. At least, that's what I had assumed.

But since WordNet-1.5 does not include index.sense, I need to recreate the sense keys from scratch. It took me a few evenings, but I got a script that rebuilds index.sense with zero diffs when testing against WordNet-1.7, 1.7.1, 2.0, and 2.1.

The (hopefully) last remaining issue relates to the presence of the adjposition markers (a) or (p) (for some reason, I haven't seen (ip)) in the head word of a satellite adjective sense key. For example:

upward%5:00:00:ascending(a):00

It never seems to happen in the primary lemma of a sense key, only the head word for satellite adjectives. They can appear in the cntlist file:

$ grep -c '([ap])' WordNet-*/dict/cntlist
WordNet-1.5/DICT/CNTLIST:93
WordNet-1.6/dict/cntlist:131
WordNet-1.7.1/dict/cntlist:0
WordNet-1.7/dict/cntlist:0
WordNet-2.0/dict/cntlist:3
WordNet-2.1/dict/cntlist:0
WordNet-3.0/dict/cntlist:130
WordNet-3.1/dict/cntlist:130

... or the index.sense file (only for WN-1.6):

$ grep -c '([ap])' WordNet-*/dict/index.sense
WordNet-1.6/dict/index.sense:374
WordNet-1.7.1/dict/index.sense:0
WordNet-1.7/dict/index.sense:0
WordNet-2.0/dict/index.sense:0
WordNet-2.1/dict/index.sense:0
WordNet-3.0/dict/index.sense:0
WordNet-3.1/dict/index.sense:0

These differences seem to be the source of incorrect counts in the original index.sense files. The sense key only%5:00:00:single(a):00 has a tag_cnt of 118 in WN-1.6's cntlist file (first field), but 0 in index.sense (last field):

$ grep "only%5:00:00:single" WordNet-1.6/dict/{cntlist,index.sense}
WordNet-1.6/dict/cntlist:118 only%5:00:00:single(a):00 1
WordNet-1.6/dict/index.sense:only%5:00:00:single(a):00 02111616 1 0

WN-1.7 does not have the adjposition markers in the cntlist or index.sense file, and it gets the full count in both:

$ grep "only%5:00:00:single" WordNet-1.7/dict/{cntlist,index.sense}
WordNet-1.7/dict/cntlist:118 only%5:00:00:single:00 1
WordNet-1.7/dict/index.sense:only%5:00:00:single:00 02148121 1 118

WN-3.0 has the markers in cntlist but not index.sense, and again the counts are wrong:

$ grep "only%5:00:00:single" WordNet-3.0/dict/{cntlist,index.sense}
WordNet-3.0/dict/cntlist:118 only%5:00:00:single(a):00 1
WordNet-3.0/dict/index.sense:only%5:00:00:single:05 02214736 1 0

Since I assume we are faithfully converting the PWN versions to WN-LMF and not fixing bugs, I'm wondering what I should be including in the WN-LMF files. I'm thinking of taking cntlist as the source of truth for counts with a more robust lookup, but should I normalize the sense keys in older versions? That is, for WN 1.5 and 1.6, should we record the sense key as only%5:00:00:single(a):00 or only%5:00:00:single:00?

goodmami · 2024-10-23T04:57:32Z

@jmccrae and @fcbond, I didn't get any response to my question above, so here is my plan:

Sense keys will be used as they are in index.sense (1.6 through 3.1)
Sense keys for 1.5 will be generated like 1.6 (with adjposition markers)
Counts will be used from cntlist with a flexible search

The following table illustrates what that looks like:

WordNet	Sense Key	Count
1.5	`only%5:00:00:single(a):00`	102
1.6	`only%5:00:00:single(a):00`	118
1.7	`only%5:00:00:single:00`	118
1.7.1	`only%5:00:00:single:00`	0¹
2.0	`only%5:00:00:single:00`	0¹
2.1	`only%5:00:00:single:05`	0¹
3.0	`only%5:00:00:single:05`	118
3.1	`only%5:00:00:single:05`	118

The flexible search for counts means that the counts in my rebuilt index.sense file will differ from the existing index.sense files for WordNets 1.6, 3.0, and 3.1.

This sense key, with or without adjposition markers, was missing from the cntlist file entirely. ↩ ↩² ↩³

jmccrae · 2024-10-24T10:09:06Z

I agree we should stick with index.sense primarily

In SemCor the sense is annotated as only%5:00:00:single(a):00 and this is aligned to WN1.6

Filename: brown2/tagfiles/br-h13.xml

<wf cmd="ignore" pos="DT">the</wf>
<wf cmd="done" pos="JJ" lemma="only" wnsn="0" lexsn="5:00:00:single(a):00">only</wf>
<wf cmd="done" pos="JJ" lemma="effective" wnsn="1" lexsn="3:00:00::">effective</wf>
<wf cmd="done" pos="NN" lemma="method" wnsn="1" lexsn="1:09:00::">method</wf>

goodmami · 2024-10-25T04:07:55Z

@jmccrae thank you, that's the kind of confirmation I needed.

goodmami added the bug Something isn't working label Oct 1, 2024

goodmami mentioned this issue Oct 1, 2024

Support for old WordNet Versions < 3.0 goodmami/wn#199

Open

goodmami added this to the Release 1.5 milestone Oct 18, 2024

This was referenced Oct 28, 2024

Update wndb2lmf to build Pre-3.0 WordNets #42

Draft

Make this work with the newest version of wn (0.9.5) #41

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wndb2lmf.py script does not handle pre-3.0 versions #38

wndb2lmf.py script does not handle pre-3.0 versions #38

goodmami commented Oct 1, 2024

goodmami commented Oct 1, 2024

goodmami commented Oct 1, 2024

goodmami commented Oct 7, 2024

goodmami commented Oct 23, 2024

jmccrae commented Oct 24, 2024

goodmami commented Oct 25, 2024

wndb2lmf.py script does not handle pre-3.0 versions #38

wndb2lmf.py script does not handle pre-3.0 versions #38

Comments

goodmami commented Oct 1, 2024

goodmami commented Oct 1, 2024

goodmami commented Oct 1, 2024

goodmami commented Oct 7, 2024

goodmami commented Oct 23, 2024

Footnotes

jmccrae commented Oct 24, 2024

goodmami commented Oct 25, 2024