Support for old WordNet Versions < 3.0 #199

jmccrae · 2024-09-24T10:51:44Z

Is your feature request related to a problem? Please describe.
A lot of datasets use much older releases of WordNet and it would be good to work with them and this modern library

Describe the solution you'd like
Incorporate all the previous versions listed here:

https://wordnet.princeton.edu/download/old-versions

Additional context
I can generate WNLMF files for them using my API so I can send them on to you to include

goodmami · 2024-09-24T17:00:16Z

Thanks, @jmccrae, this seems like a good idea. I didn't know anyone still used pre-3.0 versions.

Some concerns:

Would you host the WNLMF files yourself? Another option is through the OMW since it already distributes WNLMF-formatted PWN-3.0 (omw-en) and PWN-3.1 (omw-en31).
If you distribute it, what would you call it? I believe Christiane was saying that only the exact version distributed by Princeton should be called the Princeton WordNet (PWN 3.0 and 3.1 omwn/omw-data#5 (comment)).
If OMW distributes it, we already have a script for converting WNDB to WNLMF (see wndb2lmf.py at https://github.com/omwn/omw-data/tree/main/scripts), but maybe your API is better? At least, I see a note that the grammar I use for parsing examples out of glosses may not work on non-3.0 versions (depending on the glossing convention used).

goodmami · 2024-09-26T00:34:02Z

Following up... with the wndb2lmf.py script of omw-data I was able to produce a WNLMF file for WordNet 2.1 with minimal effort, but for 2.0 and previous the lack of syntactic behavior data (verb.Framestext) causes some errors that require a bit more work. In theory this shouldn't be an issue, but I think there are some assumptions baked into the script that should be relaxed.

jmccrae · 2024-09-26T10:43:20Z

We could make these releases from the Open English Wordnet project. We have done this previously:

https://github.com/globalwordnet/english-wordnet/releases/tag/3.1

Then they could be OEWN 2.0 etc, but also happy to see it come from OMW.

I can check if my script works better... it has some issues with ILI numbers but I think it is easy to fix.

goodmami · 2024-09-27T04:09:28Z

Oh I didn't realize you published that one. First, I noticed that only the sources are published and I had to run the scripts/merge.py command to produce the single XML file. If these are going to be indexed by Wn, the WNLMF XML file needs to be hosted. I also want to get @fcbond's opinion about where to host.

We can compare the english-wordnet 3.1 (ewn31) to the OMW English Wordnet 3.1 (omw-en31).

	ewn31	omw-en31
WN-LMF version	1.0	1.1
`wc -l`	1674110	1574295
`grep -c "<LexicalEntry"`	159015	156762
`grep -c "<Sense "`	207272	207272
`grep -c "<Synset"`	403459	403459

The difference in lexical entries is a bit worrying. I notice that omw-en31 does not use s as a part of speech for lexical entries, only synsets, whereas ewn31 uses s on both, and this causes different numbers of entries. For example, ewn31 has separate entries for ablative for the s and a parts of speech, where omw-en31 combines them into one entry.

There are also differences in how the respective files escape characters in IDs (-apos- vs -ap-, for example).

Here's a sample entry from each:

ewn31

    <LexicalEntry id="ewn-Aurora-n">
      <Lemma writtenForm="Aurora" partOfSpeech="n" />
      <Sense id="ewn-Aurora-n-09595291-01" synset="ewn-09595291-n" dc:identifier="aurora%1:18:00::" />
    </LexicalEntry>

omw-en31

    <LexicalEntry id="omw-en31-Aurora-n">
      <Lemma writtenForm="Aurora" partOfSpeech="n" />
      <Form writtenForm="aurorae" />
      <Sense id="omw-en31-Aurora-09595291-n" synset="omw-en31-09595291-n" dc:identifier="aurora%1:18:00::" />
    </LexicalEntry>

Diffs:

The IDs of ewn31 are shorter
omw-en31 includes alternative forms from the exceptions lists (unmatching capitalization; note that there is a second entry for "aurora", so these may be spurious)

Here's a sample synset for each:

ewn31

    <Synset id="ewn-00751800-v" ili="i25447" partOfSpeech="v" dc:subject="verb.communication">
      <Definition>indicate the right path or direction</Definition>
      <SynsetRelation relType="hypernym" target="ewn-00751382-v" /> 
      <Example>"The sign pointed the way to London"</Example>
    </Synset>

omw-en31

    <Synset id="omw-en31-00751800-v" ili="i25447" partOfSpeech="v" members="omw-en31-point_the_way-00751800-v" lexfile="verb.communication" dc:identifier="point_the_way.v.01">
      <Definition>indicate the right path or direction</Definition>
      <SynsetRelation target="omw-en31-00751382-v" relType="hypernym" />
      <Example>The sign pointed the way to London</Example>
    </Synset>

Diffs:

WN-LMF 1.1 allows the members list in omw-en31
omw-en31 also includes the dc:identifier which matches the NLTK's synset IDs
omw-en31 removes quotes around examples

jmccrae · 2024-09-27T08:41:45Z

I am pretty sure the difference in entry count is that we counted 's' adjectives as different lexical entries. Later version of OEWN have merged these.

One big difference is that we use dc:identifier for the sense keys. This is actually the motivation for this issue, as the older resources use these sense keys (I was looking at Rada Mihalcea's original SemLink dataset). They could also not be reliably inferred from the data (unlike the NLTK identifiers) so it is important to include them explicitly.

arademaker · 2024-09-27T15:41:55Z

I am pretty sure the difference in entry count is that we counted 's' adjectives as different lexical entries

same problem over and over again… amazing! I don’t like the s vs a decision on PWN, but better follow it.

goodmami · 2024-09-27T23:58:51Z

I am pretty sure the difference in entry count is that we counted 's' adjectives as different lexical entries.

I think you're right. The fact that there are the same number of senses in both is reassuring.

same problem over and over again…

@arademaker The PWN data is the PWN data; nothing has changed. It's just that @jmccrae and I processed it differently when converting WNDB to WN-LMF.

Note that omw-en uses s and a for the pos of synsets (where "pos" is arguably a misnomer), since it comes from the ss_type field in the data.adj file, but only a for the pos of lexical entries, since the pos field is in the index.adj file, where they are all a. Compare:

data.adj (both entries use s)

00003552 00 s 02 emergent 0 emerging 0 003 & 00003356 a 0000 + 02631097 v 0102 + 00051513 n 0101 | coming into existence; "an emergent republic"  
[...]
01147340 00 s 01 emergent 0 002 & 01146764 a 0000 + 07432005 n 0101 | occurring unexpectedly and requiring urgent action; "emergent repair of an aorta"

index.adj (only a)

emergent a 2 2 & + 2 0 01147340 00003552

At least, that's how I understood the WNDB documentation.

One big difference is that we use dc:identifier for the sense keys.

Both files use dc:identifier for the sense keys on senses. My comment about NLTK identifiers is for synsets (you may have to use the horizontal scroll bar above to see the end of the omw-en31 synset).

goodmami · 2024-10-01T05:49:40Z

Another follow-up... At first I thought that pre-2.1 versions did not include syntactic frames in the data files, but on closer inspection they do; it's just that the frame descriptions are implied and not defined in a verb.Framestext file, which I had been using to build a mapping of frame numbers to frame text. Without the file I get lookup errors. Since the frames do not seem to change across versions, I just hard-coded them into the script and I was able to load 2.0, 1.7.1, 1.7, and 1.6. Version 1.5 had other issues.

I created omwn/omw-data#38 for fixing the problems with the omw-data script.

fcbond · 2024-10-01T08:13:56Z

Hi, thanks Michael. I agree that it would be good to host them with omw-data. Do you think it is worth trying to propagate the ILIs back (for example with sense keys)?

…

On Tue, 1 Oct 2024 at 07:50, Michael Wayne Goodman ***@***.***> wrote: Another follow-up... At first I thought that pre-2.1 versions did not include syntactic frames in the data files, but on closer inspection they do; it's just that the frame descriptions are implied and not defined in a verb.Framestext file, which I had been using to build a mapping of frame numbers to frame text. Without the file I get lookup errors. Since the frames do not seem to change across versions, I just hard-coded them into the script and I was able to load 2.0, 1.7.1, 1.7, and 1.6. Version 1.5 had other issues. I created omwn/omw-data#38 <omwn/omw-data#38> for fixing the problems with the omw-data script. — Reply to this email directly, view it on GitHub <#199 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIPZRXBLAGZYIMC3DSVNO3ZZIZYTAVCNFSM6AAAAABOYANQKKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOBUHA2TMOJRHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Francis Bond <https://fcbond.github.io/>

jmccrae · 2024-10-01T11:31:12Z

There are some mappings for old versions to ILIs here:

https://github.com/globalwordnet/cili/tree/master/older-wn-mappings

But they were automatically constructed, so we may only want to take the high-confidence ones.

goodmami · 2024-10-01T16:38:05Z

I used the older mappings when converting to WN-LMF, but I ignored the confidence score. I'll use it for filtering and make the confidence threshold an option.

jmccrae added the enhancement New feature or request label Sep 24, 2024

This was referenced Oct 1, 2024

Create a new release with some improvements (1.5) omwn/omw-data#31

Open

wndb2lmf.py script does not handle pre-3.0 versions omwn/omw-data#38

Open

goodmami mentioned this issue Oct 28, 2024

Misapplied exceptional forms? omwn/omw-data#43

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for old WordNet Versions < 3.0 #199

Support for old WordNet Versions < 3.0 #199

jmccrae commented Sep 24, 2024

goodmami commented Sep 24, 2024

goodmami commented Sep 26, 2024

jmccrae commented Sep 26, 2024

goodmami commented Sep 27, 2024

jmccrae commented Sep 27, 2024

arademaker commented Sep 27, 2024

goodmami commented Sep 27, 2024

goodmami commented Oct 1, 2024

fcbond commented Oct 1, 2024 via email

jmccrae commented Oct 1, 2024

goodmami commented Oct 1, 2024

Support for old WordNet Versions < 3.0 #199

Support for old WordNet Versions < 3.0 #199

Comments

jmccrae commented Sep 24, 2024

goodmami commented Sep 24, 2024

goodmami commented Sep 26, 2024

jmccrae commented Sep 26, 2024

goodmami commented Sep 27, 2024

jmccrae commented Sep 27, 2024

arademaker commented Sep 27, 2024

goodmami commented Sep 27, 2024

goodmami commented Oct 1, 2024

fcbond commented Oct 1, 2024 via email

jmccrae commented Oct 1, 2024

goodmami commented Oct 1, 2024