Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for old WordNet Versions < 3.0 #199

Open
jmccrae opened this issue Sep 24, 2024 · 11 comments
Open

Support for old WordNet Versions < 3.0 #199

jmccrae opened this issue Sep 24, 2024 · 11 comments
Labels
enhancement New feature or request

Comments

@jmccrae
Copy link

jmccrae commented Sep 24, 2024

Is your feature request related to a problem? Please describe.
A lot of datasets use much older releases of WordNet and it would be good to work with them and this modern library

Describe the solution you'd like
Incorporate all the previous versions listed here:

https://wordnet.princeton.edu/download/old-versions

Additional context
I can generate WNLMF files for them using my API so I can send them on to you to include

@jmccrae jmccrae added the enhancement New feature or request label Sep 24, 2024
@goodmami
Copy link
Owner

Thanks, @jmccrae, this seems like a good idea. I didn't know anyone still used pre-3.0 versions.

Some concerns:

  1. Would you host the WNLMF files yourself? Another option is through the OMW since it already distributes WNLMF-formatted PWN-3.0 (omw-en) and PWN-3.1 (omw-en31).
  2. If you distribute it, what would you call it? I believe Christiane was saying that only the exact version distributed by Princeton should be called the Princeton WordNet (PWN 3.0 and 3.1 omwn/omw-data#5 (comment)).
  3. If OMW distributes it, we already have a script for converting WNDB to WNLMF (see wndb2lmf.py at https://github.com/omwn/omw-data/tree/main/scripts), but maybe your API is better? At least, I see a note that the grammar I use for parsing examples out of glosses may not work on non-3.0 versions (depending on the glossing convention used).

@goodmami
Copy link
Owner

Following up... with the wndb2lmf.py script of omw-data I was able to produce a WNLMF file for WordNet 2.1 with minimal effort, but for 2.0 and previous the lack of syntactic behavior data (verb.Framestext) causes some errors that require a bit more work. In theory this shouldn't be an issue, but I think there are some assumptions baked into the script that should be relaxed.

@jmccrae
Copy link
Author

jmccrae commented Sep 26, 2024

We could make these releases from the Open English Wordnet project. We have done this previously:

https://github.com/globalwordnet/english-wordnet/releases/tag/3.1

Then they could be OEWN 2.0 etc, but also happy to see it come from OMW.

I can check if my script works better... it has some issues with ILI numbers but I think it is easy to fix.

@goodmami
Copy link
Owner

Oh I didn't realize you published that one. First, I noticed that only the sources are published and I had to run the scripts/merge.py command to produce the single XML file. If these are going to be indexed by Wn, the WNLMF XML file needs to be hosted. I also want to get @fcbond's opinion about where to host.

We can compare the english-wordnet 3.1 (ewn31) to the OMW English Wordnet 3.1 (omw-en31).

ewn31 omw-en31
WN-LMF version 1.0 1.1
wc -l 1674110 1574295
grep -c "<LexicalEntry" 159015 156762
grep -c "<Sense " 207272 207272
grep -c "<Synset" 403459 403459

The difference in lexical entries is a bit worrying. I notice that omw-en31 does not use s as a part of speech for lexical entries, only synsets, whereas ewn31 uses s on both, and this causes different numbers of entries. For example, ewn31 has separate entries for ablative for the s and a parts of speech, where omw-en31 combines them into one entry.

There are also differences in how the respective files escape characters in IDs (-apos- vs -ap-, for example).

Here's a sample entry from each:

ewn31

    <LexicalEntry id="ewn-Aurora-n">
      <Lemma writtenForm="Aurora" partOfSpeech="n" />
      <Sense id="ewn-Aurora-n-09595291-01" synset="ewn-09595291-n" dc:identifier="aurora%1:18:00::" />
    </LexicalEntry>

omw-en31

    <LexicalEntry id="omw-en31-Aurora-n">
      <Lemma writtenForm="Aurora" partOfSpeech="n" />
      <Form writtenForm="aurorae" />
      <Sense id="omw-en31-Aurora-09595291-n" synset="omw-en31-09595291-n" dc:identifier="aurora%1:18:00::" />
    </LexicalEntry>

Diffs:

  • The IDs of ewn31 are shorter
  • omw-en31 includes alternative forms from the exceptions lists (unmatching capitalization; note that there is a second entry for "aurora", so these may be spurious)

Here's a sample synset for each:

ewn31

    <Synset id="ewn-00751800-v" ili="i25447" partOfSpeech="v" dc:subject="verb.communication">
      <Definition>indicate the right path or direction</Definition>
      <SynsetRelation relType="hypernym" target="ewn-00751382-v" /> 
      <Example>"The sign pointed the way to London"</Example>
    </Synset>

omw-en31

    <Synset id="omw-en31-00751800-v" ili="i25447" partOfSpeech="v" members="omw-en31-point_the_way-00751800-v" lexfile="verb.communication" dc:identifier="point_the_way.v.01">
      <Definition>indicate the right path or direction</Definition>
      <SynsetRelation target="omw-en31-00751382-v" relType="hypernym" />
      <Example>The sign pointed the way to London</Example>
    </Synset>

Diffs:

  • WN-LMF 1.1 allows the members list in omw-en31
  • omw-en31 also includes the dc:identifier which matches the NLTK's synset IDs
  • omw-en31 removes quotes around examples

@jmccrae
Copy link
Author

jmccrae commented Sep 27, 2024

I am pretty sure the difference in entry count is that we counted 's' adjectives as different lexical entries. Later version of OEWN have merged these.

One big difference is that we use dc:identifier for the sense keys. This is actually the motivation for this issue, as the older resources use these sense keys (I was looking at Rada Mihalcea's original SemLink dataset). They could also not be reliably inferred from the data (unlike the NLTK identifiers) so it is important to include them explicitly.

@arademaker
Copy link

I am pretty sure the difference in entry count is that we counted 's' adjectives as different lexical entries

same problem over and over again… amazing! I don’t like the s vs a decision on PWN, but better follow it.

@goodmami
Copy link
Owner

I am pretty sure the difference in entry count is that we counted 's' adjectives as different lexical entries.

I think you're right. The fact that there are the same number of senses in both is reassuring.

same problem over and over again…

@arademaker The PWN data is the PWN data; nothing has changed. It's just that @jmccrae and I processed it differently when converting WNDB to WN-LMF.

Note that omw-en uses s and a for the pos of synsets (where "pos" is arguably a misnomer), since it comes from the ss_type field in the data.adj file, but only a for the pos of lexical entries, since the pos field is in the index.adj file, where they are all a. Compare:

data.adj (both entries use s)

00003552 00 s 02 emergent 0 emerging 0 003 & 00003356 a 0000 + 02631097 v 0102 + 00051513 n 0101 | coming into existence; "an emergent republic"  
[...]
01147340 00 s 01 emergent 0 002 & 01146764 a 0000 + 07432005 n 0101 | occurring unexpectedly and requiring urgent action; "emergent repair of an aorta"

index.adj (only a)

emergent a 2 2 & + 2 0 01147340 00003552  

At least, that's how I understood the WNDB documentation.

One big difference is that we use dc:identifier for the sense keys.

Both files use dc:identifier for the sense keys on senses. My comment about NLTK identifiers is for synsets (you may have to use the horizontal scroll bar above to see the end of the omw-en31 synset).

@goodmami
Copy link
Owner

goodmami commented Oct 1, 2024

Another follow-up... At first I thought that pre-2.1 versions did not include syntactic frames in the data files, but on closer inspection they do; it's just that the frame descriptions are implied and not defined in a verb.Framestext file, which I had been using to build a mapping of frame numbers to frame text. Without the file I get lookup errors. Since the frames do not seem to change across versions, I just hard-coded them into the script and I was able to load 2.0, 1.7.1, 1.7, and 1.6. Version 1.5 had other issues.

I created omwn/omw-data#38 for fixing the problems with the omw-data script.

@fcbond
Copy link
Collaborator

fcbond commented Oct 1, 2024 via email

@jmccrae
Copy link
Author

jmccrae commented Oct 1, 2024

There are some mappings for old versions to ILIs here:

https://github.com/globalwordnet/cili/tree/master/older-wn-mappings

But they were automatically constructed, so we may only want to take the high-confidence ones.

@goodmami
Copy link
Owner

goodmami commented Oct 1, 2024

I used the older mappings when converting to WN-LMF, but I ignored the confidence score. I'll use it for filtering and make the confidence threshold an option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants