Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Addition to NLTK migration guide w.r.t. offsets #183

Open
BramVanroy opened this issue Mar 24, 2023 · 4 comments
Open

Addition to NLTK migration guide w.r.t. offsets #183

BramVanroy opened this issue Mar 24, 2023 · 4 comments
Labels
documentation Improvements or additions to documentation enhancement New feature or request

Comments

@BramVanroy
Copy link

Is your feature request related to a problem? Please describe.
Hello

I have access to WordNet synset offset IDs that I retrieve from an API (key: wnSynsetOffset). They look like this wn:00981304a. It is relatively straightforward to get these through NLTK:

from nltk.corpus import wordnet as nltk_wn

offset = "wn:00981304a"
offset_id = int(offset.split(":")[-1][:-1])
pos = offset[-1]
syns = nltk_wn.synset_from_pos_and_offset(pos, offset_id)

However, it is not clear to me how I can convert this approach to wn. I like the API of wn more and I would like to make use of the translate feature specifically, so that is why I want to make the transition.

Describe the solution you'd like
Perhaps a description in the documentation? I think that this section is relevant but it is not clear to me how to apply it on a use-case. So a real-world example can be helpful, I think.

Describe alternatives you've considered

I have tried the following manipulations but none of them work (yielding empty synset lists):

wn.synsets("wn:00981304a")
wn.synsets("00981304a")
wn.synsets("981304a")
wn.synsets("981304", pos="a")
@BramVanroy BramVanroy added the enhancement New feature or request label Mar 24, 2023
@fcbond
Copy link
Collaborator

fcbond commented Mar 24, 2023

Hi,

if you have a wordnet derived from PWN 3.0 with the same offsets, then it can be done as follows:

>>> import wn
>>> ewn=wn.WordNet('omw-en:1.4')
>>> ewn.synset(f'omw-en-00981304-s')
Synset('omw-en-00981304-s')

Many people (including omw 1.0) treat all satellite adjectives (pos 's') as adjectives (pos 'a').
wn does not, so if you look up something with pos 'a' and it doesn't work, then it is worth also looking up 's'. So something like the following should get you what you want.

def offset2synset (wn, offset):
  wnid=  f'omw-en-{offset[3:-1]}-{offset[-1]}'
  try:
    synset = wn.synset(wnid)
  except:
    if offset[-1] == 'a':
       wnid=  f'omw-en-{offset[3:-1]}-s' 
       try:
         synset =  wn.synset(wnid)
       except:
         synset = None
    else:
      synset = None
  return synset
>>> print(offset2synset(ewn, 'wn:00981304a'))
Synset('omw-en-00981304-s')
>>> print(offset2synset(ewn, 'wn:02001858v'))
Synset('omw-en-02001858-v')

@goodmami goodmami added the documentation Improvements or additions to documentation label Mar 29, 2023
@goodmami
Copy link
Owner

@BramVanroy thanks for the good questions (here and on the https://github.com/goodmami/penman project, too 👋). I agree that the documentation could be improved in this area, possibly in the NLTK migration guide.

And thanks, @fcbond, for the good description and solution.

The basic problem is that synset offsets (which are specific to each wordnet version) are not an inherent part of the WN-LMF formatted lexicons that are used by Wn, but for some lexicons (mainly the omw- ones), the WordNet 3.0 offsets are conventionally used in the synset identifiers, so you just need to reformat the identifier appropriately, as @fcbond demonstrated.

Note that I also have an unmerged nltk branch that tries to implement the NLTK's API as a shim on top of Wn, and its of2ss() function is implemented using the same wn.util.synset_id_formatter() function you linked to above:

wn/wn/nltk_api.py

Lines 329 to 342 in 5092e62

_ssid_from_pos_and_offset = _synset_id_formatter(prefix='omw-en')
def of2ss(of: str) -> Synset:
pos = of[-1]
offset = int(of[:8])
ssid = _ssid_from_pos_and_offset(pos=pos, offset=offset)
try:
synset = Synset(_wn30.synset(ssid))
except _wn.Error:
raise _wn.Error(
f'No WordNet synset found for pos={pos} at offset={offset}.'
)
return synset

@fcbond said:

Many people (including omw 1.0) treat all satellite adjectives (pos 's') as adjectives (pos 'a').
wn does not

This is not entirely true. Wn does conflate s and a in the wn.ic, wn.morphy, wn.similarity, and wn.taxonomy modules, but it's true that it does not do so on the standard synset-lookup functions.

@BramVanroy
Copy link
Author

Hello @fcbond and @goodmami

First, thanks for the help! I settled for this:

def offset2omw_synset(wnet: wn.Wordnet, offset: str) -> Optional[wn.Synset]:
    offset = offset.replace("wn:", "")
    offset = "0" * (9-len(offset)) + offset
    wnid = f"omw-en-{offset[:-1]}-{offset[-1]}"
    wnid_s = None

    try:
        return wnet.synset(wnid)
    except wn.Error:
        if wnid[-1] == "a":
            wnid_s = f"omw-en-{wnid[:-2]}-s"
            try:
                return wnet.synset(wnid_s)
            except wn.Error:
                pass

    logging.warning(f"Could not find offset {offset} ({wnid}{' or ' + wnid_s if wnid_s else ''}) in {wnet._lexicons}")

I looked at the NLTK branch @goodmami and while I think that would be very useful, I just needed a quick function that I could easily plug into my code (without having to install from GitHub). But I think it'd be a useful API to have - although I can imagine it is a lot of work!

And thank you for your work. It seems a coincidence that you are providing exactly the tools that I need for my work. I am very thankful and motivated that you created these libraries - and that they work so well and are well-documented! I've also peeked at the internals/API and documentation to inspire my own work, so a big thank you!

@goodmami
Copy link
Owner

goodmami commented Apr 8, 2023

Thanks for the kind words, @BramVanroy! And I'm glad you were able to find a solution. I'm going to keep the issue open because, as the issue title states, I think this sort of information would be useful in the documentation, so the issue should be closed when that happens.

@goodmami goodmami reopened this Apr 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants