Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature to modify wordnets in the database #17

Open
fcbond opened this issue Oct 16, 2020 · 28 comments
Open

Feature to modify wordnets in the database #17

fcbond opened this issue Oct 16, 2020 · 28 comments
Labels
enhancement New feature or request

Comments

@fcbond
Copy link
Collaborator

fcbond commented Oct 16, 2020

Updated:

This issue is for tracking the feature for modifying wordnets in the database through Wn. Currently the feature has low priority and won't be implemented unless there's a need.

Anyone who wants this feature please read the following:

If you have a use case where the lack of modifiable wordnets in Wn is holding you back, please:

  1. Explain your situation in a comment
  2. Indicate if you are a wordnet author/lexicographer or have some other role

Original issue text:

For example, add, modify or delete words, senses, synsets or relations, ...

@goodmami
Copy link
Owner

goodmami commented Feb 5, 2021

I'm not adding support just yet, but I'm thinking of adding a modified column to the lexicons table which gets set to a true value when someone makes such changes. This is so we know when to warn the user if they remove a lexicon or upgrade the database (#62) such that those changes would be lost.

I want to add this column to the next release because I want to group as many schema-related changes as possible into one release, in order to reduce the number of times people have to rebuild.

@goodmami goodmami added the enhancement New feature or request label Feb 5, 2021
@goodmami goodmami mentioned this issue Feb 5, 2021
8 tasks
@goodmami
Copy link
Owner

My first thought was to set up some triggers so that any changes to the database, whether using Wn or not, would set the modified flag, but that means creating 3 triggers (INSERT, UPDATE, and DELETE) for nearly every table, and it would slow down adding lexicons (and we'd have to unset the flag after adding a lexicon). Instead I think it will just be the responsibility of the Python code to set the flag when post-add changes are made.

@Hypercookie
Copy link
Contributor

Is there any progress on this? Thanks for your work btw :)

@goodmami
Copy link
Owner

@Hypercookie No, not yet. This feature won't be implemented until it has some higher priority, because new features -> more code to maintain, and, for the moment, I'm the only maintainer. I'm going to edit this issue to make more clear how to increase the priority (and feel free to respond accordingly).

@goodmami goodmami changed the title It would be good if a user could modify the wordnet Feature to modify wordnets in the database Mar 20, 2022
@goodmami goodmami pinned this issue Mar 20, 2022
@Hypercookie
Copy link
Contributor

Hypercookie commented Mar 21, 2022

Thanks for your response @goodmami
I work for a research project in Natural Language Processing. We want to enable our users to add own relations between words/synsets and modify existing ones. Since this feature has relativly high priority for us I have forked this project and will try to implement the needed features myself. I may open a pull request once this is done and cleaned up.

@fcbond
Copy link
Collaborator Author

fcbond commented Mar 22, 2022 via email

@goodmami
Copy link
Owner

Thanks, @Hypercookie, for explaining its importance, and for taking the initiative to implement it. I have some further thoughts regarding the implementation if this were to be merged into this repository:

  • Perhaps the most straightforward method is to modify the LMF data in-memory, then to add it to the database as normal, because this can be done now. The challenge is ensuring the lexicon is well-formed as it is created.

  • If you want to modify the database directly, be sure to set the modified flag to true (1) on the entry for the lexicon:

    modified BOOLEAN CHECK( modified IN (0, 1) ) DEFAULT 0 NOT NULL,

    This flag is not yet used, but the idea is it could be used for warnings and other indicators so users are not surprised (or worse, unaware) of query results that differ from the original lexicon.

  • The API should be clear that modifications are happening so the user does not accidentally make a change. E.g., if synset.ili = 'i123' actually modified the database, it would cause problems for users who just had a typo and meant synset.ili == 'i123'. I think it would be best as a separate module (e.g., wn.editor), or even a separate utility.

  • The database schema was not created with small modifications in mind, so some design decisions may create friction. For instance, foreign key constraints may cause integrity issues.

  • Wn cannot currently export lexicon extensions (Exporting of extensions #103) so edits to an extension could not be dumped to a new WN-LMF XML file. In practice this won't be a big issue, as the existence of extensions is mostly theoretical now. That said, if the edits you wish to make are primarily adding new things rather than modifying/deleting old things, you might look into creating an extension rather than modifying the original lexicon.

@Hypercookie
Copy link
Contributor

Thanks for your input :) I have already started on this.

  • Im not completely sure what you mean with this point.
  • Implemented
  • Currently the API resides in a seperate module (mod.py -> Working Title).
    I had the same thought as you and wanted to make modification as efficient and "obvious" as possible so currently changes to a synset look like this:
reset_all_wordnets() #Resets all modfied Wordnets
w = wn.Wordnet("odenet")
t1 = w.synset("odenet-1-n")
t2 = w.synset("odenet-10-n")

# Deleting

print(t2.hypernyms()) # -> [Synset('odenet-4866-n')]
SynsetEditor(t2.hypernyms()[0]).delete()
print(t2.hypernyms()) # -> []

# Creating / Modifying
# Since there is no lexid and no Synset passed this will create a new 
# Synset in the Lexicon with id 'odenet' 
# Calls to modfications can be made in chains to make the interface more fluent
e = SynsetEditor("odenet").set_hypernym_of(t1).set_meronym_of(t2) 
print(t1.hypernyms()) #[Synset('odenet-5437-n'), Synset('odenet-362443-mod')]
                      #                                   \> Thats the one we created

Note that this needs some more work obviously (For example the -mod part of the id makes absolutly no sense.) I also plan to provide a simple MappingDict that enables something like this:

e = SynsetEditor("odenet")
e["definition"] = "Fancy Definition"
e["hypernyms"] = [fancy_synset,fancy_synset_2]

But I think that wrapping the editor in a seperate class ensures that no accidents occur either way.

  • Yeah I noticed that :D ... I currently rely a lot on ON DELETE CASCADE for deletion, and try to use the methods that are initally used to import the lexicon to also later import synsets or other things.
  • We thought about that ... but we think that the usecase involves deletion just as much as insertion so this wasn't an option sadly.

I will try to make this as nice as possible, but time is also an issue for us ( as it is for everybody )
so we will see what comes out of it.

@fcbond
Copy link
Collaborator Author

fcbond commented Mar 28, 2022

Hi,

just a comment on the ids.   I think the original odenet ids (and the same for most wordnets) are wordnet-offset-pos, where offset is the offset of the corresponding synset in princeton wordnet 3.0, and pos is one of the small set of wordnet pos codes (originally n, v, a, r, but now extended to a few more (see below) from constants.py

# Parts of Speech

NOUN = 'n'  #:
VERB = 'v'  #:
ADJ = ADJECTIVE = 'a'  #:
ADV = ADVERB = 'r'  #:
ADJ_SAT = ADJECTIVE_SATELLITE = 's'  #:
PHRASE = 't'  #:
CONJ = CONJUNCTION = 'c'  #:
ADP = ADPOSITION = 'p'  #:
OTHER = 'x'  #:
UNKNOWN = 'u'  #:

PARTS_OF_SPEECH = frozenset((
    NOUN,
    VERB,
    ADJECTIVE,
    ADVERB,
    ADJECTIVE_SATELLITE,
    PHRASE,
    CONJUNCTION,
    ADPOSITION,
    OTHER,
    UNKNOWN,
))

If you keep to this convention and just generate new integers from above say 20000000 (and I guess check for a clash if multiple people are adding things) it may make it easier for people to debug.

PWN offsets are all below 20,000,000, ...

@Hypercookie
Copy link
Contributor

Thanks very much :) I will do that!

@Hypercookie
Copy link
Contributor

Maybe some updates :

  1. I managed to write editors for most of the wordnet. So it is now possible to edit a form or a sense for example.
  2. As far as the integration into existing modules goes I would propose to do something like this:
    wn.synset(id).editor() This would then return an Object of type SynsetEditor. Those editors exist in a similar way for Senses / Entries and Forms. This will keep the editor functionality clearly distinct from all other functionalities so no accidents happen but editing is still easy if needed.
  3. I will write editors for most of the SQL schema. Those are basically just warppers around the database. But we can extend them like we want, for example i have added a add_word(string) function to the SynsetEditor which would create a new Sense, a new Entry and a new Form and thus essentially adding the word directly to the Synset. I will try to limit this shortcuts as much as possible and later provide some documentation on how to extend those Editors to maybe add such functionality for your own projects.

Maybe you guys have some more input on this :)

@goodmami
Copy link
Owner

(recreating what I think was the context)

Perhaps the most straightforward method is to modify the LMF data in-memory, then to add it to the database as normal, because this can be done now. The challenge is ensuring the lexicon is well-formed as it is created.

Im not completely sure what you mean with this point.

The wn.lmf module loads the WN-LMF .xml files into memory. Wn does this prior to inserting the data into the database. The wn.lmf module is not well-documented, but the in-memory representation is basic Python dictionaries and lists, so it's easy to read and modify.

  1. I managed to write editors for most of the wordnet. So it is now possible to edit a form or a sense for example.

Great, it's nice to hear you're making steady progress.

As far as the integration into existing modules goes I would propose to do something like this:
wn.synset(id).editor()

Actually I would rather not have that API, for some reasons:

  • if it's a method on every synset, sense, etc., then the editor code is loaded regardless of whether the user wants to edit things
  • wn.synset(id) just returns the first synset with the id, with no guarantees about the lexicon it came from (imagine having 2 lexicons in the database with some overlapping ids)

Instead I'd prefer something like this:

>>> from wn.editor import LexiconEditor
>>> es_editor = LexiconEditor('omw-es:1.4')  # unlike wn.Wordnet, this works with exactly 1 lexicon at a time

Then I'm a bit less particular about how that object is used, but maybe:

>>> es_editor.add_synset(id=ssid, ili=..., ...)
>>> es_editor.add_word(id=wid, pos=..., ...)
>>> es_editor.add_sense(id=sid, synset=ssid, word=wid, ...)
>>> es_editor.add_sense_relation(id=sid, target=..., relation=...)
>>> es_editor.commit()  # commit transaction in DB

A set of methods like this instead of a SynsetEditor class, a SenseEditor class, etc., might reduce the amount of code to maintain, but there are benefits to the other way, too.

  1. I will write editors for most of the SQL schema...

The shortcuts are a nice idea. You might allow a function similar to wn.util.synset_id_formatter() for auto-creating word and sense IDs with a reasonable default (e.g., see _make_entry_id() and escape_lemma() in the omw-data code).

@Hypercookie
Copy link
Contributor

Hypercookie commented Mar 30, 2022

Hi ;)

The wn.lmf module loads the WN-LMF .xml files into memory. Wn does this prior to inserting the data into the database. >The wn.lmf module is not well-documented, but the in-memory representation is basic Python dictionaries and lists, so >it's easy to read and modify.

I see now what you mean. That was also my first approach because it looked like an easy thing to do but then I noticed that if we want to modfiy something that already exists, we (at least I belive so) will not get around modfiying the sqllite db itself, same for deletion. And if we are already modifying the db it acutaly would mean more code to create entries via the lmf module then to just create the according row, and then edit the object with the methods that modfiy the existing objects. (See below for some more info of what I mean)

• if it's a method on every synset, sense, etc., then the editor code is loaded regardless of whether the user wants to > edit things
• wn.synset(id) just returns the first synset with the id, with no guarantees about the lexicon it came from (imagine > having 2 lexicons in the database with some overlapping ids)

Agreed!

Instead I'd prefer something like this:

I also do! And thats why i tried this at first and quickly came to realize that if we want do modify existing synsets this approach where we have only one class that manages everything will become very complicated and totally messy. We would need a method for everything a user might want to do. For example modfiy the pronounciation of a word etc, we then also dont have a clear interface but a collection of methods. Thats why I decided let the user decide which granularity he wants.
There will be a LexiconEditor with all the functions you wrote in your reply ( and many more probably) but I think we would benefit from having to option to modify a single synset exactly to our liking.
Also as already mentioned at the top of this reply, this approach enables us to reuse code a lot we simply say here is a method that modifies the definition of a synset. If we now want to create a new synset with a specific defintion we can say :

SynsetEditor('omw-es:1.4').definition('fancy synset')

# Since the passed string is a lexicon id, this will create a new synset and return its editor instance. 
# But we can also do:

SynsetEditor(wn.synsets('Car')[0]).definition('this is not a car')

In fact a real world scenario (debatable) would look like this:

import wn
from wn.mod import *

reset_all_wordnets()  # Reset all Wordnets

# Create a new Synset in the 'odenet' lexicon.
# Add the Word "Audi" and "Mercedes" to it (without caring for Senses or Forms) and retrive the Synset
syn: wn.Synset = SynsetEditor("odenet").add_word("Audi").add_word("Mercedes").synset #This will become a clean method.

# Spawn a new Editor for the Auto synset (which exists in already in the 'odenet' lexicon) and make it a hypernym of the
# previous synset.
SynsetEditor(wn.synsets("Auto")[0]).set_hypernym_of(syn)


print("words in synset:\n ")
for i in syn.words():
    print(i.lemma())

print("\n\nhypernyms: \n")
for i in syn.hypernyms():
    for word in i.words():
        print(word.lemma())

#dont want the synset anymore?
SynsetEditor(syn).delete()
words in synset:
 
Audi
Mercedes


hypernyms: 

Wagen
fahrbarer Untersatz
Motorwagen
Personenwagen
Personenkraftwagen
Auto
Automobil
Pkw
Blechbüchse
PKW

This reuses so much code that in fact I mostly only wrote code to modify existing entries.
If we have the code like this, we can easily write methods to create a synset given an id, ili etc which will just be a chained call to an editor. But we can also get the editor itself and modfify only parts of the synset (or sense or form or whatever).The challange here is to keep the data consistent. But there (unlike for deletion ^^) the foreign key constraints in the db help a lot.

Edit:

Maybe something about the formatter functions. This is probably the only point where this approach bothers me... If we want a user to create a entry, he will just pass the lexicon id to specify the lexicon. If we now want to acutally create the entry in the db we have no idea which position or which forms it will have ... so we have to either set a fixed id without much sense (the position can be modfied) or modify the id as soon as the forms are set and we have a lemma. I dislike both aproaches. The third one would be to wait before actually creating an object and using something like .commit() but this would make the code much more complicated (we would need to preserve all changes somehow... maybe storing alle queries to be executed and adding callbacks or something?)
As always: Open for input :)

@Hypercookie
Copy link
Contributor

https://hyper-wn.readthedocs.io/en/latest/api/wn.editor.html
If you want an overview of what is happening.

@Hypercookie
Copy link
Contributor

https://github.com/Hypercookie/wn/projects
nearly done

@goodmami
Copy link
Owner

goodmami commented Apr 6, 2022

@Hypercookie thanks for sharing. It looks like you've put in a ton of work, and it's nice to see it shaping up.

As it currently is, however, I don't think I'm prepared to accept a PR with a nearly 1800-line module, as that would greatly increase the maintenance burden in Wn for a feature that, as useful as it is, would still be used by a minority of users. So it might make more sense to distribute it as a separate package. To that end, I'm happy to find a way to expose certain internals in Wn's public API (such as wn._db.connect). The package (e.g., wn_editor) would be installed separately, but to help with visibility I can create an 'extra' for the installer so, e.g., pip install wn[editor] will install the editor package. Alternatively I could try creating a namespace package, such as wn.contrib, so that the editor package is accessible from wn.contrib.editor.

@Hypercookie
Copy link
Contributor

@goodmami No problem! I absolutly see your point. I think distributing as an 'extra' package would be the nicest way. Im happy to maintain the editor as long as it is needed in my own repository. Maybe we could link it in the documentation somewhere? But I got to be honest with you I have no idea of the necessary steps to create such an 'extra' package/ preprare my repository for that. (Will google this later)

As to the internal Wn APIs I (as you saw) basically only need access to wn._db.connect and , equally important, access to the internal xyz._id attributes which yield the rowid, which makes sure that I modfiy the object the user expects.
The only other things I use are the logger (which I probably could get elsewhere, I was just lazy) and the get_modified query, which I would just copy if it is easier. I also use the _insert_synsets method from the wn._add to add synsets. But this is inconsistend with the approach to not use the lmf datastructures so it will be replaced by an direct Database call.

@goodmami
Copy link
Owner

goodmami commented Apr 6, 2022

Great, I'm glad that makes sense to you, too.

Regarding the internals, logger should be created by your module (logger = logging.getLogger(__name__) is a common idiom). I don't have any immediate plans of removing the xyz._id (but see #84), but I don't think it's necessary as you should be able to use the combination of the lexicon id and object id (the regular string one) to look it up. I'll also consider if making the _insert_synsets public is a good idea.

@Hypercookie
Copy link
Contributor

I only need xyz._id and wn._db.connect the rest is already fixed. Can you confirm that it is guaranteed that using the lexicon id and the object id only ever identifies one object (I know that this should be the case, but are there unique constraints in place to enforce this?)

@goodmami
Copy link
Owner

goodmami commented Apr 7, 2022

Can you confirm that it is guaranteed that using the lexicon id and the object id only ever identifies one object (I know that this should be the case, but are there unique constraints in place to enforce this?)

Yes, as long as you are working with only one lexicon, there will only be one Word, Sense, and Synset with the same ID. That is what the UNIQUE constraint on the entries table is ensuring (I don't recall why there aren't such constraints on synsets and senses; could have been an oversight). The code in wn._queries does not codify this constraint because those functions may look at more than one lexicon at a time.

@Hypercookie
Copy link
Contributor

The problem is basically that when a user adds stuff, they could mess with those ids (by adding one with the same id) so if there are no unique constraints in the synsets and senses, the editor could become unpredictable. Thats why I used the rowids so much, since they are primary keys, so it is a bit cleaner to use them to identify rows (in my opinion), but I can understand if you dont want to expose internal ids of wn. So I would try to write the constraints in code. It would also possible to add those constraints in the database but I dont know If you want to change the schema.

@goodmami
Copy link
Owner

goodmami commented Apr 9, 2022

I definitely do not expect or encourage multiple synsets or senses with the same id and from the same lexicon. I'll see if I can get those constraints added.

And I don't have a problem with people using the rowids, especially when working directly with the database, but I don't suggest using them from the public API classes (e.g., Synset._id). The leading underscore indicates they are non-public members and I therefore reserve the right to remove them from the classes or rename them without any warning or changelog entry.

@Hypercookie
Copy link
Contributor

Allright that sounds reasonable. I will adapt the constructors of the editors to take an id and an lexicon, transform then into a rowid, and then continue as normal. I will log a warning or something if multiple rowids are found (aka. the id is double in this lexicon) and then simply take the first. If you get to adding the constraints ( no pressure here ) this will not occur anymore but better safe then sorry. I probably will need until monday or thursday for that.

@goodmami
Copy link
Owner

goodmami commented Apr 9, 2022

@Hypercookie that sounds like a good plan. I also don't know when I'll find a few hours to try and code it up and test it, so don't hold your breath, but if your proposed change works then at least it will be more robust to unannounced changes to Wn's non-public API.

@Hypercookie
Copy link
Contributor

Yes that sounds good. I will just finish this up, release it as a pre-release version, and when you are finished I make the required changes and release the v1.0.0 version.

@Hypercookie
Copy link
Contributor

Hypercookie commented Apr 12, 2022

https://pypi.org/project/wn-editor/

whenever you are ready to include this as an extra package :)

It is not very detailed yet ( and there is no documentation. ) This will all come with version 1.0.0

@goodmami
Copy link
Owner

Thanks! I'm able to download the package, but the link to the project homepage on GitHub gives a 404, so I think it might be a private repo?

@Hypercookie
Copy link
Contributor

My bad! Should be public now...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants