Skip to content

Conversation

@PouyaMohseni
Copy link
Contributor

Introduces a multi-source data model for instrument names:

  • Added new Source and InstrumentNameSource models
  • Added uniqueness constraint enforcing a single main source per instrument name
  • Updated import pipeline to store Wikidata as a Source and create link objects
  • Updated detail templates to display main source and its verification status
  • Updated name creation and deletion workflows for UMIL DB
  • Updated views
  • Updated admin interfaces to support source links and reviewer permissions

resolved #436

Copy link
Contributor

@dchiller dchiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two data modeling-related questions:

  1. Can you explain more about the decision to model the Instrument Name - Source relationship as a many-to-many field (as opposed to, in particular, a foreign key from Source to Instrument Name), with attributes like contributor and verification status adhering to the relationship table rather than to the name or to the source itself? A few things that come up:
    a. We’re introducing here some concept of identity of a “source” that we don’t have any other machinery to manage and/or enforce (and I’m not sure that we necessarily want one). Multiple names might be sourced from Wikidata, and in this PR you make use of a single source with the name “Wikidata” — this makes sense. But what else counts as a unique source? User-generated sources are just text fields — how do we disambiguate or link sources that are provided free-form? Is it important to do that?
    b. Is the name-source relationship really what we are verifying? Or are we verifying a name based on some source(s)? Same goes for the other categories of verification: what does it mean for a name - source relationship to be rejected.
  2. What is the role of the is_main field? Is it simply a way to set whether a source is displayed (this is what the last paragraph of #436 suggests to me…)? Or does it have additional meaning (for example, is it connected to whether a source is verifiable or not? does it have something to do with “source quality”) If the former, a simple flag makes sense (although perhaps a clearer name would be something like is_display). If the latter, there is probably a name that would be more meaningful than “main”

# If user is a superuser or contributor to the main source, allow deletion
if (
request.user.is_superuser
or instrument_name.main_source.contributor == request.user
Copy link
Contributor

@dchiller dchiller Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems dangerous... (In fact, the previous version seems dangerous, too, but...)

Suppose both you and I provide the name guitar for an instrument based on the same source...and you just happened to provide that name - source pair first. Why do you get to delete the name and not me? Or maybe more importantly, why do you get to delete the name when I may still want it to exist? And maybe more likely, what if we both provided the same name but you provided the source that (for whatever reason) got chosen as the "main" source: why do you get to delete the instrument and I don't?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to stick with the previous workflow. I think it is reasonable to let people delete their contribution, but what makes more sense is letting them removing thei contribution (intrumentName-Source links). Yet, that's also dangerous because when a source for instrument name is verified we do not accept any new sources. Also, we do not record duplicate intrumentName-Source links by different users.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, totally understandable to stick with the previous workflow. I guess there's an open question then about whether some form of soft delete would be preferable? In any case, I think it makes sense to leave it with the current delete philosophy as before.

what makes more sense is letting them removing thei contribution (intrumentName-Source links).

Agreed.

Yet, that's also dangerous because when a source for instrument name is verified we do not accept any new sources.

Yeah, I didn't comment on that yet, because it kind of depends on the points about data modeling. But I'm not sure I understand why that would be the case. If we keep this PR as is, then verification is of the name - source relationship...why would I then need or want to restrict new sources from being added?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. There should be no restrictions for adding more sources or more instrument-source relations by different contributors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the deletion workflow would be better implemented if we introduce a Contributor model and link it to the InstrumentNameSource model. This is more appropriate because the verification status is not bond to any contributor of a source. If a source has multiple contributors, it only needs to be verified once by a reviewer, regardless of how many contributors it has.

The InstrumentNameSource relation should be deleted only when all contributors associated with that relation have deleted it. This creates a hierarchy: an instrument name is deleted only when all of its sources are deleted, and a source is deleted only when all of its contributors remove it.

The downside, on the other hand, is that sources should not be a user specific input. That, putting home town is problematic because of it's relation to the contributor, which breaks no contributor-source relation logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the deletion workflow would be better implemented if we introduce a Contributor model and link it to the InstrumentNameSource model. This is more appropriate because the verification status is not bond to any contributor of a source. If a source has multiple contributors, it only needs to be verified once by a reviewer, regardless of how many contributors it has.

What differentiates a Contributor model from the User model?

The InstrumentNameSource relation should be deleted only when all contributors associated with that relation have deleted it. This creates a hierarchy: an instrument name is deleted only when all of its sources are deleted, and a source is deleted only when all of its contributors remove it.

I think the problem here would be that it then becomes hard to communicate to the user what deleting a name means, and what effect the "delete" action should have. When I delete something in Wikidata, it is clear and obvious what the result of that action was: it now no longer shows up on the page. If we want to give users some sort of delete permission, how do we communicate the effect of that with the procedure you're suggesting.

@PouyaMohseni
Copy link
Contributor Author

Thank you for your review.

  1. The relationship between a source and an instrument name has its own properties, such as contributor and verification status (and added time, verifier, etc., in the future). These values do not belong to the source itself or to the instrument name, but only to their relationship. I believe an instrument name can be supported by multiple verified name–source relationships. A reviewer may verify certain sources for a given name, and a source can be considered verified if at least one of its relationships is verified.

1.b. Yes, sources currently have minimal identity, with only Wikidata being well defined. However, the current model (in addition to objectifying Wikidata) allows future expansion for data entry (URL, citation, etc.) that can be linked. Additionally, I believe that websites, books, and other references should have their own identities. especially since we are now adding many instrument names from MIMO.

  1. Yes, is_main currently serves only as the default source to display. I think it is necessary to have this attribute, and it may be useful to distinguish between verifiable and primary. Yes, other names might be more descriptive, but I think is_display would be too restrictive.

@dchiller
Copy link
Contributor

I'm not really convinced by this model for both conceptual and practical reasons.

First, some conceptual elements:

Verifying a relationship between a name and a source is distinct from verifying a relationship between a name and an instrument. With these changes, a name - source relationship is what can be verified or not: what is actually being verified here? That a name exists in a particular source? That a source in some way communicates that a given name is correct for a given instrument? That distinction seems to be muddied here: why would I expect the validity (or not) of a name - instrument relationship to be dependent on the validity (or not) of a single name - source relationship? (If 50 sources say a piano is a "piano", why does validation depend on a single source? And what source? Maybe you say that's easy: as long as one source says a piano is a "piano" then we could the name - instrument relationship as valid. But what if my source is "my grandmother called it that" -- maybe 30 grandmothers say a piano is a "piano" -- which one is verified?) The concept of verification has been problematic for similar reasons since it has been introduced, but at least when verification adhered to an instrument - name relationship, one could imagine the verification process taking into account a range of criteria for verification in a process that was at least somewhat subjective.

If verification is still, even under the model proposed here, meant to be a description of an instrument - name relationship, then I think we need to have a good reason for the de facto model to deviate from the concept of verification as we mean it.

Now, for practical matters: in the attached issue, I read the core point being this: "keeping different sources for an entered instrument name allows the reviewer to verify the label based on the full array of available sources." Or, to put it another way: "We currently model instrument names as having a single source. In fact, instrument names can have multiple sources and we want our data model to reflect that." So the question is the best way to model that reality? It isn't clear how we got from that question to modifying anything about how we model validity.

@PouyaMohseni
Copy link
Contributor Author

PouyaMohseni commented Nov 28, 2025

You are bringing up many important ideas; Thank you.

You are right, and as we discussed today in #436, verification is determined at the instrument-name level with respect to all the provided sources. This makes the verification status a property of the instrument name itself, as you mentioned. Another point is that there should be a way to exclude or hide certain tags, so I believe the Source / Source–InstrumentName (which model with decide to stick with) relation should have an attribute such as is_valid or is_displayed.

That said, I still think creating a separate table for sources is relevant. This is not only because we may have many identical sources, but also because we need to consider weighting contributors who provide identical sources for a given label, as we discussed today (yes, source is currently an unprocessed free-text).

So, if you agree, I can go ahead and implement this.

- Add new `Source` and `InstrumentNameSource` models
- Many-to-many relationship between InstrumentName and Source
- Move contributor status onto the relationship table
- Update name creation and deletion workflows to operate on relationships
- Update Wikidata import to create Source objects and link Name–Source relations
- Update admin, views, and templates to display main source and status
- Remove restriction on adding new sources after verification

Resolves  #436
name="unique_umil_label_per_instrument_language",
)
]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not keep this?

contributor = models.ForeignKey(
"auth.User",
on_delete=models.SET_NULL,
help_text="Users who contributed this name",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this change?

Comment on lines 26 to 30
context["instrument_names"] = [
name
for name in instrument_names
if name.verification_status == "verified"
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did this need to be changed? The QuerySet version is better.

"""
Display all associated source names for this instrument name
"""
return ", ".join(source.name for source in obj.sources.all())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return ", ".join(source.name for source in obj.sources.all())
return ", ".join(obj.sources.all.iterator())

- used `iterator()` in admin.py
- changed `help_text` of contributor
@ahankinson
Copy link
Member

How many queries does this functionality generate? Have you benchmarked it with Django Debug Toolbar and checked?

@PouyaMohseni
Copy link
Contributor Author

Yes, the number of queries used to be 9 for an instrument detail page, but with this new model it would become almost double the number of instrument labels, O(len(InstrumentName)). However, this only happens if we display the sources on the instrument detail page. As discussed in the meeting, we should have a dedicated page for each instrument name where all sources contributing to that name are shown. If we do not show sources on the detail page, the required queries would be reduced to 7.

PouyaMohseni added a commit that referenced this pull request Dec 10, 2025
- display aliases using nested rows
- show verification status and action options for aliases
- remove source information as part of improvements related to #443

resolves #423, resolves #420
@ahankinson
Copy link
Member

Do you know why it doubles? That's the more important question. If you have an O(len(Something)) increase with Django, it generally means that a loop through some results is hitting the database for every iteration.

This can usually be fixed with either a select_related() or a prefetch_related() so that it's O(1) (That is, the list of related results is also fetched with the initial query).

My experience with writing Django is that you always keep an eye on how many queries any new development is adding to your page load, and fix it immediately. It's a lot harder debugging these things when you have to work through relationships in models, views, and templates to find the place where it's triggering the results. (Although you can also expand the query line in DDT and it will usually show you the line.)

PouyaMohseni added a commit that referenced this pull request Dec 11, 2025
- display aliases using nested rows
- show verification status and action options for aliases
- remove source information as part of improvements related to #443

resolves #423, resolves #420
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Source Handling for Instrument Labels

4 participants