feat: support multiple sources for a single instrument name #443

PouyaMohseni · 2025-11-27T04:04:48Z

Introduces a multi-source data model for instrument names:

Added new Source and InstrumentNameSource models
Added uniqueness constraint enforcing a single main source per instrument name
Updated import pipeline to store Wikidata as a Source and create link objects
Updated detail templates to display main source and its verification status
Updated name creation and deletion workflows for UMIL DB
Updated views
Updated admin interfaces to support source links and reviewer permissions

resolved #436

dchiller

Two data modeling-related questions:

Can you explain more about the decision to model the Instrument Name - Source relationship as a many-to-many field (as opposed to, in particular, a foreign key from Source to Instrument Name), with attributes like contributor and verification status adhering to the relationship table rather than to the name or to the source itself? A few things that come up:
a. We’re introducing here some concept of identity of a “source” that we don’t have any other machinery to manage and/or enforce (and I’m not sure that we necessarily want one). Multiple names might be sourced from Wikidata, and in this PR you make use of a single source with the name “Wikidata” — this makes sense. But what else counts as a unique source? User-generated sources are just text fields — how do we disambiguate or link sources that are provided free-form? Is it important to do that?
b. Is the name-source relationship really what we are verifying? Or are we verifying a name based on some source(s)? Same goes for the other categories of verification: what does it mean for a name - source relationship to be rejected.
What is the role of the is_main field? Is it simply a way to set whether a source is displayed (this is what the last paragraph of #436 suggests to me…)? Or does it have additional meaning (for example, is it connected to whether a source is verifiable or not? does it have something to do with “source quality”) If the former, a simple flag makes sense (although perhaps a clearer name would be something like is_display). If the latter, there is probably a name that would be more meaningful than “main”

dchiller · 2025-11-27T15:55:34Z

web-app/django/VIM/apps/instruments/views/update_umil_db.py

+    # If user is a superuser or contributor to the main source, allow deletion
+    if (
+        request.user.is_superuser
+        or instrument_name.main_source.contributor == request.user


This seems dangerous... (In fact, the previous version seems dangerous, too, but...)

Suppose both you and I provide the name guitar for an instrument based on the same source...and you just happened to provide that name - source pair first. Why do you get to delete the name and not me? Or maybe more importantly, why do you get to delete the name when I may still want it to exist? And maybe more likely, what if we both provided the same name but you provided the source that (for whatever reason) got chosen as the "main" source: why do you get to delete the instrument and I don't?

I tried to stick with the previous workflow. I think it is reasonable to let people delete their contribution, but what makes more sense is letting them removing thei contribution (intrumentName-Source links). Yet, that's also dangerous because when a source for instrument name is verified we do not accept any new sources. Also, we do not record duplicate intrumentName-Source links by different users.

Yeah, totally understandable to stick with the previous workflow. I guess there's an open question then about whether some form of soft delete would be preferable? In any case, I think it makes sense to leave it with the current delete philosophy as before.

what makes more sense is letting them removing thei contribution (intrumentName-Source links).

Agreed.

Yet, that's also dangerous because when a source for instrument name is verified we do not accept any new sources.

Yeah, I didn't comment on that yet, because it kind of depends on the points about data modeling. But I'm not sure I understand why that would be the case. If we keep this PR as is, then verification is of the name - source relationship...why would I then need or want to restrict new sources from being added?

Yes, I agree. There should be no restrictions for adding more sources or more instrument-source relations by different contributors.

I believe the deletion workflow would be better implemented if we introduce a Contributor model and link it to the InstrumentNameSource model. This is more appropriate because the verification status is not bond to any contributor of a source. If a source has multiple contributors, it only needs to be verified once by a reviewer, regardless of how many contributors it has.

The InstrumentNameSource relation should be deleted only when all contributors associated with that relation have deleted it. This creates a hierarchy: an instrument name is deleted only when all of its sources are deleted, and a source is deleted only when all of its contributors remove it.

The downside, on the other hand, is that sources should not be a user specific input. That, putting home town is problematic because of it's relation to the contributor, which breaks no contributor-source relation logic.

I believe the deletion workflow would be better implemented if we introduce a Contributor model and link it to the InstrumentNameSource model. This is more appropriate because the verification status is not bond to any contributor of a source. If a source has multiple contributors, it only needs to be verified once by a reviewer, regardless of how many contributors it has.

What differentiates a Contributor model from the User model?

The InstrumentNameSource relation should be deleted only when all contributors associated with that relation have deleted it. This creates a hierarchy: an instrument name is deleted only when all of its sources are deleted, and a source is deleted only when all of its contributors remove it.

I think the problem here would be that it then becomes hard to communicate to the user what deleting a name means, and what effect the "delete" action should have. When I delete something in Wikidata, it is clear and obvious what the result of that action was: it now no longer shows up on the page. If we want to give users some sort of delete permission, how do we communicate the effect of that with the procedure you're suggesting.

PouyaMohseni · 2025-11-27T16:47:24Z

Thank you for your review.

The relationship between a source and an instrument name has its own properties, such as contributor and verification status (and added time, verifier, etc., in the future). These values do not belong to the source itself or to the instrument name, but only to their relationship. I believe an instrument name can be supported by multiple verified name–source relationships. A reviewer may verify certain sources for a given name, and a source can be considered verified if at least one of its relationships is verified.

1.b. Yes, sources currently have minimal identity, with only Wikidata being well defined. However, the current model (in addition to objectifying Wikidata) allows future expansion for data entry (URL, citation, etc.) that can be linked. Additionally, I believe that websites, books, and other references should have their own identities. especially since we are now adding many instrument names from MIMO.

Yes, is_main currently serves only as the default source to display. I think it is necessary to have this attribute, and it may be useful to distinguish between verifiable and primary. Yes, other names might be more descriptive, but I think is_display would be too restrictive.

dchiller · 2025-11-28T12:45:11Z

I'm not really convinced by this model for both conceptual and practical reasons.

First, some conceptual elements:

Verifying a relationship between a name and a source is distinct from verifying a relationship between a name and an instrument. With these changes, a name - source relationship is what can be verified or not: what is actually being verified here? That a name exists in a particular source? That a source in some way communicates that a given name is correct for a given instrument? That distinction seems to be muddied here: why would I expect the validity (or not) of a name - instrument relationship to be dependent on the validity (or not) of a single name - source relationship? (If 50 sources say a piano is a "piano", why does validation depend on a single source? And what source? Maybe you say that's easy: as long as one source says a piano is a "piano" then we could the name - instrument relationship as valid. But what if my source is "my grandmother called it that" -- maybe 30 grandmothers say a piano is a "piano" -- which one is verified?) The concept of verification has been problematic for similar reasons since it has been introduced, but at least when verification adhered to an instrument - name relationship, one could imagine the verification process taking into account a range of criteria for verification in a process that was at least somewhat subjective.

If verification is still, even under the model proposed here, meant to be a description of an instrument - name relationship, then I think we need to have a good reason for the de facto model to deviate from the concept of verification as we mean it.

Now, for practical matters: in the attached issue, I read the core point being this: "keeping different sources for an entered instrument name allows the reviewer to verify the label based on the full array of available sources." Or, to put it another way: "We currently model instrument names as having a single source. In fact, instrument names can have multiple sources and we want our data model to reflect that." So the question is the best way to model that reality? It isn't clear how we got from that question to modifying anything about how we model validity.

PouyaMohseni · 2025-11-28T21:25:33Z

You are bringing up many important ideas; Thank you.

You are right, and as we discussed today in #436, verification is determined at the instrument-name level with respect to all the provided sources. This makes the verification status a property of the instrument name itself, as you mentioned. Another point is that there should be a way to exclude or hide certain tags, so I believe the Source / Source–InstrumentName (which model with decide to stick with) relation should have an attribute such as is_valid or is_displayed.

That said, I still think creating a separate table for sources is relevant. This is not only because we may have many identical sources, but also because we need to consider weighting contributors who provide identical sources for a given label, as we discussed today (yes, source is currently an unprocessed free-text).

So, if you agree, I can go ahead and implement this.

- Add new `Source` and `InstrumentNameSource` models - Many-to-many relationship between InstrumentName and Source - Move contributor status onto the relationship table - Update name creation and deletion workflows to operate on relationships - Update Wikidata import to create Source objects and link Name–Source relations - Update admin, views, and templates to display main source and status - Remove restriction on adding new sources after verification Resolves #436

dchiller · 2025-12-09T18:50:46Z

web-app/django/VIM/apps/instruments/models/instrument_name.py

                name="unique_umil_label_per_instrument_language",
            )
        ]
-


Why not keep this?

dchiller · 2025-12-09T18:51:56Z

web-app/django/VIM/apps/instruments/models/instrument_name_source.py

+    contributor = models.ForeignKey(
+        "auth.User",
+        on_delete=models.SET_NULL,
+        help_text="Users who contributed this name",


Should this change?

dchiller · 2025-12-09T19:44:55Z

web-app/django/VIM/apps/instruments/views/instrument_detail.py

+            context["instrument_names"] = [
+                name
+                for name in instrument_names
+                if name.verification_status == "verified"
+            ]


Why did this need to be changed? The QuerySet version is better.

dchiller · 2025-12-09T19:49:32Z

web-app/django/VIM/apps/instruments/admin.py

+        """
+        Display all associated source names for this instrument name
+        """
+        return ", ".join(source.name for source in obj.sources.all())


Suggested change

return ", ".join(source.name for source in obj.sources.all())

return ", ".join(obj.sources.all.iterator())

- used `iterator()` in admin.py - changed `help_text` of contributor

ahankinson · 2025-12-09T23:00:56Z

How many queries does this functionality generate? Have you benchmarked it with Django Debug Toolbar and checked?

PouyaMohseni · 2025-12-10T17:48:15Z

Yes, the number of queries used to be 9 for an instrument detail page, but with this new model it would become almost double the number of instrument labels, O(len(InstrumentName)). However, this only happens if we display the sources on the instrument detail page. As discussed in the meeting, we should have a dedicated page for each instrument name where all sources contributing to that name are shown. If we do not show sources on the detail page, the required queries would be reduced to 7.

- display aliases using nested rows - show verification status and action options for aliases - remove source information as part of improvements related to #443 resolves #423, resolves #420

ahankinson · 2025-12-10T23:03:37Z

Do you know why it doubles? That's the more important question. If you have an O(len(Something)) increase with Django, it generally means that a loop through some results is hitting the database for every iteration.

This can usually be fixed with either a select_related() or a prefetch_related() so that it's O(1) (That is, the list of related results is also fetched with the initial query).

My experience with writing Django is that you always keep an eye on how many queries any new development is adding to your page load, and fix it immediately. It's a lot harder debugging these things when you have to work through relationships in models, views, and templates to find the place where it's triggering the results. (Although you can also expand the query line in DDT and it will usually show you the line.)

- display aliases using nested rows - show verification status and action options for aliases - remove source information as part of improvements related to #443 resolves #423, resolves #420

…he dataset request hits

dchiller reviewed Nov 27, 2025

View reviewed changes

PouyaMohseni force-pushed the multiple-source branch from 5512b20 to 9131999 Compare December 5, 2025 20:16

PouyaMohseni requested a review from dchiller December 8, 2025 14:10

dchiller reviewed Dec 9, 2025

View reviewed changes

fix: address minor issues from review

e9d75b5

- used `iterator()` in admin.py - changed `help_text` of contributor

PouyaMohseni mentioned this pull request Dec 10, 2025

Formating Aliases in the instrument page #426

Open

refator: use prefetch_related() for prefetching sources to reduce t…

447de83

…he dataset request hits

PouyaMohseni requested a review from dchiller December 17, 2025 17:13

	return ", ".join(source.name for source in obj.sources.all())
	return ", ".join(obj.sources.all.iterator())

feat: support multiple sources for a single instrument name #443

Are you sure you want to change the base?

feat: support multiple sources for a single instrument name #443

Uh oh!

Conversation

PouyaMohseni commented Nov 27, 2025

Uh oh!

dchiller left a comment

Choose a reason for hiding this comment

Uh oh!

dchiller Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PouyaMohseni commented Nov 27, 2025

Uh oh!

dchiller commented Nov 28, 2025

Uh oh!

PouyaMohseni commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahankinson commented Dec 9, 2025

Uh oh!

PouyaMohseni commented Dec 10, 2025

Uh oh!

ahankinson commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dchiller Nov 27, 2025 •

edited

Loading

PouyaMohseni commented Nov 28, 2025 •

edited

Loading