Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check that a candidate for deduplication is in a source that is configured for deduplication #156

Merged
merged 4 commits into from
Mar 27, 2024

Conversation

jschultze
Copy link
Contributor

We experienced the following behaviour with deduplication:

Sources configured:

  • Source 1 (dedup = true)
  • Source 2 (dedup = true)
  • Source 3 (dedup = true)
  • Source 4 (dedup = false)

When running deduplication (with or without explicitly stating the sources to be deduplicated with --source), records from sourced 1 to 3 where not only deduplicated within this group, but also against source 4. We where expecting only the records from sources that are configured for deduplication to be deduplicated.

The RecordManager seems to get candidates for deduplication from the whole database. The additional code checks if the source of a deduplication candidate is configured for deduplication.

@EreMaijala
Copy link
Contributor

@jschultze Has source 4 had dedup = true at some point? What the docs fail to explain properly is that if you turn dedup on or off, you need to run renormalize on the source to update the dedup keys. Sources that have dedup disabled shouldn't have dedup keys, so the records should not be found in deduplication. Regardless, the check here makes sense, but I just wanted to get to the bottom of the issue.

@EreMaijala
Copy link
Contributor

(Wiki updated with a note to run renormalize)

@jschultze
Copy link
Contributor Author

jschultze commented Mar 27, 2024

@EreMaijala Thanks for the explanation! Yes, I think that source 4 had the dedup flag set to true at first and I have not run the renormalize-command, so that is probably the reason.

I will execute the renormalization to clean the database.

@EreMaijala
Copy link
Contributor

Oops, there's a style problem. Can you fix that too?

@jschultze
Copy link
Contributor Author

The whitespace is removed.

@EreMaijala EreMaijala merged commit dded08f into NatLibFi:dev Mar 27, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants