-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clear bad production data #4007
Comments
Should review #3742 after resolution of this ticket, as that may be resolved as well. |
The following SQL script picks up the duplicates (from #3567)
Following sql to update the package state:
Clear the harvest source in the sandbox test org. |
Will do following cleanup today:
|
10-24-2022 There is new one duplicate in dhs-gov today. |
just cleaned up duplicates for ca-gov, it only took about 4 min for 12455 records with new deletion method (defer the commit to the end). |
That is 50/sec on deleting, faster than 10/sec adding/updating. What makes the speed difference? |
The new delete function, only has one solr connection for all deletions. |
The following duplicates are also be cleared : So there is no duplicates in DB as of today. Will continue monitor for couple days to see if we get new duplicates. |
checked the duplicate today, no new item returned
name | count |
To eliminate packages that have no current harvest_object, we can use this query.
|
query result returns 0. This ticket can be closed. |
I might have confused #3567 with this ticket. Or maybe the same thing happened twice? Even having 1.6 years of experience on data.gov, I wouldn't touch the production DB without Fuhu around. @Jin-Sun-tts did it twice! This ticket removed the bad data from being searchable or discoverable by users. But it is still in the system (hence #3999). As @FuhuXia mentioned above, the validating query above shows success for this ticket. |
Since we re-implemented the db-solr sync, we found that we have data in a bad state sitting in prod (not indexed on solr, but still valid).
We need to clear this bad data
How to reproduce
Expected behavior
The above query should result in 0 datasets
Actual behavior
Thousands
Sketch
The following organizations have duplicates, this may be affecting all of them (or just some), and they are sorted in highest value order:
The process to follow for each organization:
SELECT id FROM "group" WHERE name = 'doc-gov';
org-id
andharvest-source-id
with the values abovePlease note that the above marking
to_delete
is to match ckanext-harvest clearing.We also need to delete the records in SOLR, so we need to do a full harvest clear. An alternative approach would be manually running the db-solr-sync job after deleting the records, and validating that the job removed the records from solr...
The text was updated successfully, but these errors were encountered: