Clear bad production data #4007

jbrown-xentity · 2022-10-12T21:52:02Z

Since we re-implemented the db-solr sync, we found that we have data in a bad state sitting in prod (not indexed on solr, but still valid).
We need to clear this bad data

How to reproduce

Examine duplicate records that don't have a correct/corresponding record in harvest_object table. A full table scan doesn't seem to complete, but you can look by harvest source:

SELECT COUNT(*), max(package.metadata_created) 
FROM package 
LEFT JOIN "group" ON package.owner_org = "group".id 
WHERE "group".name = 'dhs-gov' 
AND package.type = 'dataset' 
AND package.id NOT IN (SELECT package_id FROM harvest_object WHERE harvest_source_id = '803bdba9-bfcb-453c-ae2a-ed81f240ff5a' AND current);

Expected behavior

The above query should result in 0 datasets

Actual behavior

Thousands

Sketch

The following organizations have duplicates, this may be affecting all of them (or just some), and they are sorted in highest value order:

The process to follow for each organization:

Find the offending harvest source (if more than one, will involve more analysis)
Get the harvest source id (via UI)
Get the organization id: SELECT id FROM "group" WHERE name = 'doc-gov';
If more than one harvest source for the org, need to update the following query
Replace the following query values org-id and harvest-source-id with the values above

UPDATE package 
SET state = 'to_delete' 
WHERE package.owner_org = 'org-id' 
AND package.type = 'dataset' 
AND package.id NOT IN (
  SELECT package_id 
  FROM harvest_object 
  WHERE harvest_source_id = 'harvest-source-id' 
  AND current
);

Clear the harvest source (either via UI or via run-task)
Re-Harvest the source

Please note that the above marking to_delete is to match ckanext-harvest clearing.
We also need to delete the records in SOLR, so we need to do a full harvest clear. An alternative approach would be manually running the db-solr-sync job after deleting the records, and validating that the job removed the records from solr...

The text was updated successfully, but these errors were encountered:

jbrown-xentity · 2022-10-13T18:02:46Z

Should review #3742 after resolution of this ticket, as that may be resolved as well.

Jin-Sun-tts · 2022-10-18T13:28:15Z

The following SQL script picks up the duplicates (from #3567)

SELECT "group".name, COUNT(*) FROM package JOIN "group" ON package.owner_org = "group".id LEFT JOIN harvest_object ON package.id = harvest_object.package_id WHERE package.state='active' AND package.type='dataset' AND harvest_object.package_id IS NULL GROUP BY 1 ORDER BY 2 DESC ;

10-17-2022

doc-gov                                        | 23425
ca-gov                                         | 12455
noaa-gov                                       | 10869
city-of-ferndale-michigan                      |   795
usaid-gov                                      |   511
city-of-seattle                                |   494
state-of-connecticut                           |   474
lake-county-illinois                           |   348
vcgi-org                                       |   186
loudoun-county-virginia                        |   147
hhs-gov                                        |   138
city-of-baltimore                              |   111
state-of-oklahoma                              |   100
ed-gov                                         |    90
doe-gov                                        |    64
usgs-gov                                       |    54
city-of-new-york                               |    40
sba-gov                                        |    36
national-institute-of-standards-and-technology |    35
wake-county                                    |    33
epa-gov                                        |    33
doi-gov                                        |    30
federal-laboratory-consortium                  |    29
city-of-sioux-falls                            |    14
dot-gov                                        |    13
city-of-austin                                 |    11
usitc-gov                                      |     6
city-of-tempe                                  |     5
king-county-washington                         |     5
state-of-new-york                              |     5
centers-for-disease-control-and-prevention     |     4
fema-gov                                       |     3
usda-gov                                       |     3
city-of-los-angeles                            |     2
state-of-maryland                              |     2
city-of-bloomington                            |     2
va-gov                                         |     2
state-gov                                      |     1
rrb-gov                                        |     1
census-gov                                     |     1
doj-gov                                        |     1
city-of-baton-rouge                            |     1
fcc-gov                                        |     1

Following sql to update the package state:

UPDATE package 
SET state = 'to_delete' 
where id in (
    SELECT package.id
    FROM package
    JOIN "group" ON package.owner_org = "group".id 
    LEFT JOIN harvest_object ON package.id = harvest_object.package_id 
    WHERE package.state='active' AND package.type='dataset'   
    AND harvest_object.package_id IS NULL  
    AND "group".name in (<group_names>)
);

Clear the harvest source in the sandbox test org.
Run db-sync-solr job manually to clear solr.

Jin-Sun-tts · 2022-10-18T13:43:56Z

Will do following cleanup today:

10-18-2022

 doc-gov                                        | 23425
 ca-gov                                         | 12455
 noaa-gov                                       | 10869

Jin-Sun-tts · 2022-10-24T19:24:38Z

10-24-2022 There is new one duplicate in dhs-gov today.

Jin-Sun-tts · 2022-10-24T19:30:50Z

just cleaned up duplicates for ca-gov, it only took about 4 min for 12455 records with new deletion method (defer the commit to the end).

FuhuXia · 2022-10-24T19:38:05Z

just cleaned up duplicates for ca-gov, it only took about 4 min for 12455 records with new deletion method (defer the commit to the end).

That is 50/sec on deleting, faster than 10/sec adding/updating. What makes the speed difference?

Jin-Sun-tts · 2022-10-24T19:46:43Z

The new delete function, only has one solr connection for all deletions.
And the adding/updating has new connection for each call.

Jin-Sun-tts · 2022-10-24T21:45:13Z

The following duplicates are also be cleared :
doc-gov 23425
noaa-gov 10869

So there is no duplicates in DB as of today. Will continue monitor for couple days to see if we get new duplicates.

Jin-Sun-tts · 2022-10-26T17:07:47Z

checked the duplicate today, no new item returned

SELECT "group".name, COUNT(*) FROM package JOIN "group" ON package.owner_org = "group".id LEFT JOIN harvest_object ON package.id = harvest_object.package_id WHERE package.state='active' AND package.type='dataset' AND harvest_object.package_id IS NULL GROUP BY 1 ORDER BY 2 DESC ;

name | count
------+-------
(0 rows)

FuhuXia · 2022-11-16T17:10:35Z

To eliminate packages that have no current harvest_object, we can use this query.

SELECT "group".name, COUNT(*)
FROM package
JOIN "group" ON package.owner_org = "group".id 
LEFT JOIN harvest_object ON package.id = harvest_object.package_id AND harvest_object.current
WHERE package.state='active'
AND package.type='dataset'
AND harvest_object.package_id IS NULL
GROUP BY 1 ORDER BY 2 DESC;

   name    | count
-----------+-------
 noaa-gov  | 37695
 usda-gov  |    27
 epa-gov   |    25
 usaid-gov |     8
 fgdc-gov  |     1
 dot-gov   |     1
 doi-gov   |     1
(7 rows)

FuhuXia · 2022-11-21T20:07:11Z

query result returns 0. This ticket can be closed.

nickumia-reisys · 2023-02-02T22:30:40Z

I might have confused #3567 with this ticket. Or maybe the same thing happened twice? Even having 1.6 years of experience on data.gov, I wouldn't touch the production DB without Fuhu around. @Jin-Sun-tts did it twice!

This ticket removed the bad data from being searchable or discoverable by users. But it is still in the system (hence #3999). As @FuhuXia mentioned above, the validating query above shows success for this ticket.

jbrown-xentity added the bug Software defect or bug label Oct 12, 2022

jbrown-xentity added this to data.gov team board Oct 12, 2022

hkdctol moved this to 📔 Product Backlog in data.gov team board Oct 13, 2022

hkdctol moved this from 📔 Product Backlog to 📟 Sprint Backlog [7] in data.gov team board Oct 13, 2022

hkdctol mentioned this issue Oct 13, 2022

NUCA harvest job created duplicated datasets #3567

Closed

Jin-Sun-tts assigned FuhuXia and Jin-Sun-tts Oct 14, 2022

Jin-Sun-tts moved this from 📟 Sprint Backlog [7] to 🏗 In Progress [8] in data.gov team board Oct 14, 2022

Jin-Sun-tts mentioned this issue Oct 20, 2022

Optimize db solr sync to speed up the deletion of large amount solr indexes GSA/ckanext-geodatagov#228

Merged

4 tasks

jbrown-xentity mentioned this issue Oct 20, 2022

CKAN duplicate dataset (same harvest object) #3944

Closed

Jin-Sun-tts moved this from 🏗 In Progress [8] to 👀 Needs Review [2] in data.gov team board Oct 26, 2022

Jin-Sun-tts moved this from 👀 Needs Review [2] to ✔ Done in data.gov team board Oct 27, 2022

FuhuXia mentioned this issue Nov 16, 2022

CKAN duplicate dataset (different harvest object) #3968

Closed

FuhuXia mentioned this issue Dec 15, 2022

Purge deleted datasets #3999

Open

1 task

nickumia-reisys closed this as completed Feb 2, 2023

FuhuXia mentioned this issue May 10, 2023

Dataset harvest object not in Solr #4079

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clear bad production data #4007

Clear bad production data #4007

jbrown-xentity commented Oct 12, 2022 •

edited by btylerburton

Loading

jbrown-xentity commented Oct 13, 2022

Jin-Sun-tts commented Oct 18, 2022

Jin-Sun-tts commented Oct 18, 2022

Jin-Sun-tts commented Oct 24, 2022

Jin-Sun-tts commented Oct 24, 2022

FuhuXia commented Oct 24, 2022

Jin-Sun-tts commented Oct 24, 2022

Jin-Sun-tts commented Oct 24, 2022

Jin-Sun-tts commented Oct 26, 2022

FuhuXia commented Nov 16, 2022

FuhuXia commented Nov 21, 2022

nickumia-reisys commented Feb 2, 2023 •

edited

Loading

Clear bad production data #4007

Clear bad production data #4007

Comments

jbrown-xentity commented Oct 12, 2022 • edited by btylerburton Loading

How to reproduce

Expected behavior

Actual behavior

Sketch

jbrown-xentity commented Oct 13, 2022

Jin-Sun-tts commented Oct 18, 2022

Jin-Sun-tts commented Oct 18, 2022

Jin-Sun-tts commented Oct 24, 2022

Jin-Sun-tts commented Oct 24, 2022

FuhuXia commented Oct 24, 2022

Jin-Sun-tts commented Oct 24, 2022

Jin-Sun-tts commented Oct 24, 2022

Jin-Sun-tts commented Oct 26, 2022

FuhuXia commented Nov 16, 2022

FuhuXia commented Nov 21, 2022

nickumia-reisys commented Feb 2, 2023 • edited Loading

jbrown-xentity commented Oct 12, 2022 •

edited by btylerburton

Loading

nickumia-reisys commented Feb 2, 2023 •

edited

Loading