Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clear bad production data #4007

Closed
35 tasks done
jbrown-xentity opened this issue Oct 12, 2022 · 12 comments
Closed
35 tasks done

Clear bad production data #4007

jbrown-xentity opened this issue Oct 12, 2022 · 12 comments
Assignees
Labels
bug Software defect or bug component/catalog Related to catalog component playbooks/roles support Issues from agency requests or affecting users

Comments

@jbrown-xentity
Copy link
Contributor

jbrown-xentity commented Oct 12, 2022

Since we re-implemented the db-solr sync, we found that we have data in a bad state sitting in prod (not indexed on solr, but still valid).
We need to clear this bad data

How to reproduce

  1. Examine duplicate records that don't have a correct/corresponding record in harvest_object table. A full table scan doesn't seem to complete, but you can look by harvest source:
SELECT COUNT(*), max(package.metadata_created) 
FROM package 
LEFT JOIN "group" ON package.owner_org = "group".id 
WHERE "group".name = 'dhs-gov' 
AND package.type = 'dataset' 
AND package.id NOT IN (SELECT package_id FROM harvest_object WHERE harvest_source_id = '803bdba9-bfcb-453c-ae2a-ed81f240ff5a' AND current);

Expected behavior

The above query should result in 0 datasets

Actual behavior

Thousands

Sketch

The following organizations have duplicates, this may be affecting all of them (or just some), and they are sorted in highest value order:

  • noaa-gov
  • doi-gov
  • doc-gov
  • ca-gov
  • dhs-gov
  • city-of-ferndale-michigan
  • lake-county-illinois
  • hhs-gov
  • state-of-oklahoma
  • ed-gov
  • usaid-gov
  • city-of-new-york
  • doe-gov
  • epa-gov
  • federal-laboratory-consortium
  • state-of-connecticut
  • usitc-gov
  • king-county-washington
  • city-of-tempe
  • state-of-new-york
  • national-institute-of-standards-and-technology
  • centers-for-disease-control-and-prevention
  • dot-gov
  • city-of-baltimore
  • city-of-austin
  • usda-gov
  • city-of-bloomington
  • city-of-los-angeles
  • va-gov
  • state-of-maryland
  • fcc-gov
  • rrb-gov
  • city-of-baton-rouge
  • fgdc-gov
  • doj-gov

The process to follow for each organization:

  • Find the offending harvest source (if more than one, will involve more analysis)
  • Get the harvest source id (via UI)
  • Get the organization id: SELECT id FROM "group" WHERE name = 'doc-gov';
  • If more than one harvest source for the org, need to update the following query
  • Replace the following query values org-id and harvest-source-id with the values above
UPDATE package 
SET state = 'to_delete' 
WHERE package.owner_org = 'org-id' 
AND package.type = 'dataset' 
AND package.id NOT IN (
  SELECT package_id 
  FROM harvest_object 
  WHERE harvest_source_id = 'harvest-source-id' 
  AND current
);
  • Clear the harvest source (either via UI or via run-task)
  • Re-Harvest the source

Please note that the above marking to_delete is to match ckanext-harvest clearing.
We also need to delete the records in SOLR, so we need to do a full harvest clear. An alternative approach would be manually running the db-solr-sync job after deleting the records, and validating that the job removed the records from solr...

@jbrown-xentity jbrown-xentity added the bug Software defect or bug label Oct 12, 2022
@hkdctol hkdctol moved this to 📔 Product Backlog in data.gov team board Oct 13, 2022
@hkdctol hkdctol moved this from 📔 Product Backlog to 📟 Sprint Backlog [7] in data.gov team board Oct 13, 2022
@jbrown-xentity
Copy link
Contributor Author

Should review #3742 after resolution of this ticket, as that may be resolved as well.

@Jin-Sun-tts Jin-Sun-tts moved this from 📟 Sprint Backlog [7] to 🏗 In Progress [8] in data.gov team board Oct 14, 2022
@Jin-Sun-tts
Copy link
Contributor

The following SQL script picks up the duplicates (from #3567)

SELECT "group".name, COUNT(*) FROM package JOIN "group" ON package.owner_org = "group".id LEFT JOIN harvest_object ON package.id = harvest_object.package_id WHERE package.state='active' AND package.type='dataset' AND harvest_object.package_id IS NULL GROUP BY 1 ORDER BY 2 DESC ;

10-17-2022

doc-gov                                        | 23425
ca-gov                                         | 12455
noaa-gov                                       | 10869
city-of-ferndale-michigan                      |   795
usaid-gov                                      |   511
city-of-seattle                                |   494
state-of-connecticut                           |   474
lake-county-illinois                           |   348
vcgi-org                                       |   186
loudoun-county-virginia                        |   147
hhs-gov                                        |   138
city-of-baltimore                              |   111
state-of-oklahoma                              |   100
ed-gov                                         |    90
doe-gov                                        |    64
usgs-gov                                       |    54
city-of-new-york                               |    40
sba-gov                                        |    36
national-institute-of-standards-and-technology |    35
wake-county                                    |    33
epa-gov                                        |    33
doi-gov                                        |    30
federal-laboratory-consortium                  |    29
city-of-sioux-falls                            |    14
dot-gov                                        |    13
city-of-austin                                 |    11
usitc-gov                                      |     6
city-of-tempe                                  |     5
king-county-washington                         |     5
state-of-new-york                              |     5
centers-for-disease-control-and-prevention     |     4
fema-gov                                       |     3
usda-gov                                       |     3
city-of-los-angeles                            |     2
state-of-maryland                              |     2
city-of-bloomington                            |     2
va-gov                                         |     2
state-gov                                      |     1
rrb-gov                                        |     1
census-gov                                     |     1
doj-gov                                        |     1
city-of-baton-rouge                            |     1
fcc-gov                                        |     1 

Following sql to update the package state:

UPDATE package 
SET state = 'to_delete' 
where id in (
    SELECT package.id
    FROM package
    JOIN "group" ON package.owner_org = "group".id 
    LEFT JOIN harvest_object ON package.id = harvest_object.package_id 
    WHERE package.state='active' AND package.type='dataset'   
    AND harvest_object.package_id IS NULL  
    AND "group".name in (<group_names>)
);

Clear the harvest source in the sandbox test org.
Run db-sync-solr job manually to clear solr.

@Jin-Sun-tts
Copy link
Contributor

Will do following cleanup today:

10-18-2022

 doc-gov                                        | 23425
 ca-gov                                         | 12455
 noaa-gov                                       | 10869

@Jin-Sun-tts
Copy link
Contributor

10-24-2022 There is new one duplicate in dhs-gov today.

@Jin-Sun-tts
Copy link
Contributor

just cleaned up duplicates for ca-gov, it only took about 4 min for 12455 records with new deletion method (defer the commit to the end).

@FuhuXia
Copy link
Member

FuhuXia commented Oct 24, 2022

just cleaned up duplicates for ca-gov, it only took about 4 min for 12455 records with new deletion method (defer the commit to the end).

That is 50/sec on deleting, faster than 10/sec adding/updating. What makes the speed difference?

@Jin-Sun-tts
Copy link
Contributor

The new delete function, only has one solr connection for all deletions.
And the adding/updating has new connection for each call.

@Jin-Sun-tts
Copy link
Contributor

The following duplicates are also be cleared :
doc-gov 23425
noaa-gov 10869

So there is no duplicates in DB as of today. Will continue monitor for couple days to see if we get new duplicates.

@Jin-Sun-tts
Copy link
Contributor

checked the duplicate today, no new item returned

SELECT "group".name, COUNT(*) FROM package JOIN "group" ON package.owner_org = "group".id LEFT JOIN harvest_object ON package.id = harvest_object.package_id WHERE package.state='active' AND package.type='dataset' AND harvest_object.package_id IS NULL GROUP BY 1 ORDER BY 2 DESC ;

name | count
------+-------
(0 rows)

@Jin-Sun-tts Jin-Sun-tts moved this from 🏗 In Progress [8] to 👀 Needs Review [2] in data.gov team board Oct 26, 2022
@Jin-Sun-tts Jin-Sun-tts moved this from 👀 Needs Review [2] to ✔ Done in data.gov team board Oct 27, 2022
@FuhuXia
Copy link
Member

FuhuXia commented Nov 16, 2022

To eliminate packages that have no current harvest_object, we can use this query.

SELECT "group".name, COUNT(*)
FROM package
JOIN "group" ON package.owner_org = "group".id 
LEFT JOIN harvest_object ON package.id = harvest_object.package_id AND harvest_object.current
WHERE package.state='active'
AND package.type='dataset'
AND harvest_object.package_id IS NULL
GROUP BY 1 ORDER BY 2 DESC;

   name    | count
-----------+-------
 noaa-gov  | 37695
 usda-gov  |    27
 epa-gov   |    25
 usaid-gov |     8
 fgdc-gov  |     1
 dot-gov   |     1
 doi-gov   |     1
(7 rows)

@FuhuXia
Copy link
Member

FuhuXia commented Nov 21, 2022

query result returns 0. This ticket can be closed.

@nickumia-reisys
Copy link
Contributor

nickumia-reisys commented Feb 2, 2023

I might have confused #3567 with this ticket. Or maybe the same thing happened twice? Even having 1.6 years of experience on data.gov, I wouldn't touch the production DB without Fuhu around. @Jin-Sun-tts did it twice!

This ticket removed the bad data from being searchable or discoverable by users. But it is still in the system (hence #3999). As @FuhuXia mentioned above, the validating query above shows success for this ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Software defect or bug component/catalog Related to catalog component playbooks/roles support Issues from agency requests or affecting users
Projects
Archived in project
Development

No branches or pull requests

4 participants