Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/organizations overview #26

Merged
merged 4 commits into from
Aug 29, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions .github/workflows/check_org_duplicates.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
---
name: Check org duplicates

on:
workflow_dispatch:

jobs:
check_org_duplicates:
name: Check Org Duplicates
runs-on: ubuntu-latest
steps:
- name: checkout
uses: actions/checkout@v2
- name: Install pipenv
run: pip install pipenv
- name: build
run: pipenv sync
- name: run check
run: pipenv run python duplicate-packages-organization.py --run-id=test
- name: save file
uses: actions/upload-artifact@v2
with:
name: org_overview
path: org-duplicates-test.csv
14 changes: 14 additions & 0 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,20 @@ on:
push:

jobs:
lint:
name: Lint
runs-on: ubuntu-latest
steps:
- name: checkout
uses: actions/checkout@v2
- name: Install pipenv
run: pip install pipenv
- name: build
run: pipenv install --dev
- name: lint stats
run: pipenv run flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
- name: lint
run: pipenv run flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics --ignore=C901
test:
name: Build and Test
runs-on: ubuntu-latest
Expand Down
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,6 @@
*.egg-info
duplicate-packages-*.csv
removed-packages-*.log
org-duplicates-*.csv
*.bak
output.log
4 changes: 2 additions & 2 deletions Pipfile
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,11 @@ verify_ssl = true

[dev-packages]
mock = "*"
pylint = "*"
flake8 = "*"

[packages]
requests = "*"
unicodecsv = "*"

[requires]
python_version = "2.7"
python_version = "3.8"
185 changes: 44 additions & 141 deletions Pipfile.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

11 changes: 11 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ Install the dependencies.

$ pipenv sync

### De-duplicate

Deduplicate packages for a specific organization.

$ pipenv run python duplicates-identifier-api.py [organization-name]
Expand Down Expand Up @@ -56,6 +58,15 @@ optional arguments:
--verbose, -v Include verbose log output.
```

### Check for duplicates
In order to evaluate how many duplicates exist across organizations, you can use the
`duplicate-packages-organization.py` script:

$ pipenv run python duplicate-packages-organization.py

See `--help` for latest options, but this script is much lighter and takes less than a minute to run.

The output gives you information about each org, and will show duplication problems system wide.

## Development

Expand Down
33 changes: 17 additions & 16 deletions dedupe/audit.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@

from __future__ import absolute_import
import codecs
import unicodecsv as csv
from datetime import datetime
Expand All @@ -10,6 +11,7 @@

log = logging.getLogger(__name__)


class RemovedPackageLog(object):
def __init__(self, filename=None, run_id=None):
if not run_id:
Expand All @@ -33,21 +35,21 @@ def add(self, package):
class DuplicatePackageLog(object):
# Order matters here for the report
fieldnames = [
'organization', # Organization name
'duplicate_id', # Duplicate id (CKAN ID)
'duplicate_title', # Duplicate title (CKAN title)
'duplicate_name', # Duplicate name (CKAN name)
'duplicate_url', # Duplicate URL (site URL + CKAN name)
'duplicate_metadata_created', # Duplicate CKAN metadata_created from CKAN
'duplicate_identifier', # Duplicate POD metadata identifier (in CKAN extra)
'duplicate_source_hash', # Duplicate source_hash (in CKAN extra)
'duplicate_is_collection', # Duplicate is a collection dataset
'duplicate_is_collection_member', # Duplicate is a collection member
'duplicate_harvest_source', # Duplicate harvest_source_id (in CKAN extra)
'retained_id', # Retained id (CKAN id)
'retained_url', # Retained URL (site URL + CKAN name)
'retained_metadata_created', # Retained metadata_created
'retained_harvest_source', # Retained harvest_source_id (in CKAN extra)
'organization', # Organization name
'duplicate_id', # Duplicate id (CKAN ID)
'duplicate_title', # Duplicate title (CKAN title)
'duplicate_name', # Duplicate name (CKAN name)
'duplicate_url', # Duplicate URL (site URL + CKAN name)
'duplicate_metadata_created', # Duplicate CKAN metadata_created from CKAN
'duplicate_identifier', # Duplicate POD metadata identifier (in CKAN extra)
'duplicate_source_hash', # Duplicate source_hash (in CKAN extra)
'duplicate_is_collection', # Duplicate is a collection dataset
'duplicate_is_collection_member', # Duplicate is a collection member
'duplicate_harvest_source', # Duplicate harvest_source_id (in CKAN extra)
'retained_id', # Retained id (CKAN id)
'retained_url', # Retained URL (site URL + CKAN name)
'retained_metadata_created', # Retained metadata_created
'retained_harvest_source', # Retained harvest_source_id (in CKAN extra)
]

def __init__(self, filename=None, api_url=None, run_id=None):
Expand All @@ -65,7 +67,6 @@ def __init__(self, filename=None, api_url=None, run_id=None):
encoding='utf-8', fieldnames=DuplicatePackageLog.fieldnames)
self.log.writeheader()


def add(self, duplicate_package, retained_package):
log.debug('Recording duplicate package to report package=%s', duplicate_package['id'])
self.log.writerow({
Expand Down
Loading