Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/compare #28

Merged
merged 27 commits into from
Jan 4, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
6336bc7
add functionality to compare func.
rshewitt Dec 27, 2023
636ded5
add compare test
rshewitt Dec 27, 2023
bc4cfb9
refactor to one line
rshewitt Dec 27, 2023
661225b
move data to fixture and refactor.
rshewitt Dec 27, 2023
5e2dbe7
add assert on `do nothing`
rshewitt Dec 27, 2023
a544ec0
ignore integration tests
rshewitt Jan 3, 2024
9911661
add sansjson
rshewitt Jan 3, 2024
cb92509
add section on comparison.
rshewitt Jan 3, 2024
d19bfb5
general fixes, var updates, add more extras, and add ckan search.
rshewitt Jan 3, 2024
6dc9787
add util module
rshewitt Jan 3, 2024
88eafc8
change dcatus catalog
rshewitt Jan 3, 2024
8e43022
add ckan mock return
rshewitt Jan 3, 2024
7bc029f
add dcatus comparison delta file.
rshewitt Jan 3, 2024
068a866
add integration fixtures and test for comparison
rshewitt Jan 3, 2024
64c7d16
add real comparison fixture and test
rshewitt Jan 3, 2024
851382f
move fixture to integration test
rshewitt Jan 3, 2024
67f0592
update expected result.
rshewitt Jan 3, 2024
8961449
fix lint issues
rshewitt Jan 3, 2024
d419c50
bump pypi version to next minor.
rshewitt Jan 3, 2024
9661590
add back ckan entrypoint fixture
rshewitt Jan 3, 2024
f2dcc75
reorder keys in all datasets.
rshewitt Jan 3, 2024
09de3ed
add raw source fixture and with/without sort test.
rshewitt Jan 3, 2024
c259e33
add notes
rshewitt Jan 3, 2024
8cfe987
limit field returns from query
rshewitt Jan 3, 2024
b61562c
isort lint
robert-bryson Jan 3, 2024
c66c4c2
Readme.md fixup
robert-bryson Jan 3, 2024
c31a32e
add more load tests and fixtures
rshewitt Jan 4, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ clean-dist: ## Cleans dist dir
rm -rf dist/*

test: up ## Runs poetry tests, ignores ckan load
poetry run pytest --ignore=./tests/load/ckan
poetry run pytest --ignore=./tests/integration

up: ## Sets up local docker environment
docker compose up -d
Expand Down
73 changes: 62 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,35 +5,86 @@ transformation, and loading into the data.gov catalog.

## Features

The datagov-harvesting-logic offers the following features:

- Extract
- general purpose fetching and downloading of web resources.
- catered extraction to the following data formats:
- General purpose fetching and downloading of web resources.
- Catered extraction to the following data formats:
- DCAT-US
- Validation
- DCAT-US
- jsonschema validation using draft 2020-12.
- `jsonschema` validation using draft 2020-12.
- Load
- DCAT-US
- conversion of dcatu-us catalog into ckan dataset schema
- create, delete, update, and patch of ckan package/dataset
- Conversion of dcat-us catalog into ckan dataset schema
- Create, delete, update, and patch of ckan package/dataset

## Requirements

This project is using poetry to manage this project. Install [here](https://python-poetry.org/docs/#installation).
This project is using `poetry` to manage this project. Install [here](https://python-poetry.org/docs/#installation).

Once installed, `poetry install` installs dependencies into a local virtual environment.

## Testing

### CKAN load testing

- CKAN load testing doesn't require the services provided in the `docker-compose.yml`.
- [catalog-dev](https://catalog-dev.data.gov/) is used for ckan load testing.
- Create an api-key by signing into catalog-dev.
- Create an api-key by signing into catalog-dev.
- Create a `credentials.py` file at the root of the project containing the variable `ckan_catalog_dev_api_key` assigned to the api-key.
- run tests with the command `poetry run pytest ./tests/load/ckan`
- Run tests with the command `poetry run pytest ./tests/load/ckan`

### Harvester testing
- These tests are found in `extract`, and `validate`. Some of them rely on services in the `docker-compose.yml`. run using docker `docker compose up -d` and with the command `poetry run pytest --ignore=./tests/load/ckan`.

- These tests are found in `extract`, and `validate`. Some of them rely on services in the `docker-compose.yml`. Run using docker `docker compose up -d` and with the command `poetry run pytest --ignore=./tests/load/ckan`.

If you followed the instructions for `CKAN load testing` and `Harvester testing` you can simply run `poetry run pytest` to run all tests.

## Comparison
rshewitt marked this conversation as resolved.
Show resolved Hide resolved

- `./tests/harvest_sources/ckan_datasets_resp.json`
- Represents what ckan would respond with after querying for the harvest source name
- `./tests/harvest_sources/dcatus_compare.json`
- Represents a changed harvest source
- Created:
- datasets[0]

```diff
+ "identifier" = "cftc-dc10"
```

- Deleted:
- datasets[0]

```diff
- "identifier" = "cftc-dc1"
```

- Updated:
- datasets[1]

```diff
- "modified": "R/P1M"
+ "modified": "R/P1M Update"
```

- datasets[2]

```diff
- "keyword": ["cotton on call", "cotton on-call"]
+ "keyword": ["cotton on call", "cotton on-call", "update keyword"]
```

- datasets[3]

```diff
"publisher": {
"name": "U.S. Commodity Futures Trading Commission",
"subOrganizationOf": {
- "name": "U.S. Government"
+ "name": "Changed Value"
}
}
```

- `./test/harvest_sources/dcatus.json`
- Represents an original harvest source prior to change occuring.
10 changes: 2 additions & 8 deletions harvester/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,14 +22,8 @@
# TODO these imports will need to be updated to ensure a consistent api
from .compare import compare
from .extract import download_waf, extract, traverse_waf
from .load import (
create_ckan_package,
dcatus_to_ckan,
load,
patch_ckan_package,
purge_ckan_package,
update_ckan_package,
)
from .load import (create_ckan_package, dcatus_to_ckan, load,
patch_ckan_package, purge_ckan_package, update_ckan_package)
from .transform import transform
from .utils import *
from .validate import *
Expand Down
19 changes: 16 additions & 3 deletions harvester/compare.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,22 @@
logger = logging.getLogger("harvester")


# stub, TODO complete
def compare(compare_obj):
def compare(harvest_source, ckan_source):
"""Compares records"""
logger.info("Hello from harvester.compare()")

return compare_obj
output = {
"create": [],
"update": [],
"delete": [],
}

harvest_ids = set(harvest_source.keys())
ckan_ids = set(ckan_source.keys())
same_ids = harvest_ids & ckan_ids

output["create"] += list(harvest_ids - ckan_ids)
output["delete"] += list(ckan_ids - harvest_ids)
output["update"] += [i for i in same_ids if harvest_source[i] != ckan_source[i]]

return output
64 changes: 47 additions & 17 deletions harvester/load.py
rshewitt marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@

import ckanapi

from harvester.utils.util import sort_dataset

logger = logging.getLogger("harvester")


Expand All @@ -21,7 +23,7 @@ def create_ckan_extra_base(*args):
return [{"key": d[0], "value": d[1]} for d in data]


def create_ckan_extras_additions(dcatus_catalog, additions):
def create_ckan_extras_additions(dcatus_dataset, additions):
extras = [
"accessLevel",
"bureauCode",
Expand All @@ -35,10 +37,13 @@ def create_ckan_extras_additions(dcatus_catalog, additions):

for extra in extras:
data = {"key": extra, "value": None}
val = dcatus_dataset[extra]
if extra == "publisher":
data["value"] = dcatus_catalog[extra]["name"]
data["value"] = val["name"]
else:
data["value"] = dcatus_catalog[extra]
if isinstance(val, list): # TODO: confirm this is what we want.
val = val[0]
data["value"] = val
output.append(data)

return output + additions
Expand Down Expand Up @@ -70,21 +75,28 @@ def get_email_from_str(in_str):
return res.group(0)


def create_ckan_resources(dists):
def create_ckan_resources(dcatus_dataset):
output = []

for dist in dists:
if "distribution" not in dcatus_dataset:
return output

for dist in dcatus_dataset["distribution"]:
url_key = "downloadURL" if "downloadURL" in dist else "accessURL"
resource = {"url": dist[url_key], "mimetype": dist["mediaType"]}
resource = {"url": dist[url_key]}
if "mimetype" in dist:
resource["mimetype"] = dist["mediaType"]

output.append(resource)

return output


def simple_transform(dcatus_catalog):
def simple_transform(dcatus_dataset):
output = {
"name": "-".join(dcatus_catalog["title"].lower().split()),
"owner_org": "test",
"name": "-".join(dcatus_dataset["title"].lower().split()),
"owner_org": "test", # TODO: CHANGE THIS!
"identifier": dcatus_dataset["identifier"],
}

mapping = {
Expand All @@ -93,14 +105,17 @@ def simple_transform(dcatus_catalog):
"title": "title",
}

for k, v in dcatus_catalog.items():
for k, v in dcatus_dataset.items():
if k not in mapping:
continue
if isinstance(mapping[k], dict):
temp = {}
to_skip = ["@type"]
for k2, v2 in v.items():
if k2 == "hasEmail":
v2 = get_email_from_str(v2)
if k2 in to_skip:
continue
temp[mapping[k][k2]] = v2
output = {**output, **temp}
rshewitt marked this conversation as resolved.
Show resolved Hide resolved
else:
Expand All @@ -116,7 +131,7 @@ def create_defaults():
}


def dcatus_to_ckan(dcatus_catalog):
def dcatus_to_ckan(dcatus_dataset, harvest_source_name):
"""
example:
- from this:
Expand All @@ -126,23 +141,34 @@ def dcatus_to_ckan(dcatus_catalog):

"""

output = simple_transform(dcatus_catalog)
output = simple_transform(dcatus_dataset)

resources = create_ckan_resources(dcatus_catalog["distribution"])
tags = create_ckan_tags(dcatus_catalog["keyword"])
pubisher_hierarchy = create_ckan_publisher_hierarchy(dcatus_catalog["publisher"])
resources = create_ckan_resources(dcatus_dataset)
tags = create_ckan_tags(dcatus_dataset["keyword"])
pubisher_hierarchy = create_ckan_publisher_hierarchy(
dcatus_dataset["publisher"], []
)

extras_base = create_ckan_extra_base(
pubisher_hierarchy, "Dataset", dcatus_catalog["publisher"]["name"]
pubisher_hierarchy, "Dataset", dcatus_dataset["publisher"]["name"]
)
extras = create_ckan_extras_additions(dcatus_catalog, extras_base)
extras = create_ckan_extras_additions(dcatus_dataset, extras_base)

defaults = create_defaults()

output["resources"] = resources
output["tags"] = tags

output["extras"] = extras_base
output["extras"] += extras
output["extras"] += [
{
"key": "dcat_metadata",
"value": str(sort_dataset(dcatus_dataset)),
}
]

output["extras"] += [{"key": "harvest_source_name", "value": harvest_source_name}]

return {**output, **defaults}

Expand All @@ -167,3 +193,7 @@ def update_ckan_package(ckan, update_data):

def purge_ckan_package(ckan, package_data):
return ckan.action.dataset_purge(**package_data)


def search_ckan(ckan, query):
return ckan.action.package_search(**query)
4 changes: 2 additions & 2 deletions harvester/utils/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
from . import json
from . import json, util

__all__ = ["json"]
__all__ = ["json", "util"]
12 changes: 12 additions & 0 deletions harvester/utils/util.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
import hashlib
import json

import sansjson


def sort_dataset(d):
return sansjson.sort_pyobject(d)


def dataset_to_hash(d):
return hashlib.sha256(json.dumps(d, sort_keys=True).encode("utf-8")).hexdigest()
15 changes: 13 additions & 2 deletions poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 2 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[tool.poetry]
name = "datagov-harvesting-logic"
version = "0.0.4"
version = "0.1.0"
description = ""
# authors = [
# {name = "Jin Sun", email = "jin.sun@gsa.gov"},
Expand All @@ -25,6 +25,7 @@ deepdiff = ">=6"
pytest = ">=7.3.2"
ckanapi = ">=4.7"
beautifulsoup4 = "^4.12.2"
sansjson = "^0.3.0"

[tool.poetry.group.dev.dependencies]
pytest = "^7.3.0"
Expand Down
Loading
Loading