- Linux / Windows 10 or higher with WSL installed
- Nodejs
v19.8.1or higher - NPM
v9.6.2or higher - Python
v3.8 - PyPI
v20.0.2or higher
- Nodejs
- NPM
- TypeScript execution engine (
ts-node) - Python
- PyPI
We're generating the following data cubes:
- Care providers (Poskytovatelé zdravotních služeb)
- Population 2021 (Obyvatelé okresy 2021)
To install required project dependencies, run the following commands:
cd data-cubes
npm ciScripts for data cubes generation can be launched from data-cubes directory using a command:
npm startTo generate both data cubes and test their integrity constraints, run the following:
npm testInput files are stored in the data-cubes/input directory. There're 3 source CSV files:
care-providers-registry.csv(data sources for care providers data cube)population-cs-2021.csv(data sources for mean population data cube)county-codes.csv(mapping for translation of county codes frompopulation-cs-2021.csvinto standardized NUTS codes)
The remaining files contain RDF schemas for the generated data cubes.
Output files are stored in the data-cubes/output directory. The following files will be generated into the output directory:
care-providers.ttl(Care providers data cube with metadata)population.ttl(Mean population data cube with metadata)datasets.ttl(merged data cubes fromcare-providers.ttlandpopulation.ttl)
The following scripts need to be run to generate both data cubes and to validate their integrity constraints:
care-providers.tspopulation.tsconstraints-validation.ts
We're generating the same data cubes as in the previous section:
- Care providers (Poskytovatelé zdravotních služeb)
- Population 2021 (Obyvatelé okresy 2021)
This project needs to be run on a Linux machine or in the WSL environment.
First, open project's directory:
cd airflowThen proceed with installing dependencies for Nodejs scripts from package.json:
npm ciNext step will be configuration of a virtual environment for Python. You can install Apache Airflow from the exported dependencies in requirements.txt file or follow the official installation guide.
To set up Python virtual environment called venv with all required dependencies, run the following commands:
# Create Python venv
python3 -m venv venv
# Activate venv
. venv/bin/activate
# Install Python dependencies from requirements.txt
pip install -r requirements.txtYou'll also need to define a home directory for Apache Airflow:
export AIRFLOW_HOME=~/airflowTo finish Airflow setup, run the following:
# Initialize a database
airflow db init
# Create an administrator
airflow users create --username "admin" --firstname "Harry" --lastname "Potter" --role "Admin" --email "harry.potter@gmail.com"
# Check the existing users
airflow users listYou should also update the generated airflow.cfg file:
dags_folderneeds to point to thedagsdirectory, e.g./home/kristyna/data-engineering/airflow/dagsload_examplesshould better be set toFalseif you want to avoid seeing tons of example DAGs
You'll need 2 commands to launch Airflow. Each command needs to be triggered in its own terminal.
-
airflow scheduler
-
airflow webserver --port 8080
Now you can visit http://localhost:8080/home in your web browser. If you set load_examples = False in airflow.cfg,
you should only see one DAG - data-cubes.
DAGs can be easily triggered using the web interface. You need to provide a JSON configuration with output_path field.
An example of the right configuration can be seen at dag-configuration.json. The generated data cubes will be stored into the directory specified by output_path parameter. If you forget to pass this parameter, {dags_folder}/output will be used as a default location for output files.
Input files are gathered in airflow/dags/input directory. Three CSV files will be downloaded (same as in Data Cubes project):
care-providers-registry.csv(data sources for care providers data cube)population-cs-2021.csv(data sources for mean population data cube)county-codes.csv(mapping for translation of county codes frompopulation-cs-2021.csvinto standardized NUTS codes)
The remaining files contain RDF schemas for the generated data cubes.
Output files will be stored into the output_path directory or {dags_folder}/output by default.
The following files will be generated:
health_care.ttl(Care providers data cube with metadata)population.ttl(Mean population data cube with metadata)datasets.ttl(merged data cubes fromhealth_care.ttlandpopulation.ttl)
There's exactly one DAG defined in data-cubes.py. It specifies 6 tasks for the Airflow pipeline:
health_care_providers_downloadpopulation_2021_downloadcounty_codes_downloadcare_providers_data_cubepopulation_data_cubeintegrity_constraints_validation
Tasks 1-3 take care of downloading the source CSV files into the input directory.
Tasks 4 and 5 are responsible of generating the target data cubes health_care.ttl and population.ttl.
The last task merges the output datasets and validates them against a set of integrity constraints.
For a better idea of how the tasks are connected, see the DAG's graph visualization:
Scripts operating with CSV and RDF files were adapted from the Data Cubes project. They're stored in airflow/dags/scripts and they're triggered by data-cubes.py. The most important (entry) scripts are again the following:
care-providers.tspopulation.tsconstraints-validation.ts
A provenance document describing a process of generating datasets from Data Cubes project can be found in provenance directory. It is stored as provenance.trig file in RDF TriG format. It follows PROV-O specification and its attached examples.
The datasets from Data Cubes project are extended with SKOS hierarchy and DCAT-AP metadata in the skos-and-dcat project.
For project installation & usage instructions, refer to the original Data Cubes section. Project structure is the same as in data-cubes project, except for 2 changes:
Compared to the original Data Cubes project, metadata were removed from population dataset completely and they were moved from care-providers dataset into a separate dataset care-providers-metadata. Metadata are described in a file skos-and-dcat/input/care-providers-metadata.ttl. File's content is loaded into an RDF store, normalized and dumped into skos-and-dcat/output/care-providers-metadata.ttl.
A SKOS hierarchy was employed for regions and counties in both data cubes - care-providers and population. Regions and counties are defined as separate SKOS concepts (skos:Concept) and they're connected to the hierarchy through Broader and Narrower relationships (skos:broader, skos:narrower).
