Skip to content

naccdata/nacc-form-validator

Repository files navigation

NACC Form Validator

Data quality rules validation module for NACC form data.

The validator is based on the Cerberus python library, which allows validating data using quality rules defined as data. See the Cerberus usage examples for more detail.

Table of Contents

See the Usage doc for a quick-start usage guide with examples. In general, all documentation outside this README lives under docs.

Setup

Before getting started, it is recommended to do your installations and work in a virtual environment. You can set one up with the following command:

# create; you want to use a Python version that matches the interpreter specified in pants.toml, which in this case is Python 3.11
python3.11 -m venv path/to/your/venv

# activate
source path/to/your/venv/bin/activate

# deactivate
deactivate

Next, you'll need to get the distribution for this package. The strategies to use the package defined in this repository are to

  1. clone the repository and build a distribution locally, or
  2. reference a distribution attached to a release on GitHub.

Once you have the distribution, you can install it with

pip3 install dist/nacc_form_validator-VERSION-py3-none-any.whl

Installing Pants

This repository uses pants for developing and building the distributions.

Install pants with one of the following. See Installing Pants for more information.

For Linux:

bash get-pants.sh

For macOS:

brew install pantsbuild/tap/pants

You will need to make sure that you have a Python version compatible with the interpreter set in the pants.toml file.

The repo has a VSCode devcontainer configuration that ensures a compatible Python is available. You need Docker installed, and VSCode with Dev Containers enabled. For this follow the Dev Containers tutorial to the point of "Check Installation".

Formatting and Linting

To format and lint with pants, run:

pants fmt nacc_form_validator::   # fixes formatting
pants lint nacc_form_validator::  # run linter

Testing

To test with pants, run:

# use the --test-force flag to ignore the cache and force all tests to run
pants test ::

Building a Distribution

To package the distribution with pants, run:

pants package nacc_form_validator:dist

will then build sdist and wheel distributions in the dist directory.

The version number on the distribution files is set in the validator/BUILD file.

Common Build Issues

Incompatible Python Interpreter

If you do not have a Python version compatible with the interpreter set in the pants.toml file, it will fail with something similar to the following when trying to build the distribution:

Examined the following interpreters:
1.)    /opt/homebrew/Cellar/python@3.12/3.12.5/Frameworks/Python.framework/Versions/3.12/bin/python3.12 CPython==3.12.5
2.) /Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/bin/python3.9 CPython==3.9.6

No interpreter compatible with the requested constraints was found:

  Version matches CPython==3.11.*

As mentioned earlier, you can use a VSCode devcontainer configuration. Otherwise, you need to set up an environment (preferably a virtual one) with the correct Python version (in this case, Python 3.11).

macOS Incompatible Architecture

On macOS, if you see a long error that ends with the following when trying to build the distribution:

(mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64e' or 'arm64'))

make sure that the pants_version in pants.toml is >=2.22.0.