diff --git a/docs/api_design.md b/docs/api_design.md new file mode 100644 index 0000000..abda6c2 --- /dev/null +++ b/docs/api_design.md @@ -0,0 +1,229 @@ +# DANDER API Design + +DANDER (Data Analytics & Data Engineering Resources) gathers a wide range of +small utilities that make a data practitioner's day a bit easier. The project +offers a Python API and a Typer powered CLI that wrap common patterns found in +the data ecosystem. Each module is intentionally light‑weight so that teams can +mix and match only the pieces they need. + +## Package layout + +```text +dander/ +├── cli.py # Command line interface +├── io/ # File input/output helpers +│ ├── json_utils.py # JSON formatting, splitting, validation +│ ├── csv_utils.py # CSV helpers +│ ├── excel_utils.py # Spreadsheet helpers (xls/xlsx/ods) +│ └── convert.py # Convert between tabular formats +├── transform/ # Data transformation helpers +│ ├── dataframe.py # Pandas / Polars helpers +│ └── dates.py # Calendar arithmetic and parsing +├── profile/ # Dataset profiling +│ └── pandas_profile.py +├── db/ # Database helpers +│ ├── connection.py # Connection helpers (SQLAlchemy, etc.) +│ ├── query.py # Simplified query execution +│ ├── duckdb.py # DuckDB convenience wrappers +│ └── sqlite.py # Thin layer over Python's sqlite3 +├── datapack/ # [Frictionless] data package helpers +│ └── frictionless.py +├── dbt/ # Light wrappers around dbtRunner +│ ├── runner.py # run/test/seed/compile/ls +│ └── artifacts.py # read manifest, run results, catalog +├── validation/ # Data quality checks +│ └── checks.py +├── stdlib/ # Ergonomic wrappers over stdlib modules +│ ├── path.py # Pathlib convenience helpers +│ ├── compression.py # gzip/zip/tar helpers +│ └── csvlib.py # High level wrappers for csv module +├── ids/ # Unique identifier utilities +│ ├── uuid_utils.py # UUID1/4/6/7 +│ └── ulid_utils.py # ULID generation and parsing +└── hashing/ # Hashing and checksums + └── digest.py # sha256/md5/xxhash/crc32 helpers +``` + +## Key modules and ideas + +### I/O helpers and file conversion + +The `io` package focuses on mundane file operations: + +- `json_utils` formats JSON, splits documents, and performs schema + validation using `jsonschema`. +- `csv_utils` wraps the standard library `csv` module and pandas for + quick CSV inspection. +- `excel_utils` uses `openpyxl`/`odfpy` to read and write spreadsheet + files, normalising column names and data types. +- `convert` provides one‑liners to convert between JSON, CSV, Parquet and + Arrow by leaning on `pandas` and `pyarrow`. + +```bash +# Convert CSV to Parquet +$ dander convert input.csv output.parquet + +# Preview a CSV file +$ dander csv head data.csv --rows 20 +``` + +### Data transformations + +`transform` groups helpers for small but frequent operations: + +- `dataframe.py` offers column renaming, flattening of nested structures + and lightweight schema enforcement for pandas/Polars DataFrames. +- `dates.py` standardises date parsing, timezone handling, business day + offsets and rolling window calculations by wrapping `datetime`, + `dateutil`, and `zoneinfo`. + +```python +from dander.transform import dates + +dates.parse_iso("2024-03-02T10:15:00Z") +dates.month_range("2024-01-01", months=3) +``` + +### Data profiling + +Inspired by tools like +[ydata-profiling](https://github.com/ydataai/pandas-profiling), the +`profile` module produces quick HTML or Markdown reports that summarise a +dataset. + +```bash +$ dander profile df.parquet --output report.html +``` + +### Database helpers and lightweight engines + +The `db` package abstracts common SQL workflows: + +- `connection.py` standardises creating SQLAlchemy engines and managing + credentials. +- `query.py` executes parametrised SQL and returns DataFrames. +- `duckdb.py` exposes in‑process DuckDB for ad‑hoc analytics and easy + integration with parquet/CSV/Arrow sources. +- `sqlite.py` leans on the Python stdlib's `sqlite3` for tiny + reproducible datasets. + +```python +from dander.db import duckdb + +with duckdb.session() as con: + con.sql("CREATE TABLE numbers AS SELECT range AS id FROM range(10)") + df = con.sql("SELECT * FROM numbers").df() +``` + +### Data package integrations + +To encourage reproducibility, `datapack` wraps the +[Frictionless](https://github.com/frictionlessdata/frictionless-py) +library. Utilities allow quick creation, validation and publishing of +Data Packages. + +```bash +# Validate a datapackage.json +$ dander datapack validate datapackage.json +``` + +### dbt integrations + +DANDER bundles thin wrappers around `dbtRunner` so Python scripts and CLI +commands can orchestrate dbt without invoking the shell. The `runner` +module exposes `run`, `test`, `seed`, `compile` and `ls` functions that +return structured results. `artifacts.py` provides helpers to read the +`manifest.json`, `run_results.json` and `catalog.json` files and expose +them as DataFrames for further analysis. + +```python +from dander.dbt.runner import run + +res = run(project_dir="my_project", select="stg_users") +print(res.status, res.execution_time) +``` + +### Standard‑library conveniences + +The `stdlib` package offers ergonomic wrappers over modules that data +teams regularly touch but are clunky to use: + +- `path.py` adds helpers for temporary directories and atomic file + writes. +- `compression.py` unifies `gzip`, `zipfile` and `tarfile` interfaces to + make archive creation and extraction trivial. +- `csvlib.py` layers a pandas‑like interface over the built‑in `csv` + reader for quick scripting. + +### Date and time utilities + +While many projects end up writing their own date helpers, DANDER's +`transform.dates` module consolidates common needs: ISO8601 parsing, +timezone conversions, calendar arithmetic, relative date ranges and +business day calculations. The aim is to surface the most useful bits of +`datetime`, `dateutil`, and `pandas` without pulling in heavy +dependencies. + +### Unique identifiers and hashing + +The `ids` and `hashing` packages provide predictable interfaces for +generating identifiers and verifying data integrity: + +- `ids.uuid_utils` exposes UUID1/4/6/7 helpers. +- `ids.ulid_utils` generates and parses ULIDs for lexicographically + sortable IDs. +- `hashing.digest` computes md5/sha256/xxhash digests and file checksums + and includes helpers for comparing directories. + +```bash +$ dander ids uuid --version 7 --count 3 +$ dander hash sha256 path/to/file.csv +``` + +### Validation + +`validation.checks` allows developers to declare simple row/column rules +and run them against DataFrames or CSV files, returning structured error +reports that can feed into CI pipelines. + +## Example Python API + +```python +from dander.io import convert +from dander.profile.pandas_profile import profile +from dander.ids import ulid_utils + +# Convert a CSV file and profile the result +convert.to_parquet("customers.csv", "customers.parquet") +profile("customers.parquet", output="profile.html") + +# Generate a sortable identifier +order_id = ulid_utils.new() +``` + +## CLI overview + +Each top level package registers a CLI subcommand: + +```bash +$ dander json format data.json --write +$ dander csv to-json data.csv out.json +$ dander db duckdb-query "select count(*) from 'data.parquet'" +$ dander dbt run --project-dir my_dbt_project +$ dander ids ulid --count 5 +$ dander hash md5 customers.parquet +``` + +## Design principles + +* **Simple abstractions** – prefer a thin layer over existing libraries. +* **Composable** – functions return data structures that can be piped to the next utility. +* **Typed** – use type hints for discoverability and tooling. +* **Observable** – rely on `loguru` for logging and make logging opt‑in via the CLI. +* **Cross‑platform** – keep dependencies light and avoid OS‑specific features. +* **Batteries included** – expose common stdlib features behind ergonomic APIs so + analysts don't have to remember incidental details. + +This outline gives DANDER space to grow into a convenient toolkit that +mirrors the daily workflow of modern data professionals. +