Skip to content

Commit

Permalink
⚡️(app) improve CSV and Parquet generation
Browse files Browse the repository at this point in the history
Using Pandas allows blazing fast data conversion using a Dataframe pivot
(from SQL to CSV particularly). We've re-implemented all streamers to
use a pandas-pyarrow combination.
  • Loading branch information
jmaupetit committed Jul 15, 2024
1 parent 59c95fd commit b2f1c8c
Show file tree
Hide file tree
Showing 14 changed files with 1,199 additions and 263 deletions.
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,10 @@ and this project adheres to

- Integrate pyinstrument profiler

### Changed

- Improve performances using Pandas

## [0.5.0] - 2024-07-08

### Changed
Expand Down
78 changes: 67 additions & 11 deletions docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,11 +107,55 @@ Default: `/d`

---

#### `PARQUET_BATCH_SIZE`
#### `CHUNK_SIZE`

Default batch size to process for Parquet files.
Size of batches to process, _i.e._ the number of SQL query result rows to
process at each iteration.

Default: 100
Default: `5000`

---

#### `SCHEMA_SNIFFER_SIZE`

The number of SQL query result rows used to infer a table schema (data types).

Default: `1000`

---

#### `DEFAULT_DTYPE_BACKEND`

The backend used to infer data types while fetching data from the database.
Possible values are: `numpy_nullable` or `pyarrow` (see
[Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.convert_dtypes.html)).

Default: `pyarrow`

---

#### `PROFILER_INTERVAL`

From
[pyinstrument's](https://pyinstrument.readthedocs.io/en/latest/reference.html#pyinstrument.Profiler.interval)
documentation:

> The minimum time, in seconds, between each stack sample. This translates into
> the resolution of the sampling.

Default: `0.001`

---

#### `PROFILER_ASYNC_MODE`

From
[pyinstrument's](https://pyinstrument.readthedocs.io/en/latest/reference.html#pyinstrument.Profiler.async_mode)
documentation:

> Configures how this Profiler tracks time in a program that uses async/await.

Default: `enabled`

---

Expand All @@ -128,6 +172,19 @@ Default: `false`

---

#### `PROFILING`

(De)Activate server request profiling. If set to `True`, adding the `?profile=1`
argument to HTTP requests returns the profiling analysis instead of the expected
requested dataset.

Example query:
[http://localhost:8000/d/invoices.csv?profile=1](http://localhost:8000/d/invoices.csv?profile=1)

Default: `false`

---

#### `HOST`

This is the host socket will be bind to. It can be an IPv4 or IPv6 address, or a
Expand Down Expand Up @@ -157,7 +214,8 @@ Default: `None`

#### `SENTRY_DSN`

The DSN of your Sentry project, _e.g._ `https://account@sentry.io/project_id`. When not set, Sentry integration is not active.
The DSN of your Sentry project, _e.g._ `https://account@sentry.io/project_id`.
When not set, Sentry integration is not active.

Default: `None`

Expand Down Expand Up @@ -194,13 +252,11 @@ database name you will query is `chinook`, depending on the database engine and
driver you want to use, here is a table that summarizes dependencies you need to
install and `DATABASE_URL` example values.

| Database | Dependency | Example value |
| ---------- | ---------------------- | ---------------------------------------------------------- |
| PostgreSQL | `databases[asyncpg]` | `postgresql+asyncpg://data7:secret@localhost:5432/chinook` |
| | `databases[aiopg]` | `postgresql+aiopg://data7:secret@localhost:5432/chinook` |
| MySQL | `databases[aiomysql]` | `mysql+aiomysql://data7:secret@localhost:3306/chinook` |
| | `databases[asyncmy]` | `mysql+asyncmy://data7:secret@localhost:3306/chinook` |
| SQLite | `databases[aiosqlite]` | `sqlite+aiosqlite:///chinook.db` |
| Database | Dependency | Example value |
| ---------- | ------------------- | -------------------------------------------------- |
| PostgreSQL | `psycopg2-binary` | `postgresql://data7:secret@localhost:5432/chinook` |
| MySQL | `mariadb-connector` | `mysql://data7:secret@localhost:3306/chinook` |
| SQLite | - | `sqlite:///chinook.db` |

---

Expand Down
42 changes: 34 additions & 8 deletions docs/tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -168,8 +168,14 @@ Three configuration files should have been created:
# The base url path for dataset urls
datasets_root_url: "/d"

# Parquet
parquet_batch_size: 100
# Pandas chunks
chunk_size: 5000
schema_sniffer_size: 1000
default_dtype_backend: pyarrow

# Pyinstrument
profiler_interval: 0.001
profiler_async_mode: enabled

# ---- DEFAULT ---------------------------------
default:
Expand All @@ -180,29 +186,49 @@ Three configuration files should have been created:
# host:
# port:

# Sentry
sentry_dsn: null
sentry_traces_sample_rate: 1.0

# Pyinstrument
profiling: false

# ---- PRODUCTION ------------------------------
production:
execution_environment: production

# Set debug to true for development, never for production!
debug: false

# Server
# host: data7.example.com
# port: 8080

# Sentry
# sentry_dsn:
# sentry_traces_sample_rate: 1.0

# Pyinstrument
profiling: false

#
# /!\ FEEL FREE TO REMOVE ENVIRONMENTS BELOW /!\
#
# ---- DEVELOPMENT -----------------------------
development:
execution_environment: development
debug: true

# Server
host: "127.0.0.1"
port: 8000

# Pyinstrument
profiling: true

# ---- TESTING ---------------------------------
testing:

execution_environment: testing
```

=== ".secrets.yaml"
Expand All @@ -216,22 +242,22 @@ Three configuration files should have been created:

# ---- DEFAULT ---------------------------------
default:
# DATABASE_URL: "sqlite+aiosqlite:///example.db"
# DATABASE_URL: "sqlite:///example.db"

# ---- PRODUCTION ------------------------------
production:
# DATABASE_URL: "sqlite+aiosqlite:///example.db"
# DATABASE_URL: "sqlite:///example.db"

#
# /!\ FEEL FREE TO REMOVE ENVIRONMENTS BELOW /!\
#
# ---- DEVELOPMENT -----------------------------
development:
DATABASE_URL: "sqlite+aiosqlite:///db/development.db"
DATABASE_URL: "sqlite:///db/development.db"

# ---- TESTING ---------------------------------
testing:
DATABASE_URL: "sqlite+aiosqlite:///db/tests.db"
DATABASE_URL: "sqlite:///db/tests.db"
```

=== "data7.yaml"
Expand Down Expand Up @@ -295,7 +321,7 @@ file:
```yaml
# .secrets.yaml
development:
DATABASE_URL: "sqlite+aiosqlite:///chinook.db"
DATABASE_URL: "sqlite:///chinook.db"
```
And check what datasets are defined in the `data7.yaml` file:
Expand Down
Loading

0 comments on commit b2f1c8c

Please sign in to comment.