⚡️(app) improve CSV and Parquet generation

Using Pandas allows blazing fast data conversion using a Dataframe pivot (from SQL to CSV particularly). We've re-implemented all streamers to use a pandas-pyarrow combination.
jmaupetit · Jul 15, 2024 · b2f1c8c · b2f1c8c
1 parent 59c95fd
commit b2f1c8c
Show file tree

Hide file tree

Showing 14 changed files with 1,199 additions and 263 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -12,6 +12,10 @@ and this project adheres to
 
 - Integrate pyinstrument profiler
 
+### Changed
+
+- Improve performances using Pandas
+
 ## [0.5.0] - 2024-07-08
 
 ### Changed

diff --git a/docs/configuration.md b/docs/configuration.md
@@ -107,11 +107,55 @@ Default: `/d`
 
 ---
 
-#### `PARQUET_BATCH_SIZE`
+#### `CHUNK_SIZE`
 
-Default batch size to process for Parquet files.
+Size of batches to process, _i.e._ the number of SQL query result rows to
+process at each iteration.
 
-Default: 100
+Default: `5000`
+
+---
+
+#### `SCHEMA_SNIFFER_SIZE`
+
+The number of SQL query result rows used to infer a table schema (data types).
+
+Default: `1000`
+
+---
+
+#### `DEFAULT_DTYPE_BACKEND`
+
+The backend used to infer data types while fetching data from the database.
+Possible values are: `numpy_nullable` or `pyarrow` (see
+[Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.convert_dtypes.html)).
+
+Default: `pyarrow`
+
+---
+
+#### `PROFILER_INTERVAL`
+
+From
+[pyinstrument's](https://pyinstrument.readthedocs.io/en/latest/reference.html#pyinstrument.Profiler.interval)
+documentation:
+
+> The minimum time, in seconds, between each stack sample. This translates into
+> the resolution of the sampling.
+
+Default: `0.001`
+
+---
+
+#### `PROFILER_ASYNC_MODE`
+
+From
+[pyinstrument's](https://pyinstrument.readthedocs.io/en/latest/reference.html#pyinstrument.Profiler.async_mode)
+documentation:
+
+> Configures how this Profiler tracks time in a program that uses async/await.
+
+Default: `enabled`
 
 ---
 
@@ -128,6 +172,19 @@ Default: `false`
 
 ---
 
+#### `PROFILING`
+
+(De)Activate server request profiling. If set to `True`, adding the `?profile=1`
+argument to HTTP requests returns the profiling analysis instead of the expected
+requested dataset.
+
+Example query:
+[http://localhost:8000/d/invoices.csv?profile=1](http://localhost:8000/d/invoices.csv?profile=1)
+
+Default: `false`
+
+---
+
 #### `HOST`
 
 This is the host socket will be bind to. It can be an IPv4 or IPv6 address, or a
@@ -157,7 +214,8 @@ Default: `None`
 
 #### `SENTRY_DSN`
 
-The DSN of your Sentry project, _e.g._ `https://account@sentry.io/project_id`. When not set, Sentry integration is not active.
+The DSN of your Sentry project, _e.g._ `https://account@sentry.io/project_id`.
+When not set, Sentry integration is not active.
 
 Default: `None`
 
@@ -194,13 +252,11 @@ database name you will query is `chinook`, depending on the database engine and
 driver you want to use, here is a table that summarizes dependencies you need to
 install and `DATABASE_URL` example values.
 
-| Database   | Dependency             | Example value                                              |
-| ---------- | ---------------------- | ---------------------------------------------------------- |
-| PostgreSQL | `databases[asyncpg]`   | `postgresql+asyncpg://data7:secret@localhost:5432/chinook` |
-|            | `databases[aiopg]`     | `postgresql+aiopg://data7:secret@localhost:5432/chinook`   |
-| MySQL      | `databases[aiomysql]`  | `mysql+aiomysql://data7:secret@localhost:3306/chinook`     |
-|            | `databases[asyncmy]`   | `mysql+asyncmy://data7:secret@localhost:3306/chinook`      |
-| SQLite     | `databases[aiosqlite]` | `sqlite+aiosqlite:///chinook.db`                           |
+| Database   | Dependency          | Example value                                      |
+| ---------- | ------------------- | -------------------------------------------------- |
+| PostgreSQL | `psycopg2-binary`   | `postgresql://data7:secret@localhost:5432/chinook` |
+| MySQL      | `mariadb-connector` | `mysql://data7:secret@localhost:3306/chinook`      |
+| SQLite     | -                   | `sqlite:///chinook.db`                             |
 
 ---
 

diff --git a/docs/tutorial.md b/docs/tutorial.md
@@ -168,8 +168,14 @@ Three configuration files should have been created:
       # The base url path for dataset urls
       datasets_root_url: "/d"
 
-      # Parquet
-      parquet_batch_size: 100
+      # Pandas chunks
+      chunk_size: 5000
+      schema_sniffer_size: 1000
+      default_dtype_backend: pyarrow
+
+      # Pyinstrument
+      profiler_interval: 0.001
+      profiler_async_mode: enabled
 
     # ---- DEFAULT ---------------------------------
     default:
@@ -180,29 +186,49 @@ Three configuration files should have been created:
       # host:
       # port:
 
+      # Sentry
+      sentry_dsn: null
+      sentry_traces_sample_rate: 1.0
+
+      # Pyinstrument
+      profiling: false
+
     # ---- PRODUCTION ------------------------------
     production:
+      execution_environment: production
+
       # Set debug to true for development, never for production!
       debug: false
 
       # Server
       # host: data7.example.com
       # port: 8080
 
+      # Sentry
+      # sentry_dsn:
+      # sentry_traces_sample_rate: 1.0
+
+      # Pyinstrument
+      profiling: false
+
     #
     # /!\ FEEL FREE TO REMOVE ENVIRONMENTS BELOW /!\
     #
     # ---- DEVELOPMENT -----------------------------
     development:
+      execution_environment: development
       debug: true
 
       # Server
       host: "127.0.0.1"
       port: 8000
 
+      # Pyinstrument
+      profiling: true
+
     # ---- TESTING ---------------------------------
     testing:
-
+      execution_environment: testing
     ```
 
 === ".secrets.yaml"
@@ -216,22 +242,22 @@ Three configuration files should have been created:
 
     # ---- DEFAULT ---------------------------------
     default:
-      # DATABASE_URL: "sqlite+aiosqlite:///example.db"
+      # DATABASE_URL: "sqlite:///example.db"
 
     # ---- PRODUCTION ------------------------------
     production:
-      # DATABASE_URL: "sqlite+aiosqlite:///example.db"
+      # DATABASE_URL: "sqlite:///example.db"
 
     #
     # /!\ FEEL FREE TO REMOVE ENVIRONMENTS BELOW /!\
     #
     # ---- DEVELOPMENT -----------------------------
     development:
-      DATABASE_URL: "sqlite+aiosqlite:///db/development.db"
+      DATABASE_URL: "sqlite:///db/development.db"
 
     # ---- TESTING ---------------------------------
     testing:
-      DATABASE_URL: "sqlite+aiosqlite:///db/tests.db"
+      DATABASE_URL: "sqlite:///db/tests.db"
     ```
 
 === "data7.yaml"
@@ -295,7 +321,7 @@ file:
 ```yaml
 # .secrets.yaml
 development:
-  DATABASE_URL: "sqlite+aiosqlite:///chinook.db"
+  DATABASE_URL: "sqlite:///chinook.db"
 ```
 
 And check what datasets are defined in the `data7.yaml` file: