Skip to content

Generate Airlines Demo DB (Big Data) and Build a Flight Timetable to make a benchmark for PostgreSQL vs Doublets #11

@konard

Description

@konard

Summary

Set up the PostgresPro Airlines demo database, generate a large (“big data”) instance with the official generator, and implement flight timetable queries (departures/arrivals by airport/date/route). Use these queries to establish a repeatable benchmark and compare PostgreSQL 18 performance and ergonomics with a Doublets implementation on equivalent operations. The environment must be containerized, with two modes: default (durable) and embedded-like (WAL-light).

Scope

  • PostgreSQL side (Docker): data generation, schema inspection, timetable queries (with validity checks), basic indexes/constraints, timing runs in both modes.
  • Doublets side: functionally equivalent data model and queries (or pipelines), plus timing runs.
  • Deliver reproducible scripts, Docker assets, and result artifacts (logs/CSVs).

Deliverables

  • [docker]/docker-compose.yml pinned to postgres:18.
  • [docker]/compose.embedded.yml (override) for embedded-like mode (WAL-light).
  • [docker]/init/99_unlogged.sql (optional) to switch data tables to UNLOGGED in embedded-like runs.
  • [docs]/HOWTO.md — runbook for Docker, generation parameters, and how to run both benchmark modes.
  • [sql]/10_timetable_queries.sql — queries below.
  • [bench]/pg/run.sh — timings (EXPLAIN (ANALYZE, BUFFERS) + wall-clock), CSV output.
  • [bench]/doublets/run.* — equivalent Doublets operations, same CSV schema.
  • [bench]/schema-mapping.md — mapping Airlines entities/fields to Doublets structures.
  • README update with a “Benchmark: Timetable” section linking to the above.

Prerequisites (PostgreSQL)

  • PostgreSQL 18 (pinned) via Docker image postgres:18.
  • Disk space per chosen dataset size (3m/6m/1y/2y).
  • Extensions typically used by the generator: btree_gist, cube, earthdistance, dblink.

Containerized Environment

Durable (default) compose

# [docker]/docker-compose.yml
version: "3.9"
services:
  pg:
    image: postgres:18
    container_name: airlines-pg18
    environment:
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: postgres
      POSTGRES_DB: postgres
    ports: ["5432:5432"]
    volumes:
      - pgdata:/var/lib/postgresql/data
      - ./init:/docker-entrypoint-initdb.d
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 5s
      timeout: 5s
      retries: 20
volumes:
  pgdata:

Embedded-like (WAL-light) override

This mode trades durability for speed to better match embedded DB behavior.

# [docker]/compose.embedded.yml
services:
  pg:
    command:
      - "postgres"
      - "-c"; "fsync=off"
      - "-c"; "synchronous_commit=off"
      - "-c"; "full_page_writes=off"
      - "-c"; "wal_level=minimal"

Optional (strongly recommended for WAL-free data writes): switch large tables to UNLOGGED in embedded mode.

-- [docker]/init/99_unlogged.sql (only applied when you want embedded-like runs)
DO $$
DECLARE r record;
BEGIN
  FOR r IN
    SELECT format('%I.%I', n.nspname, c.relname) AS fqname
    FROM pg_class c
    JOIN pg_namespace n ON n.oid = c.relnamespace
    WHERE n.nspname = 'bookings' AND c.relkind = 'r'
  LOOP
    EXECUTE 'ALTER TABLE '|| r.fqname || ' SET UNLOGGED';
  END LOOP;
END$$;

Revert to durable: replace SET UNLOGGED with SET LOGGED.

Run:

# Durable
docker compose -f docker/docker-compose.yml up -d

# Embedded-like (WAL-light)
docker compose -f docker/docker-compose.yml -f docker/compose.embedded.yml up -d

Setup Options

Option A — Load a ready snapshot

gunzip -c demo-YYYYMMDD-1y.sql.gz \
| docker compose -f docker/docker-compose.yml exec -T pg psql -U postgres

Recreates demo inside the container.

Option B — Generate your own “big data” (recommended)

From psql in the container (any DB, not demo yet):

\i install;  -- creates DB demo, schemas, extensions

-- Example: generate 1 year with 4 workers
CALL generate(now(), now() + interval '1 year', 4);

-- Progress
SELECT busy();

-- Post-checks
\i check.sql;

What to implement (PostgreSQL)

A. Timetable queries (using the provided view)

bookings.timetable already handles airport-local times and the temporal validity join.

-- Departures from SVO on 2025-10-07
SELECT *
FROM bookings.timetable
WHERE departure_airport = 'SVO'
  AND (scheduled_departure AT TIME ZONE 'UTC')::date = DATE '2025-10-07'
ORDER BY scheduled_departure;

-- Arrivals to SVO on 2025-10-07
SELECT *
FROM bookings.timetable
WHERE arrival_airport = 'SVO'
  AND (scheduled_arrival AT TIME ZONE 'UTC')::date = DATE '2025-10-07'
ORDER BY scheduled_arrival;

-- Next available flight after model-time “now” for a route
SELECT *
FROM bookings.timetable t
WHERE t.departure_airport = 'SVX'
  AND t.arrival_airport   = 'WUH'
  AND t.scheduled_departure > bookings.now()
ORDER BY t.scheduled_departure
LIMIT 1;

B. Manual timetable (without the view, explicit validity check)

SELECT f.flight_id,
       r.route_no,
       r.departure_airport,
       r.arrival_airport,
       f.status,
       f.scheduled_departure,
       f.scheduled_arrival
FROM   bookings.flights  AS f
JOIN   bookings.routes   AS r
  ON   r.route_no = f.route_no
 AND   r.validity @> f.scheduled_departure  -- validity at composition time
WHERE  r.departure_airport = 'SVO'
  AND  f.scheduled_departure::date = DATE '2025-10-07'
ORDER BY f.scheduled_departure;

Benchmark plan

Metrics

  • Wall-clock time (median over N runs).
  • EXPLAIN (ANALYZE, BUFFERS) for PostgreSQL queries (saved to logs).
  • Result row counts and checksum (simple hash) for equivalence.

Datasets

  • Two scales (e.g., 6 months and 1 year).

Procedure

  1. Warm-up: run each query once (discard).
  2. Execute each query 10 runs, record wall-clock ms.
  3. Store results in bench/results/*.csv with columns:
    system,durability_mode,dataset,query_id,run,rows,ms.
  4. Compute summary stats (min/median/p95) and check consistency.

Durability modes

  • durable: default compose (baseline, production-like).
  • embedded: compose override + UNLOGGED tables to eliminate WAL where possible.
    • Caveats: not crash-safe; UNLOGGED tables are truncated after crash/restart; no replication.

Doublets parity

  • Model the same entities/time attributes.
  • Return the same logical result sets as the SQL above.
  • Export CSV with the same columns (include durability_mode = embedded).

Validation

  • SELECT bookings.now(); returns a timestamp (model time).
  • Sanity: row counts by status, min/max scheduled times, airports present.
  • Validity sanity: joining routes to flights without the view must use r.validity @> f.scheduled_departure.
  • Embedded mode check: verify relpersistence='u' for bookings tables:
    SELECT relname, relpersistence
    FROM pg_class
    WHERE relnamespace = 'bookings'::regnamespace AND relkind='r';

Acceptance Criteria

  • Dockerized environment with PostgreSQL 18 (postgres:18) and reproducible startup.
  • Big demo DB available (document exact method & parameters).
  • 10_timetable_queries.sql runnable on the generated dataset.
  • Manual queries that bypass the view include the validity check.
  • Benchmarks run in both durability modes and produce CSVs at two scales.
  • README updated with a short how-to and links to scripts & results.

Nice to have (optional)

  • Timing plots from CSVs.
  • Notes on indexes that materially change plans.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions