Skip to content

PeerDB-io/peerdb

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3,095 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PeerDB Banner

Frustratingly simple ETL for Postgres

Workflow Status ElV2 License Slack Community

PeerDB

At PeerDB, we are building a fast, simple and the most cost effective way to stream data from Postgres to Data Warehouses, Queues and Storage engines. If you are running Postgres at the heart of your data-stack and move data at scale from Postgres to any of the above targets, PeerDB can provide value.

We support different modes of streaming - log based (CDC), cursor based (timestamp or integer) and XMIN based. Performance wise, we are 10x faster than existing tools. Features wise, we support native Postgres features such as comprehensive set of data-types incl. jsonb/arrays/geospatial, efficiently streaming TOAST columns, schema changes and so on.

Get started

git clone git@github.com:PeerDB-io/peerdb.git
cd peerdb

# Run docker containers: postgres as catalog, temporal, PeerDB server, PeerDB flow API + workers, PeerDB UI
# Requires docker and docker-compose installed: https://docs.docker.com/engine/install/
bash ./run-peerdb.sh
# OR for local development, images will be built locally.
# Requires docker, docker-compose as well as the buf compiler for protobuf generation
# https://buf.build/docs/installation
bash ./generate-protos.sh
bash ./dev-peerdb.sh

# connect to peerdb and query away (Use psql version >=14.0)
psql "port=9900 host=localhost password=peerdb"

IMPORTANT: Ensuring ClickHouse Access to MinIO

If your ClickHouse DB runs outside Docker (e.g., on VMs or ClickHouse Cloud), it may not have access to MinIO, which is used by PeerDB internally to stage files before loading them. Ensure ClickHouse has network access to MinIO.

PeerDB stages PostgreSQL data in MinIO within the Docker stack. Since ClickHouse is outside Docker, it needs a resolvable hostname for MinIO.

Update docker-compose.yml and set AWS_ENDPOINT_URL_S3 to MinIO's accessible IP (from both PeerDB and ClickHouse):

AWS_ENDPOINT_URL_S3: http://172.31.26.57:9001 # Change this to IP/host which is accessible by both PeerDB and ClickHouse

Rerun Docker Compose to apply changes. On AWS/GCP/Azure, also ensure the security group allows inbound access to MinIO.

Follow this 5-minute Quickstart Guide to see PeerDB in action i.e. streaming data in real-time across stores.

Why PeerDB

Current data tools prioritize a wide range of connectors, often neglecting to optimize for Postgres users. This can be problematic for those storing large amounts of data in Postgres and frequently transferring it. As a result, many resort to building custom pipelines when existing tools don't meet their needs. We've developed this project to provide a straightforward and reliable solution specifically for Postgres.

Postgres-first Approach

PeerDB is an ETL/ELT tool built for PostgreSQL. We implement multiple Postgres native and infrastructural optimizations to provide a fast, reliable and a feature-rich experience for moving data in/out of PostgreSQL.

For performance - we can parallelize initial load for a large table, still ensuring consistency. Syncing 100s of GB reduces from days to minutes. Our architecture is designed for real-time syncs and implements multiple logical replication related optimizations (e.g., tuning Postgres configs, parallel reading of slot). This enables 10x faster Change Data Capture with data-freshness of a few 10s of seconds even at large throughputs (10k+ tps).

For reliability, we have mechanisms in place for fault tolerance - state management, automatic retries, handling idempotency and consistency and so on (https://blog.peerdb.io/using-temporal-to-scale-data-synchronization-at-peerdb). Configurable batching and parallelism prevent out of memory (OOMs) and crashes.

From a feature richness standpoint, we support efficient syncing of tables with large (TOAST) columns. We support multiple streaming modes - log based (CDC), cursor based (timestamp or integer) and XMIN based. We provide rich data-type mapping and plan to support every possible type (incl. Custom types) that Postgres supports to the best extent possible on the target data-store.

Now available natively in ClickHouse Cloud (Generally Available)

PeerDB is now available natively in ClickHouse Cloud (Generally Available). Learn more about it here.

Postgres-compatible SQL interface to do ETL

The Postgres-compatible SQL interface for ETL is unique to PeerDB and enables you to operate in a language you are familiar with. You can do ETL the same way you work with your databases.

You can use Postgres’ ecosystem to manage your ETL —

  1. Client tools like pgAdmin, psql to run SQL commands.
  2. BI tools like Grafana, Tableau to visually monitor syncs and transforms.
  3. Database migration and versioning tools like Flyway to manage your ETL.
  4. Any language (e.g., Python, Go, Node.js) and Scheduler (Airflow) for development.
  5. And many more.

Status

We have expanded our connector ecosystem to support multiple source connectors beyond Postgres, including MySQL and MongoDB. You can check the status of connectors here

Local End to End testing

You can run locally the same end-to-end tests that our CI uses to validate changes, enabling fast iteration cycles during development.

For example:

cd flow
go clean -cache
env -f ../.env go test -v -run TestGenericCH_MySQL ./e2e/

Or local debugging sessions.

These tests require both PeerDB services, source and destination stores to be running. We provide a local environment with all the necessary services and dependencies to run these tests.

This is done through Tilt orchestrated Docker compose.

To get the environment up you first need to specify the shared environment variables for both the test and the test environment in your local .env file. You can use the provided .env.example as a template: cp .env.example .env .

📝 In the template, services URLs are set to host.docker.internal, which is the name for the default Docker gateway in Docker Desktop set-ups such as macOS and Windows. Using the default gateway address allows both test processes and services running inside Docker to access services on the host machine. In native Docker (Linux) this name is not resolved by default, you might replace it with the default gateway IP (e.g., 172.18.0.1) or add a custom entry to your /etc/hosts file to resolve host.docker.internal to the appropriate IP address. e.g:

echo "172.18.0.1 host.docker.internal" | sudo tee -a /etc/hosts

Then you can just run:

./tilt.sh

And follow the status of the services and access logs through the Tilt UI at http://localhost:10352/. Dozzle is also included at http://localhost:8118/, providing real-time container resource utilization metrics (CPU, memory) and log streaming for all running Docker containers.

image

Since .env is the environment configuration source of truth, it can be used directly to inject the required variables to the test execution processes. e.g:

go clean -cache; env -f ../.env go test -v -run TestGenericCH_MySQL ./e2e/ # Some MySQL generic tests

Running tests from Tilt

The Tilt setup includes pre-configured test launcher resources under the e2e label. These resources do not start automatically; instead, you can trigger them on demand from the Tilt UI at http://localhost:10352/.

image

Available test launchers:

  • e2e_postgres -- Postgres to ClickHouse generic tests (TestGenericCH_PG)
  • e2e_mysql -- MySQL to ClickHouse generic tests (TestGenericCH_MySQL)
  • e2e_mongodb -- MongoDB to ClickHouse test suite (TestMongoClickhouseSuite)

Each launcher automatically depends on the required services and provisioning steps, so Tilt will ensure all prerequisites are running before executing the tests. To trigger a test, click the resource in the Tilt UI and press the trigger (play) button.

Environment services versions

Data stores versions are extracted from .github/workflows/flow.yml, select the last row of the test matrix except for MySQL version which defaults to 9.5. This automatic extraction relies on the yq CLI; install the Go-based mikefarah/yq version 4 or later so that local environment generation works correctly. You can specify different versions in the local .env file to override them as follows:

MYSQL_VERSION=8.0
MONGODB_VERSION=4.4
CLICKHOUSE_VERSION=21.8

Support

Our docs can be found here. If you have any questions, feel free to drop by our Slack!

License

PeerDB is licensed under GNU Affero General Public License v3.0 (AGPLv3). Please see the LICENSE file for additional information. If you have any licensing questions please email db-integrations-support@clickhouse.com