Skip to content
/ 1trc Public

Compute the minimum, mean, and maximum across 1 trillion rows.

License

Notifications You must be signed in to change notification settings

kubox-ai/1trc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

1 Commit
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿš€ Processing One Trillion Rows with Kubox

Talos Pulumi AWS ClickHouse Ray Daft

Kubox

Kubox empowers you to build your own data infrastructure from scratch, offering the freedom to deploy tools of your choice for large-scale analytics. It combines the simplicity of SaaS with the flexibility of PaaS, creating a vendor-neutral platform ready for the future of AI/ML workloads.

๐ŸŒŸ Key Features

  • ๐Ÿ›  Effortless Kubernetes Cluster Creation: Quickly spin up clusters using the free Kubox CLI.
  • ๐ŸŒ Cloud Agnostic: Deploy on any cloud or on-premises infrastructure.
  • ๐Ÿ“Š Optimised for Large Workloads: Infrastructure to process 1 trillion rows with ease.
  • ๐Ÿงฉ Choice of data and analytic tools: Choose from high-performance databases or distributed computing frameworks.

Table of Contents

The One Trillion Row Challenge (1TRC)

The One Trillion Row Challenge originated as an ambitious benchmark task:

  • Goal: Compute the minimum, mean, and maximum temperatures per weather station, sorted alphabetically.
  • Dataset:
    • Format: Parquet
    • Size: 2.5 TB (100,000 files, each 24 MiB in size with 10 million rows)
    • Location: s3://coiled-datasets-rp/1trc (AWS S3 Requester Pays Bucket)

Here is how to download one of the 24 MiB file:

aws s3 cp s3://coiled-datasets-rp/1trc/measurements-0.parquet . --request-payer requester

This repository is work in progress as we iterate and learn before the final submission. It contains the code to quickly spin up a Kubox cluster in AWS us-east-1 and process 1 trillion rows of data. It give you two options for tackling this challenge:

  1. ClickHouse โ€“ A powerful, high-performance analytics database.
  2. Daft and Ray โ€“ A dynamic duo for distributed computing and cutting-edge data processing.

Tip

Youโ€™re not restricted to just the two options mentioned above. At its core, Kubox offers a free CLI that empowers you to effortlessly spin up Kubernetes clusters in your own cloud. It combines the ease of SaaS with the flexibility of PaaS, delivering a vendor-neutral data platform that can run anywhere.

Weโ€™re excited to have you explore and experiment with Kubox. Feel free to dive in and share your feedback as we continue to enhance this project!

Installation

1. Preliminary steps

You should have AWS CLI installed and configured. If not, please follow the instructions here

2. Install kubox

Download and install kubox a is a single binary required to create a Kubernetes cluster.

curl https://kubox.sh | sh

3. Setup AWS CLI

aws configure
aws sts get-caller-identity

Example output:

{
    "UserId": "AIDAIEXAMPLEID",
    "Account": "123456789012",
    "Arn": "arn:aws:iam::123456789012:user/example-user"
}

Running 1TRC in Kubox

We currently support running the 1TRC notebook in Kubox using Daft and Ray or ClickHouse. Follow the steps below to run the notebook:

Clone this repository

git clone https://github.com/Kubox-ai/1trc.git
cd 1trc

Using ClickHouse

1. Create a ClickHouse cluster using Kubox

Warning

You should create the Kubox cluster AWS Region to us-east-1 to avoid data transfer cost when running 1TRC. This si configurable from cluster-daft.yaml and cluster-clickhouse.yaml.

kubox create -f cluster-clickhouse.yaml

Tip

kubox create is an idempotent so you can run it again, if you run into issues. For troubleshooting guide see here

2. Export kubeconfig for ClickHouse

A kubeconfig file will be generated as part of the cluster creation process. Set it as an environment variable:

export KUBECONFIG=./clickhouse/cluster/config/kubeconfig

3. Expose ClickHouse and Notebook

Expose the notebook:

kubectl port-forward -n kubox svc/notebook 8888:80

Now you can access the notebook at http://localhost:8888

Expose the ClickHouse service:

kubectl port-forward -n kubox svc/clickhouse 8123:8123

You can access ClickHouse Dashboard at http://localhost:8123/dashboard. The default username and password are default and 123456.

4. Add AWS credentials in 1trc-clickhouse.ipynb

Add your AWS keys in 1trc-clickhouse.ipynb to access the dataset:

AWS_ACCESS_KEY_ID = (
    os.getenv("AWS_ACCESS_KEY_ID") or "<YOUR_AWS_ACCESS_KEY_ID>"
)
AWS_SECRET_ACCESS_KEY = (
    os.getenv("AWS_SECRET_ACCESS_KEY") or "<YOUR_AWS_SECRET_ACCESS_KEY>"
)

Now you can run the notebook.

5. Delete the ClickHouse cluster

kubox delete -f cluster-clickhouse.yaml

Tip

If you have issues creating see the troubleshooting guide here

Using Daft and Ray

1. Create a Ray cluster using Kubox

kubox create -f cluster-daft.yaml

2. Export kubeconfig for Daft

A kubeconfig file will be generated as part of the cluster creation process. Set it as an environment variable:

export KUBECONFIG=./daft/cluster/config/kubeconfig

3. Expose Ray Dashboard and Notebook

Expose the notebook:

kubectl port-forward -n kubox svc/notebook 8888:80

Now you can access the notebook at http://localhost:8888

Expose the Ray Dashboard service:

kubectl port-forward -n kubox svc/ray-cluster-kuberay-head-svc 8265:8265

You can access Ray Dashboard at http://localhost:8265.

4. Add AWS credentials in 1trc-daft.ipynb

Add your AWS keys in 1trc-daft.ipynb to access the dataset:

runtime_env = {
    "working_dir": "./",
    "pip": ["getdaft[all]"],
    "env_vars": {
        "AWS_ACCESS_KEY_ID": "<YOUR_AWS_ACCESS_KEY_ID>",
        "AWS_SECRET_ACCESS_KEY": "<YOUR_AWS_SECRET_ACCESS_KEY>",
    },
}

Now you can run the notebook.

5. Delete the cluster

kubox delete -f cluster-daft.yaml

Tip

If you have issues creating see the troubleshooting guide here

Comparison

Note

AWS EC2 spot instance prices used for below calculations:

Metric/Framework Daft + Ray Clickhouse
Startup time 320s 313s
Running time 1189s 527s
Delete time 122s 123s
Estimate cost $2.75 $1.37

Work in progress..

Contributing

We welcome contributions! If you find a bug, have a feature request, or want to improve the performance, feel free to open an issue or submit a pull request.

License

This repository is licensed under the Apache License 2.0. You are free to use, modify, and distribute this project under the terms of the license. See the LICENSE file for more details.