Kubox empowers you to build your own data infrastructure from scratch, offering the freedom to deploy tools of your choice for large-scale analytics. It combines the simplicity of SaaS with the flexibility of PaaS, creating a vendor-neutral platform ready for the future of AI/ML workloads.
- ๐ Effortless Kubernetes Cluster Creation: Quickly spin up clusters using the free Kubox CLI.
- ๐ Cloud Agnostic: Deploy on any cloud or on-premises infrastructure.
- ๐ Optimised for Large Workloads: Infrastructure to process 1 trillion rows with ease.
- ๐งฉ Choice of data and analytic tools: Choose from high-performance databases or distributed computing frameworks.
- The One Trillion Row Challenge (1TRC)
- Installation
- Running 1TRC in Kubox
- Comparison
- Contributing
- License
The One Trillion Row Challenge originated as an ambitious benchmark task:
- Goal: Compute the minimum, mean, and maximum temperatures per weather station, sorted alphabetically.
- Dataset:
- Format: Parquet
- Size: 2.5 TB (100,000 files, each
24 MiB
in size with 10 million rows) - Location:
s3://coiled-datasets-rp/1trc
(AWS S3 Requester Pays Bucket)
Here is how to download one of the 24 MiB
file:
aws s3 cp s3://coiled-datasets-rp/1trc/measurements-0.parquet . --request-payer requester
This repository is work in progress as we iterate and learn before the final submission. It contains the code to quickly spin up a Kubox cluster in AWS us-east-1
and process 1 trillion rows of data. It give you two options for tackling this challenge:
- ClickHouse โ A powerful, high-performance analytics database.
- Daft and Ray โ A dynamic duo for distributed computing and cutting-edge data processing.
Tip
Youโre not restricted to just the two options mentioned above. At its core, Kubox offers a free CLI that empowers you to effortlessly spin up Kubernetes clusters in your own cloud. It combines the ease of SaaS with the flexibility of PaaS, delivering a vendor-neutral data platform that can run anywhere.
Weโre excited to have you explore and experiment with Kubox. Feel free to dive in and share your feedback as we continue to enhance this project!
You should have AWS CLI installed and configured. If not, please follow the instructions here
Download and install kubox
a is a single binary required to create a Kubernetes cluster.
curl https://kubox.sh | sh
aws configure
aws sts get-caller-identity
Example output:
{
"UserId": "AIDAIEXAMPLEID",
"Account": "123456789012",
"Arn": "arn:aws:iam::123456789012:user/example-user"
}
We currently support running the 1TRC notebook in Kubox
using Daft
and Ray
or ClickHouse
. Follow the steps below to run the notebook:
git clone https://github.com/Kubox-ai/1trc.git
cd 1trc
Warning
You should create the Kubox cluster AWS Region to us-east-1
to avoid data transfer cost when running 1TRC. This si configurable from cluster-daft.yaml
and cluster-clickhouse.yaml
.
kubox create -f cluster-clickhouse.yaml
Tip
kubox create
is an idempotent so you can run it again, if you run into issues. For troubleshooting guide see here
A kubeconfig file will be generated as part of the cluster creation process. Set it as an environment variable:
export KUBECONFIG=./clickhouse/cluster/config/kubeconfig
Expose the notebook:
kubectl port-forward -n kubox svc/notebook 8888:80
Now you can access the notebook at http://localhost:8888
Expose the ClickHouse service:
kubectl port-forward -n kubox svc/clickhouse 8123:8123
You can access ClickHouse Dashboard at http://localhost:8123/dashboard
. The default username and password are default
and 123456
.
Add your AWS keys in 1trc-clickhouse.ipynb
to access the dataset:
AWS_ACCESS_KEY_ID = (
os.getenv("AWS_ACCESS_KEY_ID") or "<YOUR_AWS_ACCESS_KEY_ID>"
)
AWS_SECRET_ACCESS_KEY = (
os.getenv("AWS_SECRET_ACCESS_KEY") or "<YOUR_AWS_SECRET_ACCESS_KEY>"
)
Now you can run the notebook.
kubox delete -f cluster-clickhouse.yaml
Tip
If you have issues creating see the troubleshooting guide here
kubox create -f cluster-daft.yaml
A kubeconfig file will be generated as part of the cluster creation process. Set it as an environment variable:
export KUBECONFIG=./daft/cluster/config/kubeconfig
Expose the notebook:
kubectl port-forward -n kubox svc/notebook 8888:80
Now you can access the notebook at http://localhost:8888
Expose the Ray Dashboard service:
kubectl port-forward -n kubox svc/ray-cluster-kuberay-head-svc 8265:8265
You can access Ray Dashboard at http://localhost:8265
.
Add your AWS keys in 1trc-daft.ipynb
to access the dataset:
runtime_env = {
"working_dir": "./",
"pip": ["getdaft[all]"],
"env_vars": {
"AWS_ACCESS_KEY_ID": "<YOUR_AWS_ACCESS_KEY_ID>",
"AWS_SECRET_ACCESS_KEY": "<YOUR_AWS_SECRET_ACCESS_KEY>",
},
}
Now you can run the notebook.
kubox delete -f cluster-daft.yaml
Tip
If you have issues creating see the troubleshooting guide here
Note
AWS EC2 spot instance prices used for below calculations:
Metric/Framework | Daft + Ray | Clickhouse |
---|---|---|
Startup time | 320s | 313s |
Running time | 1189s | 527s |
Delete time | 122s | 123s |
Estimate cost | $2.75 | $1.37 |
Work in progress..
We welcome contributions! If you find a bug, have a feature request, or want to improve the performance, feel free to open an issue or submit a pull request.
This repository is licensed under the Apache License 2.0. You are free to use, modify, and distribute this project under the terms of the license. See the LICENSE file for more details.