-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Blog/clickhouse-benchmarking #61
Conversation
✅ Deploy Preview for infraspec ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
weight: 1 | ||
--- | ||
|
||
"Imagine being a Formula One driver, racing at breakneck speeds, but without any telemetry data to guide you. It’s a thrilling ride, but one wrong turn or overheating engine could lead to disaster. Just like a pit crew relies on performance metrics to optimize the car's speed and handling, we use observability in ClickHouse to monitor our data system's health. These metrics provide crucial insights, allowing us to identify bottlenecks, prevent outages, and fine-tune performance, ensuring our data engine runs as smoothly and efficiently as a championship-winning race car." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to quote for the whole setup of the blog?
|
||
- **Throughput:** Finally, we measured how many queries could be executed per second under sustained load conditions. | ||
|
||
**🔍 For detailed performance metrics and benchmarks, please refer to the full report [**here**](https://infraspec.getoutline.com/doc/clickhouse-deployment-and-performance-benchmarking-on-ecs-Stsim2Uoz1).** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have an internal doc link in a public blog?
Do you think we can make this data public?
|
||
## Configuration Changes for ClickHouse Deployment | ||
|
||
### Node Descriptions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think a diagram can do a better job here?
|
||
### Installation Steps | ||
|
||
- **ClickHouse Server**: We deployed ClickHouse Server and Client on the data nodes, clickhouse-01 and clickhouse-02, using Docker images, specifically `clickhouse/clickhouse-server` for installation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Highlight the instance name, please.
- **clickhouse-keeper-02**: Responsible for distributed coordination. | ||
- **clickhouse-keeper-03**: Responsible for distributed coordination. | ||
|
||
### Installation Steps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we share an easier way to set up the clickhouse cluster the same way we did so that it's less descriptive?
@@ -0,0 +1,374 @@ | |||
--- | |||
title: "ClickHouse Deployment and Performance Benchmarking on ECS" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think installation or installation configuration is worth blogging about, especially when we do not have an easy mechanism to do the same.
IMHO, we should focus on a performance benchmark for Clickhouse or a comparison of performance between Clickhouse and Starrocks DB. I would like to see more numbers and visuals on how performance is impacted by queries, etc.
"Imagine being a Formula One driver, racing at breakneck speeds, but without any telemetry data to guide you. It’s a thrilling ride, but one wrong turn or overheating engine could lead to disaster. Just like a pit crew relies on performance metrics to optimize the car's speed and handling, we use observability in ClickHouse to monitor our data system's health. These metrics provide crucial insights, allowing us to identify bottlenecks, prevent outages, and fine-tune performance, ensuring our data engine runs as smoothly and efficiently as a championship-winning race car." | ||
|
||
<p align="center"> | ||
<img width="480" height="600" src="/images/blog/clickhouse-benchmarking/clickhouse-storage.jpeg" alt="ClickHouse Storage"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This image looks out of place.
<img width="480" height="600" src="/images/blog/clickhouse-benchmarking/clickhouse-storage.jpeg" alt="ClickHouse Storage"> | ||
</p> | ||
|
||
In this blog, we'll dive into the process of deploying ClickHouse on AWS Elastic Container Service (ECS). We’ll also look at performance benchmarking to evaluate ClickHouse as a high-performance log storage backend. Our focus will be on its ingestion rates, query performance, scalability, and resource utilization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have a TF module for this.
4ef42e2
to
c3c1db8
Compare
|
||
Imagine being a Formula One driver, racing at breakneck speeds, but without any telemetry data to guide you. It’s a thrilling ride, but one wrong turn or overheating engine could lead to disaster. Just like a pit crew relies on performance metrics to optimize the car's speed and handling, we use observability in ClickHouse to monitor our data system's health. These metrics provide crucial insights, allowing us to identify bottlenecks, prevent outages, and fine-tune performance, ensuring our data engine runs as smoothly and efficiently as a championship-winning race car. | ||
|
||
In this blog, we’ll focus on the performance benchmarking of ClickHouse on AWS ECS during the ingestion of different data volumes. We’ll analyze key system metrics such as CPU usage, memory consumption, disk I/O, and row insertion rates across varying data ingestion sizes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're missing the context in this blog: This performance benchmarking for Clickhouse is for the use case of storing and querying logs. Maybe we can add that in the heading as well, somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this file used?
Can we show the setup at the start?
<p align="center"> | ||
<img src="/images/blog/clickhouse-benchmarking/clickhouse-write-operations.png" alt="clickhouse-benchmarking" style="border-radius: 10px; width: 300; height: 500;"> | ||
</p> | ||
<!-- markdownlint-enable MD033 --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can disable this globally for all blogs using .markdownlint.json
?
|
||
For setting up the ClickHouse cluster, we followed the [ClickHouse replication architecture guide](https://clickhouse.com/docs/en/architecture/replication) and the [AWS CloudFormation ClickHouse cluster setup](https://aws-ia.github.io/cfn-ps-clickhouse-cluster/). Using these resources, we replicated the setup on ECS, allowing us to run performance benchmarking tests on the environment. | ||
|
||
By examining performance metrics during the ingestion of 1 million (10 lakh), 5 million (50 lakh), 10 million (1 crore), and 66 million (6.6 crore) logs, we aim to provide a quantitative analysis of how system behavior changes as the load increases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we set up all the scenarios you're looking at in this blog upfront?
- Ingestion
- Metrics you want to look at
- Querying
- Metrics you want to look at
- Effect of node count
- Metrics you want to look at
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, there is no need to convert millions to an Indian numbering system(ex: 1 million (10 lakh)
). I think people understand the international numbering system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the metrics collected for all the three scenarios are the same one from the default clickhouse dashboard and we have already mentioning them in the blog under each scenario , so i think mentioning them again in the blog at the start would be more of repetative stuff and will make the blog quite lengthy
</p> | ||
<!-- markdownlint-enable MD033 --> | ||
|
||
### Key Insights from the Data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add what queries we ran here?
This is useful for other people to reproduce what you folks did here.
|
||
## Performance Comparison of Key Metrics Across Ingestion Volumes | ||
|
||
| **Logs Ingested** | **CPU Usage (Cores)** | **Selected Bytes per Second (B/s)** | **IO Wait (s)** | **CPU Wait (s)** | **Read from Disk (B)** | **Read from Filesystem (B)** | **Memory Tracked (Bytes)** | **Selected Rows per Second** | **Inserted Rows per Second** | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, this is okayishly readable. I don't have any specific suggestions, but let's see if you can find something to make it more readable.
|
||
### Key Insights from the Data | ||
|
||
#### 1. **CPU Usage (avg_cpu_usage_cores)** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think metric result images and insights would go well together rather than bundle all of them together. Like:
- Metrics
- Metric image
- Insight
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at the start even we considered this point, but we have quite a lot of metrics collected and their graphs depicted and if we follow this fashion then the blog will become too much lengthy and verbose so we considered of plotting all the graphs in one and mentioning the insights below it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that, but now they seem disconnected and difficult to correlate. The text itself is verbose and difficult to understand without looking up the diagram.
static/images/blog/clickhouse-benchmarking/clickhouse-read-operations.png
Outdated
Show resolved
Hide resolved
|
||
#### 1. **CPU Usage (avg_cpu_usage_cores)** | ||
|
||
- **At 1 Million logs**, the CPU usage was minimal at **0.103 cores**, indicating a low load on the system. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For values, you can use code blocks to highlight.
0.103 cores
over 0.103 cores. I think bold is overly used in the blog from heading to data like At 1 Million logs and results 0.103 cores.
I think we can improve readability.
|
||
## Performance in Read-Heavy Operations | ||
|
||
ClickHouse’s performance during read-heavy operations, including `SELECT`, aggregate, and `JOIN` queries, is critical for applications relying on fast data retrieval. Here, we analyze key system metrics across different configurations: two-node replicas under load balancing and a single-node configuration due to failover. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a diagram to show that we're reading/querying the Clickhouse cluster?
We can do the same with ingestion. A very simple diagram with nodes and LB in front and the client(the person querying, source in case of ingestion) depicted should suffice.
|
||
<!-- markdownlint-enable MD024 --> | ||
|
||
### Incremental Comparison of Key Metrics Across Configurations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of these setups are production grade(1 or 2 nodes), so, I do not think these numbers are worth comparing. What cluster sizing did we have for ingestion and read-throughput benchmarks?
Also, it would help to understand the type and size of the EC2 machine to determine whether the instance itself limits these benchmarks.
|
||
## Performance Comparison of Key Metrics Across Ingestion Volumes | ||
|
||
| `Logs Ingested` | `CPU Usage (Cores)` | `Selected Bytes per Second (B/s)` | `IO Wait (s)` | `CPU Wait (s)` | `Read from Disk (B)` | `Read from Filesystem (B)` | `Memory Tracked (Bytes)` | `Selected Rows per Second` | `Inserted Rows per Second` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make these metrics more readable? Like, use MB/s instead of B/s?
|
||
#### 2. `Selected Bytes per Second (avg_selected_bytes_per_second)` | ||
|
||
- `For 1 million logs`, the system processed `27,118 bytes/sec`, and this grew to `37,546 bytes/sec` for `5 million logs`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see much insight here. These are just values you already have shown through the image.
Insights are easy to understand and are easily consumable versions of what data shows/infers. We should try to present that.
For example, a 6x
increase in ingestion rate results in a 30%
increase in CPU.
|
||
#### 7. `Memory Usage (avg_memory_tracked)` | ||
|
||
- `Memory tracked` for `Node-1` ranged from `727,494,761.07 to 956,479,931 bytes`, and `Node-2` from `819,671,565.67 to 725,970,944 bytes` in the two-node setup. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to use measurements that are more user-friendly and easier to understand. Why not consider using MB or GB instead?
|
||
- Memory usage was `714 MB` for `1 million logs`, increasing to `738 MB` for `5 million logs`, a `3%` rise. | ||
- At `10 million logs`, memory usage reached `983 MB`, a `36%` increase. | ||
- At `66 million logs`, it peaked at `1.9 GB`, a doubling from `10 million logs`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grammar issue: gigabyte
Suggestion: GB
<img src="/images/blog/clickhouse-benchmarking/read-op-cpu-usage.png" alt="clickhouse-benchmarking" style="border-radius: 10px; width: 650px; height: 300px;"> | ||
</p> | ||
|
||
- `Two-Node Setup`: Node-1 utilized between `574 to 875 milli-cores` during query processing, handling most of the workload. Node-2 had lower CPU usage, ranging from `122 to 493 milli-cores`, indicating that load distribution wasn’t entirely balanced across nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grammar issue: wasn’t
Suggestion: was not
No description provided.