Blog/clickhouse-benchmarking #61

rohit-ng · 2024-10-21T14:57:11Z

No description provided.

netlify · 2024-10-21T14:57:33Z

✅ Deploy Preview for infraspec ready!

Name	Link
🔨 Latest commit	`6d60a32`
🔍 Latest deploy log	https://app.netlify.com/sites/infraspec/deploys/6735d0db782e3e0008a68d01
😎 Deploy Preview	https://deploy-preview-61--infraspec.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

vjdhama · 2024-10-22T06:49:04Z

content/blog/clickhouse-benchmarking.md

+weight: 1
+---
+
+"Imagine being a Formula One driver, racing at breakneck speeds, but without any telemetry data to guide you. It’s a thrilling ride, but one wrong turn or overheating engine could lead to disaster. Just like a pit crew relies on performance metrics to optimize the car's speed and handling, we use observability in ClickHouse to monitor our data system's health. These metrics provide crucial insights, allowing us to identify bottlenecks, prevent outages, and fine-tune performance, ensuring our data engine runs as smoothly and efficiently as a championship-winning race car."


Do we need to quote for the whole setup of the blog?

vjdhama · 2024-10-22T06:49:42Z

content/blog/clickhouse-benchmarking.md

+
+- **Throughput:** Finally, we measured how many queries could be executed per second under sustained load conditions.
+
+**🔍 For detailed performance metrics and benchmarks, please refer to the full report [**here**](https://infraspec.getoutline.com/doc/clickhouse-deployment-and-performance-benchmarking-on-ecs-Stsim2Uoz1).**


Why do we have an internal doc link in a public blog?

Do you think we can make this data public?

vjdhama · 2024-10-22T06:52:23Z

content/blog/clickhouse-benchmarking.md

+
+## Configuration Changes for ClickHouse Deployment
+
+### Node Descriptions


Do you think a diagram can do a better job here?

vjdhama · 2024-10-22T06:53:17Z

content/blog/clickhouse-benchmarking.md

+
+### Installation Steps
+
+- **ClickHouse Server**: We deployed ClickHouse Server and Client on the data nodes, clickhouse-01 and clickhouse-02, using Docker images, specifically `clickhouse/clickhouse-server` for installation.


Highlight the instance name, please.

vjdhama · 2024-10-22T06:55:05Z

content/blog/clickhouse-benchmarking.md

+- **clickhouse-keeper-02**: Responsible for distributed coordination.
+- **clickhouse-keeper-03**: Responsible for distributed coordination.
+
+### Installation Steps


Can we share an easier way to set up the clickhouse cluster the same way we did so that it's less descriptive?

vjdhama · 2024-10-22T10:23:07Z

content/blog/clickhouse-benchmarking.md

@@ -0,0 +1,374 @@
+---
+title: "ClickHouse Deployment and Performance Benchmarking on ECS"


I do not think installation or installation configuration is worth blogging about, especially when we do not have an easy mechanism to do the same.

IMHO, we should focus on a performance benchmark for Clickhouse or a comparison of performance between Clickhouse and Starrocks DB. I would like to see more numbers and visuals on how performance is impacted by queries, etc.

vjdhama · 2024-10-22T10:23:20Z

content/blog/clickhouse-benchmarking.md

+"Imagine being a Formula One driver, racing at breakneck speeds, but without any telemetry data to guide you. It’s a thrilling ride, but one wrong turn or overheating engine could lead to disaster. Just like a pit crew relies on performance metrics to optimize the car's speed and handling, we use observability in ClickHouse to monitor our data system's health. These metrics provide crucial insights, allowing us to identify bottlenecks, prevent outages, and fine-tune performance, ensuring our data engine runs as smoothly and efficiently as a championship-winning race car."
+
+<p align="center">
+  <img width="480" height="600" src="/images/blog/clickhouse-benchmarking/clickhouse-storage.jpeg" alt="ClickHouse Storage">


This image looks out of place.

vjdhama · 2024-10-22T10:23:36Z

content/blog/clickhouse-benchmarking.md

+  <img width="480" height="600" src="/images/blog/clickhouse-benchmarking/clickhouse-storage.jpeg" alt="ClickHouse Storage">
+</p>
+
+In this blog, we'll dive into the process of deploying ClickHouse on AWS Elastic Container Service (ECS). We’ll also look at performance benchmarking to evaluate ClickHouse as a high-performance log storage backend. Our focus will be on its ingestion rates, query performance, scalability, and resource utilization.


We should have a TF module for this.

vjdhama · 2024-10-25T13:55:54Z

content/blog/clickhouse-benchmarking.md

+
+Imagine being a Formula One driver, racing at breakneck speeds, but without any telemetry data to guide you. It’s a thrilling ride, but one wrong turn or overheating engine could lead to disaster. Just like a pit crew relies on performance metrics to optimize the car's speed and handling, we use observability in ClickHouse to monitor our data system's health. These metrics provide crucial insights, allowing us to identify bottlenecks, prevent outages, and fine-tune performance, ensuring our data engine runs as smoothly and efficiently as a championship-winning race car.
+
+In this blog, we’ll focus on the performance benchmarking of ClickHouse on AWS ECS during the ingestion of different data volumes. We’ll analyze key system metrics such as CPU usage, memory consumption, disk I/O, and row insertion rates across varying data ingestion sizes.


We're missing the context in this blog: This performance benchmarking for Clickhouse is for the use case of storing and querying logs. Maybe we can add that in the heading as well, somehow.

vjdhama · 2024-10-25T14:06:05Z

static/images/blog/clickhouse-benchmarking/ecs-clickhouse-deployment.png

Is this file used?

Can we show the setup at the start?

vjdhama · 2024-10-25T14:15:34Z

content/blog/clickhouse-benchmarking.md

+<p align="center">
+  <img src="/images/blog/clickhouse-benchmarking/clickhouse-write-operations.png" alt="clickhouse-benchmarking" style="border-radius: 10px; width: 300; height: 500;">
+</p>
+<!-- markdownlint-enable MD033 -->


Maybe we can disable this globally for all blogs using .markdownlint.json?

vjdhama · 2024-10-25T14:18:25Z

content/blog/clickhouse-benchmarking.md

+
+For setting up the ClickHouse cluster, we followed the [ClickHouse replication architecture guide](https://clickhouse.com/docs/en/architecture/replication) and the [AWS CloudFormation ClickHouse cluster setup](https://aws-ia.github.io/cfn-ps-clickhouse-cluster/). Using these resources, we replicated the setup on ECS, allowing us to run performance benchmarking tests on the environment.
+
+By examining performance metrics during the ingestion of 1 million (10 lakh), 5 million (50 lakh), 10 million (1 crore), and 66 million (6.6 crore) logs, we aim to provide a quantitative analysis of how system behavior changes as the load increases.


Can we set up all the scenarios you're looking at in this blog upfront?

Ingestion

Metrics you want to look at

Querying

Metrics you want to look at

Effect of node count

Metrics you want to look at

Also, there is no need to convert millions to an Indian numbering system(ex: 1 million (10 lakh)). I think people understand the international numbering system.

the metrics collected for all the three scenarios are the same one from the default clickhouse dashboard and we have already mentioning them in the blog under each scenario , so i think mentioning them again in the blog at the start would be more of repetative stuff and will make the blog quite lengthy

vjdhama · 2024-10-25T14:29:36Z

content/blog/clickhouse-benchmarking.md

+</p>
+<!-- markdownlint-enable MD033 -->
+
+### Key Insights from the Data


Can we add what queries we ran here?

This is useful for other people to reproduce what you folks did here.

vjdhama · 2024-10-25T14:33:22Z

content/blog/clickhouse-benchmarking.md

+
+## Performance Comparison of Key Metrics Across Ingestion Volumes
+
+| **Logs Ingested** | **CPU Usage (Cores)** | **Selected Bytes per Second (B/s)** | **IO Wait (s)** | **CPU Wait (s)** | **Read from Disk (B)** | **Read from Filesystem (B)** | **Memory Tracked (Bytes)** | **Selected Rows per Second** | **Inserted Rows per Second** |


To be honest, this is okayishly readable. I don't have any specific suggestions, but let's see if you can find something to make it more readable.

vjdhama · 2024-10-25T14:36:25Z

content/blog/clickhouse-benchmarking.md

+
+### Key Insights from the Data
+
+#### 1. **CPU Usage (avg_cpu_usage_cores)**


I think metric result images and insights would go well together rather than bundle all of them together. Like:

Metrics

Metric image

Insight

at the start even we considered this point, but we have quite a lot of metrics collected and their graphs depicted and if we follow this fashion then the blog will become too much lengthy and verbose so we considered of plotting all the graphs in one and mentioning the insights below it

I understand that, but now they seem disconnected and difficult to correlate. The text itself is verbose and difficult to understand without looking up the diagram.

static/images/blog/clickhouse-benchmarking/clickhouse-read-operations.png

vjdhama · 2024-10-25T14:43:05Z

content/blog/clickhouse-benchmarking.md

+
+#### 1. **CPU Usage (avg_cpu_usage_cores)**
+
+- **At 1 Million logs**, the CPU usage was minimal at **0.103 cores**, indicating a low load on the system.


For values, you can use code blocks to highlight.

0.103 cores over 0.103 cores. I think bold is overly used in the blog from heading to data like At 1 Million logs and results 0.103 cores.

I think we can improve readability.

vjdhama · 2024-10-25T14:45:55Z

content/blog/clickhouse-benchmarking.md

+
+## Performance in Read-Heavy Operations
+
+ClickHouse’s performance during read-heavy operations, including `SELECT`, aggregate, and `JOIN` queries, is critical for applications relying on fast data retrieval. Here, we analyze key system metrics across different configurations: two-node replicas under load balancing and a single-node configuration due to failover.


Can we add a diagram to show that we're reading/querying the Clickhouse cluster?

We can do the same with ingestion. A very simple diagram with nodes and LB in front and the client(the person querying, source in case of ingestion) depicted should suffice.

vjdhama · 2024-10-25T15:52:09Z

content/blog/clickhouse-benchmarking.md

+
+<!-- markdownlint-enable MD024 -->
+
+### Incremental Comparison of Key Metrics Across Configurations


None of these setups are production grade(1 or 2 nodes), so, I do not think these numbers are worth comparing. What cluster sizing did we have for ingestion and read-throughput benchmarks?

Also, it would help to understand the type and size of the EC2 machine to determine whether the instance itself limits these benchmarks.

vjdhama · 2024-11-05T13:39:37Z

content/blog/clickhouse-benchmarking.md

+
+## Performance Comparison of Key Metrics Across Ingestion Volumes
+
+| `Logs Ingested` | `CPU Usage (Cores)` | `Selected Bytes per Second (B/s)` | `IO Wait (s)` | `CPU Wait (s)` | `Read from Disk (B)` | `Read from Filesystem (B)` | `Memory Tracked (Bytes)` | `Selected Rows per Second` | `Inserted Rows per Second` |


Can we make these metrics more readable? Like, use MB/s instead of B/s?

vjdhama · 2024-11-05T13:42:55Z

content/blog/clickhouse-benchmarking.md

+
+#### 2. `Selected Bytes per Second (avg_selected_bytes_per_second)`
+
+- `For 1 million logs`, the system processed `27,118 bytes/sec`, and this grew to `37,546 bytes/sec` for `5 million logs`.


I don't see much insight here. These are just values you already have shown through the image.

Insights are easy to understand and are easily consumable versions of what data shows/infers. We should try to present that.

For example, a 6x increase in ingestion rate results in a 30% increase in CPU.

vjdhama · 2024-11-05T13:46:28Z

content/blog/clickhouse-benchmarking.md

+
+#### 7. `Memory Usage (avg_memory_tracked)`
+
+- `Memory tracked` for `Node-1` ranged from `727,494,761.07 to 956,479,931 bytes`, and `Node-2` from `819,671,565.67 to 725,970,944 bytes` in the two-node setup.


It would be great to use measurements that are more user-friendly and easier to understand. Why not consider using MB or GB instead?

github-actions · 2024-11-14T10:29:23Z

content/blog/clickhouse-benchmarking.md

+
+- Memory usage was `714 MB` for `1 million logs`, increasing to `738 MB` for `5 million logs`, a `3%` rise.
+- At `10 million logs`, memory usage reached `983 MB`, a `36%` increase.
+- At `66 million logs`, it peaked at `1.9 GB`, a doubling from `10 million logs`.


Grammar issue: gigabyte
Suggestion: GB

github-actions · 2024-11-14T10:30:50Z

content/blog/clickhouse-benchmarking.md

+  <img src="/images/blog/clickhouse-benchmarking/read-op-cpu-usage.png" alt="clickhouse-benchmarking" style="border-radius: 10px; width: 650px; height: 300px;">
+</p>
+
+- `Two-Node Setup`: Node-1 utilized between `574 to 875 milli-cores` during query processing, handling most of the workload. Node-2 had lower CPU usage, ranging from `122 to 493 milli-cores`, indicating that load distribution wasn’t entirely balanced across nodes.


Grammar issue: wasn’t
Suggestion: was not

rohit-ng added 2 commits October 21, 2024 20:25

add supporting images for blog

16c25f3

add blog

b6ef22c

rohit-ng and others added 5 commits October 21, 2024 20:40

fix: md lint issues

39cadf0

fix: image path

665783d

refactor: image size & heading in the blog

5a4bd6b

fix: spelling issue in the blog

aa799db

fix: grammar issues in the blog

241ceb5

rohit-ng assigned rohit-ng and Rahul-4480 Oct 22, 2024

vjdhama reviewed Oct 22, 2024

View reviewed changes

Rahul-4480 added 2 commits October 25, 2024 12:42

add: clickhouse read & write operations graph

96e6021

refactor: focus on performance benchmarking

c3c1db8

Rahul-4480 force-pushed the blog/clickhouse-benchmarking branch from 4ef42e2 to c3c1db8 Compare October 25, 2024 07:17

infraspecdev deleted a comment from github-actions bot Oct 25, 2024

vjdhama requested changes Oct 25, 2024

View reviewed changes

vjdhama reviewed Oct 25, 2024

View reviewed changes

Rahul-4480 added 2 commits October 30, 2024 12:11

add: clickhouse related ingestion & read diagrams

34a1a8a

refactor: apply updates based on review feedback

eb9d14e

vjdhama requested changes Nov 5, 2024

View reviewed changes

Rahul-4480 added 2 commits November 14, 2024 15:57

add: supporting clickhouse read & write graphs

1b56f7e

refactor: split the graphs into individual metrics

6d60a32

github-actions bot reviewed Nov 14, 2024

View reviewed changes

vjdhama merged commit 6f35009 into main Nov 20, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blog/clickhouse-benchmarking #61

Blog/clickhouse-benchmarking #61

rohit-ng commented Oct 21, 2024

netlify bot commented Oct 21, 2024 •

edited

Loading

vjdhama Oct 22, 2024

vjdhama Oct 22, 2024

vjdhama Oct 22, 2024

vjdhama Oct 22, 2024

vjdhama Oct 22, 2024

vjdhama Oct 22, 2024

vjdhama Oct 22, 2024

vjdhama Oct 22, 2024

vjdhama Oct 25, 2024

vjdhama Oct 25, 2024

vjdhama Oct 25, 2024

vjdhama Oct 25, 2024

vjdhama Oct 25, 2024

Rahul-4480 Oct 30, 2024

vjdhama Oct 25, 2024

vjdhama Oct 25, 2024

vjdhama Oct 25, 2024

Rahul-4480 Oct 30, 2024

vjdhama Nov 5, 2024

vjdhama Oct 25, 2024

vjdhama Oct 25, 2024

vjdhama Oct 25, 2024

vjdhama Nov 5, 2024

vjdhama Nov 5, 2024

vjdhama Nov 5, 2024

github-actions bot Nov 14, 2024

github-actions bot Nov 14, 2024


		- Throughput: Finally, we measured how many queries could be executed per second under sustained load conditions.

		🔍 For detailed performance metrics and benchmarks, please refer to the full report [here](https://infraspec.getoutline.com/doc/clickhouse-deployment-and-performance-benchmarking-on-ecs-Stsim2Uoz1).


		## Configuration Changes for ClickHouse Deployment

		### Node Descriptions


		### Installation Steps

		- ClickHouse Server: We deployed ClickHouse Server and Client on the data nodes, clickhouse-01 and clickhouse-02, using Docker images, specifically `clickhouse/clickhouse-server` for installation.

		@@ -0,0 +1,374 @@
		---
		title: "ClickHouse Deployment and Performance Benchmarking on ECS"


		Imagine being a Formula One driver, racing at breakneck speeds, but without any telemetry data to guide you. It’s a thrilling ride, but one wrong turn or overheating engine could lead to disaster. Just like a pit crew relies on performance metrics to optimize the car's speed and handling, we use observability in ClickHouse to monitor our data system's health. These metrics provide crucial insights, allowing us to identify bottlenecks, prevent outages, and fine-tune performance, ensuring our data engine runs as smoothly and efficiently as a championship-winning race car.

		In this blog, we’ll focus on the performance benchmarking of ClickHouse on AWS ECS during the ingestion of different data volumes. We’ll analyze key system metrics such as CPU usage, memory consumption, disk I/O, and row insertion rates across varying data ingestion sizes.


		For setting up the ClickHouse cluster, we followed the [ClickHouse replication architecture guide](https://clickhouse.com/docs/en/architecture/replication) and the [AWS CloudFormation ClickHouse cluster setup](https://aws-ia.github.io/cfn-ps-clickhouse-cluster/). Using these resources, we replicated the setup on ECS, allowing us to run performance benchmarking tests on the environment.

		By examining performance metrics during the ingestion of 1 million (10 lakh), 5 million (50 lakh), 10 million (1 crore), and 66 million (6.6 crore) logs, we aim to provide a quantitative analysis of how system behavior changes as the load increases.


		## Performance Comparison of Key Metrics Across Ingestion Volumes

		\| Logs Ingested \| CPU Usage (Cores) \| Selected Bytes per Second (B/s) \| IO Wait (s) \| CPU Wait (s) \| Read from Disk (B) \| Read from Filesystem (B) \| Memory Tracked (Bytes) \| Selected Rows per Second \| Inserted Rows per Second \|


		### Key Insights from the Data

		#### 1. CPU Usage (avg_cpu_usage_cores)


		#### 1. CPU Usage (avg_cpu_usage_cores)

		- At 1 Million logs, the CPU usage was minimal at 0.103 cores, indicating a low load on the system.


		## Performance in Read-Heavy Operations

		ClickHouse’s performance during read-heavy operations, including `SELECT`, aggregate, and `JOIN` queries, is critical for applications relying on fast data retrieval. Here, we analyze key system metrics across different configurations: two-node replicas under load balancing and a single-node configuration due to failover.


		<!-- markdownlint-enable MD024 -->

		### Incremental Comparison of Key Metrics Across Configurations


		## Performance Comparison of Key Metrics Across Ingestion Volumes

		\| `Logs Ingested` \| `CPU Usage (Cores)` \| `Selected Bytes per Second (B/s)` \| `IO Wait (s)` \| `CPU Wait (s)` \| `Read from Disk (B)` \| `Read from Filesystem (B)` \| `Memory Tracked (Bytes)` \| `Selected Rows per Second` \| `Inserted Rows per Second` \|


		#### 2. `Selected Bytes per Second (avg_selected_bytes_per_second)`

		- `For 1 million logs`, the system processed `27,118 bytes/sec`, and this grew to `37,546 bytes/sec` for `5 million logs`.


		#### 7. `Memory Usage (avg_memory_tracked)`

		- `Memory tracked` for `Node-1` ranged from `727,494,761.07 to 956,479,931 bytes`, and `Node-2` from `819,671,565.67 to 725,970,944 bytes` in the two-node setup.

Blog/clickhouse-benchmarking #61

Blog/clickhouse-benchmarking #61

Conversation

rohit-ng commented Oct 21, 2024

netlify bot commented Oct 21, 2024 • edited Loading

✅ Deploy Preview for infraspec ready!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot Nov 14, 2024

Choose a reason for hiding this comment

github-actions bot Nov 14, 2024

Choose a reason for hiding this comment

netlify bot commented Oct 21, 2024 •

edited

Loading