Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blog/clickhouse-benchmarking #61

Merged
merged 13 commits into from
Nov 20, 2024
Merged

Blog/clickhouse-benchmarking #61

merged 13 commits into from
Nov 20, 2024

Conversation

rohit-ng
Copy link
Collaborator

No description provided.

Copy link

netlify bot commented Oct 21, 2024

Deploy Preview for infraspec ready!

Name Link
🔨 Latest commit 6d60a32
🔍 Latest deploy log https://app.netlify.com/sites/infraspec/deploys/6735d0db782e3e0008a68d01
😎 Deploy Preview https://deploy-preview-61--infraspec.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

weight: 1
---

"Imagine being a Formula One driver, racing at breakneck speeds, but without any telemetry data to guide you. It’s a thrilling ride, but one wrong turn or overheating engine could lead to disaster. Just like a pit crew relies on performance metrics to optimize the car's speed and handling, we use observability in ClickHouse to monitor our data system's health. These metrics provide crucial insights, allowing us to identify bottlenecks, prevent outages, and fine-tune performance, ensuring our data engine runs as smoothly and efficiently as a championship-winning race car."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to quote for the whole setup of the blog?


- **Throughput:** Finally, we measured how many queries could be executed per second under sustained load conditions.

**🔍 For detailed performance metrics and benchmarks, please refer to the full report [**here**](https://infraspec.getoutline.com/doc/clickhouse-deployment-and-performance-benchmarking-on-ecs-Stsim2Uoz1).**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have an internal doc link in a public blog?

Do you think we can make this data public?


## Configuration Changes for ClickHouse Deployment

### Node Descriptions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think a diagram can do a better job here?


### Installation Steps

- **ClickHouse Server**: We deployed ClickHouse Server and Client on the data nodes, clickhouse-01 and clickhouse-02, using Docker images, specifically `clickhouse/clickhouse-server` for installation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Highlight the instance name, please.

- **clickhouse-keeper-02**: Responsible for distributed coordination.
- **clickhouse-keeper-03**: Responsible for distributed coordination.

### Installation Steps
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we share an easier way to set up the clickhouse cluster the same way we did so that it's less descriptive?

@@ -0,0 +1,374 @@
---
title: "ClickHouse Deployment and Performance Benchmarking on ECS"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think installation or installation configuration is worth blogging about, especially when we do not have an easy mechanism to do the same.

IMHO, we should focus on a performance benchmark for Clickhouse or a comparison of performance between Clickhouse and Starrocks DB. I would like to see more numbers and visuals on how performance is impacted by queries, etc.

"Imagine being a Formula One driver, racing at breakneck speeds, but without any telemetry data to guide you. It’s a thrilling ride, but one wrong turn or overheating engine could lead to disaster. Just like a pit crew relies on performance metrics to optimize the car's speed and handling, we use observability in ClickHouse to monitor our data system's health. These metrics provide crucial insights, allowing us to identify bottlenecks, prevent outages, and fine-tune performance, ensuring our data engine runs as smoothly and efficiently as a championship-winning race car."

<p align="center">
<img width="480" height="600" src="/images/blog/clickhouse-benchmarking/clickhouse-storage.jpeg" alt="ClickHouse Storage">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This image looks out of place.

<img width="480" height="600" src="/images/blog/clickhouse-benchmarking/clickhouse-storage.jpeg" alt="ClickHouse Storage">
</p>

In this blog, we'll dive into the process of deploying ClickHouse on AWS Elastic Container Service (ECS). We’ll also look at performance benchmarking to evaluate ClickHouse as a high-performance log storage backend. Our focus will be on its ingestion rates, query performance, scalability, and resource utilization.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have a TF module for this.

@Rahul-4480 Rahul-4480 force-pushed the blog/clickhouse-benchmarking branch from 4ef42e2 to c3c1db8 Compare October 25, 2024 07:17
@infraspecdev infraspecdev deleted a comment from github-actions bot Oct 25, 2024

Imagine being a Formula One driver, racing at breakneck speeds, but without any telemetry data to guide you. It’s a thrilling ride, but one wrong turn or overheating engine could lead to disaster. Just like a pit crew relies on performance metrics to optimize the car's speed and handling, we use observability in ClickHouse to monitor our data system's health. These metrics provide crucial insights, allowing us to identify bottlenecks, prevent outages, and fine-tune performance, ensuring our data engine runs as smoothly and efficiently as a championship-winning race car.

In this blog, we’ll focus on the performance benchmarking of ClickHouse on AWS ECS during the ingestion of different data volumes. We’ll analyze key system metrics such as CPU usage, memory consumption, disk I/O, and row insertion rates across varying data ingestion sizes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're missing the context in this blog: This performance benchmarking for Clickhouse is for the use case of storing and querying logs. Maybe we can add that in the heading as well, somehow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this file used?

Can we show the setup at the start?

<p align="center">
<img src="/images/blog/clickhouse-benchmarking/clickhouse-write-operations.png" alt="clickhouse-benchmarking" style="border-radius: 10px; width: 300; height: 500;">
</p>
<!-- markdownlint-enable MD033 -->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can disable this globally for all blogs using .markdownlint.json?


For setting up the ClickHouse cluster, we followed the [ClickHouse replication architecture guide](https://clickhouse.com/docs/en/architecture/replication) and the [AWS CloudFormation ClickHouse cluster setup](https://aws-ia.github.io/cfn-ps-clickhouse-cluster/). Using these resources, we replicated the setup on ECS, allowing us to run performance benchmarking tests on the environment.

By examining performance metrics during the ingestion of 1 million (10 lakh), 5 million (50 lakh), 10 million (1 crore), and 66 million (6.6 crore) logs, we aim to provide a quantitative analysis of how system behavior changes as the load increases.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we set up all the scenarios you're looking at in this blog upfront?

  1. Ingestion
    1. Metrics you want to look at
  2. Querying
    1. Metrics you want to look at
  3. Effect of node count
    1. Metrics you want to look at

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, there is no need to convert millions to an Indian numbering system(ex: 1 million (10 lakh)). I think people understand the international numbering system.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the metrics collected for all the three scenarios are the same one from the default clickhouse dashboard and we have already mentioning them in the blog under each scenario , so i think mentioning them again in the blog at the start would be more of repetative stuff and will make the blog quite lengthy

</p>
<!-- markdownlint-enable MD033 -->

### Key Insights from the Data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add what queries we ran here?

This is useful for other people to reproduce what you folks did here.


## Performance Comparison of Key Metrics Across Ingestion Volumes

| **Logs Ingested** | **CPU Usage (Cores)** | **Selected Bytes per Second (B/s)** | **IO Wait (s)** | **CPU Wait (s)** | **Read from Disk (B)** | **Read from Filesystem (B)** | **Memory Tracked (Bytes)** | **Selected Rows per Second** | **Inserted Rows per Second** |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, this is okayishly readable. I don't have any specific suggestions, but let's see if you can find something to make it more readable.


### Key Insights from the Data

#### 1. **CPU Usage (avg_cpu_usage_cores)**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think metric result images and insights would go well together rather than bundle all of them together. Like:

  • Metrics
  • Metric image
  • Insight

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at the start even we considered this point, but we have quite a lot of metrics collected and their graphs depicted and if we follow this fashion then the blog will become too much lengthy and verbose so we considered of plotting all the graphs in one and mentioning the insights below it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that, but now they seem disconnected and difficult to correlate. The text itself is verbose and difficult to understand without looking up the diagram.


#### 1. **CPU Usage (avg_cpu_usage_cores)**

- **At 1 Million logs**, the CPU usage was minimal at **0.103 cores**, indicating a low load on the system.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For values, you can use code blocks to highlight.

0.103 cores over 0.103 cores. I think bold is overly used in the blog from heading to data like At 1 Million logs and results 0.103 cores.

I think we can improve readability.


## Performance in Read-Heavy Operations

ClickHouse’s performance during read-heavy operations, including `SELECT`, aggregate, and `JOIN` queries, is critical for applications relying on fast data retrieval. Here, we analyze key system metrics across different configurations: two-node replicas under load balancing and a single-node configuration due to failover.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a diagram to show that we're reading/querying the Clickhouse cluster?

We can do the same with ingestion. A very simple diagram with nodes and LB in front and the client(the person querying, source in case of ingestion) depicted should suffice.


<!-- markdownlint-enable MD024 -->

### Incremental Comparison of Key Metrics Across Configurations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of these setups are production grade(1 or 2 nodes), so, I do not think these numbers are worth comparing. What cluster sizing did we have for ingestion and read-throughput benchmarks?

Also, it would help to understand the type and size of the EC2 machine to determine whether the instance itself limits these benchmarks.


## Performance Comparison of Key Metrics Across Ingestion Volumes

| `Logs Ingested` | `CPU Usage (Cores)` | `Selected Bytes per Second (B/s)` | `IO Wait (s)` | `CPU Wait (s)` | `Read from Disk (B)` | `Read from Filesystem (B)` | `Memory Tracked (Bytes)` | `Selected Rows per Second` | `Inserted Rows per Second` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make these metrics more readable? Like, use MB/s instead of B/s?


#### 2. `Selected Bytes per Second (avg_selected_bytes_per_second)`

- `For 1 million logs`, the system processed `27,118 bytes/sec`, and this grew to `37,546 bytes/sec` for `5 million logs`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see much insight here. These are just values you already have shown through the image.

Insights are easy to understand and are easily consumable versions of what data shows/infers. We should try to present that.

For example, a 6x increase in ingestion rate results in a 30% increase in CPU.


#### 7. `Memory Usage (avg_memory_tracked)`

- `Memory tracked` for `Node-1` ranged from `727,494,761.07 to 956,479,931 bytes`, and `Node-2` from `819,671,565.67 to 725,970,944 bytes` in the two-node setup.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to use measurements that are more user-friendly and easier to understand. Why not consider using MB or GB instead?


- Memory usage was `714 MB` for `1 million logs`, increasing to `738 MB` for `5 million logs`, a `3%` rise.
- At `10 million logs`, memory usage reached `983 MB`, a `36%` increase.
- At `66 million logs`, it peaked at `1.9 GB`, a doubling from `10 million logs`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar issue: gigabyte
Suggestion: GB

<img src="/images/blog/clickhouse-benchmarking/read-op-cpu-usage.png" alt="clickhouse-benchmarking" style="border-radius: 10px; width: 650px; height: 300px;">
</p>

- `Two-Node Setup`: Node-1 utilized between `574 to 875 milli-cores` during query processing, handling most of the workload. Node-2 had lower CPU usage, ranging from `122 to 493 milli-cores`, indicating that load distribution wasn’t entirely balanced across nodes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar issue: wasn’t
Suggestion: was not

@vjdhama vjdhama merged commit 6f35009 into main Nov 20, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants