From d84377668996c5f2c285d60d4cf81658c275de4e Mon Sep 17 00:00:00 2001 From: Andres Taylor Date: Thu, 31 Oct 2024 18:26:36 +0100 Subject: [PATCH 1/2] wip - blog post about selecting sharding keys Signed-off-by: Andres Taylor --- .../2024-11-14-how-to-choose-sharding-keys.md | 193 ++++++++++++++ content/en/blog/keys-log.json | 251 ++++++++++++++++++ content/en/blog/querylog.sql | 115 ++++++++ 3 files changed, 559 insertions(+) create mode 100644 content/en/blog/2024-11-14-how-to-choose-sharding-keys.md create mode 100644 content/en/blog/keys-log.json create mode 100644 content/en/blog/querylog.sql diff --git a/content/en/blog/2024-11-14-how-to-choose-sharding-keys.md b/content/en/blog/2024-11-14-how-to-choose-sharding-keys.md new file mode 100644 index 000000000..6960ecde1 --- /dev/null +++ b/content/en/blog/2024-11-14-how-to-choose-sharding-keys.md @@ -0,0 +1,193 @@ +--- +author: 'Andrés Taylor' +date: 2024-11-05 +slug: '2024-11-05-optimizing-query-planning-in-vitess-a-step-by-step-approach' +tags: ['Vitess', 'PlanetScale', 'MySQL', 'Query Serving', 'Vindex', 'plan', 'execution plan', 'explain', 'optimizer', "aggregate", "group by", "order by"] +title: 'Choosing sharding keys in Vitess: Tools and best practices' +description: "See how Vitess acts as a database proxy that creates an illusion of a single database when technically, the query is sent to multiple MySQL instances." +--- + +Vitess can split the content in tables across multiple MySQL instances. +This is called sharding the data. +Sharding is the process of splitting a large database into smaller, more manageable parts called shards. +Each shard is stored on a separate MySQL instance. +Vitess acts as a database proxy that creates an illusion of a single database when technically, the query is sent to multiple MySQL instances. + +When sharding a table, you need to choose a sharding key. +The sharding key is a column or set of columns that determine how the data is split across the shards. +The sharding key is used to route queries to the correct shard. +It's similar to a primary key, but it's used for sharding instead of indexing. +Choosing the right sharding keys is crucial for the performance of your application. + +When inspecting a query, if Vitess can see that a join is being performed on columns that are sharded with the same rules, it knows that the join between these tables can be pushed down to the shard and solved there. +This is the best case scenario, as it minimizes the amount of data that needs to be transferred between shards. + +Say we have two tables, `orders` and `customer`, and both are sharded by their primary keys (order_id and customer_id respectively). + +```sql +select * +from orders o + join customers c on o.customer_id = c.id +``` + +Since the join is not being done on the sharding key, Vitess will need to perform the join in the vtgate layer, which is the query router that sits between the application and the MySQL instances. +This is not ideal, as it means that all the data from both tables will need to be transferred to the vtgate layer, and the join will be performed there. + +If we were to shard the `orders` table by `customer_id` instead of `order_id`, the join could be pushed down to the shard, and the join would be performed there. +This would be much more efficient, as only the data that is needed for the join would need to be transferred between shards. + +## Analyzing How Queries Execute + +When you're choosing sharding keys, it's important to understand how your queries will execute. +Vitess provides a tool called `vexplain` that can help you analyze how your queries will be executed. +It's similar to mysql's `explain` command, showing the query plan that Vitess will use to execute the query. + +```sql +vexplain plan select * + from orders o + join customers c on o.customer_id = c.id +``` + +This will output the query plan that Vitess will use to execute the query. It's represented as a JSON tree: + +```json +{ + "OperatorType": "Join", + "Variant": "Join", + "JoinColumnIndexes": "L:0,L:1,L:2,L:3,L:4,R:0,R:1,R:2,R:3", + "JoinVars": { + "o_customer_id": 1 + }, + "TableName": "orders_customers", + "Inputs": [ + { + "OperatorType": "Route", + "Variant": "Scatter", + "Keyspace": { + "Name": "ks_derived", + "Sharded": true + }, + "FieldQuery": "select o.id, o.customer_id, o.`status`, o.total_amount, o.created_at from orders as o where 1 != 1", + "Query": "select o.id, o.customer_id, o.`status`, o.total_amount, o.created_at from orders as o", + "Table": "orders" + }, + { + "OperatorType": "Route", + "Variant": "EqualUnique", + "Keyspace": { + "Name": "ks_derived", + "Sharded": true + }, + "FieldQuery": "select c.id, c.email, c.`name`, c.created_at from customers as c where 1 != 1", + "Query": "select c.id, c.email, c.`name`, c.created_at from customers as c where c.id = :o_customer_id /* INT32 */", + "Table": "customers", + "Values": [ + ":o_customer_id" + ], + "Vindex": "user_index" + } + ] +} +``` + +This is showing the execution plan that Vitess will use for the query. We can see the join being the root of the query plan. +The join is a nested loop join, so for every row returned from the first input to the join, Vitess will issue a query to the second input to the join. +This is a so called "EqualUnique" Route, which means that Vitess knows that the second query only needs to be sent to a single shard. + +This is what Vitess will do because the sharding keys we've selected for these tables are the same as the primary keys. +For the second query, since we are querying by its sharding key, it's easy to figure out where to send the query. + +If we instead were to shard the `orders` table by `customer_id`, the query plan would look different. + +```json +{ + "OperatorType": "Route", + "Variant": "Scatter", + "Keyspace": { + "Name": "ks_derived", + "Sharded": true + }, + "FieldQuery": "select o.id, o.customer_id, o.`status`, o.total_amount, o.created_at, c.id, c.email, c.`name`, c.created_at from orders as o, customers as c where 1 != 1", + "Query": "select o.id, o.customer_id, o.`status`, o.total_amount, o.created_at, c.id, c.email, c.`name`, c.created_at from orders as o, customers as c where o.customer_id = c.id", + "Table": "customers, orders" +} +``` + +Here we can see that the join has been pushed down to the shard, and the join is being performed there. +Vtgate only has to concatenate the results from the shards. + +## Analyze Queries With VExplain Keys + +Vitess also provides a tool called `vexplain keys` that can help you analyze your queries so you can design your schema. + +```sql +vexplain keys select * + from orders o + join customers c on o.customer_id = c.id +``` + +This will output columns used by the query that might be interesting to test as sharding keys. + +```json +{ + "statementType": "SELECT", + "joinColumns": [ + "customers.id =", + "orders.customer_id =" + ], + "selectColumns": [ + "customers.`name`", + "customers.created_at", + "customers.email", + "customers.id", + "orders.`status`", + "orders.created_at", + "orders.customer_id", + "orders.id", + "orders.total_amount" + ] +} +``` + +This output shows the columns that are being used in the query. The join columns are the columns that are being used to join the two tables. The equality sign after the column name indicates which comparison is being performed against the column. +The tool will also show columns used for filtering or grouping, which can be useful when choosing sharding keys. For a more complex query, the output can look like this: + +```json +{ + "statementType": "SELECT", + "groupingColumns": [ + "customers.`name`", + "customers.id" + ], + "joinColumns": [ + "customers.id =", + "order_items.order_id =", + "orders.customer_id =", + "orders.id =" + ], + "filterColumns": [ + "orders.`status` =" + ], + "selectColumns": [ + "customers.`name`", + "customers.id", + "order_items.quantity", + "order_items.unit_price" + ] +} +``` + +This tool is very useful when you're designing your schema and trying to figure out which columns to use as sharding keys. + +## Analyzing Query Patterns at Scale + +So far we've looked at individual queries, but real applications have hundreds or thousands of queries. +Let's look at a more realistic example. We've collected a [sample query log](link-to-query-log.sql) from our e-commerce application running in production. + +Using the `vt` command line tool, we can analyze all these queries at once: + +```bash +vt keys query-log.sql > keys-log.json +vt benchstat keys-log.json +``` + diff --git a/content/en/blog/keys-log.json b/content/en/blog/keys-log.json new file mode 100644 index 000000000..756aee9e5 --- /dev/null +++ b/content/en/blog/keys-log.json @@ -0,0 +1,251 @@ +[ + { + "queryStructure": "SELECT `c`.`id` AS `customer_id`, `c`.`name` AS `customer_name`, sum(`oi`.`quantity` * `oi`.`unit_price`) AS `total_spent` FROM `customers` AS `c` JOIN `orders` AS `o` ON `c`.`id` = `o`.`customer_id` JOIN `order_items` AS `oi` ON `o`.`id` = `oi`.`order_id` WHERE `o`.`status` = :_o_status /* VARCHAR */ GROUP BY `c`.`id`, `c`.`name` ORDER BY sum(`oi`.`quantity` * `oi`.`unit_price`) DESC", + "usageCount": 1, + "lineNumbers": [ + 1 + ], + "tableName": [ + "customers", + "orders", + "order_items" + ], + "groupingColumns": [ + "customers.id", + "customers.name" + ], + "joinColumns": [ + "customers.id", + "order_items.order_id", + "orders.customer_id", + "orders.id" + ], + "filterColumns": [ + "orders.status" + ], + "statementType": "SELECT" + }, + { + "queryStructure": "SELECT `p`.`category`, avg(`oi`.`quantity` * `oi`.`unit_price`) AS `avg_order_amount` FROM `products` AS `p` JOIN `order_items` AS `oi` ON `p`.`id` = `oi`.`product_id` JOIN `orders` AS `o` ON `oi`.`order_id` = `o`.`id` WHERE `o`.`status` IN ::1 GROUP BY `p`.`category` ORDER BY avg(`oi`.`quantity` * `oi`.`unit_price`) DESC", + "usageCount": 1, + "lineNumbers": [ + 13 + ], + "tableName": [ + "products", + "order_items", + "orders" + ], + "groupingColumns": [ + "products.category" + ], + "joinColumns": [ + "order_items.order_id", + "order_items.product_id", + "orders.id", + "products.id" + ], + "filterColumns": [ + "orders.status" + ], + "statementType": "SELECT" + }, + { + "queryStructure": "SELECT `c`.`id` AS `customer_id`, `c`.`name` AS `customer_name`, COUNT(`o`.`id`) AS `total_orders` FROM `customers` AS `c` JOIN `orders` AS `o` ON `c`.`id` = `o`.`customer_id` GROUP BY `c`.`id`, `c`.`name` HAVING COUNT(`o`.`id`) \u003e :1 /* INT64 */ ORDER BY COUNT(`o`.`id`) DESC", + "usageCount": 1, + "lineNumbers": [ + 24 + ], + "tableName": [ + "customers", + "orders" + ], + "groupingColumns": [ + "customers.id", + "customers.name" + ], + "joinColumns": [ + "customers.id", + "orders.customer_id" + ], + "statementType": "SELECT" + }, + { + "queryStructure": "SELECT `p`.`id` AS `product_id`, `p`.`name` AS `product_name`, sum(`oi`.`quantity` * `oi`.`unit_price`) AS `total_revenue` FROM `products` AS `p` JOIN `order_items` AS `oi` ON `p`.`id` = `oi`.`product_id` GROUP BY `p`.`id`, `p`.`name` ORDER BY sum(`oi`.`quantity` * `oi`.`unit_price`) DESC LIMIT :1 /* INT64 */", + "usageCount": 1, + "lineNumbers": [ + 34 + ], + "tableName": [ + "products", + "order_items" + ], + "groupingColumns": [ + "products.id", + "products.name" + ], + "joinColumns": [ + "order_items.product_id", + "products.id" + ], + "statementType": "SELECT" + }, + { + "queryStructure": "SELECT `customer_id`, `customer_name`, `total_spent` FROM (SELECT `c`.`id` AS `customer_id`, `c`.`name` AS `customer_name`, sum(`oi`.`quantity` * `oi`.`unit_price`) AS `total_spent` FROM `customers` AS `c` JOIN `orders` AS `o` ON `c`.`id` = `o`.`customer_id` JOIN `order_items` AS `oi` ON `o`.`id` = `oi`.`order_id` WHERE `o`.`status` = :_o_status /* VARCHAR */ GROUP BY `c`.`id`, `c`.`name`) AS `spending_summary` ORDER BY `spending_summary`.`total_spent` DESC LIMIT :1 /* INT64 */", + "usageCount": 1, + "lineNumbers": [ + 43 + ], + "tableName": [ + "customers", + "orders", + "order_items" + ], + "groupingColumns": [ + "customers.id", + "customers.name" + ], + "joinColumns": [ + "customers.id", + "order_items.order_id", + "orders.customer_id", + "orders.id" + ], + "filterColumns": [ + "orders.status" + ], + "statementType": "SELECT" + }, + { + "queryStructure": "SELECT `p`.`category`, COALESCE(sum(`oi`.`quantity`), :1 /* INT64 */) AS `total_quantity_sold` FROM `products` AS `p` LEFT JOIN `order_items` AS `oi` ON `p`.`id` = `oi`.`product_id` GROUP BY `p`.`category` ORDER BY COALESCE(sum(`oi`.`quantity`), :1 /* INT64 */) DESC", + "usageCount": 1, + "lineNumbers": [ + 58 + ], + "tableName": [ + "products", + "order_items" + ], + "groupingColumns": [ + "products.category" + ], + "joinColumns": [ + "order_items.product_id", + "products.id" + ], + "statementType": "SELECT" + }, + { + "queryStructure": "SELECT `o`.`id` AS `order_id`, COUNT(DISTINCT `oi`.`product_id`) AS `distinct_product_count` FROM `orders` AS `o` JOIN `order_items` AS `oi` ON `o`.`id` = `oi`.`order_id` GROUP BY `o`.`id` ORDER BY COUNT(DISTINCT `oi`.`product_id`) DESC LIMIT :1 /* INT64 */", + "usageCount": 1, + "lineNumbers": [ + 66 + ], + "tableName": [ + "orders", + "order_items" + ], + "groupingColumns": [ + "orders.id" + ], + "joinColumns": [ + "order_items.order_id", + "orders.id" + ], + "statementType": "SELECT" + }, + { + "queryStructure": "SELECT DISTINCT `c`.`id` AS `customer_id`, `c`.`name` AS `customer_name` FROM `customers` AS `c` JOIN `orders` AS `o` ON `c`.`id` = `o`.`customer_id` JOIN `order_items` AS `oi` ON `o`.`id` = `oi`.`order_id` JOIN `products` AS `p` ON `oi`.`product_id` = `p`.`id` WHERE `p`.`price` \u003e (SELECT avg(`price`) FROM `products`) ORDER BY `c`.`name` ASC", + "usageCount": 1, + "lineNumbers": [ + 74 + ], + "tableName": [ + "customers", + "orders", + "order_items", + "products", + "products" + ], + "joinColumns": [ + "customers.id", + "order_items.order_id", + "order_items.product_id", + "orders.customer_id", + "orders.id", + "products.id" + ], + "filterColumns": [ + "products.price" + ], + "statementType": "SELECT" + }, + { + "queryStructure": "SELECT `c`.`id` AS `customer_id`, `c`.`name` AS `customer_name`, COALESCE(sum(`oi`.`quantity` * `oi`.`unit_price`), :1 /* INT64 */) AS `total_revenue` FROM `customers` AS `c` LEFT JOIN `orders` AS `o` ON `c`.`id` = `o`.`customer_id` AND `o`.`status` = :_o_status /* VARCHAR */ LEFT JOIN `order_items` AS `oi` ON `o`.`id` = `oi`.`order_id` GROUP BY `c`.`id`, `c`.`name` ORDER BY COALESCE(sum(`oi`.`quantity` * `oi`.`unit_price`), :1 /* INT64 */) DESC", + "usageCount": 1, + "lineNumbers": [ + 86 + ], + "tableName": [ + "customers", + "orders", + "order_items" + ], + "groupingColumns": [ + "customers.id", + "customers.name" + ], + "joinColumns": [ + "customers.id", + "order_items.order_id", + "orders.customer_id", + "orders.id" + ], + "filterColumns": [ + "orders.status" + ], + "statementType": "SELECT" + }, + { + "queryStructure": "SELECT `o`.`id` AS `order_id` FROM `orders` AS `o` JOIN `order_items` AS `oi` ON `o`.`id` = `oi`.`order_id` JOIN `products` AS `p` ON `oi`.`product_id` = `p`.`id` GROUP BY `o`.`id` HAVING COUNT(DISTINCT `p`.`category`) = :1 /* INT64 */ AND max(`p`.`category`) = :2 /* VARCHAR */", + "usageCount": 1, + "lineNumbers": [ + 97 + ], + "tableName": [ + "orders", + "order_items", + "products" + ], + "groupingColumns": [ + "orders.id" + ], + "joinColumns": [ + "order_items.order_id", + "order_items.product_id", + "orders.id", + "products.id" + ], + "statementType": "SELECT" + }, + { + "queryStructure": "SELECT `c`.`id` AS `customer_id`, `c`.`name` AS `customer_name`, sum(CASE WHEN `o`.`status` = :_o_status /* VARCHAR */ THEN :1 /* INT64 */ ELSE :2 /* INT64 */ END) AS `completed_orders`, sum(CASE WHEN `o`.`status` = :_o_status1 /* VARCHAR */ THEN :1 /* INT64 */ ELSE :2 /* INT64 */ END) AS `pending_orders` FROM `customers` AS `c` LEFT JOIN `orders` AS `o` ON `c`.`id` = `o`.`customer_id` GROUP BY `c`.`id`, `c`.`name` ORDER BY sum(CASE WHEN `o`.`status` = :_o_status /* VARCHAR */ THEN :1 /* INT64 */ ELSE :2 /* INT64 */ END) DESC, sum(CASE WHEN `o`.`status` = :_o_status1 /* VARCHAR */ THEN :1 /* INT64 */ ELSE :2 /* INT64 */ END) DESC", + "usageCount": 1, + "lineNumbers": [ + 107 + ], + "tableName": [ + "customers", + "orders" + ], + "groupingColumns": [ + "customers.id", + "customers.name" + ], + "joinColumns": [ + "customers.id", + "orders.customer_id" + ], + "statementType": "SELECT" + } +] \ No newline at end of file diff --git a/content/en/blog/querylog.sql b/content/en/blog/querylog.sql new file mode 100644 index 000000000..465ceb773 --- /dev/null +++ b/content/en/blog/querylog.sql @@ -0,0 +1,115 @@ +SELECT c.id AS customer_id, + c.name AS customer_name, + SUM(oi.quantity * oi.unit_price) AS total_spent +FROM customers c + JOIN + orders o ON c.id = o.customer_id + JOIN + order_items oi ON o.id = oi.order_id +WHERE o.status = 'Completed' +GROUP BY c.id, c.name +ORDER BY total_spent DESC; + +SELECT p.category, + AVG(oi.quantity * oi.unit_price) AS avg_order_amount +FROM products p + JOIN + order_items oi ON p.id = oi.product_id + JOIN + orders o ON oi.order_id = o.id +WHERE o.status IN ('Completed', 'Shipped') +GROUP BY p.category +ORDER BY avg_order_amount DESC; + +SELECT c.id AS customer_id, + c.name AS customer_name, + COUNT(o.id) AS total_orders +FROM customers c + JOIN + orders o ON c.id = o.customer_id +GROUP BY c.id, c.name +HAVING COUNT(o.id) > 1 +ORDER BY total_orders DESC; + +SELECT p.id AS product_id, + p.name AS product_name, + SUM(oi.quantity * oi.unit_price) AS total_revenue +FROM products p + JOIN + order_items oi ON p.id = oi.product_id +GROUP BY p.id, p.name +ORDER BY total_revenue DESC LIMIT 1; + +SELECT customer_id, + customer_name, + total_spent +FROM (SELECT c.id AS customer_id, + c.name AS customer_name, + SUM(oi.quantity * oi.unit_price) AS total_spent + FROM customers c + JOIN + orders o ON c.id = o.customer_id + JOIN + order_items oi ON o.id = oi.order_id + WHERE o.status = 'Completed' + GROUP BY c.id, c.name) AS spending_summary +ORDER BY total_spent DESC LIMIT 3; + +SELECT p.category, + COALESCE(SUM(oi.quantity), 0) AS total_quantity_sold +FROM products p + LEFT JOIN + order_items oi ON p.id = oi.product_id +GROUP BY p.category +ORDER BY total_quantity_sold DESC; + +SELECT o.id AS order_id, + COUNT(DISTINCT oi.product_id) AS distinct_product_count +FROM orders o + JOIN + order_items oi ON o.id = oi.order_id +GROUP BY o.id +ORDER BY distinct_product_count DESC LIMIT 1; + +SELECT DISTINCT c.id AS customer_id, + c.name AS customer_name +FROM customers c + JOIN + orders o ON c.id = o.customer_id + JOIN + order_items oi ON o.id = oi.order_id + JOIN + products p ON oi.product_id = p.id +WHERE p.price > (SELECT AVG(price) FROM products) +ORDER BY c.name; + +SELECT c.id AS customer_id, + c.name AS customer_name, + COALESCE(SUM(oi.quantity * oi.unit_price), 0) AS total_revenue +FROM customers c + LEFT JOIN + orders o ON c.id = o.customer_id AND o.status = 'Completed' + LEFT JOIN + order_items oi ON o.id = oi.order_id +GROUP BY c.id, c.name +ORDER BY total_revenue DESC; + +SELECT o.id AS order_id +FROM orders o + JOIN + order_items oi ON o.id = oi.order_id + JOIN + products p ON oi.product_id = p.id +GROUP BY o.id +HAVING COUNT(DISTINCT p.category) = 1 + AND MAX(p.category) = 'Gadgets'; + +SELECT c.id AS customer_id, + c.name AS customer_name, + SUM(CASE WHEN o.status = 'Completed' THEN 1 ELSE 0 END) AS completed_orders, + SUM(CASE WHEN o.status = 'Pending' THEN 1 ELSE 0 END) AS pending_orders +FROM customers c + LEFT JOIN + orders o ON c.id = o.customer_id +GROUP BY c.id, c.name +ORDER BY completed_orders DESC, pending_orders DESC; From 735802078be9ce69bbcdce3b0cfa2a05de3d4307 Mon Sep 17 00:00:00 2001 From: Andres Taylor Date: Thu, 31 Oct 2024 18:36:08 +0100 Subject: [PATCH 2/2] add intro and tweak header Signed-off-by: Andres Taylor --- .../2024-11-14-how-to-choose-sharding-keys.md | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/content/en/blog/2024-11-14-how-to-choose-sharding-keys.md b/content/en/blog/2024-11-14-how-to-choose-sharding-keys.md index 6960ecde1..fa566da82 100644 --- a/content/en/blog/2024-11-14-how-to-choose-sharding-keys.md +++ b/content/en/blog/2024-11-14-how-to-choose-sharding-keys.md @@ -1,12 +1,20 @@ --- author: 'Andrés Taylor' date: 2024-11-05 -slug: '2024-11-05-optimizing-query-planning-in-vitess-a-step-by-step-approach' -tags: ['Vitess', 'PlanetScale', 'MySQL', 'Query Serving', 'Vindex', 'plan', 'execution plan', 'explain', 'optimizer', "aggregate", "group by", "order by"] -title: 'Choosing sharding keys in Vitess: Tools and best practices' -description: "See how Vitess acts as a database proxy that creates an illusion of a single database when technically, the query is sent to multiple MySQL instances." +slug: 'optimizing-sharding-strategies-vitess' +tags: ['Vitess', 'Sharding', 'MySQL', 'Query Optimization', 'Database Scaling', 'Vindex', 'VExplain', 'Performance Analysis', 'SQL Planning'] +title: 'Mastering Sharding in Vitess: Tools, Strategies, and Best Practices' +description: "Explore how to optimize sharding strategies in Vitess for scalable query performance, leveraging tools like `vexplain` and `vt` for deep analysis and schema design." --- +## Introduction to Sharding in Vitess + +Sharding data effectively is a cornerstone of database scalability, especially when using a powerful orchestration layer like Vitess. +By understanding how to analyze and refine your sharding strategy, you can minimize data transfer, optimize query plans, and enhance application performance. +This guide dives into practical methodologies and introduces key tools in Vitess, including `vexplain` and the `vt` CLI, to help you design efficient sharding schemes and analyze query behavior at scale. + +## The Importance of Choosing the Right Sharding Key + Vitess can split the content in tables across multiple MySQL instances. This is called sharding the data. Sharding is the process of splitting a large database into smaller, more manageable parts called shards.