diff --git a/docs/en/introduction/Architecture.md b/docs/en/introduction/Architecture.md
index 2dd28b5..f43627c 100644
--- a/docs/en/introduction/Architecture.md
+++ b/docs/en/introduction/Architecture.md
@@ -5,7 +5,7 @@ import QSOverview from '../_assets/commonMarkdown/quickstart-overview-tip.mdx'
# Architecture
-StarRocks has a simple architecture. The entire system consists of only two types of components; frontends and backends. The frontend nodes are called **FE**s. There are two types of backend nodes, **BE**s, and **CN**s (Compute Nodes). BEs are deployed when local storage for data is used, and CNs are deployed when data is stored on object storage or HDFS. StarRocks does not rely on any external components, simplifying deployment and maintenance. Nodes can be horizontally scaled without service downtime. In addition, StarRocks has a replica mechanism for metadata and service data, which increases data reliability and efficiently prevents single points of failure (SPOFs).
+StarRocks has a robust architecture. The entire system consists of only two types of components; frontends and backends. The frontend nodes are called **FE**s. There are two types of backend nodes, **BE**s, and **CN**s (Compute Nodes). BEs are deployed when local storage for data is used, and CNs are deployed when data is stored on object storage or HDFS. StarRocks does not rely on any external components, simplifying deployment and maintenance. Nodes can be horizontally scaled without service downtime. In addition, StarRocks has a replica mechanism for metadata and service data, which increases data reliability and efficiently prevents single points of failure (SPOFs).
StarRocks is compatible with MySQL protocols and supports standard SQL. Users can easily connect to StarRocks from MySQL clients to gain instant and valuable insights.
diff --git a/docs/en/loading/Etl_in_loading.md b/docs/en/loading/Etl_in_loading.md
deleted file mode 100644
index df52141..0000000
--- a/docs/en/loading/Etl_in_loading.md
+++ /dev/null
@@ -1,452 +0,0 @@
----
-displayed_sidebar: docs
----
-
-# Transform data at loading
-
-import InsertPrivNote from '../_assets/commonMarkdown/insertPrivNote.mdx'
-
-StarRocks supports data transformation at loading.
-
-This feature supports [Stream Load](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md), [Broker Load](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md), and [Routine Load](../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md) but does not support [Spark Load](../sql-reference/sql-statements/loading_unloading/SPARK_LOAD.md).
-
-
-
-This topic uses CSV data as an example to describe how to extract and transform data at loading. The data file formats that are supported vary depending on the loading method of your choice.
-
-> **NOTE**
->
-> For CSV data, you can use a UTF-8 string, such as a comma (,), tab, or pipe (|), whose length does not exceed 50 bytes as a text delimiter.
-
-## Scenarios
-
-When you load a data file into a StarRocks table, the data of the data file may not be completely mapped onto the data of the StarRocks table. In this situation, you do not need to extract or transform the data before you load it into the StarRocks table. StarRocks can help you extract and transform the data during loading:
-
-- Skip columns that do not need to be loaded.
-
- You can skip the columns that do not need to be loaded. Additionally, if the columns of the data file are in a different order than the columns of the StarRocks table, you can create a column mapping between the data file and the StarRocks table.
-
-- Filter out rows you do not want to load.
-
- You can specify filter conditions based on which StarRocks filters out the rows that you do not want to load.
-
-- Generate new columns from original columns.
-
- Generated columns are special columns that are computed from the original columns of the data file. You can map the generated columns onto the columns of the StarRocks table.
-
-- Extract partition field values from a file path.
-
- If the data file is generated from Apache Hive™, you can extract partition field values from the file path.
-
-## Data examples
-
-1. Create data files in your local file system.
-
- a. Create a data file named `file1.csv`. The file consists of four columns, which represent user ID, user gender, event date, and event type in sequence.
-
- ```Plain
- 354,female,2020-05-20,1
- 465,male,2020-05-21,2
- 576,female,2020-05-22,1
- 687,male,2020-05-23,2
- ```
-
- b. Create a data file named `file2.csv`. The file consists of only one column, which represents date.
-
- ```Plain
- 2020-05-20
- 2020-05-21
- 2020-05-22
- 2020-05-23
- ```
-
-2. Create tables in your StarRocks database `test_db`.
-
- > **NOTE**
- >
- > Since v2.5.7, StarRocks can automatically set the number of buckets (BUCKETS) when you create a table or add a partition. You no longer need to manually set the number of buckets. For detailed information, see [set the number of buckets](../table_design/data_distribution/Data_distribution.md#set-the-number-of-buckets).
-
- a. Create a table named `table1`, which consists of three columns: `event_date`, `event_type`, and `user_id`.
-
- ```SQL
- MySQL [test_db]> CREATE TABLE table1
- (
- `event_date` DATE COMMENT "event date",
- `event_type` TINYINT COMMENT "event type",
- `user_id` BIGINT COMMENT "user ID"
- )
- DISTRIBUTED BY HASH(user_id);
- ```
-
- b. Create a table named `table2`, which consists of four columns: `date`, `year`, `month`, and `day`.
-
- ```SQL
- MySQL [test_db]> CREATE TABLE table2
- (
- `date` DATE COMMENT "date",
- `year` INT COMMENT "year",
- `month` TINYINT COMMENT "month",
- `day` TINYINT COMMENT "day"
- )
- DISTRIBUTED BY HASH(date);
- ```
-
-3. Upload `file1.csv` and `file2.csv` to the `/user/starrocks/data/input/` path of your HDFS cluster, publish the data of `file1.csv` to `topic1` of your Kafka cluster, and publish the data of `file2.csv` to `topic2` of your Kafka cluster.
-
-## Skip columns that do not need to be loaded
-
-The data file that you want to load into a StarRocks table may contain some columns that cannot be mapped to any columns of the StarRocks table. In this situation, StarRocks supports loading only the columns that can be mapped from the data file onto the columns of the StarRocks table.
-
-This feature supports loading data from the following data sources:
-
-- Local file system
-
-- HDFS and cloud storage
-
- > **NOTE**
- >
- > This section uses HDFS as an example.
-
-- Kafka
-
-In most cases, the columns of a CSV file are not named. For some CSV files, the first row is composed of column names, but StarRocks processes the content of the first row as common data rather than column names. Therefore, when you load a CSV file, you must temporarily name the columns of the CSV file **in sequence** in the job creation statement or command. These temporarily named columns are mapped **by name** onto the columns of the StarRocks table. Take note of the following points about the columns of the data file:
-
-- The data of the columns that can be mapped onto and are temporarily named by using the names of the columns in the StarRocks table is directly loaded.
-
-- The columns that cannot be mapped onto the columns of the StarRocks table are ignored, the data of these columns are not loaded.
-
-- If some columns can be mapped onto the columns of the StarRocks table but are not temporarily named in the job creation statement or command, the load job reports errors.
-
-This section uses `file1.csv` and `table1` as an example. The four columns of `file1.csv` are temporarily named as `user_id`, `user_gender`, `event_date`, and `event_type` in sequence. Among the temporarily named columns of `file1.csv`, `user_id`, `event_date`, and `event_type` can be mapped onto specific columns of `table1`, whereas `user_gender` cannot be mapped onto any column of `table1`. Therefore, `user_id`, `event_date`, and `event_type` are loaded into `table1`, but `user_gender` is not.
-
-### Load data
-
-#### Load data from a local file system
-
-If `file1.csv` is stored in your local file system, run the following command to create a [Stream Load](../loading/StreamLoad.md) job:
-
-```Bash
-curl --location-trusted -u : \
- -H "Expect:100-continue" \
- -H "column_separator:," \
- -H "columns: user_id, user_gender, event_date, event_type" \
- -T file1.csv -XPUT \
- http://:/api/test_db/table1/_stream_load
-```
-
-> **NOTE**
->
-> If you choose Stream Load, you must use the `columns` parameter to temporarily name the columns of the data file to create a column mapping between the data file and the StarRocks table.
-
-For detailed syntax and parameter descriptions, see [STREAM LOAD](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md).
-
-#### Load data from an HDFS cluster
-
-If `file1.csv` is stored in your HDFS cluster, execute the following statement to create a [Broker Load](../loading/hdfs_load.md) job:
-
-```SQL
-LOAD LABEL test_db.label1
-(
- DATA INFILE("hdfs://:/user/starrocks/data/input/file1.csv")
- INTO TABLE `table1`
- FORMAT AS "csv"
- COLUMNS TERMINATED BY ","
- (user_id, user_gender, event_date, event_type)
-)
-WITH BROKER;
-```
-
-> **NOTE**
->
-> If you choose Broker Load, you must use the `column_list` parameter to temporarily name the columns of the data file to create a column mapping between the data file and the StarRocks table.
-
-For detailed syntax and parameter descriptions, see [BROKER LOAD](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md).
-
-#### Load data from a Kafka cluster
-
-If the data of `file1.csv` is published to `topic1` of your Kafka cluster, execute the following statement to create a [Routine Load](../loading/RoutineLoad.md) job:
-
-```SQL
-CREATE ROUTINE LOAD test_db.table101 ON table1
- COLUMNS TERMINATED BY ",",
- COLUMNS(user_id, user_gender, event_date, event_type)
-FROM KAFKA
-(
- "kafka_broker_list" = ":",
- "kafka_topic" = "topic1",
- "property.kafka_default_offsets" = "OFFSET_BEGINNING"
-);
-```
-
-> **NOTE**
->
-> If you choose Routine Load, you must use the `COLUMNS` parameter to temporarily name the columns of the data file to create a column mapping between the data file and the StarRocks table.
-
-For detailed syntax and parameter descriptions, see [CREATE ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md).
-
-### Query data
-
-After the load of data from your local file system, HDFS cluster, or Kafka cluster is complete, query the data of `table1` to verify that the load is successful:
-
-```SQL
-MySQL [test_db]> SELECT * FROM table1;
-+------------+------------+---------+
-| event_date | event_type | user_id |
-+------------+------------+---------+
-| 2020-05-22 | 1 | 576 |
-| 2020-05-20 | 1 | 354 |
-| 2020-05-21 | 2 | 465 |
-| 2020-05-23 | 2 | 687 |
-+------------+------------+---------+
-4 rows in set (0.01 sec)
-```
-
-## Filter out rows that you do not want to load
-
-When you load a data file into a StarRocks table, you may not want to load specific rows of the data file. In this situation, you can use the WHERE clause to specify the rows that you want to load. StarRocks filters out all rows that do not meet the filter conditions specified in the WHERE clause.
-
-This feature supports loading data from the following data sources:
-
-- Local file system
-
-- HDFS and cloud storage
- > **NOTE**
- >
- > This section uses HDFS as an example.
-
-- Kafka
-
-This section uses `file1.csv` and `table1` as an example. If you want to load only the rows whose event type is `1` from `file1.csv` into `table1`, you can use the WHERE clause to specify a filter condition `event_type = 1`.
-
-### Load data
-
-#### Load data from a local file system
-
-If `file1.csv` is stored in your local file system, run the following command to create a [Stream Load](../loading/StreamLoad.md) job:
-
-```Bash
-curl --location-trusted -u : \
- -H "Expect:100-continue" \
- -H "column_separator:," \
- -H "columns: user_id, user_gender, event_date, event_type" \
- -H "where: event_type=1" \
- -T file1.csv -XPUT \
- http://:/api/test_db/table1/_stream_load
-```
-
-For detailed syntax and parameter descriptions, see [STREAM LOAD](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md).
-
-#### Load data from an HDFS cluster
-
-If `file1.csv` is stored in your HDFS cluster, execute the following statement to create a [Broker Load](../loading/hdfs_load.md) job:
-
-```SQL
-LOAD LABEL test_db.label2
-(
- DATA INFILE("hdfs://:/user/starrocks/data/input/file1.csv")
- INTO TABLE `table1`
- FORMAT AS "csv"
- COLUMNS TERMINATED BY ","
- (user_id, user_gender, event_date, event_type)
- WHERE event_type = 1
-)
-WITH BROKER;
-```
-
-For detailed syntax and parameter descriptions, see [BROKER LOAD](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md).
-
-#### Load data from a Kafka cluster
-
-If the data of `file1.csv` is published to `topic1` of your Kafka cluster, execute the following statement to create a [Routine Load](../loading/RoutineLoad.md) job:
-
-```SQL
-CREATE ROUTINE LOAD test_db.table102 ON table1
-COLUMNS TERMINATED BY ",",
-COLUMNS (user_id, user_gender, event_date, event_type),
-WHERE event_type = 1
-FROM KAFKA
-(
- "kafka_broker_list" = ":",
- "kafka_topic" = "topic1",
- "property.kafka_default_offsets" = "OFFSET_BEGINNING"
-);
-```
-
-For detailed syntax and parameter descriptions, see [CREATE ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md).
-
-### Query data
-
-After the load of data from your local file system, HDFS cluster, or Kafka cluster is complete, query the data of `table1` to verify that the load is successful:
-
-```SQL
-MySQL [test_db]> SELECT * FROM table1;
-+------------+------------+---------+
-| event_date | event_type | user_id |
-+------------+------------+---------+
-| 2020-05-20 | 1 | 354 |
-| 2020-05-22 | 1 | 576 |
-+------------+------------+---------+
-2 rows in set (0.01 sec)
-```
-
-## Generate new columns from original columns
-
-When you load a data file into a StarRocks table, some data of the data file may require conversions before the data can be loaded into the StarRocks table. In this situation, you can use functions or expressions in the job creation command or statement to implement data conversions.
-
-This feature supports loading data from the following data sources:
-
-- Local file system
-
-- HDFS and cloud storage
- > **NOTE**
- >
- > This section uses HDFS as an example.
-
-- Kafka
-
-This section uses `file2.csv` and `table2` as an example. `file2.csv` consists of only one column that represents date. You can use the [year](../sql-reference/sql-functions/date-time-functions/year.md), [month](../sql-reference/sql-functions/date-time-functions/month.md), and [day](../sql-reference/sql-functions/date-time-functions/day.md) functions to extract the year, month, and day in each date from `file2.csv` and load the extracted data into the `year`, `month`, and `day` columns of `table2`.
-
-### Load data
-
-#### Load data from a local file system
-
-If `file2.csv` is stored in your local file system, run the following command to create a [Stream Load](../loading/StreamLoad.md) job:
-
-```Bash
-curl --location-trusted -u : \
- -H "Expect:100-continue" \
- -H "column_separator:," \
- -H "columns:date,year=year(date),month=month(date),day=day(date)" \
- -T file2.csv -XPUT \
- http://:/api/test_db/table2/_stream_load
-```
-
-> **NOTE**
->
-> - In the `columns` parameter, you must first temporarily name **all columns** of the data file, and then temporarily name the new columns that you want to generate from the original columns of the data file. As shown in the preceding example, the only column of `file2.csv` is temporarily named as `date`, and then the `year=year(date)`, `month=month(date)`, and `day=day(date)` functions are invoked to generate three new columns, which are temporarily named as `year`, `month`, and `day`.
->
-> - Stream Load does not support `column_name = function(column_name)` but supports `column_name = function(column_name)`.
-
-For detailed syntax and parameter descriptions, see [STREAM LOAD](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md).
-
-#### Load data from an HDFS cluster
-
-If `file2.csv` is stored in your HDFS cluster, execute the following statement to create a [Broker Load](../loading/hdfs_load.md) job:
-
-```SQL
-LOAD LABEL test_db.label3
-(
- DATA INFILE("hdfs://:/user/starrocks/data/input/file2.csv")
- INTO TABLE `table2`
- FORMAT AS "csv"
- COLUMNS TERMINATED BY ","
- (date)
- SET(year=year(date), month=month(date), day=day(date))
-)
-WITH BROKER;
-```
-
-> **NOTE**
->
-> You must first use the `column_list` parameter to temporarily name **all columns** of the data file, and then use the SET clause to temporarily name the new columns that you want to generate from the original columns of the data file. As shown in the preceding example, the only column of `file2.csv` is temporarily named as `date` in the `column_list` parameter, and then the `year=year(date)`, `month=month(date)`, and `day=day(date)` functions are invoked in the SET clause to generate three new columns, which are temporarily named as `year`, `month`, and `day`.
-
-For detailed syntax and parameter descriptions, see [BROKER LOAD](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md).
-
-#### Load data from a Kafka cluster
-
-If the data of `file2.csv` is published to `topic2` of your Kafka cluster, execute the following statement to create a [Routine Load](../loading/RoutineLoad.md) job:
-
-```SQL
-CREATE ROUTINE LOAD test_db.table201 ON table2
- COLUMNS TERMINATED BY ",",
- COLUMNS(date,year=year(date),month=month(date),day=day(date))
-FROM KAFKA
-(
- "kafka_broker_list" = ":",
- "kafka_topic" = "topic2",
- "property.kafka_default_offsets" = "OFFSET_BEGINNING"
-);
-```
-
-> **NOTE**
->
-> In the `COLUMNS` parameter, you must first temporarily name **all columns** of the data file, and then temporarily name the new columns that you want to generate from the original columns of the data file. As shown in the preceding example, the only column of `file2.csv` is temporarily named as `date`, and then the `year=year(date)`, `month=month(date)`, and `day=day(date)` functions are invoked to generate three new columns, which are temporarily named as `year`, `month`, and `day`.
-
-For detailed syntax and parameter descriptions, see [CREATE ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md).
-
-### Query data
-
-After the load of data from your local file system, HDFS cluster, or Kafka cluster is complete, query the data of `table2` to verify that the load is successful:
-
-```SQL
-MySQL [test_db]> SELECT * FROM table2;
-+------------+------+-------+------+
-| date | year | month | day |
-+------------+------+-------+------+
-| 2020-05-20 | 2020 | 5 | 20 |
-| 2020-05-21 | 2020 | 5 | 21 |
-| 2020-05-22 | 2020 | 5 | 22 |
-| 2020-05-23 | 2020 | 5 | 23 |
-+------------+------+-------+------+
-4 rows in set (0.01 sec)
-```
-
-## Extract partition field values from a file path
-
-If the file path that you specify contains partition fields, you can use the `COLUMNS FROM PATH AS` parameter to specify the partition fields you want to extract from the file paths. The partition fields in file paths are equivalent to the columns in data files. The `COLUMNS FROM PATH AS` parameter is supported only when you load data from an HDFS cluster.
-
-For example, you want to load the following four data files generated from Hive:
-
-```Plain
-/user/starrocks/data/input/date=2020-05-20/data
-1,354
-/user/starrocks/data/input/date=2020-05-21/data
-2,465
-/user/starrocks/data/input/date=2020-05-22/data
-1,576
-/user/starrocks/data/input/date=2020-05-23/data
-2,687
-```
-
-The four data files are stored in the `/user/starrocks/data/input/` path of your HDFS cluster. Each of these data files is partitioned by partition field `date` and consists of two columns, which represent event type and user ID in sequence.
-
-### Load data from an HDFS cluster
-
-Execute the following statement to create a [Broker Load](../loading/hdfs_load.md) job, which enables you to extract the `date` partition field values from the `/user/starrocks/data/input/` file path and use a wildcard (*) to specify that you want to load all data files in the file path to `table1`:
-
-```SQL
-LOAD LABEL test_db.label4
-(
- DATA INFILE("hdfs://:/user/starrocks/data/input/date=*/*")
- INTO TABLE `table1`
- FORMAT AS "csv"
- COLUMNS TERMINATED BY ","
- (event_type, user_id)
- COLUMNS FROM PATH AS (date)
- SET(event_date = date)
-)
-WITH BROKER;
-```
-
-> **NOTE**
->
-> In the preceding example, the `date` partition field in the specified file path is equivalent to the `event_date` column of `table1`. Therefore, you need to use the SET clause to map the `date` partition field onto the `event_date` column. If the partition field in the specified file path has the same name as a column of the StarRocks table, you do not need to use the SET clause to create a mapping.
-
-For detailed syntax and parameter descriptions, see [BROKER LOAD](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md).
-
-### Query data
-
-After the load of data from your HDFS cluster is complete, query the data of `table1` to verify that the load is successful:
-
-```SQL
-MySQL [test_db]> SELECT * FROM table1;
-+------------+------------+---------+
-| event_date | event_type | user_id |
-+------------+------------+---------+
-| 2020-05-22 | 1 | 576 |
-| 2020-05-20 | 1 | 354 |
-| 2020-05-21 | 2 | 465 |
-| 2020-05-23 | 2 | 687 |
-+------------+------------+---------+
-4 rows in set (0.01 sec)
-```
diff --git a/docs/en/loading/Flink-connector-starrocks.md b/docs/en/loading/Flink-connector-starrocks.md
deleted file mode 100644
index cbbfe5b..0000000
--- a/docs/en/loading/Flink-connector-starrocks.md
+++ /dev/null
@@ -1,901 +0,0 @@
----
-displayed_sidebar: docs
----
-
-# Continuously load data from Apache Flink®
-
-StarRocks provides a self-developed connector named StarRocks Connector for Apache Flink® (Flink connector for short) to help you load data into a StarRocks table by using Flink. The basic principle is to accumulate the data and then load it all at a time into StarRocks through [STREAM LOAD](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md).
-
-The Flink connector supports DataStream API, Table API & SQL, and Python API. It has a higher and more stable performance than [flink-connector-jdbc](https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/table/jdbc/) provided by Apache Flink®.
-
-> **NOTICE**
->
-> Loading data into StarRocks tables with Flink connector needs SELECT and INSERT privileges on the target StarRocks table. If you do not have these privileges, follow the instructions provided in [GRANT](../sql-reference/sql-statements/account-management/GRANT.md) to grant these privileges to the user that you use to connect to your StarRocks cluster.
-
-## Version requirements
-
-| Connector | Flink | StarRocks | Java | Scala |
-|-----------|-------------------------------|---------------| ---- |-----------|
-| 1.2.11 | 1.15,1.16,1.17,1.18,1.19,1.20 | 2.1 and later | 8 | 2.11,2.12 |
-| 1.2.10 | 1.15,1.16,1.17,1.18,1.19 | 2.1 and later | 8 | 2.11,2.12 |
-| 1.2.9 | 1.15,1.16,1.17,1.18 | 2.1 and later | 8 | 2.11,2.12 |
-| 1.2.8 | 1.13,1.14,1.15,1.16,1.17 | 2.1 and later | 8 | 2.11,2.12 |
-| 1.2.7 | 1.11,1.12,1.13,1.14,1.15 | 2.1 and later | 8 | 2.11,2.12 |
-
-## Obtain Flink connector
-
-You can obtain the Flink connector JAR file in the following ways:
-
-- Directly download the compiled Flink connector JAR file.
-- Add the Flink connector as a dependency in your Maven project and then download the JAR file.
-- Compile the source code of the Flink connector into a JAR file by yourself.
-
-The naming format of the Flink connector JAR file is as follows:
-
-- Since Flink 1.15, it's `flink-connector-starrocks-${connector_version}_flink-${flink_version}.jar`. For example, if you install Flink 1.15 and you want to use Flink connector 1.2.7, you can use `flink-connector-starrocks-1.2.7_flink-1.15.jar`.
-
-- Prior to Flink 1.15, it's `flink-connector-starrocks-${connector_version}_flink-${flink_version}_${scala_version}.jar`. For example, if you install Flink 1.14 and Scala 2.12 in your environment, and you want to use Flink connector 1.2.7, you can use `flink-connector-starrocks-1.2.7_flink-1.14_2.12.jar`.
-
-> **NOTICE**
->
-> In general, the latest version of the Flink connector only maintains compatibility with the three most recent versions of Flink.
-
-### Download the compiled Jar file
-
-Directly download the corresponding version of the Flink connector Jar file from the [Maven Central Repository](https://repo1.maven.org/maven2/com/starrocks).
-
-### Maven Dependency
-
-In your Maven project's `pom.xml` file, add the Flink connector as a dependency according to the following format. Replace `flink_version`, `scala_version`, and `connector_version` with the respective versions.
-
-- In Flink 1.15 and later
-
- ```xml
-
- com.starrocks
- flink-connector-starrocks
- ${connector_version}_flink-${flink_version}
-
- ```
-
-- In versions earlier than Flink 1.15
-
- ```xml
-
- com.starrocks
- flink-connector-starrocks
- ${connector_version}_flink-${flink_version}_${scala_version}
-
- ```
-
-### Compile by yourself
-
-1. Download the [Flink connector source code](https://github.com/StarRocks/starrocks-connector-for-apache-flink).
-2. Execute the following command to compile the source code of Flink connector into a JAR file. Note that `flink_version` is replaced with the corresponding Flink version.
-
- ```bash
- sh build.sh
- ```
-
- For example, if the Flink version in your environment is 1.15, you need to execute the following command:
-
- ```bash
- sh build.sh 1.15
- ```
-
-3. Go to the `target/` directory to find the Flink connector JAR file, such as `flink-connector-starrocks-1.2.7_flink-1.15-SNAPSHOT.jar`, generated upon compilation.
-
-> **NOTE**
->
-> The name of Flink connector which is not formally released contains the `SNAPSHOT` suffix.
-
-## Options
-
-### connector
-
-**Required**: Yes
-**Default value**: NONE
-**Description**: The connector that you want to use. The value must be "starrocks".
-
-### jdbc-url
-
-**Required**: Yes
-**Default value**: NONE
-**Description**: The address that is used to connect to the MySQL server of the FE. You can specify multiple addresses, which must be separated by a comma (,). Format: `jdbc:mysql://:,:,:`.
-
-### load-url
-
-**Required**: Yes
-**Default value**: NONE
-**Description**: The address that is used to connect to the HTTP server of the FE. You can specify multiple addresses, which must be separated by a semicolon (;). Format: `:;:`.
-
-### database-name
-
-**Required**: Yes
-**Default value**: NONE
-**Description**: The name of the StarRocks database into which you want to load data.
-
-### table-name
-
-**Required**: Yes
-**Default value**: NONE
-**Description**: The name of the table that you want to use to load data into StarRocks.
-
-### username
-
-**Required**: Yes
-**Default value**: NONE
-**Description**: The username of the account that you want to use to load data into StarRocks. The account needs [SELECT and INSERT privileges](../sql-reference/sql-statements/account-management/GRANT.md) on the target StarRocks table.
-
-### password
-
-**Required**: Yes
-**Default value**: NONE
-**Description**: The password of the preceding account.
-
-### sink.version
-
-**Required**: No
-**Default value**: AUTO
-**Description**: The interface used to load data. This parameter is supported from Flink connector version 1.2.4 onwards.
`V1`: Use [Stream Load](../loading/StreamLoad.md) interface to load data. Connectors before 1.2.4 only support this mode.
`V2`: Use [Stream Load transaction](./Stream_Load_transaction_interface.md) interface to load data. It requires StarRocks to be at least version 2.4. Recommends `V2` because it optimizes the memory usage and provides a more stable exactly-once implementation.
`AUTO`: If the version of StarRocks supports transaction Stream Load, will choose `V2` automatically, otherwise choose `V1`
-
-### sink.label-prefix
-
-**Required**: No
-**Default value**: NONE
-**Description**: The label prefix used by Stream Load. Recommend to configure it if you are using exactly-once with connector 1.2.8 and later. See [exactly-once usage notes](#exactly-once).
-
-### sink.semantic
-
-**Required**: No
-**Default value**: at-least-once
-**Description**: The semantic guaranteed by sink. Valid values: **at-least-once** and **exactly-once**.
-
-### sink.buffer-flush.max-bytes
-
-**Required**: No
-**Default value**: 94371840(90M)
-**Description**: The maximum size of data that can be accumulated in memory before being sent to StarRocks at a time. The maximum value ranges from 64 MB to 10 GB. Setting this parameter to a larger value can improve loading performance but may increase loading latency. This parameter only takes effect when `sink.semantic` is set to `at-least-once`. If `sink.semantic` is set to `exactly-once`, the data in memory is flushed when a Flink checkpoint is triggered. In this circumstance, this parameter does not take effect.
-
-### sink.buffer-flush.max-rows
-
-**Required**: No
-**Default value**: 500000
-**Description**: The maximum number of rows that can be accumulated in memory before being sent to StarRocks at a time. This parameter is available only when `sink.version` is `V1` and `sink.semantic` is `at-least-once`. Valid values: 64000 to 5000000.
-
-### sink.buffer-flush.interval-ms
-
-**Required**: No
-**Default value**: 300000
-**Description**: The interval at which data is flushed. This parameter is available only when `sink.semantic` is `at-least-once`. Valid values: 1000 to 3600000. Unit: ms.
-
-### sink.max-retries
-
-**Required**: No
-**Default value**: 3
-**Description**: The number of times that the system retries to perform the Stream Load job. This parameter is available only when you set `sink.version` to `V1`. Valid values: 0 to 10.
-
-### sink.connect.timeout-ms
-
-**Required**: No
-**Default value**: 30000
-**Description**: The timeout for establishing HTTP connection. Valid values: 100 to 60000. Unit: ms. Before Flink connector v1.2.9, the default value is `1000`.
-
-### sink.socket.timeout-ms
-
-**Required**: No
-**Default value**: -1
-**Description**: Supported since 1.2.10. The time duration for which the HTTP client waits for data. Unit: ms. The default value `-1` means there is no timeout.
-
-### sink.sanitize-error-log
-
-**Required**: No
-**Default value**: false
-**Description**: Supported since 1.2.12. Whether to sanitize sensitive data in the error log for production security. When this item is set to `true`, sensitive row data and column values in Stream Load error logs are redacted in both the connector and SDK logs. The value defaults to `false` for backward compatibility.
-
-### sink.wait-for-continue.timeout-ms
-
-**Required**: No
-**Default value**: 10000
-**Description**: Supported since 1.2.7. The timeout for waiting response of HTTP 100-continue from the FE. Valid values: `3000` to `60000`. Unit: ms
-
-### sink.ignore.update-before
-
-**Required**: No
-**Default value**: true
-**Description**: Supported since version 1.2.8. Whether to ignore `UPDATE_BEFORE` records from Flink when loading data to Primary Key tables. If this parameter is set to false, the record is treated as a delete operation to StarRocks table.
-
-### sink.parallelism
-
-**Required**: No
-**Default value**: NONE
-**Description**: The parallelism of loading. Only available for Flink SQL. If this parameter is not specified, Flink planner decides the parallelism. **In the scenario of multi-parallelism, users need to guarantee data is written in the correct order.**
-
-### sink.properties.*
-
-**Required**: No
-**Default value**: NONE
-**Description**: The parameters that are used to control Stream Load behavior. For example, the parameter `sink.properties.format` specifies the format used for Stream Load, such as CSV or JSON. For a list of supported parameters and their descriptions, see [STREAM LOAD](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md).
-
-### sink.properties.format
-
-**Required**: No
-**Default value**: csv
-**Description**: The format used for Stream Load. The Flink connector will transform each batch of data to the format before sending them to StarRocks. Valid values: `csv` and `json`.
-
-### sink.properties.column_separator
-
-**Required**: No
-**Default value**: \t
-**Description**: The column separator for CSV-formatted data.
-
-### sink.properties.row_delimiter
-
-**Required**: No
-**Default value**: \n
-**Description**: The row delimiter for CSV-formatted data.
-
-### sink.properties.max_filter_ratio
-
-**Required**: No
-**Default value**: 0
-**Description**: The maximum error tolerance of the Stream Load. It's the maximum percentage of data records that can be filtered out due to inadequate data quality. Valid values: `0` to `1`. Default value: `0`. See [Stream Load](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md) for details.
-
-### sink.properties.partial_update
-
-**Required**: NO
-**Default value**: `FALSE`
-**Description**: Whether to use partial updates. Valid values: `TRUE` and `FALSE`. Default value: `FALSE`, indicating to disable this feature.
-
-### sink.properties.partial_update_mode
-
-**Required**: NO
-**Default value**: `row`
-**Description**: Specifies the mode for partial updates. Valid values: `row` and `column`.
The value `row` (default) means partial updates in row mode, which is more suitable for real-time updates with many columns and small batches.
The value `column` means partial updates in column mode, which is more suitable for batch updates with few columns and many rows. In such scenarios, enabling the column mode offers faster update speeds. For example, in a table with 100 columns, if only 10 columns (10% of the total) are updated for all rows, the update speed of the column mode is 10 times faster.
-
-### sink.properties.strict_mode
-
-**Required**: No
-**Default value**: false
-**Description**: Specifies whether to enable the strict mode for Stream Load. It affects the loading behavior when there are unqualified rows, such as inconsistent column values. Valid values: `true` and `false`. Default value: `false`. See [Stream Load](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md) for details.
-
-### sink.properties.compression
-
-**Required**: No
-**Default value**: NONE
-**Description**: The compression algorithm used for Stream Load. Valid values: `lz4_frame`. Compression for the JSON format requires Flink connector 1.2.10+ and StarRocks v3.2.7+. Compression for the CSV format only requires Flink connector 1.2.11+.
-
-### sink.properties.prepared_timeout
-
-**Required**: No
-**Default value**: NONE
-**Description**: Supported since 1.2.12 and only effective when `sink.version` is set to `V2`. Requires StarRocks 3.5.4 or later. Sets the timeout in seconds for the Transaction Stream Load phase from `PREPARED` to `COMMITTED`. Typically, only needed for exactly-once; at-least-once usually does not require setting this (the connector defaults to 300s). If not set in exactly-once, StarRocks FE configuration `prepared_transaction_default_timeout_second` (default 86400s) applies. See [StarRocks Transaction timeout management](./Stream_Load_transaction_interface.md#transaction-timeout-management).
-
-## Data type mapping between Flink and StarRocks
-
-| Flink data type | StarRocks data type |
-|-----------------------------------|-----------------------|
-| BOOLEAN | BOOLEAN |
-| TINYINT | TINYINT |
-| SMALLINT | SMALLINT |
-| INTEGER | INTEGER |
-| BIGINT | BIGINT |
-| FLOAT | FLOAT |
-| DOUBLE | DOUBLE |
-| DECIMAL | DECIMAL |
-| BINARY | INT |
-| CHAR | STRING |
-| VARCHAR | STRING |
-| STRING | STRING |
-| DATE | DATE |
-| TIMESTAMP_WITHOUT_TIME_ZONE(N) | DATETIME |
-| TIMESTAMP_WITH_LOCAL_TIME_ZONE(N) | DATETIME |
-| ARRAY<T> | ARRAY<T> |
-| MAP<KT,VT> | JSON STRING |
-| ROW<arg T...> | JSON STRING |
-
-## Usage notes
-
-### Exactly Once
-
-- If you want sink to guarantee exactly-once semantics, we recommend you to upgrade StarRocks to 2.5 or later, and Flink connector to 1.2.4 or later
- - Since Flink connector 1.2.4, the exactly-once is redesigned based on [Stream Load transaction interface](./Stream_Load_transaction_interface.md)
- provided by StarRocks since 2.4. Compared to the previous implementation based on non-transactional Stream Load non-transactional interface,
- the new implementation reduces memory usage and checkpoint overhead, thereby enhancing real-time performance and
- stability of loading.
-
- - If the version of StarRocks is earlier than 2.4 or the version of Flink connector is earlier than 1.2.4, the sink
- will automatically choose the implementation based on Stream Load non-transactional interface.
-
-- Configurations to guarantee exactly-once
-
- - The value of `sink.semantic` needs to be `exactly-once`.
-
- - If the version of Flink connector is 1.2.8 and later, it is recommended to specify the value of `sink.label-prefix`. Note that the label prefix must be unique among all types of loading in StarRocks, such as Flink jobs, Routine Load, and Broker Load.
-
- - If the label prefix is specified, the Flink connector will use the label prefix to clean up lingering transactions that may be generated in some Flink
- failure scenarios, such as the Flink job fails when a checkpoint is still in progress. These lingering transactions
- are generally in `PREPARED` status if you use `SHOW PROC '/transactions//running';` to view them in StarRocks. When the Flink job restores from checkpoint,
- the Flink connector will find these lingering transactions according to the label prefix and some information in
- checkpoint, and abort them. The Flink connector can not abort them when the Flink job exits because of the two-phase-commit
- mechanism to implement the exactly-once. When the Flink job exits, the Flink connector has not received the notification from
- Flink checkpoint coordinator whether the transactions should be included in a successful checkpoint, and it may
- lead to data loss if these transactions are aborted anyway. You can have an overview about how to achieve end-to-end exactly-once
- in Flink in this [blogpost](https://flink.apache.org/2018/02/28/an-overview-of-end-to-end-exactly-once-processing-in-apache-flink-with-apache-kafka-too/).
-
- - If the label prefix is not specified, lingering transactions will be cleaned up by StarRocks only after they time out. However the number of running transactions can reach the limitation of StarRocks `max_running_txn_num_per_db` if
- Flink jobs fail frequently before transactions time out. You can set a smaller timeout for `PREPARED` transactions
- to make them expired faster when the label prefix is not specified. See the following about how to set the prepared timeout.
-
-- If you are certain that the Flink job will eventually recover from checkpoint or savepoint after a long downtime because of stop or continuous failover,
- please adjust the following StarRocks configurations accordingly, to avoid data loss.
-
- - Adjust `PREPARED` transaction timeout. See the following about how to set the timeout.
-
- The timeout needs to be larger than the downtime of the Flink job. Otherwise, the lingering transactions that are included in a successful checkpoint may be aborted because of timeout before you restart the Flink job, which leads to data loss.
-
- Note that when you set a larger value to this configuration, it is better to specify the value of `sink.label-prefix` so that the lingering transactions can be cleaned according to the label prefix and some information in
- checkpoint, instead of due to timeout (which may cause data loss).
-
- - `label_keep_max_second` and `label_keep_max_num`: StarRocks FE configurations, default values are `259200` and `1000`
- respectively. For details, see [FE configurations](./loading_introduction/loading_considerations.md#fe-configurations). The value of `label_keep_max_second` needs to be larger than the downtime of the Flink job. Otherwise, the Flink connector can not check the state of transactions in StarRocks by using the transaction labels saved in the Flink's savepoint or checkpoint and figure out whether these transactions are committed or not, which may eventually lead to data loss.
-
-- How to set the timeout for PREPARED transactions
-
- - For Connector 1.2.12+ and StarRocks 3.5.4+, you can set the timeout by configuring the connector parameter `sink.properties.prepared_timeout`. By default, the value is not set, and it falls back to the StarRocks FE's global configuration `prepared_transaction_default_timeout_second` (default value is `86400`).
-
- - For other versions of Connector or StarRocks, you can set the timeout by configuring the StarRocks FE's global configuration `prepared_transaction_default_timeout_second` (default value is `86400`).
-
-### Flush Policy
-
-The Flink connector will buffer the data in memory, and flush them in batch to StarRocks via Stream Load. How the flush
-is triggered is different between at-least-once and exactly-once.
-
-For at-least-once, the flush will be triggered when any of the following conditions are met:
-
-- the bytes of buffered rows reaches the limit `sink.buffer-flush.max-bytes`
-- the number of buffered rows reaches the limit `sink.buffer-flush.max-rows`. (Only valid for sink version V1)
-- the elapsed time since the last flush reaches the limit `sink.buffer-flush.interval-ms`
-- a checkpoint is triggered
-
-For exactly-once, the flush only happens when a checkpoint is triggered.
-
-### Monitoring load metrics
-
-The Flink connector provides the following metrics to monitor loading.
-
-| Metric | Type | Description |
-|--------------------------|---------|-----------------------------------------------------------------|
-| totalFlushBytes | counter | successfully flushed bytes. |
-| totalFlushRows | counter | number of rows successfully flushed. |
-| totalFlushSucceededTimes | counter | number of times that the data is successfully flushed. |
-| totalFlushFailedTimes | counter | number of times that the data fails to be flushed. |
-| totalFilteredRows | counter | number of rows filtered, which is also included in totalFlushRows. |
-
-## Examples
-
-The following examples show how to use the Flink connector to load data into a StarRocks table with Flink SQL or Flink DataStream.
-
-### Preparations
-
-#### Create a StarRocks table
-
-Create a database `test` and create a Primary Key table `score_board`.
-
-```sql
-CREATE DATABASE `test`;
-
-CREATE TABLE `test`.`score_board`
-(
- `id` int(11) NOT NULL COMMENT "",
- `name` varchar(65533) NULL DEFAULT "" COMMENT "",
- `score` int(11) NOT NULL DEFAULT "0" COMMENT ""
-)
-ENGINE=OLAP
-PRIMARY KEY(`id`)
-COMMENT "OLAP"
-DISTRIBUTED BY HASH(`id`);
-```
-
-#### Set up Flink environment
-
-- Download Flink binary [Flink 1.15.2](https://archive.apache.org/dist/flink/flink-1.15.2/flink-1.15.2-bin-scala_2.12.tgz), and unzip it to directory `flink-1.15.2`.
-- Download [Flink connector 1.2.7](https://repo1.maven.org/maven2/com/starrocks/flink-connector-starrocks/1.2.7_flink-1.15/flink-connector-starrocks-1.2.7_flink-1.15.jar), and put it into the directory `flink-1.15.2/lib`.
-- Run the following commands to start a Flink cluster:
-
- ```shell
- cd flink-1.15.2
- ./bin/start-cluster.sh
- ```
-
-#### Network configuration
-
-Ensure that the machine where Flink is located can access the FE nodes of the StarRocks cluster via the [`http_port`](../administration/management/FE_configuration.md#http_port) (default: `8030`) and [`query_port`](../administration/management/FE_configuration.md#query_port) (default: `9030`), and the BE nodes via the [`be_http_port`](../administration/management/BE_configuration.md#be_http_port) (default: `8040`).
-
-### Run with Flink SQL
-
-- Run the following command to start a Flink SQL client.
-
- ```shell
- ./bin/sql-client.sh
- ```
-
-- Create a Flink table `score_board`, and insert values into the table via Flink SQL Client.
-Note you must define the primary key in the Flink DDL if you want to load data into a Primary Key table of StarRocks. It's optional for other types of StarRocks tables.
-
- ```SQL
- CREATE TABLE `score_board` (
- `id` INT,
- `name` STRING,
- `score` INT,
- PRIMARY KEY (id) NOT ENFORCED
- ) WITH (
- 'connector' = 'starrocks',
- 'jdbc-url' = 'jdbc:mysql://127.0.0.1:9030',
- 'load-url' = '127.0.0.1:8030',
- 'database-name' = 'test',
-
- 'table-name' = 'score_board',
- 'username' = 'root',
- 'password' = ''
- );
-
- INSERT INTO `score_board` VALUES (1, 'starrocks', 100), (2, 'flink', 100);
- ```
-
-### Run with Flink DataStream
-
-There are several ways to implement a Flink DataStream job according to the type of the input records, such as a CSV Java `String`, a JSON Java `String` or a custom Java object.
-
-- The input records are CSV-format `String`. See [LoadCsvRecords](https://github.com/StarRocks/starrocks-connector-for-apache-flink/tree/cd8086cfedc64d5181785bdf5e89a847dc294c1d/examples/src/main/java/com/starrocks/connector/flink/examples/datastream) for a complete example.
-
- ```java
- /**
- * Generate CSV-format records. Each record has three values separated by "\t".
- * These values will be loaded to the columns `id`, `name`, and `score` in the StarRocks table.
- */
- String[] records = new String[]{
- "1\tstarrocks-csv\t100",
- "2\tflink-csv\t100"
- };
- DataStream source = env.fromElements(records);
-
- /**
- * Configure the connector with the required properties.
- * You also need to add properties "sink.properties.format" and "sink.properties.column_separator"
- * to tell the connector the input records are CSV-format, and the column separator is "\t".
- * You can also use other column separators in the CSV-format records,
- * but remember to modify the "sink.properties.column_separator" correspondingly.
- */
- StarRocksSinkOptions options = StarRocksSinkOptions.builder()
- .withProperty("jdbc-url", jdbcUrl)
- .withProperty("load-url", loadUrl)
- .withProperty("database-name", "test")
- .withProperty("table-name", "score_board")
- .withProperty("username", "root")
- .withProperty("password", "")
- .withProperty("sink.properties.format", "csv")
- .withProperty("sink.properties.column_separator", "\t")
- .build();
- // Create the sink with the options.
- SinkFunction starRockSink = StarRocksSink.sink(options);
- source.addSink(starRockSink);
- ```
-
-- The input records are JSON-format `String`. See [LoadJsonRecords](https://github.com/StarRocks/starrocks-connector-for-apache-flink/tree/cd8086cfedc64d5181785bdf5e89a847dc294c1d/examples/src/main/java/com/starrocks/connector/flink/examples/datastream) for a complete example.
-
- ```java
- /**
- * Generate JSON-format records.
- * Each record has three key-value pairs corresponding to the columns `id`, `name`, and `score` in the StarRocks table.
- */
- String[] records = new String[]{
- "{\"id\":1, \"name\":\"starrocks-json\", \"score\":100}",
- "{\"id\":2, \"name\":\"flink-json\", \"score\":100}",
- };
- DataStream source = env.fromElements(records);
-
- /**
- * Configure the connector with the required properties.
- * You also need to add properties "sink.properties.format" and "sink.properties.strip_outer_array"
- * to tell the connector the input records are JSON-format and to strip the outermost array structure.
- */
- StarRocksSinkOptions options = StarRocksSinkOptions.builder()
- .withProperty("jdbc-url", jdbcUrl)
- .withProperty("load-url", loadUrl)
- .withProperty("database-name", "test")
- .withProperty("table-name", "score_board")
- .withProperty("username", "root")
- .withProperty("password", "")
- .withProperty("sink.properties.format", "json")
- .withProperty("sink.properties.strip_outer_array", "true")
- .build();
- // Create the sink with the options.
- SinkFunction starRockSink = StarRocksSink.sink(options);
- source.addSink(starRockSink);
- ```
-
-- The input records are custom Java objects. See [LoadCustomJavaRecords](https://github.com/StarRocks/starrocks-connector-for-apache-flink/tree/cd8086cfedc64d5181785bdf5e89a847dc294c1d/examples/src/main/java/com/starrocks/connector/flink/examples/datastream) for a complete example.
-
- - In this example, the input record is a simple POJO `RowData`.
-
- ```java
- public static class RowData {
- public int id;
- public String name;
- public int score;
-
- public RowData() {}
-
- public RowData(int id, String name, int score) {
- this.id = id;
- this.name = name;
- this.score = score;
- }
- }
- ```
-
- - The main program is as follows:
-
- ```java
- // Generate records which use RowData as the container.
- RowData[] records = new RowData[]{
- new RowData(1, "starrocks-rowdata", 100),
- new RowData(2, "flink-rowdata", 100),
- };
- DataStream source = env.fromElements(records);
-
- // Configure the connector with the required properties.
- StarRocksSinkOptions options = StarRocksSinkOptions.builder()
- .withProperty("jdbc-url", jdbcUrl)
- .withProperty("load-url", loadUrl)
- .withProperty("database-name", "test")
- .withProperty("table-name", "score_board")
- .withProperty("username", "root")
- .withProperty("password", "")
- .build();
-
- /**
- * The Flink connector will use a Java object array (Object[]) to represent a row to be loaded into the StarRocks table,
- * and each element is the value for a column.
- * You need to define the schema of the Object[] which matches that of the StarRocks table.
- */
- TableSchema schema = TableSchema.builder()
- .field("id", DataTypes.INT().notNull())
- .field("name", DataTypes.STRING())
- .field("score", DataTypes.INT())
- // When the StarRocks table is a Primary Key table, you must specify notNull(), for example, DataTypes.INT().notNull(), for the primary key `id`.
- .primaryKey("id")
- .build();
- // Transform the RowData to the Object[] according to the schema.
- RowDataTransformer transformer = new RowDataTransformer();
- // Create the sink with the schema, options, and transformer.
- SinkFunction starRockSink = StarRocksSink.sink(schema, options, transformer);
- source.addSink(starRockSink);
- ```
-
- - The `RowDataTransformer` in the main program is defined as follows:
-
- ```java
- private static class RowDataTransformer implements StarRocksSinkRowBuilder {
-
- /**
- * Set each element of the object array according to the input RowData.
- * The schema of the array matches that of the StarRocks table.
- */
- @Override
- public void accept(Object[] internalRow, RowData rowData) {
- internalRow[0] = rowData.id;
- internalRow[1] = rowData.name;
- internalRow[2] = rowData.score;
- // When the StarRocks table is a Primary Key table, you need to set the last element to indicate whether the data loading is an UPSERT or DELETE operation.
- internalRow[internalRow.length - 1] = StarRocksSinkOP.UPSERT.ordinal();
- }
- }
- ```
-
-### Synchronize data with Flink CDC 3.0 (with schema change supported)
-
-[Flink CDC 3.0](https://nightlies.apache.org/flink/flink-cdc-docs-stable) framework can be used
-to easily build a streaming ELT pipeline from CDC sources (such as MySQL and Kafka) to StarRocks. The pipeline can synchronize whole database, merged sharding tables, and schema changes from sources to StarRocks.
-
-Since v1.2.9, the Flink connector for StarRocks is integrated into this framework as [StarRocks Pipeline Connector](https://nightlies.apache.org/flink/flink-cdc-docs-release-3.1/docs/connectors/pipeline-connectors/starrocks/). The StarRocks Pipeline Connector supports:
-
-- Automatic creation of databases and tables
-- Schema change synchronization
-- Full and incremental data synchronization
-
-For quick start, see [Streaming ELT from MySQL to StarRocks using Flink CDC 3.0 with StarRocks Pipeline Connector](https://nightlies.apache.org/flink/flink-cdc-docs-stable/docs/get-started/quickstart/mysql-to-starrocks).
-
-It is advised to use StarRocks v3.2.1 and later versions to enable [fast_schema_evolution](../sql-reference/sql-statements/table_bucket_part_index/CREATE_TABLE.md#set-fast-schema-evolution). It will improve the speed of adding or dropping columns and reduce resource usage.
-
-## Best practices
-
-### Load data to a Primary Key table
-
-This section will show how to load data to a StarRocks Primary Key table to achieve partial updates and conditional updates.
-You can see [Change data through loading](./Load_to_Primary_Key_tables.md) for the introduction of those features.
-These examples use Flink SQL.
-
-#### Preparations
-
-Create a database `test` and create a Primary Key table `score_board` in StarRocks.
-
-```SQL
-CREATE DATABASE `test`;
-
-CREATE TABLE `test`.`score_board`
-(
- `id` int(11) NOT NULL COMMENT "",
- `name` varchar(65533) NULL DEFAULT "" COMMENT "",
- `score` int(11) NOT NULL DEFAULT "0" COMMENT ""
-)
-ENGINE=OLAP
-PRIMARY KEY(`id`)
-COMMENT "OLAP"
-DISTRIBUTED BY HASH(`id`);
-```
-
-#### Partial update
-
-This example will show how to load data only to columns `id` and `name`.
-
-1. Insert two data rows into the StarRocks table `score_board` in MySQL client.
-
- ```SQL
- mysql> INSERT INTO `score_board` VALUES (1, 'starrocks', 100), (2, 'flink', 100);
-
- mysql> select * from score_board;
- +------+-----------+-------+
- | id | name | score |
- +------+-----------+-------+
- | 1 | starrocks | 100 |
- | 2 | flink | 100 |
- +------+-----------+-------+
- 2 rows in set (0.02 sec)
- ```
-
-2. Create a Flink table `score_board` in Flink SQL client.
-
- - Define the DDL which only includes the columns `id` and `name`.
- - Set the option `sink.properties.partial_update` to `true` which tells the Flink connector to perform partial updates.
- - If the Flink connector version `<=` 1.2.7, you also need to set the option `sink.properties.columns` to `id,name,__op` to tells the Flink connector which columns need to be updated. Note that you need to append the field `__op` at the end. The field `__op` indicates that the data loading is an UPSERT or DELETE operation, and its values are set by the connector automatically.
-
- ```SQL
- CREATE TABLE `score_board` (
- `id` INT,
- `name` STRING,
- PRIMARY KEY (id) NOT ENFORCED
- ) WITH (
- 'connector' = 'starrocks',
- 'jdbc-url' = 'jdbc:mysql://127.0.0.1:9030',
- 'load-url' = '127.0.0.1:8030',
- 'database-name' = 'test',
- 'table-name' = 'score_board',
- 'username' = 'root',
- 'password' = '',
- 'sink.properties.partial_update' = 'true',
- -- only for Flink connector version <= 1.2.7
- 'sink.properties.columns' = 'id,name,__op'
- );
- ```
-
-3. Insert two data rows into the Flink table. The primary keys of the data rows are as same as these of rows in the StarRocks table. but the values in the column `name` are modified.
-
- ```SQL
- INSERT INTO `score_board` VALUES (1, 'starrocks-update'), (2, 'flink-update');
- ```
-
-4. Query the StarRocks table in MySQL client.
-
- ```SQL
- mysql> select * from score_board;
- +------+------------------+-------+
- | id | name | score |
- +------+------------------+-------+
- | 1 | starrocks-update | 100 |
- | 2 | flink-update | 100 |
- +------+------------------+-------+
- 2 rows in set (0.02 sec)
- ```
-
- You can see that only values for `name` change, and the values for `score` do not change.
-
-#### Conditional update
-
-This example will show how to do conditional update according to the value of column `score`. The update for an `id`
-takes effect only when the new value for `score` is has a greater or equal to the old value.
-
-1. Insert two data rows into the StarRocks table in MySQL client.
-
- ```SQL
- mysql> INSERT INTO `score_board` VALUES (1, 'starrocks', 100), (2, 'flink', 100);
-
- mysql> select * from score_board;
- +------+-----------+-------+
- | id | name | score |
- +------+-----------+-------+
- | 1 | starrocks | 100 |
- | 2 | flink | 100 |
- +------+-----------+-------+
- 2 rows in set (0.02 sec)
- ```
-
-2. Create a Flink table `score_board` in the following ways:
-
- - Define the DDL including all of columns.
- - Set the option `sink.properties.merge_condition` to `score` to tell the connector to use the column `score`
- as the condition.
- - Set the option `sink.version` to `V1` which tells the connector to use Stream Load.
-
- ```SQL
- CREATE TABLE `score_board` (
- `id` INT,
- `name` STRING,
- `score` INT,
- PRIMARY KEY (id) NOT ENFORCED
- ) WITH (
- 'connector' = 'starrocks',
- 'jdbc-url' = 'jdbc:mysql://127.0.0.1:9030',
- 'load-url' = '127.0.0.1:8030',
- 'database-name' = 'test',
- 'table-name' = 'score_board',
- 'username' = 'root',
- 'password' = '',
- 'sink.properties.merge_condition' = 'score',
- 'sink.version' = 'V1'
- );
- ```
-
-3. Insert two data rows into the Flink table. The primary keys of the data rows are as same as these of rows in the StarRocks table. The first data row has a smaller value in the column `score`, and the second data row has a larger value in the column `score`.
-
- ```SQL
- INSERT INTO `score_board` VALUES (1, 'starrocks-update', 99), (2, 'flink-update', 101);
- ```
-
-4. Query the StarRocks table in MySQL client.
-
- ```SQL
- mysql> select * from score_board;
- +------+--------------+-------+
- | id | name | score |
- +------+--------------+-------+
- | 1 | starrocks | 100 |
- | 2 | flink-update | 101 |
- +------+--------------+-------+
- 2 rows in set (0.03 sec)
- ```
-
- You can see that only the values of the second data row change, and the values of the first data row do not change.
-
-### Load data into columns of BITMAP type
-
-[`BITMAP`](../sql-reference/data-types/other-data-types/BITMAP.md) is often used to accelerate count distinct, such as counting UV, see [Use Bitmap for exact Count Distinct](../using_starrocks/distinct_values/Using_bitmap.md).
-Here we take the counting of UV as an example to show how to load data into columns of the `BITMAP` type.
-
-1. Create a StarRocks Aggregate table in MySQL client.
-
- In the database `test`, create an Aggregate table `page_uv` where the column `visit_users` is defined as the `BITMAP` type and configured with the aggregate function `BITMAP_UNION`.
-
- ```SQL
- CREATE TABLE `test`.`page_uv` (
- `page_id` INT NOT NULL COMMENT 'page ID',
- `visit_date` datetime NOT NULL COMMENT 'access time',
- `visit_users` BITMAP BITMAP_UNION NOT NULL COMMENT 'user ID'
- ) ENGINE=OLAP
- AGGREGATE KEY(`page_id`, `visit_date`)
- DISTRIBUTED BY HASH(`page_id`);
- ```
-
-2. Create a Flink table in Flink SQL client.
-
- The column `visit_user_id` in the Flink table is of `BIGINT` type, and we want to load this column to the column `visit_users` of `BITMAP` type in the StarRocks table. So when defining the DDL of the Flink table, note that:
- - Because Flink does not support `BITMAP`, you need to define a column `visit_user_id` as `BIGINT` type to represent the column `visit_users` of `BITMAP` type in the StarRocks table.
- - You need to set the option `sink.properties.columns` to `page_id,visit_date,user_id,visit_users=to_bitmap(visit_user_id)`, which tells the connector the column mapping between the Flink table and StarRocks table. Also you need to use [`to_bitmap`](../sql-reference/sql-functions/bitmap-functions/to_bitmap.md)
- function to tell the connector to convert the data of `BIGINT` type into `BITMAP` type.
-
- ```SQL
- CREATE TABLE `page_uv` (
- `page_id` INT,
- `visit_date` TIMESTAMP,
- `visit_user_id` BIGINT
- ) WITH (
- 'connector' = 'starrocks',
- 'jdbc-url' = 'jdbc:mysql://127.0.0.1:9030',
- 'load-url' = '127.0.0.1:8030',
- 'database-name' = 'test',
- 'table-name' = 'page_uv',
- 'username' = 'root',
- 'password' = '',
- 'sink.properties.columns' = 'page_id,visit_date,visit_user_id,visit_users=to_bitmap(visit_user_id)'
- );
- ```
-
-3. Load data into Flink table in Flink SQL client.
-
- ```SQL
- INSERT INTO `page_uv` VALUES
- (1, CAST('2020-06-23 01:30:30' AS TIMESTAMP), 13),
- (1, CAST('2020-06-23 01:30:30' AS TIMESTAMP), 23),
- (1, CAST('2020-06-23 01:30:30' AS TIMESTAMP), 33),
- (1, CAST('2020-06-23 02:30:30' AS TIMESTAMP), 13),
- (2, CAST('2020-06-23 01:30:30' AS TIMESTAMP), 23);
- ```
-
-4. Calculate page UVs from the StarRocks table in MySQL client.
-
- ```SQL
- MySQL [test]> SELECT `page_id`, COUNT(DISTINCT `visit_users`) FROM `page_uv` GROUP BY `page_id`;
- +---------+-----------------------------+
- | page_id | count(DISTINCT visit_users) |
- +---------+-----------------------------+
- | 2 | 1 |
- | 1 | 3 |
- +---------+-----------------------------+
- 2 rows in set (0.05 sec)
- ```
-
-### Load data into columns of HLL type
-
-[`HLL`](../sql-reference/data-types/other-data-types/HLL.md) can be used for approximate count distinct, see [Use HLL for approximate count distinct](../using_starrocks/distinct_values/Using_HLL.md).
-
-Here we take the counting of UV as an example to show how to load data into columns of the `HLL` type.
-
-1. Create a StarRocks Aggregate table
-
- In the database `test`, create an Aggregate table `hll_uv` where the column `visit_users` is defined as the `HLL` type and configured with the aggregate function `HLL_UNION`.
-
- ```SQL
- CREATE TABLE `hll_uv` (
- `page_id` INT NOT NULL COMMENT 'page ID',
- `visit_date` datetime NOT NULL COMMENT 'access time',
- `visit_users` HLL HLL_UNION NOT NULL COMMENT 'user ID'
- ) ENGINE=OLAP
- AGGREGATE KEY(`page_id`, `visit_date`)
- DISTRIBUTED BY HASH(`page_id`);
- ```
-
-2. Create a Flink table in Flink SQL client.
-
- The column `visit_user_id` in the Flink table is of `BIGINT` type, and we want to load this column to the column `visit_users` of `HLL` type in the StarRocks table. So when defining the DDL of the Flink table, note that:
- - Because Flink does not support `BITMAP`, you need to define a column `visit_user_id` as `BIGINT` type to represent the column `visit_users` of `HLL` type in the StarRocks table.
- - You need to set the option `sink.properties.columns` to `page_id,visit_date,user_id,visit_users=hll_hash(visit_user_id)` which tells the connector the column mapping between Flink table and StarRocks table. Also you need to use [`hll_hash`](../sql-reference/sql-functions/scalar-functions/hll_hash.md) function to tell the connector to convert the data of `BIGINT` type into `HLL` type.
-
- ```SQL
- CREATE TABLE `hll_uv` (
- `page_id` INT,
- `visit_date` TIMESTAMP,
- `visit_user_id` BIGINT
- ) WITH (
- 'connector' = 'starrocks',
- 'jdbc-url' = 'jdbc:mysql://127.0.0.1:9030',
- 'load-url' = '127.0.0.1:8030',
- 'database-name' = 'test',
- 'table-name' = 'hll_uv',
- 'username' = 'root',
- 'password' = '',
- 'sink.properties.columns' = 'page_id,visit_date,visit_user_id,visit_users=hll_hash(visit_user_id)'
- );
- ```
-
-3. Load data into Flink table in Flink SQL client.
-
- ```SQL
- INSERT INTO `hll_uv` VALUES
- (3, CAST('2023-07-24 12:00:00' AS TIMESTAMP), 78),
- (4, CAST('2023-07-24 13:20:10' AS TIMESTAMP), 2),
- (3, CAST('2023-07-24 12:30:00' AS TIMESTAMP), 674);
- ```
-
-4. Calculate page UVs from the StarRocks table in MySQL client.
-
- ```SQL
- mysql> SELECT `page_id`, COUNT(DISTINCT `visit_users`) FROM `hll_uv` GROUP BY `page_id`;
- **+---------+-----------------------------+
- | page_id | count(DISTINCT visit_users) |
- +---------+-----------------------------+
- | 3 | 2 |
- | 4 | 1 |
- +---------+-----------------------------+
- 2 rows in set (0.04 sec)
- ```
diff --git a/docs/en/loading/Flink_cdc_load.md b/docs/en/loading/Flink_cdc_load.md
deleted file mode 100644
index dbd0a24..0000000
--- a/docs/en/loading/Flink_cdc_load.md
+++ /dev/null
@@ -1,532 +0,0 @@
----
-displayed_sidebar: docs
-keywords:
- - MySql
- - mysql
- - sync
- - Flink CDC
----
-
-# Realtime synchronization from MySQL
-
-import InsertPrivNote from '../_assets/commonMarkdown/insertPrivNote.mdx'
-
-StarRocks supports multiple methods to synchronize data from MySQL to StarRocks in real time, delivering low latency real-time analytics of massive data.
-
-This topic describes how to synchronize data from MySQL to StarRocks in real-time (within seconds) through Apache Flink®.
-
-
-
-## How it works
-
-:::tip
-
-Flink CDC is used in the synchronization from MySQL to Flink. This topic uses Flink CDC whose version is less than 3.0, so SMT is used to synchronize table schemas. However, if Flink CDC 3.0 is used, it is not necessary to use SMT to synchronize table schemas to StarRocks. Flink CDC 3.0 can even synchronize the schemas of the entire MySQL database, the sharded databases and tables, and also supports schema changes synchronization. For detailed usage, see [Streaming ELT from MySQL to StarRocks](https://nightlies.apache.org/flink/flink-cdc-docs-stable/docs/get-started/quickstart/mysql-to-starrocks).
-
-:::
-
-The following figure illustrates the entire synchronization process.
-
-
-
-Real-time synchronization from MySQL through Flink to StarRocks is implemented in two stages: synchronizing database & table schema and synchronizing data. First, the SMT converts MySQL database & table schema into table creation statements for StarRocks. Then, the Flink cluster runs Flink jobs to synchronize full and incremental MySQL data to StarRocks.
-
-:::info
-
-The synchronization process guarantees exactly-once semantics.
-
-:::
-
-**Synchronization process**:
-
-1. Synchronize database & table schema.
-
- The SMT reads the schema of the MySQL database & table to be synchronized and generates SQL files for creating a destination database & table in StarRocks. This operation is based on the MySQL and StarRocks information in SMT's configuration file.
-
-2. Synchronize data.
-
- a. The Flink SQL client executes the data loading statement `INSERT INTO SELECT` to submit one or more Flink jobs to the Flink cluster.
-
- b. The Flink cluster runs the Flink jobs to obtain data. The Flink CDC connector first reads full historical data from the source database, then seamlessly switches to incremental reading, and sends the data to flink-connector-starrocks.
-
- c. flink-connector-starrocks accumulates data in mini-batches, and synchronizes each batch of data to StarRocks.
-
- :::info
-
- Only data manipulation language (DML) operations in MySQL can be synchronized to StarRocks. Data definition language (DDL) operations cannot be synchronized.
-
- :::
-
-## Scenarios
-
-Real-time synchronization from MySQL has a broad range of use cases where data is constantly changed. Take a real-world use case "real-time ranking of commodity sales" as an example.
-
-Flink calculates the real-time ranking of commodity sales based on the original order table in MySQL and synchronizes the ranking to StarRocks' Primary Key table in real time. Users can connect a visualization tool to StarRocks to view the ranking in real time to gain on-demand operational insights.
-
-## Preparations
-
-### Download and install synchronization tools
-
-To synchronize data from MySQL, you need to install the following tools: SMT, Flink, Flink CDC connector, and flink-connector-starrocks.
-
-1. Download and install Flink, and start the Flink cluster. You can also perform this step by following the instructions in [Flink official documentation](https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/try-flink/local_installation/).
-
- a. Install Java 8 or Java 11 in your operating system before you run Flink. You can run the following command to check the installed Java version.
-
- ```Bash
- # View the Java version.
- java -version
-
- # Java 8 is installed if the following output is returned.
- java version "1.8.0_301"
- Java(TM) SE Runtime Environment (build 1.8.0_301-b09)
- Java HotSpot(TM) 64-Bit Server VM (build 25.301-b09, mixed mode)
- ```
-
- b. Download the [Flink installation package](https://flink.apache.org/downloads.html) and decompress it. We recommend that you use Flink 1.14 or later. The minimum allowed version is Flink 1.11. This topic uses Flink 1.14.5.
-
- ```Bash
- # Download Flink.
- wget https://archive.apache.org/dist/flink/flink-1.14.5/flink-1.14.5-bin-scala_2.11.tgz
- # Decompress Flink.
- tar -xzf flink-1.14.5-bin-scala_2.11.tgz
- # Go to the Flink directory.
- cd flink-1.14.5
- ```
-
- c. Start the Flink cluster.
-
- ```Bash
- # Start the Flink cluster.
- ./bin/start-cluster.sh
-
- # The Flink cluster is started if the following output is returned.
- Starting cluster.
- Starting standalonesession daemon on host.
- Starting taskexecutor daemon on host.
- ```
-
-2. Download [Flink CDC connector](https://github.com/ververica/flink-cdc-connectors/releases). This topic uses MySQL as the data source and therefore, `flink-sql-connector-mysql-cdc-x.x.x.jar` is downloaded. The connector version must match the [Flink](https://github.com/ververica/flink-cdc-connectors/releases) version. This topic uses Flink 1.14.5 and you can download `flink-sql-connector-mysql-cdc-2.2.0.jar`.
-
- ```Bash
- wget https://repo1.maven.org/maven2/com/ververica/flink-sql-connector-mysql-cdc/2.1.1/flink-sql-connector-mysql-cdc-2.2.0.jar
- ```
-
-3. Download [flink-connector-starrocks](https://search.maven.org/artifact/com.starrocks/flink-connector-starrocks). The version must match the Flink version.
-
- > The flink-connector-starrocks package `x.x.x_flink-y.yy _ z.zz.jar` contains three version numbers:
- >
- > - `x.x.x` is the version number of flink-connector-starrocks.
- > - `y.yy` is the supported Flink version.
- > - `z.zz` is the Scala version supported by Flink. If the Flink version is 1.14.x or earlier, you must download a package that has the Scala version.
- >
- > This topic uses Flink 1.14.5 and Scala 2.11. Therefore, you can download the following package: `1.2.3_flink-14_2.11.jar`.
-
-4. Move the JAR packages of Flink CDC connector (`flink-sql-connector-mysql-cdc-2.2.0.jar`) and flink-connector-starrocks (`1.2.3_flink-1.14_2.11.jar`) to the `lib` directory of Flink.
-
- > **Note**
- >
- > If a Flink cluster is already running in your system, you must stop the Flink cluster and restart it to load and validate the JAR packages.
- >
- > ```Bash
- > $ ./bin/stop-cluster.sh
- > $ ./bin/start-cluster.sh
- > ```
-
-5. Download and decompress the [SMT package](https://www.starrocks.io/download/community) and place it in the `flink-1.14.5` directory. StarRocks provides SMT packages for Linux x86 and macos ARM64. You can choose one based on your operating system and CPU.
-
- ```Bash
- # for Linux x86
- wget https://releases.starrocks.io/resources/smt.tar.gz
- # for macOS ARM64
- wget https://releases.starrocks.io/resources/smt_darwin_arm64.tar.gz
- ```
-
-### Enable MySQL binary log
-
-To synchronize data from MySQL in real time, the system needs to read data from MySQL binary log (binlog), parse the data, and then synchronize the data to StarRocks. Make sure that MySQL binary log is enabled.
-
-1. Edit the MySQL configuration file `my.cnf` (default path: `/etc/my.cnf`) to enable MySQL binary log.
-
- ```Bash
- # Enable MySQL Binlog.
- log_bin = ON
- # Configure the save path for the Binlog.
- log_bin =/var/lib/mysql/mysql-bin
- # Configure server_id.
- # If server_id is not configured for MySQL 5.7.3 or later, the MySQL service cannot be used.
- server_id = 1
- # Set the Binlog format to ROW.
- binlog_format = ROW
- # The base name of the Binlog file. An identifier is appended to identify each Binlog file.
- log_bin_basename =/var/lib/mysql/mysql-bin
- # The index file of Binlog files, which manages the directory of all Binlog files.
- log_bin_index =/var/lib/mysql/mysql-bin.index
- ```
-
-2. Run one of the following commands to restart MySQL for the modified configuration file to take effect.
-
- ```Bash
- # Use service to restart MySQL.
- service mysqld restart
- # Use mysqld script to restart MySQL.
- /etc/init.d/mysqld restart
- ```
-
-3. Connect to MySQL and check whether MySQL binary log is enabled.
-
- ```Plain
- -- Connect to MySQL.
- mysql -h xxx.xx.xxx.xx -P 3306 -u root -pxxxxxx
-
- -- Check whether MySQL binary log is enabled.
- mysql> SHOW VARIABLES LIKE 'log_bin';
- +---------------+-------+
- | Variable_name | Value |
- +---------------+-------+
- | log_bin | ON |
- +---------------+-------+
- 1 row in set (0.00 sec)
- ```
-
-## Synchronize database & table schema
-
-1. Edit the SMT configuration file.
- Go to the SMT `conf` directory and edit the configuration file `config_prod.conf`, such as MySQL connection information, the matching rules of the database & table to be synchronized, and configuration information of flink-connector-starrocks.
-
- ```Bash
- [db]
- type = mysql
- host = xxx.xx.xxx.xx
- port = 3306
- user = user1
- password = xxxxxx
-
- [other]
- # Number of BEs in StarRocks
- be_num = 3
- # `decimal_v3` is supported since StarRocks-1.18.1.
- use_decimal_v3 = true
- # File to save the converted DDL SQL
- output_dir = ./result
-
- [table-rule.1]
- # Pattern to match databases for setting properties
- database = ^demo.*$
- # Pattern to match tables for setting properties
- table = ^.*$
-
- ############################################
- ### Flink sink configurations
- ### DO NOT set `connector`, `table-name`, `database-name`. They are auto-generated.
- ############################################
- flink.starrocks.jdbc-url=jdbc:mysql://:
- flink.starrocks.load-url= :
- flink.starrocks.username=user2
- flink.starrocks.password=xxxxxx
- flink.starrocks.sink.properties.format=csv
- flink.starrocks.sink.properties.column_separator=\x01
- flink.starrocks.sink.properties.row_delimiter=\x02
- flink.starrocks.sink.buffer-flush.interval-ms=15000
- ```
-
- - `[db]`: information used to access the source database.
- - `type`: type of the source database. In this topic, the source database is `mysql`.
- - `host`: IP address of the MySQL server.
- - `port`: port number of the MySQL database, defaults to `3306`
- - `user`: username for accessing the MySQL database
- - `password`: password of the username
-
- - `[table-rule]`: database & table matching rules and the corresponding flink-connector-starrocks configuration.
-
- - `Database`, `table`: the names of the database & table in MySQL. Regular expressions are supported.
- - `flink.starrocks.*`: configuration information of flink-connector-starrocks. For more configurations and information, see [flink-connector-starrocks](../loading/Flink-connector-starrocks.md).
-
- > If you need to use different flink-connector-starrocks configurations for different tables. For example, if some tables are frequently updated and you need to accelerate data loading, see [Use different flink-connector-starrocks configurations for different tables](#use-different-flink-connector-starrocks-configurations-for-different-tables). If you need to load multiple tables obtained from MySQL sharding into the same StarRocks table, see [Synchronize multiple tables after MySQL sharding to one table in StarRocks](#synchronize-multiple-tables-after-mysql-sharding-to-one-table-in-starrocks).
-
- - `[other]`: other information
- - `be_num`: The number of BEs in your StarRocks cluster (This parameter will be used for setting a reasonable number of tablets in subsequent StarRocks table creation).
- - `use_decimal_v3`: Whether to enable [Decimal V3](../sql-reference/data-types/numeric/DECIMAL.md). After Decimal V3 is enabled, MySQL decimal data will be converted into Decimal V3 data when data is synchronized to StarRocks.
- - `output_dir`: The path to save the SQL files to be generated. The SQL files will be used to create a database & table in StarRocks and submit a Flink job to the Flink cluster. The default path is `./result` and we recommend that you retain the default settings.
-
-2. Run the SMT to read the database & table schema in MySQL and generate SQL files in the `./result` directory based on the configuration file. The `starrocks-create.all.sql` file is used to create a database & table in StarRocks and the `flink-create.all.sql` file is used to submit a Flink job to the Flink cluster.
-
- ```Bash
- # Run the SMT.
- ./starrocks-migrate-tool
-
- # Go to the result directory and check the files in this directory.
- cd result
- ls result
- flink-create.1.sql smt.tar.gz starrocks-create.all.sql
- flink-create.all.sql starrocks-create.1.sql
- ```
-
-3. Run the following command to connect to StarRocks and execute the `starrocks-create.all.sql` file to create a database and table in StarRocks. We recommend that you use the default table creation statement in the SQL file to create a table of the [Primary Key table](../table_design/table_types/primary_key_table.md).
-
- > **Note**
- >
- > You can also modify the table creation statement based on your business needs and create a table that does not use the Primary Key table. However, the DELETE operation in the source MySQL database cannot be synchronized to the non- Primary Key table. Exercise caution when you create such a table.
-
- ```Bash
- mysql -h -P -u user2 -pxxxxxx < starrocks-create.all.sql
- ```
-
- If the data needs to be processed by Flink before it is written to the destination StarRocks table, the table schema will be different between the source and destination tables. In this case, you must modify the table creation statement. In this example, the destination table requires only the `product_id` and `product_name` columns and real-time ranking of commodity sales. You can use the following table creation statement.
-
- ```Bash
- CREATE DATABASE IF NOT EXISTS `demo`;
-
- CREATE TABLE IF NOT EXISTS `demo`.`orders` (
- `product_id` INT(11) NOT NULL COMMENT "",
- `product_name` STRING NOT NULL COMMENT "",
- `sales_cnt` BIGINT NOT NULL COMMENT ""
- ) ENGINE=olap
- PRIMARY KEY(`product_id`)
- DISTRIBUTED BY HASH(`product_id`)
- PROPERTIES (
- "replication_num" = "3"
- );
- ```
-
- > **NOTICE**
- >
- > Since v2.5.7, StarRocks can automatically set the number of buckets (BUCKETS) when you create a table or add a partition. You no longer need to manually set the number of buckets. For detailed information, see [set the number of buckets](../table_design/data_distribution/Data_distribution.md#set-the-number-of-buckets).
-
-## Synchronize data
-
-Run the Flink cluster and submit a Flink job to continuously synchronize full and incremental data from MySQL to StarRocks.
-
-1. Go to the Flink directory and run the following command to run the `flink-create.all.sql` file on your Flink SQL client.
-
- ```Bash
- ./bin/sql-client.sh -f flink-create.all.sql
- ```
-
- This SQL file defines dynamic tables `source table` and `sink table`, query statement `INSERT INTO SELECT`, and specifies the connector, source database, and destination database. After this file is executed, a Flink job is submitted to the Flink cluster to start data synchronization.
-
- > **Note**
- >
- > - Make sure that the Flink cluster has been started. You can start the Flink cluster by running `flink/bin/start-cluster.sh`.
- > - If your Flink version is earlier than 1.13, you may not be able to directly run the SQL file `flink-create.all.sql`. You need to execute SQL statements one by one in this file in the command line interface (CLI) of the SQL client. You also need to escape the `\` character.
- >
- > ```Bash
- > 'sink.properties.column_separator' = '\\x01'
- > 'sink.properties.row_delimiter' = '\\x02'
- > ```
-
- **Process data during synchronization**:
-
- If you need to process data during synchronization, such as performing GROUP BY or JOIN on the data, you can modify the `flink-create.all.sql` file. The following example calculates real-time ranking of commodity sales by executing COUNT (*) and GROUP BY.
-
- ```Bash
- $ ./bin/sql-client.sh -f flink-create.all.sql
- No default environment is specified.
- Searching for '/home/disk1/flink-1.13.6/conf/sql-client-defaults.yaml'...not found.
- [INFO] Executing SQL from file.
-
- Flink SQL> CREATE DATABASE IF NOT EXISTS `default_catalog`.`demo`;
- [INFO] Execute statement succeed.
-
- -- Create a dynamic table `source table` based on the order table in MySQL.
- Flink SQL>
- CREATE TABLE IF NOT EXISTS `default_catalog`.`demo`.`orders_src` (`order_id` BIGINT NOT NULL,
- `product_id` INT NULL,
- `order_date` TIMESTAMP NOT NULL,
- `customer_name` STRING NOT NULL,
- `product_name` STRING NOT NULL,
- `price` DECIMAL(10, 5) NULL,
- PRIMARY KEY(`order_id`)
- NOT ENFORCED
- ) with ('connector' = 'mysql-cdc',
- 'hostname' = 'xxx.xx.xxx.xxx',
- 'port' = '3306',
- 'username' = 'root',
- 'password' = '',
- 'database-name' = 'demo',
- 'table-name' = 'orders'
- );
- [INFO] Execute statement succeed.
-
- -- Create a dynamic table `sink table`.
- Flink SQL>
- CREATE TABLE IF NOT EXISTS `default_catalog`.`demo`.`orders_sink` (`product_id` INT NOT NULL,
- `product_name` STRING NOT NULL,
- `sales_cnt` BIGINT NOT NULL,
- PRIMARY KEY(`product_id`)
- NOT ENFORCED
- ) with ('sink.max-retries' = '10',
- 'jdbc-url' = 'jdbc:mysql://:',
- 'password' = '',
- 'sink.properties.strip_outer_array' = 'true',
- 'sink.properties.format' = 'json',
- 'load-url' = ':',
- 'username' = 'root',
- 'sink.buffer-flush.interval-ms' = '15000',
- 'connector' = 'starrocks',
- 'database-name' = 'demo',
- 'table-name' = 'orders'
- );
- [INFO] Execute statement succeed.
-
- -- Implement real-time ranking of commodity sales, where `sink table` is dynamically updated to reflect data changes in `source table`.
- Flink SQL>
- INSERT INTO `default_catalog`.`demo`.`orders_sink` select product_id,product_name, count(*) as cnt from `default_catalog`.`demo`.`orders_src` group by product_id,product_name;
- [INFO] Submitting SQL update statement to the cluster...
- [INFO] SQL update statement has been successfully submitted to the cluster:
- Job ID: 5ae005c4b3425d8bb13fe660260a35da
- ```
-
- If you only need to synchronize only a portion of the data, such as data whose payment time is later than December 21, 2021, you can use the `WHERE` clause in `INSERT INTO SELECT` to set a filter condition, such as `WHERE pay_dt > '2021-12-21'`. Data that does not meet this condition will not be synchronized to StarRocks.
-
- If the following result is returned, the Flink job has been submitted for full and incremental synchronization.
-
- ```SQL
- [INFO] Submitting SQL update statement to the cluster...
- [INFO] SQL update statement has been successfully submitted to the cluster:
- Job ID: 5ae005c4b3425d8bb13fe660260a35da
- ```
-
-2. You can use the [Flink WebUI](https://nightlies.apache.org/flink/flink-docs-master/docs/try-flink/flink-operations-playground/#flink-webui) or run the `bin/flink list -running` command on your Flink SQL client to view Flink jobs that are running in the Flink cluster and the job IDs.
-
- - Flink WebUI
- 
-
- - `bin/flink list -running`
-
- ```Bash
- $ bin/flink list -running
- Waiting for response...
- ------------------ Running/Restarting Jobs -------------------
- 13.10.2022 15:03:54 : 040a846f8b58e82eb99c8663424294d5 : insert-into_default_catalog.lily.example_tbl1_sink (RUNNING)
- --------------------------------------------------------------
- ```
-
- > **Note**
- >
- > If the job is abnormal, you can perform troubleshooting by using Flink WebUI or by viewing the log file in the `/log` directory of Flink 1.14.5.
-
-## FAQ
-
-### Use different flink-connector-starrocks configurations for different tables
-
-If some tables in the data source are frequently updated and you want to accelerate the loading speed of flink-connector-starrocks, you must set a separate flink-connector-starrocks configuration for each table in the SMT configuration file `config_prod.conf`.
-
-```Bash
-[table-rule.1]
-# Pattern to match databases for setting properties
-database = ^order.*$
-# Pattern to match tables for setting properties
-table = ^.*$
-
-############################################
-### Flink sink configurations
-### DO NOT set `connector`, `table-name`, `database-name`. They are auto-generated
-############################################
-flink.starrocks.jdbc-url=jdbc:mysql://:
-flink.starrocks.load-url= :
-flink.starrocks.username=user2
-flink.starrocks.password=xxxxxx
-flink.starrocks.sink.properties.format=csv
-flink.starrocks.sink.properties.column_separator=\x01
-flink.starrocks.sink.properties.row_delimiter=\x02
-flink.starrocks.sink.buffer-flush.interval-ms=15000
-
-[table-rule.2]
-# Pattern to match databases for setting properties
-database = ^order2.*$
-# Pattern to match tables for setting properties
-table = ^.*$
-
-############################################
-### Flink sink configurations
-### DO NOT set `connector`, `table-name`, `database-name`. They are auto-generated
-############################################
-flink.starrocks.jdbc-url=jdbc:mysql://:
-flink.starrocks.load-url= :
-flink.starrocks.username=user2
-flink.starrocks.password=xxxxxx
-flink.starrocks.sink.properties.format=csv
-flink.starrocks.sink.properties.column_separator=\x01
-flink.starrocks.sink.properties.row_delimiter=\x02
-flink.starrocks.sink.buffer-flush.interval-ms=10000
-```
-
-### Synchronize multiple tables after MySQL sharding to one table in StarRocks
-
-After sharding is performed, data in one MySQL table may be split into multiple tables or even distributed to multiple databases. All the tables have the same schema. In this case, you can set `[table-rule]` to synchronize these tables to one StarRocks table. For example, MySQL has two databases `edu_db_1` and `edu_db_2`, each of which has two tables `course_1 and course_2`, and the schema of all tables is the same. You can use the following `[table-rule]` configuration to synchronize all the tables to one StarRocks table.
-
-> **Note**
->
-> The name of the StarRocks table defaults to `course__auto_shard`. If you need to use a different name, you can modify it in the SQL files `starrocks-create.all.sql` and `flink-create.all.sql`
-
-```Bash
-[table-rule.1]
-# Pattern to match databases for setting properties
-database = ^edu_db_[0-9]*$
-# Pattern to match tables for setting properties
-table = ^course_[0-9]*$
-
-############################################
-### Flink sink configurations
-### DO NOT set `connector`, `table-name`, `database-name`. They are auto-generated
-############################################
-flink.starrocks.jdbc-url = jdbc: mysql://xxx.xxx.x.x:xxxx
-flink.starrocks.load-url = xxx.xxx.x.x:xxxx
-flink.starrocks.username = user2
-flink.starrocks.password = xxxxxx
-flink.starrocks.sink.properties.format=csv
-flink.starrocks.sink.properties.column_separator =\x01
-flink.starrocks.sink.properties.row_delimiter =\x02
-flink.starrocks.sink.buffer-flush.interval-ms = 5000
-```
-
-### Import data in JSON format
-
-Data in the preceding example is imported in CSV format. If you are unable to choose a suitable delimiter, you need to replace the following parameters of `flink.starrocks.*` in `[table-rule]`.
-
-```Plain
-flink.starrocks.sink.properties.format=csv
-flink.starrocks.sink.properties.column_separator =\x01
-flink.starrocks.sink.properties.row_delimiter =\x02
-```
-
-Data is imported in JSON format after the following parameters are passed in.
-
-```Plain
-flink.starrocks.sink.properties.format=json
-flink.starrocks.sink.properties.strip_outer_array=true
-```
-
-> **Note**
->
-> This method slightly slows down the loading speed.
-
-### Execute multiple INSERT INTO statements as one Flink job
-
-You can use the [STATEMENT SET](https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sqlclient/#execute-a-set-of-sql-statements) syntax in the `flink-create.all.sql` file to execute multiple INSERT INTO statements as one Flink job, which prevents multiple statements from taking up too many Flink job resources and improves the efficiency of executing multiple queries.
-
-> **Note**
->
-> Flink supports the STATEMENT SET syntax from 1.13 onwards.
-
-1. Open the `result/flink-create.all.sql` file.
-
-2. Modify the SQL statements in the file. Move all the INSERT INTO statements to the end of the file. Place `EXECUTE STATEMENT SET BEGIN` before the first INSERT INTO statement and place `END;` after the last INSERT INTO statement.
-
-> **Note**
->
-> The positions of CREATE DATABASE and CREATE TABLE remain unchanged.
-
-```SQL
-CREATE DATABASE IF NOT EXISTS db;
-CREATE TABLE IF NOT EXISTS db.a1;
-CREATE TABLE IF NOT EXISTS db.b1;
-CREATE TABLE IF NOT EXISTS db.a2;
-CREATE TABLE IF NOT EXISTS db.b2;
-EXECUTE STATEMENT SET
-BEGIN-- one or more INSERT INTO statements
-INSERT INTO db.a1 SELECT * FROM db.b1;
-INSERT INTO db.a2 SELECT * FROM db.b2;
-END;
-```
diff --git a/docs/en/loading/InsertInto.md b/docs/en/loading/InsertInto.md
deleted file mode 100644
index 469ae0b..0000000
--- a/docs/en/loading/InsertInto.md
+++ /dev/null
@@ -1,711 +0,0 @@
----
-displayed_sidebar: docs
----
-
-# Load data using INSERT
-
-import InsertPrivNote from '../_assets/commonMarkdown/insertPrivNote.mdx'
-
-This topic describes how to load data into StarRocks by using a SQL statement - INSERT.
-
-Similar to MySQL and many other database management systems, StarRocks supports loading data to an internal table with INSERT. You can insert one or more rows directly with the VALUES clause to test a function or a DEMO. You can also insert data defined by the results of a query into an internal table from an [external table](../data_source/External_table.md). From StarRocks v3.1 onwards, you can directly load data from files on cloud storage using the INSERT command and the table function [FILES()](../sql-reference/sql-functions/table-functions/files.md).
-
-StarRocks v2.4 further supports overwriting data into a table by using INSERT OVERWRITE. The INSERT OVERWRITE statement integrates the following operations to implement the overwriting function:
-
-1. Creates temporary partitions according to the partitions that store the original data.
-2. Inserts data into the temporary partitions.
-3. Swaps the original partitions with the temporary partitions.
-
-> **NOTE**
->
-> If you need to verify the data before overwriting it, instead of using INSERT OVERWRITE, you can follow the above procedures to overwrite your data and verify it before swapping the partitions.
-
-From v3.4.0 onwards, StarRocks supports a new semantic - Dynamic Overwrite for INSERT OVERWRITE with partitioned tables. For more information, see [Dynamic Overwrite](#dynamic-overwrite).
-
-## Precautions
-
-- You can cancel a synchronous INSERT transaction only by pressing the **Ctrl** and **C** keys from your MySQL client.
-- You can submit an asynchronous INSERT task using [SUBMIT TASK](../sql-reference/sql-statements/loading_unloading/ETL/SUBMIT_TASK.md).
-- As for the current version of StarRocks, the INSERT transaction fails by default if the data of any rows does not comply with the schema of the table. For example, the INSERT transaction fails if the length of a field in any row exceeds the length limit for the mapping field in the table. You can set the session variable `enable_insert_strict` to `false` to allow the transaction to continue by filtering out the rows that mismatch the table.
-- If you execute the INSERT statement frequently to load small batches of data into StarRocks, excessive data versions are generated. It severely affects query performance. We recommend that, in production, you should not load data with the INSERT command too often or use it as a routine for data loading on a daily basis. If your application or analytic scenario demand solutions to loading streaming data or small data batches separately, we recommend you use Apache Kafka® as your data source and load the data via Routine Load.
-- If you execute the INSERT OVERWRITE statement, StarRocks creates temporary partitions for the partitions which store the original data, inserts new data into the temporary partitions, and [swaps the original partitions with the temporary partitions](../sql-reference/sql-statements/table_bucket_part_index/ALTER_TABLE.md#use-a-temporary-partition-to-replace-the-current-partition). All these operations are executed in the FE Leader node. Hence, if the FE Leader node crashes while executing INSERT OVERWRITE command, the whole load transaction will fail, and the temporary partitions will be truncated.
-
-## Preparation
-
-### Check privileges
-
-
-
-### Create objects
-
-Create a database named `load_test`, and create a table `insert_wiki_edit` as the destination table and a table `source_wiki_edit` as the source table.
-
-> **NOTE**
->
-> Examples demonstrated in this topic are based on the table `insert_wiki_edit` and the table `source_wiki_edit`. If you prefer working with your own tables and data, you can skip the preparation and move on to the next step.
-
-```SQL
-CREATE DATABASE IF NOT EXISTS load_test;
-USE load_test;
-CREATE TABLE insert_wiki_edit
-(
- event_time DATETIME,
- channel VARCHAR(32) DEFAULT '',
- user VARCHAR(128) DEFAULT '',
- is_anonymous TINYINT DEFAULT '0',
- is_minor TINYINT DEFAULT '0',
- is_new TINYINT DEFAULT '0',
- is_robot TINYINT DEFAULT '0',
- is_unpatrolled TINYINT DEFAULT '0',
- delta INT DEFAULT '0',
- added INT DEFAULT '0',
- deleted INT DEFAULT '0'
-)
-DUPLICATE KEY(
- event_time,
- channel,
- user,
- is_anonymous,
- is_minor,
- is_new,
- is_robot,
- is_unpatrolled
-)
-PARTITION BY RANGE(event_time)(
- PARTITION p06 VALUES LESS THAN ('2015-09-12 06:00:00'),
- PARTITION p12 VALUES LESS THAN ('2015-09-12 12:00:00'),
- PARTITION p18 VALUES LESS THAN ('2015-09-12 18:00:00'),
- PARTITION p24 VALUES LESS THAN ('2015-09-13 00:00:00')
-)
-DISTRIBUTED BY HASH(user);
-
-CREATE TABLE source_wiki_edit
-(
- event_time DATETIME,
- channel VARCHAR(32) DEFAULT '',
- user VARCHAR(128) DEFAULT '',
- is_anonymous TINYINT DEFAULT '0',
- is_minor TINYINT DEFAULT '0',
- is_new TINYINT DEFAULT '0',
- is_robot TINYINT DEFAULT '0',
- is_unpatrolled TINYINT DEFAULT '0',
- delta INT DEFAULT '0',
- added INT DEFAULT '0',
- deleted INT DEFAULT '0'
-)
-DUPLICATE KEY(
- event_time,
- channel,user,
- is_anonymous,
- is_minor,
- is_new,
- is_robot,
- is_unpatrolled
-)
-PARTITION BY RANGE(event_time)(
- PARTITION p06 VALUES LESS THAN ('2015-09-12 06:00:00'),
- PARTITION p12 VALUES LESS THAN ('2015-09-12 12:00:00'),
- PARTITION p18 VALUES LESS THAN ('2015-09-12 18:00:00'),
- PARTITION p24 VALUES LESS THAN ('2015-09-13 00:00:00')
-)
-DISTRIBUTED BY HASH(user);
-```
-
-> **NOTICE**
->
-> Since v2.5.7, StarRocks can automatically set the number of buckets (BUCKETS) when you create a table or add a partition. You no longer need to manually set the number of buckets. For detailed information, see [set the number of buckets](../table_design/data_distribution/Data_distribution.md#set-the-number-of-buckets).
-
-## Insert data via INSERT INTO VALUES
-
-You can append one or more rows to a specific table by using INSERT INTO VALUES command. Multiple rows are separated by comma (,). For detailed instructions and parameter references, see [SQL Reference - INSERT](../sql-reference/sql-statements/loading_unloading/INSERT.md).
-
-> **CAUTION**
->
-> Inserting data via INSERT INTO VALUES merely applies to the situation when you need to verify a DEMO with a small dataset. It is not recommended for a massive testing or production environment. To load mass data into StarRocks, see [Loading options](Loading_intro.md) for other options that suit your scenarios.
-
-The following example inserts two rows into the data source table `source_wiki_edit` with the label `insert_load_wikipedia`. Label is the unique identification label for each data load transaction within the database.
-
-```SQL
-INSERT INTO source_wiki_edit
-WITH LABEL insert_load_wikipedia
-VALUES
- ("2015-09-12 00:00:00","#en.wikipedia","AustinFF",0,0,0,0,0,21,5,0),
- ("2015-09-12 00:00:00","#ca.wikipedia","helloSR",0,1,0,1,0,3,23,0);
-```
-
-## Insert data via INSERT INTO SELECT
-
-You can load the result of a query on a data source table into the target table via INSERT INTO SELECT command. INSERT INTO SELECT command performs ETL operations on the data from the data source table, and loads the data into an internal table in StarRocks. The data source can be one or more internal or external tables, or even data files on cloud storage. The target table MUST be an internal table in StarRocks. For detailed instructions and parameter references, see [SQL Reference - INSERT](../sql-reference/sql-statements/loading_unloading/INSERT.md).
-
-### Insert data from an internal or external table into an internal table
-
-> **NOTE**
->
-> Inserting data from an external table is identical to inserting data from an internal table. For simplicity, we only demonstrate how to insert data from an internal table in the following examples.
-
-- The following example inserts the data from the source table to the target table `insert_wiki_edit`.
-
-```SQL
-INSERT INTO insert_wiki_edit
-WITH LABEL insert_load_wikipedia_1
-SELECT * FROM source_wiki_edit;
-```
-
-- The following example inserts the data from the source table to the `p06` and `p12` partitions of the target table `insert_wiki_edit`. If no partition is specified, the data will be inserted into all partitions. Otherwise, the data will be inserted only into the specified partition(s).
-
-```SQL
-INSERT INTO insert_wiki_edit PARTITION(p06, p12)
-WITH LABEL insert_load_wikipedia_2
-SELECT * FROM source_wiki_edit;
-```
-
-Query the target table to make sure there is data in them.
-
-```Plain text
-MySQL > select * from insert_wiki_edit;
-+---------------------+---------------+----------+--------------+----------+--------+----------+----------------+-------+-------+---------+
-| event_time | channel | user | is_anonymous | is_minor | is_new | is_robot | is_unpatrolled | delta | added | deleted |
-+---------------------+---------------+----------+--------------+----------+--------+----------+----------------+-------+-------+---------+
-| 2015-09-12 00:00:00 | #en.wikipedia | AustinFF | 0 | 0 | 0 | 0 | 0 | 21 | 5 | 0 |
-| 2015-09-12 00:00:00 | #ca.wikipedia | helloSR | 0 | 1 | 0 | 1 | 0 | 3 | 23 | 0 |
-+---------------------+---------------+----------+--------------+----------+--------+----------+----------------+-------+-------+---------+
-2 rows in set (0.00 sec)
-```
-
-If you truncate the `p06` and `p12` partitions, the data will not be returned in a query.
-
-```Plain
-MySQL > TRUNCATE TABLE insert_wiki_edit PARTITION(p06, p12);
-Query OK, 0 rows affected (0.01 sec)
-
-MySQL > select * from insert_wiki_edit;
-Empty set (0.00 sec)
-```
-
-- The following example inserts the `event_time` and `channel` columns from the source table to the target table `insert_wiki_edit`. Default values are used in the columns that are not specified here.
-
-```SQL
-INSERT INTO insert_wiki_edit
-WITH LABEL insert_load_wikipedia_3
-(
- event_time,
- channel
-)
-SELECT event_time, channel FROM source_wiki_edit;
-```
-
-:::note
-From v3.3.1, specifying a column list in the INSERT INTO statement on a Primary Key table will perform Partial Updates (instead of Full Upsert in earlier versions). If the column list is not specified, the system will perform Full Upsert.
-:::
-
-### Insert data directly from files in an external source using FILES()
-
-From v3.1 onwards, StarRocks supports directly loading data from files on cloud storage using the INSERT command and the [FILES()](../sql-reference/sql-functions/table-functions/files.md) function, thereby you do not need to create an external catalog or file external table first. Besides, FILES() can automatically infer the table schema of the files, greatly simplifying the process of data loading.
-
-The following example inserts data rows from the Parquet file **parquet/insert_wiki_edit_append.parquet** within the AWS S3 bucket `inserttest` into the table `insert_wiki_edit`:
-
-```Plain
-INSERT INTO insert_wiki_edit
- SELECT * FROM FILES(
- "path" = "s3://inserttest/parquet/insert_wiki_edit_append.parquet",
- "format" = "parquet",
- "aws.s3.access_key" = "XXXXXXXXXX",
- "aws.s3.secret_key" = "YYYYYYYYYY",
- "aws.s3.region" = "us-west-2"
-);
-```
-
-## Overwrite data via INSERT OVERWRITE VALUES
-
-You can overwrite a specific table with one or more rows by using INSERT OVERWRITE VALUES command. Multiple rows are separated by comma (,). For detailed instructions and parameter references, see [SQL Reference - INSERT](../sql-reference/sql-statements/loading_unloading/INSERT.md).
-
-> **CAUTION**
->
-> Overwriting data via INSERT OVERWRITE VALUES merely applies to the situation when you need to verify a DEMO with a small dataset. It is not recommended for a massive testing or production environment. To load mass data into StarRocks, see [Loading options](Loading_intro.md) for other options that suit your scenarios.
-
-Query the source table and the target table to make sure there is data in them.
-
-```Plain
-MySQL > SELECT * FROM source_wiki_edit;
-+---------------------+---------------+----------+--------------+----------+--------+----------+----------------+-------+-------+---------+
-| event_time | channel | user | is_anonymous | is_minor | is_new | is_robot | is_unpatrolled | delta | added | deleted |
-+---------------------+---------------+----------+--------------+----------+--------+----------+----------------+-------+-------+---------+
-| 2015-09-12 00:00:00 | #ca.wikipedia | helloSR | 0 | 1 | 0 | 1 | 0 | 3 | 23 | 0 |
-| 2015-09-12 00:00:00 | #en.wikipedia | AustinFF | 0 | 0 | 0 | 0 | 0 | 21 | 5 | 0 |
-+---------------------+---------------+----------+--------------+----------+--------+----------+----------------+-------+-------+---------+
-2 rows in set (0.02 sec)
-
-MySQL > SELECT * FROM insert_wiki_edit;
-+---------------------+---------------+----------+--------------+----------+--------+----------+----------------+-------+-------+---------+
-| event_time | channel | user | is_anonymous | is_minor | is_new | is_robot | is_unpatrolled | delta | added | deleted |
-+---------------------+---------------+----------+--------------+----------+--------+----------+----------------+-------+-------+---------+
-| 2015-09-12 00:00:00 | #ca.wikipedia | helloSR | 0 | 1 | 0 | 1 | 0 | 3 | 23 | 0 |
-| 2015-09-12 00:00:00 | #en.wikipedia | AustinFF | 0 | 0 | 0 | 0 | 0 | 21 | 5 | 0 |
-+---------------------+---------------+----------+--------------+----------+--------+----------+----------------+-------+-------+---------+
-2 rows in set (0.01 sec)
-```
-
-The following example overwrites the source table `source_wiki_edit` with two new rows.
-
-```SQL
-INSERT OVERWRITE source_wiki_edit
-WITH LABEL insert_load_wikipedia_ow
-VALUES
- ("2015-09-12 00:00:00","#cn.wikipedia","GELongstreet",0,0,0,0,0,36,36,0),
- ("2015-09-12 00:00:00","#fr.wikipedia","PereBot",0,1,0,1,0,17,17,0);
-```
-
-## Overwrite data via INSERT OVERWRITE SELECT
-
-You can overwrite a table with the result of a query on a data source table via INSERT OVERWRITE SELECT command. INSERT OVERWRITE SELECT statement performs ETL operations on the data from one or more internal or external tables, and overwrites an internal table with the data For detailed instructions and parameter references, see [SQL Reference - INSERT](../sql-reference/sql-statements/loading_unloading/INSERT.md).
-
-> **NOTE**
->
-> Loading data from an external table is identical to loading data from an internal table. For simplicity, we only demonstrate how to overwrite the target table with the data from an internal table in the following examples.
-
-Query the source table and the target table to make sure that they hold different rows of data.
-
-```Plain
-MySQL > SELECT * FROM source_wiki_edit;
-+---------------------+---------------+--------------+--------------+----------+--------+----------+----------------+-------+-------+---------+
-| event_time | channel | user | is_anonymous | is_minor | is_new | is_robot | is_unpatrolled | delta | added | deleted |
-+---------------------+---------------+--------------+--------------+----------+--------+----------+----------------+-------+-------+---------+
-| 2015-09-12 00:00:00 | #cn.wikipedia | GELongstreet | 0 | 0 | 0 | 0 | 0 | 36 | 36 | 0 |
-| 2015-09-12 00:00:00 | #fr.wikipedia | PereBot | 0 | 1 | 0 | 1 | 0 | 17 | 17 | 0 |
-+---------------------+---------------+--------------+--------------+----------+--------+----------+----------------+-------+-------+---------+
-2 rows in set (0.02 sec)
-
-MySQL > SELECT * FROM insert_wiki_edit;
-+---------------------+---------------+----------+--------------+----------+--------+----------+----------------+-------+-------+---------+
-| event_time | channel | user | is_anonymous | is_minor | is_new | is_robot | is_unpatrolled | delta | added | deleted |
-+---------------------+---------------+----------+--------------+----------+--------+----------+----------------+-------+-------+---------+
-| 2015-09-12 00:00:00 | #en.wikipedia | AustinFF | 0 | 0 | 0 | 0 | 0 | 21 | 5 | 0 |
-| 2015-09-12 00:00:00 | #ca.wikipedia | helloSR | 0 | 1 | 0 | 1 | 0 | 3 | 23 | 0 |
-+---------------------+---------------+----------+--------------+----------+--------+----------+----------------+-------+-------+---------+
-2 rows in set (0.01 sec)
-```
-
-- The following example overwrites the table `insert_wiki_edit` with the data from the source table.
-
-```SQL
-INSERT OVERWRITE insert_wiki_edit
-WITH LABEL insert_load_wikipedia_ow_1
-SELECT * FROM source_wiki_edit;
-```
-
-- The following example overwrites the `p06` and `p12` partitions of the table `insert_wiki_edit` with the data from the source table.
-
-```SQL
-INSERT OVERWRITE insert_wiki_edit PARTITION(p06, p12)
-WITH LABEL insert_load_wikipedia_ow_2
-SELECT * FROM source_wiki_edit;
-```
-
-Query the target table to make sure there is data in them.
-
-```plain text
-MySQL > select * from insert_wiki_edit;
-+---------------------+---------------+--------------+--------------+----------+--------+----------+----------------+-------+-------+---------+
-| event_time | channel | user | is_anonymous | is_minor | is_new | is_robot | is_unpatrolled | delta | added | deleted |
-+---------------------+---------------+--------------+--------------+----------+--------+----------+----------------+-------+-------+---------+
-| 2015-09-12 00:00:00 | #fr.wikipedia | PereBot | 0 | 1 | 0 | 1 | 0 | 17 | 17 | 0 |
-| 2015-09-12 00:00:00 | #cn.wikipedia | GELongstreet | 0 | 0 | 0 | 0 | 0 | 36 | 36 | 0 |
-+---------------------+---------------+--------------+--------------+----------+--------+----------+----------------+-------+-------+---------+
-2 rows in set (0.01 sec)
-```
-
-If you truncate the `p06` and `p12` partitions, the data will not be returned in a query.
-
-```Plain
-MySQL > TRUNCATE TABLE insert_wiki_edit PARTITION(p06, p12);
-Query OK, 0 rows affected (0.01 sec)
-
-MySQL > select * from insert_wiki_edit;
-Empty set (0.00 sec)
-```
-
-:::note
-For tables that use the `PARTITION BY column` strategy, INSERT OVERWRITE supports creating new partitions in the destination table by specifying the value of the partition key. Existing partitions are overwritten as usual.
-
-The following example creates the partitioned table `activity`, and creates a new partition in the table while inserting data into it:
-
-```SQL
-CREATE TABLE activity (
-id INT NOT NULL,
-dt VARCHAR(10) NOT NULL
-) ENGINE=OLAP
-DUPLICATE KEY(`id`)
-PARTITION BY (`id`, `dt`)
-DISTRIBUTED BY HASH(`id`);
-
-INSERT OVERWRITE activity
-PARTITION(id='4', dt='2022-01-01')
-WITH LABEL insert_activity_auto_partition
-VALUES ('4', '2022-01-01');
-```
-
-:::
-
-- The following example overwrites the target table `insert_wiki_edit` with the `event_time` and `channel` columns from the source table. The default value is assigned to the columns into which no data is overwritten.
-
-```SQL
-INSERT OVERWRITE insert_wiki_edit
-WITH LABEL insert_load_wikipedia_ow_3
-(
- event_time,
- channel
-)
-SELECT event_time, channel FROM source_wiki_edit;
-```
-
-### Dynamic Overwrite
-
-From v3.4.0 onwards, StarRocks supports a new semantic - Dynamic Overwrite for INSERT OVERWRITE with partitioned tables.
-
-Currently, the default behavior of INSERT OVERWRITE is as follows:
-
-- When overwriting a partitioned table as a whole (that is, without specifying the PARTITION clause), new data records will replace the data in their corresponding partitions. If there are partitions that are not involved, they will be truncated while the others are overwritten.
-- When overwriting an empty partitioned table (that is, with no partitions in it) and specifying the PARTITION clause, the system returns an error `ERROR 1064 (HY000): Getting analyzing error. Detail message: Unknown partition 'xxx' in table 'yyy'`.
-- When overwriting a partitioned table and specifying a non-existent partition in the PARTITION clause, the system returns an error `ERROR 1064 (HY000): Getting analyzing error. Detail message: Unknown partition 'xxx' in table 'yyy'`.
-- When overwriting a partitioned table with data records that do not match any of the specified partitions in the PARTITION clause, the system either returns an error `ERROR 1064 (HY000): Insert has filtered data in strict mode` (if the strict mode is enabled) or filters the unqualified data records (if the strict mode is disabled).
-
-The behavior of the new Dynamic Overwrite semantic is much different:
-
-When overwriting a partitioned table as a whole, new data records will replace the data in their corresponding partitions. If there are partitions that are not involved, they will be left alone, instead of being truncated or deleted. And if there are new data records correspond to a non-existent partition, the system will create the partition.
-
-The Dynamic Overwrite semantic is disabled by default. To enable it, you need to set the system variable `dynamic_overwrite` to `true`.
-
-Enable Dynamic Overwrite in the current session:
-
-```SQL
-SET dynamic_overwrite = true;
-```
-
-You can also set it in the hint of the INSERT OVERWRITE statement to allow it take effect for the statement only:.
-
-Example:
-
-```SQL
-INSERT /*+set_var(dynamic_overwrite = true)*/ OVERWRITE insert_wiki_edit
-SELECT * FROM source_wiki_edit;
-```
-
-## Insert data into a table with generated columns
-
-A generated column is a special column whose value is derived from a pre-defined expression or evaluation based on other columns. Generated columns are especially useful when your query requests involve evaluations of expensive expressions, for example, querying a certain field from a JSON value, or calculating ARRAY data. StarRocks evaluates the expression and stores the results in the generated columns while data is being loaded into the table, thereby avoiding the expression evaluation during queries and improving the query performance.
-
-You can load data into a table with generated columns using INSERT.
-
-The following example creates a table `insert_generated_columns` and inserts a row into it. The table contains two generated columns: `avg_array` and `get_string`. `avg_array` calculates the average value of ARRAY data in `data_array`, and `get_string` extracts the strings from the JSON path `a` in `data_json`.
-
-```SQL
-CREATE TABLE insert_generated_columns (
- id INT(11) NOT NULL COMMENT "ID",
- data_array ARRAY NOT NULL COMMENT "ARRAY",
- data_json JSON NOT NULL COMMENT "JSON",
- avg_array DOUBLE NULL
- AS array_avg(data_array) COMMENT "Get the average of ARRAY",
- get_string VARCHAR(65533) NULL
- AS get_json_string(json_string(data_json), '$.a') COMMENT "Extract JSON string"
-) ENGINE=OLAP
-PRIMARY KEY(id)
-DISTRIBUTED BY HASH(id);
-
-INSERT INTO insert_generated_columns
-VALUES (1, [1,2], parse_json('{"a" : 1, "b" : 2}'));
-```
-
-> **NOTE**
->
-> Directly loading data into generated columns is not supported.
-
-You can query the table to check the data within it.
-
-```Plain
-mysql> SELECT * FROM insert_generated_columns;
-+------+------------+------------------+-----------+------------+
-| id | data_array | data_json | avg_array | get_string |
-+------+------------+------------------+-----------+------------+
-| 1 | [1,2] | {"a": 1, "b": 2} | 1.5 | 1 |
-+------+------------+------------------+-----------+------------+
-1 row in set (0.02 sec)
-```
-
-## INSERT data with PROPERTIES
-
-From v3.4.0 onwards, INSERT statements support configuring PROPERTIES, which can serve a wide variety of purposes. PROPERTIES overrides their corresponding variables.
-
-### Enable strict mode
-
-From v3.4.0 onwards, you can enable strict mode and set `max_filter_ratio` for INSERT from FILES(). Strict mode for INSERT from FILES() has the same behavior as that of other loading methods.
-
-If you want to load a dataset with some unqualified rows, you either filter these unqualified rows or load them and assign NULL values to the unqualified columns. You can achieve them by using the properties `strict_mode` and `max_filter_ratio`.
-
-- To filter the unqualified rows: set `strict_mode` to `true`, and `max_filter_ratio` to a desired value.
-- To load all unqualified rows with NULL values: set `strict_mode` to `false`.
-
-The following example inserts data rows from the Parquet file **parquet/insert_wiki_edit_append.parquet** within the AWS S3 bucket `inserttest` into the table `insert_wiki_edit`, enables strict mode to filter the unqualified data records, and tolerates at most 10% of error data:
-
-```SQL
-INSERT INTO insert_wiki_edit
-PROPERTIES(
- "strict_mode" = "true",
- "max_filter_ratio" = "0.1"
-)
-SELECT * FROM FILES(
- "path" = "s3://inserttest/parquet/insert_wiki_edit_append.parquet",
- "format" = "parquet",
- "aws.s3.access_key" = "XXXXXXXXXX",
- "aws.s3.secret_key" = "YYYYYYYYYY",
- "aws.s3.region" = "us-west-2"
-);
-```
-
-:::note
-
-`strict_mode` and `max_filter_ratio` are supported only for INSERT from FILES(). INSERT from tables does not support these properties.
-
-:::
-
-### Set timeout duration
-
-From v3.4.0 onwards, you can set the timeout duration for INSERT statements with properties.
-
-The following example inserts the data from the source table `source_wiki_edit` to the target table `insert_wiki_edit` with the timeout duration set to `2` seconds.
-
-```SQL
-INSERT INTO insert_wiki_edit
-PROPERTIES(
- "timeout" = "2"
-)
-SELECT * FROM source_wiki_edit;
-```
-
-:::note
-
-From v3.4.0 onwards, you can also set the INSERT timeout duration using the system variable `insert_timeout`, which applies to operations involving INSERT (for example, UPDATE, DELETE, CTAS, materialized view refresh, statistics collection, and PIPE). In versions earlier than v3.4.0, the corresponding variable is `query_timeout`.
-
-:::
-
-### Match column by name
-
-By default, INSERT matches the columns in the source and the target tables by their positions, that is, the mapping of the columns in the statement.
-
-The following example explicitly matches each column in the source and target tables by their positions:
-
-```SQL
-INSERT INTO insert_wiki_edit (
- event_time,
- channel,
- user
-)
-SELECT event_time, channel, user FROM source_wiki_edit;
-```
-
-The column mapping will change if you changed the order of `channel` and `user` in either the column list or the SELECT statement.
-
-```SQL
-INSERT INTO insert_wiki_edit (
- event_time,
- channel,
- user
-)
-SELECT event_time, user, channel FROM source_wiki_edit;
-```
-
-Here, the ingested data are probably not what you want, because `channel` in the target table `insert_wiki_edit` will be filled with data from `user` in the source table `source_wiki_edit`.
-
-By adding `BY NAME` clause in the INSERT statement, the system will detect the column names in the source and the target tables, and match the columns with the same name.
-
-:::note
-
-- You cannot specify the column list if `BY NAME` is specified.
-- If `BY NAME` is not specified, the system matches the columns by the position of the columns in the column list and the SELECT statement.
-
-:::
-
-The following example matches each column in the source and target tables by their names:
-
-```SQL
-INSERT INTO insert_wiki_edit BY NAME
-SELECT event_time, user, channel FROM source_wiki_edit;
-```
-
-In this case, changing the order of `channel` and `user` will not change the column mapping.
-
-## Load data asynchronously using INSERT
-
-Loading data with INSERT submits a synchronous transaction, which may fail because of session interruption or timeout. You can submit an asynchronous INSERT transaction using [SUBMIT TASK](../sql-reference/sql-statements/loading_unloading/ETL/SUBMIT_TASK.md). This feature is supported since StarRocks v2.5.
-
-- The following example asynchronously inserts the data from the source table to the target table `insert_wiki_edit`.
-
-```SQL
-SUBMIT TASK AS INSERT INTO insert_wiki_edit
-SELECT * FROM source_wiki_edit;
-```
-
-- The following example asynchronously overwrites the table `insert_wiki_edit` with the data from the source table.
-
-```SQL
-SUBMIT TASK AS INSERT OVERWRITE insert_wiki_edit
-SELECT * FROM source_wiki_edit;
-```
-
-- The following example asynchronously overwrites the table `insert_wiki_edit` with the data from the source table, and extends the query timeout to `100000` seconds using hint.
-
-```SQL
-SUBMIT /*+set_var(insert_timeout=100000)*/ TASK AS
-INSERT OVERWRITE insert_wiki_edit
-SELECT * FROM source_wiki_edit;
-```
-
-- The following example asynchronously overwrites the table `insert_wiki_edit` with the data from the source table, and specifies the task name as `async`.
-
-```SQL
-SUBMIT TASK async
-AS INSERT OVERWRITE insert_wiki_edit
-SELECT * FROM source_wiki_edit;
-```
-
-You can check the status of an asynchronous INSERT task by querying the metadata view `task_runs` in Information Schema.
-
-The following example checks the status of the INSERT task `async`.
-
-```SQL
-SELECT * FROM information_schema.task_runs WHERE task_name = 'async';
-```
-
-## Check the INSERT job status
-
-### Check via the result
-
-A synchronous INSERT transaction returns different status in accordance with the result of the transaction.
-
-- **Transaction succeeds**
-
-StarRocks returns the following if the transaction succeeds:
-
-```Plain
-Query OK, 2 rows affected (0.05 sec)
-{'label':'insert_load_wikipedia', 'status':'VISIBLE', 'txnId':'1006'}
-```
-
-- **Transaction fails**
-
-If all rows of data fail to be loaded into the target table, the INSERT transaction fails. StarRocks returns the following if the transaction fails:
-
-```Plain
-ERROR 1064 (HY000): Insert has filtered data in strict mode, tracking_url=http://x.x.x.x:yyyy/api/_load_error_log?file=error_log_9f0a4fd0b64e11ec_906bbede076e9d08
-```
-
-You can locate the problem by checking the log with `tracking_url`.
-
-### Check via Information Schema
-
-You can use the [SELECT](../sql-reference/sql-statements/table_bucket_part_index/SELECT.md) statement to query the results of one or more load jobs from the `loads` table in the `information_schema` database. This feature is supported from v3.1 onwards.
-
-Example 1: Query the results of load jobs executed on the `load_test` database, sort the results by creation time (`CREATE_TIME`) in descending order, and only return the top result.
-
-```SQL
-SELECT * FROM information_schema.loads
-WHERE database_name = 'load_test'
-ORDER BY create_time DESC
-LIMIT 1\G
-```
-
-Example 2: Query the result of the load job (whose label is `insert_load_wikipedia`) executed on the `load_test` database:
-
-```SQL
-SELECT * FROM information_schema.loads
-WHERE database_name = 'load_test' and label = 'insert_load_wikipedia'\G
-```
-
-The return is as follows:
-
-```Plain
-*************************** 1. row ***************************
- JOB_ID: 21319
- LABEL: insert_load_wikipedia
- DATABASE_NAME: load_test
- STATE: FINISHED
- PROGRESS: ETL:100%; LOAD:100%
- TYPE: INSERT
- PRIORITY: NORMAL
- SCAN_ROWS: 0
- FILTERED_ROWS: 0
- UNSELECTED_ROWS: 0
- SINK_ROWS: 2
- ETL_INFO:
- TASK_INFO: resource:N/A; timeout(s):300; max_filter_ratio:0.0
- CREATE_TIME: 2023-08-09 10:42:23
- ETL_START_TIME: 2023-08-09 10:42:23
- ETL_FINISH_TIME: 2023-08-09 10:42:23
- LOAD_START_TIME: 2023-08-09 10:42:23
- LOAD_FINISH_TIME: 2023-08-09 10:42:24
- JOB_DETAILS: {"All backends":{"5ebf11b5-365e-11ee-9e4a-7a563fb695da":[10006]},"FileNumber":0,"FileSize":0,"InternalTableLoadBytes":175,"InternalTableLoadRows":2,"ScanBytes":0,"ScanRows":0,"TaskNumber":1,"Unfinished backends":{"5ebf11b5-365e-11ee-9e4a-7a563fb695da":[]}}
- ERROR_MSG: NULL
- TRACKING_URL: NULL
- TRACKING_SQL: NULL
-REJECTED_RECORD_PATH: NULL
-1 row in set (0.01 sec)
-```
-
-For information about the fields in the return results, see [Information Schema > loads](../sql-reference/information_schema/loads.md).
-
-### Check via curl command
-
-You can check the INSERT transaction status by using curl command.
-
-Launch a terminal, and execute the following command:
-
-```Bash
-curl --location-trusted -u : \
- http://:/api//_load_info?label=
-```
-
-The following example checks the status of the transaction with label `insert_load_wikipedia`.
-
-```Bash
-curl --location-trusted -u : \
- http://x.x.x.x:8030/api/load_test/_load_info?label=insert_load_wikipedia
-```
-
-> **NOTE**
->
-> If you use an account for which no password is set, you need to input only `:`.
-
-The return is as follows:
-
-```Plain
-{
- "jobInfo":{
- "dbName":"load_test",
- "tblNames":[
- "source_wiki_edit"
- ],
- "label":"insert_load_wikipedia",
- "state":"FINISHED",
- "failMsg":"",
- "trackingUrl":""
- },
- "status":"OK",
- "msg":"Success"
-}
-```
-
-## Configuration
-
-You can set the following configuration items for INSERT transaction:
-
-- **FE configuration**
-
-| FE configuration | Description |
-| ---------------------------------- | ------------------------------------------------------------ |
-| insert_load_default_timeout_second | Default timeout for INSERT transaction. Unit: second. If the current INSERT transaction is not completed within the time set by this parameter, it will be canceled by the system and the status will be CANCELLED. As for current version of StarRocks, you can only specify a uniform timeout for all INSERT transactions using this parameter, and you cannot set a different timeout for a specific INSERT transaction. The default is 3600 seconds (1 hour). If the INSERT transaction cannot be completed within the specified time, you can extend the timeout by adjusting this parameter. |
-
-- **Session variables**
-
-| Session variable | Description |
-| -------------------- | ------------------------------------------------------------ |
-| enable_insert_strict | Switch value to control if the INSERT transaction is tolerant of invalid data rows. When it is set to `true`, the transaction fails if any of the data rows is invalid. When it is set to `false`, the transaction succeeds when at least one row of data has been loaded correctly, and the label will be returned. The default is `true`. You can set this variable with `SET enable_insert_strict = {true or false};` command. |
-| insert_timeout | Timeout for the INSERT statement. Unit: second. You can set this variable with the `SET insert_timeout = xxx;` command. |
diff --git a/docs/en/loading/Json_loading.md b/docs/en/loading/Json_loading.md
deleted file mode 100644
index d2dcffb..0000000
--- a/docs/en/loading/Json_loading.md
+++ /dev/null
@@ -1,363 +0,0 @@
----
-displayed_sidebar: docs
----
-
-# Introduction
-
-You can import semi-structured data (for example, JSON) by using stream load or routine load.
-
-## Use Scenarios
-
-* Stream Load: For JSON data stored in text files, use stream load to import.
-* Routine Load: For JSON data in Kafka, use routine load to import.
-
-### Stream Load Import
-
-Sample data:
-
-~~~json
-{ "id": 123, "city" : "beijing"},
-{ "id": 456, "city" : "shanghai"},
- ...
-~~~
-
-Example:
-
-~~~shell
-curl -v --location-trusted -u : \
- -H "format: json" -H "jsonpaths: [\"$.id\", \"$.city\"]" \
- -T example.json \
- http://FE_HOST:HTTP_PORT/api/DATABASE/TABLE/_stream_load
-~~~
-
-The `format: json` parameter allows you to execute the format of the imported data. `jsonpaths` is used to execute the corresponding data import path.
-
-Related parameters:
-
-* jsonpaths: Select the JSON path for each column
-* json\_root: Select the column where the JSON starts to be parsed
-* strip\_outer\_array: Crop the outermost array field
-* strict\_mode: Strictly filter for column type conversion during import
-
-When the JSON data schema and StarRocks data schema are not exactly the same, modify the `Jsonpath`.
-
-Sample data:
-
-~~~json
-{"k1": 1, "k2": 2}
-~~~
-
-Import example:
-
-~~~bash
-curl -v --location-trusted -u : \
- -H "format: json" -H "jsonpaths: [\"$.k2\", \"$.k1\"]" \
- -H "columns: k2, tmp_k1, k1 = tmp_k1 * 100" \
- -T example.json \
- http://127.0.0.1:8030/api/db1/tbl1/_stream_load
-~~~
-
-The ETL operation of multiplying k1 by 100 is performed during the import, and the column is matched with the original data by `Jsonpath`.
-
-The import results are as follows:
-
-~~~plain text
-+------+------+
-| k1 | k2 |
-+------+------+
-| 100 | 2 |
-+------+------+
-~~~
-
-For missing columns, if the column definition is nullable, then `NULL` will be added, or the default value can be added by `ifnull`.
-
-Sample data:
-
-~~~json
-[
- {"k1": 1, "k2": "a"},
- {"k1": 2},
- {"k1": 3, "k2": "c"},
-]
-~~~
-
-Import Example-1:
-
-~~~shell
-curl -v --location-trusted -u : \
- -H "format: json" -H "strip_outer_array: true" \
- -T example.json \
- http://127.0.0.1:8030/api/db1/tbl1/_stream_load
-~~~
-
-The import results are as follows:
-
-~~~plain text
-+------+------+
-| k1 | k2 |
-+------+------+
-| 1 | a |
-+------+------+
-| 2 | NULL |
-+------+------+
-| 3 | c |
-+------+------+
-~~~
-
-Import Example-2:
-
-~~~shell
-curl -v --location-trusted -u : \
- -H "format: json" -H "strip_outer_array: true" \
- -H "jsonpaths: [\"$.k1\", \"$.k2\"]" \
- -H "columns: k1, tmp_k2, k2 = ifnull(tmp_k2, 'x')" \
- -T example.json \
- http://127.0.0.1:8030/api/db1/tbl1/_stream_load
-~~~
-
-The import results are as follows:
-
-~~~plain text
-+------+------+
-| k1 | k2 |
-+------+------+
-| 1 | a |
-+------+------+
-| 2 | x |
-+------+------+
-| 3 | c |
-+------+------+
-~~~
-
-### Routine Load Import
-
-Similar to stream load, the message content of Kafka data sources is treated as a complete JSON data.
-
-1. If a message contains multiple rows of data in array format, all rows will be imported and Kafka's offset will only be incremented by 1.
-2. If a JSON in Array format represents multiple rows of data, but the parsing of the JSON fails due to a JSON format error, the error row will only be incremented by 1 (given that the parsing fails, StarRocks cannot actually determine how many rows of data it contains, and can only record the error data as one row).
-
-### Use Canal to import StarRocks from MySQL with incremental sync binlogs
-
-[Canal](https://github.com/alibaba/canal) is an open-source MySQL binlog synchronization tool from Alibaba, through which we can synchronize MySQL data to Kafka. The data is generated in JSON format in Kafka. Here is a demonstration of how to use routine load to synchronize data in Kafka for incremental data synchronization with MySQL.
-
-* In MySQL we have a data table with the following table creation statement.
-
-~~~sql
-CREATE TABLE `query_record` (
- `query_id` varchar(64) NOT NULL,
- `conn_id` int(11) DEFAULT NULL,
- `fe_host` varchar(32) DEFAULT NULL,
- `user` varchar(32) DEFAULT NULL,
- `start_time` datetime NOT NULL,
- `end_time` datetime DEFAULT NULL,
- `time_used` double DEFAULT NULL,
- `state` varchar(16) NOT NULL,
- `error_message` text,
- `sql` text NOT NULL,
- `database` varchar(128) NOT NULL,
- `profile` longtext,
- `plan` longtext,
- PRIMARY KEY (`query_id`),
- KEY `idx_start_time` (`start_time`) USING BTREE
-) ENGINE=InnoDB DEFAULT CHARSET=utf8
-~~~
-
-* Prerequisite: Make sure MySQL has binlog enabled and the format is ROW.
-
-~~~bash
-[mysqld]
-log-bin=mysql-bin # Enable binlog
-binlog-format=ROW # Select ROW mode
-server_id=1 # MySQL replication need to be defined, and do not duplicate canal's slaveId
-~~~
-
-* Create an account and grant privileges to the secondary MySQL server:
-
-~~~sql
-CREATE USER canal IDENTIFIED BY 'canal';
-GRANT SELECT, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'canal'@'%';
--- GRANT ALL PRIVILEGES ON *.* TO 'canal'@'%';
-FLUSH PRIVILEGES;
-~~~
-
-* Then download and install Canal.
-
-~~~bash
-wget https://github.com/alibaba/canal/releases/download/canal-1.0.17/canal.deployer-1.0.17.tar.gz
-
-mkdir /tmp/canal
-tar zxvf canal.deployer-$version.tar.gz -C /tmp/canal
-~~~
-
-* Modify the configuration (MySQL related).
-
-`$ vi conf/example/instance.properties`
-
-~~~bash
-## mysql serverId
-canal.instance.mysql.slaveId = 1234
-#position info, need to change to your own database information
-canal.instance.master.address = 127.0.0.1:3306
-canal.instance.master.journal.name =
-canal.instance.master.position =
-canal.instance.master.timestamp =
-#canal.instance.standby.address =
-#canal.instance.standby.journal.name =
-#canal.instance.standby.position =
-#canal.instance.standby.timestamp =
-#username/password, need to change to your own database information
-canal.instance.dbUsername = canal
-canal.instance.dbPassword = canal
-canal.instance.defaultDatabaseName =
-canal.instance.connectionCharset = UTF-8
-#table regex
-canal.instance.filter.regex = .\*\\\\..\*
-# Select the name of the table to be synchronized and the partition name of the kafka target.
-canal.mq.dynamicTopic=databasename.query_record
-canal.mq.partitionHash= databasename.query_record:query_id
-~~~
-
-* Modify the configuration (Kafka related).
-
-`$ vi /usr/local/canal/conf/canal.properties`
-
-~~~bash
-# Available options: tcp(by default), kafka, RocketMQ
-canal.serverMode = kafka
-# ...
-# kafka/rocketmq Cluster Configuration: 192.168.1.117:9092,192.168.1.118:9092,192.168.1.119:9092
-canal.mq.servers = 127.0.0.1:6667
-canal.mq.retries = 0
-# This value can be increased in flagMessage mode, but do not exceed the maximum size of the MQ message.
-canal.mq.batchSize = 16384
-canal.mq.maxRequestSize = 1048576
-# In flatMessage mode, please change this value to a larger value, 50-200 is recommended.
-canal.mq.lingerMs = 1
-canal.mq.bufferMemory = 33554432
-# Canal's batch size with a default value of 50K. Please do not exceed 1M due to Kafka's maximum message size limit (under 900K)
-canal.mq.canalBatchSize = 50
-# Timeout of `Canal get`, in milliseconds. Empty indicates unlimited timeout.
-canal.mq.canalGetTimeout = 100
-# Whether the object is in flat json format
-canal.mq.flatMessage = false
-canal.mq.compressionType = none
-canal.mq.acks = all
-# Whether Kafka message delivery uses transactions
-canal.mq.transaction = false
-~~~
-
-* Initiation
-
-`bin/startup.sh`
-
-The corresponding synchronization log is shown in `logs/example/example.log` and in Kafka, with the following format:
-
-~~~json
-{
- "data": [{
- "query_id": "3c7ebee321e94773-b4d79cc3f08ca2ac",
- "conn_id": "34434",
- "fe_host": "172.26.34.139",
- "user": "zhaoheng",
- "start_time": "2020-10-19 20:40:10.578",
- "end_time": "2020-10-19 20:40:10",
- "time_used": "1.0",
- "state": "FINISHED",
- "error_message": "",
- "sql": "COMMIT",
- "database": "",
- "profile": "",
- "plan": ""
- }, {
- "query_id": "7ff2df7551d64f8e-804004341bfa63ad",
- "conn_id": "34432",
- "fe_host": "172.26.34.139",
- "user": "zhaoheng",
- "start_time": "2020-10-19 20:40:10.566",
- "end_time": "2020-10-19 20:40:10",
- "time_used": "0.0",
- "state": "FINISHED",
- "error_message": "",
- "sql": "COMMIT",
- "database": "",
- "profile": "",
- "plan": ""
- }, {
- "query_id": "3a4b35d1c1914748-be385f5067759134",
- "conn_id": "34440",
- "fe_host": "172.26.34.139",
- "user": "zhaoheng",
- "start_time": "2020-10-19 20:40:10.601",
- "end_time": "1970-01-01 08:00:00",
- "time_used": "-1.0",
- "state": "RUNNING",
- "error_message": "",
- "sql": " SELECT SUM(length(lo_custkey)), SUM(length(c_custkey)) FROM lineorder_str INNER JOIN customer_str ON lo_custkey=c_custkey;",
- "database": "ssb",
- "profile": "",
- "plan": ""
- }],
- "database": "center_service_lihailei",
- "es": 1603111211000,
- "id": 122,
- "isDdl": false,
- "mysqlType": {
- "query_id": "varchar(64)",
- "conn_id": "int(11)",
- "fe_host": "varchar(32)",
- "user": "varchar(32)",
- "start_time": "datetime(3)",
- "end_time": "datetime",
- "time_used": "double",
- "state": "varchar(16)",
- "error_message": "text",
- "sql": "text",
- "database": "varchar(128)",
- "profile": "longtext",
- "plan": "longtext"
- },
- "old": null,
- "pkNames": ["query_id"],
- "sql": "",
- "sqlType": {
- "query_id": 12,
- "conn_id": 4,
- "fe_host": 12,
- "user": 12,
- "start_time": 93,
- "end_time": 93,
- "time_used": 8,
- "state": 12,
- "error_message": 2005,
- "sql": 2005,
- "database": 12,
- "profile": 2005,
- "plan": 2005
- },
- "table": "query_record",
- "ts": 1603111212015,
- "type": "INSERT"
-}
-~~~
-
-Add `json_root` and `strip_outer_array = true` to import data from `data`.
-
-~~~sql
-create routine load manual.query_job on query_record
-columns (query_id,conn_id,fe_host,user,start_time,end_time,time_used,state,error_message,`sql`,`database`,profile,plan)
-PROPERTIES (
- "format"="json",
- "json_root"="$.data",
- "desired_concurrent_number"="1",
- "strip_outer_array" ="true",
- "max_error_number"="1000"
-)
-FROM KAFKA (
- "kafka_broker_list"= "172.26.92.141:9092",
- "kafka_topic" = "databasename.query_record"
-);
-~~~
-
-This completes the near real-time synchronization of data from MySQL to StarRocks.
-
-View status and error messages of the import job by `show routine load`.
diff --git a/docs/en/loading/Kafka-connector-starrocks.md b/docs/en/loading/Kafka-connector-starrocks.md
deleted file mode 100644
index 605e0bb..0000000
--- a/docs/en/loading/Kafka-connector-starrocks.md
+++ /dev/null
@@ -1,833 +0,0 @@
----
-displayed_sidebar: docs
----
-
-# Load data using Kafka connector
-
-StarRocks provides a self-developed connector named Apache Kafka® connector (StarRocks Connector for Apache Kafka®, Kafka connector for short), as a sink connector, that continuously consumes messages from Kafka and loads them into StarRocks. The Kafka connector guarantees at-least-once semantics.
-
-The Kafka connector can seamlessly integrate with Kafka Connect, which allows StarRocks better integrated with the Kafka ecosystem. It is a wise choice if you want to load real-time data into StarRocks. Compared with Routine Load, it is recommended to use the Kafka connector in the following scenarios:
-
-- Compared with Routine Load which only supports loading data in CSV, JSON, and Avro formats, Kafka connector can load data in more formats, such as Protobuf. As long as data can be converted into JSON and CSV formats using Kafka Connect's converters, data can be loaded into StarRocks via the Kafka connector.
-- Customize data transformation, such as Debezium-formatted CDC data.
-- Load data from multiple Kafka topics.
-- Load data from Confluent Cloud.
-- Need finer control over load batch sizes, parallelism, and other parameters to achieve a balance between load speed and resource utilization.
-
-## Preparations
-
-### Version requirements
-
-| Connector | Kafka | StarRocks | Java |
-| --------- | --------- | ------------- | ---- |
-| 1.0.6 | 3.4+/4.0+ | 2.5 and later | 8 |
-| 1.0.5 | 3.4 | 2.5 and later | 8 |
-| 1.0.4 | 3.4 | 2.5 and later | 8 |
-| 1.0.3 | 3.4 | 2.5 and later | 8 |
-
-### Set up Kafka environment
-
-Both self-managed Apache Kafka clusters and Confluent Cloud are supported.
-
-- For a self-managed Apache Kafka cluster, you can refer to [Apache Kafka quickstart](https://kafka.apache.org/quickstart) to quickly deploy a Kafka cluster. Kafka Connect is already integrated into Kafka.
-- For Confluent Cloud, make sure that you have a Confluent account and have created a cluster.
-
-### Download Kafka connector
-
-Submit the Kafka connector into Kafka Connect:
-
-- Self-managed Kafka cluster:
-
- Download [starrocks-connector-for-kafka-x.y.z-with-dependencies.jar](https://github.com/StarRocks/starrocks-connector-for-kafka/releases).
-
-- Confluent Cloud:
-
- Currently, the Kafka connector is not uploaded to Confluent Hub. You need to download [starrocks-connector-for-kafka-x.y.z-with-dependencies.jar](https://github.com/StarRocks/starrocks-connector-for-kafka/releases), package it into a ZIP file and upload the ZIP file to Confluent Cloud.
-
-### Network configuration
-
-Ensure that the machine where Kafka is located can access the FE nodes of the StarRocks cluster via the [`http_port`](../administration/management/FE_configuration.md#http_port) (default: `8030`) and [`query_port`](../administration/management/FE_configuration.md#query_port) (default: `9030`), and the BE nodes via the [`be_http_port`](../administration/management/BE_configuration.md#be_http_port) (default: `8040`).
-
-## Usage
-
-This section uses a self-managed Kafka cluster as an example to explain how to configure the Kafka connector and the Kafka Connect, and then run the Kafka Connect to load data into StarRocks.
-
-### Prepare a dataset
-
-Suppose that JSON-format data exists in the topic `test` in a Kafka cluster.
-
-```JSON
-{"id":1,"city":"New York"}
-{"id":2,"city":"Los Angeles"}
-{"id":3,"city":"Chicago"}
-```
-
-### Create a table
-
-Create the table `test_tbl` in the database `example_db` in the StarRocks cluster according to the keys of the JSON-format data.
-
-```SQL
-CREATE DATABASE example_db;
-USE example_db;
-CREATE TABLE test_tbl (id INT, city STRING);
-```
-
-### Configure Kafka connector and Kafka Connect, and then run Kafka Connect to load data
-
-#### Run Kafka Connect in standalone mode
-
-1. Configure the Kafka connector. In the **config** directory under the Kafka installation directory, create the configuration file **connect-StarRocks-sink.properties** for the Kafka connector, and configure the following parameters. For more parameters and descriptions, see [Parameters](#Parameters).
-
- :::info
-
- - In this example, the Kafka connector provided by StarRocks is a sink connector that can continuously consume data from Kafka and load data into StarRocks.
- - If the source data is CDC data, such as data in Debezium format, and the StarRocks table is a Primary Key table, you also need to [configure `transform`](#load-debezium-formatted-cdc-data) in the configuration file **connect-StarRocks-sink.properties** for the Kafka connector provided by StarRocks, to synchronize the source data changes to the Primary Key table.
-
- :::
-
- ```yaml
- name=starrocks-kafka-connector
- connector.class=com.starrocks.connector.kafka.StarRocksSinkConnector
- topics=test
- key.converter=org.apache.kafka.connect.json.JsonConverter
- value.converter=org.apache.kafka.connect.json.JsonConverter
- key.converter.schemas.enable=true
- value.converter.schemas.enable=false
- # The HTTP URL of the FE in your StarRocks cluster. The default port is 8030.
- starrocks.http.url=192.168.xxx.xxx:8030
- # If the Kafka topic name is different from the StarRocks table name, you need to configure the mapping relationship between them.
- starrocks.topic2table.map=test:test_tbl
- # Enter the StarRocks username.
- starrocks.username=user1
- # Enter the StarRocks password.
- starrocks.password=123456
- starrocks.database.name=example_db
- sink.properties.strip_outer_array=true
- ```
-
-2. Configure and run the Kafka Connect.
-
- 1. Configure the Kafka Connect. In the configuration file **config/connect-standalone.properties** in the **config** directory, configure the following parameters. For more parameters and descriptions, see [Running Kafka Connect](https://kafka.apache.org/documentation.html#connect_running).
-
- ```yaml
- # The addresses of Kafka brokers. Multiple addresses of Kafka brokers need to be separated by commas (,).
- # Note that this example uses PLAINTEXT as the security protocol to access the Kafka cluster. If you are using other security protocol to access the Kafka cluster, you need to configure the relevant information in this file.
- bootstrap.servers=:9092
- offset.storage.file.filename=/tmp/connect.offsets
- offset.flush.interval.ms=10000
- key.converter=org.apache.kafka.connect.json.JsonConverter
- value.converter=org.apache.kafka.connect.json.JsonConverter
- key.converter.schemas.enable=true
- value.converter.schemas.enable=false
- # The absolute path of starrocks-connector-for-kafka-x.y.z-with-dependencies.jar.
- plugin.path=/home/kafka-connect/starrocks-kafka-connector
- ```
-
- 2. Run the Kafka Connect.
-
- ```Bash
- CLASSPATH=/home/kafka-connect/starrocks-kafka-connector/* bin/connect-standalone.sh config/connect-standalone.properties config/connect-starrocks-sink.properties
- ```
-
-#### Run Kafka Connect in distributed mode
-
-1. Configure and run the Kafka Connect.
-
- 1. Configure the Kafka Connect. In the configuration file `config/connect-distributed.properties` in the **config** directory, configure the following parameters. For more parameters and descriptions, refer to [Running Kafka Connect](https://kafka.apache.org/documentation.html#connect_running).
-
- ```yaml
- # The addresses of Kafka brokers. Multiple addresses of Kafka brokers need to be separated by commas (,).
- # Note that this example uses PLAINTEXT as the security protocol to access the Kafka cluster. If you are using other security protocol to access the Kafka cluster, you need to configure the relevant information in this file.
- bootstrap.servers=:9092
- offset.storage.file.filename=/tmp/connect.offsets
- offset.flush.interval.ms=10000
- key.converter=org.apache.kafka.connect.json.JsonConverter
- value.converter=org.apache.kafka.connect.json.JsonConverter
- key.converter.schemas.enable=true
- value.converter.schemas.enable=false
- # The absolute path of starrocks-connector-for-kafka-x.y.z-with-dependencies.jar.
- plugin.path=/home/kafka-connect/starrocks-kafka-connector
- ```
-
- 2. Run the Kafka Connect.
-
- ```BASH
- CLASSPATH=/home/kafka-connect/starrocks-kafka-connector/* bin/connect-distributed.sh config/connect-distributed.properties
- ```
-
-2. Configure and create the Kafka connector. Note that in distributed mode, you need to configure and create the Kafka connector through the REST API. For parameters and descriptions, see [Parameters](#Parameters).
-
- :::info
-
- - In this example, the Kafka connector provided by StarRocks is a sink connector that can continuously consume data from Kafka and load data into StarRocks.
- - If the source data is CDC data, such as data in Debezium format, and the StarRocks table is a Primary Key table, you also need to [configure `transform`](#load-debezium-formatted-cdc-data) in the configuration file **connect-StarRocks-sink.properties** for the Kafka connector provided by StarRocks, to synchronize the source data changes to the Primary Key table.
-
- :::
-
- ```Shell
- curl -i http://127.0.0.1:8083/connectors -H "Content-Type: application/json" -X POST -d '{
- "name":"starrocks-kafka-connector",
- "config":{
- "connector.class":"com.starrocks.connector.kafka.StarRocksSinkConnector",
- "topics":"test",
- "key.converter":"org.apache.kafka.connect.json.JsonConverter",
- "value.converter":"org.apache.kafka.connect.json.JsonConverter",
- "key.converter.schemas.enable":"true",
- "value.converter.schemas.enable":"false",
- "starrocks.http.url":"192.168.xxx.xxx:8030",
- "starrocks.topic2table.map":"test:test_tbl",
- "starrocks.username":"user1",
- "starrocks.password":"123456",
- "starrocks.database.name":"example_db",
- "sink.properties.strip_outer_array":"true"
- }
- }'
- ```
-
-#### Query StarRocks table
-
-Query the target StarRocks table `test_tbl`.
-
-```mysql
-MySQL [example_db]> select * from test_tbl;
-
-+------+-------------+
-| id | city |
-+------+-------------+
-| 1 | New York |
-| 2 | Los Angeles |
-| 3 | Chicago |
-+------+-------------+
-3 rows in set (0.01 sec)
-```
-
-The data is successfully loaded when the above result is returned.
-
-## Parameters
-
-### name
-
-**Required**: YES
-**Default value**:
-**Description**: Name for this Kafka connector. It must be globally unique among all Kafka connectors within this Kafka Connect cluster. For example, starrocks-kafka-connector.
-
-### connector.class
-
-**Required**: YES
-**Default value**:
-**Description**: Class used by this Kafka connector's sink. Set the value to `com.starrocks.connector.kafka.StarRocksSinkConnector`.
-
-### topics
-
-**Required**:
-**Default value**:
-**Description**: One or more topics to subscribe to, where each topic corresponds to a StarRocks table. By default, StarRocks assumes that the topic name matches the name of the StarRocks table. So StarRocks determines the target StarRocks table by using the topic name. Please choose either to fill in `topics` or `topics.regex` (below), but not both. However, if the StarRocks table name is not the same as the topic name, then use the optional `starrocks.topic2table.map` parameter (below) to specify the mapping from topic name to table name.
-
-### topics.regex
-
-**Required**:
-**Default value**:
-**Description**: Regular expression to match the one or more topics to subscribe to. For more description, see `topics`. Please choose either to fill in `topics.regex` or `topics` (above), but not both.
-
-### starrocks.topic2table.map
-
-**Required**: NO
-**Default value**:
-**Description**: The mapping of the StarRocks table name and the topic name when the topic name is different from the StarRocks table name. The format is `:,:,...`.
-
-### starrocks.http.url
-
-**Required**: YES
-**Default value**:
-**Description**: The HTTP URL of the FE in your StarRocks cluster. The format is `:,:,...`. Multiple addresses are separated by commas (,). For example, `192.168.xxx.xxx:8030,192.168.xxx.xxx:8030`.
-
-### starrocks.database.name
-
-**Required**: YES
-**Default value**:
-**Description**: The name of StarRocks database.
-
-### starrocks.username
-
-**Required**: YES
-**Default value**:
-**Description**: The username of your StarRocks cluster account. The user needs the [INSERT](../sql-reference/sql-statements/account-management/GRANT.md) privilege on the StarRocks table.
-
-### starrocks.password
-
-**Required**: YES
-**Default value**:
-**Description**: The password of your StarRocks cluster account.
-
-### key.converter
-
-**Required**: NO
-**Default value**: Key converter used by Kafka Connect cluster
-**Description**: This parameter specifies the key converter for the sink connector (Kafka-connector-starrocks), which is used to deserialize the keys of Kafka data. The default key converter is the one used by Kafka Connect cluster.
-
-### value.converter
-
-**Required**: NO
-**Default value**: Value converter used by Kafka Connect cluster
-**Description**: This parameter specifies the value converter for the sink connector (Kafka-connector-starrocks), which is used to deserialize the values of Kafka data. The default value converter is the one used by Kafka Connect cluster.
-
-### key.converter.schema.registry.url
-
-**Required**: NO
-**Default value**:
-**Description**: Schema registry URL for the key converter.
-
-### value.converter.schema.registry.url
-
-**Required**: NO
-**Default value**:
-**Description**: Schema registry URL for the value converter.
-
-### tasks.max
-
-**Required**: NO
-**Default value**: 1
-**Description**: The upper limit for the number of task threads that the Kafka connector can create, which is usually the same as the number of CPU cores on the worker nodes in the Kafka Connect cluster. You can tune this parameter to control load performance.
-
-### bufferflush.maxbytes
-
-**Required**: NO
-**Default value**: 94371840(90M)
-**Description**: The maximum size of data that can be accumulated in memory before being sent to StarRocks at a time. The maximum value ranges from 64 MB to 10 GB. Keep in mind that the Stream Load SDK buffer may create multiple Stream Load jobs to buffer data. Therefore, the threshold mentioned here refers to the total data size.
-
-### bufferflush.intervalms
-
-**Required**: NO
-**Default value**: 1000
-**Description**: Interval for sending a batch of data which controls the load latency. Range: [1000, 3600000].
-
-### connect.timeoutms
-
-**Required**: NO
-**Default value**: 1000
-**Description**: Timeout for connecting to the HTTP URL. Range: [100, 60000].
-
-### sink.properties.*
-
-**Required**:
-**Default value**:
-**Description**: Stream Load parameters o control load behavior. For example, the parameter `sink.properties.format` specifies the format used for Stream Load, such as CSV or JSON. For a list of supported parameters and their descriptions, see [STREAM LOAD](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md).
-
-### sink.properties.format
-
-**Required**: NO
-**Default value**: json
-**Description**: The format used for Stream Load. The Kafka connector will transform each batch of data to the format before sending them to StarRocks. Valid values: `csv` and `json`. For more information, see [CSV parameters](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md#csv-parameters) and [JSON parameters](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md#json-parameters).
-
-### sink.properties.partial_update
-
-**Required**: NO
-**Default value**: `FALSE`
-**Description**: Whether to use partial updates. Valid values: `TRUE` and `FALSE`. Default value: `FALSE`, indicating to disable this feature.
-
-### sink.properties.partial_update_mode
-
-**Required**: NO
-**Default value**: `row`
-**Description**: Specifies the mode for partial updates. Valid values: `row` and `column`.
The value `row` (default) means partial updates in row mode, which is more suitable for real-time updates with many columns and small batches.
The value `column` means partial updates in column mode, which is more suitable for batch updates with few columns and many rows. In such scenarios, enabling the column mode offers faster update speeds. For example, in a table with 100 columns, if only 10 columns (10% of the total) are updated for all rows, the update speed of the column mode is 10 times faster.
-
-## Usage Notes
-
-### Flush Policy
-
-The Kafka connector will buffer the data in memory, and flush them in batch to StarRocks via Stream Load. The flush will be triggered when any of the following conditions are met:
-
-- The bytes of buffered rows reaches the limit `bufferflush.maxbytes`.
-- The elapsed time since the last flush reaches the limit `bufferflush.intervalms`.
-- The interval at which the connector tries committing offsets for tasks is reached. The interval is controlled by the Kafka Connect configuration [`offset.flush.interval.ms`](https://docs.confluent.io/platform/current/connect/references/allconfigs.html), and the default values is `60000`.
-
-For lower data latency, adjust these configurations in the Kafka connector settings. However, more frequent flushes will increase CPU and I/O usage.
-
-### Limits
-
-- It is not supported to flatten a single message from a Kafka topic into multiple data rows and load into StarRocks.
-- The sink of the Kafka connector provided by StarRocks guarantees at-least-once semantics.
-
-## Best practices
-
-### Load Debezium-formatted CDC data
-
-Debezium is a popular Change Data Capture (CDC) tool that supports monitoring data changes in various database systems and streaming these changes to Kafka. The following example demonstrates how to configure and use the Kafka connector to write PostgreSQL changes to a **Primary Key table** in StarRocks.
-
-#### Step 1: Install and start Kafka
-
-> **NOTE**
->
-> You can skip this step if you have your own Kafka environment.
-
-1. [Download](https://dlcdn.apache.org/kafka/) the latest Kafka release from the official site and extract the package.
-
- ```Bash
- tar -xzf kafka_2.13-3.7.0.tgz
- cd kafka_2.13-3.7.0
- ```
-
-2. Start the Kafka environment.
-
- Generate a Kafka cluster UUID.
-
- ```Bash
- KAFKA_CLUSTER_ID="$(bin/kafka-storage.sh random-uuid)"
- ```
-
- Format the log directories.
-
- ```Bash
- bin/kafka-storage.sh format -t $KAFKA_CLUSTER_ID -c config/kraft/server.properties
- ```
-
- Start the Kafka server.
-
- ```Bash
- bin/kafka-server-start.sh config/kraft/server.properties
- ```
-
-#### Step 2: Configure PostgreSQL
-
-1. Make sure the PostgreSQL user is granted `REPLICATION` privileges.
-
-2. Adjust PostgreSQL configuration.
-
- Set `wal_level` to `logical` in **postgresql.conf**.
-
- ```Properties
- wal_level = logical
- ```
-
- Restart the PostgreSQL server to apply changes.
-
- ```Bash
- pg_ctl restart
- ```
-
-3. Prepare the dataset.
-
- Create a table and insert test data.
-
- ```SQL
- CREATE TABLE customers (
- id int primary key ,
- first_name varchar(65533) NULL,
- last_name varchar(65533) NULL ,
- email varchar(65533) NULL
- );
-
- INSERT INTO customers VALUES (1,'a','a','a@a.com');
- ```
-
-4. Verify the CDC log messages in Kafka.
-
- ```Json
- {
- "schema": {
- "type": "struct",
- "fields": [
- {
- "type": "struct",
- "fields": [
- {
- "type": "int32",
- "optional": false,
- "field": "id"
- },
- {
- "type": "string",
- "optional": true,
- "field": "first_name"
- },
- {
- "type": "string",
- "optional": true,
- "field": "last_name"
- },
- {
- "type": "string",
- "optional": true,
- "field": "email"
- }
- ],
- "optional": true,
- "name": "test.public.customers.Value",
- "field": "before"
- },
- {
- "type": "struct",
- "fields": [
- {
- "type": "int32",
- "optional": false,
- "field": "id"
- },
- {
- "type": "string",
- "optional": true,
- "field": "first_name"
- },
- {
- "type": "string",
- "optional": true,
- "field": "last_name"
- },
- {
- "type": "string",
- "optional": true,
- "field": "email"
- }
- ],
- "optional": true,
- "name": "test.public.customers.Value",
- "field": "after"
- },
- {
- "type": "struct",
- "fields": [
- {
- "type": "string",
- "optional": false,
- "field": "version"
- },
- {
- "type": "string",
- "optional": false,
- "field": "connector"
- },
- {
- "type": "string",
- "optional": false,
- "field": "name"
- },
- {
- "type": "int64",
- "optional": false,
- "field": "ts_ms"
- },
- {
- "type": "string",
- "optional": true,
- "name": "io.debezium.data.Enum",
- "version": 1,
- "parameters": {
- "allowed": "true,last,false,incremental"
- },
- "default": "false",
- "field": "snapshot"
- },
- {
- "type": "string",
- "optional": false,
- "field": "db"
- },
- {
- "type": "string",
- "optional": true,
- "field": "sequence"
- },
- {
- "type": "string",
- "optional": false,
- "field": "schema"
- },
- {
- "type": "string",
- "optional": false,
- "field": "table"
- },
- {
- "type": "int64",
- "optional": true,
- "field": "txId"
- },
- {
- "type": "int64",
- "optional": true,
- "field": "lsn"
- },
- {
- "type": "int64",
- "optional": true,
- "field": "xmin"
- }
- ],
- "optional": false,
- "name": "io.debezium.connector.postgresql.Source",
- "field": "source"
- },
- {
- "type": "string",
- "optional": false,
- "field": "op"
- },
- {
- "type": "int64",
- "optional": true,
- "field": "ts_ms"
- },
- {
- "type": "struct",
- "fields": [
- {
- "type": "string",
- "optional": false,
- "field": "id"
- },
- {
- "type": "int64",
- "optional": false,
- "field": "total_order"
- },
- {
- "type": "int64",
- "optional": false,
- "field": "data_collection_order"
- }
- ],
- "optional": true,
- "name": "event.block",
- "version": 1,
- "field": "transaction"
- }
- ],
- "optional": false,
- "name": "test.public.customers.Envelope",
- "version": 1
- },
- "payload": {
- "before": null,
- "after": {
- "id": 1,
- "first_name": "a",
- "last_name": "a",
- "email": "a@a.com"
- },
- "source": {
- "version": "2.5.3.Final",
- "connector": "postgresql",
- "name": "test",
- "ts_ms": 1714283798721,
- "snapshot": "false",
- "db": "postgres",
- "sequence": "[\"22910216\",\"22910504\"]",
- "schema": "public",
- "table": "customers",
- "txId": 756,
- "lsn": 22910504,
- "xmin": null
- },
- "op": "c",
- "ts_ms": 1714283798790,
- "transaction": null
- }
- }
- ```
-
-#### Step 3: Configure StarRocks
-
-Create a Primary Key table in StarRocks with the same schema as the source table in PostgreSQL.
-
-```SQL
-CREATE TABLE `customers` (
- `id` int(11) COMMENT "",
- `first_name` varchar(65533) NULL COMMENT "",
- `last_name` varchar(65533) NULL COMMENT "",
- `email` varchar(65533) NULL COMMENT ""
-) ENGINE=OLAP
-PRIMARY KEY(`id`)
-DISTRIBUTED BY hash(id) buckets 1
-PROPERTIES (
-"bucket_size" = "4294967296",
-"in_memory" = "false",
-"enable_persistent_index" = "true",
-"replicated_storage" = "true",
-"fast_schema_evolution" = "true"
-);
-```
-
-#### Step 4: Install connector
-
-1. Download the connectors and extract the packages in the **plugins** directory.
-
- ```Bash
- mkdir plugins
- tar -zxvf debezium-debezium-connector-postgresql-2.5.3.zip -C plugins
- mv starrocks-connector-for-kafka-x.y.z-with-dependencies.jar plugins
- ```
-
- This directory is the value of the configuration item `plugin.path` in **config/connect-standalone.properties**.
-
- ```Properties
- plugin.path=/path/to/kafka_2.13-3.7.0/plugins
- ```
-
-2. Configure the PostgreSQL source connector in **pg-source.properties**.
-
- ```Json
- {
- "name": "inventory-connector",
- "config": {
- "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
- "plugin.name": "pgoutput",
- "database.hostname": "localhost",
- "database.port": "5432",
- "database.user": "postgres",
- "database.password": "",
- "database.dbname" : "postgres",
- "topic.prefix": "test"
- }
- }
- ```
-
-3. Configure the StarRocks sink connector in **sr-sink.properties**.
-
- ```Json
- {
- "name": "starrocks-kafka-connector",
- "config": {
- "connector.class": "com.starrocks.connector.kafka.StarRocksSinkConnector",
- "tasks.max": "1",
- "topics": "test.public.customers",
- "starrocks.http.url": "172.26.195.69:28030",
- "starrocks.database.name": "test",
- "starrocks.username": "root",
- "starrocks.password": "StarRocks@123",
- "sink.properties.strip_outer_array": "true",
- "connect.timeoutms": "3000",
- "starrocks.topic2table.map": "test.public.customers:customers",
- "transforms": "addfield,unwrap",
- "transforms.addfield.type": "com.starrocks.connector.kafka.transforms.AddOpFieldForDebeziumRecord",
- "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
- "transforms.unwrap.drop.tombstones": "true",
- "transforms.unwrap.delete.handling.mode": "rewrite"
- }
- }
- ```
-
- > **NOTE**
- >
- > - If the StarRocks table is not a Primary Key table, you do not need to specify the `addfield` transform.
- > - The unwrap transform is provided by Debezium and is used to unwrap Debezium's complex data structure based on the operation type. For more information, see [New Record State Extraction](https://debezium.io/documentation/reference/stable/transformations/event-flattening.html).
-
-4. Configure Kafka Connect.
-
- Configure the following configuration items in the Kafka Connect configuration file **config/connect-standalone.properties**.
-
- ```Properties
- # The addresses of Kafka brokers. Multiple addresses of Kafka brokers need to be separated by commas (,).
- # Note that this example uses PLAINTEXT as the security protocol to access the Kafka cluster.
- # If you use other security protocol to access the Kafka cluster, configure the relevant information in this part.
-
- bootstrap.servers=:9092
- offset.storage.file.filename=/tmp/connect.offsets
- key.converter=org.apache.kafka.connect.json.JsonConverter
- value.converter=org.apache.kafka.connect.json.JsonConverter
- key.converter.schemas.enable=true
- value.converter.schemas.enable=false
-
- # The absolute path of starrocks-connector-for-kafka-x.y.z-with-dependencies.jar.
- plugin.path=/home/kafka-connect/starrocks-kafka-connector
-
- # Parameters that control the flush policy. For more information, see the Usage Note section.
- offset.flush.interval.ms=10000
- bufferflush.maxbytes = xxx
- bufferflush.intervalms = xxx
- ```
-
- For descriptions of more parameters, see [Running Kafka Connect](https://kafka.apache.org/documentation.html#connect_running).
-
-#### Step 5: Start Kafka Connect in Standalone Mode
-
-Run Kafka Connect in standalone mode to initiate the connectors.
-
-```Bash
-bin/connect-standalone.sh config/connect-standalone.properties config/pg-source.properties config/sr-sink.properties
-```
-
-#### Step 6: Verify data ingestion
-
-Test the following operations and ensure the data is correctly ingested into StarRocks.
-
-##### INSERT
-
-- In PostgreSQL:
-
-```Plain
-postgres=# insert into customers values (2,'b','b','b@b.com');
-INSERT 0 1
-postgres=# select * from customers;
- id | first_name | last_name | email
-----+------------+-----------+---------
- 1 | a | a | a@a.com
- 2 | b | b | b@b.com
-(2 rows)
-```
-
-- In StarRocks:
-
-```Plain
-MySQL [test]> select * from customers;
-+------+------------+-----------+---------+
-| id | first_name | last_name | email |
-+------+------------+-----------+---------+
-| 1 | a | a | a@a.com |
-| 2 | b | b | b@b.com |
-+------+------------+-----------+---------+
-2 rows in set (0.01 sec)
-```
-
-##### UPDATE
-
-- In PostgreSQL:
-
-```Plain
-postgres=# update customers set email='c@c.com';
-UPDATE 2
-postgres=# select * from customers;
- id | first_name | last_name | email
-----+------------+-----------+---------
- 1 | a | a | c@c.com
- 2 | b | b | c@c.com
-(2 rows)
-```
-
-- In StarRocks:
-
-```Plain
-MySQL [test]> select * from customers;
-+------+------------+-----------+---------+
-| id | first_name | last_name | email |
-+------+------------+-----------+---------+
-| 1 | a | a | c@c.com |
-| 2 | b | b | c@c.com |
-+------+------------+-----------+---------+
-2 rows in set (0.00 sec)
-```
-
-##### DELETE
-
-- In PostgreSQL:
-
-```Plain
-postgres=# delete from customers where id=1;
-DELETE 1
-postgres=# select * from customers;
- id | first_name | last_name | email
-----+------------+-----------+---------
- 2 | b | b | c@c.com
-(1 row)
-```
-
-- In StarRocks:
-
-```Plain
-MySQL [test]> select * from customers;
-+------+------------+-----------+---------+
-| id | first_name | last_name | email |
-+------+------------+-----------+---------+
-| 2 | b | b | c@c.com |
-+------+------------+-----------+---------+
-1 row in set (0.00 sec)
-```
diff --git a/docs/en/loading/Load_to_Primary_Key_tables.md b/docs/en/loading/Load_to_Primary_Key_tables.md
deleted file mode 100644
index 5c40759..0000000
--- a/docs/en/loading/Load_to_Primary_Key_tables.md
+++ /dev/null
@@ -1,709 +0,0 @@
----
-displayed_sidebar: docs
----
-
-# Change data through loading
-
-import InsertPrivNote from '../_assets/commonMarkdown/insertPrivNote.mdx'
-
-[Primary Key tables](../table_design/table_types/primary_key_table.md) provided by StarRocks allow you to make data changes to StarRocks tables by running [Stream Load](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md), [Broker Load](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md), or [Routine Load](../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md) jobs. These data changes include inserts, updates, and deletions. However, Primary Key tables do not support changing data by using [Spark Load](../sql-reference/sql-statements/loading_unloading/SPARK_LOAD.md) or [INSERT](../sql-reference/sql-statements/loading_unloading/INSERT.md).
-
-StarRocks also supports partial updates and conditional updates.
-
-
-
-This topic uses CSV data as an example to describe how to make data changes to a StarRocks table through loading. The data file formats that are supported vary depending on the loading method of your choice.
-
-> **NOTE**
->
-> For CSV data, you can use a UTF-8 string, such as a comma (,), tab, or pipe (|), whose length does not exceed 50 bytes as a text delimiter.
-
-## Implementation
-
-Primary Key tables provided by StarRocks support UPSERT and DELETE operations and does not distinguish INSERT operations from UPDATE operations.
-
-When you create a load job, StarRocks supports adding a field named `__op` to the job creation statement or command. The `__op` field is used to specify the type of operation you want to perform.
-
-> **NOTE**
->
-> When you create a table, you do not need to add a column named `__op` to that table.
-
-The method of defining the `__op` field varies depending on the loading method of your choice:
-
-- If you choose Stream Load, define the `__op` field by using the `columns` parameter.
-
-- If you choose Broker Load, define the `__op` field by using the SET clause.
-
-- If you choose Routine Load, define the `__op` field by using the `COLUMNS` column.
-
-You can decide whether to add the `__op` field based on the data changes you want to make. If you do not add the `__op` field, the operation type defaults to UPSERT. The major data change scenarios are as follows:
-
-- If the data file you want to load involves only UPSERT operations, you do not need to add the `__op` field.
-
-- If the data file you want to load involves only DELETE operations, you must add the `__op` field and specify the operation type as DELETE.
-
-- If the data file you want to load involves both UPSERT and DELETE operations, you must add the `__op` field and make sure that the data file contains a column whose values are `0` or `1`. A value of `0` indicates an UPSERT operation, and a value of `1` indicates a DELETE operation.
-
-## Usage notes
-
-- Make sure that each row in your data file has the same number of columns.
-
-- The columns that involve data changes must include the primary key column.
-
-## Basic operations
-
-This section provides examples of how to make data changes to a StarRocks table through loading. For detailed syntax and parameter descriptions, see [STREAM LOAD](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md), [BROKER LOAD](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md), and [CREATE ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md).
-
-### UPSERT
-
-If the data file you want to load involves only UPSERT operations, you do not need to add the `__op` field.
-
-> **NOTE**
->
-> If you add the `__op` field:
->
-> - You can specify the operation type as UPSERT.
->
-> - You can leave the `__op` field empty, because the operation type defaults to UPSERT.
-
-#### Data examples
-
-1. Prepare a data file.
-
- a. Create a CSV file named `example1.csv` in your local file system. The file consists of three columns, which represent user ID, user name, and user score in sequence.
-
- ```Plain
- 101,Lily,100
- 102,Rose,100
- ```
-
- b. Publish the data of `example1.csv` to `topic1` of your Kafka cluster.
-
-2. Prepare a StarRocks table.
-
- a. Create a Primary Key table named `table1` in your StarRocks database `test_db`. The table consists of three columns: `id`, `name`, and `score`, of which `id` is the primary key.
-
- ```SQL
- CREATE TABLE `table1`
- (
- `id` int(11) NOT NULL COMMENT "user ID",
- `name` varchar(65533) NOT NULL COMMENT "user name",
- `score` int(11) NOT NULL COMMENT "user score"
- )
- ENGINE=OLAP
- PRIMARY KEY(`id`)
- DISTRIBUTED BY HASH(`id`);
- ```
-
- > **NOTE**
- >
- > Since v2.5.7, StarRocks can automatically set the number of buckets (BUCKETS) when you create a table or add a partition. You no longer need to manually set the number of buckets. For detailed information, see [set the number of buckets](../table_design/data_distribution/Data_distribution.md#set-the-number-of-buckets).
-
- b. Insert a record into `table1`.
-
- ```SQL
- INSERT INTO table1 VALUES
- (101, 'Lily',80);
- ```
-
-#### Load data
-
-Run a load job to update the record whose `id` is `101` in `example1.csv` to `table1` and insert the record whose `id` is `102` in `example1.csv` into `table1`.
-
-- Run a Stream Load job.
-
- - If you do not want to include the `__op` field, run the following command:
-
- ```Bash
- curl --location-trusted -u : \
- -H "Expect:100-continue" \
- -H "label:label1" \
- -H "column_separator:," \
- -T example1.csv -XPUT \
- http://:/api/test_db/table1/_stream_load
- ```
-
- - If you want to include the `__op` field, run the following command:
-
- ```Bash
- curl --location-trusted -u : \
- -H "Expect:100-continue" \
- -H "label:label2" \
- -H "column_separator:," \
- -H "columns:__op ='upsert'" \
- -T example1.csv -XPUT \
- http://:/api/test_db/table1/_stream_load
- ```
-
-- Run a Broker Load job.
-
- - If you do not want to include the `__op` field, run the following command:
-
- ```SQL
- LOAD LABEL test_db.label1
- (
- data infile("hdfs://:/example1.csv")
- into table table1
- columns terminated by ","
- format as "csv"
- )
- WITH BROKER;
- ```
-
- - If you want to include the `__op` field, run the following command:
-
- ```SQL
- LOAD LABEL test_db.label2
- (
- data infile("hdfs://:/example1.csv")
- into table table1
- columns terminated by ","
- format as "csv"
- set (__op = 'upsert')
- )
- WITH BROKER;
- ```
-
-- Run a Routine Load job.
-
- - If you do not want to include the `__op` field, run the following command:
-
- ```SQL
- CREATE ROUTINE LOAD test_db.table1 ON table1
- COLUMNS TERMINATED BY ",",
- COLUMNS (id, name, score)
- PROPERTIES
- (
- "desired_concurrent_number" = "3",
- "max_batch_interval" = "20",
- "max_batch_rows"= "250000",
- "max_error_number" = "1000"
- )
- FROM KAFKA
- (
- "kafka_broker_list" =":",
- "kafka_topic" = "test1",
- "property.kafka_default_offsets" ="OFFSET_BEGINNING"
- );
- ```
-
- - If you want to include the `__op` field, run the following command:
-
- ```SQL
- CREATE ROUTINE LOAD test_db.table1 ON table1
- COLUMNS TERMINATED BY ",",
- COLUMNS (id, name, score, __op ='upsert')
- PROPERTIES
- (
- "desired_concurrent_number" = "3",
- "max_batch_interval" = "20",
- "max_batch_rows"= "250000",
- "max_error_number" = "1000"
- )
- FROM KAFKA
- (
- "kafka_broker_list" =":",
- "kafka_topic" = "test1",
- "property.kafka_default_offsets" ="OFFSET_BEGINNING"
- );
- ```
-
-#### Query data
-
-After the load is complete, query the data of `table1` to verify that the load is successful:
-
-```SQL
-SELECT * FROM table1;
-+------+------+-------+
-| id | name | score |
-+------+------+-------+
-| 101 | Lily | 100 |
-| 102 | Rose | 100 |
-+------+------+-------+
-2 rows in set (0.02 sec)
-```
-
-As shown in the preceding query result, the record whose `id` is `101` in `example1.csv` has been updated to `table1`, and the record whose `id` is `102` in `example1.csv` has been inserted into `table1`.
-
-### DELETE
-
-If the data file you want to load involves only DELETE operations, you must add the `__op` field and specify the operation type as DELETE.
-
-#### Data examples
-
-1. Prepare a data file.
-
- a. Create a CSV file named `example2.csv` in your local file system. The file consists of three columns, which represent user ID, user name, and user score in sequence.
-
- ```Plain
- 101,Jack,100
- ```
-
- b. Publish the data of `example2.csv` to `topic2` of your Kafka cluster.
-
-2. Prepare a StarRocks table.
-
- a. Create a Primary Key table named `table2` in your StarRocks table `test_db`. The table consists of three columns: `id`, `name`, and `score`, of which `id` is the primary key.
-
- ```SQL
- CREATE TABLE `table2`
- (
- `id` int(11) NOT NULL COMMENT "user ID",
- `name` varchar(65533) NOT NULL COMMENT "user name",
- `score` int(11) NOT NULL COMMENT "user score"
- )
- ENGINE=OLAP
- PRIMARY KEY(`id`)
- DISTRIBUTED BY HASH(`id`);
- ```
-
- > **NOTE**
- >
- > Since v2.5.7, StarRocks can automatically set the number of buckets (BUCKETS) when you create a table or add a partition. You no longer need to manually set the number of buckets. For detailed information, see [set the number of buckets](../table_design/data_distribution/Data_distribution.md#set-the-number-of-buckets).
-
- b. Insert two records into `table2`.
-
- ```SQL
- INSERT INTO table2 VALUES
- (101, 'Jack', 100),
- (102, 'Bob', 90);
- ```
-
-#### Load data
-
-Run a load job to delete the record whose `id` is `101` in `example2.csv` from `table2`.
-
-- Run a Stream Load job.
-
- ```Bash
- curl --location-trusted -u : \
- -H "Expect:100-continue" \
- -H "label:label3" \
- -H "column_separator:," \
- -H "columns:__op='delete'" \
- -T example2.csv -XPUT \
- http://:/api/test_db/table2/_stream_load
- ```
-
-- Run a Broker Load job.
-
- ```SQL
- LOAD LABEL test_db.label3
- (
- data infile("hdfs://:/example2.csv")
- into table table2
- columns terminated by ","
- format as "csv"
- set (__op = 'delete')
- )
- WITH BROKER;
- ```
-
-- Run a Routine Load job.
-
- ```SQL
- CREATE ROUTINE LOAD test_db.table2 ON table2
- COLUMNS(id, name, score, __op = 'delete')
- PROPERTIES
- (
- "desired_concurrent_number" = "3",
- "max_batch_interval" = "20",
- "max_batch_rows"= "250000",
- "max_error_number" = "1000"
- )
- FROM KAFKA
- (
- "kafka_broker_list" =":",
- "kafka_topic" = "test2",
- "property.kafka_default_offsets" ="OFFSET_BEGINNING"
- );
- ```
-
-#### Query data
-
-After the load is complete, query the data of `table2` to verify that the load is successful:
-
-```SQL
-SELECT * FROM table2;
-+------+------+-------+
-| id | name | score |
-+------+------+-------+
-| 102 | Bob | 90 |
-+------+------+-------+
-1 row in set (0.00 sec)
-```
-
-As shown in the preceding query result, the record whose `id` is `101` in `example2.csv` has been deleted from `table2`.
-
-### UPSERT and DELETE
-
-If the data file you want to load involves both UPSERT and DELETE operations, you must add the `__op` field and make sure that the data file contains a column whose values are `0` or `1`. A value of `0` indicates an UPSERT operation, and a value of `1` indicates a DELETE operation.
-
-#### Data examples
-
-1. Prepare a data file.
-
- a. Create a CSV file named `example3.csv` in your local file system. The file consists of four columns, which represent user ID, user name, user score, and operation type in sequence.
-
- ```Plain
- 101,Tom,100,1
- 102,Sam,70,0
- 103,Stan,80,0
- ```
-
- b. Publish the data of `example3.csv` to `topic3` of your Kafka cluster.
-
-2. Prepare a StarRocks table.
-
- a. Create a Primary Key table named `table3` in your StarRocks database `test_db`. The table consists of three columns: `id`, `name`, and `score`, of which `id` is the primary key.
-
- ```SQL
- CREATE TABLE `table3`
- (
- `id` int(11) NOT NULL COMMENT "user ID",
- `name` varchar(65533) NOT NULL COMMENT "user name",
- `score` int(11) NOT NULL COMMENT "user score"
- )
- ENGINE=OLAP
- PRIMARY KEY(`id`)
- DISTRIBUTED BY HASH(`id`);
- ```
-
- > **NOTE**
- >
- > Since v2.5.7, StarRocks can automatically set the number of buckets (BUCKETS) when you create a table or add a partition. You no longer need to manually set the number of buckets. For detailed information, see [set the number of buckets](../table_design/data_distribution/Data_distribution.md#set-the-number-of-buckets).
-
- b. Insert two records into `table3`.
-
- ```SQL
- INSERT INTO table3 VALUES
- (101, 'Tom', 100),
- (102, 'Sam', 90);
- ```
-
-#### Load data
-
-Run a load job to delete the record whose `id` is `101` in `example3.csv` from `table3`, update the record whose `id` is `102` in `example3.csv` to `table3`, and insert the record whose `id` is `103` in `example3.csv` into `table3`.
-
-- Run a Stream Load job:
-
- ```Bash
- curl --location-trusted -u : \
- -H "Expect:100-continue" \
- -H "label:label4" \
- -H "column_separator:," \
- -H "columns: id, name, score, temp, __op = temp" \
- -T example3.csv -XPUT \
- http://:/api/test_db/table3/_stream_load
- ```
-
- > **NOTE**
- >
- > In the preceding example, the fourth column that represents the operation type in `example3.csv` is temporarily named as `temp` and the `__op` field is mapped onto the `temp` column by using the `columns` parameter. As such, StarRocks can decide whether to perform an UPSERT or DELETE operation depending on the value in the fourth column of `example3.csv` is `0` or `1`.
-
-- Run a Broker Load job:
-
- ```Bash
- LOAD LABEL test_db.label4
- (
- data infile("hdfs://:/example1.csv")
- into table table1
- columns terminated by ","
- format as "csv"
- (id, name, score, temp)
- set (__op=temp)
- )
- WITH BROKER;
- ```
-
-- Run a Routine Load job:
-
- ```SQL
- CREATE ROUTINE LOAD test_db.table3 ON table3
- COLUMNS(id, name, score, temp, __op = temp)
- PROPERTIES
- (
- "desired_concurrent_number" = "3",
- "max_batch_interval" = "20",
- "max_batch_rows"= "250000",
- "max_error_number" = "1000"
- )
- FROM KAFKA
- (
- "kafka_broker_list" = ":",
- "kafka_topic" = "test3",
- "property.kafka_default_offsets" = "OFFSET_BEGINNING"
- );
- ```
-
-#### Query data
-
-After the load is complete, query the data of `table3` to verify that the load is successful:
-
-```SQL
-SELECT * FROM table3;
-+------+------+-------+
-| id | name | score |
-+------+------+-------+
-| 102 | Sam | 70 |
-| 103 | Stan | 80 |
-+------+------+-------+
-2 rows in set (0.01 sec)
-```
-
-As shown in the preceding query result, the record whose `id` is `101` in `example3.csv` has been deleted from `table3`, the record whose `id` is `102` in `example3.csv` has been updated to `table3`, and the record whose `id` is `103` in `example3.csv` has been inserted into `table3`.
-
-## Partial updates
-
-Primary Key tables also support partial updates, and provide two modes of partial updates, row mode and column mode, for different data update scenarios. These two modes of partial updates can minimize the overhead of partial updates as much as possible while guaranteeing query performance, ensuring real-time updates. Row mode is more suitable for real-time update scenarios involving many columns and small batches. Column mode is suitable for batch processing update scenarios involving a few columns and a large number of rows.
-
-> **NOTICE**
->
-> When you perform a partial update, if the row to be updated does not exist, StarRocks inserts a new row, and fills default values in fields that are empty because no data updates are inserted into them.
-
-This section uses CSV as an example to describe how to perform partial updates.
-
-### Data examples
-
-1. Prepare a data file.
-
- a. Create a CSV file named `example4.csv` in your local file system. The file consists of two columns, which represent user ID and user name in sequence.
-
- ```Plain
- 101,Lily
- 102,Rose
- 103,Alice
- ```
-
- b. Publish the data of `example4.csv` to `topic4` of your Kafka cluster.
-
-2. Prepare a StarRocks table.
-
- a. Create a Primary Key table named `table4` in your StarRocks database `test_db`. The table consists of three columns: `id`, `name`, and `score`, of which `id` is the primary key.
-
- ```SQL
- CREATE TABLE `table4`
- (
- `id` int(11) NOT NULL COMMENT "user ID",
- `name` varchar(65533) NOT NULL COMMENT "user name",
- `score` int(11) NOT NULL COMMENT "user score"
- )
- ENGINE=OLAP
- PRIMARY KEY(`id`)
- DISTRIBUTED BY HASH(`id`);
- ```
-
- > **NOTE**
- >
- > Since v2.5.7, StarRocks can automatically set the number of buckets (BUCKETS) when you create a table or add a partition. You no longer need to manually set the number of buckets. For detailed information, see [set the number of buckets](../table_design/data_distribution/Data_distribution.md#set-the-number-of-buckets).
-
- b. Insert a record into `table4`.
-
- ```SQL
- INSERT INTO table4 VALUES
- (101, 'Tom',80);
- ```
-
-### Load data
-
-Run a load to update the data in the two columns of `example4.csv` to the `id` and `name` columns of `table4`.
-
-- Run a Stream Load job:
-
- ```Bash
- curl --location-trusted -u : \
- -H "Expect:100-continue" \
- -H "label:label7" -H "column_separator:," \
- -H "partial_update:true" \
- -H "columns:id,name" \
- -T example4.csv -XPUT \
- http://:/api/test_db/table4/_stream_load
- ```
-
- > **NOTE**
- >
- > If you choose Stream Load, you must set the `partial_update` parameter to `true` to enable the partial update feature. The default is partial updates in row mode. If you need to perform partial updates in column mode, you need to set `partial_update_mode` to `column`. Additionally, you must use the `columns` parameter to specify the columns you want to update.
-
-- Run a Broker Load job:
-
- ```SQL
- LOAD LABEL test_db.table4
- (
- data infile("hdfs://:/example4.csv")
- into table table4
- format as "csv"
- (id, name)
- )
- WITH BROKER
- PROPERTIES
- (
- "partial_update" = "true"
- );
- ```
-
- > **NOTE**
- >
- > If you choose Broker Load, you must set the `partial_update` parameter to `true` to enable the partial update feature. The default is partial updates in row mode. If you need to perform partial updates in column mode, you need to set `partial_update_mode` to `column`. Additionally, you must use the `column_list` parameter to specify the columns you want to update.
-
-- Run a Routine Load job:
-
- ```SQL
- CREATE ROUTINE LOAD test_db.table4 on table4
- COLUMNS (id, name),
- COLUMNS TERMINATED BY ','
- PROPERTIES
- (
- "partial_update" = "true"
- )
- FROM KAFKA
- (
- "kafka_broker_list" =":",
- "kafka_topic" = "test4",
- "property.kafka_default_offsets" ="OFFSET_BEGINNING"
- );
- ```
-
- > **NOTE**
- >
- > - If you choose Routine Load, you must set the `partial_update` parameter to `true` to enable the partial update feature. Additionally, you must use the `COLUMNS` parameter to specify the columns you want to update.
- > - Routine Load only supports partial updates in row modes and does not support partial updates in column mode.
-
-### Query data
-
-After the load is complete, query the data of `table4` to verify that the load is successful:
-
-```SQL
-SELECT * FROM table4;
-+------+-------+-------+
-| id | name | score |
-+------+-------+-------+
-| 102 | Rose | 0 |
-| 101 | Lily | 80 |
-| 103 | Alice | 0 |
-+------+-------+-------+
-3 rows in set (0.01 sec)
-```
-
-As shown in the preceding query result, the record whose `id` is `101` in `example4.csv` has been updated to `table4`, and the records whose `id` are `102` and `103` in `example4.csv` have been Inserted into `table4`.
-
-## Conditional updates
-
-From StarRocks v2.5 onwards, Primary Key tables support conditional updates. You can specify a non-primary key column as the condition to determine whether updates can take effect. As such, the update from a source record to a destination record takes effect only when the source data record has a greater or equal value than the destination data record in the specified column.
-
-The conditional update feature is designed to resolve data disorder. If the source data is disordered, you can use this feature to ensure that new data will not be overwritten by old data.
-
-> **NOTICE**
->
-> - You cannot specify different columns as update conditions for the same batch of data.
-> - DELETE operations do not support conditional updates.
-> - In versions earlier than v3.1.3, partial updates and conditional updates cannot be used simultaneously. From v3.1.3 onwards, StarRocks supports using partial updates with conditional updates.
-
-### Data examples
-
-1. Prepare a data file.
-
- a. Create a CSV file named `example5.csv` in your local file system. The file consists of three columns, which represent user ID, version, and user score in sequence.
-
- ```Plain
- 101,1,100
- 102,3,100
- ```
-
- b. Publish the data of `example5.csv` to `topic5` of your Kafka cluster.
-
-2. Prepare a StarRocks table.
-
- a. Create a Primary Key table named `table5` in your StarRocks database `test_db`. The table consists of three columns: `id`, `version`, and `score`, of which `id` is the primary key.
-
- ```SQL
- CREATE TABLE `table5`
- (
- `id` int(11) NOT NULL COMMENT "user ID",
- `version` int NOT NULL COMMENT "version",
- `score` int(11) NOT NULL COMMENT "user score"
- )
- ENGINE=OLAP
- PRIMARY KEY(`id`) DISTRIBUTED BY HASH(`id`);
- ```
-
- > **NOTE**
- >
- > Since v2.5.7, StarRocks can automatically set the number of buckets (BUCKETS) when you create a table or add a partition. You no longer need to manually set the number of buckets. For detailed information, see [set the number of buckets](../table_design/data_distribution/Data_distribution.md#set-the-number-of-buckets).
-
- b. Insert a record into `table5`.
-
- ```SQL
- INSERT INTO table5 VALUES
- (101, 2, 80),
- (102, 2, 90);
- ```
-
-### Load data
-
-Run a load to update the records whose `id` values are `101` and `102`, respectively, from `example5.csv` into `table5`, and specify that the updates take effect only when the `version` value in each of the two records is greater or equal to their current `version` values.
-
-- Run a Stream Load job:
-
- ```Bash
- curl --location-trusted -u : \
- -H "Expect:100-continue" \
- -H "label:label10" \
- -H "column_separator:," \
- -H "merge_condition:version" \
- -T example5.csv -XPUT \
- http://:/api/test_db/table5/_stream_load
- ```
-- Run a Insert Load job:
- ```SQL
- INSERT INTO test_db.table5 properties("merge_condition" = "version")
- VALUES (101, 2, 70), (102, 3, 100);
- ```
-
-- Run a Routine Load job:
-
- ```SQL
- CREATE ROUTINE LOAD test_db.table5 on table5
- COLUMNS (id, version, score),
- COLUMNS TERMINATED BY ','
- PROPERTIES
- (
- "merge_condition" = "version"
- )
- FROM KAFKA
- (
- "kafka_broker_list" =":",
- "kafka_topic" = "topic5",
- "property.kafka_default_offsets" ="OFFSET_BEGINNING"
- );
- ```
-
-- Run a Broker Load job:
-
- ```SQL
- LOAD LABEL test_db.table5
- ( DATA INFILE ("s3://xxx.csv")
- INTO TABLE table5 COLUMNS TERMINATED BY "," FORMAT AS "CSV"
- )
- WITH BROKER
- PROPERTIES
- (
- "merge_condition" = "version"
- );
- ```
-
-### Query data
-
-After the load is complete, query the data of `table5` to verify that the load is successful:
-
-```SQL
-SELECT * FROM table5;
-+------+------+-------+
-| id | version | score |
-+------+------+-------+
-| 101 | 2 | 80 |
-| 102 | 3 | 100 |
-+------+------+-------+
-2 rows in set (0.02 sec)
-```
-
-As shown in the preceding query result, the record whose `id` is `101` in `example5.csv` is not updated to `table5`, and the record whose `id` is `102` in `example5.csv` has been Inserted into `table5`.
diff --git a/docs/en/loading/Loading_data_template.md b/docs/en/loading/Loading_data_template.md
deleted file mode 100644
index 1dd5871..0000000
--- a/docs/en/loading/Loading_data_template.md
+++ /dev/null
@@ -1,400 +0,0 @@
----
-displayed_sidebar: docs
-unlisted: true
----
-
-# Load data from \ TEMPLATE
-
-## Template instructions
-
-### A note about style
-
-Technical documentation typically has links to other documents all over the place. When you look at this document you may notice that there are few links out from the page, and that almost all of the links are at the bottom of the doc in the **more information** section. Not every keyword needs to be linked out to another page, please assume that the reader knows what `CREATE TABLE` means and that is they do not know they can click in the search bar and find out. It is fine to pop a note in the docs to tell the reader that there are other options and the details are described in the **more information** section; this allows the people who need the information know that they can read it ***later***, after they accomplish the task at hand.
-
-### The template
-
-This template is based on the process to load data from Amazon S3, some
-parts of it will not be applicable to loading from other sources. Please concentrate on the flow of this template and do not worry about including every section; the flow is meant to be:
-
-#### Introduction
-
-Introductory text that lets the reader know what the end result will be if they follow this guide. In the case of the S3 doc, the end result is "Getting data loaded from S3 in either an asynchronous manner, or a synchronous manner."
-
-#### Why?
-
-- A description of the business problem solved with the technique
-- Advantages and the disadvantages (if any) of the method(s) described
-
-#### Data flow or other diagram
-
-Diagrams or images can be helpful. If you are describing a technique that is complex and an image helps, then use one. If you are describing a technique that produces something visual (for example, the use of Superset to analyze data), then definitely include an image of the end product.
-
-Use a data flow diagram if the flow is non-obvious. When a command causes StarRocks to run several processes and combine the output of those processes and then manipulate the data it is probably time for a description of the data flow. In this template there are two methods for loading data described. One of them is simple, and has no data flow section; the other is more complicated (StarRocks is handling the complex work, not the user!), and the complex option includes a data flow section.
-
-#### Examples with verification section
-
-Note that examples should come before syntax details and other deep technical details. Many readers will be coming to the docs to find a particular technique that they can copy, paste, and modify.
-
-If possible give an example that will work and includes a dataset to use. The example in this template uses a dataset stored in S3 that anyone who has an AWS account and can authenticate with a key and secret can use. By providing a dataset the examples are more valuable to the reader because they can fully experience the described technique.
-
-Make sure that the example works as written. This implies two things:
-
-1. you have run the commands in the order presented
-2. you have included the necessary prerequisites. For example, if your example refers to database `foo`, then probably you need to preface it with `CREATE DATABASE foo;`, `USE foo;`.
-
-Verification is so important. If the process that you are describing includes several steps, then include a verification step whenever something should have been accomplished; this helps avoid having the reader get to the end and realizing that they had a typo in step 10. In this example **Check progress** and `DESCRIBE user_behavior_inferred;` steps are for verification.
-
-#### More information
-
-At the end of the template there is a spot to put links to related information including the ones to optional information that you mentioned in the main body.
-
-### Notes embedded in the template
-
-The template notes are intentionally formatted differently than the way we format documentation notes to bring them to your attention when you are working through the template. Please remove the bold italic notes as you go along:
-
-```markdown
-***Note: descriptive text***
-```
-
-## Finally, the start of the template
-
-***Note: If there are multiple recommended choices, tell the
-reader this in the intro. For example, when loading from S3,
-there is an option for synchronous loading, and asynchronous loading:***
-
-StarRocks provides two options for loading data from S3:
-
-1. Asynchronous loading using Broker Load
-2. Synchronous loading using the `FILES()` table function
-
-***Note: Tell the reader WHY they would choose one choice over the other:***
-
-Small datasets are often loaded synchronously using the `FILES()` table function, and large datasets are often loaded asynchronously using Broker Load. The two methods have different advantages and are described below.
-
-> **NOTE**
->
-> You can load data into StarRocks tables only as a user who has the INSERT privilege on those StarRocks tables. If you do not have the INSERT privilege, follow the instructions provided in [GRANT](../sql-reference/sql-statements/account-management/GRANT.md) to grant the INSERT privilege to the user that you use to connect to your StarRocks cluster.
-
-## Using Broker Load
-
-An asynchronous Broker Load process handles making the connection to S3, pulling the data, and storing the data in StarRocks.
-
-### Advantages of Broker Load
-
-- Broker Load supports data transformation, UPSERT, and DELETE operations during loading.
-- Broker Load runs in the background and clients don't need to stay connected for the job to continue.
-- Broker Load is preferred for long running jobs, the default timeout is 4 hours.
-- In addition to Parquet and ORC file format, Broker Load supports CSV files.
-
-### Data flow
-
-***Note: Processes that involve multiple components or steps may be easier to understand with a diagram. This example includes a diagram that helps describe the steps that happen when a user chooses the Broker Load option.***
-
-
-
-1. The user creates a load job.
-2. The frontend (FE) creates a query plan and distributes the plan to the backend nodes (BE).
-3. The backend (BE) nodes pull the data from the source and load the data into StarRocks.
-
-### Typical example
-
-Create a table, start a load process that pulls a Parquet file from S3, and verify the progress and success of the data loading.
-
-> **NOTE**
->
-> The examples use a sample dataset in Parquet format, if you want to load a CSV or ORC file, that information is linked at the bottom of this page.
-
-#### Create a table
-
-Create a database for your table:
-
-```SQL
-CREATE DATABASE IF NOT EXISTS project;
-USE project;
-```
-
-Create a table. This schema matches a sample dataset in an S3 bucket hosted in a StarRocks account.
-
-```SQL
-DROP TABLE IF EXISTS user_behavior;
-
-CREATE TABLE `user_behavior` (
- `UserID` int(11),
- `ItemID` int(11),
- `CategoryID` int(11),
- `BehaviorType` varchar(65533),
- `Timestamp` datetime
-) ENGINE=OLAP
-DUPLICATE KEY(`UserID`)
-DISTRIBUTED BY HASH(`UserID`)
-PROPERTIES (
- "replication_num" = "1"
-);
-```
-
-#### Gather connection details
-
-> **NOTE**
->
-> The examples use IAM user-based authentication. Other authentication methods are available and linked at the bottom of this page.
-
-Loading data from S3 requires having the:
-
-- S3 bucket
-- S3 object keys (object names) if accessing a specific object in the bucket. Note that the object key can include a prefix if your S3 objects are stored in sub-folders. The full syntax is linked in **more information**.
-- S3 region
-- Access key and secret
-
-#### Start a Broker Load
-
-This job has four main sections:
-
-- `LABEL`: A string used when querying the state of a `LOAD` job.
-- `LOAD` declaration: The source URI, destination table, and the source data format.
-- `BROKER`: The connection details for the source.
-- `PROPERTIES`: Timeout value and any other properties to apply to this job.
-
-> **NOTE**
->
-> The dataset used in these examples is hosted in an S3 bucket in a StarRocks account. Any valid `aws.s3.access_key` and `aws.s3.secret_key` can be used, as the object is readable by any AWS authenticated user. Substitute your credentials for `AAA` and `BBB` in the commands below.
-
-```SQL
-LOAD LABEL user_behavior
-(
- DATA INFILE("s3://starrocks-examples/user_behavior_sample_data.parquet")
- INTO TABLE user_behavior
- FORMAT AS "parquet"
- )
- WITH BROKER
- (
- "aws.s3.enable_ssl" = "true",
- "aws.s3.use_instance_profile" = "false",
- "aws.s3.region" = "us-east-1",
- "aws.s3.access_key" = "AAAAAAAAAAAAAAAAAAAA",
- "aws.s3.secret_key" = "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB"
- )
-PROPERTIES
-(
- "timeout" = "72000"
-);
-```
-
-#### Check progress
-
-Query the `information_schema.loads` table to track progress. If you have multiple `LOAD` jobs running you can filter on the `LABEL` associated with the job. In the output below there are two entries for the load job `user_behavior`. The first record shows a state of `CANCELLED`; scroll to the end of the output, and you see that `listPath failed`. The second record shows success with a valid AWS IAM access key and secret.
-
-```SQL
-SELECT * FROM information_schema.loads;
-```
-
-```SQL
-SELECT * FROM information_schema.loads WHERE LABEL = 'user_behavior';
-```
-
-```plaintext
-JOB_ID|LABEL |DATABASE_NAME|STATE |PROGRESS |TYPE |PRIORITY|SCAN_ROWS|FILTERED_ROWS|UNSELECTED_ROWS|SINK_ROWS|ETL_INFO|TASK_INFO |CREATE_TIME |ETL_START_TIME |ETL_FINISH_TIME |LOAD_START_TIME |LOAD_FINISH_TIME |JOB_DETAILS |ERROR_MSG |TRACKING_URL|TRACKING_SQL|REJECTED_RECORD_PATH|
-------+-------------------------------------------+-------------+---------+-------------------+------+--------+---------+-------------+---------------+---------+--------+----------------------------------------------------+-------------------+-------------------+-------------------+-------------------+-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+------------+------------+--------------------+
- 10121|user_behavior |project |CANCELLED|ETL:N/A; LOAD:N/A |BROKER|NORMAL | 0| 0| 0| 0| |resource:N/A; timeout(s):72000; max_filter_ratio:0.0|2023-08-10 14:59:30| | | |2023-08-10 14:59:34|{"All backends":{},"FileNumber":0,"FileSize":0,"InternalTableLoadBytes":0,"InternalTableLoadRows":0,"ScanBytes":0,"ScanRows":0,"TaskNumber":0,"Unfinished backends":{}} |type:ETL_RUN_FAIL; msg:listPath failed| | | |
- 10106|user_behavior |project |FINISHED |ETL:100%; LOAD:100%|BROKER|NORMAL | 86953525| 0| 0| 86953525| |resource:N/A; timeout(s):72000; max_filter_ratio:0.0|2023-08-10 14:50:15|2023-08-10 14:50:19|2023-08-10 14:50:19|2023-08-10 14:50:19|2023-08-10 14:55:10|{"All backends":{"a5fe5e1d-d7d0-4826-ba99-c7348f9a5f2f":[10004]},"FileNumber":1,"FileSize":1225637388,"InternalTableLoadBytes":2710603082,"InternalTableLoadRows":86953525,"ScanBytes":1225637388,"ScanRows":86953525,"TaskNumber":1,"Unfinished backends":{"a5| | | | |
-```
-
-You can also check a subset of the data at this point.
-
-```SQL
-SELECT * from user_behavior LIMIT 10;
-```
-
-```plaintext
-UserID|ItemID|CategoryID|BehaviorType|Timestamp |
-------+------+----------+------------+-------------------+
-171146| 68873| 3002561|pv |2017-11-30 07:11:14|
-171146|146539| 4672807|pv |2017-11-27 09:51:41|
-171146|146539| 4672807|pv |2017-11-27 14:08:33|
-171146|214198| 1320293|pv |2017-11-25 22:38:27|
-171146|260659| 4756105|pv |2017-11-30 05:11:25|
-171146|267617| 4565874|pv |2017-11-27 14:01:25|
-171146|329115| 2858794|pv |2017-12-01 02:10:51|
-171146|458604| 1349561|pv |2017-11-25 22:49:39|
-171146|458604| 1349561|pv |2017-11-27 14:03:44|
-171146|478802| 541347|pv |2017-12-02 04:52:39|
-```
-
-## Using the `FILES()` table function
-
-### `FILES()` advantages
-
-`FILES()` can infer the data types of the columns of the Parquet data and generate the schema for a StarRocks table. This provides the ability to query the file directly from S3 with a `SELECT` or to have StarRocks automatically create a table for you based on the Parquet file schema.
-
-> **NOTE**
->
-> Schema inference is a new feature in version 3.1 and is provided for Parquet format only and nested types are not yet supported.
-
-### Typical examples
-
-There are three examples using the `FILES()` table function:
-
-- Querying the data directly from S3
-- Creating and loading the table using schema inference
-- Creating a table by hand and then loading the data
-
-> **NOTE**
->
-> The dataset used in these examples is hosted in an S3 bucket in a StarRocks account. Any valid `aws.s3.access_key` and `aws.s3.secret_key` can be used, as the object is readable by any AWS authenticated user. Substitute your credentials for `AAA` and `BBB` in the commands below.
-
-#### Querying directly from S3
-
-Querying directly from S3 using `FILES()` can gives a good preview of the content of a dataset before you create a table. For example:
-
-- Get a preview of the dataset without storing the data.
-- Query for the min and max values and decide what data types to use.
-- Check for nulls.
-
-```sql
-SELECT * FROM FILES(
- "path" = "s3://starrocks-examples/user_behavior_sample_data.parquet",
- "format" = "parquet",
- "aws.s3.region" = "us-east-1",
- "aws.s3.access_key" = "AAAAAAAAAAAAAAAAAAAA",
- "aws.s3.secret_key" = "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB"
-) LIMIT 10;
-```
-
-> **NOTE**
->
-> Notice that the column names are provided by the Parquet file.
-
-```plaintext
-UserID|ItemID |CategoryID|BehaviorType|Timestamp |
-------+-------+----------+------------+-------------------+
- 1|2576651| 149192|pv |2017-11-25 01:21:25|
- 1|3830808| 4181361|pv |2017-11-25 07:04:53|
- 1|4365585| 2520377|pv |2017-11-25 07:49:06|
- 1|4606018| 2735466|pv |2017-11-25 13:28:01|
- 1| 230380| 411153|pv |2017-11-25 21:22:22|
- 1|3827899| 2920476|pv |2017-11-26 16:24:33|
- 1|3745169| 2891509|pv |2017-11-26 19:44:31|
- 1|1531036| 2920476|pv |2017-11-26 22:02:12|
- 1|2266567| 4145813|pv |2017-11-27 00:11:11|
- 1|2951368| 1080785|pv |2017-11-27 02:47:08|
-```
-
-#### Creating a table with schema inference
-
-This is a continuation of the previous example; the previous query is wrapped in `CREATE TABLE` to automate the table creation using schema inference. The column names and types are not required to create a table when using the `FILES()` table function with Parquet files as the Parquet format includes the column names and types and StarRocks will infer the schema.
-
-> **NOTE**
->
-> The syntax of `CREATE TABLE` when using schema inference does not allow setting the number of replicas, so set it before creating the table. The example below is for a system with a single replica:
->
-> `ADMIN SET FRONTEND CONFIG ('default_replication_num' ="1");`
-
-```sql
-CREATE DATABASE IF NOT EXISTS project;
-USE project;
-
-CREATE TABLE `user_behavior_inferred` AS
-SELECT * FROM FILES(
- "path" = "s3://starrocks-examples/user_behavior_sample_data.parquet",
- "format" = "parquet",
- "aws.s3.region" = "us-east-1",
- "aws.s3.access_key" = "AAAAAAAAAAAAAAAAAAAA",
- "aws.s3.secret_key" = "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB"
-);
-```
-
-```SQL
-DESCRIBE user_behavior_inferred;
-```
-
-```plaintext
-Field |Type |Null|Key |Default|Extra|
-------------+----------------+----+-----+-------+-----+
-UserID |bigint |YES |true | | |
-ItemID |bigint |YES |true | | |
-CategoryID |bigint |YES |true | | |
-BehaviorType|varchar(1048576)|YES |false| | |
-Timestamp |varchar(1048576)|YES |false| | |
-```
-
-> **NOTE**
->
-> Compare the inferred schema with the schema created by hand:
->
-> - data types
-> - nullable
-> - key fields
-
-```SQL
-SELECT * from user_behavior_inferred LIMIT 10;
-```
-
-```plaintext
-UserID|ItemID|CategoryID|BehaviorType|Timestamp |
-------+------+----------+------------+-------------------+
-171146| 68873| 3002561|pv |2017-11-30 07:11:14|
-171146|146539| 4672807|pv |2017-11-27 09:51:41|
-171146|146539| 4672807|pv |2017-11-27 14:08:33|
-171146|214198| 1320293|pv |2017-11-25 22:38:27|
-171146|260659| 4756105|pv |2017-11-30 05:11:25|
-171146|267617| 4565874|pv |2017-11-27 14:01:25|
-171146|329115| 2858794|pv |2017-12-01 02:10:51|
-171146|458604| 1349561|pv |2017-11-25 22:49:39|
-171146|458604| 1349561|pv |2017-11-27 14:03:44|
-171146|478802| 541347|pv |2017-12-02 04:52:39|
-```
-
-#### Loading into an existing table
-
-You may want to customize the table that you are inserting into, for example the:
-
-- column data type, nullable setting, or default values
-- key types and columns
-- distribution
-- etc.
-
-> **NOTE**
->
-> Creating the most efficient table structure requires knowledge of how the data will be used and the content of the columns. This document does not cover table design, there is a link in **more information** at the end of the page.
-
-In this example we are creating a table based on knowledge of how the table will be queried and the data in the Parquet file. The knowledge of the data in the Parquet file can be gained by querying the file directly in S3.
-
-- Since a query of the file in S3 indicates that the `Timestamp` column contains data that matches a `datetime` data type the column type is specified in the following DDL.
-- By querying the data in S3 you can find that there are no null values in the dataset, so the DDL does not set any columns as nullable.
-- Based on knowledge of the expected query types, the sort key and bucketing column are set to the column `UserID` (your use case might be different for this data, you might decide to use `ItemID` in addition to or instead of `UserID` for the sort key:
-
-```SQL
-CREATE TABLE `user_behavior_declared` (
- `UserID` int(11),
- `ItemID` int(11),
- `CategoryID` int(11),
- `BehaviorType` varchar(65533),
- `Timestamp` datetime
-) ENGINE=OLAP
-DUPLICATE KEY(`UserID`)
-DISTRIBUTED BY HASH(`UserID`)
-PROPERTIES (
- "replication_num" = "1"
-);
-```
-
-After creating the table, you can load it with `INSERT INTO` … `SELECT FROM FILES()`:
-
-```SQL
-INSERT INTO user_behavior_declared
- SELECT * FROM FILES(
- "path" = "s3://starrocks-examples/user_behavior_sample_data.parquet",
- "format" = "parquet",
- "aws.s3.region" = "us-east-1",
- "aws.s3.access_key" = "AAAAAAAAAAAAAAAAAAAA",
- "aws.s3.secret_key" = "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB"
-);
-```
-
-## More information
-
-- For more details on synchronous and asynchronous data loading, see [Loading concepts](./loading_introduction/loading_concepts.md).
-- Learn about how Broker Load supports data transformation during loading at [Transform data at loading](../loading/Etl_in_loading.md) and [Change data through loading](../loading/Load_to_Primary_Key_tables.md).
-- This document only covered IAM user-based authentication. For other options please see [authenticate to AWS resources](../integrations/authenticate_to_aws_resources.md).
-- The [AWS CLI Command Reference](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3/index.html) covers the S3 URI in detail.
-- Learn more about [table design](../table_design/StarRocks_table_design.md).
-- Broker Load provides many more configuration and use options than those in the above examples, the details are in [Broker Load](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md)
diff --git a/docs/en/loading/Loading_intro.md b/docs/en/loading/Loading_intro.md
deleted file mode 100644
index ad0f9f0..0000000
--- a/docs/en/loading/Loading_intro.md
+++ /dev/null
@@ -1,209 +0,0 @@
----
-displayed_sidebar: docs
-toc_max_heading_level: 3
-keywords:
- - load
- - Insert
- - Stream Load
- - Broker Load
- - Pipe
- - Routine Load
- - Spark Load
----
-
-
-
-# Loading options
-
-Data loading is the process of cleansing and transforming raw data from various data sources based on your business requirements and loading the resulting data into StarRocks to facilitate analysis.
-
-StarRocks provides a variety of options for data loading:
-
-- Loading methods: Insert, Stream Load, Broker Load, Pipe, Routine Load, and Spark Load
-- Ecosystem tools: StarRocks Connector for Apache Kafka® (Kafka connector for short), StarRocks Connector for Apache Spark™ (Spark connector for short), StarRocks Connector for Apache Flink® (Flink connector for short), and other tools such as SMT, DataX, CloudCanal, and Kettle Connector
-- API: Stream Load transaction interface
-
-These options each have its own advantages and support its own set of data source systems to pull from.
-
-This topic provides an overview of these options, along with comparisons between them to help you determine the loading option of your choice based on your data source, business scenario, data volume, data file format, and loading frequency.
-
-## Introduction to loading options
-
-This section mainly describes the characteristics and business scenarios of the loading options available in StarRocks.
-
-
-
-:::note
-
-In the following sections, "batch" or "batch loading" refers to the loading of a large amount of data from a specified source all at a time into StarRocks, whereas "stream" or "streaming" refers to the continuous loading of data in real time.
-
-:::
-
-## Loading methods
-
-### [Insert](InsertInto.md)
-
-**Business scenario:**
-
-- INSERT INTO VALUES: Append to an internal table with small amounts of data.
-- INSERT INTO SELECT:
- - INSERT INTO SELECT FROM ``: Append to a table with the result of a query on an internal or external table.
- - INSERT INTO SELECT FROM FILES(): Append to a table with the result of a query on data files in remote storage.
-
- :::note
-
- For AWS S3, this feature is supported from v3.1 onwards. For HDFS, Microsoft Azure Storage, Google GCS, and S3-compatible storage (such as MinIO), this feature is supported from v3.2 onwards.
-
- :::
-
-**File format:**
-
-- INSERT INTO VALUES: SQL
-- INSERT INTO SELECT:
- - INSERT INTO SELECT FROM ``: StarRocks tables
- - INSERT INTO SELECT FROM FILES(): Parquet and ORC
-
-**Data volume:** Not fixed (The data volume varies based on the memory size.)
-
-### [Stream Load](StreamLoad.md)
-
-**Business scenario:** Batch load data from a local file system.
-
-**File format:** CSV and JSON
-
-**Data volume:** 10 GB or less
-
-### [Broker Load](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md)
-
-**Business scenario:**
-
-- Batch load data from HDFS or cloud storage like AWS S3, Microsoft Azure Storage, Google GCS, and S3-compatible storage (such as MinIO).
-- Batch load data from a local file system or NAS.
-
-**File format:** CSV, Parquet, ORC, and JSON (supported since v3.2.3)
-
-**Data volume:** Dozens of GB to hundreds of GB
-
-### [Pipe](../sql-reference/sql-statements/loading_unloading/pipe/CREATE_PIPE.md)
-
-**Business scenario:** Batch load or stream data from HDFS or AWS S3.
-
-:::note
-
-This loading method is supported from v3.2 onwards.
-
-:::
-
-**File format:** Parquet and ORC
-
-**Data volume:** 100 GB to 1 TB or more
-
-### [Routine Load](../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md)
-
-**Business scenario:** Stream data from Kafka.
-
-**File format:** CSV, JSON, and Avro (supported since v3.0.1)
-
-**Data volume:** MBs to GBs of data as mini-batches
-
-### [Spark Load](../sql-reference/sql-statements/loading_unloading/SPARK_LOAD.md)
-
-**Business scenario:** Batch load data of Apache Hive™ tables stored in HDFS by using Spark clusters.
-
-**File format:** CSV, Parquet (supported since v2.0), and ORC (supported since v2.0)
-
-**Data volume:** Dozens of GB to TBs
-
-## Ecosystem tools
-
-### [Kafka connector](Kafka-connector-starrocks.md)
-
-**Business scenario:** Stream data from Kafka.
-
-### [Spark connector](Spark-connector-starrocks.md)
-
-**Business scenario:** Batch load data from Spark.
-
-### [Flink connector](Flink-connector-starrocks.md)
-
-**Business scenario:** Stream data from Flink.
-
-### [SMT](../integrations/loading_tools/SMT.md)
-
-**Business scenario:** Load data from data sources such as MySQL, PostgreSQL, SQL Server, Oracle, Hive, ClickHouse, and TiDB through Flink.
-
-### [DataX](../integrations/loading_tools/DataX-starrocks-writer.md)
-
-**Business scenario:** Synchronize data between various heterogeneous data sources, including relational databases (for example, MySQL and Oracle), HDFS, and Hive.
-
-### [CloudCanal](../integrations/loading_tools/CloudCanal.md)
-
-**Business scenario:** Migrate or synchronize data from source databases (for example, MySQL, Oracle, and PostgreSQL) to StarRocks.
-
-### [Kettle Connector](https://github.com/StarRocks/starrocks-connector-for-kettle)
-
-**Business scenario:** Integrate with Kettle. By combining Kettle's robust data processing and transformation capabilities with StarRocks's high-performance data storage and analytical abilities, more flexible and efficient data processing workflows can be achieved.
-
-## API
-
-### [Stream Load transaction interface](Stream_Load_transaction_interface.md)
-
-**Business scenario:** Implement two-phase commit (2PC) for transactions that are run to load data from external systems such as Flink and Kafka, while improving the performance of highly concurrent stream loads. This feature is supported from v2.4 onwards.
-
-**File format:** CSV and JSON
-
-**Data volume:** 10 GB or less
-
-## Choice of loading options
-
-This section lists the loading options available for common data sources, helping you choose the option that best suits your situation.
-
-### Object storage
-
-| **Data source** | **Available loading options** |
-| ------------------------------------- | ------------------------------------------------------------ |
-| AWS S3 |
(Batch) INSERT INTO SELECT FROM FILES() (supported since v3.1)
(Batch) Broker Load
(Batch or streaming) Pipe (supported since v3.2)
See [Load data from AWS S3](s3.md). |
-| Microsoft Azure Storage |
(Batch) INSERT INTO SELECT FROM FILES() (supported since v3.2)
(Batch) Broker Load
See [Load data from Microsoft Azure Storage](azure.md). |
-| Google GCS |
(Batch) INSERT INTO SELECT FROM FILES() (supported since v3.2)
(Batch) Broker Load
See [Load data from GCS](gcs.md). |
-| S3-compatible storage (such as MinIO) |
(Batch) INSERT INTO SELECT FROM FILES() (supported since v3.2)
(Batch) Broker Load
See [Load data from MinIO](minio.md). |
-
-### Local file system (including NAS)
-
-| **Data source** | **Available loading options** |
-| --------------------------------- | ------------------------------------------------------------ |
-| Local file system (including NAS) |
(Batch) Stream Load
(Batch) Broker Load
See [Load data from a local file system](StreamLoad.md). |
-
-### HDFS
-
-| **Data source** | **Available loading options** |
-| --------------- | ------------------------------------------------------------ |
-| HDFS |
(Batch) INSERT INTO SELECT FROM FILES() (supported since v3.2)
(Batch) Broker Load
(Batch or streaming) Pipe (supported since v3.2)
See [Load data from HDFS](hdfs_load.md). |
-
-### Flink, Kafka, and Spark
-
-| **Data source** | **Available loading options** |
-| --------------- | ------------------------------------------------------------ |
-| Apache Flink® |
**NOTE** If the source data requires multi-table joins and extract, transform and load (ETL) operations, you can use Flink to read and pre-process the data and then use [Flink connector](Flink-connector-starrocks.md) to load the data into StarRocks. |
-| Apache Spark™ |
(Batch) Create a [Hive catalog](../data_source/catalog/hive_catalog.md) and then use [INSERT INTO SELECT FROM ``](InsertInto.md#insert-data-from-an-internal-or-external-table-into-an-internal-table).
|
-| Apache Iceberg | (Batch) Create an [Iceberg catalog](../data_source/catalog/iceberg/iceberg_catalog.md) and then use [INSERT INTO SELECT FROM ``](InsertInto.md#insert-data-from-an-internal-or-external-table-into-an-internal-table). |
-| Apache Hudi | (Batch) Create a [Hudi catalog](../data_source/catalog/hudi_catalog.md) and then use [INSERT INTO SELECT FROM ``](InsertInto.md#insert-data-from-an-internal-or-external-table-into-an-internal-table). |
-| Delta Lake | (Batch) Create a [Delta Lake catalog](../data_source/catalog/deltalake_catalog.md) and then use [INSERT INTO SELECT FROM ``](InsertInto.md#insert-data-from-an-internal-or-external-table-into-an-internal-table). |
-| Elasticsearch | (Batch) Create an [Elasticsearch catalog](../data_source/catalog/elasticsearch_catalog.md) and then use [INSERT INTO SELECT FROM ``](InsertInto.md#insert-data-from-an-internal-or-external-table-into-an-internal-table). |
-| Apache Paimon | (Batch) Create a [Paimon catalog](../data_source/catalog/paimon_catalog.md) and then use [INSERT INTO SELECT FROM ``](InsertInto.md#insert-data-from-an-internal-or-external-table-into-an-internal-table). |
-
-Note that StarRocks provides [unified catalogs](https://docs.starrocks.io/docs/data_source/catalog/unified_catalog/) from v3.2 onwards to help you handle tables from Hive, Iceberg, Hudi, and Delta Lake data sources as a unified data source without ingestion.
-
-### Internal and external databases
-
-| **Data source** | **Available loading options** |
-| ------------------------------------------------------------ | ------------------------------------------------------------ |
-| StarRocks | (Batch) Create a [StarRocks external table](../data_source/External_table.md#starrocks-external-table) and then use [INSERT INTO VALUES](InsertInto.md#insert-data-via-insert-into-values) to insert a few data records or [INSERT INTO SELECT FROM ``](InsertInto.md#insert-data-from-an-internal-or-external-table-into-an-internal-table) to insert the data of a table. **NOTE** StarRocks external tables only support data writes. They do not support data reads. |
-| MySQL |
(Batch) Create a [JDBC catalog](../data_source/catalog/jdbc_catalog.md) (recommended) or a [MySQL external table](../data_source/External_table.md#deprecated-mysql-external-table) and then use [INSERT INTO SELECT FROM ``](InsertInto.md#insert-data-from-an-internal-or-external-table-into-an-internal-table).
(Streaming) Use [SMT, Flink CDC connector, Flink, and Flink connector](Flink_cdc_load.md).
|
-| Other databases such as Oracle, PostgreSQL, SQL Server, ClickHouse, and TiDB |
(Batch) Create a [JDBC catalog](../data_source/catalog/jdbc_catalog.md) (recommended) or a [JDBC external table](../data_source/External_table.md#external-table-for-a-jdbc-compatible-database) and then use [INSERT INTO SELECT FROM ``](InsertInto.md#insert-data-from-an-internal-or-external-table-into-an-internal-table).
(Streaming) Use [SMT, Flink CDC connector, Flink, and Flink connector](loading_tools.md).
|
diff --git a/docs/en/loading/RoutineLoad.md b/docs/en/loading/RoutineLoad.md
deleted file mode 100644
index 3f1170d..0000000
--- a/docs/en/loading/RoutineLoad.md
+++ /dev/null
@@ -1,553 +0,0 @@
----
-displayed_sidebar: docs
-keywords: ['Routine Load']
----
-
-# Load data using Routine Load
-
-import InsertPrivNote from '../_assets/commonMarkdown/insertPrivNote.mdx'
-import QSTip from '../_assets/commonMarkdown/quickstart-routine-load-tip.mdx'
-
-
-
-This topic introduces how to create a Routine Load job to stream Kafka messages (events) into StarRocks, and familiarizes you with some basic concepts about Routine Load.
-
-To continuously load messages of a stream into StarRocks, you can store the message stream in a Kafka topic, and create a Routine Load job to consume the messages. The Routine Load job persists in StarRocks, generates a series of load tasks to consume the messages in all or part of the partitions in the topic, and loads the messages into StarRocks.
-
-A Routine Load job supports exactly-once delivery semantics to guarantee the data loaded into StarRocks is neither lost nor duplicated.
-
-Routine Load supports data transformation at data loading and supports data changes made by UPSERT and DELETE operations during data loading. For more information, see [Transform data at loading](../loading/Etl_in_loading.md) and [Change data through loading](../loading/Load_to_Primary_Key_tables.md).
-
-
-
-## Supported data formats
-
-Routine Load now supports consuming CSV, JSON, and Avro (supported since v3.0.1) formatted data from a Kafka cluster.
-
-> **NOTE**
->
-> For CSV data, take note of the following points:
->
-> - You can use a UTF-8 string, such as a comma (,), tab, or pipe (|), whose length does not exceed 50 bytes as a text delimiter.
-> - Null values are denoted by using `\N`. For example, a data file consists of three columns, and a record from that data file holds data in the first and third columns but no data in the second column. In this situation, you need to use `\N` in the second column to denote a null value. This means the record must be compiled as `a,\N,b` instead of `a,,b`. `a,,b` denotes that the second column of the record holds an empty string.
-
-## Basic concepts
-
-
-
-### Terminology
-
-- **Load job**
-
- A Routine Load job is a long-running job. As long as its status is RUNNING, a load job continuously generates one or multiple concurrent load tasks which consume the messages in a topic of a Kafka cluster and load the data into StarRocks.
-
-- **Load task**
-
- A load job is split into multiple load tasks by certain rules. A load task is the basic unit of data loading. As an individual event, a load task implements the load mechanism based on [Stream Load](../loading/StreamLoad.md). Multiple load tasks concurrently consume the messages from different partitions of a topic, and load the data into StarRocks.
-
-### Workflow
-
-1. **Create a Routine Load job.**
- To load data from Kafka, you need to create a Routine Load job by running the [CREATE ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md) statement. The FE parses the statement, and creates the job according to the properties you have specified.
-
-2. **The FE splits the job into multiple load tasks.**
-
- The FE split the job into multiple load tasks based on certain rules. Each load task is an individual transaction.
- The splitting rules are as follows:
- - The FE calculates the actual concurrent number of the load tasks according to the desired concurrent number `desired_concurrent_number`, the partition number in the Kafka topic, and the number of the BE nodes that are alive.
- - The FE splits the job into load tasks based on the actual concurrent number calculated, and arranges the tasks in the task queue.
-
- Each Kafka topic consists of multiple partitions. The relation between the topic partition and the load task is as follows:
- - A partition is uniquely assigned to a load task, and all messages from the partition are consumed by the load task.
- - A load task can consume messages from one or more partitions.
- - All partitions are distributed evenly among load tasks.
-
-3. **Multiple load tasks run concurrently to consume the messages from multiple Kafka topic partitions, and load the data into StarRocks**
-
- 1. **The FE schedules and submits load tasks**: the FE schedules the load tasks in the queue on a timely basis, and assigns them to selected Coordinator BE nodes. The interval between load tasks is defined by the configuration item `max_batch_interval`. The FE distributes the load tasks evenly to all BE nodes. See [CREATE ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md#examples) for more information about `max_batch_interval`.
-
- 2. The Coordinator BE starts the load task, consumes messages in partitions, parses and filters the data. A load task lasts until the pre-defined amount of messages are consumed or the pre-defined time limit is reached. The message batch size and time limit are defined in the FE configurations `max_routine_load_batch_size` and `routine_load_task_consume_second`. For detailed information, see [FE Configuration](../administration/management/FE_configuration.md). The Coordinator BE then distributes the messages to the Executor BEs. The Executor BEs write the messages to disks.
-
- > **NOTE**
- >
- > StarRocks supports access to Kafka via security protocols including SASL_SSL, SAS_PLAINTEXT, SSL, and PLAINTEXT. This topic uses connecting to Kafka via PLAINTEXT as an example. If you need to connect to Kafka via other security protocols, see [CREATE ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md).
-
-4. **The FE generates new load tasks to load data continuously.**
- After the Executor BEs has written the data to disks, the Coordinator BE reports the result of the load task to the FE. Based on the result, the FE then generates new load tasks to load the data continuously. Or the FE retries the failed tasks to make sure the data loaded into StarRocks is neither lost nor duplicated.
-
-## Create a Routine Load job
-
-The following three examples describe how to consume CSV-format, JSON-format and Avro-format data in Kafka, and load the data into StarRocks by creating a Routine Load job. For detailed syntax and parameter descriptions, see [CREATE ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md).
-
-### Load CSV-format data
-
-This section describes how to create a Routine Load job to consume CSV-format data in a Kafka cluster, and load the data into StarRocks.
-
-#### Prepare a dataset
-
-Suppose there is a CSV-format dataset in the topic `ordertest1` in a Kafka cluster. Every message in the dataset includes six fields: order ID, payment date, customer name, nationality, gender, and price.
-
-```Plain
-2020050802,2020-05-08,Johann Georg Faust,Deutschland,male,895
-2020050802,2020-05-08,Julien Sorel,France,male,893
-2020050803,2020-05-08,Dorian Grey,UK,male,1262
-2020050901,2020-05-09,Anna Karenina",Russia,female,175
-2020051001,2020-05-10,Tess Durbeyfield,US,female,986
-2020051101,2020-05-11,Edogawa Conan,japan,male,8924
-```
-
-#### Create a table
-
-According to the fields of CSV-format data, create the table `example_tbl1` in the database `example_db`. The following example creates a table with 5 fields excluding the field of customer gender in the CSV-format data.
-
-```SQL
-CREATE TABLE example_db.example_tbl1 (
- `order_id` bigint NOT NULL COMMENT "Order ID",
- `pay_dt` date NOT NULL COMMENT "Payment date",
- `customer_name` varchar(26) NULL COMMENT "Customer name",
- `nationality` varchar(26) NULL COMMENT "Nationality",
- `price`double NULL COMMENT "Price"
-)
-ENGINE=OLAP
-DUPLICATE KEY (order_id,pay_dt)
-DISTRIBUTED BY HASH(`order_id`);
-```
-
-> **NOTICE**
->
-> Since v2.5.7, StarRocks can automatically set the number of buckets (BUCKETS) when you create a table or add a partition. You no longer need to manually set the number of buckets. For detailed information, see [set the number of buckets](../table_design/data_distribution/Data_distribution.md#set-the-number-of-buckets).
-
-#### Submit a Routine Load job
-
-Execute the following statement to submit a Routine Load job named `example_tbl1_ordertest1` to consume the messages in the topic `ordertest1` and load the data into the table `example_tbl1`. The load task consumes the messages from the initial offset in the specified partitions of the topic.
-
-```SQL
-CREATE ROUTINE LOAD example_db.example_tbl1_ordertest1 ON example_tbl1
-COLUMNS TERMINATED BY ",",
-COLUMNS (order_id, pay_dt, customer_name, nationality, temp_gender, price)
-PROPERTIES
-(
- "desired_concurrent_number" = "5"
-)
-FROM KAFKA
-(
- "kafka_broker_list" = ":,:",
- "kafka_topic" = "ordertest1",
- "kafka_partitions" = "0,1,2,3,4",
- "property.kafka_default_offsets" = "OFFSET_BEGINNING"
-);
-```
-
-After submitting the load job, you can execute the [SHOW ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/SHOW_ROUTINE_LOAD.md) statement to check the status of the load job.
-
-- **load job name**
-
- There could be multiple load job on a table. Therefore, we recommend you name a load job with the corresponding Kafka topic and the time when the load job is submitted. It helps you distinguish the load job on each table.
-
-- **Column separator**
-
- The property `COLUMN TERMINATED BY` defines the column separator of the CSV-format data. The default is `\t`.
-
-- **Kafka topic partition and offset**
-
- You can specify the properties `kafka_partitions` and `kafka_offsets` to specify the partitions and offsets to consume the messages. For example, if you want the load job to consume messages from the Kafka partitions `"0,1,2,3,4"` of the topic `ordertest1` all with the initial offsets, you can specify the properties as follows: If you want the load job to consume messages from the Kafka partitions `"0,1,2,3,4"`and you need to specify a separate starting offset for each partition, you can configure as follows:
-
- ```SQL
- "kafka_partitions" ="0,1,2,3,4",
- "kafka_offsets" = "OFFSET_BEGINNING, OFFSET_END, 1000, 2000, 3000"
- ```
-
- You can also set the default offsets of all partitions with the property `property.kafka_default_offsets`.
-
- ```SQL
- "kafka_partitions" ="0,1,2,3,4",
- "property.kafka_default_offsets" = "OFFSET_BEGINNING"
- ```
-
- For detailed information, see [CREATE ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md).
-
-- **Data mapping and transformation**
-
- To specify the mapping and transformation relationship between the CSV-format data, and the StarRocks table, you need to use the `COLUMNS` parameter.
-
- **Data mapping:**
-
- - StarRocks extracts the columns in the CSV-format data and maps them **in sequence** onto the fields declared in the `COLUMNS` parameter.
-
- - StarRocks extracts the fields declared in the `COLUMNS` parameter and maps them **by name** onto the columns of StarRocks table.
-
- **Data transformation:**
-
- And because the example excludes the column of customer gender from the CSV-format data, the field `temp_gender` in `COLUMNS` parameter is used as a placeholder for this field. The other fields are mapped to columns of the StarRocks table `example_tbl1` directly.
-
- For more information about data transformation, see [Transform data at loading](./Etl_in_loading.md).
-
- > **NOTE**
- >
- > You do not need to specify the `COLUMNS` parameter if the names, number, and order of the columns in the CSV-format data completely correspond to those of the StarRocks table.
-
-- **Task concurrency**
-
- When there are many Kafka topic partitions and enough BE nodes, you can accelerate the loading by increasing the task concurrency.
-
- To increase the actual load task concurrency, you can increase the desired load task concurrency `desired_concurrent_number` when you create a routine load job. You can also set the dynamic configuration item of FE `max_routine_load_task_concurrent_num` ( default maximum load task currency ) to a larger value. For more information about `max_routine_load_task_concurrent_num`, please see [FE configuration items](../administration/management/FE_configuration.md).
-
- The actual task concurrency is defined by the minimum value among the number of BE nodes that are alive, the number of the pre-specified Kafka topic partitions, and the values of `desired_concurrent_number` and `max_routine_load_task_concurrent_num`.
-
- In the example, the number of BE nodes that are alive is `5`, the number of the pre-specified Kafka topic partitions is `5`, and the value of `max_routine_load_task_concurrent_num` is `5`. To increase the actual load task concurrency, you can increase the `desired_concurrent_number` from the default value `3` to `5`.
-
- For more about the properties, see [CREATE ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md).
-
-### Load JSON-format data
-
-This section describes how to create a Routine Load job to consume JSON-format data in a Kafka cluster, and load the data into StarRocks.
-
-#### Prepare a dataset
-
-Suppose there is a JSON-format dataset in the topic `ordertest2` in a Kafka cluster. The dataset includes six keys: commodity ID, customer name, nationality, payment time, and price. Besides, you want to transform the payment time column into the DATE type, and load it into the `pay_dt` column in the StarRocks table.
-
-```JSON
-{"commodity_id": "1", "customer_name": "Mark Twain", "country": "US","pay_time": 1589191487,"price": 875}
-{"commodity_id": "2", "customer_name": "Oscar Wilde", "country": "UK","pay_time": 1589191487,"price": 895}
-{"commodity_id": "3", "customer_name": "Antoine de Saint-Exupéry","country": "France","pay_time": 1589191487,"price": 895}
-```
-
-> **CAUTION** Each JSON object in a row must be in one Kafka message, otherwise a JSON parsing error is returned.
-
-#### Create a table
-
-According to the keys of the JSON-format data, create the table `example_tbl2` in the database `example_db`.
-
-```SQL
-CREATE TABLE `example_tbl2` (
- `commodity_id` varchar(26) NULL COMMENT "Commodity ID",
- `customer_name` varchar(26) NULL COMMENT "Customer name",
- `country` varchar(26) NULL COMMENT "Country",
- `pay_time` bigint(20) NULL COMMENT "Payment time",
- `pay_dt` date NULL COMMENT "Payment date",
- `price`double SUM NULL COMMENT "Price"
-)
-ENGINE=OLAP
-AGGREGATE KEY(`commodity_id`,`customer_name`,`country`,`pay_time`,`pay_dt`)
-DISTRIBUTED BY HASH(`commodity_id`);
-```
-
-> **NOTICE**
->
-> Since v2.5.7, StarRocks can automatically set the number of buckets (BUCKETS) when you create a table or add a partition. You no longer need to manually set the number of buckets. For detailed information, see [set the number of buckets](../table_design/data_distribution/Data_distribution.md#set-the-number-of-buckets).
-
-#### Submit a Routine Load job
-
-Execute the following statement to submit a Routine Load job named `example_tbl2_ordertest2` to consume the messages in the topic `ordertest2` and load the data into the table `example_tbl2`. The load task consumes the messages from the initial offset in the specified partitions of the topic.
-
-```SQL
-CREATE ROUTINE LOAD example_db.example_tbl2_ordertest2 ON example_tbl2
-COLUMNS(commodity_id, customer_name, country, pay_time, price, pay_dt=from_unixtime(pay_time, '%Y%m%d'))
-PROPERTIES
-(
- "desired_concurrent_number" = "5",
- "format" = "json",
- "jsonpaths" = "[\"$.commodity_id\",\"$.customer_name\",\"$.country\",\"$.pay_time\",\"$.price\"]"
- )
-FROM KAFKA
-(
- "kafka_broker_list" =":,:",
- "kafka_topic" = "ordertest2",
- "kafka_partitions" ="0,1,2,3,4",
- "property.kafka_default_offsets" = "OFFSET_BEGINNING"
-);
-```
-
-After submitting the load job, you can execute the [SHOW ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/SHOW_ROUTINE_LOAD.md) statement to check the status of the load job.
-
-- **Data format**
-
- You need to specify `"format" = "json"` in the clause `PROPERTIES` to define that the data format is JSON.
-
-- **Data mapping and transformation**
-
- To specify the mapping and transformation relationship between the JSON-format data, and the StarRocks table, you need to specify the parameter `COLUMNS` and property`jsonpaths`. The order of fields specified in the `COLUMNS` parameter must match that of the JSON-format data, and the name of fields must match that of the StarRocks table. The property `jsonpaths` is used to extract the required fields from the JSON data. These fields are then named by the property `COLUMNS`.
-
- Because the example needs to transform the payment time field to the DATE data type, and load the data into the `pay_dt` column in the StarRocks table, you need to use the from_unixtime function. The other fields are mapped to fields of the table `example_tbl2` directly.
-
- **Data mapping:**
-
- - StarRocks extracts the `name` and `code` keys of JSON-format data and maps them onto the keys declared in the `jsonpaths` property.
-
- - StarRocks extracts the keys declared in the `jsonpaths` property and maps them **in sequence** onto the fields declared in the `COLUMNS` parameter.
-
- - StarRocks extracts the fields declared in the `COLUMNS` parameter and maps them **by name** onto the columns of StarRocks table.
-
- **Data transformation**:
-
- - Because the example needs to transform the key `pay_time` to the DATE data type, and load the data into the `pay_dt` column in the StarRocks table, you need to use the from_unixtime function in `COLUMNS` parameter. The other fields are mapped to fields of the table `example_tbl2` directly.
-
- - And because the example excludes the column of customer gender from the JSON-format data, the field `temp_gender` in `COLUMNS` parameter is used as a placeholder for this field. The other fields are mapped to columns of the StarRocks table `example_tbl1` directly.
-
- For more information about data transformation, see [Transform data at loading](./Etl_in_loading.md).
-
- > **NOTE**
- >
- > You do not need to specify the `COLUMNS` parameter if the names and number of the keys in the JSON object completely match those of fields in the StarRocks table.
-
-### Load Avro-format data
-
-Since v3.0.1, StarRocks supports loading Avro data by using Routine Load.
-
-#### Prepare a dataset
-
-##### Avro schema
-
-1. Create the following Avro schema file `avro_schema.avsc`:
-
- ```JSON
- {
- "type": "record",
- "name": "sensor_log",
- "fields" : [
- {"name": "id", "type": "long"},
- {"name": "name", "type": "string"},
- {"name": "checked", "type" : "boolean"},
- {"name": "data", "type": "double"},
- {"name": "sensor_type", "type": {"type": "enum", "name": "sensor_type_enum", "symbols" : ["TEMPERATURE", "HUMIDITY", "AIR-PRESSURE"]}}
- ]
- }
- ```
-
-2. Register the Avro schema in the [Schema Registry](https://docs.confluent.io/cloud/current/get-started/schema-registry.html#create-a-schema).
-
-##### Avro data
-
-Prepare the Avro data and send it to the Kafka topic `topic_0`.
-
-#### Create a table
-
-According to the fields of Avro data, create a table `sensor_log` in the target database `example_db` in the StarRocks cluster. The column names of the table must match the field names in the Avro data. For the data type mapping between the table columns and the Avro data fields, see [Data types mapping](#Data types mapping).
-
-```SQL
-CREATE TABLE example_db.sensor_log (
- `id` bigint NOT NULL COMMENT "sensor id",
- `name` varchar(26) NOT NULL COMMENT "sensor name",
- `checked` boolean NOT NULL COMMENT "checked",
- `data` double NULL COMMENT "sensor data",
- `sensor_type` varchar(26) NOT NULL COMMENT "sensor type"
-)
-ENGINE=OLAP
-DUPLICATE KEY (id)
-DISTRIBUTED BY HASH(`id`);
-```
-
-> **NOTICE**
->
-> Since v2.5.7, StarRocks can automatically set the number of buckets (BUCKETS) when you create a table or add a partition. You no longer need to manually set the number of buckets. For detailed information, see [set the number of buckets](../table_design/data_distribution/Data_distribution.md#set-the-number-of-buckets).
-
-#### Submit a Routine Load job
-
-Execute the following statement to submit a Routine Load job named `sensor_log_load_job` to consume the Avro messages in the Kafka topic `topic_0` and load the data into the table `sensor_log` in the database `sensor`. The load job consumes the messages from the initial offset in the specified partitions of the topic.
-
-```SQL
-CREATE ROUTINE LOAD example_db.sensor_log_load_job ON sensor_log
-PROPERTIES
-(
- "format" = "avro"
-)
-FROM KAFKA
-(
- "kafka_broker_list" = ":,:,...",
- "confluent.schema.registry.url" = "http://172.xx.xxx.xxx:8081",
- "kafka_topic" = "topic_0",
- "kafka_partitions" = "0,1,2,3,4,5",
- "property.kafka_default_offsets" = "OFFSET_BEGINNING"
-);
-```
-
-- Data Format
-
- You need to specify `"format = "avro"` in the clause `PROPERTIES` to define that the data format is Avro.
-
-- Schema Registry
-
- You need to configure `confluent.schema.registry.url` to specify the URL of the Schema Registry where the Avro schema is registered. StarRocks retrieves the Avro schema by using this URL. The format is as follows:
-
- ```Plaintext
- confluent.schema.registry.url = http[s]://[:@][:]
- ```
-
-- Data mapping and transformation
-
- To specify the mapping and transformation relationship between the Avro-format data and the StarRocks table, you need to specify the parameter `COLUMNS` and property `jsonpaths`. The order of fields specified in the `COLUMNS` parameter must match that of the fields in the property `jsonpaths`, and the names of fields must match these of the StarRocks table. The property `jsonpaths` is used to extract the required fields from the Avro data. These fields are then named by the property `COLUMNS`.
-
- For more information about data transformation, see [Transform data at loading](./Etl_in_loading.md).
-
- > NOTE
- >
- > You do not need to specify the `COLUMNS` parameter if the names and number of the fields in the Avro record completely match those of columns in the StarRocks table.
-
-After submitting the load job, you can execute the [SHOW ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/SHOW_ROUTINE_LOAD.md) statement to check the status of the load job.
-
-#### Data types mapping
-
-The data type mapping between the Avro data fields you want to load and the StarRocks table columns is as follows:
-
-##### Primitive types
-
-| Avro | StarRocks |
-| ------- | --------- |
-| nul | NULL |
-| boolean | BOOLEAN |
-| int | INT |
-| long | BIGINT |
-| float | FLOAT |
-| double | DOUBLE |
-| bytes | STRING |
-| string | STRING |
-
-##### Complex types
-
-| Avro | StarRocks |
-| -------------- | ------------------------------------------------------------ |
-| record | Load the entire RECORD or its subfields into StarRocks as JSON. |
-| enums | STRING |
-| arrays | ARRAY |
-| maps | JSON |
-| union(T, null) | NULLABLE(T) |
-| fixed | STRING |
-
-#### Limits
-
-- Currently, StarRocks does not support schema evolution.
-- Each Kafka message must only contain a single Avro data record.
-
-## Check a load job and task
-
-### Check a load job
-
-Execute the [SHOW ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/SHOW_ROUTINE_LOAD.md) statement to check the status of the load job `example_tbl2_ordertest2`. StarRocks returns the execution state `State`, the statistical information (including the total rows consumed and the total rows loaded) `Statistics`, and the progress of the load job `progress`.
-
-If the state of the load job is automatically changed to **PAUSED**, it is possibly because the number of error rows has exceeded the threshold. For detailed instructions on setting this threshold, see [CREATE ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md). You can check the files `ReasonOfStateChanged` and `ErrorLogUrls` to identify and troubleshoot the problem. Having fixed the problem, you can then execute the [RESUME ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/RESUME_ROUTINE_LOAD.md) statement to resume the **PAUSED** load job.
-
-If the state of the load job is **CANCELLED**, it is possibly because the load job encounters an exception (such as the table has been dropped). You can check the files `ReasonOfStateChanged` and `ErrorLogUrls` to identify and troubleshoot the problem. However, you cannot resume a **CANCELLED** load job.
-
-```SQL
-MySQL [example_db]> SHOW ROUTINE LOAD FOR example_tbl2_ordertest2 \G
-*************************** 1. row ***************************
- Id: 63013
- Name: example_tbl2_ordertest2
- CreateTime: 2022-08-10 17:09:00
- PauseTime: NULL
- EndTime: NULL
- DbName: default_cluster:example_db
- TableName: example_tbl2
- State: RUNNING
- DataSourceType: KAFKA
- CurrentTaskNum: 3
- JobProperties: {"partitions":"*","partial_update":"false","columnToColumnExpr":"commodity_id,customer_name,country,pay_time,pay_dt=from_unixtime(`pay_time`, '%Y%m%d'),price","maxBatchIntervalS":"20","whereExpr":"*","dataFormat":"json","timezone":"Asia/Shanghai","format":"json","json_root":"","strict_mode":"false","jsonpaths":"[\"$.commodity_id\",\"$.customer_name\",\"$.country\",\"$.pay_time\",\"$.price\"]","desireTaskConcurrentNum":"3","maxErrorNum":"0","strip_outer_array":"false","currentTaskConcurrentNum":"3","maxBatchRows":"200000"}
-DataSourceProperties: {"topic":"ordertest2","currentKafkaPartitions":"0,1,2,3,4","brokerList":":,:"}
- CustomProperties: {"kafka_default_offsets":"OFFSET_BEGINNING"}
- Statistic: {"receivedBytes":230,"errorRows":0,"committedTaskNum":1,"loadedRows":2,"loadRowsRate":0,"abortedTaskNum":0,"totalRows":2,"unselectedRows":0,"receivedBytesRate":0,"taskExecuteTimeMs":522}
- Progress: {"0":"1","1":"OFFSET_ZERO","2":"OFFSET_ZERO","3":"OFFSET_ZERO","4":"OFFSET_ZERO"}
-ReasonOfStateChanged:
- ErrorLogUrls:
- OtherMsg:
-```
-
-> **CAUTION**
->
-> You cannot check a load job that has stopped or has not yet started.
-
-### Check a load task
-
-Execute the [SHOW ROUTINE LOAD TASK](../sql-reference/sql-statements/loading_unloading/routine_load/SHOW_ROUTINE_LOAD_TASK.md) statement to check the load tasks of the load job `example_tbl2_ordertest2`, such as how many tasks are currently running, the Kafka topic partitions that are consumed and the consumption progress `DataSourceProperties`, and the corresponding Coordinator BE node `BeId`.
-
-```SQL
-MySQL [example_db]> SHOW ROUTINE LOAD TASK WHERE JobName = "example_tbl2_ordertest2" \G
-*************************** 1. row ***************************
- TaskId: 18c3a823-d73e-4a64-b9cb-b9eced026753
- TxnId: -1
- TxnStatus: UNKNOWN
- JobId: 63013
- CreateTime: 2022-08-10 17:09:05
- LastScheduledTime: 2022-08-10 17:47:27
- ExecuteStartTime: NULL
- Timeout: 60
- BeId: -1
-DataSourceProperties: {"1":0,"4":0}
- Message: there is no new data in kafka, wait for 20 seconds to schedule again
-*************************** 2. row ***************************
- TaskId: f76c97ac-26aa-4b41-8194-a8ba2063eb00
- TxnId: -1
- TxnStatus: UNKNOWN
- JobId: 63013
- CreateTime: 2022-08-10 17:09:05
- LastScheduledTime: 2022-08-10 17:47:26
- ExecuteStartTime: NULL
- Timeout: 60
- BeId: -1
-DataSourceProperties: {"2":0}
- Message: there is no new data in kafka, wait for 20 seconds to schedule again
-*************************** 3. row ***************************
- TaskId: 1a327a34-99f4-4f8d-8014-3cd38db99ec6
- TxnId: -1
- TxnStatus: UNKNOWN
- JobId: 63013
- CreateTime: 2022-08-10 17:09:26
- LastScheduledTime: 2022-08-10 17:47:27
- ExecuteStartTime: NULL
- Timeout: 60
- BeId: -1
-DataSourceProperties: {"0":2,"3":0}
- Message: there is no new data in kafka, wait for 20 seconds to schedule again
-```
-
-## Pause a load job
-
-You can execute the [PAUSE ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/PAUSE_ROUTINE_LOAD.md) statement to pause a load job. The state of the load job will be **PAUSED** after the statement is executed. However, it has not stopped. You can execute the [RESUME ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/RESUME_ROUTINE_LOAD.md) statement to resume it. You can also check its status with the [SHOW ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/SHOW_ROUTINE_LOAD.md) statement.
-
-The following example pauses the load job `example_tbl2_ordertest2`:
-
-```SQL
-PAUSE ROUTINE LOAD FOR example_tbl2_ordertest2;
-```
-
-## Resume a load job
-
-You can execute the [RESUME ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/RESUME_ROUTINE_LOAD.md) statement to resume a paused load job. The state of the load job will be **NEED_SCHEDULE** temporarily (because the load job is being re-scheduled), and then become **RUNNING**. You can check its status with the [SHOW ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/SHOW_ROUTINE_LOAD.md) statement.
-
-The following example resumes the paused load job `example_tbl2_ordertest2`:
-
-```SQL
-RESUME ROUTINE LOAD FOR example_tbl2_ordertest2;
-```
-
-## Alter a load job
-
-Before altering a load job, you must pause it with the [PAUSE ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/PAUSE_ROUTINE_LOAD.md) statement. Then you can execute the [ALTER ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/ALTER_ROUTINE_LOAD.md). After altering it, you can execute the [RESUME ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/RESUME_ROUTINE_LOAD.md) statement to resume it, and check its status with the [SHOW ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/SHOW_ROUTINE_LOAD.md) statement.
-
-Suppose the number of the BE nodes that are alive increases to `6` and the Kafka topic partitions to be consumed is `"0,1,2,3,4,5,6,7"`. If you want to increase the actual load task concurrency, you can execute the following statement to increase the number of desired task concurrency `desired_concurrent_number` to `6` (greater than or equal to the number of BE nodes that are alive), and specify the Kafka topic partitions and initial offsets.
-
-> **NOTE**
->
-> Because the actual task concurrency is determined by the minimum value of multiple parameters, you must make sure that the value of the FE dynamic parameter `max_routine_load_task_concurrent_num` is greater than or equal to `6`.
-
-```SQL
-ALTER ROUTINE LOAD FOR example_tbl2_ordertest2
-PROPERTIES
-(
- "desired_concurrent_number" = "6"
-)
-FROM kafka
-(
- "kafka_partitions" = "0,1,2,3,4,5,6,7",
- "kafka_offsets" = "OFFSET_BEGINNING,OFFSET_BEGINNING,OFFSET_BEGINNING,OFFSET_BEGINNING,OFFSET_END,OFFSET_END,OFFSET_END,OFFSET_END"
-);
-```
-
-## Stop a load job
-
-You can execute the [STOP ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/STOP_ROUTINE_LOAD.md) statement to stop a load job. The state of the load job will be **STOPPED** after the statement is executed, and you cannot resume a stopped load job. You cannot check the status of a stopped load job with the [SHOW ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/SHOW_ROUTINE_LOAD.md) statement.
-
-The following example stops the load job `example_tbl2_ordertest2`:
-
-```SQL
-STOP ROUTINE LOAD FOR example_tbl2_ordertest2;
-```
diff --git a/docs/en/loading/SQL_transaction.md b/docs/en/loading/SQL_transaction.md
deleted file mode 100644
index 6ea4540..0000000
--- a/docs/en/loading/SQL_transaction.md
+++ /dev/null
@@ -1,156 +0,0 @@
----
-displayed_sidebar: docs
----
-
-import Beta from '../_assets/commonMarkdown/_beta.mdx'
-
-# SQL Transaction
-
-
-
-Start a simple SQL transaction to commit multiple DML statements in a batch.
-
-## Overview
-
-From v3.5.0, StarRocks supports SQL transactions to assure the atomicity of the updated tables when manipulating data within multiple tables.
-
-A transaction consists of multiple SQL statements that are processed within the same atomic unit. The statements in the transaction are either applied or undone together, thus guaranteeing the ACID (atomicity, consistency, isolation, and durability) properties of the transaction.
-
-Currently, the SQL transaction in StarRocks supports the following operations:
-- INSERT INTO
-- UPDATE
-- DELETE
-
-:::note
-
-- INSERT OVERWRITE is not supported currently.
-- Multiple INSERT statements against the same table within a transaction are supported only in shared-data clusters from v4.0 onwards.
-- UPDATE and DELETE are supported only in shared-data clusters from v4.0 onwards.
-
-:::
-
-From v4.0 onwards, within one SQL transaction:
-- **Multiple INSERT statements** against the one table are supported.
-- **Only one UPDATE *OR* DELETE** statement against one table is allowed.
-- **An UPDATE *OR* DELETE** statement **after** INSERT statements against the same table is **not allowed**.
-
-The ACID properties of the transaction are guaranteed only on the limited READ COMMITTED isolation level, that is:
-- A statement operates only on data that was committed before the statement began.
-- Two successive statements within the same transaction may operate on different data if another transaction is committed between the execution of the first and the second statements.
-- Data changes brought by preceding DML statements are invisible to subsequent statements within the same transaction.
-
-A transaction is associated with a single session. Multiple sessions cannot share the same transaction.
-
-## Usage
-
-1. A transaction must be started by executing a START TRANSACTION statement. StarRocks also supports the synonym BEGIN.
-
- ```SQL
- { START TRANSACTION | BEGIN [ WORK ] }
- ```
-
-2. After starting the transaction, you can define multiple DML statements in the transaction. For detailed information, see [Usage notes](#usage-notes).
-
-3. A transaction must be ended explicitly by executing `COMMIT` or `ROLLBACK`.
-
- - To apply (commit) the transaction, use the following syntax:
-
- ```SQL
- COMMIT [ WORK ]
- ```
-
- - To undo (roll back) the transaction, use the following syntax:
-
- ```SQL
- ROLLBACK [ WORK ]
- ```
-
-## Example
-
-1. Create the demo table `desT` in a shared-data cluster, and load data into it.
-
- :::note
- If you want to try this example in a shared-nothing cluster, you must skip Step 3 and define only one INSERT statement in Step 4.
- :::
-
- ```SQL
- CREATE TABLE desT (
- k int,
- v int
- ) PRIMARY KEY(k);
-
- INSERT INTO desT VALUES
- (1,1),
- (2,2),
- (3,3);
- ```
-
-2. Start a transaction.
-
- ```SQL
- START TRANSACTION;
- ```
-
- Or
-
- ```SQL
- BEGIN WORK;
- ```
-
-3. Define an UPDATE or DELETE statement.
-
- ```SQL
- UPDATE desT SET v = v + 1 WHERE k = 1,
- ```
-
- Or
-
- ```SQL
- DELETE FROM desT WHERE k = 1;
- ```
-
-4. Define multiple INSERT statements.
-
- ```SQL
- -- Insert data with specified values.
- INSERT INTO desT VALUES (4,4);
- -- Insert data from a native table to another.
- INSERT INTO desT SELECT * FROM srcT;
- -- Insert data from remote storage.
- INSERT INTO desT
- SELECT * FROM FILES(
- "path" = "s3://inserttest/parquet/srcT.parquet",
- "format" = "parquet",
- "aws.s3.access_key" = "XXXXXXXXXX",
- "aws.s3.secret_key" = "YYYYYYYYYY",
- "aws.s3.region" = "us-west-2"
- );
- ```
-
-5. Apply or undo the transaction.
-
- - To apply the SQL statements in the transaction.
-
- ```SQL
- COMMIT WORK;
- ```
-
- - To undo the SQL statements in the transaction.
-
- ```SQL
- ROLLBACK WORK;
- ```
-
-## Usage notes
-
-- Currently, StarRocks supports SELECT, INSERT, UPDATE, and DELETE statements in SQL transactions. UPDATE and DELETE are supported only in shared-data clusters from v4.0 onwards.
-- SELECT statements against the tables whose data have been changed in the same transaction are not allowed.
-- Multiple INSERT statements against the same table within a transaction are supported only in shared-data clusters from v4.0 onwards.
-- Within a transaction, you can only define one UPDATE or DELETE statement against each table, and it must precede the INSERT statements.
-- Subsequent DML statements cannot read the uncommitted changes brought by preceding statements within the same transaction. For example, the target table of the preceding INSERT statement cannot be the source table of subsequent statements. Otherwise, the system returns an error.
-- All target tables of the DML statements in a transaction must be within the same database. Cross-database operations are not allowed.
-- Currently, INSERT OVERWRITE is not supported.
-- Nesting transactions are not allowed. You cannot specify BEGIN WORK within a BEGIN-COMMIT/ROLLBACK pair.
-- If the session where an on-going transaction belongs is terminated or closed, the transaction is automatically rolled back.
-- StarRock only supports limited READ COMMITTED for Transaction Isolation Level as described above.
-- Write conflict checks are not supported. When two transactions write to the same table simultaneously, both transactions can be committed successfully. The visibility (order) of the data changes depends on the execution order of the COMMIT WORK statements.
diff --git a/docs/en/loading/Spark-connector-starrocks.md b/docs/en/loading/Spark-connector-starrocks.md
deleted file mode 100644
index 89dd483..0000000
--- a/docs/en/loading/Spark-connector-starrocks.md
+++ /dev/null
@@ -1,842 +0,0 @@
----
-displayed_sidebar: docs
----
-
-# Load data using Spark connector (recommended)
-
-StarRocks provides a self-developed connector named StarRocks Connector for Apache Spark™ (Spark connector for short) to help you load data into a StarRocks table by using Spark. The basic principle is to accumulate the data and then load it all at a time into StarRocks through [STREAM LOAD](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md). The Spark connector is implemented based on Spark DataSource V2. A DataSource can be created by using Spark DataFrames or Spark SQL. And both batch and structured streaming modes are supported.
-
-> **NOTICE**
->
-> Only users with the SELECT and INSERT privileges on a StarRocks table can load data into this table. You can follow the instructions provided in [GRANT](../sql-reference/sql-statements/account-management/GRANT.md) to grant these privileges to a user.
-
-## Version requirements
-
-| Spark connector | Spark | StarRocks | Java | Scala |
-| --------------- | ---------------- | ------------- | ---- | ----- |
-| 1.1.2 | 3.2, 3.3, 3.4, 3.5 | 2.5 and later | 8 | 2.12 |
-| 1.1.1 | 3.2, 3.3, or 3.4 | 2.5 and later | 8 | 2.12 |
-| 1.1.0 | 3.2, 3.3, or 3.4 | 2.5 and later | 8 | 2.12 |
-
-> **NOTICE**
->
-> - Please see [Upgrade Spark connector](#upgrade-spark-connector) for behavior changes among different versions of the Spark connector.
-> - The Spark connector does not provide MySQL JDBC driver since version 1.1.1, and you need import the driver to the spark classpath manually. You can find the driver on [MySQL site](https://dev.mysql.com/downloads/connector/j/) or [Maven Central](https://repo1.maven.org/maven2/mysql/mysql-connector-java/).
-
-## Obtain Spark connector
-
-You can obtain the Spark connector JAR file in the following ways:
-
-- Directly download the compiled Spark Connector JAR file.
-- Add the Spark connector as a dependency in your Maven project and then download the JAR file.
-- Compile the source code of the Spark Connector into a JAR file by yourself.
-
-The naming format of the Spark connector JAR file is `starrocks-spark-connector-${spark_version}_${scala_version}-${connector_version}.jar`.
-
-For example, if you install Spark 3.2 and Scala 2.12 in your environment and you want to use Spark connector 1.1.0, you can use `starrocks-spark-connector-3.2_2.12-1.1.0.jar`.
-
-> **NOTICE**
->
-> In general, the latest version of the Spark connector only maintains compatibility with the three most recent versions of Spark.
-
-### Download the compiled Jar file
-
-Directly download the corresponding version of the Spark connector JAR from the [Maven Central Repository](https://repo1.maven.org/maven2/com/starrocks).
-
-### Maven Dependency
-
-1. In your Maven project's `pom.xml` file, add the Spark connector as a dependency according to the following format. Replace `spark_version`, `scala_version`, and `connector_version` with the respective versions.
-
- ```xml
-
- com.starrocks
- starrocks-spark-connector-${spark_version}_${scala_version}
- ${connector_version}
-
- ```
-
-2. For example, if the version of Spark in your environment is 3.2, the version of Scala is 2.12, and you choose Spark connector 1.1.0, you need to add the following dependency:
-
- ```xml
-
- com.starrocks
- starrocks-spark-connector-3.2_2.12
- 1.1.0
-
- ```
-
-### Compile by yourself
-
-1. Download the [Spark connector package](https://github.com/StarRocks/starrocks-connector-for-apache-spark).
-2. Execute the following command to compile the source code of Spark connector into a JAR file. Note that `spark_version` is replaced with the corresponding Spark version.
-
- ```bash
- sh build.sh
- ```
-
- For example, if the Spark version in your environment is 3.2, you need to execute the following command:
-
- ```bash
- sh build.sh 3.2
- ```
-
-3. Go to the `target/` directory to find the Spark connector JAR file, such as `starrocks-spark-connector-3.2_2.12-1.1.0-SNAPSHOT.jar` , generated upon compilation.
-
-> **NOTE**
->
-> The name of Spark connector which is not formally released contains the `SNAPSHOT` suffix.
-
-## Parameters
-
-### starrocks.fe.http.url
-
-**Required**: YES
-**Default value**: None
-**Description**: The HTTP URL of the FE in your StarRocks cluster. You can specify multiple URLs, which must be separated by a comma (,). Format: `:,:`. Since version 1.1.1, you can also add `http://` prefix to the URL, such as `http://:,http://:`.
-
-### starrocks.fe.jdbc.url
-
-**Required**: YES
-**Default value**: None
-**Description**: The address that is used to connect to the MySQL server of the FE. Format: `jdbc:mysql://:`.
-
-### starrocks.table.identifier
-
-**Required**: YES
-**Default value**: None
-**Description**: The name of the StarRocks table. Format: `.`.
-
-### starrocks.user
-
-**Required**: YES
-**Default value**: None
-**Description**: The username of your StarRocks cluster account. The user needs the [SELECT and INSERT privileges](../sql-reference/sql-statements/account-management/GRANT.md) on the StarRocks table.
-
-### starrocks.password
-
-**Required**: YES
-**Default value**: None
-**Description**: The password of your StarRocks cluster account.
-
-### starrocks.write.label.prefix
-
-**Required**: NO
-**Default value**: spark-
-**Description**: The label prefix used by Stream Load.
-
-### starrocks.write.enable.transaction-stream-load
-
-**Required**: NO
-**Default value**: TRUE
-**Description**: Whether to use [Stream Load transaction interface](../loading/Stream_Load_transaction_interface.md) to load data. It requires StarRocks v2.5 or later. This feature can load more data in a transaction with less memory usage, and improve performance. **NOTICE:** Since 1.1.1, this parameter takes effect only when the value of `starrocks.write.max.retries` is non-positive because Stream Load transaction interface does not support retry.
-
-### starrocks.write.buffer.size
-
-**Required**: NO
-**Default value**: 104857600
-**Description**: The maximum size of data that can be accumulated in memory before being sent to StarRocks at a time. Setting this parameter to a larger value can improve loading performance but may increase loading latency.
-
-### starrocks.write.buffer.rows
-
-**Required**: NO
-**Default value**: Integer.MAX_VALUE
-**Description**: Supported since version 1.1.1. The maximum number of rows that can be accumulated in memory before being sent to StarRocks at a time.
-
-### starrocks.write.flush.interval.ms
-
-**Required**: NO
-**Default value**: 300000
-**Description**: The interval at which data is sent to StarRocks. This parameter is used to control the loading latency.
-
-### starrocks.write.max.retries
-
-**Required**: NO
-**Default value**: 3
-**Description**: Supported since version 1.1.1. The number of times that the connector retries to perform the Stream Load for the same batch of data if the load fails. **NOTICE:** Because Stream Load transaction interface does not support retry. If this parameter is positive, the connector always use Stream Load interface and ignore the value of `starrocks.write.enable.transaction-stream-load`.
-
-### starrocks.write.retry.interval.ms
-
-**Required**: NO
-**Default value**: 10000
-**Description**: Supported since version 1.1.1. The interval to retry the Stream Load for the same batch of data if the load fails.
-
-### starrocks.columns
-
-**Required**: NO
-**Default value**: None
-**Description**: The StarRocks table column into which you want to load data. You can specify multiple columns, which must be separated by commas (,), for example, `"col0,col1,col2"`.
-
-### starrocks.column.types
-
-**Required**: NO
-**Default value**: None
-**Description**: Supported since version 1.1.1. Customize the column data types for Spark instead of using the defaults inferred from the StarRocks table and the [default mapping](#data-type-mapping-between-spark-and-starrocks). The parameter value is a schema in DDL format same as the output of Spark [StructType#toDDL](https://github.com/apache/spark/blob/master/sql/api/src/main/scala/org/apache/spark/sql/types/StructType.scala#L449) , such as `col0 INT, col1 STRING, col2 BIGINT`. Note that you only need to specify columns that need customization. One use case is to load data into columns of [BITMAP](#load-data-into-columns-of-bitmap-type) or [HLL](#load-data-into-columns-of-hll-type) type.
-
-### starrocks.write.properties.*
-
-**Required**: NO
-**Default value**: None
-**Description**: The parameters that are used to control Stream Load behavior. For example, the parameter `starrocks.write.properties.format` specifies the format of the data to be loaded, such as CSV or JSON. For a list of supported parameters and their descriptions, see [STREAM LOAD](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md).
-
-### starrocks.write.properties.format
-
-**Required**: NO
-**Default value**: CSV
-**Description**: The file format based on which the Spark connector transforms each batch of data before the data is sent to StarRocks. Valid values: CSV and JSON.
-
-### starrocks.write.properties.row_delimiter
-
-**Required**: NO
-**Default value**: \n
-**Description**: The row delimiter for CSV-formatted data.
-
-### starrocks.write.properties.column_separator
-
-**Required**: NO
-**Default value**: \t
-**Description**: The column separator for CSV-formatted data.
-
-### starrocks.write.properties.partial_update
-
-**Required**: NO
-**Default value**: `FALSE`
-**Description**: Whether to use partial updates. Valid values: `TRUE` and `FALSE`. Default value: `FALSE`, indicating to disable this feature.
-
-### starrocks.write.properties.partial_update_mode
-
-**Required**: NO
-**Default value**: `row`
-**Description**: Specifies the mode for partial updates. Valid values: `row` and `column`.
The value `row` (default) means partial updates in row mode, which is more suitable for real-time updates with many columns and small batches.
The value `column` means partial updates in column mode, which is more suitable for batch updates with few columns and many rows. In such scenarios, enabling the column mode offers faster update speeds. For example, in a table with 100 columns, if only 10 columns (10% of the total) are updated for all rows, the update speed of the column mode is 10 times faster.
-
-### starrocks.write.num.partitions
-
-**Required**: NO
-**Default value**: None
-**Description**: The number of partitions into which Spark can write data in parallel. When the data volume is small, you can reduce the number of partitions to lower the loading concurrency and frequency. The default value for this parameter is determined by Spark. However, this method may cause Spark Shuffle cost.
-
-### starrocks.write.partition.columns
-
-**Required**: NO
-**Default value**: None
-**Description**: The partitioning columns in Spark. The parameter takes effect only when `starrocks.write.num.partitions` is specified. If this parameter is not specified, all columns being written are used for partitioning.
-
-### starrocks.timezone
-
-**Required**: NO
-**Default value**: Default timezone of JVM
-**Description**: Supported since 1.1.1. The timezone used to convert Spark `TimestampType` to StarRocks `DATETIME`. The default is the timezone of JVM returned by `ZoneId#systemDefault()`. The format can be a timezone name such as `Asia/Shanghai`, or a zone offset such as `+08:00`.
-
-## Data type mapping between Spark and StarRocks
-
-- The default data type mapping is as follows:
-
- | Spark data type | StarRocks data type |
- | --------------- | ------------------------------------------------------------ |
- | BooleanType | BOOLEAN |
- | ByteType | TINYINT |
- | ShortType | SMALLINT |
- | IntegerType | INT |
- | LongType | BIGINT |
- | StringType | LARGEINT |
- | FloatType | FLOAT |
- | DoubleType | DOUBLE |
- | DecimalType | DECIMAL |
- | StringType | CHAR |
- | StringType | VARCHAR |
- | StringType | STRING |
- | StringType | JSON |
- | DateType | DATE |
- | TimestampType | DATETIME |
- | ArrayType | ARRAY **NOTE:** **Supported since version 1.1.1**. For detailed steps, see [Load data into columns of ARRAY type](#load-data-into-columns-of-array-type). |
-
-- You can also customize the data type mapping.
-
- For example, a StarRocks table contains BITMAP and HLL columns, but Spark does not support the two data types. You need to customize the corresponding data types in Spark. For detailed steps, see load data into [BITMAP](#load-data-into-columns-of-bitmap-type) and [HLL](#load-data-into-columns-of-hll-type) columns. **BITMAP and HLL are supported since version 1.1.1**.
-
-## Upgrade Spark connector
-
-### Upgrade from version 1.1.0 to 1.1.1
-
-- Since 1.1.1, the Spark connector does not provide `mysql-connector-java` which is the official JDBC driver for MySQL, because of the limitations of the GPL license used by `mysql-connector-java`.
- However, the Spark connector still needs the MySQL JDBC driver to connect to StarRocks for the table metadata, so you need to add the driver to the Spark classpath manually. You can find the
- driver on [MySQL site](https://dev.mysql.com/downloads/connector/j/) or [Maven Central](https://repo1.maven.org/maven2/mysql/mysql-connector-java/).
-- Since 1.1.1, the connector uses Stream Load interface by default rather than Stream Load transaction interface in version 1.1.0. If you still want to use Stream Load transaction interface, you
- can set the option `starrocks.write.max.retries` to `0`. Please see the description of `starrocks.write.enable.transaction-stream-load` and `starrocks.write.max.retries`
- for details.
-
-## Examples
-
-The following examples show how to use the Spark connector to load data into a StarRocks table with Spark DataFrames or Spark SQL. The Spark DataFrames supports both Batch and Structured Streaming modes.
-
-For more examples, see [Spark Connector Examples](https://github.com/StarRocks/starrocks-connector-for-apache-spark/tree/main/src/test/java/com/starrocks/connector/spark/examples).
-
-### Preparations
-
-#### Create a StarRocks table
-
-Create a database `test` and create a Primary Key table `score_board`.
-
-```sql
-CREATE DATABASE `test`;
-
-CREATE TABLE `test`.`score_board`
-(
- `id` int(11) NOT NULL COMMENT "",
- `name` varchar(65533) NULL DEFAULT "" COMMENT "",
- `score` int(11) NOT NULL DEFAULT "0" COMMENT ""
-)
-ENGINE=OLAP
-PRIMARY KEY(`id`)
-COMMENT "OLAP"
-DISTRIBUTED BY HASH(`id`);
-```
-
-#### Network configuration
-
-Ensure that the machine where Spark is located can access the FE nodes of the StarRocks cluster via the [`http_port`](../administration/management/FE_configuration.md#http_port) (default: `8030`) and [`query_port`](../administration/management/FE_configuration.md#query_port) (default: `9030`), and the BE nodes via the [`be_http_port`](../administration/management/BE_configuration.md#be_http_port) (default: `8040`).
-
-#### Set up your Spark environment
-
-Note that the following examples are run in Spark 3.2.4 and use `spark-shell`, `pyspark` and `spark-sql`. Before running the examples, make sure to place the Spark connector JAR file in the `$SPARK_HOME/jars` directory.
-
-### Load data with Spark DataFrames
-
-The following two examples explain how to load data with Spark DataFrames Batch or Structured Streaming mode.
-
-#### Batch
-
-Construct data in memory and load data into the StarRocks table.
-
-1. You can write the spark application using Scala or Python.
-
- For Scala, run the following code snippet in `spark-shell`:
-
- ```Scala
- // 1. Create a DataFrame from a sequence.
- val data = Seq((1, "starrocks", 100), (2, "spark", 100))
- val df = data.toDF("id", "name", "score")
-
- // 2. Write to StarRocks by configuring the format as "starrocks" and the following options.
- // You need to modify the options according your own environment.
- df.write.format("starrocks")
- .option("starrocks.fe.http.url", "127.0.0.1:8030")
- .option("starrocks.fe.jdbc.url", "jdbc:mysql://127.0.0.1:9030")
- .option("starrocks.table.identifier", "test.score_board")
- .option("starrocks.user", "root")
- .option("starrocks.password", "")
- .mode("append")
- .save()
- ```
-
- For Python, run the following code snippet in `pyspark`:
-
- ```python
- from pyspark.sql import SparkSession
-
- spark = SparkSession \
- .builder \
- .appName("StarRocks Example") \
- .getOrCreate()
-
- # 1. Create a DataFrame from a sequence.
- data = [(1, "starrocks", 100), (2, "spark", 100)]
- df = spark.sparkContext.parallelize(data) \
- .toDF(["id", "name", "score"])
-
- # 2. Write to StarRocks by configuring the format as "starrocks" and the following options.
- # You need to modify the options according your own environment.
- df.write.format("starrocks") \
- .option("starrocks.fe.http.url", "127.0.0.1:8030") \
- .option("starrocks.fe.jdbc.url", "jdbc:mysql://127.0.0.1:9030") \
- .option("starrocks.table.identifier", "test.score_board") \
- .option("starrocks.user", "root") \
- .option("starrocks.password", "") \
- .mode("append") \
- .save()
- ```
-
-2. Query data in the StarRocks table.
-
- ```sql
- MySQL [test]> SELECT * FROM `score_board`;
- +------+-----------+-------+
- | id | name | score |
- +------+-----------+-------+
- | 1 | starrocks | 100 |
- | 2 | spark | 100 |
- +------+-----------+-------+
- 2 rows in set (0.00 sec)
- ```
-
-#### Structured Streaming
-
-Construct a streaming read of data from a CSV file and load data into the StarRocks table.
-
-1. In the directory `csv-data`, create a CSV file `test.csv` with the following data:
-
- ```csv
- 3,starrocks,100
- 4,spark,100
- ```
-
-2. You can write the Spark application using Scala or Python.
-
- For Scala, run the following code snippet in `spark-shell`:
-
- ```Scala
- import org.apache.spark.sql.types.StructType
-
- // 1. Create a DataFrame from CSV.
- val schema = (new StructType()
- .add("id", "integer")
- .add("name", "string")
- .add("score", "integer")
- )
- val df = (spark.readStream
- .option("sep", ",")
- .schema(schema)
- .format("csv")
- // Replace it with your path to the directory "csv-data".
- .load("/path/to/csv-data")
- )
-
- // 2. Write to StarRocks by configuring the format as "starrocks" and the following options.
- // You need to modify the options according your own environment.
- val query = (df.writeStream.format("starrocks")
- .option("starrocks.fe.http.url", "127.0.0.1:8030")
- .option("starrocks.fe.jdbc.url", "jdbc:mysql://127.0.0.1:9030")
- .option("starrocks.table.identifier", "test.score_board")
- .option("starrocks.user", "root")
- .option("starrocks.password", "")
- // replace it with your checkpoint directory
- .option("checkpointLocation", "/path/to/checkpoint")
- .outputMode("append")
- .start()
- )
- ```
-
- For Python, run the following code snippet in `pyspark`:
-
- ```python
- from pyspark.sql import SparkSession
- from pyspark.sql.types import IntegerType, StringType, StructType, StructField
-
- spark = SparkSession \
- .builder \
- .appName("StarRocks SS Example") \
- .getOrCreate()
-
- # 1. Create a DataFrame from CSV.
- schema = StructType([
- StructField("id", IntegerType()),
- StructField("name", StringType()),
- StructField("score", IntegerType())
- ])
- df = (
- spark.readStream
- .option("sep", ",")
- .schema(schema)
- .format("csv")
- # Replace it with your path to the directory "csv-data".
- .load("/path/to/csv-data")
- )
-
- # 2. Write to StarRocks by configuring the format as "starrocks" and the following options.
- # You need to modify the options according your own environment.
- query = (
- df.writeStream.format("starrocks")
- .option("starrocks.fe.http.url", "127.0.0.1:8030")
- .option("starrocks.fe.jdbc.url", "jdbc:mysql://127.0.0.1:9030")
- .option("starrocks.table.identifier", "test.score_board")
- .option("starrocks.user", "root")
- .option("starrocks.password", "")
- # replace it with your checkpoint directory
- .option("checkpointLocation", "/path/to/checkpoint")
- .outputMode("append")
- .start()
- )
- ```
-
-3. Query data in the StarRocks table.
-
- ```SQL
- MySQL [test]> select * from score_board;
- +------+-----------+-------+
- | id | name | score |
- +------+-----------+-------+
- | 4 | spark | 100 |
- | 3 | starrocks | 100 |
- +------+-----------+-------+
- 2 rows in set (0.67 sec)
- ```
-
-### Load data with Spark SQL
-
-The following example explains how to load data with Spark SQL by using the `INSERT INTO` statement in the [Spark SQL CLI](https://spark.apache.org/docs/latest/sql-distributed-sql-engine-spark-sql-cli.html).
-
-1. Execute the following SQL statement in the `spark-sql`:
-
- ```SQL
- -- 1. Create a table by configuring the data source as `starrocks` and the following options.
- -- You need to modify the options according your own environment.
- CREATE TABLE `score_board`
- USING starrocks
- OPTIONS(
- "starrocks.fe.http.url"="127.0.0.1:8030",
- "starrocks.fe.jdbc.url"="jdbc:mysql://127.0.0.1:9030",
- "starrocks.table.identifier"="test.score_board",
- "starrocks.user"="root",
- "starrocks.password"=""
- );
-
- -- 2. Insert two rows into the table.
- INSERT INTO `score_board` VALUES (5, "starrocks", 100), (6, "spark", 100);
- ```
-
-2. Query data in the StarRocks table.
-
- ```SQL
- MySQL [test]> select * from score_board;
- +------+-----------+-------+
- | id | name | score |
- +------+-----------+-------+
- | 6 | spark | 100 |
- | 5 | starrocks | 100 |
- +------+-----------+-------+
- 2 rows in set (0.00 sec)
- ```
-
-## Best Practices
-
-### Load data to Primary Key table
-
-This section will show how to load data to StarRocks Primary Key table to achieve partial updates, and conditional updates.
-You can see [Change data through loading](../loading/Load_to_Primary_Key_tables.md) for the detailed introduction of these features.
-These examples use Spark SQL.
-
-#### Preparations
-
-Create a database `test` and create a Primary Key table `score_board` in StarRocks.
-
-```SQL
-CREATE DATABASE `test`;
-
-CREATE TABLE `test`.`score_board`
-(
- `id` int(11) NOT NULL COMMENT "",
- `name` varchar(65533) NULL DEFAULT "" COMMENT "",
- `score` int(11) NOT NULL DEFAULT "0" COMMENT ""
-)
-ENGINE=OLAP
-PRIMARY KEY(`id`)
-COMMENT "OLAP"
-DISTRIBUTED BY HASH(`id`);
-```
-
-#### Partial updates
-
-This example will show how to only update data in the column `name` through loading:
-
-1. Insert initial data to StarRocks table in MySQL client.
-
- ```sql
- mysql> INSERT INTO `score_board` VALUES (1, 'starrocks', 100), (2, 'spark', 100);
-
- mysql> select * from score_board;
- +------+-----------+-------+
- | id | name | score |
- +------+-----------+-------+
- | 1 | starrocks | 100 |
- | 2 | spark | 100 |
- +------+-----------+-------+
- 2 rows in set (0.02 sec)
- ```
-
-2. Create a Spark table `score_board` in Spark SQL client.
-
- - Set the option `starrocks.write.properties.partial_update` to `true` which tells the connector to do partial update.
- - Set the option `starrocks.columns` to `"id,name"` to tell the connector which columns to write.
-
- ```SQL
- CREATE TABLE `score_board`
- USING starrocks
- OPTIONS(
- "starrocks.fe.http.url"="127.0.0.1:8030",
- "starrocks.fe.jdbc.url"="jdbc:mysql://127.0.0.1:9030",
- "starrocks.table.identifier"="test.score_board",
- "starrocks.user"="root",
- "starrocks.password"="",
- "starrocks.write.properties.partial_update"="true",
- "starrocks.columns"="id,name"
- );
- ```
-
-3. Insert data into the table in Spark SQL client, and only update the column `name`.
-
- ```SQL
- INSERT INTO `score_board` VALUES (1, 'starrocks-update'), (2, 'spark-update');
- ```
-
-4. Query the StarRocks table in MySQL client.
-
- You can see that only values for `name` change, and the values for `score` does not change.
-
- ```SQL
- mysql> select * from score_board;
- +------+------------------+-------+
- | id | name | score |
- +------+------------------+-------+
- | 1 | starrocks-update | 100 |
- | 2 | spark-update | 100 |
- +------+------------------+-------+
- 2 rows in set (0.02 sec)
- ```
-
-#### Conditional updates
-
-This example will show how to do conditional updates according to the values of column `score`. The update for an `id`
-takes effect only when the new value for `score` is has a greater or equal to the old value.
-
-1. Insert initial data to StarRocks table in MySQL client.
-
- ```SQL
- mysql> INSERT INTO `score_board` VALUES (1, 'starrocks', 100), (2, 'spark', 100);
-
- mysql> select * from score_board;
- +------+-----------+-------+
- | id | name | score |
- +------+-----------+-------+
- | 1 | starrocks | 100 |
- | 2 | spark | 100 |
- +------+-----------+-------+
- 2 rows in set (0.02 sec)
- ```
-
-2. Create a Spark table `score_board` in the following ways.
-
- - Set the option `starrocks.write.properties.merge_condition` to `score` which tells the connector to use the column `score` as the condition.
- - Make sure that the Spark connector use Stream Load interface to load data, rather than Stream Load transaction interface, because the latter does not support this feature.
-
- ```SQL
- CREATE TABLE `score_board`
- USING starrocks
- OPTIONS(
- "starrocks.fe.http.url"="127.0.0.1:8030",
- "starrocks.fe.jdbc.url"="jdbc:mysql://127.0.0.1:9030",
- "starrocks.table.identifier"="test.score_board",
- "starrocks.user"="root",
- "starrocks.password"="",
- "starrocks.write.properties.merge_condition"="score"
- );
- ```
-
-3. Insert data to the table in Spark SQL client, and update the row whose `id` is 1 with a smaller score value, and the row whose `id` is 2 with a larger score value.
-
- ```SQL
- INSERT INTO `score_board` VALUES (1, 'starrocks-update', 99), (2, 'spark-update', 101);
- ```
-
-4. Query the StarRocks table in MySQL client.
-
- You can see that only the row whose `id` is 2 changes, and the row whose `id` is 1 does not change.
-
- ```SQL
- mysql> select * from score_board;
- +------+--------------+-------+
- | id | name | score |
- +------+--------------+-------+
- | 1 | starrocks | 100 |
- | 2 | spark-update | 101 |
- +------+--------------+-------+
- 2 rows in set (0.03 sec)
- ```
-
-### Load data into columns of BITMAP type
-
-[`BITMAP`](../sql-reference/data-types/other-data-types/BITMAP.md) is often used to accelerate count distinct, such as counting UV, see [Use Bitmap for exact Count Distinct](../using_starrocks/distinct_values/Using_bitmap.md).
-Here we take the counting of UV as an example to show how to load data into columns of the `BITMAP` type. **`BITMAP` is supported since version 1.1.1**.
-
-1. Create a StarRocks Aggregate table.
-
- In the database `test`, create an Aggregate table `page_uv` where the column `visit_users` is defined as the `BITMAP` type and configured with the aggregate function `BITMAP_UNION`.
-
- ```SQL
- CREATE TABLE `test`.`page_uv` (
- `page_id` INT NOT NULL COMMENT 'page ID',
- `visit_date` datetime NOT NULL COMMENT 'access time',
- `visit_users` BITMAP BITMAP_UNION NOT NULL COMMENT 'user ID'
- ) ENGINE=OLAP
- AGGREGATE KEY(`page_id`, `visit_date`)
- DISTRIBUTED BY HASH(`page_id`);
- ```
-
-2. Create a Spark table.
-
- The schema of the Spark table is inferred from the StarRocks table, and the Spark does not support the `BITMAP` type. So you need to customize the corresponding column data type in Spark, for example as `BIGINT`, by configuring the option `"starrocks.column.types"="visit_users BIGINT"`. When using Stream Load to ingest data, the connector uses the [`to_bitmap`](../sql-reference/sql-functions/bitmap-functions/to_bitmap.md) function to convert the data of `BIGINT` type into `BITMAP` type.
-
- Run the following DDL in `spark-sql`:
-
- ```SQL
- CREATE TABLE `page_uv`
- USING starrocks
- OPTIONS(
- "starrocks.fe.http.url"="127.0.0.1:8030",
- "starrocks.fe.jdbc.url"="jdbc:mysql://127.0.0.1:9030",
- "starrocks.table.identifier"="test.page_uv",
- "starrocks.user"="root",
- "starrocks.password"="",
- "starrocks.column.types"="visit_users BIGINT"
- );
- ```
-
-3. Load data into StarRocks table.
-
- Run the following DML in `spark-sql`:
-
- ```SQL
- INSERT INTO `page_uv` VALUES
- (1, CAST('2020-06-23 01:30:30' AS TIMESTAMP), 13),
- (1, CAST('2020-06-23 01:30:30' AS TIMESTAMP), 23),
- (1, CAST('2020-06-23 01:30:30' AS TIMESTAMP), 33),
- (1, CAST('2020-06-23 02:30:30' AS TIMESTAMP), 13),
- (2, CAST('2020-06-23 01:30:30' AS TIMESTAMP), 23);
- ```
-
-4. Calculate page UVs from the StarRocks table.
-
- ```SQL
- MySQL [test]> SELECT `page_id`, COUNT(DISTINCT `visit_users`) FROM `page_uv` GROUP BY `page_id`;
- +---------+-----------------------------+
- | page_id | count(DISTINCT visit_users) |
- +---------+-----------------------------+
- | 2 | 1 |
- | 1 | 3 |
- +---------+-----------------------------+
- 2 rows in set (0.01 sec)
- ```
-
-> **NOTICE:**
->
-> The connector uses [`to_bitmap`](../sql-reference/sql-functions/bitmap-functions/to_bitmap.md)
-> function to convert data of the `TINYINT`, `SMALLINT`, `INTEGER`, and `BIGINT` types in Spark to the `BITMAP` type in StarRocks, and uses
-> [`bitmap_hash`](../sql-reference/sql-functions/bitmap-functions/bitmap_hash.md) or [`bitmap_hash64`](../sql-reference/sql-functions/bitmap-functions/bitmap_hash64.md) function for other Spark data types.
-
-### Load data into columns of HLL type
-
-[`HLL`](../sql-reference/data-types/other-data-types/HLL.md) can be used for approximate count distinct, see [Use HLL for approximate count distinct](../using_starrocks/distinct_values/Using_HLL.md).
-
-Here we take the counting of UV as an example to show how to load data into columns of the `HLL` type. **`HLL` is supported since version 1.1.1**.
-
-1. Create a StarRocks Aggregate table.
-
- In the database `test`, create an Aggregate table `hll_uv` where the column `visit_users` is defined as the `HLL` type and configured with the aggregate function `HLL_UNION`.
-
- ```SQL
- CREATE TABLE `hll_uv` (
- `page_id` INT NOT NULL COMMENT 'page ID',
- `visit_date` datetime NOT NULL COMMENT 'access time',
- `visit_users` HLL HLL_UNION NOT NULL COMMENT 'user ID'
- ) ENGINE=OLAP
- AGGREGATE KEY(`page_id`, `visit_date`)
- DISTRIBUTED BY HASH(`page_id`);
- ```
-
-2. Create a Spark table.
-
- The schema of the Spark table is inferred from the StarRocks table, and the Spark does not support the `HLL` type. So you need to customize the corresponding column data type in Spark, for example as `BIGINT`, by configuring the option `"starrocks.column.types"="visit_users BIGINT"`. When using Stream Load to ingest data, the connector uses the [`hll_hash`](../sql-reference/sql-functions/scalar-functions/hll_hash.md) function to convert the data of `BIGINT` type into `HLL` type.
-
- Run the following DDL in `spark-sql`:
-
- ```SQL
- CREATE TABLE `hll_uv`
- USING starrocks
- OPTIONS(
- "starrocks.fe.http.url"="127.0.0.1:8030",
- "starrocks.fe.jdbc.url"="jdbc:mysql://127.0.0.1:9030",
- "starrocks.table.identifier"="test.hll_uv",
- "starrocks.user"="root",
- "starrocks.password"="",
- "starrocks.column.types"="visit_users BIGINT"
- );
- ```
-
-3. Load data into StarRocks table.
-
- Run the following DML in `spark-sql`:
-
- ```SQL
- INSERT INTO `hll_uv` VALUES
- (3, CAST('2023-07-24 12:00:00' AS TIMESTAMP), 78),
- (4, CAST('2023-07-24 13:20:10' AS TIMESTAMP), 2),
- (3, CAST('2023-07-24 12:30:00' AS TIMESTAMP), 674);
- ```
-
-4. Calculate page UVs from the StarRocks table.
-
- ```SQL
- MySQL [test]> SELECT `page_id`, COUNT(DISTINCT `visit_users`) FROM `hll_uv` GROUP BY `page_id`;
- +---------+-----------------------------+
- | page_id | count(DISTINCT visit_users) |
- +---------+-----------------------------+
- | 4 | 1 |
- | 3 | 2 |
- +---------+-----------------------------+
- 2 rows in set (0.01 sec)
- ```
-
-### Load data into columns of ARRAY type
-
-The following example explains how to load data into columns of the [`ARRAY`](../sql-reference/data-types/semi_structured/Array.md) type.
-
-1. Create a StarRocks table.
-
- In the database `test`, create a Primary Key table `array_tbl` that includes one `INT` column and two `ARRAY` columns.
-
- ```SQL
- CREATE TABLE `array_tbl` (
- `id` INT NOT NULL,
- `a0` ARRAY,
- `a1` ARRAY>
- )
- ENGINE=OLAP
- PRIMARY KEY(`id`)
- DISTRIBUTED BY HASH(`id`)
- ;
- ```
-
-2. Write data to StarRocks.
-
- Because some versions of StarRocks does not provide the metadata of `ARRAY` column, the connector can not infer the corresponding Spark data type for this column. However, you can explicitly specify the corresponding Spark data type of the column in the option `starrocks.column.types`. In this example, you can configure the option as `a0 ARRAY,a1 ARRAY>`.
-
- Run the following codes in `spark-shell`:
-
- ```scala
- val data = Seq(
- | (1, Seq("hello", "starrocks"), Seq(Seq(1, 2), Seq(3, 4))),
- | (2, Seq("hello", "spark"), Seq(Seq(5, 6, 7), Seq(8, 9, 10)))
- | )
- val df = data.toDF("id", "a0", "a1")
- df.write
- .format("starrocks")
- .option("starrocks.fe.http.url", "127.0.0.1:8030")
- .option("starrocks.fe.jdbc.url", "jdbc:mysql://127.0.0.1:9030")
- .option("starrocks.table.identifier", "test.array_tbl")
- .option("starrocks.user", "root")
- .option("starrocks.password", "")
- .option("starrocks.column.types", "a0 ARRAY,a1 ARRAY>")
- .mode("append")
- .save()
- ```
-
-3. Query data in the StarRocks table.
-
- ```SQL
- MySQL [test]> SELECT * FROM `array_tbl`;
- +------+-----------------------+--------------------+
- | id | a0 | a1 |
- +------+-----------------------+--------------------+
- | 1 | ["hello","starrocks"] | [[1,2],[3,4]] |
- | 2 | ["hello","spark"] | [[5,6,7],[8,9,10]] |
- +------+-----------------------+--------------------+
- 2 rows in set (0.01 sec)
- ```
diff --git a/docs/en/loading/SparkLoad.md b/docs/en/loading/SparkLoad.md
deleted file mode 100644
index 8794ce3..0000000
--- a/docs/en/loading/SparkLoad.md
+++ /dev/null
@@ -1,537 +0,0 @@
----
-displayed_sidebar: docs
----
-
-# Load data in bulk using Spark Load
-
-This load uses external Apache Spark™ resources to pre-process imported data, which improves import performance and saves compute resources. It is mainly used for **initial migration** and **large data import** into StarRocks (data volume up to TB level).
-
-Spark load is an **asynchronous** import method that requires users to create Spark-type import jobs via the MySQL protocol and view the import results using `SHOW LOAD`.
-
-> **NOTICE**
->
-> - Only users with the INSERT privilege on a StarRocks table can load data into this table. You can follow the instructions provided in [GRANT](../sql-reference/sql-statements/account-management/GRANT.md) to grant the required privilege.
-> - Spark Load can not be used to load data into a Primary Key table.
-
-## Terminology explanation
-
-- **Spark ETL**: Mainly responsible for ETL of data in the import process, including global dictionary construction (BITMAP type), partitioning, sorting, aggregation, etc.
-- **Broker**: Broker is an independent stateless process. It encapsulates the file system interface and provides StarRocks with the ability to read files from remote storage systems.
-- **Global Dictionary**: Saves the data structure that maps data from the original value to the encoded value. The original value can be any data type, while the encoded value is an integer. The global dictionary is mainly used in scenarios where exact count distinct is precomputed.
-
-## Fundamentals
-
-The user submits a Spark type import job through the MySQL client;the FE records the metadata and returns the submission result.
-
-The execution of the spark load task is divided into the following main phases.
-
-1. The user submits the spark load job to the FE.
-2. The FE schedules the submission of the ETL task to the Apache Spark™ cluster for execution.
-3. The Apache Spark™ cluster executes the ETL task that includes global dictionary construction (BITMAP type), partitioning, sorting, aggregation, etc.
-4. After the ETL task is completed, the FE gets the data path of each preprocessed slice and schedules the relevant BE to execute the Push task.
-5. The BE reads data through Broker process from HDFS and converts it into StarRocks storage format.
- > If you choose not to use Broker process, the BE reads data from HDFS directly.
-6. The FE schedules the effective version and completes the import job.
-
-The following diagram illustrates the main flow of spark load.
-
-
-
----
-
-## Global Dictionary
-
-### Applicable Scenarios
-
-Currently, the BITMAP column in StarRocks is implemented using the Roaringbitmap, which only has integer to be the input data type. So if you want to implement precomputation for the BITMAP column in the import process, then you need to convert the input data type to integer.
-
-In the existing import process of StarRocks, the data structure of the global dictionary is implemented based on the Hive table, which saves the mapping from the original value to the encoded value.
-
-### Build Process
-
-1. Read the data from the upstream data source and generate a temporary Hive table, named `hive-table`.
-2. Extract the values of the de-emphasized fields of `hive-table` to generate a new Hive table named `distinct-value-table`.
-3. Create a new global dictionary table named `dict-table` with one column for the original values and one column for the encoded values.
-4. Left join between `distinct-value-table` and `dict-table`, and then use the window function to encode this set. Finally both the original value and the encoded value of the de-duplicated column are written back to `dict-table`.
-5. Join between `dict-table` and `hive-table` to finish the job of replacing the original value in `hive-table` with the integer encoded value.
-6. `hive-table` will be read by the next time data pre-processing, and then imported into StarRocks after calculation.
-
-## Data Pre-processing
-
-The basic process of data pre-processing is as follows:
-
-1. Read data from the upstream data source (HDFS file or Hive table).
-2. Complete field mapping and calculation for the read data, then generate `bucket-id` based on the partition information.
-3. Generate RollupTree based on the Rollup metadata of StarRocks table.
-4. Iterate through the RollupTree and perform hierarchical aggregation operations. The Rollup of the next hierarchy can be calculated from the Rollup of the previous hierarchy.
-5. Each time the aggregation calculation is completed, the data is bucketed according to `bucket-id` and then written to HDFS.
-6. The subsequent Broker process will pull the files from HDFS and import them into the StarRocks BE node.
-
-## Basic Operations
-
-### Configuring ETL Clusters
-
-Apache Spark™ is used as an external computational resource in StarRocks for ETL work. There may be other external resources added to StarRocks, such as Spark/GPU for query, HDFS/S3 for external storage, MapReduce for ETL, etc. Therefore, we introduce `Resource Management` to manage these external resources used by StarRocks.
-
-Before submitting a Apache Spark™ import job, configure the Apache Spark™ cluster for performing ETL tasks. The syntax for operation is as follows:
-
-~~~sql
--- create Apache Spark™ resource
-CREATE EXTERNAL RESOURCE resource_name
-PROPERTIES
-(
- type = spark,
- spark_conf_key = spark_conf_value,
- working_dir = path,
- broker = broker_name,
- broker.property_key = property_value
-);
-
--- drop Apache Spark™ resource
-DROP RESOURCE resource_name;
-
--- show resources
-SHOW RESOURCES
-SHOW PROC "/resources";
-
--- privileges
-GRANT USAGE_PRIV ON RESOURCE resource_name TO user_identityGRANT USAGE_PRIV ON RESOURCE resource_name TO ROLE role_name;
-REVOKE USAGE_PRIV ON RESOURCE resource_name FROM user_identityREVOKE USAGE_PRIV ON RESOURCE resource_name FROM ROLE role_name;
-~~~
-
-- Create resource
-
-**For example**:
-
-~~~sql
--- yarn cluster mode
-CREATE EXTERNAL RESOURCE "spark0"
-PROPERTIES
-(
- "type" = "spark",
- "spark.master" = "yarn",
- "spark.submit.deployMode" = "cluster",
- "spark.jars" = "xxx.jar,yyy.jar",
- "spark.files" = "/tmp/aaa,/tmp/bbb",
- "spark.executor.memory" = "1g",
- "spark.yarn.queue" = "queue0",
- "spark.hadoop.yarn.resourcemanager.address" = "127.0.0.1:9999",
- "spark.hadoop.fs.defaultFS" = "hdfs://127.0.0.1:10000",
- "working_dir" = "hdfs://127.0.0.1:10000/tmp/starrocks",
- "broker" = "broker0",
- "broker.username" = "user0",
- "broker.password" = "password0"
-);
-
--- yarn HA cluster mode
-CREATE EXTERNAL RESOURCE "spark1"
-PROPERTIES
-(
- "type" = "spark",
- "spark.master" = "yarn",
- "spark.submit.deployMode" = "cluster",
- "spark.hadoop.yarn.resourcemanager.ha.enabled" = "true",
- "spark.hadoop.yarn.resourcemanager.ha.rm-ids" = "rm1,rm2",
- "spark.hadoop.yarn.resourcemanager.hostname.rm1" = "host1",
- "spark.hadoop.yarn.resourcemanager.hostname.rm2" = "host2",
- "spark.hadoop.fs.defaultFS" = "hdfs://127.0.0.1:10000",
- "working_dir" = "hdfs://127.0.0.1:10000/tmp/starrocks",
- "broker" = "broker1"
-);
-~~~
-
-`resource-name` is the name of the Apache Spark™ resource configured in StarRocks.
-
-`PROPERTIES` includes parameters relating to the Apache Spark™ resource, as follows:
-> **Note**
->
-> For detailed description of Apache Spark™ resource PROPERTIES, please see [CREATE RESOURCE](../sql-reference/sql-statements/Resource/CREATE_RESOURCE.md)
-
-- Spark related parameters:
- - `type`: Resource type, required, currently only supports `spark`.
- - `spark.master`: Required, currently only supports `yarn`.
- - `spark.submit.deployMode`: The deployment mode of the Apache Spark™ program, required, currently supports both `cluster` and `client`.
- - `spark.hadoop.fs.defaultFS`: Required if master is yarn.
- - Parameters related to yarn resource manager, required.
- - one ResourceManager on a single node
- `spark.hadoop.yarn.resourcemanager.address`: Address of the single point resource manager.
- - ResourceManager HA
- > You can choose to specify ResourceManager's hostname or address.
- - `spark.hadoop.yarn.resourcemanager.ha.enabled`: Enable the resource manager HA, set to `true`.
- - `spark.hadoop.yarn.resourcemanager.ha.rm-ids`: list of resource manager logical ids.
- - `spark.hadoop.yarn.resourcemanager.hostname.rm-id`: For each rm-id, specify the hostname corresponding to the resource manager.
- - `spark.hadoop.yarn.resourcemanager.address.rm-id`: For each rm-id, specify `host:port` for the client to submit jobs to.
-
-- `*working_dir`: The directory used by ETL. Required if Apache Spark™ is used as an ETL resource. For example: `hdfs://host:port/tmp/starrocks`.
-
-- Broker related parameters:
- - `broker`: Broker name. Required if Apache Spark™ is used as an ETL resource. You need to use the `ALTER SYSTEM ADD BROKER` command to complete the configuration in advance.
- - `broker.property_key`: Information (e.g.authentication information) to be specified when Broker process reads the intermediate file generated by the ETL.
-
-**Precaution**:
-
-The above is a description of parameters for loading through Broker process. If you intend to load data without Broker process, the following should be noted.
-
-- You do not need to specify `broker`.
-- If you need to configure user authentication, and HA for NameNode nodes, you need to configure the parameters in the hdfs-site.xml file in the HDFS cluster, see [broker_properties](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md#hdfs) for descriptions of parameters. and you need to move the **hdfs-site.xml** file under **$FE_HOME/conf** for each FE and **$BE_HOME/conf** for each BE.
-
-> Note
->
-> If the HDFS file can only be accessed by a specific user, you still need to specify the HDFS username in `broker.name` and the user password in `broker.password`.
-
-- View resources
-
-Regular accounts can only view resources to which they have `USAGE-PRIV` access. The root and admin accounts can view all resources.
-
-- Resource Permissions
-
-Resource permissions are managed through `GRANT REVOKE`, which currently only supports `USAGE-PRIV` permissions. You can give `USAGE-PRIV` permissions to a user or a role.
-
-~~~sql
--- Grant access to spark0 resources to user0
-GRANT USAGE_PRIV ON RESOURCE "spark0" TO "user0"@"%";
-
--- Grant access to spark0 resources to role0
-GRANT USAGE_PRIV ON RESOURCE "spark0" TO ROLE "role0";
-
--- Grant access to all resources to user0
-GRANT USAGE_PRIV ON RESOURCE* TO "user0"@"%";
-
--- Grant access to all resources to role0
-GRANT USAGE_PRIV ON RESOURCE* TO ROLE "role0";
-
--- Revoke the use privileges of spark0 resources from user user0
-REVOKE USAGE_PRIV ON RESOURCE "spark0" FROM "user0"@"%";
-~~~
-
-### Configuring Spark Client
-
-Configure the Spark client for FE so that the latter can submit Spark tasks by executing the `spark-submit` command. It is recommended to use the official version of Spark2 2.4.5 or above [spark download address](https://archive.apache.org/dist/spark/). After downloading, please use the following steps to complete the configuration.
-
-- Configure `SPARK-HOME`
-
-Place the Spark client in a directory on the same machine as the FE, and configure `spark_home_default_dir` in the FE configuration file to this directory, which by default is the `lib/spark2x` path in the FE root directory, and cannot be empty.
-
-- **Configure SPARK dependency package**
-
-To configure the dependency package, zip and archive all jar files in the jars folder under the Spark client, and configure the `spark_resource_path` item in the FE configuration to this zip file. If this configuration is empty, the FE will try to find the `lib/spark2x/jars/spark-2x.zip` file in the FE root directory. If the FE fails to find it, it will report an error.
-
-When the spark load job is submitted, the archived dependency files will be uploaded to the remote repository. The default repository path is under the `working_dir/{cluster_id}` directory named with `--spark-repository--{resource-name}`, which means that a resource in the cluster corresponds to a remote repository. The directory structure is referenced as follows:
-
-~~~bash
----spark-repository--spark0/
-
- |---archive-1.0.0/
-
- | |\---lib-990325d2c0d1d5e45bf675e54e44fb16-spark-dpp-1.0.0\-jar-with-dependencies.jar
-
- | |\---lib-7670c29daf535efe3c9b923f778f61fc-spark-2x.zip
-
- |---archive-1.1.0/
-
- | |\---lib-64d5696f99c379af2bee28c1c84271d5-spark-dpp-1.1.0\-jar-with-dependencies.jar
-
- | |\---lib-1bbb74bb6b264a270bc7fca3e964160f-spark-2x.zip
-
- |---archive-1.2.0/
-
- | |-...
-
-~~~
-
-In addition to the spark dependencies (named `spark-2x.zip` by default), the FE also uploads the DPP dependencies to the remote repository. If all the dependencies submitted by the spark load already exist in the remote repository, then there is no need to upload the dependencies again, saving the time of repeatedly uploading a large number of files each time.
-
-### Configuring YARN Client
-
-Configure the yarn client for FE so that the FE can execute yarn commands to get the status of the running application or kill it.It is recommended to use the official version of Hadoop2 2.5.2 or above ([hadoop download address](https://archive.apache.org/dist/hadoop/common/)). After downloading, please use the following steps to complete the configuration:
-
-- **Configure the YARN executable path**
-
-Place the downloaded yarn client in a directory on the same machine as the FE, and configure the `yarn_client_path` item in the FE configuration file to the binary executable file of yarn, which by default is the `lib/yarn-client/hadoop/bin/yarn` path in the FE root directory.
-
-- **Configure the path to the configuration file needed to generate YARN (optional)**
-
-When the FE goes through the yarn client to get the status of the application, or to kill the application, by default StarRocks generates the configuration file required to execute the yarn command in the `lib/yarn-config` path of the FE root directory This path can be modified by configuring the `yarn_config_dir` entry in the FE configuration file, which currently includes `core-site.xml` and `yarn-site.xml`.
-
-### Create Import Job
-
-**Syntax:**
-
-~~~sql
-LOAD LABEL load_label
- (data_desc, ...)
-WITH RESOURCE resource_name
-[resource_properties]
-[PROPERTIES (key1=value1, ... )]
-
-* load_label:
- db_name.label_name
-
-* data_desc:
- DATA INFILE ('file_path', ...)
- [NEGATIVE]
- INTO TABLE tbl_name
- [PARTITION (p1, p2)]
- [COLUMNS TERMINATED BY separator ]
- [(col1, ...)]
- [COLUMNS FROM PATH AS (col2, ...)]
- [SET (k1=f1(xx), k2=f2(xx))]
- [WHERE predicate]
-
- DATA FROM TABLE hive_external_tbl
- [NEGATIVE]
- INTO TABLE tbl_name
- [PARTITION (p1, p2)]
- [SET (k1=f1(xx), k2=f2(xx))]
- [WHERE predicate]
-
-* resource_properties:
- (key2=value2, ...)
-~~~
-
-**Example 1**: The case where the upstream data source is HDFS
-
-~~~sql
-LOAD LABEL db1.label1
-(
- DATA INFILE("hdfs://abc.com:8888/user/starrocks/test/ml/file1")
- INTO TABLE tbl1
- COLUMNS TERMINATED BY ","
- (tmp_c1,tmp_c2)
- SET
- (
- id=tmp_c2,
- name=tmp_c1
- ),
- DATA INFILE("hdfs://abc.com:8888/user/starrocks/test/ml/file2")
- INTO TABLE tbl2
- COLUMNS TERMINATED BY ","
- (col1, col2)
- where col1 > 1
-)
-WITH RESOURCE 'spark0'
-(
- "spark.executor.memory" = "2g",
- "spark.shuffle.compress" = "true"
-)
-PROPERTIES
-(
- "timeout" = "3600"
-);
-~~~
-
-**Example 2**: The case where the upstream data source is Hive.
-
-- Step 1: Create a new hive resource
-
-~~~sql
-CREATE EXTERNAL RESOURCE hive0
-PROPERTIES
-(
- "type" = "hive",
- "hive.metastore.uris" = "thrift://xx.xx.xx.xx:8080"
-);
- ~~~
-
-- Step 2: Create a new hive external table
-
-~~~sql
-CREATE EXTERNAL TABLE hive_t1
-(
- k1 INT,
- K2 SMALLINT,
- k3 varchar(50),
- uuid varchar(100)
-)
-ENGINE=hive
-PROPERTIES
-(
- "resource" = "hive0",
- "database" = "tmp",
- "table" = "t1"
-);
- ~~~
-
-- Step 3: Submit the load command, requiring that the columns in the imported StarRocks table exist in the hive external table.
-
-~~~sql
-LOAD LABEL db1.label1
-(
- DATA FROM TABLE hive_t1
- INTO TABLE tbl1
- SET
- (
- uuid=bitmap_dict(uuid)
- )
-)
-WITH RESOURCE 'spark0'
-(
- "spark.executor.memory" = "2g",
- "spark.shuffle.compress" = "true"
-)
-PROPERTIES
-(
- "timeout" = "3600"
-);
- ~~~
-
-Introduction to the parameters in the Spark load:
-
-- **Label**
-
-Label of the import job. Each import job has a Label that is unique within the database, following the same rules as broker load.
-
-- **Data description class parameters**
-
-Currently, supported data sources are CSV and Hive table. Other rules are the same as broker load.
-
-- **Import Job Parameters**
-
-Import job parameters refer to the parameters belonging to the `opt_properties` section of the import statement. These parameters are applicable to the entire import job. The rules are the same as broker load.
-
-- **Spark Resource Parameters**
-
-Spark resources need to be configured into StarRocks in advance and users need to be given USAGE-PRIV permissions before they can apply the resources to Spark load.
-Spark resource parameters can be set when the user has a temporary need, such as adding resources for a job and modifying Spark configs. The setting only takes effect on this job and does not affect the existing configurations in the StarRocks cluster.
-
-~~~sql
-WITH RESOURCE 'spark0'
-(
- "spark.driver.memory" = "1g",
- "spark.executor.memory" = "3g"
-)
-~~~
-
-- **Import when the data source is Hive**
-
-Currently, to use a Hive table in the import process, you need to create an external table of the `Hive` type and then specify its name when submitting the import command.
-
-- **Import process to build a global dictionary**
-
-In the load command, you can specify the required fields for building the global dictionary in the following format: `StarRocks field name=bitmap_dict(hive table field name)` Note that currently **the global dictionary is only supported when the upstream data source is a Hive table**.
-
-- **Load binary type data**
-
-Since v2.5.17, Spark Load supports the bitmap_from_binary function, which can convert binary data into bitmap data. If the column type of the Hive table or HDFS file is binary and the corresponding column in the StarRocks table is a bitmap-type aggregate column, you can specify the fields in the load command in the following format, `StarRocks field name=bitmap_from_binary(Hive table field name)`. This eliminates the need for building a global dictionary.
-
-## Viewing Import Jobs
-
-The Spark load import is asynchronous, as is the broker load. The user must record the label of the import job and use it in the `SHOW LOAD` command to view the import results. The command to view the import is common to all import methods. The example is as follows.
-
-Refer to Broker Load for a detailed explanation of returned parameters.The differences are as follows.
-
-~~~sql
-mysql> show load order by createtime desc limit 1\G
-*************************** 1. row ***************************
- JobId: 76391
- Label: label1
- State: FINISHED
- Progress: ETL:100%; LOAD:100%
- Type: SPARK
- EtlInfo: unselected.rows=4; dpp.abnorm.ALL=15; dpp.norm.ALL=28133376
- TaskInfo: cluster:cluster0; timeout(s):10800; max_filter_ratio:5.0E-5
- ErrorMsg: N/A
- CreateTime: 2019-07-27 11:46:42
- EtlStartTime: 2019-07-27 11:46:44
- EtlFinishTime: 2019-07-27 11:49:44
- LoadStartTime: 2019-07-27 11:49:44
-LoadFinishTime: 2019-07-27 11:50:16
- URL: http://1.1.1.1:8089/proxy/application_1586619723848_0035/
- JobDetails: {"ScannedRows":28133395,"TaskNumber":1,"FileNumber":1,"FileSize":200000}
-~~~
-
-- **State**
-
-The current stage of the imported job.
-PENDING: The job is committed.
-ETL: Spark ETL is committed.
-LOADING: The FE schedule an BE to execute push operation.
-FINISHED: The push is completed and the version is effective.
-
-There are two final stages of the import job – `CANCELLED` and `FINISHED`, both indicating the load job is completed. `CANCELLED` indicates import failure and `FINISHED` indicates import success.
-
-- **Progress**
-
-Description of the import job progress. There are two types of progress –ETL and LOAD, which correspond to the two phases of the import process, ETL and LOADING.
-
-- The range of progress for LOAD is 0~100%.
-
-`LOAD progress = the number of currently completed tablets of all replications imports / the total number of tablets of this import job * 100%`.
-
-- If all tables have been imported, the LOAD progress is 99%, and changes to 100% when the import enters the final validation phase.
-
-- The import progress is not linear. If there is no change in progress for a period of time, it does not mean that the import is not executing.
-
-- **Type**
-
- The type of the import job. SPARK for spark load.
-
-- **CreateTime/EtlStartTime/EtlFinishTime/LoadStartTime/LoadFinishTime**
-
-These values represent the time when the import was created, when the ETL phase started, when the ETL phase completed, when the LOADING phase started, and when the entire import job was completed.
-
-- **JobDetails**
-
-Displays the detailed running status of the job, including the number of imported files, total size (in bytes), number of subtasks, number of raw rows being processed, etc. For example:
-
-~~~json
- {"ScannedRows":139264,"TaskNumber":1,"FileNumber":1,"FileSize":940754064}
-~~~
-
-- **URL**
-
-You can copy the input to your browser to access the web interface of the corresponding application.
-
-### View Apache Spark™ Launcher commit logs
-
-Sometimes users need to view the detailed logs generated during a Apache Spark™ job commit. By default, the logs are saved in the path `log/spark_launcher_log` in the FE root directory named as `spark-launcher-{load-job-id}-{label}.log`. The logs are saved in this directory for a period of time and will be erased when the import information in FE metadata is cleaned up. The default retention time is 3 days.
-
-### Cancel Import
-
-When the Spark load job status is not `CANCELLED` or `FINISHED`, it can be cancelled manually by the user by specifying the Label of the import job.
-
----
-
-## Related System Configurations
-
-**FE Configuration:** The following configuration is the system-level configuration of Spark load, which applies to all Spark load import jobs. The configuration values can be adjusted mainly by modifying `fe.conf`.
-
-- enable-spark-load: Enable Spark load and resource creation with a default value of false.
-- spark-load-default-timeout-second: The default timeout for the job is 259200 seconds (3 days).
-- spark-home-default-dir: The Spark client path (`fe/lib/spark2x`).
-- spark-resource-path: The path to the packaged S park dependency file (empty by default).
-- spark-launcher-log-dir: The directory where the commit log of the Spark client is stored (`fe/log/spark-launcher-log`).
-- yarn-client-path: The path to the yarn binary executable (`fe/lib/yarn-client/hadoop/bin/yarn`).
-- yarn-config-dir: Yarn's configuration file path (`fe/lib/yarn-config`).
-
----
-
-## Best Practices
-
-The most suitable scenario for using Spark load is when the raw data is in the file system (HDFS) and the data volume is in the tens of GB to TB level. Use Stream Load or Broker Load for smaller data volumes.
-
-For the full spark load import example, refer to the demo on github: [https://github.com/StarRocks/demo/blob/master/docs/03_sparkLoad2StarRocks.md](https://github.com/StarRocks/demo/blob/master/docs/03_sparkLoad2StarRocks.md)
-
-## FAQs
-
-- `Error: When running with master 'yarn' either HADOOP-CONF-DIR or YARN-CONF-DIR must be set in the environment.`
-
- Using Spark Load without configuring the `HADOOP-CONF-DIR` environment variable in `spark-env.sh` of the Spark client.
-
-- `Error: Cannot run program "xxx/bin/spark-submit": error=2, No such file or directory`
-
- The `spark_home_default_dir` configuration item does not specify the Spark client root directory when using Spark Load.
-
-- `Error: File xxx/jars/spark-2x.zip does not exist.`
-
- The `spark-resource-path` configuration item does not point to the packed zip file when using Spark load.
-
-- `Error: yarn client does not exist in path: xxx/yarn-client/hadoop/bin/yarn`
-
- The yarn-client-path configuration item does not specify the yarn executable when using Spark load.
-
-- `ERROR: Cannot execute hadoop-yarn/bin/... /libexec/yarn-config.sh`
-
- When using Hadoop with CDH, you need to configure the `HADOOP_LIBEXEC_DIR` environment variable.
- Since `hadoop-yarn` and hadoop directories are different, the default `libexec` directory will look for `hadoop-yarn/bin/... /libexec`, while `libexec` is in the hadoop directory.
- The ```yarn application status`` command to get the Spark task status reported an error causing the import job to fail.
diff --git a/docs/en/loading/StreamLoad.md b/docs/en/loading/StreamLoad.md
deleted file mode 100644
index 2095258..0000000
--- a/docs/en/loading/StreamLoad.md
+++ /dev/null
@@ -1,545 +0,0 @@
----
-displayed_sidebar: docs
-keywords: ['Stream Load']
----
-
-# Load data from a local file system
-
-import InsertPrivNote from '../_assets/commonMarkdown/insertPrivNote.mdx'
-
-StarRocks provides two methods of loading data from a local file system:
-
-- Synchronous loading using [Stream Load](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md)
-- Asynchronous loading using [Broker Load](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md)
-
-Each of these options has its own advantages:
-
-- Stream Load supports CSV and JSON file formats. This method is recommended if you want to load data from a small number of files whose individual sizes do not exceed 10 GB.
-- Broker Load supports Parquet, ORC, CSV, and JSON file formats (JSON file format is supported from v3.2.3 onwards). This method is recommended if you want to load data from a large number of files whose individual sizes exceed 10 GB, or if the files are stored in a network attached storage (NAS) device. **Using Broker Load to load data from a local file system is supported from v2.5 onwards.**
-
-For CSV data, take note of the following points:
-
-- You can use a UTF-8 string, such as a comma (,), tab, or pipe (|), whose length does not exceed 50 bytes as a text delimiter.
-- Null values are denoted by using `\N`. For example, a data file consists of three columns, and a record from that data file holds data in the first and third columns but no data in the second column. In this situation, you need to use `\N` in the second column to denote a null value. This means the record must be compiled as `a,\N,b` instead of `a,,b`. `a,,b` denotes that the second column of the record holds an empty string.
-
-Stream Load and Broker Load both support data transformation at data loading and supports data changes made by UPSERT and DELETE operations during data loading. For more information, see [Transform data at loading](../loading/Etl_in_loading.md) and [Change data through loading](../loading/Load_to_Primary_Key_tables.md).
-
-## Before you begin
-
-### Check privileges
-
-
-
-#### Check network configuration
-
-Make sure that the machine on which the data you want to load resides can access the FE and BE nodes of the StarRocks cluster via the [`http_port`](../administration/management/FE_configuration.md#http_port) (default: `8030`) and [`be_http_port`](../administration/management/BE_configuration.md#be_http_port) (default: `8040`) , respectively.
-
-## Loading from a local file system via Stream Load
-
-Stream Load is an HTTP PUT-based synchronous loading method. After you submit a load job, StarRocks synchronously runs the job, and returns the result of the job after the job finishes. You can determine whether the job is successful based on the job result.
-
-> **NOTICE**
->
-> After you load data into a StarRocks table by using Stream Load, the data of the materialized views that are created on that table is also updated.
-
-### How it works
-
-You can submit a load request on your client to an FE according to HTTP, and the FE then uses an HTTP redirect to forward the load request to a specific BE or CN. You can also directly submit a load request on your client to a BE or CN of your choice.
-
-:::note
-
-If you submit load requests to an FE, the FE uses a polling mechanism to decide which BE or CN will serve as a coordinator to receive and process the load requests. The polling mechanism helps achieve load balancing within your StarRocks cluster. Therefore, we recommend that you send load requests to an FE.
-
-:::
-
-The BE or CN that receives the load request runs as the Coordinator BE or CN to split data based on the used schema into portions and assign each portion of the data to the other involved BEs or CNs. After the load finishes, the Coordinator BE or CN returns the result of the load job to your client. Note that if you stop the Coordinator BE or CN during the load, the load job fails.
-
-The following figure shows the workflow of a Stream Load job.
-
-
-
-### Limits
-
-Stream Load does not support loading the data of a CSV file that contains a JSON-formatted column.
-
-### Typical example
-
-This section uses curl as an example to describe how to load the data of a CSV or JSON file from your local file system into StarRocks. For detailed syntax and parameter descriptions, see [STREAM LOAD](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md).
-
-Note that in StarRocks some literals are used as reserved keywords by the SQL language. Do not directly use these keywords in SQL statements. If you want to use such a keyword in an SQL statement, enclose it in a pair of backticks (`). See [Keywords](../sql-reference/sql-statements/keywords.md).
-
-#### Load CSV data
-
-##### Prepare datasets
-
-In your local file system, create a CSV file named `example1.csv`. The file consists of three columns, which represent the user ID, user name, and user score in sequence.
-
-```Plain
-1,Lily,23
-2,Rose,23
-3,Alice,24
-4,Julia,25
-```
-
-##### Create a database and a table
-
-Create a database and switch to it:
-
-```SQL
-CREATE DATABASE IF NOT EXISTS mydatabase;
-USE mydatabase;
-```
-
-Create a Primary Key table named `table1`. The table consists of three columns: `id`, `name`, and `score`, of which `id` is the primary key.
-
-```SQL
-CREATE TABLE `table1`
-(
- `id` int(11) NOT NULL COMMENT "user ID",
- `name` varchar(65533) NULL COMMENT "user name",
- `score` int(11) NOT NULL COMMENT "user score"
-)
-ENGINE=OLAP
-PRIMARY KEY(`id`)
-DISTRIBUTED BY HASH(`id`);
-```
-
-:::note
-
-Since v2.5.7, StarRocks can automatically set the number of buckets (BUCKETS) when you create a table or add a partition. You no longer need to manually set the number of buckets. For detailed information, see [set the number of buckets](../table_design/data_distribution/Data_distribution.md#set-the-number-of-buckets).
-
-:::
-
-##### Start a Stream Load
-
-Run the following command to load the data of `example1.csv` into `table1`:
-
-```Bash
-curl --location-trusted -u : -H "label:123" \
- -H "Expect:100-continue" \
- -H "column_separator:," \
- -H "columns: id, name, score" \
- -T example1.csv -XPUT \
- http://:/api/mydatabase/table1/_stream_load
-```
-
-:::note
-
-- If you use an account for which no password is set, you need to input only `:`.
-- You can use [SHOW FRONTENDS](../sql-reference/sql-statements/cluster-management/nodes_processes/SHOW_FRONTENDS.md) to view the IP address and HTTP port of the FE node.
-
-:::
-
-`example1.csv` consists of three columns, which are separated by commas (,) and can be mapped in sequence onto the `id`, `name`, and `score` columns of `table1`. Therefore, you need to use the `column_separator` parameter to specify the comma (,) as the column separator. You also need to use the `columns` parameter to temporarily name the three columns of `example1.csv` as `id`, `name`, and `score`, which are mapped in sequence onto the three columns of `table1`.
-
-After the load is complete, you can query `table1` to verify that the load is successful:
-
-```SQL
-SELECT * FROM table1;
-+------+-------+-------+
-| id | name | score |
-+------+-------+-------+
-| 1 | Lily | 23 |
-| 2 | Rose | 23 |
-| 3 | Alice | 24 |
-| 4 | Julia | 25 |
-+------+-------+-------+
-4 rows in set (0.00 sec)
-```
-
-#### Load JSON data
-
-Since v3.2.7, Stream Load supports compressing JSON data during transmission, reducing network bandwidth overhead. Users can specify different compression algorithms using parameters `compression` and `Content-Encoding`. Supported compression algorithms including GZIP, BZIP2, LZ4_FRAME, and ZSTD. For the syntax, see [STREAM LOAD](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md).
-
-##### Prepare datasets
-
-In your local file system, create a JSON file named `example2.json`. The file consists of two columns, which represent city ID and city name in sequence.
-
-```JSON
-{"name": "Beijing", "code": 2}
-```
-
-##### Create a database and a table
-
-Create a database and switch to it:
-
-```SQL
-CREATE DATABASE IF NOT EXISTS mydatabase;
-USE mydatabase;
-```
-
-Create a Primary Key table named `table2`. The table consists of two columns: `id` and `city`, of which `id` is the primary key.
-
-```SQL
-CREATE TABLE `table2`
-(
- `id` int(11) NOT NULL COMMENT "city ID",
- `city` varchar(65533) NULL COMMENT "city name"
-)
-ENGINE=OLAP
-PRIMARY KEY(`id`)
-DISTRIBUTED BY HASH(`id`);
-```
-
-:::note
-
-Since v2.5.7, StarRocks can set the number of(BUCKETS) automatically when you create a table or add a partition. You no longer need to manually set the number of buckets. For detailed information, see [set the number of buckets](../table_design/data_distribution/Data_distribution.md#set-the-number-of-buckets).
-
-:::
-
-##### Start a Stream Load
-
-Run the following command to load the data of `example2.json` into `table2`:
-
-```Bash
-curl -v --location-trusted -u : -H "strict_mode: true" \
- -H "Expect:100-continue" \
- -H "format: json" -H "jsonpaths: [\"$.name\", \"$.code\"]" \
- -H "columns: city,tmp_id, id = tmp_id * 100" \
- -T example2.json -XPUT \
- http://:/api/mydatabase/table2/_stream_load
-```
-
-:::note
-
-- If you use an account for which no password is set, you need to input only `:`.
-- You can use [SHOW FRONTENDS](../sql-reference/sql-statements/cluster-management/nodes_processes/SHOW_FRONTENDS.md) to view the IP address and HTTP port of the FE node.
-
-:::
-
-`example2.json` consists of two keys, `name` and `code`, which are mapped onto the `id` and `city` columns of `table2`, as shown in the following figure.
-
-
-
-The mappings shown in the preceding figure are described as follows:
-
-- StarRocks extracts the `name` and `code` keys of `example2.json` and maps them onto the `name` and `code` fields declared in the `jsonpaths` parameter.
-
-- StarRocks extracts the `name` and `code` fields declared in the `jsonpaths` parameter and **maps them in sequence** onto the `city` and `tmp_id` fields declared in the `columns` parameter.
-
-- StarRocks extracts the `city` and `tmp_id` fields declared in the `columns` parameter and **maps them by name** onto the `city` and `id` columns of `table2`.
-
-:::note
-
-In the preceding example, the value of `code` in `example2.json` is multiplied by 100 before it is loaded into the `id` column of `table2`.
-
-:::
-
-For detailed mappings between `jsonpaths`, `columns`, and the columns of the StarRocks table, see the "Column mappings" section in [STREAM LOAD](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md).
-
-After the load is complete, you can query `table2` to verify that the load is successful:
-
-```SQL
-SELECT * FROM table2;
-+------+--------+
-| id | city |
-+------+--------+
-| 200 | Beijing|
-+------+--------+
-4 rows in set (0.01 sec)
-```
-
-import Beta from '../_assets/commonMarkdown/_beta.mdx'
-
-#### Merge Stream Load requests
-
-
-
-From v3.4.0, the system supports merging multiple Stream Load requests.
-
-:::warning
-
-Note that the Merge Commit optimization is suitable for the scenario with **concurrent** Stream Load jobs on a single table. It is not recommended if the concurrency is one. Meanwhile, think twice before setting `merge_commit_async` to `false` and `merge_commit_interval_ms` to a large value because they may cause load performance degradation.
-
-:::
-
-Merge Commit is an optimization for Stream Load, designed for high-concurrency, small-batch (from KB to tens of MB) real-time loading scenarios. In earlier versions, each Stream Load request would generate a transaction and a data version, which led to the following issues in high-concurrency loading scenarios:
-
-- Excessive data versions impact query performance, and limiting the number of versions may cause `too many versions` errors.
-- Data version merging through Compaction increases resource consumption.
-- It generates small files, increasing IOPS and I/O latency. And in shared-data clusters, this also raises cloud object storage costs.
-- Leader FE node, as the transaction manager, may become a single point of bottleneck.
-
-Merge Commit mitigates these issues by merging multiple concurrent Stream Load requests within a time window into a single transaction. This reduces the number of transactions and versions generated by high-concurrency requests, thereby improving loading performance.
-
-Merge Commit supports both synchronous and asynchronous modes. Each mode has advantages and disadvantages. You can choose based on your use cases.
-
-- **Synchronous mode**
-
- The server returns only after the merged transaction is committed, ensuring the loading is successful and visible.
-
-- **Asynchronous mode**
-
- The server returns immediately after receiving the data. This mode does not ensure the loading is successful.
-
-| **Mode** | **Advantages** | **Disadvantages** |
-| ------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |
-| Synchronous |
Ensures data persistence and visibility upon request return.
Guarantees that multiple sequential loading requests from the same client are executed in order.
| Each loading request from the client is blocked until the server closes the merge window. It may reduce the data processing capability of a single client if the window is excessively large. |
-| Asynchronous | Allows a single client to send subsequent loading requests without waiting for the server to close the merge window, improving loading throughput. |
Does not guarantee data persistence or visibility upon return. The client must later verify the transaction status.
Does not guarantee that multiple sequential loading requests from the same client are executed in order.
|
-
-##### Start a Stream Load
-
-- Run the following command to start a Stream Load job with Merge Commit enabled in synchronous mode, and set the merging window to `5000` milliseconds and degree of parallelism to `2`:
-
- ```Bash
- curl --location-trusted -u : \
- -H "Expect:100-continue" \
- -H "column_separator:," \
- -H "columns: id, name, score" \
- -H "enable_merge_commit:true" \
- -H "merge_commit_interval_ms:5000" \
- -H "merge_commit_parallel:2" \
- -T example1.csv -XPUT \
- http://:/api/mydatabase/table1/_stream_load
- ```
-
-- Run the following command to start a Stream Load job with Merge Commit enabled in asynchronous mode, and set the merging window to `60000` milliseconds and degree of parallelism to `2`:
-
- ```Bash
- curl --location-trusted -u : \
- -H "Expect:100-continue" \
- -H "column_separator:," \
- -H "columns: id, name, score" \
- -H "enable_merge_commit:true" \
- -H "merge_commit_async:true" \
- -H "merge_commit_interval_ms:60000" \
- -H "merge_commit_parallel:2" \
- -T example1.csv -XPUT \
- http://:/api/mydatabase/table1/_stream_load
- ```
-
-:::note
-
-- Merge Commit only supports merging **homogeneous** loading requests into a single database and table. "Homogeneous" indicates that the Stream Load parameters are identical, including: common parameters, JSON format parameters, CSV format parameters, `opt_properties`, and Merge Commit parameters.
-- For loading CSV-formatted data, you must ensure that each row ends with a line separator. `skip_header` is not supported.
-- The server automatically generates labels for transactions. They will be ignored if specified.
-- Merge Commit merges multiple loading requests into a single transaction. If one request contains data quality issues, all requests in the transaction will fail.
-
-:::
-
-#### Check Stream Load progress
-
-After a load job is complete, StarRocks returns the result of the job in JSON format. For more information, see the "Return value" section in [STREAM LOAD](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md).
-
-Stream Load does not allow you to query the result of a load job by using the SHOW LOAD statement.
-
-#### Cancel a Stream Load job
-
-Stream Load does not allow you to cancel a load job. If a load job times out or encounters errors, StarRocks automatically cancels the job.
-
-### Parameter configurations
-
-This section describes some system parameters that you need to configure if you choose the loading method Stream Load. These parameter configurations take effect on all Stream Load jobs.
-
-- `streaming_load_max_mb`: the maximum size of each data file you want to load. The default maximum size is 10 GB. For more information, see [Configure BE or CN dynamic parameters](../administration/management/BE_configuration.md).
-
- We recommend that you do not load more than 10 GB of data at a time. If the size of a data file exceeds 10 GB, we recommend that you split the data file into small files that each are less than 10 GB in size and then load these files one by one. If you cannot split a data file greater than 10 GB, you can increase the value of this parameter based on the file size.
-
- After you increase the value of this parameter, the new value can take effect only after you restart the BEs or CNs of your StarRocks cluster. Additionally, system performance may deteriorate, and the costs of retries in the event of load failures also increase.
-
- :::note
-
- When you load the data of a JSON file, take note of the following points:
-
- - The size of each JSON object in the file cannot exceed 4 GB. If any JSON object in the file exceeds 4 GB, StarRocks throws an error "This parser can't support a document that big."
-
- - By default, the JSON body in an HTTP request cannot exceed 100 MB. If the JSON body exceeds 100 MB, StarRocks throws an error "The size of this batch exceed the max size [104857600] of json type data data [8617627793]. Set ignore_json_size to skip check, although it may lead huge memory consuming." To prevent this error, you can add `"ignore_json_size:true"` in the HTTP request header to ignore the check on the JSON body size.
-
- :::
-
-- `stream_load_default_timeout_second`: the timeout period of each load job. The default timeout period is 600 seconds. For more information, see [Configure FE dynamic parameters](../administration/management/FE_configuration.md#configure-fe-dynamic-parameters).
-
- If many of the load jobs that you create time out, you can increase the value of this parameter based on the calculation result that you obtain from the following formula:
-
- **Timeout period of each load job > Amount of data to be loaded/Average loading speed**
-
- For example, if the size of the data file that you want to load is 10 GB and the average loading speed of your StarRocks cluster is 100 MB/s, set the timeout period to more than 100 seconds.
-
- :::note
-
- **Average loading speed** in the preceding formula is the average loading speed of your StarRocks cluster. It varies depending on the disk I/O and the number of BEs or CNs in your StarRocks cluster.
-
- :::
-
- Stream Load also provides the `timeout` parameter, which allows you to specify the timeout period of an individual load job. For more information, see [STREAM LOAD](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md).
-
-### Usage notes
-
-If a field is missing for a record in the data file you want to load and the column onto which the field is mapped in your StarRocks table is defined as `NOT NULL`, StarRocks automatically fills a `NULL` value in the mapping column of your StarRocks table during the load of the record. You can also use the `ifnull()` function to specify the default value that you want to fill.
-
-For example, if the field that represents city ID in the preceding `example2.json` file is missing and you want to fill an `x` value in the mapping column of `table2`, you can specify `"columns: city, tmp_id, id = ifnull(tmp_id, 'x')"`.
-
-## Loading from a local file system via Broker Load
-
-In addition to Stream Load, you can also use Broker Load to load data from a local file system. This feature is supported from v2.5 onwards.
-
-Broker Load is an asynchronous loading method. After you submit a load job, StarRocks asynchronously runs the job and does not immediately return the job result. You need to query the job result by hand. See [Check Broker Load progress](#check-broker-load-progress).
-
-### Limits
-
-- Currently Broker Load supports loading from a local file system only through a single broker whose version is v2.5 or later.
-- Highly concurrent queries against a single broker may cause issues such as timeout and OOM. To mitigate the impact, you can use the `pipeline_dop` variable (see [System variable](../sql-reference/System_variable.md#pipeline_dop)) to set the query parallelism for Broker Load. For queries against a single broker, we recommend that you set `pipeline_dop` to a value smaller than `16`.
-
-### Typical example
-
-Broker Load supports loading from a single data file to a single table, loading from multiple data files to a single table, and loading from multiple data files to multiple tables. This section uses loading from multiple data files to a single table as an example.
-
-Note that in StarRocks some literals are used as reserved keywords by the SQL language. Do not directly use these keywords in SQL statements. If you want to use such a keyword in an SQL statement, enclose it in a pair of backticks (`). See [Keywords](../sql-reference/sql-statements/keywords.md).
-
-#### Prepare datasets
-
-Use the CSV file format as an example. Log in to your local file system, and create two CSV files, `file1.csv` and `file2.csv`, in a specific storage location (for example, `/home/disk1/business/`). Both files consist of three columns, which represent the user ID, user name, and user score in sequence.
-
-- `file1.csv`
-
- ```Plain
- 1,Lily,21
- 2,Rose,22
- 3,Alice,23
- 4,Julia,24
- ```
-
-- `file2.csv`
-
- ```Plain
- 5,Tony,25
- 6,Adam,26
- 7,Allen,27
- 8,Jacky,28
- ```
-
-#### Create a database and a table
-
-Create a database and switch to it:
-
-```SQL
-CREATE DATABASE IF NOT EXISTS mydatabase;
-USE mydatabase;
-```
-
-Create a Primary Key table named `mytable`. The table consists of three columns: `id`, `name`, and `score`, of which `id` is the primary key.
-
-```SQL
-CREATE TABLE `mytable`
-(
- `id` int(11) NOT NULL COMMENT "User ID",
- `name` varchar(65533) NULL DEFAULT "" COMMENT "User name",
- `score` int(11) NOT NULL DEFAULT "0" COMMENT "User score"
-)
-ENGINE=OLAP
-PRIMARY KEY(`id`)
-DISTRIBUTED BY HASH(`id`)
-PROPERTIES("replication_num"="1");
-```
-
-#### Start a Broker Load
-
-Run the following command to start a Broker Load job that loads data from all data files (`file1.csv` and `file2.csv`) stored in the `/home/disk1/business/` path of your local file system to the StarRocks table `mytable`:
-
-```SQL
-LOAD LABEL mydatabase.label_local
-(
- DATA INFILE("file:///home/disk1/business/csv/*")
- INTO TABLE mytable
- COLUMNS TERMINATED BY ","
- (id, name, score)
-)
-WITH BROKER "sole_broker"
-PROPERTIES
-(
- "timeout" = "3600"
-);
-```
-
-This job has four main sections:
-
-- `LABEL`: A string used when querying the state of the load job.
-- `LOAD` declaration: The source URI, source data format, and destination table name.
-- `PROPERTIES`: The timeout value and any other properties to apply to the load job.
-
-For detailed syntax and parameter descriptions, see [BROKER LOAD](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md).
-
-#### Check Broker Load progress
-
-In v3.0 and earlier, use the [SHOW LOAD](../sql-reference/sql-statements/loading_unloading/SHOW_LOAD.md) statement or the curl command to view the progress of Broker Load jobs.
-
-In v3.1 and later, you can view the progress of Broker Load jobs from the [`information_schema.loads`](../sql-reference/information_schema/loads.md) view:
-
-```SQL
-SELECT * FROM information_schema.loads;
-```
-
-If you have submitted multiple load jobs, you can filter on the `LABEL` associated with the job. Example:
-
-```SQL
-SELECT * FROM information_schema.loads WHERE LABEL = 'label_local';
-```
-
-After you confirm that the load job has finished, you can query table to see if the data has been successfully loaded. Example:
-
-```SQL
-SELECT * FROM mytable;
-+------+-------+-------+
-| id | name | score |
-+------+-------+-------+
-| 3 | Alice | 23 |
-| 5 | Tony | 25 |
-| 6 | Adam | 26 |
-| 1 | Lily | 21 |
-| 2 | Rose | 22 |
-| 4 | Julia | 24 |
-| 7 | Allen | 27 |
-| 8 | Jacky | 28 |
-+------+-------+-------+
-8 rows in set (0.07 sec)
-```
-
-#### Cancel a Broker Load job
-
-When a load job is not in the **CANCELLED** or **FINISHED** stage, you can use the [CANCEL LOAD](../sql-reference/sql-statements/loading_unloading/CANCEL_LOAD.md) statement to cancel the job.
-
-For example, you can execute the following statement to cancel a load job, whose label is `label_local`, in the database `mydatabase`:
-
-```SQL
-CANCEL LOAD
-FROM mydatabase
-WHERE LABEL = "label_local";
-```
-
-## Loading from NAS via Broker Load
-
-There are two ways to load data from NAS by using Broker Load:
-
-- Consider NAS as a local file system, and run a load job with a broker. See the previous section "[Loading from a local system via Broker Load](#loading-from-a-local-file-system-via-broker-load)".
-- (Recommended) Consider NAS as a cloud storage system, and run a load job without a broker.
-
-This section introduces the second way. Detailed operations are as follows:
-
-1. Mount your NAS device to the same path on all the BE or CN nodes and FE nodes of your StarRocks cluster. As such, all BEs or CNs can access the NAS device like they access their own locally stored files.
-
-2. Use Broker Load to load data from the NAS device to the destination StarRocks table. Example:
-
- ```SQL
- LOAD LABEL test_db.label_nas
- (
- DATA INFILE("file:///home/disk1/sr/*")
- INTO TABLE mytable
- COLUMNS TERMINATED BY ","
- )
- WITH BROKER
- PROPERTIES
- (
- "timeout" = "3600"
- );
- ```
-
- This job has four main sections:
-
- - `LABEL`: A string used when querying the state of the load job.
- - `LOAD` declaration: The source URI, source data format, and destination table name. Note that `DATA INFILE` in the declaration is used to specify the mount point folder path of the NAS device, as shown in the above example in which `file:///` is the prefix and `/home/disk1/sr` is the mount point folder path.
- - `BROKER`: You do not need to specify the broker name.
- - `PROPERTIES`: The timeout value and any other properties to apply to the load job.
-
- For detailed syntax and parameter descriptions, see [BROKER LOAD](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md).
-
-After you submit a job, you can view the load progress or cancel the job as needed. For detailed operations, see "[Check Broker Load progress](#check-broker-load-progress)" and "[Cancel a Broker Load job](#cancel-a-broker-load-job) in this topic.
diff --git a/docs/en/loading/Stream_Load_transaction_interface.md b/docs/en/loading/Stream_Load_transaction_interface.md
deleted file mode 100644
index db5a6a2..0000000
--- a/docs/en/loading/Stream_Load_transaction_interface.md
+++ /dev/null
@@ -1,547 +0,0 @@
----
-displayed_sidebar: docs
-keywords: ['Stream Load']
----
-
-# Load data using Stream Load transaction interface
-
-import InsertPrivNote from '../_assets/commonMarkdown/insertPrivNote.mdx'
-
-From v2.4 onwards, StarRocks provides a Stream Load transaction interface to implement two-phase commit (2PC) for transactions that are run to load data from external systems such as Apache Flink® and Apache Kafka®. The Stream Load transaction interface helps improve the performance of highly concurrent stream loads.
-
-From v4.0 onwards, the Stream Load transaction interface supports Multi-table Transaction, that is, loading data into multiple tables within the same database.
-
-This topic describes the Stream Load transaction interface and how to load data into StarRocks by using this interface.
-
-## Description
-
-The Stream Load transaction interface supports using an HTTP protocol-compatible tool or language to call API operations. This topic uses curl as an example to explain how to use this interface. This interface provides various features, such as transaction management, data write, transaction pre-commit, transaction deduplication, and transaction timeout management.
-
-:::note
-Stream Load supports CSV and JSON file formats. This method is recommended if you want to load data from a small number of files whose individual sizes do not exceed 10 GB. Stream Load does not support Parquet file format. If you need to load data from Parquet files, use [INSERT+files()](../loading/InsertInto.md#insert-data-directly-from-files-in-an-external-source-using-files).
-:::
-
-### Transaction management
-
-The Stream Load transaction interface provides the following API operations, which are used to manage transactions:
-
-- `/api/transaction/begin`: starts a new transaction.
-
-- `/api/transaction/prepare`: pre-commits the current transaction and make data changes temporarily persistent. After you pre-commit a transaction, you can proceed to commit or roll back the transaction. If your cluster crashes after a transaction is pre-committed, you can still proceed to commit the transaction after the cluster is restored.
-
-- `/api/transaction/commit`: commits the current transaction to make data changes persistent.
-
-- `/api/transaction/rollback`: rolls back the current transaction to abort data changes.
-
-> **NOTE**
->
-> After the transaction is pre-committed, do not continue to write data using the transaction. If you continue to write data using the transaction, your write request returns errors.
-
-
-The following diagram shows the relationship between transaction states and operations:
-
-```mermaid
-stateDiagram-v2
- direction LR
- [*] --> PREPARE : begin
- PREPARE --> PREPARED : prepare
- PREPARE --> ABORTED : rollback
- PREPARED --> COMMITTED : commit
- PREPARED --> ABORTED : rollback
-```
-
-### Data write
-
-The Stream Load transaction interface provides the `/api/transaction/load` operation, which is used to write data. You can call this operation multiple times within one transaction.
-
-From v4.0 onwards, you can call `/api/transaction/load` operations on different tables to load data into multiple tables within the same database.
-
-### Transaction deduplication
-
-The Stream Load transaction interface carries over the labeling mechanism of StarRocks. You can bind a unique label to each transaction to achieve at-most-once guarantees for transactions.
-
-### Transaction timeout management
-
-When you begin a transaction, you can use the `timeout` field in the HTTP request header to specify a timeout period (in seconds) for the transaction from `PREPARE` to `PREPARED` state. If the transaction has not been prepared after this period, it will be automatically aborted. If this field is not specified, the default value is determined by the FE configuration [`stream_load_default_timeout_second`](../administration/management/FE_configuration.md#stream_load_default_timeout_second) (Default: 600 seconds).
-
-When you begin a transaction, you can also use the `idle_transaction_timeout` field in the HTTP request header to specify a timeout period (in seconds) within which the transaction can stay idle. If no data is written within this period, the transaction will be automatically rolled back.
-
-When you prepare a transaction, you can use the `prepared_timeout` field in the HTTP request header to specify a timeout period (in seconds) for the transaction from `PREPARED` to `COMMITTED` state. If the transaction has not been committed after this period, it will be automatically aborted. If this field is not specified, the default value is determined by the FE configuration [`prepared_transaction_default_timeout_second`](../administration/management/FE_configuration.md#prepared_transaction_default_timeout_second) (Default: 86400 seconds). `prepared_timeout` is supported from v3.5.4 onwards.
-
-## Benefits
-
-The Stream Load transaction interface brings the following benefits:
-
-- **Exactly-once semantics**
-
- A transaction is split into two phases, pre-commit and commit, which make it easy to load data across systems. For example, this interface can guarantee exactly-once semantics for data loads from Flink.
-
-- **Improved load performance**
-
- If you run a load job by using a program, the Stream Load transaction interface allows you to merge multiple mini-batches of data on demand and then send them all at once within one transaction by calling the `/api/transaction/commit` operation. As such, fewer data versions need to be loaded, and load performance is improved.
-
-## Limits
-
-The Stream Load transaction interface has the following limits:
-
-- **Single-database multi-table** transactions are supported from v4.0 onwards. Support for **multi-database multi-table** transactions is in development.
-
-- Only **concurrent data writes from one client** are supported. Support for **concurrent data writes from multiple clients** is in development.
-
-- The `/api/transaction/load` operation can be called multiple times within one transaction. In this case, the parameter settings (except `table`) specified for all of the `/api/transaction/load` operations that are called must be the same.
-
-- When you load CSV-formatted data by using the Stream Load transaction interface, make sure that each data record in your data file ends with a row delimiter.
-
-## Precautions
-
-- If the `/api/transaction/begin`, `/api/transaction/load`, or `/api/transaction/prepare` operation that you have called returns errors, the transaction fails and is automatically rolled back.
-- When calling the `/api/transaction/begin` operation to start a new transaction, you must specify a label. Note that the subsequent `/api/transaction/load`, `/api/transaction/prepare`, and `/api/transaction/commit` operations must use the same label as the `/api/transaction/begin` operation.
-- If you use the label of an ongoing transaction to call the `/api/transaction/begin` operation to start a new transaction, the previous transaction will fail and be rolled back.
-- If you use a multi-table transaction to load data into different tables, you must specify the parameter `-H "transaction_type:multi"` for all operations involved in the transaction.
-- The default column separator and row delimiter that StarRocks supports for CSV-formatted data are `\t` and `\n`. If your data file does not use the default column separator or row delimiter, you must use `"column_separator: "` or `"row_delimiter: "` to specify the column separator or row delimiter that is actually used in your data file when calling the `/api/transaction/load` operation.
-
-## Before you begin
-
-### Check privileges
-
-
-
-#### Check network configuration
-
-Make sure that the machine on which the data you want to load resides can access the FE and BE nodes of the StarRocks cluster via the [`http_port`](../administration/management/FE_configuration.md#http_port) (default: `8030`) and [`be_http_port`](../administration/management/BE_configuration.md#be_http_port) (default: `8040`) , respectively.
-
-## Basic operations
-
-### Prepare sample data
-
-This topic uses CSV-formatted data as an example.
-
-1. In the `/home/disk1/` path of your local file system, create a CSV file named `example1.csv`. The file consists of three columns, which represent the user ID, user name, and user score in sequence.
-
- ```Plain
- 1,Lily,23
- 2,Rose,23
- 3,Alice,24
- 4,Julia,25
- ```
-
-2. In your StarRocks database `test_db`, create a Primary Key table named `table1`. The table consists of three columns: `id`, `name`, and `score`, of which `id` is the primary key.
-
- ```SQL
- CREATE TABLE `table1`
- (
- `id` int(11) NOT NULL COMMENT "user ID",
- `name` varchar(65533) NULL COMMENT "user name",
- `score` int(11) NOT NULL COMMENT "user score"
- )
- ENGINE=OLAP
- PRIMARY KEY(`id`)
- DISTRIBUTED BY HASH(`id`) BUCKETS 10;
- ```
-
-### Start a transaction
-
-#### Syntax
-
-```Bash
-curl --location-trusted -u : -H "label:" \
- -H "Expect:100-continue" \
- [-H "transaction_type:multi"]\ # Optional. Initiates a multi-table transaction.
- -H "db:" -H "table:" \
- -XPOST http://:/api/transaction/begin
-```
-
-> **NOTE**
->
-> Specify `-H "transaction_type:multi"` in the command if you want to load data into different tables within the transaction.
-
-#### Example
-
-```Bash
-curl --location-trusted -u :<123456> -H "label:streamload_txn_example1_table1" \
- -H "Expect:100-continue" \
- -H "db:test_db" -H "table:table1" \
- -XPOST http://:/api/transaction/begin
-```
-
-> **NOTE**
->
-> For this example, `streamload_txn_example1_table1` is specified as the label of the transaction.
-
-#### Return result
-
-- If the transaction is successfully started, the following result is returned:
-
- ```Bash
- {
- "Status": "OK",
- "Message": "",
- "Label": "streamload_txn_example1_table1",
- "TxnId": 9032,
- "BeginTxnTimeMs": 0
- }
- ```
-
-- If the transaction is bound to a duplicate label, the following result is returned:
-
- ```Bash
- {
- "Status": "LABEL_ALREADY_EXISTS",
- "ExistingJobStatus": "RUNNING",
- "Message": "Label [streamload_txn_example1_table1] has already been used."
- }
- ```
-
-- If errors other than duplicate label occur, the following result is returned:
-
- ```Bash
- {
- "Status": "FAILED",
- "Message": ""
- }
- ```
-
-### Write data
-
-#### Syntax
-
-```Bash
-curl --location-trusted -u : -H "label:" \
- -H "Expect:100-continue" \
- [-H "transaction_type:multi"]\ # Optional. Loads data via a multi-table transaction.
- -H "db:" -H "table:" \
- -T \
- -XPUT http://:/api/transaction/load
-```
-
-> **NOTE**
->
-> - When calling the `/api/transaction/load` operation, you must use `` to specify the save path of the data file you want to load.
-> - You can call `/api/transaction/load` operations with different `table` parameter values to load data into different tables within the same database. In this case, you must specify `-H "transaction_type:multi"` in the command.
-
-#### Example
-
-```Bash
-curl --location-trusted -u :<123456> -H "label:streamload_txn_example1_table1" \
- -H "Expect:100-continue" \
- -H "db:test_db" -H "table:table1" \
- -T /home/disk1/example1.csv \
- -H "column_separator: ," \
- -XPUT http://:/api/transaction/load
-```
-
-> **NOTE**
->
-> For this example, the column separator used in the data file `example1.csv` is commas (`,`) instead of StarRocks‘s default column separator (`\t`). Therefore, when calling the `/api/transaction/load` operation, you must use `"column_separator: "` to specify commas (`,`) as the column separator.
-
-#### Return result
-
-- If the data write is successful, the following result is returned:
-
- ```Bash
- {
- "TxnId": 1,
- "Seq": 0,
- "Label": "streamload_txn_example1_table1",
- "Status": "OK",
- "Message": "",
- "NumberTotalRows": 5265644,
- "NumberLoadedRows": 5265644,
- "NumberFilteredRows": 0,
- "NumberUnselectedRows": 0,
- "LoadBytes": 10737418067,
- "LoadTimeMs": 418778,
- "StreamLoadPutTimeMs": 68,
- "ReceivedDataTimeMs": 38964,
- }
- ```
-
-- If the transaction is considered unknown, the following result is returned:
-
- ```Bash
- {
- "TxnId": 1,
- "Label": "streamload_txn_example1_table1",
- "Status": "FAILED",
- "Message": "TXN_NOT_EXISTS"
- }
- ```
-
-- If the transaction is considered in an invalid state, the following result is returned:
-
- ```Bash
- {
- "TxnId": 1,
- "Label": "streamload_txn_example1_table1",
- "Status": "FAILED",
- "Message": "Transcation State Invalid"
- }
- ```
-
-- If errors other than unknown transaction and invalid status occur, the following result is returned:
-
- ```Bash
- {
- "TxnId": 1,
- "Label": "streamload_txn_example1_table1",
- "Status": "FAILED",
- "Message": ""
- }
- ```
-
-### Pre-commit a transaction
-
-#### Syntax
-
-```Bash
-curl --location-trusted -u : -H "label:" \
- -H "Expect:100-continue" \
- [-H "transaction_type:multi"]\ # Optional. Pre-commits a multi-table transaction.
- -H "db:" \
- [-H "prepared_timeout:"] \
- -XPOST http://:/api/transaction/prepare
-```
-
-> **NOTE**
->
-> Specify `-H "transaction_type:multi"` in the command if the transaction you want to pre-commit is a multi-table transaction.
-
-#### Example
-
-```Bash
-curl --location-trusted -u :<123456> -H "label:streamload_txn_example1_table1" \
- -H "Expect:100-continue" \
- -H "db:test_db" \
- -H "prepared_timeout:300" \
- -XPOST http://:/api/transaction/prepare
-```
-
-> **NOTE**
->
-> The `prepared_timeout` field is optional. If it is not specified, the default value is determined by the FE configuration [`prepared_transaction_default_timeout_second`](../administration/management/FE_configuration.md#prepared_transaction_default_timeout_second) (Default: 86400 seconds). `prepared_timeout` is supported from v3.5.4 onwards.
-
-#### Return result
-
-- If the pre-commit is successful, the following result is returned:
-
- ```Bash
- {
- "TxnId": 1,
- "Label": "streamload_txn_example1_table1",
- "Status": "OK",
- "Message": "",
- "NumberTotalRows": 5265644,
- "NumberLoadedRows": 5265644,
- "NumberFilteredRows": 0,
- "NumberUnselectedRows": 0,
- "LoadBytes": 10737418067,
- "LoadTimeMs": 418778,
- "StreamLoadPutTimeMs": 68,
- "ReceivedDataTimeMs": 38964,
- "WriteDataTimeMs": 417851
- "CommitAndPublishTimeMs": 1393
- }
- ```
-
-- If the transaction is considered not existent, the following result is returned:
-
- ```Bash
- {
- "TxnId": 1,
- "Label": "streamload_txn_example1_table1",
- "Status": "FAILED",
- "Message": "Transcation Not Exist"
- }
- ```
-
-- If the pre-commit times out, the following result is returned:
-
- ```Bash
- {
- "TxnId": 1,
- "Label": "streamload_txn_example1_table1",
- "Status": "FAILED",
- "Message": "commit timeout",
- }
- ```
-
-- If errors other than non-existent transaction and pre-commit timeout occur, the following result is returned:
-
- ```Bash
- {
- "TxnId": 1,
- "Label": "streamload_txn_example1_table1",
- "Status": "FAILED",
- "Message": "publish timeout"
- }
- ```
-
-### Commit a transaction
-
-#### Syntax
-
-```Bash
-curl --location-trusted -u : -H "label:" \
- -H "Expect:100-continue" \
- [-H "transaction_type:multi"]\ # Optional. Commits a multi-table transaction.
- -H "db:" \
- -XPOST http://:/api/transaction/commit
-```
-
-> **NOTE**
->
-> Specify `-H "transaction_type:multi"` in the command if the transaction you want to commit is a multi-table transaction.
-
-#### Example
-
-```Bash
-curl --location-trusted -u :<123456> -H "label:streamload_txn_example1_table1" \
- -H "Expect:100-continue" \
- -H "db:test_db" \
- -XPOST http://:/api/transaction/commit
-```
-
-#### Return result
-
-- If the commit is successful, the following result is returned:
-
- ```Bash
- {
- "TxnId": 1,
- "Label": "streamload_txn_example1_table1",
- "Status": "OK",
- "Message": "",
- "NumberTotalRows": 5265644,
- "NumberLoadedRows": 5265644,
- "NumberFilteredRows": 0,
- "NumberUnselectedRows": 0,
- "LoadBytes": 10737418067,
- "LoadTimeMs": 418778,
- "StreamLoadPutTimeMs": 68,
- "ReceivedDataTimeMs": 38964,
- "WriteDataTimeMs": 417851
- "CommitAndPublishTimeMs": 1393
- }
- ```
-
-- If the transaction has already been committed, the following result is returned:
-
- ```Bash
- {
- "TxnId": 1,
- "Label": "streamload_txn_example1_table1",
- "Status": "OK",
- "Message": "Transaction already commited",
- }
- ```
-
-- If the transaction is considered not existent, the following result is returned:
-
- ```Bash
- {
- "TxnId": 1,
- "Label": "streamload_txn_example1_table1",
- "Status": "FAILED",
- "Message": "Transcation Not Exist"
- }
- ```
-
-- If the commit times out, the following result is returned:
-
- ```Bash
- {
- "TxnId": 1,
- "Label": "streamload_txn_example1_table1",
- "Status": "FAILED",
- "Message": "commit timeout",
- }
- ```
-
-- If the data publish times out, the following result is returned:
-
- ```Bash
- {
- "TxnId": 1,
- "Label": "streamload_txn_example1_table1",
- "Status": "FAILED",
- "Message": "publish timeout",
- "CommitAndPublishTimeMs": 1393
- }
- ```
-
-- If errors other than non-existent transaction and timeout occur, the following result is returned:
-
- ```Bash
- {
- "TxnId": 1,
- "Label": "streamload_txn_example1_table1",
- "Status": "FAILED",
- "Message": ""
- }
- ```
-
-### Roll back a transaction
-
-#### Syntax
-
-```Bash
-curl --location-trusted -u : -H "label:" \
- -H "Expect:100-continue" \
- [-H "transaction_type:multi"]\ # Optional. Rolls back a multi-table transaction.
- -H "db:" \
- -XPOST http://:/api/transaction/rollback
-```
-
-> **NOTE**
->
-> Specify `-H "transaction_type:multi"` in the command if the transaction you want to roll back is a multi-table transaction.
-
-#### Example
-
-```Bash
-curl --location-trusted -u