diff --git a/docs/en/introduction/Architecture.md b/docs/en/introduction/Architecture.md index 2dd28b5..f43627c 100644 --- a/docs/en/introduction/Architecture.md +++ b/docs/en/introduction/Architecture.md @@ -5,7 +5,7 @@ import QSOverview from '../_assets/commonMarkdown/quickstart-overview-tip.mdx' # Architecture -StarRocks has a simple architecture. The entire system consists of only two types of components; frontends and backends. The frontend nodes are called **FE**s. There are two types of backend nodes, **BE**s, and **CN**s (Compute Nodes). BEs are deployed when local storage for data is used, and CNs are deployed when data is stored on object storage or HDFS. StarRocks does not rely on any external components, simplifying deployment and maintenance. Nodes can be horizontally scaled without service downtime. In addition, StarRocks has a replica mechanism for metadata and service data, which increases data reliability and efficiently prevents single points of failure (SPOFs). +StarRocks has a robust architecture. The entire system consists of only two types of components; frontends and backends. The frontend nodes are called **FE**s. There are two types of backend nodes, **BE**s, and **CN**s (Compute Nodes). BEs are deployed when local storage for data is used, and CNs are deployed when data is stored on object storage or HDFS. StarRocks does not rely on any external components, simplifying deployment and maintenance. Nodes can be horizontally scaled without service downtime. In addition, StarRocks has a replica mechanism for metadata and service data, which increases data reliability and efficiently prevents single points of failure (SPOFs). StarRocks is compatible with MySQL protocols and supports standard SQL. Users can easily connect to StarRocks from MySQL clients to gain instant and valuable insights. diff --git a/docs/en/loading/Etl_in_loading.md b/docs/en/loading/Etl_in_loading.md deleted file mode 100644 index df52141..0000000 --- a/docs/en/loading/Etl_in_loading.md +++ /dev/null @@ -1,452 +0,0 @@ ---- -displayed_sidebar: docs ---- - -# Transform data at loading - -import InsertPrivNote from '../_assets/commonMarkdown/insertPrivNote.mdx' - -StarRocks supports data transformation at loading. - -This feature supports [Stream Load](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md), [Broker Load](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md), and [Routine Load](../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md) but does not support [Spark Load](../sql-reference/sql-statements/loading_unloading/SPARK_LOAD.md). - - - -This topic uses CSV data as an example to describe how to extract and transform data at loading. The data file formats that are supported vary depending on the loading method of your choice. - -> **NOTE** -> -> For CSV data, you can use a UTF-8 string, such as a comma (,), tab, or pipe (|), whose length does not exceed 50 bytes as a text delimiter. - -## Scenarios - -When you load a data file into a StarRocks table, the data of the data file may not be completely mapped onto the data of the StarRocks table. In this situation, you do not need to extract or transform the data before you load it into the StarRocks table. StarRocks can help you extract and transform the data during loading: - -- Skip columns that do not need to be loaded. - - You can skip the columns that do not need to be loaded. Additionally, if the columns of the data file are in a different order than the columns of the StarRocks table, you can create a column mapping between the data file and the StarRocks table. - -- Filter out rows you do not want to load. - - You can specify filter conditions based on which StarRocks filters out the rows that you do not want to load. - -- Generate new columns from original columns. - - Generated columns are special columns that are computed from the original columns of the data file. You can map the generated columns onto the columns of the StarRocks table. - -- Extract partition field values from a file path. - - If the data file is generated from Apache Hive™, you can extract partition field values from the file path. - -## Data examples - -1. Create data files in your local file system. - - a. Create a data file named `file1.csv`. The file consists of four columns, which represent user ID, user gender, event date, and event type in sequence. - - ```Plain - 354,female,2020-05-20,1 - 465,male,2020-05-21,2 - 576,female,2020-05-22,1 - 687,male,2020-05-23,2 - ``` - - b. Create a data file named `file2.csv`. The file consists of only one column, which represents date. - - ```Plain - 2020-05-20 - 2020-05-21 - 2020-05-22 - 2020-05-23 - ``` - -2. Create tables in your StarRocks database `test_db`. - - > **NOTE** - > - > Since v2.5.7, StarRocks can automatically set the number of buckets (BUCKETS) when you create a table or add a partition. You no longer need to manually set the number of buckets. For detailed information, see [set the number of buckets](../table_design/data_distribution/Data_distribution.md#set-the-number-of-buckets). - - a. Create a table named `table1`, which consists of three columns: `event_date`, `event_type`, and `user_id`. - - ```SQL - MySQL [test_db]> CREATE TABLE table1 - ( - `event_date` DATE COMMENT "event date", - `event_type` TINYINT COMMENT "event type", - `user_id` BIGINT COMMENT "user ID" - ) - DISTRIBUTED BY HASH(user_id); - ``` - - b. Create a table named `table2`, which consists of four columns: `date`, `year`, `month`, and `day`. - - ```SQL - MySQL [test_db]> CREATE TABLE table2 - ( - `date` DATE COMMENT "date", - `year` INT COMMENT "year", - `month` TINYINT COMMENT "month", - `day` TINYINT COMMENT "day" - ) - DISTRIBUTED BY HASH(date); - ``` - -3. Upload `file1.csv` and `file2.csv` to the `/user/starrocks/data/input/` path of your HDFS cluster, publish the data of `file1.csv` to `topic1` of your Kafka cluster, and publish the data of `file2.csv` to `topic2` of your Kafka cluster. - -## Skip columns that do not need to be loaded - -The data file that you want to load into a StarRocks table may contain some columns that cannot be mapped to any columns of the StarRocks table. In this situation, StarRocks supports loading only the columns that can be mapped from the data file onto the columns of the StarRocks table. - -This feature supports loading data from the following data sources: - -- Local file system - -- HDFS and cloud storage - - > **NOTE** - > - > This section uses HDFS as an example. - -- Kafka - -In most cases, the columns of a CSV file are not named. For some CSV files, the first row is composed of column names, but StarRocks processes the content of the first row as common data rather than column names. Therefore, when you load a CSV file, you must temporarily name the columns of the CSV file **in sequence** in the job creation statement or command. These temporarily named columns are mapped **by name** onto the columns of the StarRocks table. Take note of the following points about the columns of the data file: - -- The data of the columns that can be mapped onto and are temporarily named by using the names of the columns in the StarRocks table is directly loaded. - -- The columns that cannot be mapped onto the columns of the StarRocks table are ignored, the data of these columns are not loaded. - -- If some columns can be mapped onto the columns of the StarRocks table but are not temporarily named in the job creation statement or command, the load job reports errors. - -This section uses `file1.csv` and `table1` as an example. The four columns of `file1.csv` are temporarily named as `user_id`, `user_gender`, `event_date`, and `event_type` in sequence. Among the temporarily named columns of `file1.csv`, `user_id`, `event_date`, and `event_type` can be mapped onto specific columns of `table1`, whereas `user_gender` cannot be mapped onto any column of `table1`. Therefore, `user_id`, `event_date`, and `event_type` are loaded into `table1`, but `user_gender` is not. - -### Load data - -#### Load data from a local file system - -If `file1.csv` is stored in your local file system, run the following command to create a [Stream Load](../loading/StreamLoad.md) job: - -```Bash -curl --location-trusted -u : \ - -H "Expect:100-continue" \ - -H "column_separator:," \ - -H "columns: user_id, user_gender, event_date, event_type" \ - -T file1.csv -XPUT \ - http://:/api/test_db/table1/_stream_load -``` - -> **NOTE** -> -> If you choose Stream Load, you must use the `columns` parameter to temporarily name the columns of the data file to create a column mapping between the data file and the StarRocks table. - -For detailed syntax and parameter descriptions, see [STREAM LOAD](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md). - -#### Load data from an HDFS cluster - -If `file1.csv` is stored in your HDFS cluster, execute the following statement to create a [Broker Load](../loading/hdfs_load.md) job: - -```SQL -LOAD LABEL test_db.label1 -( - DATA INFILE("hdfs://:/user/starrocks/data/input/file1.csv") - INTO TABLE `table1` - FORMAT AS "csv" - COLUMNS TERMINATED BY "," - (user_id, user_gender, event_date, event_type) -) -WITH BROKER; -``` - -> **NOTE** -> -> If you choose Broker Load, you must use the `column_list` parameter to temporarily name the columns of the data file to create a column mapping between the data file and the StarRocks table. - -For detailed syntax and parameter descriptions, see [BROKER LOAD](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md). - -#### Load data from a Kafka cluster - -If the data of `file1.csv` is published to `topic1` of your Kafka cluster, execute the following statement to create a [Routine Load](../loading/RoutineLoad.md) job: - -```SQL -CREATE ROUTINE LOAD test_db.table101 ON table1 - COLUMNS TERMINATED BY ",", - COLUMNS(user_id, user_gender, event_date, event_type) -FROM KAFKA -( - "kafka_broker_list" = ":", - "kafka_topic" = "topic1", - "property.kafka_default_offsets" = "OFFSET_BEGINNING" -); -``` - -> **NOTE** -> -> If you choose Routine Load, you must use the `COLUMNS` parameter to temporarily name the columns of the data file to create a column mapping between the data file and the StarRocks table. - -For detailed syntax and parameter descriptions, see [CREATE ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md). - -### Query data - -After the load of data from your local file system, HDFS cluster, or Kafka cluster is complete, query the data of `table1` to verify that the load is successful: - -```SQL -MySQL [test_db]> SELECT * FROM table1; -+------------+------------+---------+ -| event_date | event_type | user_id | -+------------+------------+---------+ -| 2020-05-22 | 1 | 576 | -| 2020-05-20 | 1 | 354 | -| 2020-05-21 | 2 | 465 | -| 2020-05-23 | 2 | 687 | -+------------+------------+---------+ -4 rows in set (0.01 sec) -``` - -## Filter out rows that you do not want to load - -When you load a data file into a StarRocks table, you may not want to load specific rows of the data file. In this situation, you can use the WHERE clause to specify the rows that you want to load. StarRocks filters out all rows that do not meet the filter conditions specified in the WHERE clause. - -This feature supports loading data from the following data sources: - -- Local file system - -- HDFS and cloud storage - > **NOTE** - > - > This section uses HDFS as an example. - -- Kafka - -This section uses `file1.csv` and `table1` as an example. If you want to load only the rows whose event type is `1` from `file1.csv` into `table1`, you can use the WHERE clause to specify a filter condition `event_type = 1`. - -### Load data - -#### Load data from a local file system - -If `file1.csv` is stored in your local file system, run the following command to create a [Stream Load](../loading/StreamLoad.md) job: - -```Bash -curl --location-trusted -u : \ - -H "Expect:100-continue" \ - -H "column_separator:," \ - -H "columns: user_id, user_gender, event_date, event_type" \ - -H "where: event_type=1" \ - -T file1.csv -XPUT \ - http://:/api/test_db/table1/_stream_load -``` - -For detailed syntax and parameter descriptions, see [STREAM LOAD](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md). - -#### Load data from an HDFS cluster - -If `file1.csv` is stored in your HDFS cluster, execute the following statement to create a [Broker Load](../loading/hdfs_load.md) job: - -```SQL -LOAD LABEL test_db.label2 -( - DATA INFILE("hdfs://:/user/starrocks/data/input/file1.csv") - INTO TABLE `table1` - FORMAT AS "csv" - COLUMNS TERMINATED BY "," - (user_id, user_gender, event_date, event_type) - WHERE event_type = 1 -) -WITH BROKER; -``` - -For detailed syntax and parameter descriptions, see [BROKER LOAD](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md). - -#### Load data from a Kafka cluster - -If the data of `file1.csv` is published to `topic1` of your Kafka cluster, execute the following statement to create a [Routine Load](../loading/RoutineLoad.md) job: - -```SQL -CREATE ROUTINE LOAD test_db.table102 ON table1 -COLUMNS TERMINATED BY ",", -COLUMNS (user_id, user_gender, event_date, event_type), -WHERE event_type = 1 -FROM KAFKA -( - "kafka_broker_list" = ":", - "kafka_topic" = "topic1", - "property.kafka_default_offsets" = "OFFSET_BEGINNING" -); -``` - -For detailed syntax and parameter descriptions, see [CREATE ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md). - -### Query data - -After the load of data from your local file system, HDFS cluster, or Kafka cluster is complete, query the data of `table1` to verify that the load is successful: - -```SQL -MySQL [test_db]> SELECT * FROM table1; -+------------+------------+---------+ -| event_date | event_type | user_id | -+------------+------------+---------+ -| 2020-05-20 | 1 | 354 | -| 2020-05-22 | 1 | 576 | -+------------+------------+---------+ -2 rows in set (0.01 sec) -``` - -## Generate new columns from original columns - -When you load a data file into a StarRocks table, some data of the data file may require conversions before the data can be loaded into the StarRocks table. In this situation, you can use functions or expressions in the job creation command or statement to implement data conversions. - -This feature supports loading data from the following data sources: - -- Local file system - -- HDFS and cloud storage - > **NOTE** - > - > This section uses HDFS as an example. - -- Kafka - -This section uses `file2.csv` and `table2` as an example. `file2.csv` consists of only one column that represents date. You can use the [year](../sql-reference/sql-functions/date-time-functions/year.md), [month](../sql-reference/sql-functions/date-time-functions/month.md), and [day](../sql-reference/sql-functions/date-time-functions/day.md) functions to extract the year, month, and day in each date from `file2.csv` and load the extracted data into the `year`, `month`, and `day` columns of `table2`. - -### Load data - -#### Load data from a local file system - -If `file2.csv` is stored in your local file system, run the following command to create a [Stream Load](../loading/StreamLoad.md) job: - -```Bash -curl --location-trusted -u : \ - -H "Expect:100-continue" \ - -H "column_separator:," \ - -H "columns:date,year=year(date),month=month(date),day=day(date)" \ - -T file2.csv -XPUT \ - http://:/api/test_db/table2/_stream_load -``` - -> **NOTE** -> -> - In the `columns` parameter, you must first temporarily name **all columns** of the data file, and then temporarily name the new columns that you want to generate from the original columns of the data file. As shown in the preceding example, the only column of `file2.csv` is temporarily named as `date`, and then the `year=year(date)`, `month=month(date)`, and `day=day(date)` functions are invoked to generate three new columns, which are temporarily named as `year`, `month`, and `day`. -> -> - Stream Load does not support `column_name = function(column_name)` but supports `column_name = function(column_name)`. - -For detailed syntax and parameter descriptions, see [STREAM LOAD](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md). - -#### Load data from an HDFS cluster - -If `file2.csv` is stored in your HDFS cluster, execute the following statement to create a [Broker Load](../loading/hdfs_load.md) job: - -```SQL -LOAD LABEL test_db.label3 -( - DATA INFILE("hdfs://:/user/starrocks/data/input/file2.csv") - INTO TABLE `table2` - FORMAT AS "csv" - COLUMNS TERMINATED BY "," - (date) - SET(year=year(date), month=month(date), day=day(date)) -) -WITH BROKER; -``` - -> **NOTE** -> -> You must first use the `column_list` parameter to temporarily name **all columns** of the data file, and then use the SET clause to temporarily name the new columns that you want to generate from the original columns of the data file. As shown in the preceding example, the only column of `file2.csv` is temporarily named as `date` in the `column_list` parameter, and then the `year=year(date)`, `month=month(date)`, and `day=day(date)` functions are invoked in the SET clause to generate three new columns, which are temporarily named as `year`, `month`, and `day`. - -For detailed syntax and parameter descriptions, see [BROKER LOAD](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md). - -#### Load data from a Kafka cluster - -If the data of `file2.csv` is published to `topic2` of your Kafka cluster, execute the following statement to create a [Routine Load](../loading/RoutineLoad.md) job: - -```SQL -CREATE ROUTINE LOAD test_db.table201 ON table2 - COLUMNS TERMINATED BY ",", - COLUMNS(date,year=year(date),month=month(date),day=day(date)) -FROM KAFKA -( - "kafka_broker_list" = ":", - "kafka_topic" = "topic2", - "property.kafka_default_offsets" = "OFFSET_BEGINNING" -); -``` - -> **NOTE** -> -> In the `COLUMNS` parameter, you must first temporarily name **all columns** of the data file, and then temporarily name the new columns that you want to generate from the original columns of the data file. As shown in the preceding example, the only column of `file2.csv` is temporarily named as `date`, and then the `year=year(date)`, `month=month(date)`, and `day=day(date)` functions are invoked to generate three new columns, which are temporarily named as `year`, `month`, and `day`. - -For detailed syntax and parameter descriptions, see [CREATE ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md). - -### Query data - -After the load of data from your local file system, HDFS cluster, or Kafka cluster is complete, query the data of `table2` to verify that the load is successful: - -```SQL -MySQL [test_db]> SELECT * FROM table2; -+------------+------+-------+------+ -| date | year | month | day | -+------------+------+-------+------+ -| 2020-05-20 | 2020 | 5 | 20 | -| 2020-05-21 | 2020 | 5 | 21 | -| 2020-05-22 | 2020 | 5 | 22 | -| 2020-05-23 | 2020 | 5 | 23 | -+------------+------+-------+------+ -4 rows in set (0.01 sec) -``` - -## Extract partition field values from a file path - -If the file path that you specify contains partition fields, you can use the `COLUMNS FROM PATH AS` parameter to specify the partition fields you want to extract from the file paths. The partition fields in file paths are equivalent to the columns in data files. The `COLUMNS FROM PATH AS` parameter is supported only when you load data from an HDFS cluster. - -For example, you want to load the following four data files generated from Hive: - -```Plain -/user/starrocks/data/input/date=2020-05-20/data -1,354 -/user/starrocks/data/input/date=2020-05-21/data -2,465 -/user/starrocks/data/input/date=2020-05-22/data -1,576 -/user/starrocks/data/input/date=2020-05-23/data -2,687 -``` - -The four data files are stored in the `/user/starrocks/data/input/` path of your HDFS cluster. Each of these data files is partitioned by partition field `date` and consists of two columns, which represent event type and user ID in sequence. - -### Load data from an HDFS cluster - -Execute the following statement to create a [Broker Load](../loading/hdfs_load.md) job, which enables you to extract the `date` partition field values from the `/user/starrocks/data/input/` file path and use a wildcard (*) to specify that you want to load all data files in the file path to `table1`: - -```SQL -LOAD LABEL test_db.label4 -( - DATA INFILE("hdfs://:/user/starrocks/data/input/date=*/*") - INTO TABLE `table1` - FORMAT AS "csv" - COLUMNS TERMINATED BY "," - (event_type, user_id) - COLUMNS FROM PATH AS (date) - SET(event_date = date) -) -WITH BROKER; -``` - -> **NOTE** -> -> In the preceding example, the `date` partition field in the specified file path is equivalent to the `event_date` column of `table1`. Therefore, you need to use the SET clause to map the `date` partition field onto the `event_date` column. If the partition field in the specified file path has the same name as a column of the StarRocks table, you do not need to use the SET clause to create a mapping. - -For detailed syntax and parameter descriptions, see [BROKER LOAD](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md). - -### Query data - -After the load of data from your HDFS cluster is complete, query the data of `table1` to verify that the load is successful: - -```SQL -MySQL [test_db]> SELECT * FROM table1; -+------------+------------+---------+ -| event_date | event_type | user_id | -+------------+------------+---------+ -| 2020-05-22 | 1 | 576 | -| 2020-05-20 | 1 | 354 | -| 2020-05-21 | 2 | 465 | -| 2020-05-23 | 2 | 687 | -+------------+------------+---------+ -4 rows in set (0.01 sec) -``` diff --git a/docs/en/loading/Flink-connector-starrocks.md b/docs/en/loading/Flink-connector-starrocks.md deleted file mode 100644 index cbbfe5b..0000000 --- a/docs/en/loading/Flink-connector-starrocks.md +++ /dev/null @@ -1,901 +0,0 @@ ---- -displayed_sidebar: docs ---- - -# Continuously load data from Apache Flink® - -StarRocks provides a self-developed connector named StarRocks Connector for Apache Flink® (Flink connector for short) to help you load data into a StarRocks table by using Flink. The basic principle is to accumulate the data and then load it all at a time into StarRocks through [STREAM LOAD](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md). - -The Flink connector supports DataStream API, Table API & SQL, and Python API. It has a higher and more stable performance than [flink-connector-jdbc](https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/table/jdbc/) provided by Apache Flink®. - -> **NOTICE** -> -> Loading data into StarRocks tables with Flink connector needs SELECT and INSERT privileges on the target StarRocks table. If you do not have these privileges, follow the instructions provided in [GRANT](../sql-reference/sql-statements/account-management/GRANT.md) to grant these privileges to the user that you use to connect to your StarRocks cluster. - -## Version requirements - -| Connector | Flink | StarRocks | Java | Scala | -|-----------|-------------------------------|---------------| ---- |-----------| -| 1.2.11 | 1.15,1.16,1.17,1.18,1.19,1.20 | 2.1 and later | 8 | 2.11,2.12 | -| 1.2.10 | 1.15,1.16,1.17,1.18,1.19 | 2.1 and later | 8 | 2.11,2.12 | -| 1.2.9 | 1.15,1.16,1.17,1.18 | 2.1 and later | 8 | 2.11,2.12 | -| 1.2.8 | 1.13,1.14,1.15,1.16,1.17 | 2.1 and later | 8 | 2.11,2.12 | -| 1.2.7 | 1.11,1.12,1.13,1.14,1.15 | 2.1 and later | 8 | 2.11,2.12 | - -## Obtain Flink connector - -You can obtain the Flink connector JAR file in the following ways: - -- Directly download the compiled Flink connector JAR file. -- Add the Flink connector as a dependency in your Maven project and then download the JAR file. -- Compile the source code of the Flink connector into a JAR file by yourself. - -The naming format of the Flink connector JAR file is as follows: - -- Since Flink 1.15, it's `flink-connector-starrocks-${connector_version}_flink-${flink_version}.jar`. For example, if you install Flink 1.15 and you want to use Flink connector 1.2.7, you can use `flink-connector-starrocks-1.2.7_flink-1.15.jar`. - -- Prior to Flink 1.15, it's `flink-connector-starrocks-${connector_version}_flink-${flink_version}_${scala_version}.jar`. For example, if you install Flink 1.14 and Scala 2.12 in your environment, and you want to use Flink connector 1.2.7, you can use `flink-connector-starrocks-1.2.7_flink-1.14_2.12.jar`. - -> **NOTICE** -> -> In general, the latest version of the Flink connector only maintains compatibility with the three most recent versions of Flink. - -### Download the compiled Jar file - -Directly download the corresponding version of the Flink connector Jar file from the [Maven Central Repository](https://repo1.maven.org/maven2/com/starrocks). - -### Maven Dependency - -In your Maven project's `pom.xml` file, add the Flink connector as a dependency according to the following format. Replace `flink_version`, `scala_version`, and `connector_version` with the respective versions. - -- In Flink 1.15 and later - - ```xml - - com.starrocks - flink-connector-starrocks - ${connector_version}_flink-${flink_version} - - ``` - -- In versions earlier than Flink 1.15 - - ```xml - - com.starrocks - flink-connector-starrocks - ${connector_version}_flink-${flink_version}_${scala_version} - - ``` - -### Compile by yourself - -1. Download the [Flink connector source code](https://github.com/StarRocks/starrocks-connector-for-apache-flink). -2. Execute the following command to compile the source code of Flink connector into a JAR file. Note that `flink_version` is replaced with the corresponding Flink version. - - ```bash - sh build.sh - ``` - - For example, if the Flink version in your environment is 1.15, you need to execute the following command: - - ```bash - sh build.sh 1.15 - ``` - -3. Go to the `target/` directory to find the Flink connector JAR file, such as `flink-connector-starrocks-1.2.7_flink-1.15-SNAPSHOT.jar`, generated upon compilation. - -> **NOTE** -> -> The name of Flink connector which is not formally released contains the `SNAPSHOT` suffix. - -## Options - -### connector - -**Required**: Yes
-**Default value**: NONE
-**Description**: The connector that you want to use. The value must be "starrocks". - -### jdbc-url - -**Required**: Yes
-**Default value**: NONE
-**Description**: The address that is used to connect to the MySQL server of the FE. You can specify multiple addresses, which must be separated by a comma (,). Format: `jdbc:mysql://:,:,:`. - -### load-url - -**Required**: Yes
-**Default value**: NONE
-**Description**: The address that is used to connect to the HTTP server of the FE. You can specify multiple addresses, which must be separated by a semicolon (;). Format: `:;:`. - -### database-name - -**Required**: Yes
-**Default value**: NONE
-**Description**: The name of the StarRocks database into which you want to load data. - -### table-name - -**Required**: Yes
-**Default value**: NONE
-**Description**: The name of the table that you want to use to load data into StarRocks. - -### username - -**Required**: Yes
-**Default value**: NONE
-**Description**: The username of the account that you want to use to load data into StarRocks. The account needs [SELECT and INSERT privileges](../sql-reference/sql-statements/account-management/GRANT.md) on the target StarRocks table. - -### password - -**Required**: Yes
-**Default value**: NONE
-**Description**: The password of the preceding account. - -### sink.version - -**Required**: No
-**Default value**: AUTO
-**Description**: The interface used to load data. This parameter is supported from Flink connector version 1.2.4 onwards.
  • `V1`: Use [Stream Load](../loading/StreamLoad.md) interface to load data. Connectors before 1.2.4 only support this mode.
  • `V2`: Use [Stream Load transaction](./Stream_Load_transaction_interface.md) interface to load data. It requires StarRocks to be at least version 2.4. Recommends `V2` because it optimizes the memory usage and provides a more stable exactly-once implementation.
  • `AUTO`: If the version of StarRocks supports transaction Stream Load, will choose `V2` automatically, otherwise choose `V1`
- -### sink.label-prefix - -**Required**: No
-**Default value**: NONE
-**Description**: The label prefix used by Stream Load. Recommend to configure it if you are using exactly-once with connector 1.2.8 and later. See [exactly-once usage notes](#exactly-once). - -### sink.semantic - -**Required**: No
-**Default value**: at-least-once
-**Description**: The semantic guaranteed by sink. Valid values: **at-least-once** and **exactly-once**. - -### sink.buffer-flush.max-bytes - -**Required**: No
-**Default value**: 94371840(90M)
-**Description**: The maximum size of data that can be accumulated in memory before being sent to StarRocks at a time. The maximum value ranges from 64 MB to 10 GB. Setting this parameter to a larger value can improve loading performance but may increase loading latency. This parameter only takes effect when `sink.semantic` is set to `at-least-once`. If `sink.semantic` is set to `exactly-once`, the data in memory is flushed when a Flink checkpoint is triggered. In this circumstance, this parameter does not take effect. - -### sink.buffer-flush.max-rows - -**Required**: No
-**Default value**: 500000
-**Description**: The maximum number of rows that can be accumulated in memory before being sent to StarRocks at a time. This parameter is available only when `sink.version` is `V1` and `sink.semantic` is `at-least-once`. Valid values: 64000 to 5000000. - -### sink.buffer-flush.interval-ms - -**Required**: No
-**Default value**: 300000
-**Description**: The interval at which data is flushed. This parameter is available only when `sink.semantic` is `at-least-once`. Valid values: 1000 to 3600000. Unit: ms. - -### sink.max-retries - -**Required**: No
-**Default value**: 3
-**Description**: The number of times that the system retries to perform the Stream Load job. This parameter is available only when you set `sink.version` to `V1`. Valid values: 0 to 10. - -### sink.connect.timeout-ms - -**Required**: No
-**Default value**: 30000
-**Description**: The timeout for establishing HTTP connection. Valid values: 100 to 60000. Unit: ms. Before Flink connector v1.2.9, the default value is `1000`. - -### sink.socket.timeout-ms - -**Required**: No
-**Default value**: -1
-**Description**: Supported since 1.2.10. The time duration for which the HTTP client waits for data. Unit: ms. The default value `-1` means there is no timeout. - -### sink.sanitize-error-log - -**Required**: No
-**Default value**: false
-**Description**: Supported since 1.2.12. Whether to sanitize sensitive data in the error log for production security. When this item is set to `true`, sensitive row data and column values in Stream Load error logs are redacted in both the connector and SDK logs. The value defaults to `false` for backward compatibility. - -### sink.wait-for-continue.timeout-ms - -**Required**: No
-**Default value**: 10000
-**Description**: Supported since 1.2.7. The timeout for waiting response of HTTP 100-continue from the FE. Valid values: `3000` to `60000`. Unit: ms - -### sink.ignore.update-before - -**Required**: No
-**Default value**: true
-**Description**: Supported since version 1.2.8. Whether to ignore `UPDATE_BEFORE` records from Flink when loading data to Primary Key tables. If this parameter is set to false, the record is treated as a delete operation to StarRocks table. - -### sink.parallelism - -**Required**: No
-**Default value**: NONE
-**Description**: The parallelism of loading. Only available for Flink SQL. If this parameter is not specified, Flink planner decides the parallelism. **In the scenario of multi-parallelism, users need to guarantee data is written in the correct order.** - -### sink.properties.* - -**Required**: No
-**Default value**: NONE
-**Description**: The parameters that are used to control Stream Load behavior. For example, the parameter `sink.properties.format` specifies the format used for Stream Load, such as CSV or JSON. For a list of supported parameters and their descriptions, see [STREAM LOAD](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md). - -### sink.properties.format - -**Required**: No
-**Default value**: csv
-**Description**: The format used for Stream Load. The Flink connector will transform each batch of data to the format before sending them to StarRocks. Valid values: `csv` and `json`. - -### sink.properties.column_separator - -**Required**: No
-**Default value**: \t
-**Description**: The column separator for CSV-formatted data. - -### sink.properties.row_delimiter - -**Required**: No
-**Default value**: \n
-**Description**: The row delimiter for CSV-formatted data. - -### sink.properties.max_filter_ratio - -**Required**: No
-**Default value**: 0
-**Description**: The maximum error tolerance of the Stream Load. It's the maximum percentage of data records that can be filtered out due to inadequate data quality. Valid values: `0` to `1`. Default value: `0`. See [Stream Load](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md) for details. - -### sink.properties.partial_update - -**Required**: NO
-**Default value**: `FALSE`
-**Description**: Whether to use partial updates. Valid values: `TRUE` and `FALSE`. Default value: `FALSE`, indicating to disable this feature. - -### sink.properties.partial_update_mode - -**Required**: NO
-**Default value**: `row`
-**Description**: Specifies the mode for partial updates. Valid values: `row` and `column`.
  • The value `row` (default) means partial updates in row mode, which is more suitable for real-time updates with many columns and small batches.
  • The value `column` means partial updates in column mode, which is more suitable for batch updates with few columns and many rows. In such scenarios, enabling the column mode offers faster update speeds. For example, in a table with 100 columns, if only 10 columns (10% of the total) are updated for all rows, the update speed of the column mode is 10 times faster.
- -### sink.properties.strict_mode - -**Required**: No
-**Default value**: false
-**Description**: Specifies whether to enable the strict mode for Stream Load. It affects the loading behavior when there are unqualified rows, such as inconsistent column values. Valid values: `true` and `false`. Default value: `false`. See [Stream Load](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md) for details. - -### sink.properties.compression - -**Required**: No
-**Default value**: NONE
-**Description**: The compression algorithm used for Stream Load. Valid values: `lz4_frame`. Compression for the JSON format requires Flink connector 1.2.10+ and StarRocks v3.2.7+. Compression for the CSV format only requires Flink connector 1.2.11+. - -### sink.properties.prepared_timeout - -**Required**: No
-**Default value**: NONE
-**Description**: Supported since 1.2.12 and only effective when `sink.version` is set to `V2`. Requires StarRocks 3.5.4 or later. Sets the timeout in seconds for the Transaction Stream Load phase from `PREPARED` to `COMMITTED`. Typically, only needed for exactly-once; at-least-once usually does not require setting this (the connector defaults to 300s). If not set in exactly-once, StarRocks FE configuration `prepared_transaction_default_timeout_second` (default 86400s) applies. See [StarRocks Transaction timeout management](./Stream_Load_transaction_interface.md#transaction-timeout-management). - -## Data type mapping between Flink and StarRocks - -| Flink data type | StarRocks data type | -|-----------------------------------|-----------------------| -| BOOLEAN | BOOLEAN | -| TINYINT | TINYINT | -| SMALLINT | SMALLINT | -| INTEGER | INTEGER | -| BIGINT | BIGINT | -| FLOAT | FLOAT | -| DOUBLE | DOUBLE | -| DECIMAL | DECIMAL | -| BINARY | INT | -| CHAR | STRING | -| VARCHAR | STRING | -| STRING | STRING | -| DATE | DATE | -| TIMESTAMP_WITHOUT_TIME_ZONE(N) | DATETIME | -| TIMESTAMP_WITH_LOCAL_TIME_ZONE(N) | DATETIME | -| ARRAY<T> | ARRAY<T> | -| MAP<KT,VT> | JSON STRING | -| ROW<arg T...> | JSON STRING | - -## Usage notes - -### Exactly Once - -- If you want sink to guarantee exactly-once semantics, we recommend you to upgrade StarRocks to 2.5 or later, and Flink connector to 1.2.4 or later - - Since Flink connector 1.2.4, the exactly-once is redesigned based on [Stream Load transaction interface](./Stream_Load_transaction_interface.md) - provided by StarRocks since 2.4. Compared to the previous implementation based on non-transactional Stream Load non-transactional interface, - the new implementation reduces memory usage and checkpoint overhead, thereby enhancing real-time performance and - stability of loading. - - - If the version of StarRocks is earlier than 2.4 or the version of Flink connector is earlier than 1.2.4, the sink - will automatically choose the implementation based on Stream Load non-transactional interface. - -- Configurations to guarantee exactly-once - - - The value of `sink.semantic` needs to be `exactly-once`. - - - If the version of Flink connector is 1.2.8 and later, it is recommended to specify the value of `sink.label-prefix`. Note that the label prefix must be unique among all types of loading in StarRocks, such as Flink jobs, Routine Load, and Broker Load. - - - If the label prefix is specified, the Flink connector will use the label prefix to clean up lingering transactions that may be generated in some Flink - failure scenarios, such as the Flink job fails when a checkpoint is still in progress. These lingering transactions - are generally in `PREPARED` status if you use `SHOW PROC '/transactions//running';` to view them in StarRocks. When the Flink job restores from checkpoint, - the Flink connector will find these lingering transactions according to the label prefix and some information in - checkpoint, and abort them. The Flink connector can not abort them when the Flink job exits because of the two-phase-commit - mechanism to implement the exactly-once. When the Flink job exits, the Flink connector has not received the notification from - Flink checkpoint coordinator whether the transactions should be included in a successful checkpoint, and it may - lead to data loss if these transactions are aborted anyway. You can have an overview about how to achieve end-to-end exactly-once - in Flink in this [blogpost](https://flink.apache.org/2018/02/28/an-overview-of-end-to-end-exactly-once-processing-in-apache-flink-with-apache-kafka-too/). - - - If the label prefix is not specified, lingering transactions will be cleaned up by StarRocks only after they time out. However the number of running transactions can reach the limitation of StarRocks `max_running_txn_num_per_db` if - Flink jobs fail frequently before transactions time out. You can set a smaller timeout for `PREPARED` transactions - to make them expired faster when the label prefix is not specified. See the following about how to set the prepared timeout. - -- If you are certain that the Flink job will eventually recover from checkpoint or savepoint after a long downtime because of stop or continuous failover, - please adjust the following StarRocks configurations accordingly, to avoid data loss. - - - Adjust `PREPARED` transaction timeout. See the following about how to set the timeout. - - The timeout needs to be larger than the downtime of the Flink job. Otherwise, the lingering transactions that are included in a successful checkpoint may be aborted because of timeout before you restart the Flink job, which leads to data loss. - - Note that when you set a larger value to this configuration, it is better to specify the value of `sink.label-prefix` so that the lingering transactions can be cleaned according to the label prefix and some information in - checkpoint, instead of due to timeout (which may cause data loss). - - - `label_keep_max_second` and `label_keep_max_num`: StarRocks FE configurations, default values are `259200` and `1000` - respectively. For details, see [FE configurations](./loading_introduction/loading_considerations.md#fe-configurations). The value of `label_keep_max_second` needs to be larger than the downtime of the Flink job. Otherwise, the Flink connector can not check the state of transactions in StarRocks by using the transaction labels saved in the Flink's savepoint or checkpoint and figure out whether these transactions are committed or not, which may eventually lead to data loss. - -- How to set the timeout for PREPARED transactions - - - For Connector 1.2.12+ and StarRocks 3.5.4+, you can set the timeout by configuring the connector parameter `sink.properties.prepared_timeout`. By default, the value is not set, and it falls back to the StarRocks FE's global configuration `prepared_transaction_default_timeout_second` (default value is `86400`). - - - For other versions of Connector or StarRocks, you can set the timeout by configuring the StarRocks FE's global configuration `prepared_transaction_default_timeout_second` (default value is `86400`). - -### Flush Policy - -The Flink connector will buffer the data in memory, and flush them in batch to StarRocks via Stream Load. How the flush -is triggered is different between at-least-once and exactly-once. - -For at-least-once, the flush will be triggered when any of the following conditions are met: - -- the bytes of buffered rows reaches the limit `sink.buffer-flush.max-bytes` -- the number of buffered rows reaches the limit `sink.buffer-flush.max-rows`. (Only valid for sink version V1) -- the elapsed time since the last flush reaches the limit `sink.buffer-flush.interval-ms` -- a checkpoint is triggered - -For exactly-once, the flush only happens when a checkpoint is triggered. - -### Monitoring load metrics - -The Flink connector provides the following metrics to monitor loading. - -| Metric | Type | Description | -|--------------------------|---------|-----------------------------------------------------------------| -| totalFlushBytes | counter | successfully flushed bytes. | -| totalFlushRows | counter | number of rows successfully flushed. | -| totalFlushSucceededTimes | counter | number of times that the data is successfully flushed. | -| totalFlushFailedTimes | counter | number of times that the data fails to be flushed. | -| totalFilteredRows | counter | number of rows filtered, which is also included in totalFlushRows. | - -## Examples - -The following examples show how to use the Flink connector to load data into a StarRocks table with Flink SQL or Flink DataStream. - -### Preparations - -#### Create a StarRocks table - -Create a database `test` and create a Primary Key table `score_board`. - -```sql -CREATE DATABASE `test`; - -CREATE TABLE `test`.`score_board` -( - `id` int(11) NOT NULL COMMENT "", - `name` varchar(65533) NULL DEFAULT "" COMMENT "", - `score` int(11) NOT NULL DEFAULT "0" COMMENT "" -) -ENGINE=OLAP -PRIMARY KEY(`id`) -COMMENT "OLAP" -DISTRIBUTED BY HASH(`id`); -``` - -#### Set up Flink environment - -- Download Flink binary [Flink 1.15.2](https://archive.apache.org/dist/flink/flink-1.15.2/flink-1.15.2-bin-scala_2.12.tgz), and unzip it to directory `flink-1.15.2`. -- Download [Flink connector 1.2.7](https://repo1.maven.org/maven2/com/starrocks/flink-connector-starrocks/1.2.7_flink-1.15/flink-connector-starrocks-1.2.7_flink-1.15.jar), and put it into the directory `flink-1.15.2/lib`. -- Run the following commands to start a Flink cluster: - - ```shell - cd flink-1.15.2 - ./bin/start-cluster.sh - ``` - -#### Network configuration - -Ensure that the machine where Flink is located can access the FE nodes of the StarRocks cluster via the [`http_port`](../administration/management/FE_configuration.md#http_port) (default: `8030`) and [`query_port`](../administration/management/FE_configuration.md#query_port) (default: `9030`), and the BE nodes via the [`be_http_port`](../administration/management/BE_configuration.md#be_http_port) (default: `8040`). - -### Run with Flink SQL - -- Run the following command to start a Flink SQL client. - - ```shell - ./bin/sql-client.sh - ``` - -- Create a Flink table `score_board`, and insert values into the table via Flink SQL Client. -Note you must define the primary key in the Flink DDL if you want to load data into a Primary Key table of StarRocks. It's optional for other types of StarRocks tables. - - ```SQL - CREATE TABLE `score_board` ( - `id` INT, - `name` STRING, - `score` INT, - PRIMARY KEY (id) NOT ENFORCED - ) WITH ( - 'connector' = 'starrocks', - 'jdbc-url' = 'jdbc:mysql://127.0.0.1:9030', - 'load-url' = '127.0.0.1:8030', - 'database-name' = 'test', - - 'table-name' = 'score_board', - 'username' = 'root', - 'password' = '' - ); - - INSERT INTO `score_board` VALUES (1, 'starrocks', 100), (2, 'flink', 100); - ``` - -### Run with Flink DataStream - -There are several ways to implement a Flink DataStream job according to the type of the input records, such as a CSV Java `String`, a JSON Java `String` or a custom Java object. - -- The input records are CSV-format `String`. See [LoadCsvRecords](https://github.com/StarRocks/starrocks-connector-for-apache-flink/tree/cd8086cfedc64d5181785bdf5e89a847dc294c1d/examples/src/main/java/com/starrocks/connector/flink/examples/datastream) for a complete example. - - ```java - /** - * Generate CSV-format records. Each record has three values separated by "\t". - * These values will be loaded to the columns `id`, `name`, and `score` in the StarRocks table. - */ - String[] records = new String[]{ - "1\tstarrocks-csv\t100", - "2\tflink-csv\t100" - }; - DataStream source = env.fromElements(records); - - /** - * Configure the connector with the required properties. - * You also need to add properties "sink.properties.format" and "sink.properties.column_separator" - * to tell the connector the input records are CSV-format, and the column separator is "\t". - * You can also use other column separators in the CSV-format records, - * but remember to modify the "sink.properties.column_separator" correspondingly. - */ - StarRocksSinkOptions options = StarRocksSinkOptions.builder() - .withProperty("jdbc-url", jdbcUrl) - .withProperty("load-url", loadUrl) - .withProperty("database-name", "test") - .withProperty("table-name", "score_board") - .withProperty("username", "root") - .withProperty("password", "") - .withProperty("sink.properties.format", "csv") - .withProperty("sink.properties.column_separator", "\t") - .build(); - // Create the sink with the options. - SinkFunction starRockSink = StarRocksSink.sink(options); - source.addSink(starRockSink); - ``` - -- The input records are JSON-format `String`. See [LoadJsonRecords](https://github.com/StarRocks/starrocks-connector-for-apache-flink/tree/cd8086cfedc64d5181785bdf5e89a847dc294c1d/examples/src/main/java/com/starrocks/connector/flink/examples/datastream) for a complete example. - - ```java - /** - * Generate JSON-format records. - * Each record has three key-value pairs corresponding to the columns `id`, `name`, and `score` in the StarRocks table. - */ - String[] records = new String[]{ - "{\"id\":1, \"name\":\"starrocks-json\", \"score\":100}", - "{\"id\":2, \"name\":\"flink-json\", \"score\":100}", - }; - DataStream source = env.fromElements(records); - - /** - * Configure the connector with the required properties. - * You also need to add properties "sink.properties.format" and "sink.properties.strip_outer_array" - * to tell the connector the input records are JSON-format and to strip the outermost array structure. - */ - StarRocksSinkOptions options = StarRocksSinkOptions.builder() - .withProperty("jdbc-url", jdbcUrl) - .withProperty("load-url", loadUrl) - .withProperty("database-name", "test") - .withProperty("table-name", "score_board") - .withProperty("username", "root") - .withProperty("password", "") - .withProperty("sink.properties.format", "json") - .withProperty("sink.properties.strip_outer_array", "true") - .build(); - // Create the sink with the options. - SinkFunction starRockSink = StarRocksSink.sink(options); - source.addSink(starRockSink); - ``` - -- The input records are custom Java objects. See [LoadCustomJavaRecords](https://github.com/StarRocks/starrocks-connector-for-apache-flink/tree/cd8086cfedc64d5181785bdf5e89a847dc294c1d/examples/src/main/java/com/starrocks/connector/flink/examples/datastream) for a complete example. - - - In this example, the input record is a simple POJO `RowData`. - - ```java - public static class RowData { - public int id; - public String name; - public int score; - - public RowData() {} - - public RowData(int id, String name, int score) { - this.id = id; - this.name = name; - this.score = score; - } - } - ``` - - - The main program is as follows: - - ```java - // Generate records which use RowData as the container. - RowData[] records = new RowData[]{ - new RowData(1, "starrocks-rowdata", 100), - new RowData(2, "flink-rowdata", 100), - }; - DataStream source = env.fromElements(records); - - // Configure the connector with the required properties. - StarRocksSinkOptions options = StarRocksSinkOptions.builder() - .withProperty("jdbc-url", jdbcUrl) - .withProperty("load-url", loadUrl) - .withProperty("database-name", "test") - .withProperty("table-name", "score_board") - .withProperty("username", "root") - .withProperty("password", "") - .build(); - - /** - * The Flink connector will use a Java object array (Object[]) to represent a row to be loaded into the StarRocks table, - * and each element is the value for a column. - * You need to define the schema of the Object[] which matches that of the StarRocks table. - */ - TableSchema schema = TableSchema.builder() - .field("id", DataTypes.INT().notNull()) - .field("name", DataTypes.STRING()) - .field("score", DataTypes.INT()) - // When the StarRocks table is a Primary Key table, you must specify notNull(), for example, DataTypes.INT().notNull(), for the primary key `id`. - .primaryKey("id") - .build(); - // Transform the RowData to the Object[] according to the schema. - RowDataTransformer transformer = new RowDataTransformer(); - // Create the sink with the schema, options, and transformer. - SinkFunction starRockSink = StarRocksSink.sink(schema, options, transformer); - source.addSink(starRockSink); - ``` - - - The `RowDataTransformer` in the main program is defined as follows: - - ```java - private static class RowDataTransformer implements StarRocksSinkRowBuilder { - - /** - * Set each element of the object array according to the input RowData. - * The schema of the array matches that of the StarRocks table. - */ - @Override - public void accept(Object[] internalRow, RowData rowData) { - internalRow[0] = rowData.id; - internalRow[1] = rowData.name; - internalRow[2] = rowData.score; - // When the StarRocks table is a Primary Key table, you need to set the last element to indicate whether the data loading is an UPSERT or DELETE operation. - internalRow[internalRow.length - 1] = StarRocksSinkOP.UPSERT.ordinal(); - } - } - ``` - -### Synchronize data with Flink CDC 3.0 (with schema change supported) - -[Flink CDC 3.0](https://nightlies.apache.org/flink/flink-cdc-docs-stable) framework can be used -to easily build a streaming ELT pipeline from CDC sources (such as MySQL and Kafka) to StarRocks. The pipeline can synchronize whole database, merged sharding tables, and schema changes from sources to StarRocks. - -Since v1.2.9, the Flink connector for StarRocks is integrated into this framework as [StarRocks Pipeline Connector](https://nightlies.apache.org/flink/flink-cdc-docs-release-3.1/docs/connectors/pipeline-connectors/starrocks/). The StarRocks Pipeline Connector supports: - -- Automatic creation of databases and tables -- Schema change synchronization -- Full and incremental data synchronization - -For quick start, see [Streaming ELT from MySQL to StarRocks using Flink CDC 3.0 with StarRocks Pipeline Connector](https://nightlies.apache.org/flink/flink-cdc-docs-stable/docs/get-started/quickstart/mysql-to-starrocks). - -It is advised to use StarRocks v3.2.1 and later versions to enable [fast_schema_evolution](../sql-reference/sql-statements/table_bucket_part_index/CREATE_TABLE.md#set-fast-schema-evolution). It will improve the speed of adding or dropping columns and reduce resource usage. - -## Best practices - -### Load data to a Primary Key table - -This section will show how to load data to a StarRocks Primary Key table to achieve partial updates and conditional updates. -You can see [Change data through loading](./Load_to_Primary_Key_tables.md) for the introduction of those features. -These examples use Flink SQL. - -#### Preparations - -Create a database `test` and create a Primary Key table `score_board` in StarRocks. - -```SQL -CREATE DATABASE `test`; - -CREATE TABLE `test`.`score_board` -( - `id` int(11) NOT NULL COMMENT "", - `name` varchar(65533) NULL DEFAULT "" COMMENT "", - `score` int(11) NOT NULL DEFAULT "0" COMMENT "" -) -ENGINE=OLAP -PRIMARY KEY(`id`) -COMMENT "OLAP" -DISTRIBUTED BY HASH(`id`); -``` - -#### Partial update - -This example will show how to load data only to columns `id` and `name`. - -1. Insert two data rows into the StarRocks table `score_board` in MySQL client. - - ```SQL - mysql> INSERT INTO `score_board` VALUES (1, 'starrocks', 100), (2, 'flink', 100); - - mysql> select * from score_board; - +------+-----------+-------+ - | id | name | score | - +------+-----------+-------+ - | 1 | starrocks | 100 | - | 2 | flink | 100 | - +------+-----------+-------+ - 2 rows in set (0.02 sec) - ``` - -2. Create a Flink table `score_board` in Flink SQL client. - - - Define the DDL which only includes the columns `id` and `name`. - - Set the option `sink.properties.partial_update` to `true` which tells the Flink connector to perform partial updates. - - If the Flink connector version `<=` 1.2.7, you also need to set the option `sink.properties.columns` to `id,name,__op` to tells the Flink connector which columns need to be updated. Note that you need to append the field `__op` at the end. The field `__op` indicates that the data loading is an UPSERT or DELETE operation, and its values are set by the connector automatically. - - ```SQL - CREATE TABLE `score_board` ( - `id` INT, - `name` STRING, - PRIMARY KEY (id) NOT ENFORCED - ) WITH ( - 'connector' = 'starrocks', - 'jdbc-url' = 'jdbc:mysql://127.0.0.1:9030', - 'load-url' = '127.0.0.1:8030', - 'database-name' = 'test', - 'table-name' = 'score_board', - 'username' = 'root', - 'password' = '', - 'sink.properties.partial_update' = 'true', - -- only for Flink connector version <= 1.2.7 - 'sink.properties.columns' = 'id,name,__op' - ); - ``` - -3. Insert two data rows into the Flink table. The primary keys of the data rows are as same as these of rows in the StarRocks table. but the values in the column `name` are modified. - - ```SQL - INSERT INTO `score_board` VALUES (1, 'starrocks-update'), (2, 'flink-update'); - ``` - -4. Query the StarRocks table in MySQL client. - - ```SQL - mysql> select * from score_board; - +------+------------------+-------+ - | id | name | score | - +------+------------------+-------+ - | 1 | starrocks-update | 100 | - | 2 | flink-update | 100 | - +------+------------------+-------+ - 2 rows in set (0.02 sec) - ``` - - You can see that only values for `name` change, and the values for `score` do not change. - -#### Conditional update - -This example will show how to do conditional update according to the value of column `score`. The update for an `id` -takes effect only when the new value for `score` is has a greater or equal to the old value. - -1. Insert two data rows into the StarRocks table in MySQL client. - - ```SQL - mysql> INSERT INTO `score_board` VALUES (1, 'starrocks', 100), (2, 'flink', 100); - - mysql> select * from score_board; - +------+-----------+-------+ - | id | name | score | - +------+-----------+-------+ - | 1 | starrocks | 100 | - | 2 | flink | 100 | - +------+-----------+-------+ - 2 rows in set (0.02 sec) - ``` - -2. Create a Flink table `score_board` in the following ways: - - - Define the DDL including all of columns. - - Set the option `sink.properties.merge_condition` to `score` to tell the connector to use the column `score` - as the condition. - - Set the option `sink.version` to `V1` which tells the connector to use Stream Load. - - ```SQL - CREATE TABLE `score_board` ( - `id` INT, - `name` STRING, - `score` INT, - PRIMARY KEY (id) NOT ENFORCED - ) WITH ( - 'connector' = 'starrocks', - 'jdbc-url' = 'jdbc:mysql://127.0.0.1:9030', - 'load-url' = '127.0.0.1:8030', - 'database-name' = 'test', - 'table-name' = 'score_board', - 'username' = 'root', - 'password' = '', - 'sink.properties.merge_condition' = 'score', - 'sink.version' = 'V1' - ); - ``` - -3. Insert two data rows into the Flink table. The primary keys of the data rows are as same as these of rows in the StarRocks table. The first data row has a smaller value in the column `score`, and the second data row has a larger value in the column `score`. - - ```SQL - INSERT INTO `score_board` VALUES (1, 'starrocks-update', 99), (2, 'flink-update', 101); - ``` - -4. Query the StarRocks table in MySQL client. - - ```SQL - mysql> select * from score_board; - +------+--------------+-------+ - | id | name | score | - +------+--------------+-------+ - | 1 | starrocks | 100 | - | 2 | flink-update | 101 | - +------+--------------+-------+ - 2 rows in set (0.03 sec) - ``` - - You can see that only the values of the second data row change, and the values of the first data row do not change. - -### Load data into columns of BITMAP type - -[`BITMAP`](../sql-reference/data-types/other-data-types/BITMAP.md) is often used to accelerate count distinct, such as counting UV, see [Use Bitmap for exact Count Distinct](../using_starrocks/distinct_values/Using_bitmap.md). -Here we take the counting of UV as an example to show how to load data into columns of the `BITMAP` type. - -1. Create a StarRocks Aggregate table in MySQL client. - - In the database `test`, create an Aggregate table `page_uv` where the column `visit_users` is defined as the `BITMAP` type and configured with the aggregate function `BITMAP_UNION`. - - ```SQL - CREATE TABLE `test`.`page_uv` ( - `page_id` INT NOT NULL COMMENT 'page ID', - `visit_date` datetime NOT NULL COMMENT 'access time', - `visit_users` BITMAP BITMAP_UNION NOT NULL COMMENT 'user ID' - ) ENGINE=OLAP - AGGREGATE KEY(`page_id`, `visit_date`) - DISTRIBUTED BY HASH(`page_id`); - ``` - -2. Create a Flink table in Flink SQL client. - - The column `visit_user_id` in the Flink table is of `BIGINT` type, and we want to load this column to the column `visit_users` of `BITMAP` type in the StarRocks table. So when defining the DDL of the Flink table, note that: - - Because Flink does not support `BITMAP`, you need to define a column `visit_user_id` as `BIGINT` type to represent the column `visit_users` of `BITMAP` type in the StarRocks table. - - You need to set the option `sink.properties.columns` to `page_id,visit_date,user_id,visit_users=to_bitmap(visit_user_id)`, which tells the connector the column mapping between the Flink table and StarRocks table. Also you need to use [`to_bitmap`](../sql-reference/sql-functions/bitmap-functions/to_bitmap.md) - function to tell the connector to convert the data of `BIGINT` type into `BITMAP` type. - - ```SQL - CREATE TABLE `page_uv` ( - `page_id` INT, - `visit_date` TIMESTAMP, - `visit_user_id` BIGINT - ) WITH ( - 'connector' = 'starrocks', - 'jdbc-url' = 'jdbc:mysql://127.0.0.1:9030', - 'load-url' = '127.0.0.1:8030', - 'database-name' = 'test', - 'table-name' = 'page_uv', - 'username' = 'root', - 'password' = '', - 'sink.properties.columns' = 'page_id,visit_date,visit_user_id,visit_users=to_bitmap(visit_user_id)' - ); - ``` - -3. Load data into Flink table in Flink SQL client. - - ```SQL - INSERT INTO `page_uv` VALUES - (1, CAST('2020-06-23 01:30:30' AS TIMESTAMP), 13), - (1, CAST('2020-06-23 01:30:30' AS TIMESTAMP), 23), - (1, CAST('2020-06-23 01:30:30' AS TIMESTAMP), 33), - (1, CAST('2020-06-23 02:30:30' AS TIMESTAMP), 13), - (2, CAST('2020-06-23 01:30:30' AS TIMESTAMP), 23); - ``` - -4. Calculate page UVs from the StarRocks table in MySQL client. - - ```SQL - MySQL [test]> SELECT `page_id`, COUNT(DISTINCT `visit_users`) FROM `page_uv` GROUP BY `page_id`; - +---------+-----------------------------+ - | page_id | count(DISTINCT visit_users) | - +---------+-----------------------------+ - | 2 | 1 | - | 1 | 3 | - +---------+-----------------------------+ - 2 rows in set (0.05 sec) - ``` - -### Load data into columns of HLL type - -[`HLL`](../sql-reference/data-types/other-data-types/HLL.md) can be used for approximate count distinct, see [Use HLL for approximate count distinct](../using_starrocks/distinct_values/Using_HLL.md). - -Here we take the counting of UV as an example to show how to load data into columns of the `HLL` type. - -1. Create a StarRocks Aggregate table - - In the database `test`, create an Aggregate table `hll_uv` where the column `visit_users` is defined as the `HLL` type and configured with the aggregate function `HLL_UNION`. - - ```SQL - CREATE TABLE `hll_uv` ( - `page_id` INT NOT NULL COMMENT 'page ID', - `visit_date` datetime NOT NULL COMMENT 'access time', - `visit_users` HLL HLL_UNION NOT NULL COMMENT 'user ID' - ) ENGINE=OLAP - AGGREGATE KEY(`page_id`, `visit_date`) - DISTRIBUTED BY HASH(`page_id`); - ``` - -2. Create a Flink table in Flink SQL client. - - The column `visit_user_id` in the Flink table is of `BIGINT` type, and we want to load this column to the column `visit_users` of `HLL` type in the StarRocks table. So when defining the DDL of the Flink table, note that: - - Because Flink does not support `BITMAP`, you need to define a column `visit_user_id` as `BIGINT` type to represent the column `visit_users` of `HLL` type in the StarRocks table. - - You need to set the option `sink.properties.columns` to `page_id,visit_date,user_id,visit_users=hll_hash(visit_user_id)` which tells the connector the column mapping between Flink table and StarRocks table. Also you need to use [`hll_hash`](../sql-reference/sql-functions/scalar-functions/hll_hash.md) function to tell the connector to convert the data of `BIGINT` type into `HLL` type. - - ```SQL - CREATE TABLE `hll_uv` ( - `page_id` INT, - `visit_date` TIMESTAMP, - `visit_user_id` BIGINT - ) WITH ( - 'connector' = 'starrocks', - 'jdbc-url' = 'jdbc:mysql://127.0.0.1:9030', - 'load-url' = '127.0.0.1:8030', - 'database-name' = 'test', - 'table-name' = 'hll_uv', - 'username' = 'root', - 'password' = '', - 'sink.properties.columns' = 'page_id,visit_date,visit_user_id,visit_users=hll_hash(visit_user_id)' - ); - ``` - -3. Load data into Flink table in Flink SQL client. - - ```SQL - INSERT INTO `hll_uv` VALUES - (3, CAST('2023-07-24 12:00:00' AS TIMESTAMP), 78), - (4, CAST('2023-07-24 13:20:10' AS TIMESTAMP), 2), - (3, CAST('2023-07-24 12:30:00' AS TIMESTAMP), 674); - ``` - -4. Calculate page UVs from the StarRocks table in MySQL client. - - ```SQL - mysql> SELECT `page_id`, COUNT(DISTINCT `visit_users`) FROM `hll_uv` GROUP BY `page_id`; - **+---------+-----------------------------+ - | page_id | count(DISTINCT visit_users) | - +---------+-----------------------------+ - | 3 | 2 | - | 4 | 1 | - +---------+-----------------------------+ - 2 rows in set (0.04 sec) - ``` diff --git a/docs/en/loading/Flink_cdc_load.md b/docs/en/loading/Flink_cdc_load.md deleted file mode 100644 index dbd0a24..0000000 --- a/docs/en/loading/Flink_cdc_load.md +++ /dev/null @@ -1,532 +0,0 @@ ---- -displayed_sidebar: docs -keywords: - - MySql - - mysql - - sync - - Flink CDC ---- - -# Realtime synchronization from MySQL - -import InsertPrivNote from '../_assets/commonMarkdown/insertPrivNote.mdx' - -StarRocks supports multiple methods to synchronize data from MySQL to StarRocks in real time, delivering low latency real-time analytics of massive data. - -This topic describes how to synchronize data from MySQL to StarRocks in real-time (within seconds) through Apache Flink®. - - - -## How it works - -:::tip - -Flink CDC is used in the synchronization from MySQL to Flink. This topic uses Flink CDC whose version is less than 3.0, so SMT is used to synchronize table schemas. However, if Flink CDC 3.0 is used, it is not necessary to use SMT to synchronize table schemas to StarRocks. Flink CDC 3.0 can even synchronize the schemas of the entire MySQL database, the sharded databases and tables, and also supports schema changes synchronization. For detailed usage, see [Streaming ELT from MySQL to StarRocks](https://nightlies.apache.org/flink/flink-cdc-docs-stable/docs/get-started/quickstart/mysql-to-starrocks). - -::: - -The following figure illustrates the entire synchronization process. - -![img](../_assets/4.9.2.png) - -Real-time synchronization from MySQL through Flink to StarRocks is implemented in two stages: synchronizing database & table schema and synchronizing data. First, the SMT converts MySQL database & table schema into table creation statements for StarRocks. Then, the Flink cluster runs Flink jobs to synchronize full and incremental MySQL data to StarRocks. - -:::info - -The synchronization process guarantees exactly-once semantics. - -::: - -**Synchronization process**: - -1. Synchronize database & table schema. - - The SMT reads the schema of the MySQL database & table to be synchronized and generates SQL files for creating a destination database & table in StarRocks. This operation is based on the MySQL and StarRocks information in SMT's configuration file. - -2. Synchronize data. - - a. The Flink SQL client executes the data loading statement `INSERT INTO SELECT` to submit one or more Flink jobs to the Flink cluster. - - b. The Flink cluster runs the Flink jobs to obtain data. The Flink CDC connector first reads full historical data from the source database, then seamlessly switches to incremental reading, and sends the data to flink-connector-starrocks. - - c. flink-connector-starrocks accumulates data in mini-batches, and synchronizes each batch of data to StarRocks. - - :::info - - Only data manipulation language (DML) operations in MySQL can be synchronized to StarRocks. Data definition language (DDL) operations cannot be synchronized. - - ::: - -## Scenarios - -Real-time synchronization from MySQL has a broad range of use cases where data is constantly changed. Take a real-world use case "real-time ranking of commodity sales" as an example. - -Flink calculates the real-time ranking of commodity sales based on the original order table in MySQL and synchronizes the ranking to StarRocks' Primary Key table in real time. Users can connect a visualization tool to StarRocks to view the ranking in real time to gain on-demand operational insights. - -## Preparations - -### Download and install synchronization tools - -To synchronize data from MySQL, you need to install the following tools: SMT, Flink, Flink CDC connector, and flink-connector-starrocks. - -1. Download and install Flink, and start the Flink cluster. You can also perform this step by following the instructions in [Flink official documentation](https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/try-flink/local_installation/). - - a. Install Java 8 or Java 11 in your operating system before you run Flink. You can run the following command to check the installed Java version. - - ```Bash - # View the Java version. - java -version - - # Java 8 is installed if the following output is returned. - java version "1.8.0_301" - Java(TM) SE Runtime Environment (build 1.8.0_301-b09) - Java HotSpot(TM) 64-Bit Server VM (build 25.301-b09, mixed mode) - ``` - - b. Download the [Flink installation package](https://flink.apache.org/downloads.html) and decompress it. We recommend that you use Flink 1.14 or later. The minimum allowed version is Flink 1.11. This topic uses Flink 1.14.5. - - ```Bash - # Download Flink. - wget https://archive.apache.org/dist/flink/flink-1.14.5/flink-1.14.5-bin-scala_2.11.tgz - # Decompress Flink. - tar -xzf flink-1.14.5-bin-scala_2.11.tgz - # Go to the Flink directory. - cd flink-1.14.5 - ``` - - c. Start the Flink cluster. - - ```Bash - # Start the Flink cluster. - ./bin/start-cluster.sh - - # The Flink cluster is started if the following output is returned. - Starting cluster. - Starting standalonesession daemon on host. - Starting taskexecutor daemon on host. - ``` - -2. Download [Flink CDC connector](https://github.com/ververica/flink-cdc-connectors/releases). This topic uses MySQL as the data source and therefore, `flink-sql-connector-mysql-cdc-x.x.x.jar` is downloaded. The connector version must match the [Flink](https://github.com/ververica/flink-cdc-connectors/releases) version. This topic uses Flink 1.14.5 and you can download `flink-sql-connector-mysql-cdc-2.2.0.jar`. - - ```Bash - wget https://repo1.maven.org/maven2/com/ververica/flink-sql-connector-mysql-cdc/2.1.1/flink-sql-connector-mysql-cdc-2.2.0.jar - ``` - -3. Download [flink-connector-starrocks](https://search.maven.org/artifact/com.starrocks/flink-connector-starrocks). The version must match the Flink version. - - > The flink-connector-starrocks package `x.x.x_flink-y.yy _ z.zz.jar` contains three version numbers: - > - > - `x.x.x` is the version number of flink-connector-starrocks. - > - `y.yy` is the supported Flink version. - > - `z.zz` is the Scala version supported by Flink. If the Flink version is 1.14.x or earlier, you must download a package that has the Scala version. - > - > This topic uses Flink 1.14.5 and Scala 2.11. Therefore, you can download the following package: `1.2.3_flink-14_2.11.jar`. - -4. Move the JAR packages of Flink CDC connector (`flink-sql-connector-mysql-cdc-2.2.0.jar`) and flink-connector-starrocks (`1.2.3_flink-1.14_2.11.jar`) to the `lib` directory of Flink. - - > **Note** - > - > If a Flink cluster is already running in your system, you must stop the Flink cluster and restart it to load and validate the JAR packages. - > - > ```Bash - > $ ./bin/stop-cluster.sh - > $ ./bin/start-cluster.sh - > ``` - -5. Download and decompress the [SMT package](https://www.starrocks.io/download/community) and place it in the `flink-1.14.5` directory. StarRocks provides SMT packages for Linux x86 and macos ARM64. You can choose one based on your operating system and CPU. - - ```Bash - # for Linux x86 - wget https://releases.starrocks.io/resources/smt.tar.gz - # for macOS ARM64 - wget https://releases.starrocks.io/resources/smt_darwin_arm64.tar.gz - ``` - -### Enable MySQL binary log - -To synchronize data from MySQL in real time, the system needs to read data from MySQL binary log (binlog), parse the data, and then synchronize the data to StarRocks. Make sure that MySQL binary log is enabled. - -1. Edit the MySQL configuration file `my.cnf` (default path: `/etc/my.cnf`) to enable MySQL binary log. - - ```Bash - # Enable MySQL Binlog. - log_bin = ON - # Configure the save path for the Binlog. - log_bin =/var/lib/mysql/mysql-bin - # Configure server_id. - # If server_id is not configured for MySQL 5.7.3 or later, the MySQL service cannot be used. - server_id = 1 - # Set the Binlog format to ROW. - binlog_format = ROW - # The base name of the Binlog file. An identifier is appended to identify each Binlog file. - log_bin_basename =/var/lib/mysql/mysql-bin - # The index file of Binlog files, which manages the directory of all Binlog files. - log_bin_index =/var/lib/mysql/mysql-bin.index - ``` - -2. Run one of the following commands to restart MySQL for the modified configuration file to take effect. - - ```Bash - # Use service to restart MySQL. - service mysqld restart - # Use mysqld script to restart MySQL. - /etc/init.d/mysqld restart - ``` - -3. Connect to MySQL and check whether MySQL binary log is enabled. - - ```Plain - -- Connect to MySQL. - mysql -h xxx.xx.xxx.xx -P 3306 -u root -pxxxxxx - - -- Check whether MySQL binary log is enabled. - mysql> SHOW VARIABLES LIKE 'log_bin'; - +---------------+-------+ - | Variable_name | Value | - +---------------+-------+ - | log_bin | ON | - +---------------+-------+ - 1 row in set (0.00 sec) - ``` - -## Synchronize database & table schema - -1. Edit the SMT configuration file. - Go to the SMT `conf` directory and edit the configuration file `config_prod.conf`, such as MySQL connection information, the matching rules of the database & table to be synchronized, and configuration information of flink-connector-starrocks. - - ```Bash - [db] - type = mysql - host = xxx.xx.xxx.xx - port = 3306 - user = user1 - password = xxxxxx - - [other] - # Number of BEs in StarRocks - be_num = 3 - # `decimal_v3` is supported since StarRocks-1.18.1. - use_decimal_v3 = true - # File to save the converted DDL SQL - output_dir = ./result - - [table-rule.1] - # Pattern to match databases for setting properties - database = ^demo.*$ - # Pattern to match tables for setting properties - table = ^.*$ - - ############################################ - ### Flink sink configurations - ### DO NOT set `connector`, `table-name`, `database-name`. They are auto-generated. - ############################################ - flink.starrocks.jdbc-url=jdbc:mysql://: - flink.starrocks.load-url= : - flink.starrocks.username=user2 - flink.starrocks.password=xxxxxx - flink.starrocks.sink.properties.format=csv - flink.starrocks.sink.properties.column_separator=\x01 - flink.starrocks.sink.properties.row_delimiter=\x02 - flink.starrocks.sink.buffer-flush.interval-ms=15000 - ``` - - - `[db]`: information used to access the source database. - - `type`: type of the source database. In this topic, the source database is `mysql`. - - `host`: IP address of the MySQL server. - - `port`: port number of the MySQL database, defaults to `3306` - - `user`: username for accessing the MySQL database - - `password`: password of the username - - - `[table-rule]`: database & table matching rules and the corresponding flink-connector-starrocks configuration. - - - `Database`, `table`: the names of the database & table in MySQL. Regular expressions are supported. - - `flink.starrocks.*`: configuration information of flink-connector-starrocks. For more configurations and information, see [flink-connector-starrocks](../loading/Flink-connector-starrocks.md). - - > If you need to use different flink-connector-starrocks configurations for different tables. For example, if some tables are frequently updated and you need to accelerate data loading, see [Use different flink-connector-starrocks configurations for different tables](#use-different-flink-connector-starrocks-configurations-for-different-tables). If you need to load multiple tables obtained from MySQL sharding into the same StarRocks table, see [Synchronize multiple tables after MySQL sharding to one table in StarRocks](#synchronize-multiple-tables-after-mysql-sharding-to-one-table-in-starrocks). - - - `[other]`: other information - - `be_num`: The number of BEs in your StarRocks cluster (This parameter will be used for setting a reasonable number of tablets in subsequent StarRocks table creation). - - `use_decimal_v3`: Whether to enable [Decimal V3](../sql-reference/data-types/numeric/DECIMAL.md). After Decimal V3 is enabled, MySQL decimal data will be converted into Decimal V3 data when data is synchronized to StarRocks. - - `output_dir`: The path to save the SQL files to be generated. The SQL files will be used to create a database & table in StarRocks and submit a Flink job to the Flink cluster. The default path is `./result` and we recommend that you retain the default settings. - -2. Run the SMT to read the database & table schema in MySQL and generate SQL files in the `./result` directory based on the configuration file. The `starrocks-create.all.sql` file is used to create a database & table in StarRocks and the `flink-create.all.sql` file is used to submit a Flink job to the Flink cluster. - - ```Bash - # Run the SMT. - ./starrocks-migrate-tool - - # Go to the result directory and check the files in this directory. - cd result - ls result - flink-create.1.sql smt.tar.gz starrocks-create.all.sql - flink-create.all.sql starrocks-create.1.sql - ``` - -3. Run the following command to connect to StarRocks and execute the `starrocks-create.all.sql` file to create a database and table in StarRocks. We recommend that you use the default table creation statement in the SQL file to create a table of the [Primary Key table](../table_design/table_types/primary_key_table.md). - - > **Note** - > - > You can also modify the table creation statement based on your business needs and create a table that does not use the Primary Key table. However, the DELETE operation in the source MySQL database cannot be synchronized to the non- Primary Key table. Exercise caution when you create such a table. - - ```Bash - mysql -h -P -u user2 -pxxxxxx < starrocks-create.all.sql - ``` - - If the data needs to be processed by Flink before it is written to the destination StarRocks table, the table schema will be different between the source and destination tables. In this case, you must modify the table creation statement. In this example, the destination table requires only the `product_id` and `product_name` columns and real-time ranking of commodity sales. You can use the following table creation statement. - - ```Bash - CREATE DATABASE IF NOT EXISTS `demo`; - - CREATE TABLE IF NOT EXISTS `demo`.`orders` ( - `product_id` INT(11) NOT NULL COMMENT "", - `product_name` STRING NOT NULL COMMENT "", - `sales_cnt` BIGINT NOT NULL COMMENT "" - ) ENGINE=olap - PRIMARY KEY(`product_id`) - DISTRIBUTED BY HASH(`product_id`) - PROPERTIES ( - "replication_num" = "3" - ); - ``` - - > **NOTICE** - > - > Since v2.5.7, StarRocks can automatically set the number of buckets (BUCKETS) when you create a table or add a partition. You no longer need to manually set the number of buckets. For detailed information, see [set the number of buckets](../table_design/data_distribution/Data_distribution.md#set-the-number-of-buckets). - -## Synchronize data - -Run the Flink cluster and submit a Flink job to continuously synchronize full and incremental data from MySQL to StarRocks. - -1. Go to the Flink directory and run the following command to run the `flink-create.all.sql` file on your Flink SQL client. - - ```Bash - ./bin/sql-client.sh -f flink-create.all.sql - ``` - - This SQL file defines dynamic tables `source table` and `sink table`, query statement `INSERT INTO SELECT`, and specifies the connector, source database, and destination database. After this file is executed, a Flink job is submitted to the Flink cluster to start data synchronization. - - > **Note** - > - > - Make sure that the Flink cluster has been started. You can start the Flink cluster by running `flink/bin/start-cluster.sh`. - > - If your Flink version is earlier than 1.13, you may not be able to directly run the SQL file `flink-create.all.sql`. You need to execute SQL statements one by one in this file in the command line interface (CLI) of the SQL client. You also need to escape the `\` character. - > - > ```Bash - > 'sink.properties.column_separator' = '\\x01' - > 'sink.properties.row_delimiter' = '\\x02' - > ``` - - **Process data during synchronization**: - - If you need to process data during synchronization, such as performing GROUP BY or JOIN on the data, you can modify the `flink-create.all.sql` file. The following example calculates real-time ranking of commodity sales by executing COUNT (*) and GROUP BY. - - ```Bash - $ ./bin/sql-client.sh -f flink-create.all.sql - No default environment is specified. - Searching for '/home/disk1/flink-1.13.6/conf/sql-client-defaults.yaml'...not found. - [INFO] Executing SQL from file. - - Flink SQL> CREATE DATABASE IF NOT EXISTS `default_catalog`.`demo`; - [INFO] Execute statement succeed. - - -- Create a dynamic table `source table` based on the order table in MySQL. - Flink SQL> - CREATE TABLE IF NOT EXISTS `default_catalog`.`demo`.`orders_src` (`order_id` BIGINT NOT NULL, - `product_id` INT NULL, - `order_date` TIMESTAMP NOT NULL, - `customer_name` STRING NOT NULL, - `product_name` STRING NOT NULL, - `price` DECIMAL(10, 5) NULL, - PRIMARY KEY(`order_id`) - NOT ENFORCED - ) with ('connector' = 'mysql-cdc', - 'hostname' = 'xxx.xx.xxx.xxx', - 'port' = '3306', - 'username' = 'root', - 'password' = '', - 'database-name' = 'demo', - 'table-name' = 'orders' - ); - [INFO] Execute statement succeed. - - -- Create a dynamic table `sink table`. - Flink SQL> - CREATE TABLE IF NOT EXISTS `default_catalog`.`demo`.`orders_sink` (`product_id` INT NOT NULL, - `product_name` STRING NOT NULL, - `sales_cnt` BIGINT NOT NULL, - PRIMARY KEY(`product_id`) - NOT ENFORCED - ) with ('sink.max-retries' = '10', - 'jdbc-url' = 'jdbc:mysql://:', - 'password' = '', - 'sink.properties.strip_outer_array' = 'true', - 'sink.properties.format' = 'json', - 'load-url' = ':', - 'username' = 'root', - 'sink.buffer-flush.interval-ms' = '15000', - 'connector' = 'starrocks', - 'database-name' = 'demo', - 'table-name' = 'orders' - ); - [INFO] Execute statement succeed. - - -- Implement real-time ranking of commodity sales, where `sink table` is dynamically updated to reflect data changes in `source table`. - Flink SQL> - INSERT INTO `default_catalog`.`demo`.`orders_sink` select product_id,product_name, count(*) as cnt from `default_catalog`.`demo`.`orders_src` group by product_id,product_name; - [INFO] Submitting SQL update statement to the cluster... - [INFO] SQL update statement has been successfully submitted to the cluster: - Job ID: 5ae005c4b3425d8bb13fe660260a35da - ``` - - If you only need to synchronize only a portion of the data, such as data whose payment time is later than December 21, 2021, you can use the `WHERE` clause in `INSERT INTO SELECT` to set a filter condition, such as `WHERE pay_dt > '2021-12-21'`. Data that does not meet this condition will not be synchronized to StarRocks. - - If the following result is returned, the Flink job has been submitted for full and incremental synchronization. - - ```SQL - [INFO] Submitting SQL update statement to the cluster... - [INFO] SQL update statement has been successfully submitted to the cluster: - Job ID: 5ae005c4b3425d8bb13fe660260a35da - ``` - -2. You can use the [Flink WebUI](https://nightlies.apache.org/flink/flink-docs-master/docs/try-flink/flink-operations-playground/#flink-webui) or run the `bin/flink list -running` command on your Flink SQL client to view Flink jobs that are running in the Flink cluster and the job IDs. - - - Flink WebUI - ![img](../_assets/4.9.3.png) - - - `bin/flink list -running` - - ```Bash - $ bin/flink list -running - Waiting for response... - ------------------ Running/Restarting Jobs ------------------- - 13.10.2022 15:03:54 : 040a846f8b58e82eb99c8663424294d5 : insert-into_default_catalog.lily.example_tbl1_sink (RUNNING) - -------------------------------------------------------------- - ``` - - > **Note** - > - > If the job is abnormal, you can perform troubleshooting by using Flink WebUI or by viewing the log file in the `/log` directory of Flink 1.14.5. - -## FAQ - -### Use different flink-connector-starrocks configurations for different tables - -If some tables in the data source are frequently updated and you want to accelerate the loading speed of flink-connector-starrocks, you must set a separate flink-connector-starrocks configuration for each table in the SMT configuration file `config_prod.conf`. - -```Bash -[table-rule.1] -# Pattern to match databases for setting properties -database = ^order.*$ -# Pattern to match tables for setting properties -table = ^.*$ - -############################################ -### Flink sink configurations -### DO NOT set `connector`, `table-name`, `database-name`. They are auto-generated -############################################ -flink.starrocks.jdbc-url=jdbc:mysql://: -flink.starrocks.load-url= : -flink.starrocks.username=user2 -flink.starrocks.password=xxxxxx -flink.starrocks.sink.properties.format=csv -flink.starrocks.sink.properties.column_separator=\x01 -flink.starrocks.sink.properties.row_delimiter=\x02 -flink.starrocks.sink.buffer-flush.interval-ms=15000 - -[table-rule.2] -# Pattern to match databases for setting properties -database = ^order2.*$ -# Pattern to match tables for setting properties -table = ^.*$ - -############################################ -### Flink sink configurations -### DO NOT set `connector`, `table-name`, `database-name`. They are auto-generated -############################################ -flink.starrocks.jdbc-url=jdbc:mysql://: -flink.starrocks.load-url= : -flink.starrocks.username=user2 -flink.starrocks.password=xxxxxx -flink.starrocks.sink.properties.format=csv -flink.starrocks.sink.properties.column_separator=\x01 -flink.starrocks.sink.properties.row_delimiter=\x02 -flink.starrocks.sink.buffer-flush.interval-ms=10000 -``` - -### Synchronize multiple tables after MySQL sharding to one table in StarRocks - -After sharding is performed, data in one MySQL table may be split into multiple tables or even distributed to multiple databases. All the tables have the same schema. In this case, you can set `[table-rule]` to synchronize these tables to one StarRocks table. For example, MySQL has two databases `edu_db_1` and `edu_db_2`, each of which has two tables `course_1 and course_2`, and the schema of all tables is the same. You can use the following `[table-rule]` configuration to synchronize all the tables to one StarRocks table. - -> **Note** -> -> The name of the StarRocks table defaults to `course__auto_shard`. If you need to use a different name, you can modify it in the SQL files `starrocks-create.all.sql` and `flink-create.all.sql` - -```Bash -[table-rule.1] -# Pattern to match databases for setting properties -database = ^edu_db_[0-9]*$ -# Pattern to match tables for setting properties -table = ^course_[0-9]*$ - -############################################ -### Flink sink configurations -### DO NOT set `connector`, `table-name`, `database-name`. They are auto-generated -############################################ -flink.starrocks.jdbc-url = jdbc: mysql://xxx.xxx.x.x:xxxx -flink.starrocks.load-url = xxx.xxx.x.x:xxxx -flink.starrocks.username = user2 -flink.starrocks.password = xxxxxx -flink.starrocks.sink.properties.format=csv -flink.starrocks.sink.properties.column_separator =\x01 -flink.starrocks.sink.properties.row_delimiter =\x02 -flink.starrocks.sink.buffer-flush.interval-ms = 5000 -``` - -### Import data in JSON format - -Data in the preceding example is imported in CSV format. If you are unable to choose a suitable delimiter, you need to replace the following parameters of `flink.starrocks.*` in `[table-rule]`. - -```Plain -flink.starrocks.sink.properties.format=csv -flink.starrocks.sink.properties.column_separator =\x01 -flink.starrocks.sink.properties.row_delimiter =\x02 -``` - -Data is imported in JSON format after the following parameters are passed in. - -```Plain -flink.starrocks.sink.properties.format=json -flink.starrocks.sink.properties.strip_outer_array=true -``` - -> **Note** -> -> This method slightly slows down the loading speed. - -### Execute multiple INSERT INTO statements as one Flink job - -You can use the [STATEMENT SET](https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sqlclient/#execute-a-set-of-sql-statements) syntax in the `flink-create.all.sql` file to execute multiple INSERT INTO statements as one Flink job, which prevents multiple statements from taking up too many Flink job resources and improves the efficiency of executing multiple queries. - -> **Note** -> -> Flink supports the STATEMENT SET syntax from 1.13 onwards. - -1. Open the `result/flink-create.all.sql` file. - -2. Modify the SQL statements in the file. Move all the INSERT INTO statements to the end of the file. Place `EXECUTE STATEMENT SET BEGIN` before the first INSERT INTO statement and place `END;` after the last INSERT INTO statement. - -> **Note** -> -> The positions of CREATE DATABASE and CREATE TABLE remain unchanged. - -```SQL -CREATE DATABASE IF NOT EXISTS db; -CREATE TABLE IF NOT EXISTS db.a1; -CREATE TABLE IF NOT EXISTS db.b1; -CREATE TABLE IF NOT EXISTS db.a2; -CREATE TABLE IF NOT EXISTS db.b2; -EXECUTE STATEMENT SET -BEGIN-- one or more INSERT INTO statements -INSERT INTO db.a1 SELECT * FROM db.b1; -INSERT INTO db.a2 SELECT * FROM db.b2; -END; -``` diff --git a/docs/en/loading/InsertInto.md b/docs/en/loading/InsertInto.md deleted file mode 100644 index 469ae0b..0000000 --- a/docs/en/loading/InsertInto.md +++ /dev/null @@ -1,711 +0,0 @@ ---- -displayed_sidebar: docs ---- - -# Load data using INSERT - -import InsertPrivNote from '../_assets/commonMarkdown/insertPrivNote.mdx' - -This topic describes how to load data into StarRocks by using a SQL statement - INSERT. - -Similar to MySQL and many other database management systems, StarRocks supports loading data to an internal table with INSERT. You can insert one or more rows directly with the VALUES clause to test a function or a DEMO. You can also insert data defined by the results of a query into an internal table from an [external table](../data_source/External_table.md). From StarRocks v3.1 onwards, you can directly load data from files on cloud storage using the INSERT command and the table function [FILES()](../sql-reference/sql-functions/table-functions/files.md). - -StarRocks v2.4 further supports overwriting data into a table by using INSERT OVERWRITE. The INSERT OVERWRITE statement integrates the following operations to implement the overwriting function: - -1. Creates temporary partitions according to the partitions that store the original data. -2. Inserts data into the temporary partitions. -3. Swaps the original partitions with the temporary partitions. - -> **NOTE** -> -> If you need to verify the data before overwriting it, instead of using INSERT OVERWRITE, you can follow the above procedures to overwrite your data and verify it before swapping the partitions. - -From v3.4.0 onwards, StarRocks supports a new semantic - Dynamic Overwrite for INSERT OVERWRITE with partitioned tables. For more information, see [Dynamic Overwrite](#dynamic-overwrite). - -## Precautions - -- You can cancel a synchronous INSERT transaction only by pressing the **Ctrl** and **C** keys from your MySQL client. -- You can submit an asynchronous INSERT task using [SUBMIT TASK](../sql-reference/sql-statements/loading_unloading/ETL/SUBMIT_TASK.md). -- As for the current version of StarRocks, the INSERT transaction fails by default if the data of any rows does not comply with the schema of the table. For example, the INSERT transaction fails if the length of a field in any row exceeds the length limit for the mapping field in the table. You can set the session variable `enable_insert_strict` to `false` to allow the transaction to continue by filtering out the rows that mismatch the table. -- If you execute the INSERT statement frequently to load small batches of data into StarRocks, excessive data versions are generated. It severely affects query performance. We recommend that, in production, you should not load data with the INSERT command too often or use it as a routine for data loading on a daily basis. If your application or analytic scenario demand solutions to loading streaming data or small data batches separately, we recommend you use Apache Kafka® as your data source and load the data via Routine Load. -- If you execute the INSERT OVERWRITE statement, StarRocks creates temporary partitions for the partitions which store the original data, inserts new data into the temporary partitions, and [swaps the original partitions with the temporary partitions](../sql-reference/sql-statements/table_bucket_part_index/ALTER_TABLE.md#use-a-temporary-partition-to-replace-the-current-partition). All these operations are executed in the FE Leader node. Hence, if the FE Leader node crashes while executing INSERT OVERWRITE command, the whole load transaction will fail, and the temporary partitions will be truncated. - -## Preparation - -### Check privileges - - - -### Create objects - -Create a database named `load_test`, and create a table `insert_wiki_edit` as the destination table and a table `source_wiki_edit` as the source table. - -> **NOTE** -> -> Examples demonstrated in this topic are based on the table `insert_wiki_edit` and the table `source_wiki_edit`. If you prefer working with your own tables and data, you can skip the preparation and move on to the next step. - -```SQL -CREATE DATABASE IF NOT EXISTS load_test; -USE load_test; -CREATE TABLE insert_wiki_edit -( - event_time DATETIME, - channel VARCHAR(32) DEFAULT '', - user VARCHAR(128) DEFAULT '', - is_anonymous TINYINT DEFAULT '0', - is_minor TINYINT DEFAULT '0', - is_new TINYINT DEFAULT '0', - is_robot TINYINT DEFAULT '0', - is_unpatrolled TINYINT DEFAULT '0', - delta INT DEFAULT '0', - added INT DEFAULT '0', - deleted INT DEFAULT '0' -) -DUPLICATE KEY( - event_time, - channel, - user, - is_anonymous, - is_minor, - is_new, - is_robot, - is_unpatrolled -) -PARTITION BY RANGE(event_time)( - PARTITION p06 VALUES LESS THAN ('2015-09-12 06:00:00'), - PARTITION p12 VALUES LESS THAN ('2015-09-12 12:00:00'), - PARTITION p18 VALUES LESS THAN ('2015-09-12 18:00:00'), - PARTITION p24 VALUES LESS THAN ('2015-09-13 00:00:00') -) -DISTRIBUTED BY HASH(user); - -CREATE TABLE source_wiki_edit -( - event_time DATETIME, - channel VARCHAR(32) DEFAULT '', - user VARCHAR(128) DEFAULT '', - is_anonymous TINYINT DEFAULT '0', - is_minor TINYINT DEFAULT '0', - is_new TINYINT DEFAULT '0', - is_robot TINYINT DEFAULT '0', - is_unpatrolled TINYINT DEFAULT '0', - delta INT DEFAULT '0', - added INT DEFAULT '0', - deleted INT DEFAULT '0' -) -DUPLICATE KEY( - event_time, - channel,user, - is_anonymous, - is_minor, - is_new, - is_robot, - is_unpatrolled -) -PARTITION BY RANGE(event_time)( - PARTITION p06 VALUES LESS THAN ('2015-09-12 06:00:00'), - PARTITION p12 VALUES LESS THAN ('2015-09-12 12:00:00'), - PARTITION p18 VALUES LESS THAN ('2015-09-12 18:00:00'), - PARTITION p24 VALUES LESS THAN ('2015-09-13 00:00:00') -) -DISTRIBUTED BY HASH(user); -``` - -> **NOTICE** -> -> Since v2.5.7, StarRocks can automatically set the number of buckets (BUCKETS) when you create a table or add a partition. You no longer need to manually set the number of buckets. For detailed information, see [set the number of buckets](../table_design/data_distribution/Data_distribution.md#set-the-number-of-buckets). - -## Insert data via INSERT INTO VALUES - -You can append one or more rows to a specific table by using INSERT INTO VALUES command. Multiple rows are separated by comma (,). For detailed instructions and parameter references, see [SQL Reference - INSERT](../sql-reference/sql-statements/loading_unloading/INSERT.md). - -> **CAUTION** -> -> Inserting data via INSERT INTO VALUES merely applies to the situation when you need to verify a DEMO with a small dataset. It is not recommended for a massive testing or production environment. To load mass data into StarRocks, see [Loading options](Loading_intro.md) for other options that suit your scenarios. - -The following example inserts two rows into the data source table `source_wiki_edit` with the label `insert_load_wikipedia`. Label is the unique identification label for each data load transaction within the database. - -```SQL -INSERT INTO source_wiki_edit -WITH LABEL insert_load_wikipedia -VALUES - ("2015-09-12 00:00:00","#en.wikipedia","AustinFF",0,0,0,0,0,21,5,0), - ("2015-09-12 00:00:00","#ca.wikipedia","helloSR",0,1,0,1,0,3,23,0); -``` - -## Insert data via INSERT INTO SELECT - -You can load the result of a query on a data source table into the target table via INSERT INTO SELECT command. INSERT INTO SELECT command performs ETL operations on the data from the data source table, and loads the data into an internal table in StarRocks. The data source can be one or more internal or external tables, or even data files on cloud storage. The target table MUST be an internal table in StarRocks. For detailed instructions and parameter references, see [SQL Reference - INSERT](../sql-reference/sql-statements/loading_unloading/INSERT.md). - -### Insert data from an internal or external table into an internal table - -> **NOTE** -> -> Inserting data from an external table is identical to inserting data from an internal table. For simplicity, we only demonstrate how to insert data from an internal table in the following examples. - -- The following example inserts the data from the source table to the target table `insert_wiki_edit`. - -```SQL -INSERT INTO insert_wiki_edit -WITH LABEL insert_load_wikipedia_1 -SELECT * FROM source_wiki_edit; -``` - -- The following example inserts the data from the source table to the `p06` and `p12` partitions of the target table `insert_wiki_edit`. If no partition is specified, the data will be inserted into all partitions. Otherwise, the data will be inserted only into the specified partition(s). - -```SQL -INSERT INTO insert_wiki_edit PARTITION(p06, p12) -WITH LABEL insert_load_wikipedia_2 -SELECT * FROM source_wiki_edit; -``` - -Query the target table to make sure there is data in them. - -```Plain text -MySQL > select * from insert_wiki_edit; -+---------------------+---------------+----------+--------------+----------+--------+----------+----------------+-------+-------+---------+ -| event_time | channel | user | is_anonymous | is_minor | is_new | is_robot | is_unpatrolled | delta | added | deleted | -+---------------------+---------------+----------+--------------+----------+--------+----------+----------------+-------+-------+---------+ -| 2015-09-12 00:00:00 | #en.wikipedia | AustinFF | 0 | 0 | 0 | 0 | 0 | 21 | 5 | 0 | -| 2015-09-12 00:00:00 | #ca.wikipedia | helloSR | 0 | 1 | 0 | 1 | 0 | 3 | 23 | 0 | -+---------------------+---------------+----------+--------------+----------+--------+----------+----------------+-------+-------+---------+ -2 rows in set (0.00 sec) -``` - -If you truncate the `p06` and `p12` partitions, the data will not be returned in a query. - -```Plain -MySQL > TRUNCATE TABLE insert_wiki_edit PARTITION(p06, p12); -Query OK, 0 rows affected (0.01 sec) - -MySQL > select * from insert_wiki_edit; -Empty set (0.00 sec) -``` - -- The following example inserts the `event_time` and `channel` columns from the source table to the target table `insert_wiki_edit`. Default values are used in the columns that are not specified here. - -```SQL -INSERT INTO insert_wiki_edit -WITH LABEL insert_load_wikipedia_3 -( - event_time, - channel -) -SELECT event_time, channel FROM source_wiki_edit; -``` - -:::note -From v3.3.1, specifying a column list in the INSERT INTO statement on a Primary Key table will perform Partial Updates (instead of Full Upsert in earlier versions). If the column list is not specified, the system will perform Full Upsert. -::: - -### Insert data directly from files in an external source using FILES() - -From v3.1 onwards, StarRocks supports directly loading data from files on cloud storage using the INSERT command and the [FILES()](../sql-reference/sql-functions/table-functions/files.md) function, thereby you do not need to create an external catalog or file external table first. Besides, FILES() can automatically infer the table schema of the files, greatly simplifying the process of data loading. - -The following example inserts data rows from the Parquet file **parquet/insert_wiki_edit_append.parquet** within the AWS S3 bucket `inserttest` into the table `insert_wiki_edit`: - -```Plain -INSERT INTO insert_wiki_edit - SELECT * FROM FILES( - "path" = "s3://inserttest/parquet/insert_wiki_edit_append.parquet", - "format" = "parquet", - "aws.s3.access_key" = "XXXXXXXXXX", - "aws.s3.secret_key" = "YYYYYYYYYY", - "aws.s3.region" = "us-west-2" -); -``` - -## Overwrite data via INSERT OVERWRITE VALUES - -You can overwrite a specific table with one or more rows by using INSERT OVERWRITE VALUES command. Multiple rows are separated by comma (,). For detailed instructions and parameter references, see [SQL Reference - INSERT](../sql-reference/sql-statements/loading_unloading/INSERT.md). - -> **CAUTION** -> -> Overwriting data via INSERT OVERWRITE VALUES merely applies to the situation when you need to verify a DEMO with a small dataset. It is not recommended for a massive testing or production environment. To load mass data into StarRocks, see [Loading options](Loading_intro.md) for other options that suit your scenarios. - -Query the source table and the target table to make sure there is data in them. - -```Plain -MySQL > SELECT * FROM source_wiki_edit; -+---------------------+---------------+----------+--------------+----------+--------+----------+----------------+-------+-------+---------+ -| event_time | channel | user | is_anonymous | is_minor | is_new | is_robot | is_unpatrolled | delta | added | deleted | -+---------------------+---------------+----------+--------------+----------+--------+----------+----------------+-------+-------+---------+ -| 2015-09-12 00:00:00 | #ca.wikipedia | helloSR | 0 | 1 | 0 | 1 | 0 | 3 | 23 | 0 | -| 2015-09-12 00:00:00 | #en.wikipedia | AustinFF | 0 | 0 | 0 | 0 | 0 | 21 | 5 | 0 | -+---------------------+---------------+----------+--------------+----------+--------+----------+----------------+-------+-------+---------+ -2 rows in set (0.02 sec) - -MySQL > SELECT * FROM insert_wiki_edit; -+---------------------+---------------+----------+--------------+----------+--------+----------+----------------+-------+-------+---------+ -| event_time | channel | user | is_anonymous | is_minor | is_new | is_robot | is_unpatrolled | delta | added | deleted | -+---------------------+---------------+----------+--------------+----------+--------+----------+----------------+-------+-------+---------+ -| 2015-09-12 00:00:00 | #ca.wikipedia | helloSR | 0 | 1 | 0 | 1 | 0 | 3 | 23 | 0 | -| 2015-09-12 00:00:00 | #en.wikipedia | AustinFF | 0 | 0 | 0 | 0 | 0 | 21 | 5 | 0 | -+---------------------+---------------+----------+--------------+----------+--------+----------+----------------+-------+-------+---------+ -2 rows in set (0.01 sec) -``` - -The following example overwrites the source table `source_wiki_edit` with two new rows. - -```SQL -INSERT OVERWRITE source_wiki_edit -WITH LABEL insert_load_wikipedia_ow -VALUES - ("2015-09-12 00:00:00","#cn.wikipedia","GELongstreet",0,0,0,0,0,36,36,0), - ("2015-09-12 00:00:00","#fr.wikipedia","PereBot",0,1,0,1,0,17,17,0); -``` - -## Overwrite data via INSERT OVERWRITE SELECT - -You can overwrite a table with the result of a query on a data source table via INSERT OVERWRITE SELECT command. INSERT OVERWRITE SELECT statement performs ETL operations on the data from one or more internal or external tables, and overwrites an internal table with the data For detailed instructions and parameter references, see [SQL Reference - INSERT](../sql-reference/sql-statements/loading_unloading/INSERT.md). - -> **NOTE** -> -> Loading data from an external table is identical to loading data from an internal table. For simplicity, we only demonstrate how to overwrite the target table with the data from an internal table in the following examples. - -Query the source table and the target table to make sure that they hold different rows of data. - -```Plain -MySQL > SELECT * FROM source_wiki_edit; -+---------------------+---------------+--------------+--------------+----------+--------+----------+----------------+-------+-------+---------+ -| event_time | channel | user | is_anonymous | is_minor | is_new | is_robot | is_unpatrolled | delta | added | deleted | -+---------------------+---------------+--------------+--------------+----------+--------+----------+----------------+-------+-------+---------+ -| 2015-09-12 00:00:00 | #cn.wikipedia | GELongstreet | 0 | 0 | 0 | 0 | 0 | 36 | 36 | 0 | -| 2015-09-12 00:00:00 | #fr.wikipedia | PereBot | 0 | 1 | 0 | 1 | 0 | 17 | 17 | 0 | -+---------------------+---------------+--------------+--------------+----------+--------+----------+----------------+-------+-------+---------+ -2 rows in set (0.02 sec) - -MySQL > SELECT * FROM insert_wiki_edit; -+---------------------+---------------+----------+--------------+----------+--------+----------+----------------+-------+-------+---------+ -| event_time | channel | user | is_anonymous | is_minor | is_new | is_robot | is_unpatrolled | delta | added | deleted | -+---------------------+---------------+----------+--------------+----------+--------+----------+----------------+-------+-------+---------+ -| 2015-09-12 00:00:00 | #en.wikipedia | AustinFF | 0 | 0 | 0 | 0 | 0 | 21 | 5 | 0 | -| 2015-09-12 00:00:00 | #ca.wikipedia | helloSR | 0 | 1 | 0 | 1 | 0 | 3 | 23 | 0 | -+---------------------+---------------+----------+--------------+----------+--------+----------+----------------+-------+-------+---------+ -2 rows in set (0.01 sec) -``` - -- The following example overwrites the table `insert_wiki_edit` with the data from the source table. - -```SQL -INSERT OVERWRITE insert_wiki_edit -WITH LABEL insert_load_wikipedia_ow_1 -SELECT * FROM source_wiki_edit; -``` - -- The following example overwrites the `p06` and `p12` partitions of the table `insert_wiki_edit` with the data from the source table. - -```SQL -INSERT OVERWRITE insert_wiki_edit PARTITION(p06, p12) -WITH LABEL insert_load_wikipedia_ow_2 -SELECT * FROM source_wiki_edit; -``` - -Query the target table to make sure there is data in them. - -```plain text -MySQL > select * from insert_wiki_edit; -+---------------------+---------------+--------------+--------------+----------+--------+----------+----------------+-------+-------+---------+ -| event_time | channel | user | is_anonymous | is_minor | is_new | is_robot | is_unpatrolled | delta | added | deleted | -+---------------------+---------------+--------------+--------------+----------+--------+----------+----------------+-------+-------+---------+ -| 2015-09-12 00:00:00 | #fr.wikipedia | PereBot | 0 | 1 | 0 | 1 | 0 | 17 | 17 | 0 | -| 2015-09-12 00:00:00 | #cn.wikipedia | GELongstreet | 0 | 0 | 0 | 0 | 0 | 36 | 36 | 0 | -+---------------------+---------------+--------------+--------------+----------+--------+----------+----------------+-------+-------+---------+ -2 rows in set (0.01 sec) -``` - -If you truncate the `p06` and `p12` partitions, the data will not be returned in a query. - -```Plain -MySQL > TRUNCATE TABLE insert_wiki_edit PARTITION(p06, p12); -Query OK, 0 rows affected (0.01 sec) - -MySQL > select * from insert_wiki_edit; -Empty set (0.00 sec) -``` - -:::note -For tables that use the `PARTITION BY column` strategy, INSERT OVERWRITE supports creating new partitions in the destination table by specifying the value of the partition key. Existing partitions are overwritten as usual. - -The following example creates the partitioned table `activity`, and creates a new partition in the table while inserting data into it: - -```SQL -CREATE TABLE activity ( -id INT NOT NULL, -dt VARCHAR(10) NOT NULL -) ENGINE=OLAP -DUPLICATE KEY(`id`) -PARTITION BY (`id`, `dt`) -DISTRIBUTED BY HASH(`id`); - -INSERT OVERWRITE activity -PARTITION(id='4', dt='2022-01-01') -WITH LABEL insert_activity_auto_partition -VALUES ('4', '2022-01-01'); -``` - -::: - -- The following example overwrites the target table `insert_wiki_edit` with the `event_time` and `channel` columns from the source table. The default value is assigned to the columns into which no data is overwritten. - -```SQL -INSERT OVERWRITE insert_wiki_edit -WITH LABEL insert_load_wikipedia_ow_3 -( - event_time, - channel -) -SELECT event_time, channel FROM source_wiki_edit; -``` - -### Dynamic Overwrite - -From v3.4.0 onwards, StarRocks supports a new semantic - Dynamic Overwrite for INSERT OVERWRITE with partitioned tables. - -Currently, the default behavior of INSERT OVERWRITE is as follows: - -- When overwriting a partitioned table as a whole (that is, without specifying the PARTITION clause), new data records will replace the data in their corresponding partitions. If there are partitions that are not involved, they will be truncated while the others are overwritten. -- When overwriting an empty partitioned table (that is, with no partitions in it) and specifying the PARTITION clause, the system returns an error `ERROR 1064 (HY000): Getting analyzing error. Detail message: Unknown partition 'xxx' in table 'yyy'`. -- When overwriting a partitioned table and specifying a non-existent partition in the PARTITION clause, the system returns an error `ERROR 1064 (HY000): Getting analyzing error. Detail message: Unknown partition 'xxx' in table 'yyy'`. -- When overwriting a partitioned table with data records that do not match any of the specified partitions in the PARTITION clause, the system either returns an error `ERROR 1064 (HY000): Insert has filtered data in strict mode` (if the strict mode is enabled) or filters the unqualified data records (if the strict mode is disabled). - -The behavior of the new Dynamic Overwrite semantic is much different: - -When overwriting a partitioned table as a whole, new data records will replace the data in their corresponding partitions. If there are partitions that are not involved, they will be left alone, instead of being truncated or deleted. And if there are new data records correspond to a non-existent partition, the system will create the partition. - -The Dynamic Overwrite semantic is disabled by default. To enable it, you need to set the system variable `dynamic_overwrite` to `true`. - -Enable Dynamic Overwrite in the current session: - -```SQL -SET dynamic_overwrite = true; -``` - -You can also set it in the hint of the INSERT OVERWRITE statement to allow it take effect for the statement only:. - -Example: - -```SQL -INSERT /*+set_var(dynamic_overwrite = true)*/ OVERWRITE insert_wiki_edit -SELECT * FROM source_wiki_edit; -``` - -## Insert data into a table with generated columns - -A generated column is a special column whose value is derived from a pre-defined expression or evaluation based on other columns. Generated columns are especially useful when your query requests involve evaluations of expensive expressions, for example, querying a certain field from a JSON value, or calculating ARRAY data. StarRocks evaluates the expression and stores the results in the generated columns while data is being loaded into the table, thereby avoiding the expression evaluation during queries and improving the query performance. - -You can load data into a table with generated columns using INSERT. - -The following example creates a table `insert_generated_columns` and inserts a row into it. The table contains two generated columns: `avg_array` and `get_string`. `avg_array` calculates the average value of ARRAY data in `data_array`, and `get_string` extracts the strings from the JSON path `a` in `data_json`. - -```SQL -CREATE TABLE insert_generated_columns ( - id INT(11) NOT NULL COMMENT "ID", - data_array ARRAY NOT NULL COMMENT "ARRAY", - data_json JSON NOT NULL COMMENT "JSON", - avg_array DOUBLE NULL - AS array_avg(data_array) COMMENT "Get the average of ARRAY", - get_string VARCHAR(65533) NULL - AS get_json_string(json_string(data_json), '$.a') COMMENT "Extract JSON string" -) ENGINE=OLAP -PRIMARY KEY(id) -DISTRIBUTED BY HASH(id); - -INSERT INTO insert_generated_columns -VALUES (1, [1,2], parse_json('{"a" : 1, "b" : 2}')); -``` - -> **NOTE** -> -> Directly loading data into generated columns is not supported. - -You can query the table to check the data within it. - -```Plain -mysql> SELECT * FROM insert_generated_columns; -+------+------------+------------------+-----------+------------+ -| id | data_array | data_json | avg_array | get_string | -+------+------------+------------------+-----------+------------+ -| 1 | [1,2] | {"a": 1, "b": 2} | 1.5 | 1 | -+------+------------+------------------+-----------+------------+ -1 row in set (0.02 sec) -``` - -## INSERT data with PROPERTIES - -From v3.4.0 onwards, INSERT statements support configuring PROPERTIES, which can serve a wide variety of purposes. PROPERTIES overrides their corresponding variables. - -### Enable strict mode - -From v3.4.0 onwards, you can enable strict mode and set `max_filter_ratio` for INSERT from FILES(). Strict mode for INSERT from FILES() has the same behavior as that of other loading methods. - -If you want to load a dataset with some unqualified rows, you either filter these unqualified rows or load them and assign NULL values to the unqualified columns. You can achieve them by using the properties `strict_mode` and `max_filter_ratio`. - -- To filter the unqualified rows: set `strict_mode` to `true`, and `max_filter_ratio` to a desired value. -- To load all unqualified rows with NULL values: set `strict_mode` to `false`. - -The following example inserts data rows from the Parquet file **parquet/insert_wiki_edit_append.parquet** within the AWS S3 bucket `inserttest` into the table `insert_wiki_edit`, enables strict mode to filter the unqualified data records, and tolerates at most 10% of error data: - -```SQL -INSERT INTO insert_wiki_edit -PROPERTIES( - "strict_mode" = "true", - "max_filter_ratio" = "0.1" -) -SELECT * FROM FILES( - "path" = "s3://inserttest/parquet/insert_wiki_edit_append.parquet", - "format" = "parquet", - "aws.s3.access_key" = "XXXXXXXXXX", - "aws.s3.secret_key" = "YYYYYYYYYY", - "aws.s3.region" = "us-west-2" -); -``` - -:::note - -`strict_mode` and `max_filter_ratio` are supported only for INSERT from FILES(). INSERT from tables does not support these properties. - -::: - -### Set timeout duration - -From v3.4.0 onwards, you can set the timeout duration for INSERT statements with properties. - -The following example inserts the data from the source table `source_wiki_edit` to the target table `insert_wiki_edit` with the timeout duration set to `2` seconds. - -```SQL -INSERT INTO insert_wiki_edit -PROPERTIES( - "timeout" = "2" -) -SELECT * FROM source_wiki_edit; -``` - -:::note - -From v3.4.0 onwards, you can also set the INSERT timeout duration using the system variable `insert_timeout`, which applies to operations involving INSERT (for example, UPDATE, DELETE, CTAS, materialized view refresh, statistics collection, and PIPE). In versions earlier than v3.4.0, the corresponding variable is `query_timeout`. - -::: - -### Match column by name - -By default, INSERT matches the columns in the source and the target tables by their positions, that is, the mapping of the columns in the statement. - -The following example explicitly matches each column in the source and target tables by their positions: - -```SQL -INSERT INTO insert_wiki_edit ( - event_time, - channel, - user -) -SELECT event_time, channel, user FROM source_wiki_edit; -``` - -The column mapping will change if you changed the order of `channel` and `user` in either the column list or the SELECT statement. - -```SQL -INSERT INTO insert_wiki_edit ( - event_time, - channel, - user -) -SELECT event_time, user, channel FROM source_wiki_edit; -``` - -Here, the ingested data are probably not what you want, because `channel` in the target table `insert_wiki_edit` will be filled with data from `user` in the source table `source_wiki_edit`. - -By adding `BY NAME` clause in the INSERT statement, the system will detect the column names in the source and the target tables, and match the columns with the same name. - -:::note - -- You cannot specify the column list if `BY NAME` is specified. -- If `BY NAME` is not specified, the system matches the columns by the position of the columns in the column list and the SELECT statement. - -::: - -The following example matches each column in the source and target tables by their names: - -```SQL -INSERT INTO insert_wiki_edit BY NAME -SELECT event_time, user, channel FROM source_wiki_edit; -``` - -In this case, changing the order of `channel` and `user` will not change the column mapping. - -## Load data asynchronously using INSERT - -Loading data with INSERT submits a synchronous transaction, which may fail because of session interruption or timeout. You can submit an asynchronous INSERT transaction using [SUBMIT TASK](../sql-reference/sql-statements/loading_unloading/ETL/SUBMIT_TASK.md). This feature is supported since StarRocks v2.5. - -- The following example asynchronously inserts the data from the source table to the target table `insert_wiki_edit`. - -```SQL -SUBMIT TASK AS INSERT INTO insert_wiki_edit -SELECT * FROM source_wiki_edit; -``` - -- The following example asynchronously overwrites the table `insert_wiki_edit` with the data from the source table. - -```SQL -SUBMIT TASK AS INSERT OVERWRITE insert_wiki_edit -SELECT * FROM source_wiki_edit; -``` - -- The following example asynchronously overwrites the table `insert_wiki_edit` with the data from the source table, and extends the query timeout to `100000` seconds using hint. - -```SQL -SUBMIT /*+set_var(insert_timeout=100000)*/ TASK AS -INSERT OVERWRITE insert_wiki_edit -SELECT * FROM source_wiki_edit; -``` - -- The following example asynchronously overwrites the table `insert_wiki_edit` with the data from the source table, and specifies the task name as `async`. - -```SQL -SUBMIT TASK async -AS INSERT OVERWRITE insert_wiki_edit -SELECT * FROM source_wiki_edit; -``` - -You can check the status of an asynchronous INSERT task by querying the metadata view `task_runs` in Information Schema. - -The following example checks the status of the INSERT task `async`. - -```SQL -SELECT * FROM information_schema.task_runs WHERE task_name = 'async'; -``` - -## Check the INSERT job status - -### Check via the result - -A synchronous INSERT transaction returns different status in accordance with the result of the transaction. - -- **Transaction succeeds** - -StarRocks returns the following if the transaction succeeds: - -```Plain -Query OK, 2 rows affected (0.05 sec) -{'label':'insert_load_wikipedia', 'status':'VISIBLE', 'txnId':'1006'} -``` - -- **Transaction fails** - -If all rows of data fail to be loaded into the target table, the INSERT transaction fails. StarRocks returns the following if the transaction fails: - -```Plain -ERROR 1064 (HY000): Insert has filtered data in strict mode, tracking_url=http://x.x.x.x:yyyy/api/_load_error_log?file=error_log_9f0a4fd0b64e11ec_906bbede076e9d08 -``` - -You can locate the problem by checking the log with `tracking_url`. - -### Check via Information Schema - -You can use the [SELECT](../sql-reference/sql-statements/table_bucket_part_index/SELECT.md) statement to query the results of one or more load jobs from the `loads` table in the `information_schema` database. This feature is supported from v3.1 onwards. - -Example 1: Query the results of load jobs executed on the `load_test` database, sort the results by creation time (`CREATE_TIME`) in descending order, and only return the top result. - -```SQL -SELECT * FROM information_schema.loads -WHERE database_name = 'load_test' -ORDER BY create_time DESC -LIMIT 1\G -``` - -Example 2: Query the result of the load job (whose label is `insert_load_wikipedia`) executed on the `load_test` database: - -```SQL -SELECT * FROM information_schema.loads -WHERE database_name = 'load_test' and label = 'insert_load_wikipedia'\G -``` - -The return is as follows: - -```Plain -*************************** 1. row *************************** - JOB_ID: 21319 - LABEL: insert_load_wikipedia - DATABASE_NAME: load_test - STATE: FINISHED - PROGRESS: ETL:100%; LOAD:100% - TYPE: INSERT - PRIORITY: NORMAL - SCAN_ROWS: 0 - FILTERED_ROWS: 0 - UNSELECTED_ROWS: 0 - SINK_ROWS: 2 - ETL_INFO: - TASK_INFO: resource:N/A; timeout(s):300; max_filter_ratio:0.0 - CREATE_TIME: 2023-08-09 10:42:23 - ETL_START_TIME: 2023-08-09 10:42:23 - ETL_FINISH_TIME: 2023-08-09 10:42:23 - LOAD_START_TIME: 2023-08-09 10:42:23 - LOAD_FINISH_TIME: 2023-08-09 10:42:24 - JOB_DETAILS: {"All backends":{"5ebf11b5-365e-11ee-9e4a-7a563fb695da":[10006]},"FileNumber":0,"FileSize":0,"InternalTableLoadBytes":175,"InternalTableLoadRows":2,"ScanBytes":0,"ScanRows":0,"TaskNumber":1,"Unfinished backends":{"5ebf11b5-365e-11ee-9e4a-7a563fb695da":[]}} - ERROR_MSG: NULL - TRACKING_URL: NULL - TRACKING_SQL: NULL -REJECTED_RECORD_PATH: NULL -1 row in set (0.01 sec) -``` - -For information about the fields in the return results, see [Information Schema > loads](../sql-reference/information_schema/loads.md). - -### Check via curl command - -You can check the INSERT transaction status by using curl command. - -Launch a terminal, and execute the following command: - -```Bash -curl --location-trusted -u : \ - http://:/api//_load_info?label= -``` - -The following example checks the status of the transaction with label `insert_load_wikipedia`. - -```Bash -curl --location-trusted -u : \ - http://x.x.x.x:8030/api/load_test/_load_info?label=insert_load_wikipedia -``` - -> **NOTE** -> -> If you use an account for which no password is set, you need to input only `:`. - -The return is as follows: - -```Plain -{ - "jobInfo":{ - "dbName":"load_test", - "tblNames":[ - "source_wiki_edit" - ], - "label":"insert_load_wikipedia", - "state":"FINISHED", - "failMsg":"", - "trackingUrl":"" - }, - "status":"OK", - "msg":"Success" -} -``` - -## Configuration - -You can set the following configuration items for INSERT transaction: - -- **FE configuration** - -| FE configuration | Description | -| ---------------------------------- | ------------------------------------------------------------ | -| insert_load_default_timeout_second | Default timeout for INSERT transaction. Unit: second. If the current INSERT transaction is not completed within the time set by this parameter, it will be canceled by the system and the status will be CANCELLED. As for current version of StarRocks, you can only specify a uniform timeout for all INSERT transactions using this parameter, and you cannot set a different timeout for a specific INSERT transaction. The default is 3600 seconds (1 hour). If the INSERT transaction cannot be completed within the specified time, you can extend the timeout by adjusting this parameter. | - -- **Session variables** - -| Session variable | Description | -| -------------------- | ------------------------------------------------------------ | -| enable_insert_strict | Switch value to control if the INSERT transaction is tolerant of invalid data rows. When it is set to `true`, the transaction fails if any of the data rows is invalid. When it is set to `false`, the transaction succeeds when at least one row of data has been loaded correctly, and the label will be returned. The default is `true`. You can set this variable with `SET enable_insert_strict = {true or false};` command. | -| insert_timeout | Timeout for the INSERT statement. Unit: second. You can set this variable with the `SET insert_timeout = xxx;` command. | diff --git a/docs/en/loading/Json_loading.md b/docs/en/loading/Json_loading.md deleted file mode 100644 index d2dcffb..0000000 --- a/docs/en/loading/Json_loading.md +++ /dev/null @@ -1,363 +0,0 @@ ---- -displayed_sidebar: docs ---- - -# Introduction - -You can import semi-structured data (for example, JSON) by using stream load or routine load. - -## Use Scenarios - -* Stream Load: For JSON data stored in text files, use stream load to import. -* Routine Load: For JSON data in Kafka, use routine load to import. - -### Stream Load Import - -Sample data: - -~~~json -{ "id": 123, "city" : "beijing"}, -{ "id": 456, "city" : "shanghai"}, - ... -~~~ - -Example: - -~~~shell -curl -v --location-trusted -u : \ - -H "format: json" -H "jsonpaths: [\"$.id\", \"$.city\"]" \ - -T example.json \ - http://FE_HOST:HTTP_PORT/api/DATABASE/TABLE/_stream_load -~~~ - -The `format: json` parameter allows you to execute the format of the imported data. `jsonpaths` is used to execute the corresponding data import path. - -Related parameters: - -* jsonpaths: Select the JSON path for each column -* json\_root: Select the column where the JSON starts to be parsed -* strip\_outer\_array: Crop the outermost array field -* strict\_mode: Strictly filter for column type conversion during import - -When the JSON data schema and StarRocks data schema are not exactly the same, modify the `Jsonpath`. - -Sample data: - -~~~json -{"k1": 1, "k2": 2} -~~~ - -Import example: - -~~~bash -curl -v --location-trusted -u : \ - -H "format: json" -H "jsonpaths: [\"$.k2\", \"$.k1\"]" \ - -H "columns: k2, tmp_k1, k1 = tmp_k1 * 100" \ - -T example.json \ - http://127.0.0.1:8030/api/db1/tbl1/_stream_load -~~~ - -The ETL operation of multiplying k1 by 100 is performed during the import, and the column is matched with the original data by `Jsonpath`. - -The import results are as follows: - -~~~plain text -+------+------+ -| k1 | k2 | -+------+------+ -| 100 | 2 | -+------+------+ -~~~ - -For missing columns, if the column definition is nullable, then `NULL` will be added, or the default value can be added by `ifnull`. - -Sample data: - -~~~json -[ - {"k1": 1, "k2": "a"}, - {"k1": 2}, - {"k1": 3, "k2": "c"}, -] -~~~ - -Import Example-1: - -~~~shell -curl -v --location-trusted -u : \ - -H "format: json" -H "strip_outer_array: true" \ - -T example.json \ - http://127.0.0.1:8030/api/db1/tbl1/_stream_load -~~~ - -The import results are as follows: - -~~~plain text -+------+------+ -| k1 | k2 | -+------+------+ -| 1 | a | -+------+------+ -| 2 | NULL | -+------+------+ -| 3 | c | -+------+------+ -~~~ - -Import Example-2: - -~~~shell -curl -v --location-trusted -u : \ - -H "format: json" -H "strip_outer_array: true" \ - -H "jsonpaths: [\"$.k1\", \"$.k2\"]" \ - -H "columns: k1, tmp_k2, k2 = ifnull(tmp_k2, 'x')" \ - -T example.json \ - http://127.0.0.1:8030/api/db1/tbl1/_stream_load -~~~ - -The import results are as follows: - -~~~plain text -+------+------+ -| k1 | k2 | -+------+------+ -| 1 | a | -+------+------+ -| 2 | x | -+------+------+ -| 3 | c | -+------+------+ -~~~ - -### Routine Load Import - -Similar to stream load, the message content of Kafka data sources is treated as a complete JSON data. - -1. If a message contains multiple rows of data in array format, all rows will be imported and Kafka's offset will only be incremented by 1. -2. If a JSON in Array format represents multiple rows of data, but the parsing of the JSON fails due to a JSON format error, the error row will only be incremented by 1 (given that the parsing fails, StarRocks cannot actually determine how many rows of data it contains, and can only record the error data as one row). - -### Use Canal to import StarRocks from MySQL with incremental sync binlogs - -[Canal](https://github.com/alibaba/canal) is an open-source MySQL binlog synchronization tool from Alibaba, through which we can synchronize MySQL data to Kafka. The data is generated in JSON format in Kafka. Here is a demonstration of how to use routine load to synchronize data in Kafka for incremental data synchronization with MySQL. - -* In MySQL we have a data table with the following table creation statement. - -~~~sql -CREATE TABLE `query_record` ( - `query_id` varchar(64) NOT NULL, - `conn_id` int(11) DEFAULT NULL, - `fe_host` varchar(32) DEFAULT NULL, - `user` varchar(32) DEFAULT NULL, - `start_time` datetime NOT NULL, - `end_time` datetime DEFAULT NULL, - `time_used` double DEFAULT NULL, - `state` varchar(16) NOT NULL, - `error_message` text, - `sql` text NOT NULL, - `database` varchar(128) NOT NULL, - `profile` longtext, - `plan` longtext, - PRIMARY KEY (`query_id`), - KEY `idx_start_time` (`start_time`) USING BTREE -) ENGINE=InnoDB DEFAULT CHARSET=utf8 -~~~ - -* Prerequisite: Make sure MySQL has binlog enabled and the format is ROW. - -~~~bash -[mysqld] -log-bin=mysql-bin # Enable binlog -binlog-format=ROW # Select ROW mode -server_id=1 # MySQL replication need to be defined, and do not duplicate canal's slaveId -~~~ - -* Create an account and grant privileges to the secondary MySQL server: - -~~~sql -CREATE USER canal IDENTIFIED BY 'canal'; -GRANT SELECT, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'canal'@'%'; --- GRANT ALL PRIVILEGES ON *.* TO 'canal'@'%'; -FLUSH PRIVILEGES; -~~~ - -* Then download and install Canal. - -~~~bash -wget https://github.com/alibaba/canal/releases/download/canal-1.0.17/canal.deployer-1.0.17.tar.gz - -mkdir /tmp/canal -tar zxvf canal.deployer-$version.tar.gz -C /tmp/canal -~~~ - -* Modify the configuration (MySQL related). - -`$ vi conf/example/instance.properties` - -~~~bash -## mysql serverId -canal.instance.mysql.slaveId = 1234 -#position info, need to change to your own database information -canal.instance.master.address = 127.0.0.1:3306 -canal.instance.master.journal.name = -canal.instance.master.position = -canal.instance.master.timestamp = -#canal.instance.standby.address = -#canal.instance.standby.journal.name = -#canal.instance.standby.position = -#canal.instance.standby.timestamp = -#username/password, need to change to your own database information -canal.instance.dbUsername = canal -canal.instance.dbPassword = canal -canal.instance.defaultDatabaseName = -canal.instance.connectionCharset = UTF-8 -#table regex -canal.instance.filter.regex = .\*\\\\..\* -# Select the name of the table to be synchronized and the partition name of the kafka target. -canal.mq.dynamicTopic=databasename.query_record -canal.mq.partitionHash= databasename.query_record:query_id -~~~ - -* Modify the configuration (Kafka related). - -`$ vi /usr/local/canal/conf/canal.properties` - -~~~bash -# Available options: tcp(by default), kafka, RocketMQ -canal.serverMode = kafka -# ... -# kafka/rocketmq Cluster Configuration: 192.168.1.117:9092,192.168.1.118:9092,192.168.1.119:9092 -canal.mq.servers = 127.0.0.1:6667 -canal.mq.retries = 0 -# This value can be increased in flagMessage mode, but do not exceed the maximum size of the MQ message. -canal.mq.batchSize = 16384 -canal.mq.maxRequestSize = 1048576 -# In flatMessage mode, please change this value to a larger value, 50-200 is recommended. -canal.mq.lingerMs = 1 -canal.mq.bufferMemory = 33554432 -# Canal's batch size with a default value of 50K. Please do not exceed 1M due to Kafka's maximum message size limit (under 900K) -canal.mq.canalBatchSize = 50 -# Timeout of `Canal get`, in milliseconds. Empty indicates unlimited timeout. -canal.mq.canalGetTimeout = 100 -# Whether the object is in flat json format -canal.mq.flatMessage = false -canal.mq.compressionType = none -canal.mq.acks = all -# Whether Kafka message delivery uses transactions -canal.mq.transaction = false -~~~ - -* Initiation - -`bin/startup.sh` - -The corresponding synchronization log is shown in `logs/example/example.log` and in Kafka, with the following format: - -~~~json -{ - "data": [{ - "query_id": "3c7ebee321e94773-b4d79cc3f08ca2ac", - "conn_id": "34434", - "fe_host": "172.26.34.139", - "user": "zhaoheng", - "start_time": "2020-10-19 20:40:10.578", - "end_time": "2020-10-19 20:40:10", - "time_used": "1.0", - "state": "FINISHED", - "error_message": "", - "sql": "COMMIT", - "database": "", - "profile": "", - "plan": "" - }, { - "query_id": "7ff2df7551d64f8e-804004341bfa63ad", - "conn_id": "34432", - "fe_host": "172.26.34.139", - "user": "zhaoheng", - "start_time": "2020-10-19 20:40:10.566", - "end_time": "2020-10-19 20:40:10", - "time_used": "0.0", - "state": "FINISHED", - "error_message": "", - "sql": "COMMIT", - "database": "", - "profile": "", - "plan": "" - }, { - "query_id": "3a4b35d1c1914748-be385f5067759134", - "conn_id": "34440", - "fe_host": "172.26.34.139", - "user": "zhaoheng", - "start_time": "2020-10-19 20:40:10.601", - "end_time": "1970-01-01 08:00:00", - "time_used": "-1.0", - "state": "RUNNING", - "error_message": "", - "sql": " SELECT SUM(length(lo_custkey)), SUM(length(c_custkey)) FROM lineorder_str INNER JOIN customer_str ON lo_custkey=c_custkey;", - "database": "ssb", - "profile": "", - "plan": "" - }], - "database": "center_service_lihailei", - "es": 1603111211000, - "id": 122, - "isDdl": false, - "mysqlType": { - "query_id": "varchar(64)", - "conn_id": "int(11)", - "fe_host": "varchar(32)", - "user": "varchar(32)", - "start_time": "datetime(3)", - "end_time": "datetime", - "time_used": "double", - "state": "varchar(16)", - "error_message": "text", - "sql": "text", - "database": "varchar(128)", - "profile": "longtext", - "plan": "longtext" - }, - "old": null, - "pkNames": ["query_id"], - "sql": "", - "sqlType": { - "query_id": 12, - "conn_id": 4, - "fe_host": 12, - "user": 12, - "start_time": 93, - "end_time": 93, - "time_used": 8, - "state": 12, - "error_message": 2005, - "sql": 2005, - "database": 12, - "profile": 2005, - "plan": 2005 - }, - "table": "query_record", - "ts": 1603111212015, - "type": "INSERT" -} -~~~ - -Add `json_root` and `strip_outer_array = true` to import data from `data`. - -~~~sql -create routine load manual.query_job on query_record -columns (query_id,conn_id,fe_host,user,start_time,end_time,time_used,state,error_message,`sql`,`database`,profile,plan) -PROPERTIES ( - "format"="json", - "json_root"="$.data", - "desired_concurrent_number"="1", - "strip_outer_array" ="true", - "max_error_number"="1000" -) -FROM KAFKA ( - "kafka_broker_list"= "172.26.92.141:9092", - "kafka_topic" = "databasename.query_record" -); -~~~ - -This completes the near real-time synchronization of data from MySQL to StarRocks. - -View status and error messages of the import job by `show routine load`. diff --git a/docs/en/loading/Kafka-connector-starrocks.md b/docs/en/loading/Kafka-connector-starrocks.md deleted file mode 100644 index 605e0bb..0000000 --- a/docs/en/loading/Kafka-connector-starrocks.md +++ /dev/null @@ -1,833 +0,0 @@ ---- -displayed_sidebar: docs ---- - -# Load data using Kafka connector - -StarRocks provides a self-developed connector named Apache Kafka® connector (StarRocks Connector for Apache Kafka®, Kafka connector for short), as a sink connector, that continuously consumes messages from Kafka and loads them into StarRocks. The Kafka connector guarantees at-least-once semantics. - -The Kafka connector can seamlessly integrate with Kafka Connect, which allows StarRocks better integrated with the Kafka ecosystem. It is a wise choice if you want to load real-time data into StarRocks. Compared with Routine Load, it is recommended to use the Kafka connector in the following scenarios: - -- Compared with Routine Load which only supports loading data in CSV, JSON, and Avro formats, Kafka connector can load data in more formats, such as Protobuf. As long as data can be converted into JSON and CSV formats using Kafka Connect's converters, data can be loaded into StarRocks via the Kafka connector. -- Customize data transformation, such as Debezium-formatted CDC data. -- Load data from multiple Kafka topics. -- Load data from Confluent Cloud. -- Need finer control over load batch sizes, parallelism, and other parameters to achieve a balance between load speed and resource utilization. - -## Preparations - -### Version requirements - -| Connector | Kafka | StarRocks | Java | -| --------- | --------- | ------------- | ---- | -| 1.0.6 | 3.4+/4.0+ | 2.5 and later | 8 | -| 1.0.5 | 3.4 | 2.5 and later | 8 | -| 1.0.4 | 3.4 | 2.5 and later | 8 | -| 1.0.3 | 3.4 | 2.5 and later | 8 | - -### Set up Kafka environment - -Both self-managed Apache Kafka clusters and Confluent Cloud are supported. - -- For a self-managed Apache Kafka cluster, you can refer to [Apache Kafka quickstart](https://kafka.apache.org/quickstart) to quickly deploy a Kafka cluster. Kafka Connect is already integrated into Kafka. -- For Confluent Cloud, make sure that you have a Confluent account and have created a cluster. - -### Download Kafka connector - -Submit the Kafka connector into Kafka Connect: - -- Self-managed Kafka cluster: - - Download [starrocks-connector-for-kafka-x.y.z-with-dependencies.jar](https://github.com/StarRocks/starrocks-connector-for-kafka/releases). - -- Confluent Cloud: - - Currently, the Kafka connector is not uploaded to Confluent Hub. You need to download [starrocks-connector-for-kafka-x.y.z-with-dependencies.jar](https://github.com/StarRocks/starrocks-connector-for-kafka/releases), package it into a ZIP file and upload the ZIP file to Confluent Cloud. - -### Network configuration - -Ensure that the machine where Kafka is located can access the FE nodes of the StarRocks cluster via the [`http_port`](../administration/management/FE_configuration.md#http_port) (default: `8030`) and [`query_port`](../administration/management/FE_configuration.md#query_port) (default: `9030`), and the BE nodes via the [`be_http_port`](../administration/management/BE_configuration.md#be_http_port) (default: `8040`). - -## Usage - -This section uses a self-managed Kafka cluster as an example to explain how to configure the Kafka connector and the Kafka Connect, and then run the Kafka Connect to load data into StarRocks. - -### Prepare a dataset - -Suppose that JSON-format data exists in the topic `test` in a Kafka cluster. - -```JSON -{"id":1,"city":"New York"} -{"id":2,"city":"Los Angeles"} -{"id":3,"city":"Chicago"} -``` - -### Create a table - -Create the table `test_tbl` in the database `example_db` in the StarRocks cluster according to the keys of the JSON-format data. - -```SQL -CREATE DATABASE example_db; -USE example_db; -CREATE TABLE test_tbl (id INT, city STRING); -``` - -### Configure Kafka connector and Kafka Connect, and then run Kafka Connect to load data - -#### Run Kafka Connect in standalone mode - -1. Configure the Kafka connector. In the **config** directory under the Kafka installation directory, create the configuration file **connect-StarRocks-sink.properties** for the Kafka connector, and configure the following parameters. For more parameters and descriptions, see [Parameters](#Parameters). - - :::info - - - In this example, the Kafka connector provided by StarRocks is a sink connector that can continuously consume data from Kafka and load data into StarRocks. - - If the source data is CDC data, such as data in Debezium format, and the StarRocks table is a Primary Key table, you also need to [configure `transform`](#load-debezium-formatted-cdc-data) in the configuration file **connect-StarRocks-sink.properties** for the Kafka connector provided by StarRocks, to synchronize the source data changes to the Primary Key table. - - ::: - - ```yaml - name=starrocks-kafka-connector - connector.class=com.starrocks.connector.kafka.StarRocksSinkConnector - topics=test - key.converter=org.apache.kafka.connect.json.JsonConverter - value.converter=org.apache.kafka.connect.json.JsonConverter - key.converter.schemas.enable=true - value.converter.schemas.enable=false - # The HTTP URL of the FE in your StarRocks cluster. The default port is 8030. - starrocks.http.url=192.168.xxx.xxx:8030 - # If the Kafka topic name is different from the StarRocks table name, you need to configure the mapping relationship between them. - starrocks.topic2table.map=test:test_tbl - # Enter the StarRocks username. - starrocks.username=user1 - # Enter the StarRocks password. - starrocks.password=123456 - starrocks.database.name=example_db - sink.properties.strip_outer_array=true - ``` - -2. Configure and run the Kafka Connect. - - 1. Configure the Kafka Connect. In the configuration file **config/connect-standalone.properties** in the **config** directory, configure the following parameters. For more parameters and descriptions, see [Running Kafka Connect](https://kafka.apache.org/documentation.html#connect_running). - - ```yaml - # The addresses of Kafka brokers. Multiple addresses of Kafka brokers need to be separated by commas (,). - # Note that this example uses PLAINTEXT as the security protocol to access the Kafka cluster. If you are using other security protocol to access the Kafka cluster, you need to configure the relevant information in this file. - bootstrap.servers=:9092 - offset.storage.file.filename=/tmp/connect.offsets - offset.flush.interval.ms=10000 - key.converter=org.apache.kafka.connect.json.JsonConverter - value.converter=org.apache.kafka.connect.json.JsonConverter - key.converter.schemas.enable=true - value.converter.schemas.enable=false - # The absolute path of starrocks-connector-for-kafka-x.y.z-with-dependencies.jar. - plugin.path=/home/kafka-connect/starrocks-kafka-connector - ``` - - 2. Run the Kafka Connect. - - ```Bash - CLASSPATH=/home/kafka-connect/starrocks-kafka-connector/* bin/connect-standalone.sh config/connect-standalone.properties config/connect-starrocks-sink.properties - ``` - -#### Run Kafka Connect in distributed mode - -1. Configure and run the Kafka Connect. - - 1. Configure the Kafka Connect. In the configuration file `config/connect-distributed.properties` in the **config** directory, configure the following parameters. For more parameters and descriptions, refer to [Running Kafka Connect](https://kafka.apache.org/documentation.html#connect_running). - - ```yaml - # The addresses of Kafka brokers. Multiple addresses of Kafka brokers need to be separated by commas (,). - # Note that this example uses PLAINTEXT as the security protocol to access the Kafka cluster. If you are using other security protocol to access the Kafka cluster, you need to configure the relevant information in this file. - bootstrap.servers=:9092 - offset.storage.file.filename=/tmp/connect.offsets - offset.flush.interval.ms=10000 - key.converter=org.apache.kafka.connect.json.JsonConverter - value.converter=org.apache.kafka.connect.json.JsonConverter - key.converter.schemas.enable=true - value.converter.schemas.enable=false - # The absolute path of starrocks-connector-for-kafka-x.y.z-with-dependencies.jar. - plugin.path=/home/kafka-connect/starrocks-kafka-connector - ``` - - 2. Run the Kafka Connect. - - ```BASH - CLASSPATH=/home/kafka-connect/starrocks-kafka-connector/* bin/connect-distributed.sh config/connect-distributed.properties - ``` - -2. Configure and create the Kafka connector. Note that in distributed mode, you need to configure and create the Kafka connector through the REST API. For parameters and descriptions, see [Parameters](#Parameters). - - :::info - - - In this example, the Kafka connector provided by StarRocks is a sink connector that can continuously consume data from Kafka and load data into StarRocks. - - If the source data is CDC data, such as data in Debezium format, and the StarRocks table is a Primary Key table, you also need to [configure `transform`](#load-debezium-formatted-cdc-data) in the configuration file **connect-StarRocks-sink.properties** for the Kafka connector provided by StarRocks, to synchronize the source data changes to the Primary Key table. - - ::: - - ```Shell - curl -i http://127.0.0.1:8083/connectors -H "Content-Type: application/json" -X POST -d '{ - "name":"starrocks-kafka-connector", - "config":{ - "connector.class":"com.starrocks.connector.kafka.StarRocksSinkConnector", - "topics":"test", - "key.converter":"org.apache.kafka.connect.json.JsonConverter", - "value.converter":"org.apache.kafka.connect.json.JsonConverter", - "key.converter.schemas.enable":"true", - "value.converter.schemas.enable":"false", - "starrocks.http.url":"192.168.xxx.xxx:8030", - "starrocks.topic2table.map":"test:test_tbl", - "starrocks.username":"user1", - "starrocks.password":"123456", - "starrocks.database.name":"example_db", - "sink.properties.strip_outer_array":"true" - } - }' - ``` - -#### Query StarRocks table - -Query the target StarRocks table `test_tbl`. - -```mysql -MySQL [example_db]> select * from test_tbl; - -+------+-------------+ -| id | city | -+------+-------------+ -| 1 | New York | -| 2 | Los Angeles | -| 3 | Chicago | -+------+-------------+ -3 rows in set (0.01 sec) -``` - -The data is successfully loaded when the above result is returned. - -## Parameters - -### name - -**Required**: YES
-**Default value**:
-**Description**: Name for this Kafka connector. It must be globally unique among all Kafka connectors within this Kafka Connect cluster. For example, starrocks-kafka-connector. - -### connector.class - -**Required**: YES
-**Default value**:
-**Description**: Class used by this Kafka connector's sink. Set the value to `com.starrocks.connector.kafka.StarRocksSinkConnector`. - -### topics - -**Required**:
-**Default value**:
-**Description**: One or more topics to subscribe to, where each topic corresponds to a StarRocks table. By default, StarRocks assumes that the topic name matches the name of the StarRocks table. So StarRocks determines the target StarRocks table by using the topic name. Please choose either to fill in `topics` or `topics.regex` (below), but not both. However, if the StarRocks table name is not the same as the topic name, then use the optional `starrocks.topic2table.map` parameter (below) to specify the mapping from topic name to table name. - -### topics.regex - -**Required**:
-**Default value**: -**Description**: Regular expression to match the one or more topics to subscribe to. For more description, see `topics`. Please choose either to fill in `topics.regex` or `topics` (above), but not both.
- -### starrocks.topic2table.map - -**Required**: NO
-**Default value**:
-**Description**: The mapping of the StarRocks table name and the topic name when the topic name is different from the StarRocks table name. The format is `:,:,...`. - -### starrocks.http.url - -**Required**: YES
-**Default value**:
-**Description**: The HTTP URL of the FE in your StarRocks cluster. The format is `:,:,...`. Multiple addresses are separated by commas (,). For example, `192.168.xxx.xxx:8030,192.168.xxx.xxx:8030`. - -### starrocks.database.name - -**Required**: YES
-**Default value**:
-**Description**: The name of StarRocks database. - -### starrocks.username - -**Required**: YES
-**Default value**:
-**Description**: The username of your StarRocks cluster account. The user needs the [INSERT](../sql-reference/sql-statements/account-management/GRANT.md) privilege on the StarRocks table. - -### starrocks.password - -**Required**: YES
-**Default value**:
-**Description**: The password of your StarRocks cluster account. - -### key.converter - -**Required**: NO
-**Default value**: Key converter used by Kafka Connect cluster
-**Description**: This parameter specifies the key converter for the sink connector (Kafka-connector-starrocks), which is used to deserialize the keys of Kafka data. The default key converter is the one used by Kafka Connect cluster. - -### value.converter - -**Required**: NO
-**Default value**: Value converter used by Kafka Connect cluster
-**Description**: This parameter specifies the value converter for the sink connector (Kafka-connector-starrocks), which is used to deserialize the values of Kafka data. The default value converter is the one used by Kafka Connect cluster. - -### key.converter.schema.registry.url - -**Required**: NO
-**Default value**:
-**Description**: Schema registry URL for the key converter. - -### value.converter.schema.registry.url - -**Required**: NO
-**Default value**:
-**Description**: Schema registry URL for the value converter. - -### tasks.max - -**Required**: NO
-**Default value**: 1
-**Description**: The upper limit for the number of task threads that the Kafka connector can create, which is usually the same as the number of CPU cores on the worker nodes in the Kafka Connect cluster. You can tune this parameter to control load performance. - -### bufferflush.maxbytes - -**Required**: NO
-**Default value**: 94371840(90M)
-**Description**: The maximum size of data that can be accumulated in memory before being sent to StarRocks at a time. The maximum value ranges from 64 MB to 10 GB. Keep in mind that the Stream Load SDK buffer may create multiple Stream Load jobs to buffer data. Therefore, the threshold mentioned here refers to the total data size. - -### bufferflush.intervalms - -**Required**: NO
-**Default value**: 1000
-**Description**: Interval for sending a batch of data which controls the load latency. Range: [1000, 3600000]. - -### connect.timeoutms - -**Required**: NO
-**Default value**: 1000
-**Description**: Timeout for connecting to the HTTP URL. Range: [100, 60000]. - -### sink.properties.* - -**Required**:
-**Default value**:
-**Description**: Stream Load parameters o control load behavior. For example, the parameter `sink.properties.format` specifies the format used for Stream Load, such as CSV or JSON. For a list of supported parameters and their descriptions, see [STREAM LOAD](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md). - -### sink.properties.format - -**Required**: NO
-**Default value**: json
-**Description**: The format used for Stream Load. The Kafka connector will transform each batch of data to the format before sending them to StarRocks. Valid values: `csv` and `json`. For more information, see [CSV parameters](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md#csv-parameters) and [JSON parameters](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md#json-parameters). - -### sink.properties.partial_update - -**Required**: NO
-**Default value**: `FALSE`
-**Description**: Whether to use partial updates. Valid values: `TRUE` and `FALSE`. Default value: `FALSE`, indicating to disable this feature. - -### sink.properties.partial_update_mode - -**Required**: NO
-**Default value**: `row`
-**Description**: Specifies the mode for partial updates. Valid values: `row` and `column`.
  • The value `row` (default) means partial updates in row mode, which is more suitable for real-time updates with many columns and small batches.
  • The value `column` means partial updates in column mode, which is more suitable for batch updates with few columns and many rows. In such scenarios, enabling the column mode offers faster update speeds. For example, in a table with 100 columns, if only 10 columns (10% of the total) are updated for all rows, the update speed of the column mode is 10 times faster.
- -## Usage Notes - -### Flush Policy - -The Kafka connector will buffer the data in memory, and flush them in batch to StarRocks via Stream Load. The flush will be triggered when any of the following conditions are met: - -- The bytes of buffered rows reaches the limit `bufferflush.maxbytes`. -- The elapsed time since the last flush reaches the limit `bufferflush.intervalms`. -- The interval at which the connector tries committing offsets for tasks is reached. The interval is controlled by the Kafka Connect configuration [`offset.flush.interval.ms`](https://docs.confluent.io/platform/current/connect/references/allconfigs.html), and the default values is `60000`. - -For lower data latency, adjust these configurations in the Kafka connector settings. However, more frequent flushes will increase CPU and I/O usage. - -### Limits - -- It is not supported to flatten a single message from a Kafka topic into multiple data rows and load into StarRocks. -- The sink of the Kafka connector provided by StarRocks guarantees at-least-once semantics. - -## Best practices - -### Load Debezium-formatted CDC data - -Debezium is a popular Change Data Capture (CDC) tool that supports monitoring data changes in various database systems and streaming these changes to Kafka. The following example demonstrates how to configure and use the Kafka connector to write PostgreSQL changes to a **Primary Key table** in StarRocks. - -#### Step 1: Install and start Kafka - -> **NOTE** -> -> You can skip this step if you have your own Kafka environment. - -1. [Download](https://dlcdn.apache.org/kafka/) the latest Kafka release from the official site and extract the package. - - ```Bash - tar -xzf kafka_2.13-3.7.0.tgz - cd kafka_2.13-3.7.0 - ``` - -2. Start the Kafka environment. - - Generate a Kafka cluster UUID. - - ```Bash - KAFKA_CLUSTER_ID="$(bin/kafka-storage.sh random-uuid)" - ``` - - Format the log directories. - - ```Bash - bin/kafka-storage.sh format -t $KAFKA_CLUSTER_ID -c config/kraft/server.properties - ``` - - Start the Kafka server. - - ```Bash - bin/kafka-server-start.sh config/kraft/server.properties - ``` - -#### Step 2: Configure PostgreSQL - -1. Make sure the PostgreSQL user is granted `REPLICATION` privileges. - -2. Adjust PostgreSQL configuration. - - Set `wal_level` to `logical` in **postgresql.conf**. - - ```Properties - wal_level = logical - ``` - - Restart the PostgreSQL server to apply changes. - - ```Bash - pg_ctl restart - ``` - -3. Prepare the dataset. - - Create a table and insert test data. - - ```SQL - CREATE TABLE customers ( - id int primary key , - first_name varchar(65533) NULL, - last_name varchar(65533) NULL , - email varchar(65533) NULL - ); - - INSERT INTO customers VALUES (1,'a','a','a@a.com'); - ``` - -4. Verify the CDC log messages in Kafka. - - ```Json - { - "schema": { - "type": "struct", - "fields": [ - { - "type": "struct", - "fields": [ - { - "type": "int32", - "optional": false, - "field": "id" - }, - { - "type": "string", - "optional": true, - "field": "first_name" - }, - { - "type": "string", - "optional": true, - "field": "last_name" - }, - { - "type": "string", - "optional": true, - "field": "email" - } - ], - "optional": true, - "name": "test.public.customers.Value", - "field": "before" - }, - { - "type": "struct", - "fields": [ - { - "type": "int32", - "optional": false, - "field": "id" - }, - { - "type": "string", - "optional": true, - "field": "first_name" - }, - { - "type": "string", - "optional": true, - "field": "last_name" - }, - { - "type": "string", - "optional": true, - "field": "email" - } - ], - "optional": true, - "name": "test.public.customers.Value", - "field": "after" - }, - { - "type": "struct", - "fields": [ - { - "type": "string", - "optional": false, - "field": "version" - }, - { - "type": "string", - "optional": false, - "field": "connector" - }, - { - "type": "string", - "optional": false, - "field": "name" - }, - { - "type": "int64", - "optional": false, - "field": "ts_ms" - }, - { - "type": "string", - "optional": true, - "name": "io.debezium.data.Enum", - "version": 1, - "parameters": { - "allowed": "true,last,false,incremental" - }, - "default": "false", - "field": "snapshot" - }, - { - "type": "string", - "optional": false, - "field": "db" - }, - { - "type": "string", - "optional": true, - "field": "sequence" - }, - { - "type": "string", - "optional": false, - "field": "schema" - }, - { - "type": "string", - "optional": false, - "field": "table" - }, - { - "type": "int64", - "optional": true, - "field": "txId" - }, - { - "type": "int64", - "optional": true, - "field": "lsn" - }, - { - "type": "int64", - "optional": true, - "field": "xmin" - } - ], - "optional": false, - "name": "io.debezium.connector.postgresql.Source", - "field": "source" - }, - { - "type": "string", - "optional": false, - "field": "op" - }, - { - "type": "int64", - "optional": true, - "field": "ts_ms" - }, - { - "type": "struct", - "fields": [ - { - "type": "string", - "optional": false, - "field": "id" - }, - { - "type": "int64", - "optional": false, - "field": "total_order" - }, - { - "type": "int64", - "optional": false, - "field": "data_collection_order" - } - ], - "optional": true, - "name": "event.block", - "version": 1, - "field": "transaction" - } - ], - "optional": false, - "name": "test.public.customers.Envelope", - "version": 1 - }, - "payload": { - "before": null, - "after": { - "id": 1, - "first_name": "a", - "last_name": "a", - "email": "a@a.com" - }, - "source": { - "version": "2.5.3.Final", - "connector": "postgresql", - "name": "test", - "ts_ms": 1714283798721, - "snapshot": "false", - "db": "postgres", - "sequence": "[\"22910216\",\"22910504\"]", - "schema": "public", - "table": "customers", - "txId": 756, - "lsn": 22910504, - "xmin": null - }, - "op": "c", - "ts_ms": 1714283798790, - "transaction": null - } - } - ``` - -#### Step 3: Configure StarRocks - -Create a Primary Key table in StarRocks with the same schema as the source table in PostgreSQL. - -```SQL -CREATE TABLE `customers` ( - `id` int(11) COMMENT "", - `first_name` varchar(65533) NULL COMMENT "", - `last_name` varchar(65533) NULL COMMENT "", - `email` varchar(65533) NULL COMMENT "" -) ENGINE=OLAP -PRIMARY KEY(`id`) -DISTRIBUTED BY hash(id) buckets 1 -PROPERTIES ( -"bucket_size" = "4294967296", -"in_memory" = "false", -"enable_persistent_index" = "true", -"replicated_storage" = "true", -"fast_schema_evolution" = "true" -); -``` - -#### Step 4: Install connector - -1. Download the connectors and extract the packages in the **plugins** directory. - - ```Bash - mkdir plugins - tar -zxvf debezium-debezium-connector-postgresql-2.5.3.zip -C plugins - mv starrocks-connector-for-kafka-x.y.z-with-dependencies.jar plugins - ``` - - This directory is the value of the configuration item `plugin.path` in **config/connect-standalone.properties**. - - ```Properties - plugin.path=/path/to/kafka_2.13-3.7.0/plugins - ``` - -2. Configure the PostgreSQL source connector in **pg-source.properties**. - - ```Json - { - "name": "inventory-connector", - "config": { - "connector.class": "io.debezium.connector.postgresql.PostgresConnector", - "plugin.name": "pgoutput", - "database.hostname": "localhost", - "database.port": "5432", - "database.user": "postgres", - "database.password": "", - "database.dbname" : "postgres", - "topic.prefix": "test" - } - } - ``` - -3. Configure the StarRocks sink connector in **sr-sink.properties**. - - ```Json - { - "name": "starrocks-kafka-connector", - "config": { - "connector.class": "com.starrocks.connector.kafka.StarRocksSinkConnector", - "tasks.max": "1", - "topics": "test.public.customers", - "starrocks.http.url": "172.26.195.69:28030", - "starrocks.database.name": "test", - "starrocks.username": "root", - "starrocks.password": "StarRocks@123", - "sink.properties.strip_outer_array": "true", - "connect.timeoutms": "3000", - "starrocks.topic2table.map": "test.public.customers:customers", - "transforms": "addfield,unwrap", - "transforms.addfield.type": "com.starrocks.connector.kafka.transforms.AddOpFieldForDebeziumRecord", - "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState", - "transforms.unwrap.drop.tombstones": "true", - "transforms.unwrap.delete.handling.mode": "rewrite" - } - } - ``` - - > **NOTE** - > - > - If the StarRocks table is not a Primary Key table, you do not need to specify the `addfield` transform. - > - The unwrap transform is provided by Debezium and is used to unwrap Debezium's complex data structure based on the operation type. For more information, see [New Record State Extraction](https://debezium.io/documentation/reference/stable/transformations/event-flattening.html). - -4. Configure Kafka Connect. - - Configure the following configuration items in the Kafka Connect configuration file **config/connect-standalone.properties**. - - ```Properties - # The addresses of Kafka brokers. Multiple addresses of Kafka brokers need to be separated by commas (,). - # Note that this example uses PLAINTEXT as the security protocol to access the Kafka cluster. - # If you use other security protocol to access the Kafka cluster, configure the relevant information in this part. - - bootstrap.servers=:9092 - offset.storage.file.filename=/tmp/connect.offsets - key.converter=org.apache.kafka.connect.json.JsonConverter - value.converter=org.apache.kafka.connect.json.JsonConverter - key.converter.schemas.enable=true - value.converter.schemas.enable=false - - # The absolute path of starrocks-connector-for-kafka-x.y.z-with-dependencies.jar. - plugin.path=/home/kafka-connect/starrocks-kafka-connector - - # Parameters that control the flush policy. For more information, see the Usage Note section. - offset.flush.interval.ms=10000 - bufferflush.maxbytes = xxx - bufferflush.intervalms = xxx - ``` - - For descriptions of more parameters, see [Running Kafka Connect](https://kafka.apache.org/documentation.html#connect_running). - -#### Step 5: Start Kafka Connect in Standalone Mode - -Run Kafka Connect in standalone mode to initiate the connectors. - -```Bash -bin/connect-standalone.sh config/connect-standalone.properties config/pg-source.properties config/sr-sink.properties -``` - -#### Step 6: Verify data ingestion - -Test the following operations and ensure the data is correctly ingested into StarRocks. - -##### INSERT - -- In PostgreSQL: - -```Plain -postgres=# insert into customers values (2,'b','b','b@b.com'); -INSERT 0 1 -postgres=# select * from customers; - id | first_name | last_name | email -----+------------+-----------+--------- - 1 | a | a | a@a.com - 2 | b | b | b@b.com -(2 rows) -``` - -- In StarRocks: - -```Plain -MySQL [test]> select * from customers; -+------+------------+-----------+---------+ -| id | first_name | last_name | email | -+------+------------+-----------+---------+ -| 1 | a | a | a@a.com | -| 2 | b | b | b@b.com | -+------+------------+-----------+---------+ -2 rows in set (0.01 sec) -``` - -##### UPDATE - -- In PostgreSQL: - -```Plain -postgres=# update customers set email='c@c.com'; -UPDATE 2 -postgres=# select * from customers; - id | first_name | last_name | email -----+------------+-----------+--------- - 1 | a | a | c@c.com - 2 | b | b | c@c.com -(2 rows) -``` - -- In StarRocks: - -```Plain -MySQL [test]> select * from customers; -+------+------------+-----------+---------+ -| id | first_name | last_name | email | -+------+------------+-----------+---------+ -| 1 | a | a | c@c.com | -| 2 | b | b | c@c.com | -+------+------------+-----------+---------+ -2 rows in set (0.00 sec) -``` - -##### DELETE - -- In PostgreSQL: - -```Plain -postgres=# delete from customers where id=1; -DELETE 1 -postgres=# select * from customers; - id | first_name | last_name | email -----+------------+-----------+--------- - 2 | b | b | c@c.com -(1 row) -``` - -- In StarRocks: - -```Plain -MySQL [test]> select * from customers; -+------+------------+-----------+---------+ -| id | first_name | last_name | email | -+------+------------+-----------+---------+ -| 2 | b | b | c@c.com | -+------+------------+-----------+---------+ -1 row in set (0.00 sec) -``` diff --git a/docs/en/loading/Load_to_Primary_Key_tables.md b/docs/en/loading/Load_to_Primary_Key_tables.md deleted file mode 100644 index 5c40759..0000000 --- a/docs/en/loading/Load_to_Primary_Key_tables.md +++ /dev/null @@ -1,709 +0,0 @@ ---- -displayed_sidebar: docs ---- - -# Change data through loading - -import InsertPrivNote from '../_assets/commonMarkdown/insertPrivNote.mdx' - -[Primary Key tables](../table_design/table_types/primary_key_table.md) provided by StarRocks allow you to make data changes to StarRocks tables by running [Stream Load](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md), [Broker Load](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md), or [Routine Load](../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md) jobs. These data changes include inserts, updates, and deletions. However, Primary Key tables do not support changing data by using [Spark Load](../sql-reference/sql-statements/loading_unloading/SPARK_LOAD.md) or [INSERT](../sql-reference/sql-statements/loading_unloading/INSERT.md). - -StarRocks also supports partial updates and conditional updates. - - - -This topic uses CSV data as an example to describe how to make data changes to a StarRocks table through loading. The data file formats that are supported vary depending on the loading method of your choice. - -> **NOTE** -> -> For CSV data, you can use a UTF-8 string, such as a comma (,), tab, or pipe (|), whose length does not exceed 50 bytes as a text delimiter. - -## Implementation - -Primary Key tables provided by StarRocks support UPSERT and DELETE operations and does not distinguish INSERT operations from UPDATE operations. - -When you create a load job, StarRocks supports adding a field named `__op` to the job creation statement or command. The `__op` field is used to specify the type of operation you want to perform. - -> **NOTE** -> -> When you create a table, you do not need to add a column named `__op` to that table. - -The method of defining the `__op` field varies depending on the loading method of your choice: - -- If you choose Stream Load, define the `__op` field by using the `columns` parameter. - -- If you choose Broker Load, define the `__op` field by using the SET clause. - -- If you choose Routine Load, define the `__op` field by using the `COLUMNS` column. - -You can decide whether to add the `__op` field based on the data changes you want to make. If you do not add the `__op` field, the operation type defaults to UPSERT. The major data change scenarios are as follows: - -- If the data file you want to load involves only UPSERT operations, you do not need to add the `__op` field. - -- If the data file you want to load involves only DELETE operations, you must add the `__op` field and specify the operation type as DELETE. - -- If the data file you want to load involves both UPSERT and DELETE operations, you must add the `__op` field and make sure that the data file contains a column whose values are `0` or `1`. A value of `0` indicates an UPSERT operation, and a value of `1` indicates a DELETE operation. - -## Usage notes - -- Make sure that each row in your data file has the same number of columns. - -- The columns that involve data changes must include the primary key column. - -## Basic operations - -This section provides examples of how to make data changes to a StarRocks table through loading. For detailed syntax and parameter descriptions, see [STREAM LOAD](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md), [BROKER LOAD](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md), and [CREATE ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md). - -### UPSERT - -If the data file you want to load involves only UPSERT operations, you do not need to add the `__op` field. - -> **NOTE** -> -> If you add the `__op` field: -> -> - You can specify the operation type as UPSERT. -> -> - You can leave the `__op` field empty, because the operation type defaults to UPSERT. - -#### Data examples - -1. Prepare a data file. - - a. Create a CSV file named `example1.csv` in your local file system. The file consists of three columns, which represent user ID, user name, and user score in sequence. - - ```Plain - 101,Lily,100 - 102,Rose,100 - ``` - - b. Publish the data of `example1.csv` to `topic1` of your Kafka cluster. - -2. Prepare a StarRocks table. - - a. Create a Primary Key table named `table1` in your StarRocks database `test_db`. The table consists of three columns: `id`, `name`, and `score`, of which `id` is the primary key. - - ```SQL - CREATE TABLE `table1` - ( - `id` int(11) NOT NULL COMMENT "user ID", - `name` varchar(65533) NOT NULL COMMENT "user name", - `score` int(11) NOT NULL COMMENT "user score" - ) - ENGINE=OLAP - PRIMARY KEY(`id`) - DISTRIBUTED BY HASH(`id`); - ``` - - > **NOTE** - > - > Since v2.5.7, StarRocks can automatically set the number of buckets (BUCKETS) when you create a table or add a partition. You no longer need to manually set the number of buckets. For detailed information, see [set the number of buckets](../table_design/data_distribution/Data_distribution.md#set-the-number-of-buckets). - - b. Insert a record into `table1`. - - ```SQL - INSERT INTO table1 VALUES - (101, 'Lily',80); - ``` - -#### Load data - -Run a load job to update the record whose `id` is `101` in `example1.csv` to `table1` and insert the record whose `id` is `102` in `example1.csv` into `table1`. - -- Run a Stream Load job. - - - If you do not want to include the `__op` field, run the following command: - - ```Bash - curl --location-trusted -u : \ - -H "Expect:100-continue" \ - -H "label:label1" \ - -H "column_separator:," \ - -T example1.csv -XPUT \ - http://:/api/test_db/table1/_stream_load - ``` - - - If you want to include the `__op` field, run the following command: - - ```Bash - curl --location-trusted -u : \ - -H "Expect:100-continue" \ - -H "label:label2" \ - -H "column_separator:," \ - -H "columns:__op ='upsert'" \ - -T example1.csv -XPUT \ - http://:/api/test_db/table1/_stream_load - ``` - -- Run a Broker Load job. - - - If you do not want to include the `__op` field, run the following command: - - ```SQL - LOAD LABEL test_db.label1 - ( - data infile("hdfs://:/example1.csv") - into table table1 - columns terminated by "," - format as "csv" - ) - WITH BROKER; - ``` - - - If you want to include the `__op` field, run the following command: - - ```SQL - LOAD LABEL test_db.label2 - ( - data infile("hdfs://:/example1.csv") - into table table1 - columns terminated by "," - format as "csv" - set (__op = 'upsert') - ) - WITH BROKER; - ``` - -- Run a Routine Load job. - - - If you do not want to include the `__op` field, run the following command: - - ```SQL - CREATE ROUTINE LOAD test_db.table1 ON table1 - COLUMNS TERMINATED BY ",", - COLUMNS (id, name, score) - PROPERTIES - ( - "desired_concurrent_number" = "3", - "max_batch_interval" = "20", - "max_batch_rows"= "250000", - "max_error_number" = "1000" - ) - FROM KAFKA - ( - "kafka_broker_list" =":", - "kafka_topic" = "test1", - "property.kafka_default_offsets" ="OFFSET_BEGINNING" - ); - ``` - - - If you want to include the `__op` field, run the following command: - - ```SQL - CREATE ROUTINE LOAD test_db.table1 ON table1 - COLUMNS TERMINATED BY ",", - COLUMNS (id, name, score, __op ='upsert') - PROPERTIES - ( - "desired_concurrent_number" = "3", - "max_batch_interval" = "20", - "max_batch_rows"= "250000", - "max_error_number" = "1000" - ) - FROM KAFKA - ( - "kafka_broker_list" =":", - "kafka_topic" = "test1", - "property.kafka_default_offsets" ="OFFSET_BEGINNING" - ); - ``` - -#### Query data - -After the load is complete, query the data of `table1` to verify that the load is successful: - -```SQL -SELECT * FROM table1; -+------+------+-------+ -| id | name | score | -+------+------+-------+ -| 101 | Lily | 100 | -| 102 | Rose | 100 | -+------+------+-------+ -2 rows in set (0.02 sec) -``` - -As shown in the preceding query result, the record whose `id` is `101` in `example1.csv` has been updated to `table1`, and the record whose `id` is `102` in `example1.csv` has been inserted into `table1`. - -### DELETE - -If the data file you want to load involves only DELETE operations, you must add the `__op` field and specify the operation type as DELETE. - -#### Data examples - -1. Prepare a data file. - - a. Create a CSV file named `example2.csv` in your local file system. The file consists of three columns, which represent user ID, user name, and user score in sequence. - - ```Plain - 101,Jack,100 - ``` - - b. Publish the data of `example2.csv` to `topic2` of your Kafka cluster. - -2. Prepare a StarRocks table. - - a. Create a Primary Key table named `table2` in your StarRocks table `test_db`. The table consists of three columns: `id`, `name`, and `score`, of which `id` is the primary key. - - ```SQL - CREATE TABLE `table2` - ( - `id` int(11) NOT NULL COMMENT "user ID", - `name` varchar(65533) NOT NULL COMMENT "user name", - `score` int(11) NOT NULL COMMENT "user score" - ) - ENGINE=OLAP - PRIMARY KEY(`id`) - DISTRIBUTED BY HASH(`id`); - ``` - - > **NOTE** - > - > Since v2.5.7, StarRocks can automatically set the number of buckets (BUCKETS) when you create a table or add a partition. You no longer need to manually set the number of buckets. For detailed information, see [set the number of buckets](../table_design/data_distribution/Data_distribution.md#set-the-number-of-buckets). - - b. Insert two records into `table2`. - - ```SQL - INSERT INTO table2 VALUES - (101, 'Jack', 100), - (102, 'Bob', 90); - ``` - -#### Load data - -Run a load job to delete the record whose `id` is `101` in `example2.csv` from `table2`. - -- Run a Stream Load job. - - ```Bash - curl --location-trusted -u : \ - -H "Expect:100-continue" \ - -H "label:label3" \ - -H "column_separator:," \ - -H "columns:__op='delete'" \ - -T example2.csv -XPUT \ - http://:/api/test_db/table2/_stream_load - ``` - -- Run a Broker Load job. - - ```SQL - LOAD LABEL test_db.label3 - ( - data infile("hdfs://:/example2.csv") - into table table2 - columns terminated by "," - format as "csv" - set (__op = 'delete') - ) - WITH BROKER; - ``` - -- Run a Routine Load job. - - ```SQL - CREATE ROUTINE LOAD test_db.table2 ON table2 - COLUMNS(id, name, score, __op = 'delete') - PROPERTIES - ( - "desired_concurrent_number" = "3", - "max_batch_interval" = "20", - "max_batch_rows"= "250000", - "max_error_number" = "1000" - ) - FROM KAFKA - ( - "kafka_broker_list" =":", - "kafka_topic" = "test2", - "property.kafka_default_offsets" ="OFFSET_BEGINNING" - ); - ``` - -#### Query data - -After the load is complete, query the data of `table2` to verify that the load is successful: - -```SQL -SELECT * FROM table2; -+------+------+-------+ -| id | name | score | -+------+------+-------+ -| 102 | Bob | 90 | -+------+------+-------+ -1 row in set (0.00 sec) -``` - -As shown in the preceding query result, the record whose `id` is `101` in `example2.csv` has been deleted from `table2`. - -### UPSERT and DELETE - -If the data file you want to load involves both UPSERT and DELETE operations, you must add the `__op` field and make sure that the data file contains a column whose values are `0` or `1`. A value of `0` indicates an UPSERT operation, and a value of `1` indicates a DELETE operation. - -#### Data examples - -1. Prepare a data file. - - a. Create a CSV file named `example3.csv` in your local file system. The file consists of four columns, which represent user ID, user name, user score, and operation type in sequence. - - ```Plain - 101,Tom,100,1 - 102,Sam,70,0 - 103,Stan,80,0 - ``` - - b. Publish the data of `example3.csv` to `topic3` of your Kafka cluster. - -2. Prepare a StarRocks table. - - a. Create a Primary Key table named `table3` in your StarRocks database `test_db`. The table consists of three columns: `id`, `name`, and `score`, of which `id` is the primary key. - - ```SQL - CREATE TABLE `table3` - ( - `id` int(11) NOT NULL COMMENT "user ID", - `name` varchar(65533) NOT NULL COMMENT "user name", - `score` int(11) NOT NULL COMMENT "user score" - ) - ENGINE=OLAP - PRIMARY KEY(`id`) - DISTRIBUTED BY HASH(`id`); - ``` - - > **NOTE** - > - > Since v2.5.7, StarRocks can automatically set the number of buckets (BUCKETS) when you create a table or add a partition. You no longer need to manually set the number of buckets. For detailed information, see [set the number of buckets](../table_design/data_distribution/Data_distribution.md#set-the-number-of-buckets). - - b. Insert two records into `table3`. - - ```SQL - INSERT INTO table3 VALUES - (101, 'Tom', 100), - (102, 'Sam', 90); - ``` - -#### Load data - -Run a load job to delete the record whose `id` is `101` in `example3.csv` from `table3`, update the record whose `id` is `102` in `example3.csv` to `table3`, and insert the record whose `id` is `103` in `example3.csv` into `table3`. - -- Run a Stream Load job: - - ```Bash - curl --location-trusted -u : \ - -H "Expect:100-continue" \ - -H "label:label4" \ - -H "column_separator:," \ - -H "columns: id, name, score, temp, __op = temp" \ - -T example3.csv -XPUT \ - http://:/api/test_db/table3/_stream_load - ``` - - > **NOTE** - > - > In the preceding example, the fourth column that represents the operation type in `example3.csv` is temporarily named as `temp` and the `__op` field is mapped onto the `temp` column by using the `columns` parameter. As such, StarRocks can decide whether to perform an UPSERT or DELETE operation depending on the value in the fourth column of `example3.csv` is `0` or `1`. - -- Run a Broker Load job: - - ```Bash - LOAD LABEL test_db.label4 - ( - data infile("hdfs://:/example1.csv") - into table table1 - columns terminated by "," - format as "csv" - (id, name, score, temp) - set (__op=temp) - ) - WITH BROKER; - ``` - -- Run a Routine Load job: - - ```SQL - CREATE ROUTINE LOAD test_db.table3 ON table3 - COLUMNS(id, name, score, temp, __op = temp) - PROPERTIES - ( - "desired_concurrent_number" = "3", - "max_batch_interval" = "20", - "max_batch_rows"= "250000", - "max_error_number" = "1000" - ) - FROM KAFKA - ( - "kafka_broker_list" = ":", - "kafka_topic" = "test3", - "property.kafka_default_offsets" = "OFFSET_BEGINNING" - ); - ``` - -#### Query data - -After the load is complete, query the data of `table3` to verify that the load is successful: - -```SQL -SELECT * FROM table3; -+------+------+-------+ -| id | name | score | -+------+------+-------+ -| 102 | Sam | 70 | -| 103 | Stan | 80 | -+------+------+-------+ -2 rows in set (0.01 sec) -``` - -As shown in the preceding query result, the record whose `id` is `101` in `example3.csv` has been deleted from `table3`, the record whose `id` is `102` in `example3.csv` has been updated to `table3`, and the record whose `id` is `103` in `example3.csv` has been inserted into `table3`. - -## Partial updates - -Primary Key tables also support partial updates, and provide two modes of partial updates, row mode and column mode, for different data update scenarios. These two modes of partial updates can minimize the overhead of partial updates as much as possible while guaranteeing query performance, ensuring real-time updates. Row mode is more suitable for real-time update scenarios involving many columns and small batches. Column mode is suitable for batch processing update scenarios involving a few columns and a large number of rows. - -> **NOTICE** -> -> When you perform a partial update, if the row to be updated does not exist, StarRocks inserts a new row, and fills default values in fields that are empty because no data updates are inserted into them. - -This section uses CSV as an example to describe how to perform partial updates. - -### Data examples - -1. Prepare a data file. - - a. Create a CSV file named `example4.csv` in your local file system. The file consists of two columns, which represent user ID and user name in sequence. - - ```Plain - 101,Lily - 102,Rose - 103,Alice - ``` - - b. Publish the data of `example4.csv` to `topic4` of your Kafka cluster. - -2. Prepare a StarRocks table. - - a. Create a Primary Key table named `table4` in your StarRocks database `test_db`. The table consists of three columns: `id`, `name`, and `score`, of which `id` is the primary key. - - ```SQL - CREATE TABLE `table4` - ( - `id` int(11) NOT NULL COMMENT "user ID", - `name` varchar(65533) NOT NULL COMMENT "user name", - `score` int(11) NOT NULL COMMENT "user score" - ) - ENGINE=OLAP - PRIMARY KEY(`id`) - DISTRIBUTED BY HASH(`id`); - ``` - - > **NOTE** - > - > Since v2.5.7, StarRocks can automatically set the number of buckets (BUCKETS) when you create a table or add a partition. You no longer need to manually set the number of buckets. For detailed information, see [set the number of buckets](../table_design/data_distribution/Data_distribution.md#set-the-number-of-buckets). - - b. Insert a record into `table4`. - - ```SQL - INSERT INTO table4 VALUES - (101, 'Tom',80); - ``` - -### Load data - -Run a load to update the data in the two columns of `example4.csv` to the `id` and `name` columns of `table4`. - -- Run a Stream Load job: - - ```Bash - curl --location-trusted -u : \ - -H "Expect:100-continue" \ - -H "label:label7" -H "column_separator:," \ - -H "partial_update:true" \ - -H "columns:id,name" \ - -T example4.csv -XPUT \ - http://:/api/test_db/table4/_stream_load - ``` - - > **NOTE** - > - > If you choose Stream Load, you must set the `partial_update` parameter to `true` to enable the partial update feature. The default is partial updates in row mode. If you need to perform partial updates in column mode, you need to set `partial_update_mode` to `column`. Additionally, you must use the `columns` parameter to specify the columns you want to update. - -- Run a Broker Load job: - - ```SQL - LOAD LABEL test_db.table4 - ( - data infile("hdfs://:/example4.csv") - into table table4 - format as "csv" - (id, name) - ) - WITH BROKER - PROPERTIES - ( - "partial_update" = "true" - ); - ``` - - > **NOTE** - > - > If you choose Broker Load, you must set the `partial_update` parameter to `true` to enable the partial update feature. The default is partial updates in row mode. If you need to perform partial updates in column mode, you need to set `partial_update_mode` to `column`. Additionally, you must use the `column_list` parameter to specify the columns you want to update. - -- Run a Routine Load job: - - ```SQL - CREATE ROUTINE LOAD test_db.table4 on table4 - COLUMNS (id, name), - COLUMNS TERMINATED BY ',' - PROPERTIES - ( - "partial_update" = "true" - ) - FROM KAFKA - ( - "kafka_broker_list" =":", - "kafka_topic" = "test4", - "property.kafka_default_offsets" ="OFFSET_BEGINNING" - ); - ``` - - > **NOTE** - > - > - If you choose Routine Load, you must set the `partial_update` parameter to `true` to enable the partial update feature. Additionally, you must use the `COLUMNS` parameter to specify the columns you want to update. - > - Routine Load only supports partial updates in row modes and does not support partial updates in column mode. - -### Query data - -After the load is complete, query the data of `table4` to verify that the load is successful: - -```SQL -SELECT * FROM table4; -+------+-------+-------+ -| id | name | score | -+------+-------+-------+ -| 102 | Rose | 0 | -| 101 | Lily | 80 | -| 103 | Alice | 0 | -+------+-------+-------+ -3 rows in set (0.01 sec) -``` - -As shown in the preceding query result, the record whose `id` is `101` in `example4.csv` has been updated to `table4`, and the records whose `id` are `102` and `103` in `example4.csv` have been Inserted into `table4`. - -## Conditional updates - -From StarRocks v2.5 onwards, Primary Key tables support conditional updates. You can specify a non-primary key column as the condition to determine whether updates can take effect. As such, the update from a source record to a destination record takes effect only when the source data record has a greater or equal value than the destination data record in the specified column. - -The conditional update feature is designed to resolve data disorder. If the source data is disordered, you can use this feature to ensure that new data will not be overwritten by old data. - -> **NOTICE** -> -> - You cannot specify different columns as update conditions for the same batch of data. -> - DELETE operations do not support conditional updates. -> - In versions earlier than v3.1.3, partial updates and conditional updates cannot be used simultaneously. From v3.1.3 onwards, StarRocks supports using partial updates with conditional updates. - -### Data examples - -1. Prepare a data file. - - a. Create a CSV file named `example5.csv` in your local file system. The file consists of three columns, which represent user ID, version, and user score in sequence. - - ```Plain - 101,1,100 - 102,3,100 - ``` - - b. Publish the data of `example5.csv` to `topic5` of your Kafka cluster. - -2. Prepare a StarRocks table. - - a. Create a Primary Key table named `table5` in your StarRocks database `test_db`. The table consists of three columns: `id`, `version`, and `score`, of which `id` is the primary key. - - ```SQL - CREATE TABLE `table5` - ( - `id` int(11) NOT NULL COMMENT "user ID", - `version` int NOT NULL COMMENT "version", - `score` int(11) NOT NULL COMMENT "user score" - ) - ENGINE=OLAP - PRIMARY KEY(`id`) DISTRIBUTED BY HASH(`id`); - ``` - - > **NOTE** - > - > Since v2.5.7, StarRocks can automatically set the number of buckets (BUCKETS) when you create a table or add a partition. You no longer need to manually set the number of buckets. For detailed information, see [set the number of buckets](../table_design/data_distribution/Data_distribution.md#set-the-number-of-buckets). - - b. Insert a record into `table5`. - - ```SQL - INSERT INTO table5 VALUES - (101, 2, 80), - (102, 2, 90); - ``` - -### Load data - -Run a load to update the records whose `id` values are `101` and `102`, respectively, from `example5.csv` into `table5`, and specify that the updates take effect only when the `version` value in each of the two records is greater or equal to their current `version` values. - -- Run a Stream Load job: - - ```Bash - curl --location-trusted -u : \ - -H "Expect:100-continue" \ - -H "label:label10" \ - -H "column_separator:," \ - -H "merge_condition:version" \ - -T example5.csv -XPUT \ - http://:/api/test_db/table5/_stream_load - ``` -- Run a Insert Load job: - ```SQL - INSERT INTO test_db.table5 properties("merge_condition" = "version") - VALUES (101, 2, 70), (102, 3, 100); - ``` - -- Run a Routine Load job: - - ```SQL - CREATE ROUTINE LOAD test_db.table5 on table5 - COLUMNS (id, version, score), - COLUMNS TERMINATED BY ',' - PROPERTIES - ( - "merge_condition" = "version" - ) - FROM KAFKA - ( - "kafka_broker_list" =":", - "kafka_topic" = "topic5", - "property.kafka_default_offsets" ="OFFSET_BEGINNING" - ); - ``` - -- Run a Broker Load job: - - ```SQL - LOAD LABEL test_db.table5 - ( DATA INFILE ("s3://xxx.csv") - INTO TABLE table5 COLUMNS TERMINATED BY "," FORMAT AS "CSV" - ) - WITH BROKER - PROPERTIES - ( - "merge_condition" = "version" - ); - ``` - -### Query data - -After the load is complete, query the data of `table5` to verify that the load is successful: - -```SQL -SELECT * FROM table5; -+------+------+-------+ -| id | version | score | -+------+------+-------+ -| 101 | 2 | 80 | -| 102 | 3 | 100 | -+------+------+-------+ -2 rows in set (0.02 sec) -``` - -As shown in the preceding query result, the record whose `id` is `101` in `example5.csv` is not updated to `table5`, and the record whose `id` is `102` in `example5.csv` has been Inserted into `table5`. diff --git a/docs/en/loading/Loading_data_template.md b/docs/en/loading/Loading_data_template.md deleted file mode 100644 index 1dd5871..0000000 --- a/docs/en/loading/Loading_data_template.md +++ /dev/null @@ -1,400 +0,0 @@ ---- -displayed_sidebar: docs -unlisted: true ---- - -# Load data from \ TEMPLATE - -## Template instructions - -### A note about style - -Technical documentation typically has links to other documents all over the place. When you look at this document you may notice that there are few links out from the page, and that almost all of the links are at the bottom of the doc in the **more information** section. Not every keyword needs to be linked out to another page, please assume that the reader knows what `CREATE TABLE` means and that is they do not know they can click in the search bar and find out. It is fine to pop a note in the docs to tell the reader that there are other options and the details are described in the **more information** section; this allows the people who need the information know that they can read it ***later***, after they accomplish the task at hand. - -### The template - -This template is based on the process to load data from Amazon S3, some -parts of it will not be applicable to loading from other sources. Please concentrate on the flow of this template and do not worry about including every section; the flow is meant to be: - -#### Introduction - -Introductory text that lets the reader know what the end result will be if they follow this guide. In the case of the S3 doc, the end result is "Getting data loaded from S3 in either an asynchronous manner, or a synchronous manner." - -#### Why? - -- A description of the business problem solved with the technique -- Advantages and the disadvantages (if any) of the method(s) described - -#### Data flow or other diagram - -Diagrams or images can be helpful. If you are describing a technique that is complex and an image helps, then use one. If you are describing a technique that produces something visual (for example, the use of Superset to analyze data), then definitely include an image of the end product. - -Use a data flow diagram if the flow is non-obvious. When a command causes StarRocks to run several processes and combine the output of those processes and then manipulate the data it is probably time for a description of the data flow. In this template there are two methods for loading data described. One of them is simple, and has no data flow section; the other is more complicated (StarRocks is handling the complex work, not the user!), and the complex option includes a data flow section. - -#### Examples with verification section - -Note that examples should come before syntax details and other deep technical details. Many readers will be coming to the docs to find a particular technique that they can copy, paste, and modify. - -If possible give an example that will work and includes a dataset to use. The example in this template uses a dataset stored in S3 that anyone who has an AWS account and can authenticate with a key and secret can use. By providing a dataset the examples are more valuable to the reader because they can fully experience the described technique. - -Make sure that the example works as written. This implies two things: - -1. you have run the commands in the order presented -2. you have included the necessary prerequisites. For example, if your example refers to database `foo`, then probably you need to preface it with `CREATE DATABASE foo;`, `USE foo;`. - -Verification is so important. If the process that you are describing includes several steps, then include a verification step whenever something should have been accomplished; this helps avoid having the reader get to the end and realizing that they had a typo in step 10. In this example **Check progress** and `DESCRIBE user_behavior_inferred;` steps are for verification. - -#### More information - -At the end of the template there is a spot to put links to related information including the ones to optional information that you mentioned in the main body. - -### Notes embedded in the template - -The template notes are intentionally formatted differently than the way we format documentation notes to bring them to your attention when you are working through the template. Please remove the bold italic notes as you go along: - -```markdown -***Note: descriptive text*** -``` - -## Finally, the start of the template - -***Note: If there are multiple recommended choices, tell the -reader this in the intro. For example, when loading from S3, -there is an option for synchronous loading, and asynchronous loading:*** - -StarRocks provides two options for loading data from S3: - -1. Asynchronous loading using Broker Load -2. Synchronous loading using the `FILES()` table function - -***Note: Tell the reader WHY they would choose one choice over the other:*** - -Small datasets are often loaded synchronously using the `FILES()` table function, and large datasets are often loaded asynchronously using Broker Load. The two methods have different advantages and are described below. - -> **NOTE** -> -> You can load data into StarRocks tables only as a user who has the INSERT privilege on those StarRocks tables. If you do not have the INSERT privilege, follow the instructions provided in [GRANT](../sql-reference/sql-statements/account-management/GRANT.md) to grant the INSERT privilege to the user that you use to connect to your StarRocks cluster. - -## Using Broker Load - -An asynchronous Broker Load process handles making the connection to S3, pulling the data, and storing the data in StarRocks. - -### Advantages of Broker Load - -- Broker Load supports data transformation, UPSERT, and DELETE operations during loading. -- Broker Load runs in the background and clients don't need to stay connected for the job to continue. -- Broker Load is preferred for long running jobs, the default timeout is 4 hours. -- In addition to Parquet and ORC file format, Broker Load supports CSV files. - -### Data flow - -***Note: Processes that involve multiple components or steps may be easier to understand with a diagram. This example includes a diagram that helps describe the steps that happen when a user chooses the Broker Load option.*** - -![Workflow of Broker Load](../_assets/broker_load_how-to-work_en.png) - -1. The user creates a load job. -2. The frontend (FE) creates a query plan and distributes the plan to the backend nodes (BE). -3. The backend (BE) nodes pull the data from the source and load the data into StarRocks. - -### Typical example - -Create a table, start a load process that pulls a Parquet file from S3, and verify the progress and success of the data loading. - -> **NOTE** -> -> The examples use a sample dataset in Parquet format, if you want to load a CSV or ORC file, that information is linked at the bottom of this page. - -#### Create a table - -Create a database for your table: - -```SQL -CREATE DATABASE IF NOT EXISTS project; -USE project; -``` - -Create a table. This schema matches a sample dataset in an S3 bucket hosted in a StarRocks account. - -```SQL -DROP TABLE IF EXISTS user_behavior; - -CREATE TABLE `user_behavior` ( - `UserID` int(11), - `ItemID` int(11), - `CategoryID` int(11), - `BehaviorType` varchar(65533), - `Timestamp` datetime -) ENGINE=OLAP -DUPLICATE KEY(`UserID`) -DISTRIBUTED BY HASH(`UserID`) -PROPERTIES ( - "replication_num" = "1" -); -``` - -#### Gather connection details - -> **NOTE** -> -> The examples use IAM user-based authentication. Other authentication methods are available and linked at the bottom of this page. - -Loading data from S3 requires having the: - -- S3 bucket -- S3 object keys (object names) if accessing a specific object in the bucket. Note that the object key can include a prefix if your S3 objects are stored in sub-folders. The full syntax is linked in **more information**. -- S3 region -- Access key and secret - -#### Start a Broker Load - -This job has four main sections: - -- `LABEL`: A string used when querying the state of a `LOAD` job. -- `LOAD` declaration: The source URI, destination table, and the source data format. -- `BROKER`: The connection details for the source. -- `PROPERTIES`: Timeout value and any other properties to apply to this job. - -> **NOTE** -> -> The dataset used in these examples is hosted in an S3 bucket in a StarRocks account. Any valid `aws.s3.access_key` and `aws.s3.secret_key` can be used, as the object is readable by any AWS authenticated user. Substitute your credentials for `AAA` and `BBB` in the commands below. - -```SQL -LOAD LABEL user_behavior -( - DATA INFILE("s3://starrocks-examples/user_behavior_sample_data.parquet") - INTO TABLE user_behavior - FORMAT AS "parquet" - ) - WITH BROKER - ( - "aws.s3.enable_ssl" = "true", - "aws.s3.use_instance_profile" = "false", - "aws.s3.region" = "us-east-1", - "aws.s3.access_key" = "AAAAAAAAAAAAAAAAAAAA", - "aws.s3.secret_key" = "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB" - ) -PROPERTIES -( - "timeout" = "72000" -); -``` - -#### Check progress - -Query the `information_schema.loads` table to track progress. If you have multiple `LOAD` jobs running you can filter on the `LABEL` associated with the job. In the output below there are two entries for the load job `user_behavior`. The first record shows a state of `CANCELLED`; scroll to the end of the output, and you see that `listPath failed`. The second record shows success with a valid AWS IAM access key and secret. - -```SQL -SELECT * FROM information_schema.loads; -``` - -```SQL -SELECT * FROM information_schema.loads WHERE LABEL = 'user_behavior'; -``` - -```plaintext -JOB_ID|LABEL |DATABASE_NAME|STATE |PROGRESS |TYPE |PRIORITY|SCAN_ROWS|FILTERED_ROWS|UNSELECTED_ROWS|SINK_ROWS|ETL_INFO|TASK_INFO |CREATE_TIME |ETL_START_TIME |ETL_FINISH_TIME |LOAD_START_TIME |LOAD_FINISH_TIME |JOB_DETAILS |ERROR_MSG |TRACKING_URL|TRACKING_SQL|REJECTED_RECORD_PATH| -------+-------------------------------------------+-------------+---------+-------------------+------+--------+---------+-------------+---------------+---------+--------+----------------------------------------------------+-------------------+-------------------+-------------------+-------------------+-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+------------+------------+--------------------+ - 10121|user_behavior |project |CANCELLED|ETL:N/A; LOAD:N/A |BROKER|NORMAL | 0| 0| 0| 0| |resource:N/A; timeout(s):72000; max_filter_ratio:0.0|2023-08-10 14:59:30| | | |2023-08-10 14:59:34|{"All backends":{},"FileNumber":0,"FileSize":0,"InternalTableLoadBytes":0,"InternalTableLoadRows":0,"ScanBytes":0,"ScanRows":0,"TaskNumber":0,"Unfinished backends":{}} |type:ETL_RUN_FAIL; msg:listPath failed| | | | - 10106|user_behavior |project |FINISHED |ETL:100%; LOAD:100%|BROKER|NORMAL | 86953525| 0| 0| 86953525| |resource:N/A; timeout(s):72000; max_filter_ratio:0.0|2023-08-10 14:50:15|2023-08-10 14:50:19|2023-08-10 14:50:19|2023-08-10 14:50:19|2023-08-10 14:55:10|{"All backends":{"a5fe5e1d-d7d0-4826-ba99-c7348f9a5f2f":[10004]},"FileNumber":1,"FileSize":1225637388,"InternalTableLoadBytes":2710603082,"InternalTableLoadRows":86953525,"ScanBytes":1225637388,"ScanRows":86953525,"TaskNumber":1,"Unfinished backends":{"a5| | | | | -``` - -You can also check a subset of the data at this point. - -```SQL -SELECT * from user_behavior LIMIT 10; -``` - -```plaintext -UserID|ItemID|CategoryID|BehaviorType|Timestamp | -------+------+----------+------------+-------------------+ -171146| 68873| 3002561|pv |2017-11-30 07:11:14| -171146|146539| 4672807|pv |2017-11-27 09:51:41| -171146|146539| 4672807|pv |2017-11-27 14:08:33| -171146|214198| 1320293|pv |2017-11-25 22:38:27| -171146|260659| 4756105|pv |2017-11-30 05:11:25| -171146|267617| 4565874|pv |2017-11-27 14:01:25| -171146|329115| 2858794|pv |2017-12-01 02:10:51| -171146|458604| 1349561|pv |2017-11-25 22:49:39| -171146|458604| 1349561|pv |2017-11-27 14:03:44| -171146|478802| 541347|pv |2017-12-02 04:52:39| -``` - -## Using the `FILES()` table function - -### `FILES()` advantages - -`FILES()` can infer the data types of the columns of the Parquet data and generate the schema for a StarRocks table. This provides the ability to query the file directly from S3 with a `SELECT` or to have StarRocks automatically create a table for you based on the Parquet file schema. - -> **NOTE** -> -> Schema inference is a new feature in version 3.1 and is provided for Parquet format only and nested types are not yet supported. - -### Typical examples - -There are three examples using the `FILES()` table function: - -- Querying the data directly from S3 -- Creating and loading the table using schema inference -- Creating a table by hand and then loading the data - -> **NOTE** -> -> The dataset used in these examples is hosted in an S3 bucket in a StarRocks account. Any valid `aws.s3.access_key` and `aws.s3.secret_key` can be used, as the object is readable by any AWS authenticated user. Substitute your credentials for `AAA` and `BBB` in the commands below. - -#### Querying directly from S3 - -Querying directly from S3 using `FILES()` can gives a good preview of the content of a dataset before you create a table. For example: - -- Get a preview of the dataset without storing the data. -- Query for the min and max values and decide what data types to use. -- Check for nulls. - -```sql -SELECT * FROM FILES( - "path" = "s3://starrocks-examples/user_behavior_sample_data.parquet", - "format" = "parquet", - "aws.s3.region" = "us-east-1", - "aws.s3.access_key" = "AAAAAAAAAAAAAAAAAAAA", - "aws.s3.secret_key" = "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB" -) LIMIT 10; -``` - -> **NOTE** -> -> Notice that the column names are provided by the Parquet file. - -```plaintext -UserID|ItemID |CategoryID|BehaviorType|Timestamp | -------+-------+----------+------------+-------------------+ - 1|2576651| 149192|pv |2017-11-25 01:21:25| - 1|3830808| 4181361|pv |2017-11-25 07:04:53| - 1|4365585| 2520377|pv |2017-11-25 07:49:06| - 1|4606018| 2735466|pv |2017-11-25 13:28:01| - 1| 230380| 411153|pv |2017-11-25 21:22:22| - 1|3827899| 2920476|pv |2017-11-26 16:24:33| - 1|3745169| 2891509|pv |2017-11-26 19:44:31| - 1|1531036| 2920476|pv |2017-11-26 22:02:12| - 1|2266567| 4145813|pv |2017-11-27 00:11:11| - 1|2951368| 1080785|pv |2017-11-27 02:47:08| -``` - -#### Creating a table with schema inference - -This is a continuation of the previous example; the previous query is wrapped in `CREATE TABLE` to automate the table creation using schema inference. The column names and types are not required to create a table when using the `FILES()` table function with Parquet files as the Parquet format includes the column names and types and StarRocks will infer the schema. - -> **NOTE** -> -> The syntax of `CREATE TABLE` when using schema inference does not allow setting the number of replicas, so set it before creating the table. The example below is for a system with a single replica: -> -> `ADMIN SET FRONTEND CONFIG ('default_replication_num' ="1");` - -```sql -CREATE DATABASE IF NOT EXISTS project; -USE project; - -CREATE TABLE `user_behavior_inferred` AS -SELECT * FROM FILES( - "path" = "s3://starrocks-examples/user_behavior_sample_data.parquet", - "format" = "parquet", - "aws.s3.region" = "us-east-1", - "aws.s3.access_key" = "AAAAAAAAAAAAAAAAAAAA", - "aws.s3.secret_key" = "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB" -); -``` - -```SQL -DESCRIBE user_behavior_inferred; -``` - -```plaintext -Field |Type |Null|Key |Default|Extra| -------------+----------------+----+-----+-------+-----+ -UserID |bigint |YES |true | | | -ItemID |bigint |YES |true | | | -CategoryID |bigint |YES |true | | | -BehaviorType|varchar(1048576)|YES |false| | | -Timestamp |varchar(1048576)|YES |false| | | -``` - -> **NOTE** -> -> Compare the inferred schema with the schema created by hand: -> -> - data types -> - nullable -> - key fields - -```SQL -SELECT * from user_behavior_inferred LIMIT 10; -``` - -```plaintext -UserID|ItemID|CategoryID|BehaviorType|Timestamp | -------+------+----------+------------+-------------------+ -171146| 68873| 3002561|pv |2017-11-30 07:11:14| -171146|146539| 4672807|pv |2017-11-27 09:51:41| -171146|146539| 4672807|pv |2017-11-27 14:08:33| -171146|214198| 1320293|pv |2017-11-25 22:38:27| -171146|260659| 4756105|pv |2017-11-30 05:11:25| -171146|267617| 4565874|pv |2017-11-27 14:01:25| -171146|329115| 2858794|pv |2017-12-01 02:10:51| -171146|458604| 1349561|pv |2017-11-25 22:49:39| -171146|458604| 1349561|pv |2017-11-27 14:03:44| -171146|478802| 541347|pv |2017-12-02 04:52:39| -``` - -#### Loading into an existing table - -You may want to customize the table that you are inserting into, for example the: - -- column data type, nullable setting, or default values -- key types and columns -- distribution -- etc. - -> **NOTE** -> -> Creating the most efficient table structure requires knowledge of how the data will be used and the content of the columns. This document does not cover table design, there is a link in **more information** at the end of the page. - -In this example we are creating a table based on knowledge of how the table will be queried and the data in the Parquet file. The knowledge of the data in the Parquet file can be gained by querying the file directly in S3. - -- Since a query of the file in S3 indicates that the `Timestamp` column contains data that matches a `datetime` data type the column type is specified in the following DDL. -- By querying the data in S3 you can find that there are no null values in the dataset, so the DDL does not set any columns as nullable. -- Based on knowledge of the expected query types, the sort key and bucketing column are set to the column `UserID` (your use case might be different for this data, you might decide to use `ItemID` in addition to or instead of `UserID` for the sort key: - -```SQL -CREATE TABLE `user_behavior_declared` ( - `UserID` int(11), - `ItemID` int(11), - `CategoryID` int(11), - `BehaviorType` varchar(65533), - `Timestamp` datetime -) ENGINE=OLAP -DUPLICATE KEY(`UserID`) -DISTRIBUTED BY HASH(`UserID`) -PROPERTIES ( - "replication_num" = "1" -); -``` - -After creating the table, you can load it with `INSERT INTO` … `SELECT FROM FILES()`: - -```SQL -INSERT INTO user_behavior_declared - SELECT * FROM FILES( - "path" = "s3://starrocks-examples/user_behavior_sample_data.parquet", - "format" = "parquet", - "aws.s3.region" = "us-east-1", - "aws.s3.access_key" = "AAAAAAAAAAAAAAAAAAAA", - "aws.s3.secret_key" = "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB" -); -``` - -## More information - -- For more details on synchronous and asynchronous data loading, see [Loading concepts](./loading_introduction/loading_concepts.md). -- Learn about how Broker Load supports data transformation during loading at [Transform data at loading](../loading/Etl_in_loading.md) and [Change data through loading](../loading/Load_to_Primary_Key_tables.md). -- This document only covered IAM user-based authentication. For other options please see [authenticate to AWS resources](../integrations/authenticate_to_aws_resources.md). -- The [AWS CLI Command Reference](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3/index.html) covers the S3 URI in detail. -- Learn more about [table design](../table_design/StarRocks_table_design.md). -- Broker Load provides many more configuration and use options than those in the above examples, the details are in [Broker Load](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md) diff --git a/docs/en/loading/Loading_intro.md b/docs/en/loading/Loading_intro.md deleted file mode 100644 index ad0f9f0..0000000 --- a/docs/en/loading/Loading_intro.md +++ /dev/null @@ -1,209 +0,0 @@ ---- -displayed_sidebar: docs -toc_max_heading_level: 3 -keywords: - - load - - Insert - - Stream Load - - Broker Load - - Pipe - - Routine Load - - Spark Load ---- - - - -# Loading options - -Data loading is the process of cleansing and transforming raw data from various data sources based on your business requirements and loading the resulting data into StarRocks to facilitate analysis. - -StarRocks provides a variety of options for data loading: - -- Loading methods: Insert, Stream Load, Broker Load, Pipe, Routine Load, and Spark Load -- Ecosystem tools: StarRocks Connector for Apache Kafka® (Kafka connector for short), StarRocks Connector for Apache Spark™ (Spark connector for short), StarRocks Connector for Apache Flink® (Flink connector for short), and other tools such as SMT, DataX, CloudCanal, and Kettle Connector -- API: Stream Load transaction interface - -These options each have its own advantages and support its own set of data source systems to pull from. - -This topic provides an overview of these options, along with comparisons between them to help you determine the loading option of your choice based on your data source, business scenario, data volume, data file format, and loading frequency. - -## Introduction to loading options - -This section mainly describes the characteristics and business scenarios of the loading options available in StarRocks. - -![Loading options overview](../_assets/loading_intro_overview.png) - -:::note - -In the following sections, "batch" or "batch loading" refers to the loading of a large amount of data from a specified source all at a time into StarRocks, whereas "stream" or "streaming" refers to the continuous loading of data in real time. - -::: - -## Loading methods - -### [Insert](InsertInto.md) - -**Business scenario:** - -- INSERT INTO VALUES: Append to an internal table with small amounts of data. -- INSERT INTO SELECT: - - INSERT INTO SELECT FROM ``: Append to a table with the result of a query on an internal or external table. - - INSERT INTO SELECT FROM FILES(): Append to a table with the result of a query on data files in remote storage. - - :::note - - For AWS S3, this feature is supported from v3.1 onwards. For HDFS, Microsoft Azure Storage, Google GCS, and S3-compatible storage (such as MinIO), this feature is supported from v3.2 onwards. - - ::: - -**File format:** - -- INSERT INTO VALUES: SQL -- INSERT INTO SELECT: - - INSERT INTO SELECT FROM ``: StarRocks tables - - INSERT INTO SELECT FROM FILES(): Parquet and ORC - -**Data volume:** Not fixed (The data volume varies based on the memory size.) - -### [Stream Load](StreamLoad.md) - -**Business scenario:** Batch load data from a local file system. - -**File format:** CSV and JSON - -**Data volume:** 10 GB or less - -### [Broker Load](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md) - -**Business scenario:** - -- Batch load data from HDFS or cloud storage like AWS S3, Microsoft Azure Storage, Google GCS, and S3-compatible storage (such as MinIO). -- Batch load data from a local file system or NAS. - -**File format:** CSV, Parquet, ORC, and JSON (supported since v3.2.3) - -**Data volume:** Dozens of GB to hundreds of GB - -### [Pipe](../sql-reference/sql-statements/loading_unloading/pipe/CREATE_PIPE.md) - -**Business scenario:** Batch load or stream data from HDFS or AWS S3. - -:::note - -This loading method is supported from v3.2 onwards. - -::: - -**File format:** Parquet and ORC - -**Data volume:** 100 GB to 1 TB or more - -### [Routine Load](../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md) - -**Business scenario:** Stream data from Kafka. - -**File format:** CSV, JSON, and Avro (supported since v3.0.1) - -**Data volume:** MBs to GBs of data as mini-batches - -### [Spark Load](../sql-reference/sql-statements/loading_unloading/SPARK_LOAD.md) - -**Business scenario:** Batch load data of Apache Hive™ tables stored in HDFS by using Spark clusters. - -**File format:** CSV, Parquet (supported since v2.0), and ORC (supported since v2.0) - -**Data volume:** Dozens of GB to TBs - -## Ecosystem tools - -### [Kafka connector](Kafka-connector-starrocks.md) - -**Business scenario:** Stream data from Kafka. - -### [Spark connector](Spark-connector-starrocks.md) - -**Business scenario:** Batch load data from Spark. - -### [Flink connector](Flink-connector-starrocks.md) - -**Business scenario:** Stream data from Flink. - -### [SMT](../integrations/loading_tools/SMT.md) - -**Business scenario:** Load data from data sources such as MySQL, PostgreSQL, SQL Server, Oracle, Hive, ClickHouse, and TiDB through Flink. - -### [DataX](../integrations/loading_tools/DataX-starrocks-writer.md) - -**Business scenario:** Synchronize data between various heterogeneous data sources, including relational databases (for example, MySQL and Oracle), HDFS, and Hive. - -### [CloudCanal](../integrations/loading_tools/CloudCanal.md) - -**Business scenario:** Migrate or synchronize data from source databases (for example, MySQL, Oracle, and PostgreSQL) to StarRocks. - -### [Kettle Connector](https://github.com/StarRocks/starrocks-connector-for-kettle) - -**Business scenario:** Integrate with Kettle. By combining Kettle's robust data processing and transformation capabilities with StarRocks's high-performance data storage and analytical abilities, more flexible and efficient data processing workflows can be achieved. - -## API - -### [Stream Load transaction interface](Stream_Load_transaction_interface.md) - -**Business scenario:** Implement two-phase commit (2PC) for transactions that are run to load data from external systems such as Flink and Kafka, while improving the performance of highly concurrent stream loads. This feature is supported from v2.4 onwards. - -**File format:** CSV and JSON - -**Data volume:** 10 GB or less - -## Choice of loading options - -This section lists the loading options available for common data sources, helping you choose the option that best suits your situation. - -### Object storage - -| **Data source** | **Available loading options** | -| ------------------------------------- | ------------------------------------------------------------ | -| AWS S3 |
  • (Batch) INSERT INTO SELECT FROM FILES() (supported since v3.1)
  • (Batch) Broker Load
  • (Batch or streaming) Pipe (supported since v3.2)
See [Load data from AWS S3](s3.md). | -| Microsoft Azure Storage |
  • (Batch) INSERT INTO SELECT FROM FILES() (supported since v3.2)
  • (Batch) Broker Load
See [Load data from Microsoft Azure Storage](azure.md). | -| Google GCS |
  • (Batch) INSERT INTO SELECT FROM FILES() (supported since v3.2)
  • (Batch) Broker Load
See [Load data from GCS](gcs.md). | -| S3-compatible storage (such as MinIO) |
  • (Batch) INSERT INTO SELECT FROM FILES() (supported since v3.2)
  • (Batch) Broker Load
See [Load data from MinIO](minio.md). | - -### Local file system (including NAS) - -| **Data source** | **Available loading options** | -| --------------------------------- | ------------------------------------------------------------ | -| Local file system (including NAS) |
  • (Batch) Stream Load
  • (Batch) Broker Load
See [Load data from a local file system](StreamLoad.md). | - -### HDFS - -| **Data source** | **Available loading options** | -| --------------- | ------------------------------------------------------------ | -| HDFS |
  • (Batch) INSERT INTO SELECT FROM FILES() (supported since v3.2)
  • (Batch) Broker Load
  • (Batch or streaming) Pipe (supported since v3.2)
See [Load data from HDFS](hdfs_load.md). | - -### Flink, Kafka, and Spark - -| **Data source** | **Available loading options** | -| --------------- | ------------------------------------------------------------ | -| Apache Flink® |
  • [Flink connector](Flink-connector-starrocks.md)
  • [Stream Load transaction interface](Stream_Load_transaction_interface.md)
| -| Apache Kafka® |
  • (Streaming) [Kafka connector](Kafka-connector-starrocks.md)
  • (Streaming) [Routine Load](RoutineLoad.md)
  • [Stream Load transaction interface](Stream_Load_transaction_interface.md)
**NOTE**
If the source data requires multi-table joins and extract, transform and load (ETL) operations, you can use Flink to read and pre-process the data and then use [Flink connector](Flink-connector-starrocks.md) to load the data into StarRocks. | -| Apache Spark™ |
  • [Spark connector](Spark-connector-starrocks.md)
  • [Spark Load](SparkLoad.md)
| - -### Data lakes - -| **Data source** | **Available loading options** | -| --------------- | ------------------------------------------------------------ | -| Apache Hive™ |
  • (Batch) Create a [Hive catalog](../data_source/catalog/hive_catalog.md) and then use [INSERT INTO SELECT FROM ``](InsertInto.md#insert-data-from-an-internal-or-external-table-into-an-internal-table).
  • (Batch) [Spark Load](https://docs.starrocks.io/docs/loading/SparkLoad/).
| -| Apache Iceberg | (Batch) Create an [Iceberg catalog](../data_source/catalog/iceberg/iceberg_catalog.md) and then use [INSERT INTO SELECT FROM ``](InsertInto.md#insert-data-from-an-internal-or-external-table-into-an-internal-table). | -| Apache Hudi | (Batch) Create a [Hudi catalog](../data_source/catalog/hudi_catalog.md) and then use [INSERT INTO SELECT FROM ``](InsertInto.md#insert-data-from-an-internal-or-external-table-into-an-internal-table). | -| Delta Lake | (Batch) Create a [Delta Lake catalog](../data_source/catalog/deltalake_catalog.md) and then use [INSERT INTO SELECT FROM ``](InsertInto.md#insert-data-from-an-internal-or-external-table-into-an-internal-table). | -| Elasticsearch | (Batch) Create an [Elasticsearch catalog](../data_source/catalog/elasticsearch_catalog.md) and then use [INSERT INTO SELECT FROM ``](InsertInto.md#insert-data-from-an-internal-or-external-table-into-an-internal-table). | -| Apache Paimon | (Batch) Create a [Paimon catalog](../data_source/catalog/paimon_catalog.md) and then use [INSERT INTO SELECT FROM ``](InsertInto.md#insert-data-from-an-internal-or-external-table-into-an-internal-table). | - -Note that StarRocks provides [unified catalogs](https://docs.starrocks.io/docs/data_source/catalog/unified_catalog/) from v3.2 onwards to help you handle tables from Hive, Iceberg, Hudi, and Delta Lake data sources as a unified data source without ingestion. - -### Internal and external databases - -| **Data source** | **Available loading options** | -| ------------------------------------------------------------ | ------------------------------------------------------------ | -| StarRocks | (Batch) Create a [StarRocks external table](../data_source/External_table.md#starrocks-external-table) and then use [INSERT INTO VALUES](InsertInto.md#insert-data-via-insert-into-values) to insert a few data records or [INSERT INTO SELECT FROM ``](InsertInto.md#insert-data-from-an-internal-or-external-table-into-an-internal-table) to insert the data of a table.
**NOTE**
StarRocks external tables only support data writes. They do not support data reads. | -| MySQL |
  • (Batch) Create a [JDBC catalog](../data_source/catalog/jdbc_catalog.md) (recommended) or a [MySQL external table](../data_source/External_table.md#deprecated-mysql-external-table) and then use [INSERT INTO SELECT FROM ``](InsertInto.md#insert-data-from-an-internal-or-external-table-into-an-internal-table).
  • (Streaming) Use [SMT, Flink CDC connector, Flink, and Flink connector](Flink_cdc_load.md).
| -| Other databases such as Oracle, PostgreSQL, SQL Server, ClickHouse, and TiDB |
  • (Batch) Create a [JDBC catalog](../data_source/catalog/jdbc_catalog.md) (recommended) or a [JDBC external table](../data_source/External_table.md#external-table-for-a-jdbc-compatible-database) and then use [INSERT INTO SELECT FROM ``](InsertInto.md#insert-data-from-an-internal-or-external-table-into-an-internal-table).
  • (Streaming) Use [SMT, Flink CDC connector, Flink, and Flink connector](loading_tools.md).
| diff --git a/docs/en/loading/RoutineLoad.md b/docs/en/loading/RoutineLoad.md deleted file mode 100644 index 3f1170d..0000000 --- a/docs/en/loading/RoutineLoad.md +++ /dev/null @@ -1,553 +0,0 @@ ---- -displayed_sidebar: docs -keywords: ['Routine Load'] ---- - -# Load data using Routine Load - -import InsertPrivNote from '../_assets/commonMarkdown/insertPrivNote.mdx' -import QSTip from '../_assets/commonMarkdown/quickstart-routine-load-tip.mdx' - - - -This topic introduces how to create a Routine Load job to stream Kafka messages (events) into StarRocks, and familiarizes you with some basic concepts about Routine Load. - -To continuously load messages of a stream into StarRocks, you can store the message stream in a Kafka topic, and create a Routine Load job to consume the messages. The Routine Load job persists in StarRocks, generates a series of load tasks to consume the messages in all or part of the partitions in the topic, and loads the messages into StarRocks. - -A Routine Load job supports exactly-once delivery semantics to guarantee the data loaded into StarRocks is neither lost nor duplicated. - -Routine Load supports data transformation at data loading and supports data changes made by UPSERT and DELETE operations during data loading. For more information, see [Transform data at loading](../loading/Etl_in_loading.md) and [Change data through loading](../loading/Load_to_Primary_Key_tables.md). - - - -## Supported data formats - -Routine Load now supports consuming CSV, JSON, and Avro (supported since v3.0.1) formatted data from a Kafka cluster. - -> **NOTE** -> -> For CSV data, take note of the following points: -> -> - You can use a UTF-8 string, such as a comma (,), tab, or pipe (|), whose length does not exceed 50 bytes as a text delimiter. -> - Null values are denoted by using `\N`. For example, a data file consists of three columns, and a record from that data file holds data in the first and third columns but no data in the second column. In this situation, you need to use `\N` in the second column to denote a null value. This means the record must be compiled as `a,\N,b` instead of `a,,b`. `a,,b` denotes that the second column of the record holds an empty string. - -## Basic concepts - -![routine load](../_assets/4.5.2-1.png) - -### Terminology - -- **Load job** - - A Routine Load job is a long-running job. As long as its status is RUNNING, a load job continuously generates one or multiple concurrent load tasks which consume the messages in a topic of a Kafka cluster and load the data into StarRocks. - -- **Load task** - - A load job is split into multiple load tasks by certain rules. A load task is the basic unit of data loading. As an individual event, a load task implements the load mechanism based on [Stream Load](../loading/StreamLoad.md). Multiple load tasks concurrently consume the messages from different partitions of a topic, and load the data into StarRocks. - -### Workflow - -1. **Create a Routine Load job.** - To load data from Kafka, you need to create a Routine Load job by running the [CREATE ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md) statement. The FE parses the statement, and creates the job according to the properties you have specified. - -2. **The FE splits the job into multiple load tasks.** - - The FE split the job into multiple load tasks based on certain rules. Each load task is an individual transaction. - The splitting rules are as follows: - - The FE calculates the actual concurrent number of the load tasks according to the desired concurrent number `desired_concurrent_number`, the partition number in the Kafka topic, and the number of the BE nodes that are alive. - - The FE splits the job into load tasks based on the actual concurrent number calculated, and arranges the tasks in the task queue. - - Each Kafka topic consists of multiple partitions. The relation between the topic partition and the load task is as follows: - - A partition is uniquely assigned to a load task, and all messages from the partition are consumed by the load task. - - A load task can consume messages from one or more partitions. - - All partitions are distributed evenly among load tasks. - -3. **Multiple load tasks run concurrently to consume the messages from multiple Kafka topic partitions, and load the data into StarRocks** - - 1. **The FE schedules and submits load tasks**: the FE schedules the load tasks in the queue on a timely basis, and assigns them to selected Coordinator BE nodes. The interval between load tasks is defined by the configuration item `max_batch_interval`. The FE distributes the load tasks evenly to all BE nodes. See [CREATE ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md#examples) for more information about `max_batch_interval`. - - 2. The Coordinator BE starts the load task, consumes messages in partitions, parses and filters the data. A load task lasts until the pre-defined amount of messages are consumed or the pre-defined time limit is reached. The message batch size and time limit are defined in the FE configurations `max_routine_load_batch_size` and `routine_load_task_consume_second`. For detailed information, see [FE Configuration](../administration/management/FE_configuration.md). The Coordinator BE then distributes the messages to the Executor BEs. The Executor BEs write the messages to disks. - - > **NOTE** - > - > StarRocks supports access to Kafka via security protocols including SASL_SSL, SAS_PLAINTEXT, SSL, and PLAINTEXT. This topic uses connecting to Kafka via PLAINTEXT as an example. If you need to connect to Kafka via other security protocols, see [CREATE ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md). - -4. **The FE generates new load tasks to load data continuously.** - After the Executor BEs has written the data to disks, the Coordinator BE reports the result of the load task to the FE. Based on the result, the FE then generates new load tasks to load the data continuously. Or the FE retries the failed tasks to make sure the data loaded into StarRocks is neither lost nor duplicated. - -## Create a Routine Load job - -The following three examples describe how to consume CSV-format, JSON-format and Avro-format data in Kafka, and load the data into StarRocks by creating a Routine Load job. For detailed syntax and parameter descriptions, see [CREATE ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md). - -### Load CSV-format data - -This section describes how to create a Routine Load job to consume CSV-format data in a Kafka cluster, and load the data into StarRocks. - -#### Prepare a dataset - -Suppose there is a CSV-format dataset in the topic `ordertest1` in a Kafka cluster. Every message in the dataset includes six fields: order ID, payment date, customer name, nationality, gender, and price. - -```Plain -2020050802,2020-05-08,Johann Georg Faust,Deutschland,male,895 -2020050802,2020-05-08,Julien Sorel,France,male,893 -2020050803,2020-05-08,Dorian Grey,UK,male,1262 -2020050901,2020-05-09,Anna Karenina",Russia,female,175 -2020051001,2020-05-10,Tess Durbeyfield,US,female,986 -2020051101,2020-05-11,Edogawa Conan,japan,male,8924 -``` - -#### Create a table - -According to the fields of CSV-format data, create the table `example_tbl1` in the database `example_db`. The following example creates a table with 5 fields excluding the field of customer gender in the CSV-format data. - -```SQL -CREATE TABLE example_db.example_tbl1 ( - `order_id` bigint NOT NULL COMMENT "Order ID", - `pay_dt` date NOT NULL COMMENT "Payment date", - `customer_name` varchar(26) NULL COMMENT "Customer name", - `nationality` varchar(26) NULL COMMENT "Nationality", - `price`double NULL COMMENT "Price" -) -ENGINE=OLAP -DUPLICATE KEY (order_id,pay_dt) -DISTRIBUTED BY HASH(`order_id`); -``` - -> **NOTICE** -> -> Since v2.5.7, StarRocks can automatically set the number of buckets (BUCKETS) when you create a table or add a partition. You no longer need to manually set the number of buckets. For detailed information, see [set the number of buckets](../table_design/data_distribution/Data_distribution.md#set-the-number-of-buckets). - -#### Submit a Routine Load job - -Execute the following statement to submit a Routine Load job named `example_tbl1_ordertest1` to consume the messages in the topic `ordertest1` and load the data into the table `example_tbl1`. The load task consumes the messages from the initial offset in the specified partitions of the topic. - -```SQL -CREATE ROUTINE LOAD example_db.example_tbl1_ordertest1 ON example_tbl1 -COLUMNS TERMINATED BY ",", -COLUMNS (order_id, pay_dt, customer_name, nationality, temp_gender, price) -PROPERTIES -( - "desired_concurrent_number" = "5" -) -FROM KAFKA -( - "kafka_broker_list" = ":,:", - "kafka_topic" = "ordertest1", - "kafka_partitions" = "0,1,2,3,4", - "property.kafka_default_offsets" = "OFFSET_BEGINNING" -); -``` - -After submitting the load job, you can execute the [SHOW ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/SHOW_ROUTINE_LOAD.md) statement to check the status of the load job. - -- **load job name** - - There could be multiple load job on a table. Therefore, we recommend you name a load job with the corresponding Kafka topic and the time when the load job is submitted. It helps you distinguish the load job on each table. - -- **Column separator** - - The property `COLUMN TERMINATED BY` defines the column separator of the CSV-format data. The default is `\t`. - -- **Kafka topic partition and offset** - - You can specify the properties `kafka_partitions` and `kafka_offsets` to specify the partitions and offsets to consume the messages. For example, if you want the load job to consume messages from the Kafka partitions `"0,1,2,3,4"` of the topic `ordertest1` all with the initial offsets, you can specify the properties as follows: If you want the load job to consume messages from the Kafka partitions `"0,1,2,3,4"`and you need to specify a separate starting offset for each partition, you can configure as follows: - - ```SQL - "kafka_partitions" ="0,1,2,3,4", - "kafka_offsets" = "OFFSET_BEGINNING, OFFSET_END, 1000, 2000, 3000" - ``` - - You can also set the default offsets of all partitions with the property `property.kafka_default_offsets`. - - ```SQL - "kafka_partitions" ="0,1,2,3,4", - "property.kafka_default_offsets" = "OFFSET_BEGINNING" - ``` - - For detailed information, see [CREATE ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md). - -- **Data mapping and transformation** - - To specify the mapping and transformation relationship between the CSV-format data, and the StarRocks table, you need to use the `COLUMNS` parameter. - - **Data mapping:** - - - StarRocks extracts the columns in the CSV-format data and maps them **in sequence** onto the fields declared in the `COLUMNS` parameter. - - - StarRocks extracts the fields declared in the `COLUMNS` parameter and maps them **by name** onto the columns of StarRocks table. - - **Data transformation:** - - And because the example excludes the column of customer gender from the CSV-format data, the field `temp_gender` in `COLUMNS` parameter is used as a placeholder for this field. The other fields are mapped to columns of the StarRocks table `example_tbl1` directly. - - For more information about data transformation, see [Transform data at loading](./Etl_in_loading.md). - - > **NOTE** - > - > You do not need to specify the `COLUMNS` parameter if the names, number, and order of the columns in the CSV-format data completely correspond to those of the StarRocks table. - -- **Task concurrency** - - When there are many Kafka topic partitions and enough BE nodes, you can accelerate the loading by increasing the task concurrency. - - To increase the actual load task concurrency, you can increase the desired load task concurrency `desired_concurrent_number` when you create a routine load job. You can also set the dynamic configuration item of FE `max_routine_load_task_concurrent_num` ( default maximum load task currency ) to a larger value. For more information about `max_routine_load_task_concurrent_num`, please see [FE configuration items](../administration/management/FE_configuration.md). - - The actual task concurrency is defined by the minimum value among the number of BE nodes that are alive, the number of the pre-specified Kafka topic partitions, and the values of `desired_concurrent_number` and `max_routine_load_task_concurrent_num`. - - In the example, the number of BE nodes that are alive is `5`, the number of the pre-specified Kafka topic partitions is `5`, and the value of `max_routine_load_task_concurrent_num` is `5`. To increase the actual load task concurrency, you can increase the `desired_concurrent_number` from the default value `3` to `5`. - - For more about the properties, see [CREATE ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md). - -### Load JSON-format data - -This section describes how to create a Routine Load job to consume JSON-format data in a Kafka cluster, and load the data into StarRocks. - -#### Prepare a dataset - -Suppose there is a JSON-format dataset in the topic `ordertest2` in a Kafka cluster. The dataset includes six keys: commodity ID, customer name, nationality, payment time, and price. Besides, you want to transform the payment time column into the DATE type, and load it into the `pay_dt` column in the StarRocks table. - -```JSON -{"commodity_id": "1", "customer_name": "Mark Twain", "country": "US","pay_time": 1589191487,"price": 875} -{"commodity_id": "2", "customer_name": "Oscar Wilde", "country": "UK","pay_time": 1589191487,"price": 895} -{"commodity_id": "3", "customer_name": "Antoine de Saint-Exupéry","country": "France","pay_time": 1589191487,"price": 895} -``` - -> **CAUTION** Each JSON object in a row must be in one Kafka message, otherwise a JSON parsing error is returned. - -#### Create a table - -According to the keys of the JSON-format data, create the table `example_tbl2` in the database `example_db`. - -```SQL -CREATE TABLE `example_tbl2` ( - `commodity_id` varchar(26) NULL COMMENT "Commodity ID", - `customer_name` varchar(26) NULL COMMENT "Customer name", - `country` varchar(26) NULL COMMENT "Country", - `pay_time` bigint(20) NULL COMMENT "Payment time", - `pay_dt` date NULL COMMENT "Payment date", - `price`double SUM NULL COMMENT "Price" -) -ENGINE=OLAP -AGGREGATE KEY(`commodity_id`,`customer_name`,`country`,`pay_time`,`pay_dt`) -DISTRIBUTED BY HASH(`commodity_id`); -``` - -> **NOTICE** -> -> Since v2.5.7, StarRocks can automatically set the number of buckets (BUCKETS) when you create a table or add a partition. You no longer need to manually set the number of buckets. For detailed information, see [set the number of buckets](../table_design/data_distribution/Data_distribution.md#set-the-number-of-buckets). - -#### Submit a Routine Load job - -Execute the following statement to submit a Routine Load job named `example_tbl2_ordertest2` to consume the messages in the topic `ordertest2` and load the data into the table `example_tbl2`. The load task consumes the messages from the initial offset in the specified partitions of the topic. - -```SQL -CREATE ROUTINE LOAD example_db.example_tbl2_ordertest2 ON example_tbl2 -COLUMNS(commodity_id, customer_name, country, pay_time, price, pay_dt=from_unixtime(pay_time, '%Y%m%d')) -PROPERTIES -( - "desired_concurrent_number" = "5", - "format" = "json", - "jsonpaths" = "[\"$.commodity_id\",\"$.customer_name\",\"$.country\",\"$.pay_time\",\"$.price\"]" - ) -FROM KAFKA -( - "kafka_broker_list" =":,:", - "kafka_topic" = "ordertest2", - "kafka_partitions" ="0,1,2,3,4", - "property.kafka_default_offsets" = "OFFSET_BEGINNING" -); -``` - -After submitting the load job, you can execute the [SHOW ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/SHOW_ROUTINE_LOAD.md) statement to check the status of the load job. - -- **Data format** - - You need to specify `"format" = "json"` in the clause `PROPERTIES` to define that the data format is JSON. - -- **Data mapping and transformation** - - To specify the mapping and transformation relationship between the JSON-format data, and the StarRocks table, you need to specify the parameter `COLUMNS` and property`jsonpaths`. The order of fields specified in the `COLUMNS` parameter must match that of the JSON-format data, and the name of fields must match that of the StarRocks table. The property `jsonpaths` is used to extract the required fields from the JSON data. These fields are then named by the property `COLUMNS`. - - Because the example needs to transform the payment time field to the DATE data type, and load the data into the `pay_dt` column in the StarRocks table, you need to use the from_unixtime function. The other fields are mapped to fields of the table `example_tbl2` directly. - - **Data mapping:** - - - StarRocks extracts the `name` and `code` keys of JSON-format data and maps them onto the keys declared in the `jsonpaths` property. - - - StarRocks extracts the keys declared in the `jsonpaths` property and maps them **in sequence** onto the fields declared in the `COLUMNS` parameter. - - - StarRocks extracts the fields declared in the `COLUMNS` parameter and maps them **by name** onto the columns of StarRocks table. - - **Data transformation**: - - - Because the example needs to transform the key `pay_time` to the DATE data type, and load the data into the `pay_dt` column in the StarRocks table, you need to use the from_unixtime function in `COLUMNS` parameter. The other fields are mapped to fields of the table `example_tbl2` directly. - - - And because the example excludes the column of customer gender from the JSON-format data, the field `temp_gender` in `COLUMNS` parameter is used as a placeholder for this field. The other fields are mapped to columns of the StarRocks table `example_tbl1` directly. - - For more information about data transformation, see [Transform data at loading](./Etl_in_loading.md). - - > **NOTE** - > - > You do not need to specify the `COLUMNS` parameter if the names and number of the keys in the JSON object completely match those of fields in the StarRocks table. - -### Load Avro-format data - -Since v3.0.1, StarRocks supports loading Avro data by using Routine Load. - -#### Prepare a dataset - -##### Avro schema - -1. Create the following Avro schema file `avro_schema.avsc`: - - ```JSON - { - "type": "record", - "name": "sensor_log", - "fields" : [ - {"name": "id", "type": "long"}, - {"name": "name", "type": "string"}, - {"name": "checked", "type" : "boolean"}, - {"name": "data", "type": "double"}, - {"name": "sensor_type", "type": {"type": "enum", "name": "sensor_type_enum", "symbols" : ["TEMPERATURE", "HUMIDITY", "AIR-PRESSURE"]}} - ] - } - ``` - -2. Register the Avro schema in the [Schema Registry](https://docs.confluent.io/cloud/current/get-started/schema-registry.html#create-a-schema). - -##### Avro data - -Prepare the Avro data and send it to the Kafka topic `topic_0`. - -#### Create a table - -According to the fields of Avro data, create a table `sensor_log` in the target database `example_db` in the StarRocks cluster. The column names of the table must match the field names in the Avro data. For the data type mapping between the table columns and the Avro data fields, see [Data types mapping](#Data types mapping). - -```SQL -CREATE TABLE example_db.sensor_log ( - `id` bigint NOT NULL COMMENT "sensor id", - `name` varchar(26) NOT NULL COMMENT "sensor name", - `checked` boolean NOT NULL COMMENT "checked", - `data` double NULL COMMENT "sensor data", - `sensor_type` varchar(26) NOT NULL COMMENT "sensor type" -) -ENGINE=OLAP -DUPLICATE KEY (id) -DISTRIBUTED BY HASH(`id`); -``` - -> **NOTICE** -> -> Since v2.5.7, StarRocks can automatically set the number of buckets (BUCKETS) when you create a table or add a partition. You no longer need to manually set the number of buckets. For detailed information, see [set the number of buckets](../table_design/data_distribution/Data_distribution.md#set-the-number-of-buckets). - -#### Submit a Routine Load job - -Execute the following statement to submit a Routine Load job named `sensor_log_load_job` to consume the Avro messages in the Kafka topic `topic_0` and load the data into the table `sensor_log` in the database `sensor`. The load job consumes the messages from the initial offset in the specified partitions of the topic. - -```SQL -CREATE ROUTINE LOAD example_db.sensor_log_load_job ON sensor_log -PROPERTIES -( - "format" = "avro" -) -FROM KAFKA -( - "kafka_broker_list" = ":,:,...", - "confluent.schema.registry.url" = "http://172.xx.xxx.xxx:8081", - "kafka_topic" = "topic_0", - "kafka_partitions" = "0,1,2,3,4,5", - "property.kafka_default_offsets" = "OFFSET_BEGINNING" -); -``` - -- Data Format - - You need to specify `"format = "avro"` in the clause `PROPERTIES` to define that the data format is Avro. - -- Schema Registry - - You need to configure `confluent.schema.registry.url` to specify the URL of the Schema Registry where the Avro schema is registered. StarRocks retrieves the Avro schema by using this URL. The format is as follows: - - ```Plaintext - confluent.schema.registry.url = http[s]://[:@][:] - ``` - -- Data mapping and transformation - - To specify the mapping and transformation relationship between the Avro-format data and the StarRocks table, you need to specify the parameter `COLUMNS` and property `jsonpaths`. The order of fields specified in the `COLUMNS` parameter must match that of the fields in the property `jsonpaths`, and the names of fields must match these of the StarRocks table. The property `jsonpaths` is used to extract the required fields from the Avro data. These fields are then named by the property `COLUMNS`. - - For more information about data transformation, see [Transform data at loading](./Etl_in_loading.md). - - > NOTE - > - > You do not need to specify the `COLUMNS` parameter if the names and number of the fields in the Avro record completely match those of columns in the StarRocks table. - -After submitting the load job, you can execute the [SHOW ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/SHOW_ROUTINE_LOAD.md) statement to check the status of the load job. - -#### Data types mapping - -The data type mapping between the Avro data fields you want to load and the StarRocks table columns is as follows: - -##### Primitive types - -| Avro | StarRocks | -| ------- | --------- | -| nul | NULL | -| boolean | BOOLEAN | -| int | INT | -| long | BIGINT | -| float | FLOAT | -| double | DOUBLE | -| bytes | STRING | -| string | STRING | - -##### Complex types - -| Avro | StarRocks | -| -------------- | ------------------------------------------------------------ | -| record | Load the entire RECORD or its subfields into StarRocks as JSON. | -| enums | STRING | -| arrays | ARRAY | -| maps | JSON | -| union(T, null) | NULLABLE(T) | -| fixed | STRING | - -#### Limits - -- Currently, StarRocks does not support schema evolution. -- Each Kafka message must only contain a single Avro data record. - -## Check a load job and task - -### Check a load job - -Execute the [SHOW ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/SHOW_ROUTINE_LOAD.md) statement to check the status of the load job `example_tbl2_ordertest2`. StarRocks returns the execution state `State`, the statistical information (including the total rows consumed and the total rows loaded) `Statistics`, and the progress of the load job `progress`. - -If the state of the load job is automatically changed to **PAUSED**, it is possibly because the number of error rows has exceeded the threshold. For detailed instructions on setting this threshold, see [CREATE ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md). You can check the files `ReasonOfStateChanged` and `ErrorLogUrls` to identify and troubleshoot the problem. Having fixed the problem, you can then execute the [RESUME ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/RESUME_ROUTINE_LOAD.md) statement to resume the **PAUSED** load job. - -If the state of the load job is **CANCELLED**, it is possibly because the load job encounters an exception (such as the table has been dropped). You can check the files `ReasonOfStateChanged` and `ErrorLogUrls` to identify and troubleshoot the problem. However, you cannot resume a **CANCELLED** load job. - -```SQL -MySQL [example_db]> SHOW ROUTINE LOAD FOR example_tbl2_ordertest2 \G -*************************** 1. row *************************** - Id: 63013 - Name: example_tbl2_ordertest2 - CreateTime: 2022-08-10 17:09:00 - PauseTime: NULL - EndTime: NULL - DbName: default_cluster:example_db - TableName: example_tbl2 - State: RUNNING - DataSourceType: KAFKA - CurrentTaskNum: 3 - JobProperties: {"partitions":"*","partial_update":"false","columnToColumnExpr":"commodity_id,customer_name,country,pay_time,pay_dt=from_unixtime(`pay_time`, '%Y%m%d'),price","maxBatchIntervalS":"20","whereExpr":"*","dataFormat":"json","timezone":"Asia/Shanghai","format":"json","json_root":"","strict_mode":"false","jsonpaths":"[\"$.commodity_id\",\"$.customer_name\",\"$.country\",\"$.pay_time\",\"$.price\"]","desireTaskConcurrentNum":"3","maxErrorNum":"0","strip_outer_array":"false","currentTaskConcurrentNum":"3","maxBatchRows":"200000"} -DataSourceProperties: {"topic":"ordertest2","currentKafkaPartitions":"0,1,2,3,4","brokerList":":,:"} - CustomProperties: {"kafka_default_offsets":"OFFSET_BEGINNING"} - Statistic: {"receivedBytes":230,"errorRows":0,"committedTaskNum":1,"loadedRows":2,"loadRowsRate":0,"abortedTaskNum":0,"totalRows":2,"unselectedRows":0,"receivedBytesRate":0,"taskExecuteTimeMs":522} - Progress: {"0":"1","1":"OFFSET_ZERO","2":"OFFSET_ZERO","3":"OFFSET_ZERO","4":"OFFSET_ZERO"} -ReasonOfStateChanged: - ErrorLogUrls: - OtherMsg: -``` - -> **CAUTION** -> -> You cannot check a load job that has stopped or has not yet started. - -### Check a load task - -Execute the [SHOW ROUTINE LOAD TASK](../sql-reference/sql-statements/loading_unloading/routine_load/SHOW_ROUTINE_LOAD_TASK.md) statement to check the load tasks of the load job `example_tbl2_ordertest2`, such as how many tasks are currently running, the Kafka topic partitions that are consumed and the consumption progress `DataSourceProperties`, and the corresponding Coordinator BE node `BeId`. - -```SQL -MySQL [example_db]> SHOW ROUTINE LOAD TASK WHERE JobName = "example_tbl2_ordertest2" \G -*************************** 1. row *************************** - TaskId: 18c3a823-d73e-4a64-b9cb-b9eced026753 - TxnId: -1 - TxnStatus: UNKNOWN - JobId: 63013 - CreateTime: 2022-08-10 17:09:05 - LastScheduledTime: 2022-08-10 17:47:27 - ExecuteStartTime: NULL - Timeout: 60 - BeId: -1 -DataSourceProperties: {"1":0,"4":0} - Message: there is no new data in kafka, wait for 20 seconds to schedule again -*************************** 2. row *************************** - TaskId: f76c97ac-26aa-4b41-8194-a8ba2063eb00 - TxnId: -1 - TxnStatus: UNKNOWN - JobId: 63013 - CreateTime: 2022-08-10 17:09:05 - LastScheduledTime: 2022-08-10 17:47:26 - ExecuteStartTime: NULL - Timeout: 60 - BeId: -1 -DataSourceProperties: {"2":0} - Message: there is no new data in kafka, wait for 20 seconds to schedule again -*************************** 3. row *************************** - TaskId: 1a327a34-99f4-4f8d-8014-3cd38db99ec6 - TxnId: -1 - TxnStatus: UNKNOWN - JobId: 63013 - CreateTime: 2022-08-10 17:09:26 - LastScheduledTime: 2022-08-10 17:47:27 - ExecuteStartTime: NULL - Timeout: 60 - BeId: -1 -DataSourceProperties: {"0":2,"3":0} - Message: there is no new data in kafka, wait for 20 seconds to schedule again -``` - -## Pause a load job - -You can execute the [PAUSE ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/PAUSE_ROUTINE_LOAD.md) statement to pause a load job. The state of the load job will be **PAUSED** after the statement is executed. However, it has not stopped. You can execute the [RESUME ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/RESUME_ROUTINE_LOAD.md) statement to resume it. You can also check its status with the [SHOW ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/SHOW_ROUTINE_LOAD.md) statement. - -The following example pauses the load job `example_tbl2_ordertest2`: - -```SQL -PAUSE ROUTINE LOAD FOR example_tbl2_ordertest2; -``` - -## Resume a load job - -You can execute the [RESUME ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/RESUME_ROUTINE_LOAD.md) statement to resume a paused load job. The state of the load job will be **NEED_SCHEDULE** temporarily (because the load job is being re-scheduled), and then become **RUNNING**. You can check its status with the [SHOW ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/SHOW_ROUTINE_LOAD.md) statement. - -The following example resumes the paused load job `example_tbl2_ordertest2`: - -```SQL -RESUME ROUTINE LOAD FOR example_tbl2_ordertest2; -``` - -## Alter a load job - -Before altering a load job, you must pause it with the [PAUSE ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/PAUSE_ROUTINE_LOAD.md) statement. Then you can execute the [ALTER ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/ALTER_ROUTINE_LOAD.md). After altering it, you can execute the [RESUME ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/RESUME_ROUTINE_LOAD.md) statement to resume it, and check its status with the [SHOW ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/SHOW_ROUTINE_LOAD.md) statement. - -Suppose the number of the BE nodes that are alive increases to `6` and the Kafka topic partitions to be consumed is `"0,1,2,3,4,5,6,7"`. If you want to increase the actual load task concurrency, you can execute the following statement to increase the number of desired task concurrency `desired_concurrent_number` to `6` (greater than or equal to the number of BE nodes that are alive), and specify the Kafka topic partitions and initial offsets. - -> **NOTE** -> -> Because the actual task concurrency is determined by the minimum value of multiple parameters, you must make sure that the value of the FE dynamic parameter `max_routine_load_task_concurrent_num` is greater than or equal to `6`. - -```SQL -ALTER ROUTINE LOAD FOR example_tbl2_ordertest2 -PROPERTIES -( - "desired_concurrent_number" = "6" -) -FROM kafka -( - "kafka_partitions" = "0,1,2,3,4,5,6,7", - "kafka_offsets" = "OFFSET_BEGINNING,OFFSET_BEGINNING,OFFSET_BEGINNING,OFFSET_BEGINNING,OFFSET_END,OFFSET_END,OFFSET_END,OFFSET_END" -); -``` - -## Stop a load job - -You can execute the [STOP ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/STOP_ROUTINE_LOAD.md) statement to stop a load job. The state of the load job will be **STOPPED** after the statement is executed, and you cannot resume a stopped load job. You cannot check the status of a stopped load job with the [SHOW ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/SHOW_ROUTINE_LOAD.md) statement. - -The following example stops the load job `example_tbl2_ordertest2`: - -```SQL -STOP ROUTINE LOAD FOR example_tbl2_ordertest2; -``` diff --git a/docs/en/loading/SQL_transaction.md b/docs/en/loading/SQL_transaction.md deleted file mode 100644 index 6ea4540..0000000 --- a/docs/en/loading/SQL_transaction.md +++ /dev/null @@ -1,156 +0,0 @@ ---- -displayed_sidebar: docs ---- - -import Beta from '../_assets/commonMarkdown/_beta.mdx' - -# SQL Transaction - - - -Start a simple SQL transaction to commit multiple DML statements in a batch. - -## Overview - -From v3.5.0, StarRocks supports SQL transactions to assure the atomicity of the updated tables when manipulating data within multiple tables. - -A transaction consists of multiple SQL statements that are processed within the same atomic unit. The statements in the transaction are either applied or undone together, thus guaranteeing the ACID (atomicity, consistency, isolation, and durability) properties of the transaction. - -Currently, the SQL transaction in StarRocks supports the following operations: -- INSERT INTO -- UPDATE -- DELETE - -:::note - -- INSERT OVERWRITE is not supported currently. -- Multiple INSERT statements against the same table within a transaction are supported only in shared-data clusters from v4.0 onwards. -- UPDATE and DELETE are supported only in shared-data clusters from v4.0 onwards. - -::: - -From v4.0 onwards, within one SQL transaction: -- **Multiple INSERT statements** against the one table are supported. -- **Only one UPDATE *OR* DELETE** statement against one table is allowed. -- **An UPDATE *OR* DELETE** statement **after** INSERT statements against the same table is **not allowed**. - -The ACID properties of the transaction are guaranteed only on the limited READ COMMITTED isolation level, that is: -- A statement operates only on data that was committed before the statement began. -- Two successive statements within the same transaction may operate on different data if another transaction is committed between the execution of the first and the second statements. -- Data changes brought by preceding DML statements are invisible to subsequent statements within the same transaction. - -A transaction is associated with a single session. Multiple sessions cannot share the same transaction. - -## Usage - -1. A transaction must be started by executing a START TRANSACTION statement. StarRocks also supports the synonym BEGIN. - - ```SQL - { START TRANSACTION | BEGIN [ WORK ] } - ``` - -2. After starting the transaction, you can define multiple DML statements in the transaction. For detailed information, see [Usage notes](#usage-notes). - -3. A transaction must be ended explicitly by executing `COMMIT` or `ROLLBACK`. - - - To apply (commit) the transaction, use the following syntax: - - ```SQL - COMMIT [ WORK ] - ``` - - - To undo (roll back) the transaction, use the following syntax: - - ```SQL - ROLLBACK [ WORK ] - ``` - -## Example - -1. Create the demo table `desT` in a shared-data cluster, and load data into it. - - :::note - If you want to try this example in a shared-nothing cluster, you must skip Step 3 and define only one INSERT statement in Step 4. - ::: - - ```SQL - CREATE TABLE desT ( - k int, - v int - ) PRIMARY KEY(k); - - INSERT INTO desT VALUES - (1,1), - (2,2), - (3,3); - ``` - -2. Start a transaction. - - ```SQL - START TRANSACTION; - ``` - - Or - - ```SQL - BEGIN WORK; - ``` - -3. Define an UPDATE or DELETE statement. - - ```SQL - UPDATE desT SET v = v + 1 WHERE k = 1, - ``` - - Or - - ```SQL - DELETE FROM desT WHERE k = 1; - ``` - -4. Define multiple INSERT statements. - - ```SQL - -- Insert data with specified values. - INSERT INTO desT VALUES (4,4); - -- Insert data from a native table to another. - INSERT INTO desT SELECT * FROM srcT; - -- Insert data from remote storage. - INSERT INTO desT - SELECT * FROM FILES( - "path" = "s3://inserttest/parquet/srcT.parquet", - "format" = "parquet", - "aws.s3.access_key" = "XXXXXXXXXX", - "aws.s3.secret_key" = "YYYYYYYYYY", - "aws.s3.region" = "us-west-2" - ); - ``` - -5. Apply or undo the transaction. - - - To apply the SQL statements in the transaction. - - ```SQL - COMMIT WORK; - ``` - - - To undo the SQL statements in the transaction. - - ```SQL - ROLLBACK WORK; - ``` - -## Usage notes - -- Currently, StarRocks supports SELECT, INSERT, UPDATE, and DELETE statements in SQL transactions. UPDATE and DELETE are supported only in shared-data clusters from v4.0 onwards. -- SELECT statements against the tables whose data have been changed in the same transaction are not allowed. -- Multiple INSERT statements against the same table within a transaction are supported only in shared-data clusters from v4.0 onwards. -- Within a transaction, you can only define one UPDATE or DELETE statement against each table, and it must precede the INSERT statements. -- Subsequent DML statements cannot read the uncommitted changes brought by preceding statements within the same transaction. For example, the target table of the preceding INSERT statement cannot be the source table of subsequent statements. Otherwise, the system returns an error. -- All target tables of the DML statements in a transaction must be within the same database. Cross-database operations are not allowed. -- Currently, INSERT OVERWRITE is not supported. -- Nesting transactions are not allowed. You cannot specify BEGIN WORK within a BEGIN-COMMIT/ROLLBACK pair. -- If the session where an on-going transaction belongs is terminated or closed, the transaction is automatically rolled back. -- StarRock only supports limited READ COMMITTED for Transaction Isolation Level as described above. -- Write conflict checks are not supported. When two transactions write to the same table simultaneously, both transactions can be committed successfully. The visibility (order) of the data changes depends on the execution order of the COMMIT WORK statements. diff --git a/docs/en/loading/Spark-connector-starrocks.md b/docs/en/loading/Spark-connector-starrocks.md deleted file mode 100644 index 89dd483..0000000 --- a/docs/en/loading/Spark-connector-starrocks.md +++ /dev/null @@ -1,842 +0,0 @@ ---- -displayed_sidebar: docs ---- - -# Load data using Spark connector (recommended) - -StarRocks provides a self-developed connector named StarRocks Connector for Apache Spark™ (Spark connector for short) to help you load data into a StarRocks table by using Spark. The basic principle is to accumulate the data and then load it all at a time into StarRocks through [STREAM LOAD](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md). The Spark connector is implemented based on Spark DataSource V2. A DataSource can be created by using Spark DataFrames or Spark SQL. And both batch and structured streaming modes are supported. - -> **NOTICE** -> -> Only users with the SELECT and INSERT privileges on a StarRocks table can load data into this table. You can follow the instructions provided in [GRANT](../sql-reference/sql-statements/account-management/GRANT.md) to grant these privileges to a user. - -## Version requirements - -| Spark connector | Spark | StarRocks | Java | Scala | -| --------------- | ---------------- | ------------- | ---- | ----- | -| 1.1.2 | 3.2, 3.3, 3.4, 3.5 | 2.5 and later | 8 | 2.12 | -| 1.1.1 | 3.2, 3.3, or 3.4 | 2.5 and later | 8 | 2.12 | -| 1.1.0 | 3.2, 3.3, or 3.4 | 2.5 and later | 8 | 2.12 | - -> **NOTICE** -> -> - Please see [Upgrade Spark connector](#upgrade-spark-connector) for behavior changes among different versions of the Spark connector. -> - The Spark connector does not provide MySQL JDBC driver since version 1.1.1, and you need import the driver to the spark classpath manually. You can find the driver on [MySQL site](https://dev.mysql.com/downloads/connector/j/) or [Maven Central](https://repo1.maven.org/maven2/mysql/mysql-connector-java/). - -## Obtain Spark connector - -You can obtain the Spark connector JAR file in the following ways: - -- Directly download the compiled Spark Connector JAR file. -- Add the Spark connector as a dependency in your Maven project and then download the JAR file. -- Compile the source code of the Spark Connector into a JAR file by yourself. - -The naming format of the Spark connector JAR file is `starrocks-spark-connector-${spark_version}_${scala_version}-${connector_version}.jar`. - -For example, if you install Spark 3.2 and Scala 2.12 in your environment and you want to use Spark connector 1.1.0, you can use `starrocks-spark-connector-3.2_2.12-1.1.0.jar`. - -> **NOTICE** -> -> In general, the latest version of the Spark connector only maintains compatibility with the three most recent versions of Spark. - -### Download the compiled Jar file - -Directly download the corresponding version of the Spark connector JAR from the [Maven Central Repository](https://repo1.maven.org/maven2/com/starrocks). - -### Maven Dependency - -1. In your Maven project's `pom.xml` file, add the Spark connector as a dependency according to the following format. Replace `spark_version`, `scala_version`, and `connector_version` with the respective versions. - - ```xml - - com.starrocks - starrocks-spark-connector-${spark_version}_${scala_version} - ${connector_version} - - ``` - -2. For example, if the version of Spark in your environment is 3.2, the version of Scala is 2.12, and you choose Spark connector 1.1.0, you need to add the following dependency: - - ```xml - - com.starrocks - starrocks-spark-connector-3.2_2.12 - 1.1.0 - - ``` - -### Compile by yourself - -1. Download the [Spark connector package](https://github.com/StarRocks/starrocks-connector-for-apache-spark). -2. Execute the following command to compile the source code of Spark connector into a JAR file. Note that `spark_version` is replaced with the corresponding Spark version. - - ```bash - sh build.sh - ``` - - For example, if the Spark version in your environment is 3.2, you need to execute the following command: - - ```bash - sh build.sh 3.2 - ``` - -3. Go to the `target/` directory to find the Spark connector JAR file, such as `starrocks-spark-connector-3.2_2.12-1.1.0-SNAPSHOT.jar` , generated upon compilation. - -> **NOTE** -> -> The name of Spark connector which is not formally released contains the `SNAPSHOT` suffix. - -## Parameters - -### starrocks.fe.http.url - -**Required**: YES
-**Default value**: None
-**Description**: The HTTP URL of the FE in your StarRocks cluster. You can specify multiple URLs, which must be separated by a comma (,). Format: `:,:`. Since version 1.1.1, you can also add `http://` prefix to the URL, such as `http://:,http://:`. - -### starrocks.fe.jdbc.url - -**Required**: YES
-**Default value**: None
-**Description**: The address that is used to connect to the MySQL server of the FE. Format: `jdbc:mysql://:`. - -### starrocks.table.identifier - -**Required**: YES
-**Default value**: None
-**Description**: The name of the StarRocks table. Format: `.`. - -### starrocks.user - -**Required**: YES
-**Default value**: None
-**Description**: The username of your StarRocks cluster account. The user needs the [SELECT and INSERT privileges](../sql-reference/sql-statements/account-management/GRANT.md) on the StarRocks table. - -### starrocks.password - -**Required**: YES
-**Default value**: None
-**Description**: The password of your StarRocks cluster account. - -### starrocks.write.label.prefix - -**Required**: NO
-**Default value**: spark-
-**Description**: The label prefix used by Stream Load. - -### starrocks.write.enable.transaction-stream-load - -**Required**: NO
-**Default value**: TRUE
-**Description**: Whether to use [Stream Load transaction interface](../loading/Stream_Load_transaction_interface.md) to load data. It requires StarRocks v2.5 or later. This feature can load more data in a transaction with less memory usage, and improve performance.
**NOTICE:** Since 1.1.1, this parameter takes effect only when the value of `starrocks.write.max.retries` is non-positive because Stream Load transaction interface does not support retry. - -### starrocks.write.buffer.size - -**Required**: NO
-**Default value**: 104857600
-**Description**: The maximum size of data that can be accumulated in memory before being sent to StarRocks at a time. Setting this parameter to a larger value can improve loading performance but may increase loading latency. - -### starrocks.write.buffer.rows - -**Required**: NO
-**Default value**: Integer.MAX_VALUE
-**Description**: Supported since version 1.1.1. The maximum number of rows that can be accumulated in memory before being sent to StarRocks at a time. - -### starrocks.write.flush.interval.ms - -**Required**: NO
-**Default value**: 300000
-**Description**: The interval at which data is sent to StarRocks. This parameter is used to control the loading latency. - -### starrocks.write.max.retries - -**Required**: NO
-**Default value**: 3
-**Description**: Supported since version 1.1.1. The number of times that the connector retries to perform the Stream Load for the same batch of data if the load fails.
**NOTICE:** Because Stream Load transaction interface does not support retry. If this parameter is positive, the connector always use Stream Load interface and ignore the value of `starrocks.write.enable.transaction-stream-load`. - -### starrocks.write.retry.interval.ms - -**Required**: NO
-**Default value**: 10000
-**Description**: Supported since version 1.1.1. The interval to retry the Stream Load for the same batch of data if the load fails. - -### starrocks.columns - -**Required**: NO
-**Default value**: None
-**Description**: The StarRocks table column into which you want to load data. You can specify multiple columns, which must be separated by commas (,), for example, `"col0,col1,col2"`. - -### starrocks.column.types - -**Required**: NO
-**Default value**: None
-**Description**: Supported since version 1.1.1. Customize the column data types for Spark instead of using the defaults inferred from the StarRocks table and the [default mapping](#data-type-mapping-between-spark-and-starrocks). The parameter value is a schema in DDL format same as the output of Spark [StructType#toDDL](https://github.com/apache/spark/blob/master/sql/api/src/main/scala/org/apache/spark/sql/types/StructType.scala#L449) , such as `col0 INT, col1 STRING, col2 BIGINT`. Note that you only need to specify columns that need customization. One use case is to load data into columns of [BITMAP](#load-data-into-columns-of-bitmap-type) or [HLL](#load-data-into-columns-of-hll-type) type. - -### starrocks.write.properties.* - -**Required**: NO
-**Default value**: None
-**Description**: The parameters that are used to control Stream Load behavior. For example, the parameter `starrocks.write.properties.format` specifies the format of the data to be loaded, such as CSV or JSON. For a list of supported parameters and their descriptions, see [STREAM LOAD](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md). - -### starrocks.write.properties.format - -**Required**: NO
-**Default value**: CSV
-**Description**: The file format based on which the Spark connector transforms each batch of data before the data is sent to StarRocks. Valid values: CSV and JSON. - -### starrocks.write.properties.row_delimiter - -**Required**: NO
-**Default value**: \n
-**Description**: The row delimiter for CSV-formatted data. - -### starrocks.write.properties.column_separator - -**Required**: NO
-**Default value**: \t
-**Description**: The column separator for CSV-formatted data. - -### starrocks.write.properties.partial_update - -**Required**: NO
-**Default value**: `FALSE`
-**Description**: Whether to use partial updates. Valid values: `TRUE` and `FALSE`. Default value: `FALSE`, indicating to disable this feature. - -### starrocks.write.properties.partial_update_mode - -**Required**: NO
-**Default value**: `row`
-**Description**: Specifies the mode for partial updates. Valid values: `row` and `column`.
  • The value `row` (default) means partial updates in row mode, which is more suitable for real-time updates with many columns and small batches.
  • The value `column` means partial updates in column mode, which is more suitable for batch updates with few columns and many rows. In such scenarios, enabling the column mode offers faster update speeds. For example, in a table with 100 columns, if only 10 columns (10% of the total) are updated for all rows, the update speed of the column mode is 10 times faster.
- -### starrocks.write.num.partitions - -**Required**: NO
-**Default value**: None
-**Description**: The number of partitions into which Spark can write data in parallel. When the data volume is small, you can reduce the number of partitions to lower the loading concurrency and frequency. The default value for this parameter is determined by Spark. However, this method may cause Spark Shuffle cost. - -### starrocks.write.partition.columns - -**Required**: NO
-**Default value**: None
-**Description**: The partitioning columns in Spark. The parameter takes effect only when `starrocks.write.num.partitions` is specified. If this parameter is not specified, all columns being written are used for partitioning. - -### starrocks.timezone - -**Required**: NO
-**Default value**: Default timezone of JVM
-**Description**: Supported since 1.1.1. The timezone used to convert Spark `TimestampType` to StarRocks `DATETIME`. The default is the timezone of JVM returned by `ZoneId#systemDefault()`. The format can be a timezone name such as `Asia/Shanghai`, or a zone offset such as `+08:00`. - -## Data type mapping between Spark and StarRocks - -- The default data type mapping is as follows: - - | Spark data type | StarRocks data type | - | --------------- | ------------------------------------------------------------ | - | BooleanType | BOOLEAN | - | ByteType | TINYINT | - | ShortType | SMALLINT | - | IntegerType | INT | - | LongType | BIGINT | - | StringType | LARGEINT | - | FloatType | FLOAT | - | DoubleType | DOUBLE | - | DecimalType | DECIMAL | - | StringType | CHAR | - | StringType | VARCHAR | - | StringType | STRING | - | StringType | JSON | - | DateType | DATE | - | TimestampType | DATETIME | - | ArrayType | ARRAY
**NOTE:**
**Supported since version 1.1.1**. For detailed steps, see [Load data into columns of ARRAY type](#load-data-into-columns-of-array-type). | - -- You can also customize the data type mapping. - - For example, a StarRocks table contains BITMAP and HLL columns, but Spark does not support the two data types. You need to customize the corresponding data types in Spark. For detailed steps, see load data into [BITMAP](#load-data-into-columns-of-bitmap-type) and [HLL](#load-data-into-columns-of-hll-type) columns. **BITMAP and HLL are supported since version 1.1.1**. - -## Upgrade Spark connector - -### Upgrade from version 1.1.0 to 1.1.1 - -- Since 1.1.1, the Spark connector does not provide `mysql-connector-java` which is the official JDBC driver for MySQL, because of the limitations of the GPL license used by `mysql-connector-java`. - However, the Spark connector still needs the MySQL JDBC driver to connect to StarRocks for the table metadata, so you need to add the driver to the Spark classpath manually. You can find the - driver on [MySQL site](https://dev.mysql.com/downloads/connector/j/) or [Maven Central](https://repo1.maven.org/maven2/mysql/mysql-connector-java/). -- Since 1.1.1, the connector uses Stream Load interface by default rather than Stream Load transaction interface in version 1.1.0. If you still want to use Stream Load transaction interface, you - can set the option `starrocks.write.max.retries` to `0`. Please see the description of `starrocks.write.enable.transaction-stream-load` and `starrocks.write.max.retries` - for details. - -## Examples - -The following examples show how to use the Spark connector to load data into a StarRocks table with Spark DataFrames or Spark SQL. The Spark DataFrames supports both Batch and Structured Streaming modes. - -For more examples, see [Spark Connector Examples](https://github.com/StarRocks/starrocks-connector-for-apache-spark/tree/main/src/test/java/com/starrocks/connector/spark/examples). - -### Preparations - -#### Create a StarRocks table - -Create a database `test` and create a Primary Key table `score_board`. - -```sql -CREATE DATABASE `test`; - -CREATE TABLE `test`.`score_board` -( - `id` int(11) NOT NULL COMMENT "", - `name` varchar(65533) NULL DEFAULT "" COMMENT "", - `score` int(11) NOT NULL DEFAULT "0" COMMENT "" -) -ENGINE=OLAP -PRIMARY KEY(`id`) -COMMENT "OLAP" -DISTRIBUTED BY HASH(`id`); -``` - -#### Network configuration - -Ensure that the machine where Spark is located can access the FE nodes of the StarRocks cluster via the [`http_port`](../administration/management/FE_configuration.md#http_port) (default: `8030`) and [`query_port`](../administration/management/FE_configuration.md#query_port) (default: `9030`), and the BE nodes via the [`be_http_port`](../administration/management/BE_configuration.md#be_http_port) (default: `8040`). - -#### Set up your Spark environment - -Note that the following examples are run in Spark 3.2.4 and use `spark-shell`, `pyspark` and `spark-sql`. Before running the examples, make sure to place the Spark connector JAR file in the `$SPARK_HOME/jars` directory. - -### Load data with Spark DataFrames - -The following two examples explain how to load data with Spark DataFrames Batch or Structured Streaming mode. - -#### Batch - -Construct data in memory and load data into the StarRocks table. - -1. You can write the spark application using Scala or Python. - - For Scala, run the following code snippet in `spark-shell`: - - ```Scala - // 1. Create a DataFrame from a sequence. - val data = Seq((1, "starrocks", 100), (2, "spark", 100)) - val df = data.toDF("id", "name", "score") - - // 2. Write to StarRocks by configuring the format as "starrocks" and the following options. - // You need to modify the options according your own environment. - df.write.format("starrocks") - .option("starrocks.fe.http.url", "127.0.0.1:8030") - .option("starrocks.fe.jdbc.url", "jdbc:mysql://127.0.0.1:9030") - .option("starrocks.table.identifier", "test.score_board") - .option("starrocks.user", "root") - .option("starrocks.password", "") - .mode("append") - .save() - ``` - - For Python, run the following code snippet in `pyspark`: - - ```python - from pyspark.sql import SparkSession - - spark = SparkSession \ - .builder \ - .appName("StarRocks Example") \ - .getOrCreate() - - # 1. Create a DataFrame from a sequence. - data = [(1, "starrocks", 100), (2, "spark", 100)] - df = spark.sparkContext.parallelize(data) \ - .toDF(["id", "name", "score"]) - - # 2. Write to StarRocks by configuring the format as "starrocks" and the following options. - # You need to modify the options according your own environment. - df.write.format("starrocks") \ - .option("starrocks.fe.http.url", "127.0.0.1:8030") \ - .option("starrocks.fe.jdbc.url", "jdbc:mysql://127.0.0.1:9030") \ - .option("starrocks.table.identifier", "test.score_board") \ - .option("starrocks.user", "root") \ - .option("starrocks.password", "") \ - .mode("append") \ - .save() - ``` - -2. Query data in the StarRocks table. - - ```sql - MySQL [test]> SELECT * FROM `score_board`; - +------+-----------+-------+ - | id | name | score | - +------+-----------+-------+ - | 1 | starrocks | 100 | - | 2 | spark | 100 | - +------+-----------+-------+ - 2 rows in set (0.00 sec) - ``` - -#### Structured Streaming - -Construct a streaming read of data from a CSV file and load data into the StarRocks table. - -1. In the directory `csv-data`, create a CSV file `test.csv` with the following data: - - ```csv - 3,starrocks,100 - 4,spark,100 - ``` - -2. You can write the Spark application using Scala or Python. - - For Scala, run the following code snippet in `spark-shell`: - - ```Scala - import org.apache.spark.sql.types.StructType - - // 1. Create a DataFrame from CSV. - val schema = (new StructType() - .add("id", "integer") - .add("name", "string") - .add("score", "integer") - ) - val df = (spark.readStream - .option("sep", ",") - .schema(schema) - .format("csv") - // Replace it with your path to the directory "csv-data". - .load("/path/to/csv-data") - ) - - // 2. Write to StarRocks by configuring the format as "starrocks" and the following options. - // You need to modify the options according your own environment. - val query = (df.writeStream.format("starrocks") - .option("starrocks.fe.http.url", "127.0.0.1:8030") - .option("starrocks.fe.jdbc.url", "jdbc:mysql://127.0.0.1:9030") - .option("starrocks.table.identifier", "test.score_board") - .option("starrocks.user", "root") - .option("starrocks.password", "") - // replace it with your checkpoint directory - .option("checkpointLocation", "/path/to/checkpoint") - .outputMode("append") - .start() - ) - ``` - - For Python, run the following code snippet in `pyspark`: - - ```python - from pyspark.sql import SparkSession - from pyspark.sql.types import IntegerType, StringType, StructType, StructField - - spark = SparkSession \ - .builder \ - .appName("StarRocks SS Example") \ - .getOrCreate() - - # 1. Create a DataFrame from CSV. - schema = StructType([ - StructField("id", IntegerType()), - StructField("name", StringType()), - StructField("score", IntegerType()) - ]) - df = ( - spark.readStream - .option("sep", ",") - .schema(schema) - .format("csv") - # Replace it with your path to the directory "csv-data". - .load("/path/to/csv-data") - ) - - # 2. Write to StarRocks by configuring the format as "starrocks" and the following options. - # You need to modify the options according your own environment. - query = ( - df.writeStream.format("starrocks") - .option("starrocks.fe.http.url", "127.0.0.1:8030") - .option("starrocks.fe.jdbc.url", "jdbc:mysql://127.0.0.1:9030") - .option("starrocks.table.identifier", "test.score_board") - .option("starrocks.user", "root") - .option("starrocks.password", "") - # replace it with your checkpoint directory - .option("checkpointLocation", "/path/to/checkpoint") - .outputMode("append") - .start() - ) - ``` - -3. Query data in the StarRocks table. - - ```SQL - MySQL [test]> select * from score_board; - +------+-----------+-------+ - | id | name | score | - +------+-----------+-------+ - | 4 | spark | 100 | - | 3 | starrocks | 100 | - +------+-----------+-------+ - 2 rows in set (0.67 sec) - ``` - -### Load data with Spark SQL - -The following example explains how to load data with Spark SQL by using the `INSERT INTO` statement in the [Spark SQL CLI](https://spark.apache.org/docs/latest/sql-distributed-sql-engine-spark-sql-cli.html). - -1. Execute the following SQL statement in the `spark-sql`: - - ```SQL - -- 1. Create a table by configuring the data source as `starrocks` and the following options. - -- You need to modify the options according your own environment. - CREATE TABLE `score_board` - USING starrocks - OPTIONS( - "starrocks.fe.http.url"="127.0.0.1:8030", - "starrocks.fe.jdbc.url"="jdbc:mysql://127.0.0.1:9030", - "starrocks.table.identifier"="test.score_board", - "starrocks.user"="root", - "starrocks.password"="" - ); - - -- 2. Insert two rows into the table. - INSERT INTO `score_board` VALUES (5, "starrocks", 100), (6, "spark", 100); - ``` - -2. Query data in the StarRocks table. - - ```SQL - MySQL [test]> select * from score_board; - +------+-----------+-------+ - | id | name | score | - +------+-----------+-------+ - | 6 | spark | 100 | - | 5 | starrocks | 100 | - +------+-----------+-------+ - 2 rows in set (0.00 sec) - ``` - -## Best Practices - -### Load data to Primary Key table - -This section will show how to load data to StarRocks Primary Key table to achieve partial updates, and conditional updates. -You can see [Change data through loading](../loading/Load_to_Primary_Key_tables.md) for the detailed introduction of these features. -These examples use Spark SQL. - -#### Preparations - -Create a database `test` and create a Primary Key table `score_board` in StarRocks. - -```SQL -CREATE DATABASE `test`; - -CREATE TABLE `test`.`score_board` -( - `id` int(11) NOT NULL COMMENT "", - `name` varchar(65533) NULL DEFAULT "" COMMENT "", - `score` int(11) NOT NULL DEFAULT "0" COMMENT "" -) -ENGINE=OLAP -PRIMARY KEY(`id`) -COMMENT "OLAP" -DISTRIBUTED BY HASH(`id`); -``` - -#### Partial updates - -This example will show how to only update data in the column `name` through loading: - -1. Insert initial data to StarRocks table in MySQL client. - - ```sql - mysql> INSERT INTO `score_board` VALUES (1, 'starrocks', 100), (2, 'spark', 100); - - mysql> select * from score_board; - +------+-----------+-------+ - | id | name | score | - +------+-----------+-------+ - | 1 | starrocks | 100 | - | 2 | spark | 100 | - +------+-----------+-------+ - 2 rows in set (0.02 sec) - ``` - -2. Create a Spark table `score_board` in Spark SQL client. - - - Set the option `starrocks.write.properties.partial_update` to `true` which tells the connector to do partial update. - - Set the option `starrocks.columns` to `"id,name"` to tell the connector which columns to write. - - ```SQL - CREATE TABLE `score_board` - USING starrocks - OPTIONS( - "starrocks.fe.http.url"="127.0.0.1:8030", - "starrocks.fe.jdbc.url"="jdbc:mysql://127.0.0.1:9030", - "starrocks.table.identifier"="test.score_board", - "starrocks.user"="root", - "starrocks.password"="", - "starrocks.write.properties.partial_update"="true", - "starrocks.columns"="id,name" - ); - ``` - -3. Insert data into the table in Spark SQL client, and only update the column `name`. - - ```SQL - INSERT INTO `score_board` VALUES (1, 'starrocks-update'), (2, 'spark-update'); - ``` - -4. Query the StarRocks table in MySQL client. - - You can see that only values for `name` change, and the values for `score` does not change. - - ```SQL - mysql> select * from score_board; - +------+------------------+-------+ - | id | name | score | - +------+------------------+-------+ - | 1 | starrocks-update | 100 | - | 2 | spark-update | 100 | - +------+------------------+-------+ - 2 rows in set (0.02 sec) - ``` - -#### Conditional updates - -This example will show how to do conditional updates according to the values of column `score`. The update for an `id` -takes effect only when the new value for `score` is has a greater or equal to the old value. - -1. Insert initial data to StarRocks table in MySQL client. - - ```SQL - mysql> INSERT INTO `score_board` VALUES (1, 'starrocks', 100), (2, 'spark', 100); - - mysql> select * from score_board; - +------+-----------+-------+ - | id | name | score | - +------+-----------+-------+ - | 1 | starrocks | 100 | - | 2 | spark | 100 | - +------+-----------+-------+ - 2 rows in set (0.02 sec) - ``` - -2. Create a Spark table `score_board` in the following ways. - - - Set the option `starrocks.write.properties.merge_condition` to `score` which tells the connector to use the column `score` as the condition. - - Make sure that the Spark connector use Stream Load interface to load data, rather than Stream Load transaction interface, because the latter does not support this feature. - - ```SQL - CREATE TABLE `score_board` - USING starrocks - OPTIONS( - "starrocks.fe.http.url"="127.0.0.1:8030", - "starrocks.fe.jdbc.url"="jdbc:mysql://127.0.0.1:9030", - "starrocks.table.identifier"="test.score_board", - "starrocks.user"="root", - "starrocks.password"="", - "starrocks.write.properties.merge_condition"="score" - ); - ``` - -3. Insert data to the table in Spark SQL client, and update the row whose `id` is 1 with a smaller score value, and the row whose `id` is 2 with a larger score value. - - ```SQL - INSERT INTO `score_board` VALUES (1, 'starrocks-update', 99), (2, 'spark-update', 101); - ``` - -4. Query the StarRocks table in MySQL client. - - You can see that only the row whose `id` is 2 changes, and the row whose `id` is 1 does not change. - - ```SQL - mysql> select * from score_board; - +------+--------------+-------+ - | id | name | score | - +------+--------------+-------+ - | 1 | starrocks | 100 | - | 2 | spark-update | 101 | - +------+--------------+-------+ - 2 rows in set (0.03 sec) - ``` - -### Load data into columns of BITMAP type - -[`BITMAP`](../sql-reference/data-types/other-data-types/BITMAP.md) is often used to accelerate count distinct, such as counting UV, see [Use Bitmap for exact Count Distinct](../using_starrocks/distinct_values/Using_bitmap.md). -Here we take the counting of UV as an example to show how to load data into columns of the `BITMAP` type. **`BITMAP` is supported since version 1.1.1**. - -1. Create a StarRocks Aggregate table. - - In the database `test`, create an Aggregate table `page_uv` where the column `visit_users` is defined as the `BITMAP` type and configured with the aggregate function `BITMAP_UNION`. - - ```SQL - CREATE TABLE `test`.`page_uv` ( - `page_id` INT NOT NULL COMMENT 'page ID', - `visit_date` datetime NOT NULL COMMENT 'access time', - `visit_users` BITMAP BITMAP_UNION NOT NULL COMMENT 'user ID' - ) ENGINE=OLAP - AGGREGATE KEY(`page_id`, `visit_date`) - DISTRIBUTED BY HASH(`page_id`); - ``` - -2. Create a Spark table. - - The schema of the Spark table is inferred from the StarRocks table, and the Spark does not support the `BITMAP` type. So you need to customize the corresponding column data type in Spark, for example as `BIGINT`, by configuring the option `"starrocks.column.types"="visit_users BIGINT"`. When using Stream Load to ingest data, the connector uses the [`to_bitmap`](../sql-reference/sql-functions/bitmap-functions/to_bitmap.md) function to convert the data of `BIGINT` type into `BITMAP` type. - - Run the following DDL in `spark-sql`: - - ```SQL - CREATE TABLE `page_uv` - USING starrocks - OPTIONS( - "starrocks.fe.http.url"="127.0.0.1:8030", - "starrocks.fe.jdbc.url"="jdbc:mysql://127.0.0.1:9030", - "starrocks.table.identifier"="test.page_uv", - "starrocks.user"="root", - "starrocks.password"="", - "starrocks.column.types"="visit_users BIGINT" - ); - ``` - -3. Load data into StarRocks table. - - Run the following DML in `spark-sql`: - - ```SQL - INSERT INTO `page_uv` VALUES - (1, CAST('2020-06-23 01:30:30' AS TIMESTAMP), 13), - (1, CAST('2020-06-23 01:30:30' AS TIMESTAMP), 23), - (1, CAST('2020-06-23 01:30:30' AS TIMESTAMP), 33), - (1, CAST('2020-06-23 02:30:30' AS TIMESTAMP), 13), - (2, CAST('2020-06-23 01:30:30' AS TIMESTAMP), 23); - ``` - -4. Calculate page UVs from the StarRocks table. - - ```SQL - MySQL [test]> SELECT `page_id`, COUNT(DISTINCT `visit_users`) FROM `page_uv` GROUP BY `page_id`; - +---------+-----------------------------+ - | page_id | count(DISTINCT visit_users) | - +---------+-----------------------------+ - | 2 | 1 | - | 1 | 3 | - +---------+-----------------------------+ - 2 rows in set (0.01 sec) - ``` - -> **NOTICE:** -> -> The connector uses [`to_bitmap`](../sql-reference/sql-functions/bitmap-functions/to_bitmap.md) -> function to convert data of the `TINYINT`, `SMALLINT`, `INTEGER`, and `BIGINT` types in Spark to the `BITMAP` type in StarRocks, and uses -> [`bitmap_hash`](../sql-reference/sql-functions/bitmap-functions/bitmap_hash.md) or [`bitmap_hash64`](../sql-reference/sql-functions/bitmap-functions/bitmap_hash64.md) function for other Spark data types. - -### Load data into columns of HLL type - -[`HLL`](../sql-reference/data-types/other-data-types/HLL.md) can be used for approximate count distinct, see [Use HLL for approximate count distinct](../using_starrocks/distinct_values/Using_HLL.md). - -Here we take the counting of UV as an example to show how to load data into columns of the `HLL` type. **`HLL` is supported since version 1.1.1**. - -1. Create a StarRocks Aggregate table. - - In the database `test`, create an Aggregate table `hll_uv` where the column `visit_users` is defined as the `HLL` type and configured with the aggregate function `HLL_UNION`. - - ```SQL - CREATE TABLE `hll_uv` ( - `page_id` INT NOT NULL COMMENT 'page ID', - `visit_date` datetime NOT NULL COMMENT 'access time', - `visit_users` HLL HLL_UNION NOT NULL COMMENT 'user ID' - ) ENGINE=OLAP - AGGREGATE KEY(`page_id`, `visit_date`) - DISTRIBUTED BY HASH(`page_id`); - ``` - -2. Create a Spark table. - - The schema of the Spark table is inferred from the StarRocks table, and the Spark does not support the `HLL` type. So you need to customize the corresponding column data type in Spark, for example as `BIGINT`, by configuring the option `"starrocks.column.types"="visit_users BIGINT"`. When using Stream Load to ingest data, the connector uses the [`hll_hash`](../sql-reference/sql-functions/scalar-functions/hll_hash.md) function to convert the data of `BIGINT` type into `HLL` type. - - Run the following DDL in `spark-sql`: - - ```SQL - CREATE TABLE `hll_uv` - USING starrocks - OPTIONS( - "starrocks.fe.http.url"="127.0.0.1:8030", - "starrocks.fe.jdbc.url"="jdbc:mysql://127.0.0.1:9030", - "starrocks.table.identifier"="test.hll_uv", - "starrocks.user"="root", - "starrocks.password"="", - "starrocks.column.types"="visit_users BIGINT" - ); - ``` - -3. Load data into StarRocks table. - - Run the following DML in `spark-sql`: - - ```SQL - INSERT INTO `hll_uv` VALUES - (3, CAST('2023-07-24 12:00:00' AS TIMESTAMP), 78), - (4, CAST('2023-07-24 13:20:10' AS TIMESTAMP), 2), - (3, CAST('2023-07-24 12:30:00' AS TIMESTAMP), 674); - ``` - -4. Calculate page UVs from the StarRocks table. - - ```SQL - MySQL [test]> SELECT `page_id`, COUNT(DISTINCT `visit_users`) FROM `hll_uv` GROUP BY `page_id`; - +---------+-----------------------------+ - | page_id | count(DISTINCT visit_users) | - +---------+-----------------------------+ - | 4 | 1 | - | 3 | 2 | - +---------+-----------------------------+ - 2 rows in set (0.01 sec) - ``` - -### Load data into columns of ARRAY type - -The following example explains how to load data into columns of the [`ARRAY`](../sql-reference/data-types/semi_structured/Array.md) type. - -1. Create a StarRocks table. - - In the database `test`, create a Primary Key table `array_tbl` that includes one `INT` column and two `ARRAY` columns. - - ```SQL - CREATE TABLE `array_tbl` ( - `id` INT NOT NULL, - `a0` ARRAY, - `a1` ARRAY> - ) - ENGINE=OLAP - PRIMARY KEY(`id`) - DISTRIBUTED BY HASH(`id`) - ; - ``` - -2. Write data to StarRocks. - - Because some versions of StarRocks does not provide the metadata of `ARRAY` column, the connector can not infer the corresponding Spark data type for this column. However, you can explicitly specify the corresponding Spark data type of the column in the option `starrocks.column.types`. In this example, you can configure the option as `a0 ARRAY,a1 ARRAY>`. - - Run the following codes in `spark-shell`: - - ```scala - val data = Seq( - | (1, Seq("hello", "starrocks"), Seq(Seq(1, 2), Seq(3, 4))), - | (2, Seq("hello", "spark"), Seq(Seq(5, 6, 7), Seq(8, 9, 10))) - | ) - val df = data.toDF("id", "a0", "a1") - df.write - .format("starrocks") - .option("starrocks.fe.http.url", "127.0.0.1:8030") - .option("starrocks.fe.jdbc.url", "jdbc:mysql://127.0.0.1:9030") - .option("starrocks.table.identifier", "test.array_tbl") - .option("starrocks.user", "root") - .option("starrocks.password", "") - .option("starrocks.column.types", "a0 ARRAY,a1 ARRAY>") - .mode("append") - .save() - ``` - -3. Query data in the StarRocks table. - - ```SQL - MySQL [test]> SELECT * FROM `array_tbl`; - +------+-----------------------+--------------------+ - | id | a0 | a1 | - +------+-----------------------+--------------------+ - | 1 | ["hello","starrocks"] | [[1,2],[3,4]] | - | 2 | ["hello","spark"] | [[5,6,7],[8,9,10]] | - +------+-----------------------+--------------------+ - 2 rows in set (0.01 sec) - ``` diff --git a/docs/en/loading/SparkLoad.md b/docs/en/loading/SparkLoad.md deleted file mode 100644 index 8794ce3..0000000 --- a/docs/en/loading/SparkLoad.md +++ /dev/null @@ -1,537 +0,0 @@ ---- -displayed_sidebar: docs ---- - -# Load data in bulk using Spark Load - -This load uses external Apache Spark™ resources to pre-process imported data, which improves import performance and saves compute resources. It is mainly used for **initial migration** and **large data import** into StarRocks (data volume up to TB level). - -Spark load is an **asynchronous** import method that requires users to create Spark-type import jobs via the MySQL protocol and view the import results using `SHOW LOAD`. - -> **NOTICE** -> -> - Only users with the INSERT privilege on a StarRocks table can load data into this table. You can follow the instructions provided in [GRANT](../sql-reference/sql-statements/account-management/GRANT.md) to grant the required privilege. -> - Spark Load can not be used to load data into a Primary Key table. - -## Terminology explanation - -- **Spark ETL**: Mainly responsible for ETL of data in the import process, including global dictionary construction (BITMAP type), partitioning, sorting, aggregation, etc. -- **Broker**: Broker is an independent stateless process. It encapsulates the file system interface and provides StarRocks with the ability to read files from remote storage systems. -- **Global Dictionary**: Saves the data structure that maps data from the original value to the encoded value. The original value can be any data type, while the encoded value is an integer. The global dictionary is mainly used in scenarios where exact count distinct is precomputed. - -## Fundamentals - -The user submits a Spark type import job through the MySQL client;the FE records the metadata and returns the submission result. - -The execution of the spark load task is divided into the following main phases. - -1. The user submits the spark load job to the FE. -2. The FE schedules the submission of the ETL task to the Apache Spark™ cluster for execution. -3. The Apache Spark™ cluster executes the ETL task that includes global dictionary construction (BITMAP type), partitioning, sorting, aggregation, etc. -4. After the ETL task is completed, the FE gets the data path of each preprocessed slice and schedules the relevant BE to execute the Push task. -5. The BE reads data through Broker process from HDFS and converts it into StarRocks storage format. - > If you choose not to use Broker process, the BE reads data from HDFS directly. -6. The FE schedules the effective version and completes the import job. - -The following diagram illustrates the main flow of spark load. - -![Spark load](../_assets/4.3.2-1.png) - ---- - -## Global Dictionary - -### Applicable Scenarios - -Currently, the BITMAP column in StarRocks is implemented using the Roaringbitmap, which only has integer to be the input data type. So if you want to implement precomputation for the BITMAP column in the import process, then you need to convert the input data type to integer. - -In the existing import process of StarRocks, the data structure of the global dictionary is implemented based on the Hive table, which saves the mapping from the original value to the encoded value. - -### Build Process - -1. Read the data from the upstream data source and generate a temporary Hive table, named `hive-table`. -2. Extract the values of the de-emphasized fields of `hive-table` to generate a new Hive table named `distinct-value-table`. -3. Create a new global dictionary table named `dict-table` with one column for the original values and one column for the encoded values. -4. Left join between `distinct-value-table` and `dict-table`, and then use the window function to encode this set. Finally both the original value and the encoded value of the de-duplicated column are written back to `dict-table`. -5. Join between `dict-table` and `hive-table` to finish the job of replacing the original value in `hive-table` with the integer encoded value. -6. `hive-table` will be read by the next time data pre-processing, and then imported into StarRocks after calculation. - -## Data Pre-processing - -The basic process of data pre-processing is as follows: - -1. Read data from the upstream data source (HDFS file or Hive table). -2. Complete field mapping and calculation for the read data, then generate `bucket-id` based on the partition information. -3. Generate RollupTree based on the Rollup metadata of StarRocks table. -4. Iterate through the RollupTree and perform hierarchical aggregation operations. The Rollup of the next hierarchy can be calculated from the Rollup of the previous hierarchy. -5. Each time the aggregation calculation is completed, the data is bucketed according to `bucket-id` and then written to HDFS. -6. The subsequent Broker process will pull the files from HDFS and import them into the StarRocks BE node. - -## Basic Operations - -### Configuring ETL Clusters - -Apache Spark™ is used as an external computational resource in StarRocks for ETL work. There may be other external resources added to StarRocks, such as Spark/GPU for query, HDFS/S3 for external storage, MapReduce for ETL, etc. Therefore, we introduce `Resource Management` to manage these external resources used by StarRocks. - -Before submitting a Apache Spark™ import job, configure the Apache Spark™ cluster for performing ETL tasks. The syntax for operation is as follows: - -~~~sql --- create Apache Spark™ resource -CREATE EXTERNAL RESOURCE resource_name -PROPERTIES -( - type = spark, - spark_conf_key = spark_conf_value, - working_dir = path, - broker = broker_name, - broker.property_key = property_value -); - --- drop Apache Spark™ resource -DROP RESOURCE resource_name; - --- show resources -SHOW RESOURCES -SHOW PROC "/resources"; - --- privileges -GRANT USAGE_PRIV ON RESOURCE resource_name TO user_identityGRANT USAGE_PRIV ON RESOURCE resource_name TO ROLE role_name; -REVOKE USAGE_PRIV ON RESOURCE resource_name FROM user_identityREVOKE USAGE_PRIV ON RESOURCE resource_name FROM ROLE role_name; -~~~ - -- Create resource - -**For example**: - -~~~sql --- yarn cluster mode -CREATE EXTERNAL RESOURCE "spark0" -PROPERTIES -( - "type" = "spark", - "spark.master" = "yarn", - "spark.submit.deployMode" = "cluster", - "spark.jars" = "xxx.jar,yyy.jar", - "spark.files" = "/tmp/aaa,/tmp/bbb", - "spark.executor.memory" = "1g", - "spark.yarn.queue" = "queue0", - "spark.hadoop.yarn.resourcemanager.address" = "127.0.0.1:9999", - "spark.hadoop.fs.defaultFS" = "hdfs://127.0.0.1:10000", - "working_dir" = "hdfs://127.0.0.1:10000/tmp/starrocks", - "broker" = "broker0", - "broker.username" = "user0", - "broker.password" = "password0" -); - --- yarn HA cluster mode -CREATE EXTERNAL RESOURCE "spark1" -PROPERTIES -( - "type" = "spark", - "spark.master" = "yarn", - "spark.submit.deployMode" = "cluster", - "spark.hadoop.yarn.resourcemanager.ha.enabled" = "true", - "spark.hadoop.yarn.resourcemanager.ha.rm-ids" = "rm1,rm2", - "spark.hadoop.yarn.resourcemanager.hostname.rm1" = "host1", - "spark.hadoop.yarn.resourcemanager.hostname.rm2" = "host2", - "spark.hadoop.fs.defaultFS" = "hdfs://127.0.0.1:10000", - "working_dir" = "hdfs://127.0.0.1:10000/tmp/starrocks", - "broker" = "broker1" -); -~~~ - -`resource-name` is the name of the Apache Spark™ resource configured in StarRocks. - -`PROPERTIES` includes parameters relating to the Apache Spark™ resource, as follows: -> **Note** -> -> For detailed description of Apache Spark™ resource PROPERTIES, please see [CREATE RESOURCE](../sql-reference/sql-statements/Resource/CREATE_RESOURCE.md) - -- Spark related parameters: - - `type`: Resource type, required, currently only supports `spark`. - - `spark.master`: Required, currently only supports `yarn`. - - `spark.submit.deployMode`: The deployment mode of the Apache Spark™ program, required, currently supports both `cluster` and `client`. - - `spark.hadoop.fs.defaultFS`: Required if master is yarn. - - Parameters related to yarn resource manager, required. - - one ResourceManager on a single node - `spark.hadoop.yarn.resourcemanager.address`: Address of the single point resource manager. - - ResourceManager HA - > You can choose to specify ResourceManager's hostname or address. - - `spark.hadoop.yarn.resourcemanager.ha.enabled`: Enable the resource manager HA, set to `true`. - - `spark.hadoop.yarn.resourcemanager.ha.rm-ids`: list of resource manager logical ids. - - `spark.hadoop.yarn.resourcemanager.hostname.rm-id`: For each rm-id, specify the hostname corresponding to the resource manager. - - `spark.hadoop.yarn.resourcemanager.address.rm-id`: For each rm-id, specify `host:port` for the client to submit jobs to. - -- `*working_dir`: The directory used by ETL. Required if Apache Spark™ is used as an ETL resource. For example: `hdfs://host:port/tmp/starrocks`. - -- Broker related parameters: - - `broker`: Broker name. Required if Apache Spark™ is used as an ETL resource. You need to use the `ALTER SYSTEM ADD BROKER` command to complete the configuration in advance. - - `broker.property_key`: Information (e.g.authentication information) to be specified when Broker process reads the intermediate file generated by the ETL. - -**Precaution**: - -The above is a description of parameters for loading through Broker process. If you intend to load data without Broker process, the following should be noted. - -- You do not need to specify `broker`. -- If you need to configure user authentication, and HA for NameNode nodes, you need to configure the parameters in the hdfs-site.xml file in the HDFS cluster, see [broker_properties](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md#hdfs) for descriptions of parameters. and you need to move the **hdfs-site.xml** file under **$FE_HOME/conf** for each FE and **$BE_HOME/conf** for each BE. - -> Note -> -> If the HDFS file can only be accessed by a specific user, you still need to specify the HDFS username in `broker.name` and the user password in `broker.password`. - -- View resources - -Regular accounts can only view resources to which they have `USAGE-PRIV` access. The root and admin accounts can view all resources. - -- Resource Permissions - -Resource permissions are managed through `GRANT REVOKE`, which currently only supports `USAGE-PRIV` permissions. You can give `USAGE-PRIV` permissions to a user or a role. - -~~~sql --- Grant access to spark0 resources to user0 -GRANT USAGE_PRIV ON RESOURCE "spark0" TO "user0"@"%"; - --- Grant access to spark0 resources to role0 -GRANT USAGE_PRIV ON RESOURCE "spark0" TO ROLE "role0"; - --- Grant access to all resources to user0 -GRANT USAGE_PRIV ON RESOURCE* TO "user0"@"%"; - --- Grant access to all resources to role0 -GRANT USAGE_PRIV ON RESOURCE* TO ROLE "role0"; - --- Revoke the use privileges of spark0 resources from user user0 -REVOKE USAGE_PRIV ON RESOURCE "spark0" FROM "user0"@"%"; -~~~ - -### Configuring Spark Client - -Configure the Spark client for FE so that the latter can submit Spark tasks by executing the `spark-submit` command. It is recommended to use the official version of Spark2 2.4.5 or above [spark download address](https://archive.apache.org/dist/spark/). After downloading, please use the following steps to complete the configuration. - -- Configure `SPARK-HOME` - -Place the Spark client in a directory on the same machine as the FE, and configure `spark_home_default_dir` in the FE configuration file to this directory, which by default is the `lib/spark2x` path in the FE root directory, and cannot be empty. - -- **Configure SPARK dependency package** - -To configure the dependency package, zip and archive all jar files in the jars folder under the Spark client, and configure the `spark_resource_path` item in the FE configuration to this zip file. If this configuration is empty, the FE will try to find the `lib/spark2x/jars/spark-2x.zip` file in the FE root directory. If the FE fails to find it, it will report an error. - -When the spark load job is submitted, the archived dependency files will be uploaded to the remote repository. The default repository path is under the `working_dir/{cluster_id}` directory named with `--spark-repository--{resource-name}`, which means that a resource in the cluster corresponds to a remote repository. The directory structure is referenced as follows: - -~~~bash ----spark-repository--spark0/ - - |---archive-1.0.0/ - - | |\---lib-990325d2c0d1d5e45bf675e54e44fb16-spark-dpp-1.0.0\-jar-with-dependencies.jar - - | |\---lib-7670c29daf535efe3c9b923f778f61fc-spark-2x.zip - - |---archive-1.1.0/ - - | |\---lib-64d5696f99c379af2bee28c1c84271d5-spark-dpp-1.1.0\-jar-with-dependencies.jar - - | |\---lib-1bbb74bb6b264a270bc7fca3e964160f-spark-2x.zip - - |---archive-1.2.0/ - - | |-... - -~~~ - -In addition to the spark dependencies (named `spark-2x.zip` by default), the FE also uploads the DPP dependencies to the remote repository. If all the dependencies submitted by the spark load already exist in the remote repository, then there is no need to upload the dependencies again, saving the time of repeatedly uploading a large number of files each time. - -### Configuring YARN Client - -Configure the yarn client for FE so that the FE can execute yarn commands to get the status of the running application or kill it.It is recommended to use the official version of Hadoop2 2.5.2 or above ([hadoop download address](https://archive.apache.org/dist/hadoop/common/)). After downloading, please use the following steps to complete the configuration: - -- **Configure the YARN executable path** - -Place the downloaded yarn client in a directory on the same machine as the FE, and configure the `yarn_client_path` item in the FE configuration file to the binary executable file of yarn, which by default is the `lib/yarn-client/hadoop/bin/yarn` path in the FE root directory. - -- **Configure the path to the configuration file needed to generate YARN (optional)** - -When the FE goes through the yarn client to get the status of the application, or to kill the application, by default StarRocks generates the configuration file required to execute the yarn command in the `lib/yarn-config` path of the FE root directory This path can be modified by configuring the `yarn_config_dir` entry in the FE configuration file, which currently includes `core-site.xml` and `yarn-site.xml`. - -### Create Import Job - -**Syntax:** - -~~~sql -LOAD LABEL load_label - (data_desc, ...) -WITH RESOURCE resource_name -[resource_properties] -[PROPERTIES (key1=value1, ... )] - -* load_label: - db_name.label_name - -* data_desc: - DATA INFILE ('file_path', ...) - [NEGATIVE] - INTO TABLE tbl_name - [PARTITION (p1, p2)] - [COLUMNS TERMINATED BY separator ] - [(col1, ...)] - [COLUMNS FROM PATH AS (col2, ...)] - [SET (k1=f1(xx), k2=f2(xx))] - [WHERE predicate] - - DATA FROM TABLE hive_external_tbl - [NEGATIVE] - INTO TABLE tbl_name - [PARTITION (p1, p2)] - [SET (k1=f1(xx), k2=f2(xx))] - [WHERE predicate] - -* resource_properties: - (key2=value2, ...) -~~~ - -**Example 1**: The case where the upstream data source is HDFS - -~~~sql -LOAD LABEL db1.label1 -( - DATA INFILE("hdfs://abc.com:8888/user/starrocks/test/ml/file1") - INTO TABLE tbl1 - COLUMNS TERMINATED BY "," - (tmp_c1,tmp_c2) - SET - ( - id=tmp_c2, - name=tmp_c1 - ), - DATA INFILE("hdfs://abc.com:8888/user/starrocks/test/ml/file2") - INTO TABLE tbl2 - COLUMNS TERMINATED BY "," - (col1, col2) - where col1 > 1 -) -WITH RESOURCE 'spark0' -( - "spark.executor.memory" = "2g", - "spark.shuffle.compress" = "true" -) -PROPERTIES -( - "timeout" = "3600" -); -~~~ - -**Example 2**: The case where the upstream data source is Hive. - -- Step 1: Create a new hive resource - -~~~sql -CREATE EXTERNAL RESOURCE hive0 -PROPERTIES -( - "type" = "hive", - "hive.metastore.uris" = "thrift://xx.xx.xx.xx:8080" -); - ~~~ - -- Step 2: Create a new hive external table - -~~~sql -CREATE EXTERNAL TABLE hive_t1 -( - k1 INT, - K2 SMALLINT, - k3 varchar(50), - uuid varchar(100) -) -ENGINE=hive -PROPERTIES -( - "resource" = "hive0", - "database" = "tmp", - "table" = "t1" -); - ~~~ - -- Step 3: Submit the load command, requiring that the columns in the imported StarRocks table exist in the hive external table. - -~~~sql -LOAD LABEL db1.label1 -( - DATA FROM TABLE hive_t1 - INTO TABLE tbl1 - SET - ( - uuid=bitmap_dict(uuid) - ) -) -WITH RESOURCE 'spark0' -( - "spark.executor.memory" = "2g", - "spark.shuffle.compress" = "true" -) -PROPERTIES -( - "timeout" = "3600" -); - ~~~ - -Introduction to the parameters in the Spark load: - -- **Label** - -Label of the import job. Each import job has a Label that is unique within the database, following the same rules as broker load. - -- **Data description class parameters** - -Currently, supported data sources are CSV and Hive table. Other rules are the same as broker load. - -- **Import Job Parameters** - -Import job parameters refer to the parameters belonging to the `opt_properties` section of the import statement. These parameters are applicable to the entire import job. The rules are the same as broker load. - -- **Spark Resource Parameters** - -Spark resources need to be configured into StarRocks in advance and users need to be given USAGE-PRIV permissions before they can apply the resources to Spark load. -Spark resource parameters can be set when the user has a temporary need, such as adding resources for a job and modifying Spark configs. The setting only takes effect on this job and does not affect the existing configurations in the StarRocks cluster. - -~~~sql -WITH RESOURCE 'spark0' -( - "spark.driver.memory" = "1g", - "spark.executor.memory" = "3g" -) -~~~ - -- **Import when the data source is Hive** - -Currently, to use a Hive table in the import process, you need to create an external table of the `Hive` type and then specify its name when submitting the import command. - -- **Import process to build a global dictionary** - -In the load command, you can specify the required fields for building the global dictionary in the following format: `StarRocks field name=bitmap_dict(hive table field name)` Note that currently **the global dictionary is only supported when the upstream data source is a Hive table**. - -- **Load binary type data** - -Since v2.5.17, Spark Load supports the bitmap_from_binary function, which can convert binary data into bitmap data. If the column type of the Hive table or HDFS file is binary and the corresponding column in the StarRocks table is a bitmap-type aggregate column, you can specify the fields in the load command in the following format, `StarRocks field name=bitmap_from_binary(Hive table field name)`. This eliminates the need for building a global dictionary. - -## Viewing Import Jobs - -The Spark load import is asynchronous, as is the broker load. The user must record the label of the import job and use it in the `SHOW LOAD` command to view the import results. The command to view the import is common to all import methods. The example is as follows. - -Refer to Broker Load for a detailed explanation of returned parameters.The differences are as follows. - -~~~sql -mysql> show load order by createtime desc limit 1\G -*************************** 1. row *************************** - JobId: 76391 - Label: label1 - State: FINISHED - Progress: ETL:100%; LOAD:100% - Type: SPARK - EtlInfo: unselected.rows=4; dpp.abnorm.ALL=15; dpp.norm.ALL=28133376 - TaskInfo: cluster:cluster0; timeout(s):10800; max_filter_ratio:5.0E-5 - ErrorMsg: N/A - CreateTime: 2019-07-27 11:46:42 - EtlStartTime: 2019-07-27 11:46:44 - EtlFinishTime: 2019-07-27 11:49:44 - LoadStartTime: 2019-07-27 11:49:44 -LoadFinishTime: 2019-07-27 11:50:16 - URL: http://1.1.1.1:8089/proxy/application_1586619723848_0035/ - JobDetails: {"ScannedRows":28133395,"TaskNumber":1,"FileNumber":1,"FileSize":200000} -~~~ - -- **State** - -The current stage of the imported job. -PENDING: The job is committed. -ETL: Spark ETL is committed. -LOADING: The FE schedule an BE to execute push operation. -FINISHED: The push is completed and the version is effective. - -There are two final stages of the import job – `CANCELLED` and `FINISHED`, both indicating the load job is completed. `CANCELLED` indicates import failure and `FINISHED` indicates import success. - -- **Progress** - -Description of the import job progress. There are two types of progress –ETL and LOAD, which correspond to the two phases of the import process, ETL and LOADING. - -- The range of progress for LOAD is 0~100%. - -`LOAD progress = the number of currently completed tablets of all replications imports / the total number of tablets of this import job * 100%`. - -- If all tables have been imported, the LOAD progress is 99%, and changes to 100% when the import enters the final validation phase. - -- The import progress is not linear. If there is no change in progress for a period of time, it does not mean that the import is not executing. - -- **Type** - - The type of the import job. SPARK for spark load. - -- **CreateTime/EtlStartTime/EtlFinishTime/LoadStartTime/LoadFinishTime** - -These values represent the time when the import was created, when the ETL phase started, when the ETL phase completed, when the LOADING phase started, and when the entire import job was completed. - -- **JobDetails** - -Displays the detailed running status of the job, including the number of imported files, total size (in bytes), number of subtasks, number of raw rows being processed, etc. For example: - -~~~json - {"ScannedRows":139264,"TaskNumber":1,"FileNumber":1,"FileSize":940754064} -~~~ - -- **URL** - -You can copy the input to your browser to access the web interface of the corresponding application. - -### View Apache Spark™ Launcher commit logs - -Sometimes users need to view the detailed logs generated during a Apache Spark™ job commit. By default, the logs are saved in the path `log/spark_launcher_log` in the FE root directory named as `spark-launcher-{load-job-id}-{label}.log`. The logs are saved in this directory for a period of time and will be erased when the import information in FE metadata is cleaned up. The default retention time is 3 days. - -### Cancel Import - -When the Spark load job status is not `CANCELLED` or `FINISHED`, it can be cancelled manually by the user by specifying the Label of the import job. - ---- - -## Related System Configurations - -**FE Configuration:** The following configuration is the system-level configuration of Spark load, which applies to all Spark load import jobs. The configuration values can be adjusted mainly by modifying `fe.conf`. - -- enable-spark-load: Enable Spark load and resource creation with a default value of false. -- spark-load-default-timeout-second: The default timeout for the job is 259200 seconds (3 days). -- spark-home-default-dir: The Spark client path (`fe/lib/spark2x`). -- spark-resource-path: The path to the packaged S park dependency file (empty by default). -- spark-launcher-log-dir: The directory where the commit log of the Spark client is stored (`fe/log/spark-launcher-log`). -- yarn-client-path: The path to the yarn binary executable (`fe/lib/yarn-client/hadoop/bin/yarn`). -- yarn-config-dir: Yarn's configuration file path (`fe/lib/yarn-config`). - ---- - -## Best Practices - -The most suitable scenario for using Spark load is when the raw data is in the file system (HDFS) and the data volume is in the tens of GB to TB level. Use Stream Load or Broker Load for smaller data volumes. - -For the full spark load import example, refer to the demo on github: [https://github.com/StarRocks/demo/blob/master/docs/03_sparkLoad2StarRocks.md](https://github.com/StarRocks/demo/blob/master/docs/03_sparkLoad2StarRocks.md) - -## FAQs - -- `Error: When running with master 'yarn' either HADOOP-CONF-DIR or YARN-CONF-DIR must be set in the environment.` - - Using Spark Load without configuring the `HADOOP-CONF-DIR` environment variable in `spark-env.sh` of the Spark client. - -- `Error: Cannot run program "xxx/bin/spark-submit": error=2, No such file or directory` - - The `spark_home_default_dir` configuration item does not specify the Spark client root directory when using Spark Load. - -- `Error: File xxx/jars/spark-2x.zip does not exist.` - - The `spark-resource-path` configuration item does not point to the packed zip file when using Spark load. - -- `Error: yarn client does not exist in path: xxx/yarn-client/hadoop/bin/yarn` - - The yarn-client-path configuration item does not specify the yarn executable when using Spark load. - -- `ERROR: Cannot execute hadoop-yarn/bin/... /libexec/yarn-config.sh` - - When using Hadoop with CDH, you need to configure the `HADOOP_LIBEXEC_DIR` environment variable. - Since `hadoop-yarn` and hadoop directories are different, the default `libexec` directory will look for `hadoop-yarn/bin/... /libexec`, while `libexec` is in the hadoop directory. - The ```yarn application status`` command to get the Spark task status reported an error causing the import job to fail. diff --git a/docs/en/loading/StreamLoad.md b/docs/en/loading/StreamLoad.md deleted file mode 100644 index 2095258..0000000 --- a/docs/en/loading/StreamLoad.md +++ /dev/null @@ -1,545 +0,0 @@ ---- -displayed_sidebar: docs -keywords: ['Stream Load'] ---- - -# Load data from a local file system - -import InsertPrivNote from '../_assets/commonMarkdown/insertPrivNote.mdx' - -StarRocks provides two methods of loading data from a local file system: - -- Synchronous loading using [Stream Load](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md) -- Asynchronous loading using [Broker Load](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md) - -Each of these options has its own advantages: - -- Stream Load supports CSV and JSON file formats. This method is recommended if you want to load data from a small number of files whose individual sizes do not exceed 10 GB. -- Broker Load supports Parquet, ORC, CSV, and JSON file formats (JSON file format is supported from v3.2.3 onwards). This method is recommended if you want to load data from a large number of files whose individual sizes exceed 10 GB, or if the files are stored in a network attached storage (NAS) device. **Using Broker Load to load data from a local file system is supported from v2.5 onwards.** - -For CSV data, take note of the following points: - -- You can use a UTF-8 string, such as a comma (,), tab, or pipe (|), whose length does not exceed 50 bytes as a text delimiter. -- Null values are denoted by using `\N`. For example, a data file consists of three columns, and a record from that data file holds data in the first and third columns but no data in the second column. In this situation, you need to use `\N` in the second column to denote a null value. This means the record must be compiled as `a,\N,b` instead of `a,,b`. `a,,b` denotes that the second column of the record holds an empty string. - -Stream Load and Broker Load both support data transformation at data loading and supports data changes made by UPSERT and DELETE operations during data loading. For more information, see [Transform data at loading](../loading/Etl_in_loading.md) and [Change data through loading](../loading/Load_to_Primary_Key_tables.md). - -## Before you begin - -### Check privileges - - - -#### Check network configuration - -Make sure that the machine on which the data you want to load resides can access the FE and BE nodes of the StarRocks cluster via the [`http_port`](../administration/management/FE_configuration.md#http_port) (default: `8030`) and [`be_http_port`](../administration/management/BE_configuration.md#be_http_port) (default: `8040`) , respectively. - -## Loading from a local file system via Stream Load - -Stream Load is an HTTP PUT-based synchronous loading method. After you submit a load job, StarRocks synchronously runs the job, and returns the result of the job after the job finishes. You can determine whether the job is successful based on the job result. - -> **NOTICE** -> -> After you load data into a StarRocks table by using Stream Load, the data of the materialized views that are created on that table is also updated. - -### How it works - -You can submit a load request on your client to an FE according to HTTP, and the FE then uses an HTTP redirect to forward the load request to a specific BE or CN. You can also directly submit a load request on your client to a BE or CN of your choice. - -:::note - -If you submit load requests to an FE, the FE uses a polling mechanism to decide which BE or CN will serve as a coordinator to receive and process the load requests. The polling mechanism helps achieve load balancing within your StarRocks cluster. Therefore, we recommend that you send load requests to an FE. - -::: - -The BE or CN that receives the load request runs as the Coordinator BE or CN to split data based on the used schema into portions and assign each portion of the data to the other involved BEs or CNs. After the load finishes, the Coordinator BE or CN returns the result of the load job to your client. Note that if you stop the Coordinator BE or CN during the load, the load job fails. - -The following figure shows the workflow of a Stream Load job. - -![Workflow of Stream Load](../_assets/4.2-1.png) - -### Limits - -Stream Load does not support loading the data of a CSV file that contains a JSON-formatted column. - -### Typical example - -This section uses curl as an example to describe how to load the data of a CSV or JSON file from your local file system into StarRocks. For detailed syntax and parameter descriptions, see [STREAM LOAD](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md). - -Note that in StarRocks some literals are used as reserved keywords by the SQL language. Do not directly use these keywords in SQL statements. If you want to use such a keyword in an SQL statement, enclose it in a pair of backticks (`). See [Keywords](../sql-reference/sql-statements/keywords.md). - -#### Load CSV data - -##### Prepare datasets - -In your local file system, create a CSV file named `example1.csv`. The file consists of three columns, which represent the user ID, user name, and user score in sequence. - -```Plain -1,Lily,23 -2,Rose,23 -3,Alice,24 -4,Julia,25 -``` - -##### Create a database and a table - -Create a database and switch to it: - -```SQL -CREATE DATABASE IF NOT EXISTS mydatabase; -USE mydatabase; -``` - -Create a Primary Key table named `table1`. The table consists of three columns: `id`, `name`, and `score`, of which `id` is the primary key. - -```SQL -CREATE TABLE `table1` -( - `id` int(11) NOT NULL COMMENT "user ID", - `name` varchar(65533) NULL COMMENT "user name", - `score` int(11) NOT NULL COMMENT "user score" -) -ENGINE=OLAP -PRIMARY KEY(`id`) -DISTRIBUTED BY HASH(`id`); -``` - -:::note - -Since v2.5.7, StarRocks can automatically set the number of buckets (BUCKETS) when you create a table or add a partition. You no longer need to manually set the number of buckets. For detailed information, see [set the number of buckets](../table_design/data_distribution/Data_distribution.md#set-the-number-of-buckets). - -::: - -##### Start a Stream Load - -Run the following command to load the data of `example1.csv` into `table1`: - -```Bash -curl --location-trusted -u : -H "label:123" \ - -H "Expect:100-continue" \ - -H "column_separator:," \ - -H "columns: id, name, score" \ - -T example1.csv -XPUT \ - http://:/api/mydatabase/table1/_stream_load -``` - -:::note - -- If you use an account for which no password is set, you need to input only `:`. -- You can use [SHOW FRONTENDS](../sql-reference/sql-statements/cluster-management/nodes_processes/SHOW_FRONTENDS.md) to view the IP address and HTTP port of the FE node. - -::: - -`example1.csv` consists of three columns, which are separated by commas (,) and can be mapped in sequence onto the `id`, `name`, and `score` columns of `table1`. Therefore, you need to use the `column_separator` parameter to specify the comma (,) as the column separator. You also need to use the `columns` parameter to temporarily name the three columns of `example1.csv` as `id`, `name`, and `score`, which are mapped in sequence onto the three columns of `table1`. - -After the load is complete, you can query `table1` to verify that the load is successful: - -```SQL -SELECT * FROM table1; -+------+-------+-------+ -| id | name | score | -+------+-------+-------+ -| 1 | Lily | 23 | -| 2 | Rose | 23 | -| 3 | Alice | 24 | -| 4 | Julia | 25 | -+------+-------+-------+ -4 rows in set (0.00 sec) -``` - -#### Load JSON data - -Since v3.2.7, Stream Load supports compressing JSON data during transmission, reducing network bandwidth overhead. Users can specify different compression algorithms using parameters `compression` and `Content-Encoding`. Supported compression algorithms including GZIP, BZIP2, LZ4_FRAME, and ZSTD. For the syntax, see [STREAM LOAD](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md). - -##### Prepare datasets - -In your local file system, create a JSON file named `example2.json`. The file consists of two columns, which represent city ID and city name in sequence. - -```JSON -{"name": "Beijing", "code": 2} -``` - -##### Create a database and a table - -Create a database and switch to it: - -```SQL -CREATE DATABASE IF NOT EXISTS mydatabase; -USE mydatabase; -``` - -Create a Primary Key table named `table2`. The table consists of two columns: `id` and `city`, of which `id` is the primary key. - -```SQL -CREATE TABLE `table2` -( - `id` int(11) NOT NULL COMMENT "city ID", - `city` varchar(65533) NULL COMMENT "city name" -) -ENGINE=OLAP -PRIMARY KEY(`id`) -DISTRIBUTED BY HASH(`id`); -``` - -:::note - -Since v2.5.7, StarRocks can set the number of(BUCKETS) automatically when you create a table or add a partition. You no longer need to manually set the number of buckets. For detailed information, see [set the number of buckets](../table_design/data_distribution/Data_distribution.md#set-the-number-of-buckets). - -::: - -##### Start a Stream Load - -Run the following command to load the data of `example2.json` into `table2`: - -```Bash -curl -v --location-trusted -u : -H "strict_mode: true" \ - -H "Expect:100-continue" \ - -H "format: json" -H "jsonpaths: [\"$.name\", \"$.code\"]" \ - -H "columns: city,tmp_id, id = tmp_id * 100" \ - -T example2.json -XPUT \ - http://:/api/mydatabase/table2/_stream_load -``` - -:::note - -- If you use an account for which no password is set, you need to input only `:`. -- You can use [SHOW FRONTENDS](../sql-reference/sql-statements/cluster-management/nodes_processes/SHOW_FRONTENDS.md) to view the IP address and HTTP port of the FE node. - -::: - -`example2.json` consists of two keys, `name` and `code`, which are mapped onto the `id` and `city` columns of `table2`, as shown in the following figure. - -![JSON - Column Mapping](../_assets/4.2-2.png) - -The mappings shown in the preceding figure are described as follows: - -- StarRocks extracts the `name` and `code` keys of `example2.json` and maps them onto the `name` and `code` fields declared in the `jsonpaths` parameter. - -- StarRocks extracts the `name` and `code` fields declared in the `jsonpaths` parameter and **maps them in sequence** onto the `city` and `tmp_id` fields declared in the `columns` parameter. - -- StarRocks extracts the `city` and `tmp_id` fields declared in the `columns` parameter and **maps them by name** onto the `city` and `id` columns of `table2`. - -:::note - -In the preceding example, the value of `code` in `example2.json` is multiplied by 100 before it is loaded into the `id` column of `table2`. - -::: - -For detailed mappings between `jsonpaths`, `columns`, and the columns of the StarRocks table, see the "Column mappings" section in [STREAM LOAD](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md). - -After the load is complete, you can query `table2` to verify that the load is successful: - -```SQL -SELECT * FROM table2; -+------+--------+ -| id | city | -+------+--------+ -| 200 | Beijing| -+------+--------+ -4 rows in set (0.01 sec) -``` - -import Beta from '../_assets/commonMarkdown/_beta.mdx' - -#### Merge Stream Load requests - - - -From v3.4.0, the system supports merging multiple Stream Load requests. - -:::warning - -Note that the Merge Commit optimization is suitable for the scenario with **concurrent** Stream Load jobs on a single table. It is not recommended if the concurrency is one. Meanwhile, think twice before setting `merge_commit_async` to `false` and `merge_commit_interval_ms` to a large value because they may cause load performance degradation. - -::: - -Merge Commit is an optimization for Stream Load, designed for high-concurrency, small-batch (from KB to tens of MB) real-time loading scenarios. In earlier versions, each Stream Load request would generate a transaction and a data version, which led to the following issues in high-concurrency loading scenarios: - -- Excessive data versions impact query performance, and limiting the number of versions may cause `too many versions` errors. -- Data version merging through Compaction increases resource consumption. -- It generates small files, increasing IOPS and I/O latency. And in shared-data clusters, this also raises cloud object storage costs. -- Leader FE node, as the transaction manager, may become a single point of bottleneck. - -Merge Commit mitigates these issues by merging multiple concurrent Stream Load requests within a time window into a single transaction. This reduces the number of transactions and versions generated by high-concurrency requests, thereby improving loading performance. - -Merge Commit supports both synchronous and asynchronous modes. Each mode has advantages and disadvantages. You can choose based on your use cases. - -- **Synchronous mode** - - The server returns only after the merged transaction is committed, ensuring the loading is successful and visible. - -- **Asynchronous mode** - - The server returns immediately after receiving the data. This mode does not ensure the loading is successful. - -| **Mode** | **Advantages** | **Disadvantages** | -| ------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | -| Synchronous |
  • Ensures data persistence and visibility upon request return.
  • Guarantees that multiple sequential loading requests from the same client are executed in order.
| Each loading request from the client is blocked until the server closes the merge window. It may reduce the data processing capability of a single client if the window is excessively large. | -| Asynchronous | Allows a single client to send subsequent loading requests without waiting for the server to close the merge window, improving loading throughput. |
  • Does not guarantee data persistence or visibility upon return. The client must later verify the transaction status.
  • Does not guarantee that multiple sequential loading requests from the same client are executed in order.
| - -##### Start a Stream Load - -- Run the following command to start a Stream Load job with Merge Commit enabled in synchronous mode, and set the merging window to `5000` milliseconds and degree of parallelism to `2`: - - ```Bash - curl --location-trusted -u : \ - -H "Expect:100-continue" \ - -H "column_separator:," \ - -H "columns: id, name, score" \ - -H "enable_merge_commit:true" \ - -H "merge_commit_interval_ms:5000" \ - -H "merge_commit_parallel:2" \ - -T example1.csv -XPUT \ - http://:/api/mydatabase/table1/_stream_load - ``` - -- Run the following command to start a Stream Load job with Merge Commit enabled in asynchronous mode, and set the merging window to `60000` milliseconds and degree of parallelism to `2`: - - ```Bash - curl --location-trusted -u : \ - -H "Expect:100-continue" \ - -H "column_separator:," \ - -H "columns: id, name, score" \ - -H "enable_merge_commit:true" \ - -H "merge_commit_async:true" \ - -H "merge_commit_interval_ms:60000" \ - -H "merge_commit_parallel:2" \ - -T example1.csv -XPUT \ - http://:/api/mydatabase/table1/_stream_load - ``` - -:::note - -- Merge Commit only supports merging **homogeneous** loading requests into a single database and table. "Homogeneous" indicates that the Stream Load parameters are identical, including: common parameters, JSON format parameters, CSV format parameters, `opt_properties`, and Merge Commit parameters. -- For loading CSV-formatted data, you must ensure that each row ends with a line separator. `skip_header` is not supported. -- The server automatically generates labels for transactions. They will be ignored if specified. -- Merge Commit merges multiple loading requests into a single transaction. If one request contains data quality issues, all requests in the transaction will fail. - -::: - -#### Check Stream Load progress - -After a load job is complete, StarRocks returns the result of the job in JSON format. For more information, see the "Return value" section in [STREAM LOAD](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md). - -Stream Load does not allow you to query the result of a load job by using the SHOW LOAD statement. - -#### Cancel a Stream Load job - -Stream Load does not allow you to cancel a load job. If a load job times out or encounters errors, StarRocks automatically cancels the job. - -### Parameter configurations - -This section describes some system parameters that you need to configure if you choose the loading method Stream Load. These parameter configurations take effect on all Stream Load jobs. - -- `streaming_load_max_mb`: the maximum size of each data file you want to load. The default maximum size is 10 GB. For more information, see [Configure BE or CN dynamic parameters](../administration/management/BE_configuration.md). - - We recommend that you do not load more than 10 GB of data at a time. If the size of a data file exceeds 10 GB, we recommend that you split the data file into small files that each are less than 10 GB in size and then load these files one by one. If you cannot split a data file greater than 10 GB, you can increase the value of this parameter based on the file size. - - After you increase the value of this parameter, the new value can take effect only after you restart the BEs or CNs of your StarRocks cluster. Additionally, system performance may deteriorate, and the costs of retries in the event of load failures also increase. - - :::note - - When you load the data of a JSON file, take note of the following points: - - - The size of each JSON object in the file cannot exceed 4 GB. If any JSON object in the file exceeds 4 GB, StarRocks throws an error "This parser can't support a document that big." - - - By default, the JSON body in an HTTP request cannot exceed 100 MB. If the JSON body exceeds 100 MB, StarRocks throws an error "The size of this batch exceed the max size [104857600] of json type data data [8617627793]. Set ignore_json_size to skip check, although it may lead huge memory consuming." To prevent this error, you can add `"ignore_json_size:true"` in the HTTP request header to ignore the check on the JSON body size. - - ::: - -- `stream_load_default_timeout_second`: the timeout period of each load job. The default timeout period is 600 seconds. For more information, see [Configure FE dynamic parameters](../administration/management/FE_configuration.md#configure-fe-dynamic-parameters). - - If many of the load jobs that you create time out, you can increase the value of this parameter based on the calculation result that you obtain from the following formula: - - **Timeout period of each load job > Amount of data to be loaded/Average loading speed** - - For example, if the size of the data file that you want to load is 10 GB and the average loading speed of your StarRocks cluster is 100 MB/s, set the timeout period to more than 100 seconds. - - :::note - - **Average loading speed** in the preceding formula is the average loading speed of your StarRocks cluster. It varies depending on the disk I/O and the number of BEs or CNs in your StarRocks cluster. - - ::: - - Stream Load also provides the `timeout` parameter, which allows you to specify the timeout period of an individual load job. For more information, see [STREAM LOAD](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md). - -### Usage notes - -If a field is missing for a record in the data file you want to load and the column onto which the field is mapped in your StarRocks table is defined as `NOT NULL`, StarRocks automatically fills a `NULL` value in the mapping column of your StarRocks table during the load of the record. You can also use the `ifnull()` function to specify the default value that you want to fill. - -For example, if the field that represents city ID in the preceding `example2.json` file is missing and you want to fill an `x` value in the mapping column of `table2`, you can specify `"columns: city, tmp_id, id = ifnull(tmp_id, 'x')"`. - -## Loading from a local file system via Broker Load - -In addition to Stream Load, you can also use Broker Load to load data from a local file system. This feature is supported from v2.5 onwards. - -Broker Load is an asynchronous loading method. After you submit a load job, StarRocks asynchronously runs the job and does not immediately return the job result. You need to query the job result by hand. See [Check Broker Load progress](#check-broker-load-progress). - -### Limits - -- Currently Broker Load supports loading from a local file system only through a single broker whose version is v2.5 or later. -- Highly concurrent queries against a single broker may cause issues such as timeout and OOM. To mitigate the impact, you can use the `pipeline_dop` variable (see [System variable](../sql-reference/System_variable.md#pipeline_dop)) to set the query parallelism for Broker Load. For queries against a single broker, we recommend that you set `pipeline_dop` to a value smaller than `16`. - -### Typical example - -Broker Load supports loading from a single data file to a single table, loading from multiple data files to a single table, and loading from multiple data files to multiple tables. This section uses loading from multiple data files to a single table as an example. - -Note that in StarRocks some literals are used as reserved keywords by the SQL language. Do not directly use these keywords in SQL statements. If you want to use such a keyword in an SQL statement, enclose it in a pair of backticks (`). See [Keywords](../sql-reference/sql-statements/keywords.md). - -#### Prepare datasets - -Use the CSV file format as an example. Log in to your local file system, and create two CSV files, `file1.csv` and `file2.csv`, in a specific storage location (for example, `/home/disk1/business/`). Both files consist of three columns, which represent the user ID, user name, and user score in sequence. - -- `file1.csv` - - ```Plain - 1,Lily,21 - 2,Rose,22 - 3,Alice,23 - 4,Julia,24 - ``` - -- `file2.csv` - - ```Plain - 5,Tony,25 - 6,Adam,26 - 7,Allen,27 - 8,Jacky,28 - ``` - -#### Create a database and a table - -Create a database and switch to it: - -```SQL -CREATE DATABASE IF NOT EXISTS mydatabase; -USE mydatabase; -``` - -Create a Primary Key table named `mytable`. The table consists of three columns: `id`, `name`, and `score`, of which `id` is the primary key. - -```SQL -CREATE TABLE `mytable` -( - `id` int(11) NOT NULL COMMENT "User ID", - `name` varchar(65533) NULL DEFAULT "" COMMENT "User name", - `score` int(11) NOT NULL DEFAULT "0" COMMENT "User score" -) -ENGINE=OLAP -PRIMARY KEY(`id`) -DISTRIBUTED BY HASH(`id`) -PROPERTIES("replication_num"="1"); -``` - -#### Start a Broker Load - -Run the following command to start a Broker Load job that loads data from all data files (`file1.csv` and `file2.csv`) stored in the `/home/disk1/business/` path of your local file system to the StarRocks table `mytable`: - -```SQL -LOAD LABEL mydatabase.label_local -( - DATA INFILE("file:///home/disk1/business/csv/*") - INTO TABLE mytable - COLUMNS TERMINATED BY "," - (id, name, score) -) -WITH BROKER "sole_broker" -PROPERTIES -( - "timeout" = "3600" -); -``` - -This job has four main sections: - -- `LABEL`: A string used when querying the state of the load job. -- `LOAD` declaration: The source URI, source data format, and destination table name. -- `PROPERTIES`: The timeout value and any other properties to apply to the load job. - -For detailed syntax and parameter descriptions, see [BROKER LOAD](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md). - -#### Check Broker Load progress - -In v3.0 and earlier, use the [SHOW LOAD](../sql-reference/sql-statements/loading_unloading/SHOW_LOAD.md) statement or the curl command to view the progress of Broker Load jobs. - -In v3.1 and later, you can view the progress of Broker Load jobs from the [`information_schema.loads`](../sql-reference/information_schema/loads.md) view: - -```SQL -SELECT * FROM information_schema.loads; -``` - -If you have submitted multiple load jobs, you can filter on the `LABEL` associated with the job. Example: - -```SQL -SELECT * FROM information_schema.loads WHERE LABEL = 'label_local'; -``` - -After you confirm that the load job has finished, you can query table to see if the data has been successfully loaded. Example: - -```SQL -SELECT * FROM mytable; -+------+-------+-------+ -| id | name | score | -+------+-------+-------+ -| 3 | Alice | 23 | -| 5 | Tony | 25 | -| 6 | Adam | 26 | -| 1 | Lily | 21 | -| 2 | Rose | 22 | -| 4 | Julia | 24 | -| 7 | Allen | 27 | -| 8 | Jacky | 28 | -+------+-------+-------+ -8 rows in set (0.07 sec) -``` - -#### Cancel a Broker Load job - -When a load job is not in the **CANCELLED** or **FINISHED** stage, you can use the [CANCEL LOAD](../sql-reference/sql-statements/loading_unloading/CANCEL_LOAD.md) statement to cancel the job. - -For example, you can execute the following statement to cancel a load job, whose label is `label_local`, in the database `mydatabase`: - -```SQL -CANCEL LOAD -FROM mydatabase -WHERE LABEL = "label_local"; -``` - -## Loading from NAS via Broker Load - -There are two ways to load data from NAS by using Broker Load: - -- Consider NAS as a local file system, and run a load job with a broker. See the previous section "[Loading from a local system via Broker Load](#loading-from-a-local-file-system-via-broker-load)". -- (Recommended) Consider NAS as a cloud storage system, and run a load job without a broker. - -This section introduces the second way. Detailed operations are as follows: - -1. Mount your NAS device to the same path on all the BE or CN nodes and FE nodes of your StarRocks cluster. As such, all BEs or CNs can access the NAS device like they access their own locally stored files. - -2. Use Broker Load to load data from the NAS device to the destination StarRocks table. Example: - - ```SQL - LOAD LABEL test_db.label_nas - ( - DATA INFILE("file:///home/disk1/sr/*") - INTO TABLE mytable - COLUMNS TERMINATED BY "," - ) - WITH BROKER - PROPERTIES - ( - "timeout" = "3600" - ); - ``` - - This job has four main sections: - - - `LABEL`: A string used when querying the state of the load job. - - `LOAD` declaration: The source URI, source data format, and destination table name. Note that `DATA INFILE` in the declaration is used to specify the mount point folder path of the NAS device, as shown in the above example in which `file:///` is the prefix and `/home/disk1/sr` is the mount point folder path. - - `BROKER`: You do not need to specify the broker name. - - `PROPERTIES`: The timeout value and any other properties to apply to the load job. - - For detailed syntax and parameter descriptions, see [BROKER LOAD](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md). - -After you submit a job, you can view the load progress or cancel the job as needed. For detailed operations, see "[Check Broker Load progress](#check-broker-load-progress)" and "[Cancel a Broker Load job](#cancel-a-broker-load-job) in this topic. diff --git a/docs/en/loading/Stream_Load_transaction_interface.md b/docs/en/loading/Stream_Load_transaction_interface.md deleted file mode 100644 index db5a6a2..0000000 --- a/docs/en/loading/Stream_Load_transaction_interface.md +++ /dev/null @@ -1,547 +0,0 @@ ---- -displayed_sidebar: docs -keywords: ['Stream Load'] ---- - -# Load data using Stream Load transaction interface - -import InsertPrivNote from '../_assets/commonMarkdown/insertPrivNote.mdx' - -From v2.4 onwards, StarRocks provides a Stream Load transaction interface to implement two-phase commit (2PC) for transactions that are run to load data from external systems such as Apache Flink® and Apache Kafka®. The Stream Load transaction interface helps improve the performance of highly concurrent stream loads. - -From v4.0 onwards, the Stream Load transaction interface supports Multi-table Transaction, that is, loading data into multiple tables within the same database. - -This topic describes the Stream Load transaction interface and how to load data into StarRocks by using this interface. - -## Description - -The Stream Load transaction interface supports using an HTTP protocol-compatible tool or language to call API operations. This topic uses curl as an example to explain how to use this interface. This interface provides various features, such as transaction management, data write, transaction pre-commit, transaction deduplication, and transaction timeout management. - -:::note -Stream Load supports CSV and JSON file formats. This method is recommended if you want to load data from a small number of files whose individual sizes do not exceed 10 GB. Stream Load does not support Parquet file format. If you need to load data from Parquet files, use [INSERT+files()](../loading/InsertInto.md#insert-data-directly-from-files-in-an-external-source-using-files). -::: - -### Transaction management - -The Stream Load transaction interface provides the following API operations, which are used to manage transactions: - -- `/api/transaction/begin`: starts a new transaction. - -- `/api/transaction/prepare`: pre-commits the current transaction and make data changes temporarily persistent. After you pre-commit a transaction, you can proceed to commit or roll back the transaction. If your cluster crashes after a transaction is pre-committed, you can still proceed to commit the transaction after the cluster is restored. - -- `/api/transaction/commit`: commits the current transaction to make data changes persistent. - -- `/api/transaction/rollback`: rolls back the current transaction to abort data changes. - -> **NOTE** -> -> After the transaction is pre-committed, do not continue to write data using the transaction. If you continue to write data using the transaction, your write request returns errors. - - -The following diagram shows the relationship between transaction states and operations: - -```mermaid -stateDiagram-v2 - direction LR - [*] --> PREPARE : begin - PREPARE --> PREPARED : prepare - PREPARE --> ABORTED : rollback - PREPARED --> COMMITTED : commit - PREPARED --> ABORTED : rollback -``` - -### Data write - -The Stream Load transaction interface provides the `/api/transaction/load` operation, which is used to write data. You can call this operation multiple times within one transaction. - -From v4.0 onwards, you can call `/api/transaction/load` operations on different tables to load data into multiple tables within the same database. - -### Transaction deduplication - -The Stream Load transaction interface carries over the labeling mechanism of StarRocks. You can bind a unique label to each transaction to achieve at-most-once guarantees for transactions. - -### Transaction timeout management - -When you begin a transaction, you can use the `timeout` field in the HTTP request header to specify a timeout period (in seconds) for the transaction from `PREPARE` to `PREPARED` state. If the transaction has not been prepared after this period, it will be automatically aborted. If this field is not specified, the default value is determined by the FE configuration [`stream_load_default_timeout_second`](../administration/management/FE_configuration.md#stream_load_default_timeout_second) (Default: 600 seconds). - -When you begin a transaction, you can also use the `idle_transaction_timeout` field in the HTTP request header to specify a timeout period (in seconds) within which the transaction can stay idle. If no data is written within this period, the transaction will be automatically rolled back. - -When you prepare a transaction, you can use the `prepared_timeout` field in the HTTP request header to specify a timeout period (in seconds) for the transaction from `PREPARED` to `COMMITTED` state. If the transaction has not been committed after this period, it will be automatically aborted. If this field is not specified, the default value is determined by the FE configuration [`prepared_transaction_default_timeout_second`](../administration/management/FE_configuration.md#prepared_transaction_default_timeout_second) (Default: 86400 seconds). `prepared_timeout` is supported from v3.5.4 onwards. - -## Benefits - -The Stream Load transaction interface brings the following benefits: - -- **Exactly-once semantics** - - A transaction is split into two phases, pre-commit and commit, which make it easy to load data across systems. For example, this interface can guarantee exactly-once semantics for data loads from Flink. - -- **Improved load performance** - - If you run a load job by using a program, the Stream Load transaction interface allows you to merge multiple mini-batches of data on demand and then send them all at once within one transaction by calling the `/api/transaction/commit` operation. As such, fewer data versions need to be loaded, and load performance is improved. - -## Limits - -The Stream Load transaction interface has the following limits: - -- **Single-database multi-table** transactions are supported from v4.0 onwards. Support for **multi-database multi-table** transactions is in development. - -- Only **concurrent data writes from one client** are supported. Support for **concurrent data writes from multiple clients** is in development. - -- The `/api/transaction/load` operation can be called multiple times within one transaction. In this case, the parameter settings (except `table`) specified for all of the `/api/transaction/load` operations that are called must be the same. - -- When you load CSV-formatted data by using the Stream Load transaction interface, make sure that each data record in your data file ends with a row delimiter. - -## Precautions - -- If the `/api/transaction/begin`, `/api/transaction/load`, or `/api/transaction/prepare` operation that you have called returns errors, the transaction fails and is automatically rolled back. -- When calling the `/api/transaction/begin` operation to start a new transaction, you must specify a label. Note that the subsequent `/api/transaction/load`, `/api/transaction/prepare`, and `/api/transaction/commit` operations must use the same label as the `/api/transaction/begin` operation. -- If you use the label of an ongoing transaction to call the `/api/transaction/begin` operation to start a new transaction, the previous transaction will fail and be rolled back. -- If you use a multi-table transaction to load data into different tables, you must specify the parameter `-H "transaction_type:multi"` for all operations involved in the transaction. -- The default column separator and row delimiter that StarRocks supports for CSV-formatted data are `\t` and `\n`. If your data file does not use the default column separator or row delimiter, you must use `"column_separator: "` or `"row_delimiter: "` to specify the column separator or row delimiter that is actually used in your data file when calling the `/api/transaction/load` operation. - -## Before you begin - -### Check privileges - - - -#### Check network configuration - -Make sure that the machine on which the data you want to load resides can access the FE and BE nodes of the StarRocks cluster via the [`http_port`](../administration/management/FE_configuration.md#http_port) (default: `8030`) and [`be_http_port`](../administration/management/BE_configuration.md#be_http_port) (default: `8040`) , respectively. - -## Basic operations - -### Prepare sample data - -This topic uses CSV-formatted data as an example. - -1. In the `/home/disk1/` path of your local file system, create a CSV file named `example1.csv`. The file consists of three columns, which represent the user ID, user name, and user score in sequence. - - ```Plain - 1,Lily,23 - 2,Rose,23 - 3,Alice,24 - 4,Julia,25 - ``` - -2. In your StarRocks database `test_db`, create a Primary Key table named `table1`. The table consists of three columns: `id`, `name`, and `score`, of which `id` is the primary key. - - ```SQL - CREATE TABLE `table1` - ( - `id` int(11) NOT NULL COMMENT "user ID", - `name` varchar(65533) NULL COMMENT "user name", - `score` int(11) NOT NULL COMMENT "user score" - ) - ENGINE=OLAP - PRIMARY KEY(`id`) - DISTRIBUTED BY HASH(`id`) BUCKETS 10; - ``` - -### Start a transaction - -#### Syntax - -```Bash -curl --location-trusted -u : -H "label:" \ - -H "Expect:100-continue" \ - [-H "transaction_type:multi"]\ # Optional. Initiates a multi-table transaction. - -H "db:" -H "table:" \ - -XPOST http://:/api/transaction/begin -``` - -> **NOTE** -> -> Specify `-H "transaction_type:multi"` in the command if you want to load data into different tables within the transaction. - -#### Example - -```Bash -curl --location-trusted -u :<123456> -H "label:streamload_txn_example1_table1" \ - -H "Expect:100-continue" \ - -H "db:test_db" -H "table:table1" \ - -XPOST http://:/api/transaction/begin -``` - -> **NOTE** -> -> For this example, `streamload_txn_example1_table1` is specified as the label of the transaction. - -#### Return result - -- If the transaction is successfully started, the following result is returned: - - ```Bash - { - "Status": "OK", - "Message": "", - "Label": "streamload_txn_example1_table1", - "TxnId": 9032, - "BeginTxnTimeMs": 0 - } - ``` - -- If the transaction is bound to a duplicate label, the following result is returned: - - ```Bash - { - "Status": "LABEL_ALREADY_EXISTS", - "ExistingJobStatus": "RUNNING", - "Message": "Label [streamload_txn_example1_table1] has already been used." - } - ``` - -- If errors other than duplicate label occur, the following result is returned: - - ```Bash - { - "Status": "FAILED", - "Message": "" - } - ``` - -### Write data - -#### Syntax - -```Bash -curl --location-trusted -u : -H "label:" \ - -H "Expect:100-continue" \ - [-H "transaction_type:multi"]\ # Optional. Loads data via a multi-table transaction. - -H "db:" -H "table:" \ - -T \ - -XPUT http://:/api/transaction/load -``` - -> **NOTE** -> -> - When calling the `/api/transaction/load` operation, you must use `` to specify the save path of the data file you want to load. -> - You can call `/api/transaction/load` operations with different `table` parameter values to load data into different tables within the same database. In this case, you must specify `-H "transaction_type:multi"` in the command. - -#### Example - -```Bash -curl --location-trusted -u :<123456> -H "label:streamload_txn_example1_table1" \ - -H "Expect:100-continue" \ - -H "db:test_db" -H "table:table1" \ - -T /home/disk1/example1.csv \ - -H "column_separator: ," \ - -XPUT http://:/api/transaction/load -``` - -> **NOTE** -> -> For this example, the column separator used in the data file `example1.csv` is commas (`,`) instead of StarRocks‘s default column separator (`\t`). Therefore, when calling the `/api/transaction/load` operation, you must use `"column_separator: "` to specify commas (`,`) as the column separator. - -#### Return result - -- If the data write is successful, the following result is returned: - - ```Bash - { - "TxnId": 1, - "Seq": 0, - "Label": "streamload_txn_example1_table1", - "Status": "OK", - "Message": "", - "NumberTotalRows": 5265644, - "NumberLoadedRows": 5265644, - "NumberFilteredRows": 0, - "NumberUnselectedRows": 0, - "LoadBytes": 10737418067, - "LoadTimeMs": 418778, - "StreamLoadPutTimeMs": 68, - "ReceivedDataTimeMs": 38964, - } - ``` - -- If the transaction is considered unknown, the following result is returned: - - ```Bash - { - "TxnId": 1, - "Label": "streamload_txn_example1_table1", - "Status": "FAILED", - "Message": "TXN_NOT_EXISTS" - } - ``` - -- If the transaction is considered in an invalid state, the following result is returned: - - ```Bash - { - "TxnId": 1, - "Label": "streamload_txn_example1_table1", - "Status": "FAILED", - "Message": "Transcation State Invalid" - } - ``` - -- If errors other than unknown transaction and invalid status occur, the following result is returned: - - ```Bash - { - "TxnId": 1, - "Label": "streamload_txn_example1_table1", - "Status": "FAILED", - "Message": "" - } - ``` - -### Pre-commit a transaction - -#### Syntax - -```Bash -curl --location-trusted -u : -H "label:" \ - -H "Expect:100-continue" \ - [-H "transaction_type:multi"]\ # Optional. Pre-commits a multi-table transaction. - -H "db:" \ - [-H "prepared_timeout:"] \ - -XPOST http://:/api/transaction/prepare -``` - -> **NOTE** -> -> Specify `-H "transaction_type:multi"` in the command if the transaction you want to pre-commit is a multi-table transaction. - -#### Example - -```Bash -curl --location-trusted -u :<123456> -H "label:streamload_txn_example1_table1" \ - -H "Expect:100-continue" \ - -H "db:test_db" \ - -H "prepared_timeout:300" \ - -XPOST http://:/api/transaction/prepare -``` - -> **NOTE** -> -> The `prepared_timeout` field is optional. If it is not specified, the default value is determined by the FE configuration [`prepared_transaction_default_timeout_second`](../administration/management/FE_configuration.md#prepared_transaction_default_timeout_second) (Default: 86400 seconds). `prepared_timeout` is supported from v3.5.4 onwards. - -#### Return result - -- If the pre-commit is successful, the following result is returned: - - ```Bash - { - "TxnId": 1, - "Label": "streamload_txn_example1_table1", - "Status": "OK", - "Message": "", - "NumberTotalRows": 5265644, - "NumberLoadedRows": 5265644, - "NumberFilteredRows": 0, - "NumberUnselectedRows": 0, - "LoadBytes": 10737418067, - "LoadTimeMs": 418778, - "StreamLoadPutTimeMs": 68, - "ReceivedDataTimeMs": 38964, - "WriteDataTimeMs": 417851 - "CommitAndPublishTimeMs": 1393 - } - ``` - -- If the transaction is considered not existent, the following result is returned: - - ```Bash - { - "TxnId": 1, - "Label": "streamload_txn_example1_table1", - "Status": "FAILED", - "Message": "Transcation Not Exist" - } - ``` - -- If the pre-commit times out, the following result is returned: - - ```Bash - { - "TxnId": 1, - "Label": "streamload_txn_example1_table1", - "Status": "FAILED", - "Message": "commit timeout", - } - ``` - -- If errors other than non-existent transaction and pre-commit timeout occur, the following result is returned: - - ```Bash - { - "TxnId": 1, - "Label": "streamload_txn_example1_table1", - "Status": "FAILED", - "Message": "publish timeout" - } - ``` - -### Commit a transaction - -#### Syntax - -```Bash -curl --location-trusted -u : -H "label:" \ - -H "Expect:100-continue" \ - [-H "transaction_type:multi"]\ # Optional. Commits a multi-table transaction. - -H "db:" \ - -XPOST http://:/api/transaction/commit -``` - -> **NOTE** -> -> Specify `-H "transaction_type:multi"` in the command if the transaction you want to commit is a multi-table transaction. - -#### Example - -```Bash -curl --location-trusted -u :<123456> -H "label:streamload_txn_example1_table1" \ - -H "Expect:100-continue" \ - -H "db:test_db" \ - -XPOST http://:/api/transaction/commit -``` - -#### Return result - -- If the commit is successful, the following result is returned: - - ```Bash - { - "TxnId": 1, - "Label": "streamload_txn_example1_table1", - "Status": "OK", - "Message": "", - "NumberTotalRows": 5265644, - "NumberLoadedRows": 5265644, - "NumberFilteredRows": 0, - "NumberUnselectedRows": 0, - "LoadBytes": 10737418067, - "LoadTimeMs": 418778, - "StreamLoadPutTimeMs": 68, - "ReceivedDataTimeMs": 38964, - "WriteDataTimeMs": 417851 - "CommitAndPublishTimeMs": 1393 - } - ``` - -- If the transaction has already been committed, the following result is returned: - - ```Bash - { - "TxnId": 1, - "Label": "streamload_txn_example1_table1", - "Status": "OK", - "Message": "Transaction already commited", - } - ``` - -- If the transaction is considered not existent, the following result is returned: - - ```Bash - { - "TxnId": 1, - "Label": "streamload_txn_example1_table1", - "Status": "FAILED", - "Message": "Transcation Not Exist" - } - ``` - -- If the commit times out, the following result is returned: - - ```Bash - { - "TxnId": 1, - "Label": "streamload_txn_example1_table1", - "Status": "FAILED", - "Message": "commit timeout", - } - ``` - -- If the data publish times out, the following result is returned: - - ```Bash - { - "TxnId": 1, - "Label": "streamload_txn_example1_table1", - "Status": "FAILED", - "Message": "publish timeout", - "CommitAndPublishTimeMs": 1393 - } - ``` - -- If errors other than non-existent transaction and timeout occur, the following result is returned: - - ```Bash - { - "TxnId": 1, - "Label": "streamload_txn_example1_table1", - "Status": "FAILED", - "Message": "" - } - ``` - -### Roll back a transaction - -#### Syntax - -```Bash -curl --location-trusted -u : -H "label:" \ - -H "Expect:100-continue" \ - [-H "transaction_type:multi"]\ # Optional. Rolls back a multi-table transaction. - -H "db:" \ - -XPOST http://:/api/transaction/rollback -``` - -> **NOTE** -> -> Specify `-H "transaction_type:multi"` in the command if the transaction you want to roll back is a multi-table transaction. - -#### Example - -```Bash -curl --location-trusted -u :<123456> -H "label:streamload_txn_example1_table1" \ - -H "Expect:100-continue" \ - -H "db:test_db" \ - -XPOST http://:/api/transaction/rollback -``` - -#### Return result - -- If the rollback is successful, the following result is returned: - - ```Bash - { - "TxnId": 1, - "Label": "streamload_txn_example1_table1", - "Status": "OK", - "Message": "" - } - ``` - -- If the transaction is considered not existent, the following result is returned: - - ```Bash - { - "TxnId": 1, - "Label": "streamload_txn_example1_table1", - "Status": "FAILED", - "Message": "Transcation Not Exist" - } - ``` - -- If errors other than not existent transaction occur, the following result is returned: - - ```Bash - { - "TxnId": 1, - "Label": "streamload_txn_example1_table1", - "Status": "FAILED", - "Message": "" - } - ``` - -## References - -For information about the suitable application scenarios and supported data file formats of Stream Load and about how Stream Load works, see [Loading from a local file system via Stream Load](../loading/StreamLoad.md#loading-from-a-local-file-system-via-stream-load). - -For information about the syntax and parameters for creating Stream Load jobs, see [STREAM LOAD](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md). diff --git a/docs/en/loading/alibaba.md b/docs/en/loading/alibaba.md deleted file mode 100644 index 25867d5..0000000 --- a/docs/en/loading/alibaba.md +++ /dev/null @@ -1,3 +0,0 @@ ---- -unlisted: true ---- diff --git a/docs/en/loading/automq-routine-load.md b/docs/en/loading/automq-routine-load.md deleted file mode 100644 index add7045..0000000 --- a/docs/en/loading/automq-routine-load.md +++ /dev/null @@ -1,147 +0,0 @@ ---- -displayed_sidebar: docs -description: Cloud based Kafka from AutoMQ ---- - -# AutoMQ Kafka - -import Replicanum from '../_assets/commonMarkdown/replicanum.mdx' - -[AutoMQ for Kafka](https://www.automq.com/docs) is a cloud-native version of Kafka redesigned for cloud environments. -AutoMQ Kafka is [open source](https://github.com/AutoMQ/automq-for-kafka) and fully compatible with the Kafka protocol, fully leveraging cloud benefits. -Compared to self-managed Apache Kafka, AutoMQ Kafka, with its cloud-native architecture, offers features like capacity auto scaling, self-balancing of network traffic, move partition in seconds. These features contribute to a significantly lower Total Cost of Ownership (TCO) for users. - -This article will guide you through importing data into AutoMQ Kafka using StarRocks Routine Load. -For an understanding of the basic principles of Routine Load, refer to the section on Routine Load Fundamentals. - -## Prepare Environment - -### Prepare StarRocks and test data - -Ensure you have a running StarRocks cluster. - -Creating a database and a Primary Key table for testing: - -```sql -create database automq_db; -create table users ( - id bigint NOT NULL, - name string NOT NULL, - timestamp string NULL, - status string NULL -) PRIMARY KEY (id) -DISTRIBUTED BY HASH(id) -PROPERTIES ( - "enable_persistent_index" = "true" -); -``` - - - -## Prepare AutoMQ Kafka and test data - -To prepare your AutoMQ Kafka environment and test data, follow the AutoMQ [Quick Start](https://www.automq.com/docs) guide to deploy your AutoMQ Kafka cluster. Ensure that StarRocks can directly connect to your AutoMQ Kafka server. - -To quickly create a topic named `example_topic` in AutoMQ Kafka and write a test JSON data into it, follow these steps: - -### Create a topic - -Use Kafka’s command-line tools to create a topic. Ensure you have access to the Kafka environment and the Kafka service is running. -Here is the command to create a topic: - -```shell -./kafka-topics.sh --create --topic example_topic --bootstrap-server 10.0.96.4:9092 --partitions 1 --replication-factor 1 -``` - -> Note: Replace `topic` and `bootstrap-server` with your Kafka server address. - -To check the result of the topic creation, use this command: - -```shell -./kafka-topics.sh --describe example_topic --bootstrap-server 10.0.96.4:9092 -``` - -### Generate test data - -Generate a simple JSON format test data - -```json -{ - "id": 1, - "name": "testuser", - "timestamp": "2023-11-10T12:00:00", - "status": "active" -} -``` - -### Write Test Data - -Use Kafka’s command-line tools or programming methods to write test data into example_topic. Here is an example using command-line tools: - -```shell -echo '{"id": 1, "name": "testuser", "timestamp": "2023-11-10T12:00:00", "status": "active"}' | sh kafka-console-producer.sh --broker-list 10.0.96.4:9092 --topic example_topic -``` - -> Note: Replace `topic` and `bootstrap-server` with your Kafka server address. - -To view the recently written topic data, use the following command: - -```shell -sh kafka-console-consumer.sh --bootstrap-server 10.0.96.4:9092 --topic example_topic --from-beginning -``` - -## Creating a Routine Load Task - -In the StarRocks command line, create a Routine Load task to continuously import data from the AutoMQ Kafka topic: - -```sql -CREATE ROUTINE LOAD automq_example_load ON users -COLUMNS(id, name, timestamp, status) -PROPERTIES -( - "desired_concurrent_number" = "5", - "format" = "json", - "jsonpaths" = "[\"$.id\",\"$.name\",\"$.timestamp\",\"$.status\"]" -) -FROM KAFKA -( - "kafka_broker_list" = "10.0.96.4:9092", - "kafka_topic" = "example_topic", - "kafka_partitions" = "0", - "property.kafka_default_offsets" = "OFFSET_BEGINNING" -); -``` - -> Note: Replace `kafka_broker_list` with your Kafka server address. - -### Explanation of Parameters - -#### Data Format - -Specify the data format as JSON in the "format" = "json" of the PROPERTIES clause. - -#### Data Extraction and Transformation - -To specify the mapping and transformation relationship between the source data and the target table, configure the COLUMNS and jsonpaths parameters. The column names in COLUMNS correspond to the column names of the target table, and their order corresponds to the column order in the source data. The jsonpaths parameter is used to extract the required field data from JSON data, similar to newly generated CSV data. Then the COLUMNS parameter temporarily names the fields in jsonpaths in order. For more explanations on data transformation, please see [Data Transformation during Import](./Etl_in_loading.md). -> Note: If each JSON object per line has key names and quantities (order is not required) that correspond to the columns of the target table, there is no need to configure COLUMNS. - -## Verifying Data Import - -First, we check the Routine Load import job and confirm the Routine Load import task status is in RUNNING status. - -```sql -show routine load\G -``` - -Then, querying the corresponding table in the StarRocks database, we can observe that the data has been successfully imported. - -```sql -StarRocks > select * from users; -+------+--------------+---------------------+--------+ -| id | name | timestamp | status | -+------+--------------+---------------------+--------+ -| 1 | testuser | 2023-11-10T12:00:00 | active | -| 2 | testuser | 2023-11-10T12:00:00 | active | -+------+--------------+---------------------+--------+ -2 rows in set (0.01 sec) -``` diff --git a/docs/en/loading/azure.md b/docs/en/loading/azure.md deleted file mode 100644 index 82978bd..0000000 --- a/docs/en/loading/azure.md +++ /dev/null @@ -1,456 +0,0 @@ ---- -displayed_sidebar: docs -toc_max_heading_level: 4 -keywords: ['Broker Load'] ---- - -# Load data from Microsoft Azure Storage - -import InsertPrivNote from '../_assets/commonMarkdown/insertPrivNote.mdx' - -StarRocks provides the following options for loading data from Azure: - -- Synchronous loading using [INSERT](../sql-reference/sql-statements/loading_unloading/INSERT.md)+[`FILES()`](../sql-reference/sql-functions/table-functions/files.md) -- Asynchronous loading using [Broker Load](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md) - -Each of these options has its own advantages, which are detailed in the following sections. - -In most cases, we recommend that you use the INSERT+`FILES()` method, which is much easier to use. - -However, the INSERT+`FILES()` method currently supports only the Parquet, ORC, and CSV file formats. Therefore, if you need to load data of other file formats such as JSON, or [perform data changes such as DELETE during data loading](../loading/Load_to_Primary_Key_tables.md), you can resort to Broker Load. - -## Before you begin - -### Make source data ready - -Make sure that the source data you want to load into StarRocks is properly stored in a container within your Azure storage account. - -In this topic, suppose you want to load the data of a Parquet-formatted sample dataset (`user_behavior_ten_million_rows.parquet`) stored in the root directory of a container (`starrocks-container`) within an Azure Data Lake Storage Gen2 (ADLS Gen2) storage account (`starrocks`). - -### Check privileges - - - -### Gather authentication details - -The examples in this topic use the Shared Key authentication method. To ensure that you have permission to read data from ADLS Gen2, we recommend that you read [Azure Data Lake Storage Gen2 > Shared Key (access key of storage account)](../integrations/authenticate_to_azure_storage.md#service-principal-1) to understand the authentication parameters that you need to configure. - -In a nutshell, if you practice Shared Key authentication, you need to gather the following information: - -- The username of your ADLS Gen2 storage account -- The shared key of your ADLS Gen2 storage account - -For information about all the authentication methods available, see [Authenticate to Azure cloud storage](../integrations/authenticate_to_azure_storage.md). - -## Use INSERT+FILES() - -This method is available from v3.2 onwards and currently supports only the Parquet, ORC, and CSV (from v3.3.0 onwards) file formats. - -### Advantages of INSERT+FILES() - -`FILES()` can read the file stored in cloud storage based on the path-related properties you specify, infer the table schema of the data in the file, and then return the data from the file as data rows. - -With `FILES()`, you can: - -- Query the data directly from Azure using [SELECT](../sql-reference/sql-statements/table_bucket_part_index/SELECT.md). -- Create and load a table using [CREATE TABLE AS SELECT](../sql-reference/sql-statements/table_bucket_part_index/CREATE_TABLE_AS_SELECT.md) (CTAS). -- Load the data into an existing table using [INSERT](../sql-reference/sql-statements/loading_unloading/INSERT.md). - -### Typical examples - -#### Querying directly from Azure using SELECT - -Querying directly from Azure using SELECT+`FILES()` can give a good preview of the content of a dataset before you create a table. For example: - -- Get a preview of the dataset without storing the data. -- Query for the min and max values and decide what data types to use. -- Check for `NULL` values. - -The following example queries the sample dataset `user_behavior_ten_million_rows.parquet` stored in the container `starrocks-container` within your storage account `starrocks`: - -```SQL -SELECT * FROM FILES -( - "path" = "abfss://starrocks-container@starrocks.dfs.core.windows.net/user_behavior_ten_million_rows.parquet", - "format" = "parquet", - "azure.adls2.storage_account" = "starrocks", - "azure.adls2.shared_key" = "xxxxxxxxxxxxxxxxxx" -) -LIMIT 3; -``` - -The system returns a query result similar to the following: - -```Plain -+--------+---------+------------+--------------+---------------------+ -| UserID | ItemID | CategoryID | BehaviorType | Timestamp | -+--------+---------+------------+--------------+---------------------+ -| 543711 | 829192 | 2355072 | pv | 2017-11-27 08:22:37 | -| 543711 | 2056618 | 3645362 | pv | 2017-11-27 10:16:46 | -| 543711 | 1165492 | 3645362 | pv | 2017-11-27 10:17:00 | -+--------+---------+------------+--------------+---------------------+ -``` - -> **NOTE** -> -> Notice that the column names as returned above are provided by the Parquet file. - -#### Creating and loading a table using CTAS - -This is a continuation of the previous example. The previous query is wrapped in CREATE TABLE AS SELECT (CTAS) to automate the table creation using schema inference. This means StarRocks will infer the table schema, create the table you want, and then load the data into the table. The column names and types are not required to create a table when using the `FILES()` table function with Parquet files as the Parquet format includes the column names. - -> **NOTE** -> -> The syntax of CREATE TABLE when using schema inference does not allow setting the number of replicas. If you are using a StarRocks shared-nothing cluster, set the number of replicas before creating the table. The example below is for a system with three replicas: -> -> ```SQL -> ADMIN SET FRONTEND CONFIG ('default_replication_num' = "3"); -> ``` - -Create a database and switch to it: - -```SQL -CREATE DATABASE IF NOT EXISTS mydatabase; -USE mydatabase; -``` - -Use CTAS to create a table and load the data of the sample dataset `user_behavior_ten_million_rows.parquet`, which is stored in the container `starrocks-container` within your storage account `starrocks`, into the table: - -```SQL -CREATE TABLE user_behavior_inferred AS -SELECT * FROM FILES -( - "path" = "abfss://starrocks-container@starrocks.dfs.core.windows.net/user_behavior_ten_million_rows.parquet", - "format" = "parquet", - "azure.adls2.storage_account" = "starrocks", - "azure.adls2.shared_key" = "xxxxxxxxxxxxxxxxxx" -); -``` - -After creating the table, you can view its schema by using [DESCRIBE](../sql-reference/sql-statements/table_bucket_part_index/DESCRIBE.md): - -```SQL -DESCRIBE user_behavior_inferred; -``` - -The system returns the following query result: - -```Plain -+--------------+-----------+------+-------+---------+-------+ -| Field | Type | Null | Key | Default | Extra | -+--------------+-----------+------+-------+---------+-------+ -| UserID | bigint | YES | true | NULL | | -| ItemID | bigint | YES | true | NULL | | -| CategoryID | bigint | YES | true | NULL | | -| BehaviorType | varbinary | YES | false | NULL | | -| Timestamp | varbinary | YES | false | NULL | | -+--------------+-----------+------+-------+---------+-------+ -``` - -Query the table to verify that the data has been loaded into it. Example: - -```SQL -SELECT * from user_behavior_inferred LIMIT 3; -``` - -The following query result is returned, indicating that the data has been successfully loaded: - -```Plain -+--------+--------+------------+--------------+---------------------+ -| UserID | ItemID | CategoryID | BehaviorType | Timestamp | -+--------+--------+------------+--------------+---------------------+ -| 84 | 162325 | 2939262 | pv | 2017-12-02 05:41:41 | -| 84 | 232622 | 4148053 | pv | 2017-11-27 04:36:10 | -| 84 | 595303 | 903809 | pv | 2017-11-26 08:03:59 | -+--------+--------+------------+--------------+---------------------+ -``` - -#### Loading into an existing table using INSERT - -You may want to customize the table that you are inserting into, for example, the: - -- column data type, nullable setting, or default values -- key types and columns -- data partitioning and bucketing - -> **NOTE** -> -> Creating the most efficient table structure requires knowledge of how the data will be used and the content of the columns. This topic does not cover table design. For information about table design, see [Table types](../table_design/StarRocks_table_design.md). - -In this example, we are creating a table based on knowledge of how the table will be queried and the data in the Parquet file. The knowledge of the data in the Parquet file can be gained by querying the file directly in Azure. - -- Since a query of the dataset in Azure indicates that the `Timestamp` column contains data that matches a VARBINARY data type, the column type is specified in the following DDL. -- By querying the data in Azure, you can find that there are no `NULL` values in the dataset, so the DDL does not set any columns as nullable. -- Based on knowledge of the expected query types, the sort key and bucketing column are set to the column `UserID`. Your use case might be different for this data, so you might decide to use `ItemID` in addition to or instead of `UserID` for the sort key. - -Create a database and switch to it: - -```SQL -CREATE DATABASE IF NOT EXISTS mydatabase; -USE mydatabase; -``` - -Create a table by hand (we recommend that the table have the same schema as the Parquet file you want to load from Azure): - -```SQL -CREATE TABLE user_behavior_declared -( - UserID int(11), - ItemID int(11), - CategoryID int(11), - BehaviorType varchar(65533), - Timestamp varbinary -) -ENGINE = OLAP -DUPLICATE KEY(UserID) -DISTRIBUTED BY HASH(UserID); -``` - -Display the schema so that you can compare it with the inferred schema produced by the `FILES()` table function: - -```sql -DESCRIBE user_behavior_declared; -``` - -```plaintext -+--------------+----------------+------+-------+---------+-------+ -| Field | Type | Null | Key | Default | Extra | -+--------------+----------------+------+-------+---------+-------+ -| UserID | int | NO | true | NULL | | -| ItemID | int | NO | false | NULL | | -| CategoryID | int | NO | false | NULL | | -| BehaviorType | varchar(65533) | NO | false | NULL | | -| Timestamp | varbinary | NO | false | NULL | | -+--------------+----------------+------+-------+---------+-------+ -5 rows in set (0.00 sec) -``` - -:::tip - -Compare the schema you just created with the schema inferred earlier using the `FILES()` table function. Look at: - -- data types -- nullable -- key fields - -To better control the schema of the destination table and for better query performance, we recommend that you specify the table schema by hand in production environments. - -::: - -After creating the table, you can load it with INSERT INTO SELECT FROM FILES(): - -```SQL -INSERT INTO user_behavior_declared -SELECT * FROM FILES -( - "path" = "abfss://starrocks-container@starrocks.dfs.core.windows.net/user_behavior_ten_million_rows.parquet", - "format" = "parquet", - "azure.adls2.storage_account" = "starrocks", - "azure.adls2.shared_key" = "xxxxxxxxxxxxxxxxxx" -); -``` - -After the load is complete, you can query the table to verify that the data has been loaded into it. Example: - -```SQL -SELECT * from user_behavior_declared LIMIT 3; -``` - -The system returns a query result similar to the following, indicating that the data has been successfully loaded: - -```Plain - +--------+---------+------------+--------------+---------------------+ -| UserID | ItemID | CategoryID | BehaviorType | Timestamp | -+--------+---------+------------+--------------+---------------------+ -| 142 | 2869980 | 2939262 | pv | 2017-11-25 03:43:22 | -| 142 | 2522236 | 1669167 | pv | 2017-11-25 15:14:12 | -| 142 | 3031639 | 3607361 | pv | 2017-11-25 15:19:25 | -+--------+---------+------------+--------------+---------------------+ -``` - -#### Check load progress - -You can query the progress of INSERT jobs from the [`loads`](../sql-reference/information_schema/loads.md) view in the StarRocks Information Schema. This feature is supported from v3.1 onwards. Example: - -```SQL -SELECT * FROM information_schema.loads ORDER BY JOB_ID DESC; -``` - -For information about the fields provided in the `loads` view, see [`loads`](../sql-reference/information_schema/loads.md). - -If you have submitted multiple load jobs, you can filter on the `LABEL` associated with the job. Example: - -```SQL -SELECT * FROM information_schema.loads WHERE LABEL = 'insert_f3fc2298-a553-11ee-92f4-00163e0842bd' \G -*************************** 1. row *************************** - JOB_ID: 10193 - LABEL: insert_f3fc2298-a553-11ee-92f4-00163e0842bd - DATABASE_NAME: mydatabase - STATE: FINISHED - PROGRESS: ETL:100%; LOAD:100% - TYPE: INSERT - PRIORITY: NORMAL - SCAN_ROWS: 10000000 - FILTERED_ROWS: 0 - UNSELECTED_ROWS: 0 - SINK_ROWS: 10000000 - ETL_INFO: - TASK_INFO: resource:N/A; timeout(s):300; max_filter_ratio:0.0 - CREATE_TIME: 2023-12-28 15:37:38 - ETL_START_TIME: 2023-12-28 15:37:38 - ETL_FINISH_TIME: 2023-12-28 15:37:38 - LOAD_START_TIME: 2023-12-28 15:37:38 - LOAD_FINISH_TIME: 2023-12-28 15:39:35 - JOB_DETAILS: {"All backends":{"f3fc2298-a553-11ee-92f4-00163e0842bd":[10120]},"FileNumber":0,"FileSize":0,"InternalTableLoadBytes":581730322,"InternalTableLoadRows":10000000,"ScanBytes":581574034,"ScanRows":10000000,"TaskNumber":1,"Unfinished backends":{"f3fc2298-a553-11ee-92f4-00163e0842bd":[]}} - ERROR_MSG: NULL - TRACKING_URL: NULL - TRACKING_SQL: NULL -REJECTED_RECORD_PATH: NULL -``` - -> **NOTE** -> -> INSERT is a synchronous command. If an INSERT job is still running, you need to open another session to check its execution status. - -## Use Broker Load - -An asynchronous Broker Load process handles making the connection to Azure, pulling the data, and storing the data in StarRocks. - -This method supports the following file formats: - -- Parquet -- ORC -- CSV -- JSON (supported from v3.2.3 onwards) - -### Advantages of Broker Load - -- Broker Load runs in the background and clients don't need to stay connected for the job to continue. -- Broker Load is preferred for long running jobs, the default timeout is 4 hours. -- In addition to Parquet and ORC file format, Broker Load supports CSV file format and JSON file format (JSON file format is supported from v3.2.3 onwards). - -### Data flow - -![Workflow of Broker Load](../_assets/broker_load_how-to-work_en.png) - -1. The user creates a load job. -2. The frontend (FE) creates a query plan and distributes the plan to the backend nodes (BEs) or compute nodes (CNs). -3. The BEs or CNs pull the data from the source and load the data into StarRocks. - -### Typical example - -Create a table, start a load process that pulls the sample dataset `user_behavior_ten_million_rows.parquet` from Azure, and verify the progress and success of the data loading. - -#### Create a database and a table - -Connect to your StarRocks cluster. Then, create a database and switch to it: - -```SQL -CREATE DATABASE IF NOT EXISTS mydatabase; -USE mydatabase; -``` - -Create a table by hand (we recommend that the table have the same schema as the Parquet file you want to load from Azure): - -```SQL -CREATE TABLE user_behavior -( - UserID int(11), - ItemID int(11), - CategoryID int(11), - BehaviorType varchar(65533), - Timestamp varbinary -) -ENGINE = OLAP -DUPLICATE KEY(UserID) -DISTRIBUTED BY HASH(UserID); -``` - -#### Start a Broker Load - -Run the following command to start a Broker Load job that loads data from the sample dataset `user_behavior_ten_million_rows.parquet` to the `user_behavior` table: - -```SQL -LOAD LABEL user_behavior -( - DATA INFILE("abfss://starrocks-container@starrocks.dfs.core.windows.net/user_behavior_ten_million_rows.parquet") - INTO TABLE user_behavior - FORMAT AS "parquet" -) -WITH BROKER -( - "azure.adls2.storage_account" = "starrocks", - "azure.adls2.shared_key" = "xxxxxxxxxxxxxxxxxx" -) -PROPERTIES -( - "timeout" = "3600" -); -``` - -This job has four main sections: - -- `LABEL`: A string used when querying the state of the load job. -- `LOAD` declaration: The source URI, source data format, and destination table name. -- `BROKER`: The connection details for the source. -- `PROPERTIES`: The timeout value and any other properties to apply to the load job. - -For detailed syntax and parameter descriptions, see [BROKER LOAD](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md). - -#### Check load progress - -You can query the progress of Broker Load jobs from the [`loads`](../sql-reference/information_schema/loads.md) view in the StarRocks Information Schema. This feature is supported from v3.1 onwards. - -```SQL -SELECT * FROM information_schema.loads \G -``` - -For information about the fields provided in the `loads` view, see [`loads`](../sql-reference/information_schema/loads.md). - -If you have submitted multiple load jobs, you can filter on the `LABEL` associated with the job: - -```SQL -SELECT * FROM information_schema.loads WHERE LABEL = 'user_behavior' \G -*************************** 1. row *************************** - JOB_ID: 10250 - LABEL: user_behavior - DATABASE_NAME: mydatabase - STATE: FINISHED - PROGRESS: ETL:100%; LOAD:100% - TYPE: BROKER - PRIORITY: NORMAL - SCAN_ROWS: 10000000 - FILTERED_ROWS: 0 - UNSELECTED_ROWS: 0 - SINK_ROWS: 10000000 - ETL_INFO: - TASK_INFO: resource:N/A; timeout(s):3600; max_filter_ratio:0.0 - CREATE_TIME: 2023-12-28 16:15:19 - ETL_START_TIME: 2023-12-28 16:15:25 - ETL_FINISH_TIME: 2023-12-28 16:15:25 - LOAD_START_TIME: 2023-12-28 16:15:25 - LOAD_FINISH_TIME: 2023-12-28 16:16:31 - JOB_DETAILS: {"All backends":{"6a8ef4c0-1009-48c9-8d18-c4061d2255bf":[10121]},"FileNumber":1,"FileSize":132251298,"InternalTableLoadBytes":311710786,"InternalTableLoadRows":10000000,"ScanBytes":132251298,"ScanRows":10000000,"TaskNumber":1,"Unfinished backends":{"6a8ef4c0-1009-48c9-8d18-c4061d2255bf":[]}} - ERROR_MSG: NULL - TRACKING_URL: NULL - TRACKING_SQL: NULL -REJECTED_RECORD_PATH: NULL -``` - -After you confirm that the load job has finished, you can check a subset of the destination table to see if the data has been successfully loaded. Example: - -```SQL -SELECT * from user_behavior LIMIT 3; -``` - -The system returns a query result similar to the following, indicating that the data has been successfully loaded: - -```Plain -+--------+---------+------------+--------------+---------------------+ -| UserID | ItemID | CategoryID | BehaviorType | Timestamp | -+--------+---------+------------+--------------+---------------------+ -| 142 | 2869980 | 2939262 | pv | 2017-11-25 03:43:22 | -| 142 | 2522236 | 1669167 | pv | 2017-11-25 15:14:12 | -| 142 | 3031639 | 3607361 | pv | 2017-11-25 15:19:25 | -+--------+---------+------------+--------------+---------------------+ -``` diff --git a/docs/en/loading/gcs.md b/docs/en/loading/gcs.md deleted file mode 100644 index 77725ea..0000000 --- a/docs/en/loading/gcs.md +++ /dev/null @@ -1,465 +0,0 @@ ---- -displayed_sidebar: docs -toc_max_heading_level: 4 -keywords: ['Broker Load'] ---- - -# Load data from GCS - -import InsertPrivNote from '../_assets/commonMarkdown/insertPrivNote.mdx' - -StarRocks provides the following options for loading data from GCS: - -- Synchronous loading using [INSERT](../sql-reference/sql-statements/loading_unloading/INSERT.md)+[`FILES()`](../sql-reference/sql-functions/table-functions/files.md) -- Asynchronous loading using [Broker Load](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md) - -Each of these options has its own advantages, which are detailed in the following sections. - -In most cases, we recommend that you use the INSERT+`FILES()` method, which is much easier to use. - -However, the INSERT+`FILES()` method currently supports only the Parquet, ORC, and CSV file formats. Therefore, if you need to load data of other file formats such as JSON, or [perform data changes such as DELETE during data loading](../loading/Load_to_Primary_Key_tables.md), you can resort to Broker Load. - -## Before you begin - -### Make source data ready - -Make sure the source data you want to load into StarRocks is properly stored in a GCS bucket. You may also consider where the data and the database are located, because data transfer costs are much lower when your bucket and your StarRocks cluster are located in the same region. - -In this topic, we provide you with a sample dataset in a GCS bucket, `gs://starrocks-samples/user_behavior_ten_million_rows.parquet`. You can access that dataset with any valid credentials as the object is readable by any GCP user. - -### Check privileges - - - -### Gather authentication details - -The examples in this topic use service account-based authentication. To practice IAM user-based authentication, you need to gather information about the following GCS resources: - -- The GCS bucket that stores your data. -- The GCS object key (object name) if accessing a specific object in the bucket. Note that the object key can include a prefix if your GCS objects are stored in sub-folders. -- The GCS region to which the GCS bucket belongs. -- The `private_ key_id`, `private_key`, and `client_email` of your Google Cloud service account - -For information about all the authentication methods available, see [Authenticate to Google Cloud Storage](../integrations/authenticate_to_gcs.md). - -## Use INSERT+FILES() - -This method is available from v3.2 onwards and currently supports only the Parquet, ORC, and CSV (from v3.3.0 onwards) file formats. - -### Advantages of INSERT+FILES() - -`FILES()` can read the file stored in cloud storage based on the path-related properties you specify, infer the table schema of the data in the file, and then return the data from the file as data rows. - -With `FILES()`, you can: - -- Query the data directly from GCS using [SELECT](../sql-reference/sql-statements/table_bucket_part_index/SELECT.md). -- Create and load a table using [CREATE TABLE AS SELECT](../sql-reference/sql-statements/table_bucket_part_index/CREATE_TABLE_AS_SELECT.md) (CTAS). -- Load the data into an existing table using [INSERT](../sql-reference/sql-statements/loading_unloading/INSERT.md). - -### Typical examples - -#### Querying directly from GCS using SELECT - -Querying directly from GCS using SELECT+`FILES()` can give a good preview of the content of a dataset before you create a table. For example: - -- Get a preview of the dataset without storing the data. -- Query for the min and max values and decide what data types to use. -- Check for `NULL` values. - -The following example queries the sample dataset `gs://starrocks-samples/user_behavior_ten_million_rows.parquet`: - -```SQL -SELECT * FROM FILES -( - "path" = "gs://starrocks-samples/user_behavior_ten_million_rows.parquet", - "format" = "parquet", - "gcp.gcs.service_account_email" = "sampledatareader@xxxxx-xxxxxx-000000.iam.gserviceaccount.com", - "gcp.gcs.service_account_private_key_id" = "baaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa", - "gcp.gcs.service_account_private_key" = "-----BEGIN PRIVATE KEY-----\n...\n-----END PRIVATE KEY-----" -) -LIMIT 3; -``` - -> **NOTE** -> -> Substitute the credentials in the above command with your own credentials. Any valid service account email, key, and secret can be used, as the object is readable by any GCP authenticated user. - -The system returns a query result similar to the following: - -```Plain -+--------+---------+------------+--------------+---------------------+ -| UserID | ItemID | CategoryID | BehaviorType | Timestamp | -+--------+---------+------------+--------------+---------------------+ -| 543711 | 829192 | 2355072 | pv | 2017-11-27 08:22:37 | -| 543711 | 2056618 | 3645362 | pv | 2017-11-27 10:16:46 | -| 543711 | 1165492 | 3645362 | pv | 2017-11-27 10:17:00 | -+--------+---------+------------+--------------+---------------------+ -``` - -> **NOTE** -> -> Notice that the column names as returned above are provided by the Parquet file. - -#### Creating and loading a table using CTAS - -This is a continuation of the previous example. The previous query is wrapped in CREATE TABLE AS SELECT (CTAS) to automate the table creation using schema inference. This means StarRocks will infer the table schema, create the table you want, and then load the data into the table. The column names and types are not required to create a table when using the `FILES()` table function with Parquet files as the Parquet format includes the column names. - -> **NOTE** -> -> The syntax of CREATE TABLE when using schema inference does not allow setting the number of replicas. If you are using a StarRocks shared-nothing cluster, set the number of replicas before creating the table. The example below is for a system with three replicas: -> -> ```SQL -> ADMIN SET FRONTEND CONFIG ('default_replication_num' = "3"); -> ``` - -Create a database and switch to it: - -```SQL -CREATE DATABASE IF NOT EXISTS mydatabase; -USE mydatabase; -``` - -Use CTAS to create a table and load the data of the sample dataset `gs://starrocks-samples/user_behavior_ten_million_rows.parquet` into the table: - -```SQL -CREATE TABLE user_behavior_inferred AS -SELECT * FROM FILES -( - "path" = "gs://starrocks-samples/user_behavior_ten_million_rows.parquet", - "format" = "parquet", - "gcp.gcs.service_account_email" = "sampledatareader@xxxxx-xxxxxx-000000.iam.gserviceaccount.com", - "gcp.gcs.service_account_private_key_id" = "baaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa", - "gcp.gcs.service_account_private_key" = "-----BEGIN PRIVATE KEY-----\n...\n-----END PRIVATE KEY-----" -); -``` - -> **NOTE** -> -> Substitute the credentials in the above command with your own credentials. Any valid service account email, key, and secret can be used, as the object is readable by any GCP authenticated user. - -After creating the table, you can view its schema by using [DESCRIBE](../sql-reference/sql-statements/table_bucket_part_index/DESCRIBE.md): - -```SQL -DESCRIBE user_behavior_inferred; -``` - -The system returns a query result similar to the following: - -```Plain -+--------------+-----------+------+-------+---------+-------+ -| Field | Type | Null | Key | Default | Extra | -+--------------+-----------+------+-------+---------+-------+ -| UserID | bigint | YES | true | NULL | | -| ItemID | bigint | YES | true | NULL | | -| CategoryID | bigint | YES | true | NULL | | -| BehaviorType | varbinary | YES | false | NULL | | -| Timestamp | varbinary | YES | false | NULL | | -+--------------+-----------+------+-------+---------+-------+ -``` - -Query the table to verify that the data has been loaded into it. Example: - -```SQL -SELECT * from user_behavior_inferred LIMIT 3; -``` - -The following query result is returned, indicating that the data has been successfully loaded: - -```Plain -+--------+--------+------------+--------------+---------------------+ -| UserID | ItemID | CategoryID | BehaviorType | Timestamp | -+--------+--------+------------+--------------+---------------------+ -| 84 | 162325 | 2939262 | pv | 2017-12-02 05:41:41 | -| 84 | 232622 | 4148053 | pv | 2017-11-27 04:36:10 | -| 84 | 595303 | 903809 | pv | 2017-11-26 08:03:59 | -+--------+--------+------------+--------------+---------------------+ -``` - -#### Loading into an existing table using INSERT - -You may want to customize the table that you are inserting into, for example, the: - -- column data type, nullable setting, or default values -- key types and columns -- data partitioning and bucketing - -> **NOTE** -> -> Creating the most efficient table structure requires knowledge of how the data will be used and the content of the columns. This topic does not cover table design. For information about table design, see [Table types](../table_design/StarRocks_table_design.md). - -In this example, we are creating a table based on knowledge of how the table will be queried and the data in the Parquet file. The knowledge of the data in the Parquet file can be gained by querying the file directly in GCS. - -- Since a query of the dataset in GCS indicates that the `Timestamp` column contains data that matches a VARBINARY data type, the column type is specified in the following DDL. -- By querying the data in GCS, you can find that there are no `NULL` values in the dataset, so the DDL does not set any columns as nullable. -- Based on knowledge of the expected query types, the sort key and bucketing column are set to the column `UserID`. Your use case might be different for this data, so you might decide to use `ItemID` in addition to or instead of `UserID` for the sort key. - -Create a database and switch to it: - -```SQL -CREATE DATABASE IF NOT EXISTS mydatabase; -USE mydatabase; -``` - -Create a table by hand (we recommend that the table have the same schema as the Parquet file you want to load from GCS): - -```SQL -CREATE TABLE user_behavior_declared -( - UserID int(11), - ItemID int(11), - CategoryID int(11), - BehaviorType varchar(65533), - Timestamp varbinary -) -ENGINE = OLAP -DUPLICATE KEY(UserID) -DISTRIBUTED BY HASH(UserID); -``` - -Display the schema so that you can compare it with the inferred schema produced by the `FILES()` table function: - -```sql -DESCRIBE user_behavior_declared; -``` - -```plaintext -+--------------+----------------+------+-------+---------+-------+ -| Field | Type | Null | Key | Default | Extra | -+--------------+----------------+------+-------+---------+-------+ -| UserID | int | NO | true | NULL | | -| ItemID | int | NO | false | NULL | | -| CategoryID | int | NO | false | NULL | | -| BehaviorType | varchar(65533) | NO | false | NULL | | -| Timestamp | varbinary | NO | false | NULL | | -+--------------+----------------+------+-------+---------+-------+ -5 rows in set (0.00 sec) -``` - -:::tip - -Compare the schema you just created with the schema inferred earlier using the `FILES()` table function. Look at: - -- data types -- nullable -- key fields - -To better control the schema of the destination table and for better query performance, we recommend that you specify the table schema by hand in production environments. - -::: - -After creating the table, you can load it with INSERT INTO SELECT FROM FILES(): - -```SQL -INSERT INTO user_behavior_declared - SELECT * FROM FILES - ( - "path" = "gs://starrocks-samples/user_behavior_ten_million_rows.parquet", - "format" = "parquet", - "gcp.gcs.service_account_email" = "sampledatareader@xxxxx-xxxxxx-000000.iam.gserviceaccount.com", - "gcp.gcs.service_account_private_key_id" = "baaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa", - "gcp.gcs.service_account_private_key" = "-----BEGIN PRIVATE KEY-----\n...\n-----END PRIVATE KEY-----" -); -``` - -> **NOTE** -> -> Substitute the credentials in the above command with your own credentials. Any valid service account email, key, and secret can be used, as the object is readable by any GCP authenticated user. - -After the load is complete, you can query the table to verify that the data has been loaded into it. Example: - -```SQL -SELECT * from user_behavior_declared LIMIT 3; -``` - -The system returns a query result similar to the following, indicating that the data has been successfully loaded: - -```Plain -+--------+---------+------------+--------------+---------------------+ -| UserID | ItemID | CategoryID | BehaviorType | Timestamp | -+--------+---------+------------+--------------+---------------------+ -| 142 | 2869980 | 2939262 | pv | 2017-11-25 03:43:22 | -| 142 | 2522236 | 1669167 | pv | 2017-11-25 15:14:12 | -| 142 | 3031639 | 3607361 | pv | 2017-11-25 15:19:25 | -+--------+---------+------------+--------------+---------------------+ -``` - -#### Check load progress - -You can query the progress of INSERT jobs from the [`loads`](../sql-reference/information_schema/loads.md) view in the StarRocks Information Schema. This feature is supported from v3.1 onwards. Example: - -```SQL -SELECT * FROM information_schema.loads ORDER BY JOB_ID DESC; -``` - -For information about the fields provided in the `loads` view, see [`loads`](../sql-reference/information_schema/loads.md). - -If you have submitted multiple load jobs, you can filter on the `LABEL` associated with the job. Example: - -```SQL -SELECT * FROM information_schema.loads WHERE LABEL = 'insert_f3fc2298-a553-11ee-92f4-00163e0842bd' \G -*************************** 1. row *************************** - JOB_ID: 10193 - LABEL: insert_f3fc2298-a553-11ee-92f4-00163e0842bd - DATABASE_NAME: mydatabase - STATE: FINISHED - PROGRESS: ETL:100%; LOAD:100% - TYPE: INSERT - PRIORITY: NORMAL - SCAN_ROWS: 10000000 - FILTERED_ROWS: 0 - UNSELECTED_ROWS: 0 - SINK_ROWS: 10000000 - ETL_INFO: - TASK_INFO: resource:N/A; timeout(s):300; max_filter_ratio:0.0 - CREATE_TIME: 2023-12-28 15:37:38 - ETL_START_TIME: 2023-12-28 15:37:38 - ETL_FINISH_TIME: 2023-12-28 15:37:38 - LOAD_START_TIME: 2023-12-28 15:37:38 - LOAD_FINISH_TIME: 2023-12-28 15:39:35 - JOB_DETAILS: {"All backends":{"f3fc2298-a553-11ee-92f4-00163e0842bd":[10120]},"FileNumber":0,"FileSize":0,"InternalTableLoadBytes":581730322,"InternalTableLoadRows":10000000,"ScanBytes":581574034,"ScanRows":10000000,"TaskNumber":1,"Unfinished backends":{"f3fc2298-a553-11ee-92f4-00163e0842bd":[]}} - ERROR_MSG: NULL - TRACKING_URL: NULL - TRACKING_SQL: NULL -REJECTED_RECORD_PATH: NULL -``` - -> **NOTE** -> -> INSERT is a synchronous command. If an INSERT job is still running, you need to open another session to check its execution status. - -## Use Broker Load - -An asynchronous Broker Load process handles making the connection to GCS, pulling the data, and storing the data in StarRocks. - -This method supports the following file formats: - -- Parquet -- ORC -- CSV -- JSON (supported from v3.2.3 onwards) - -### Advantages of Broker Load - -- Broker Load runs in the background and clients don't need to stay connected for the job to continue. -- Broker Load is preferred for long running jobs, the default timeout is 4 hours. -- In addition to Parquet and ORC file format, Broker Load supports CSV file format and JSON file format (JSON file format is supported from v3.2.3 onwards). - -### Data flow - -![Workflow of Broker Load](../_assets/broker_load_how-to-work_en.png) - -1. The user creates a load job. -2. The frontend (FE) creates a query plan and distributes the plan to the backend nodes (BEs) or compute nodes (CNs). -3. The BEs or CNs pull the data from the source and load the data into StarRocks. - -### Typical example - -Create a table, start a load process that pulls the sample dataset `gs://starrocks-samples/user_behavior_ten_million_rows.parquet` from GCS, and verify the progress and success of the data loading. - -#### Create a database and a table - -Create a database and switch to it: - -```SQL -CREATE DATABASE IF NOT EXISTS mydatabase; -USE mydatabase; -``` - -Create a table by hand (we recommend that the table has the same schema as the Parquet file that you want to load from GCS): - -```SQL -CREATE TABLE user_behavior -( - UserID int(11), - ItemID int(11), - CategoryID int(11), - BehaviorType varchar(65533), - Timestamp varbinary -) -ENGINE = OLAP -DUPLICATE KEY(UserID) -DISTRIBUTED BY HASH(UserID); -``` - -#### Start a Broker Load - -Run the following command to start a Broker Load job that loads data from the sample dataset `gs://starrocks-samples/user_behavior_ten_million_rows.parquet` to the `user_behavior` table: - -```SQL -LOAD LABEL user_behavior -( - DATA INFILE("gs://starrocks-samples/user_behavior_ten_million_rows.parquet") - INTO TABLE user_behavior - FORMAT AS "parquet" - ) - WITH BROKER - ( - - "gcp.gcs.service_account_email" = "sampledatareader@xxxxx-xxxxxx-000000.iam.gserviceaccount.com", - "gcp.gcs.service_account_private_key_id" = "baaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa", - "gcp.gcs.service_account_private_key" = "-----BEGIN PRIVATE KEY-----\n...\n-----END PRIVATE KEY-----" - ) -PROPERTIES -( - "timeout" = "72000" -); -``` - -> **NOTE** -> -> Substitute the credentials in the above command with your own credentials. Any valid service account email, key, and secret can be used, as the object is readable by any GCP authenticated user. - -This job has four main sections: - -- `LABEL`: A string used when querying the state of the load job. -- `LOAD` declaration: The source URI, source data format, and destination table name. -- `BROKER`: The connection details for the source. -- `PROPERTIES`: The timeout value and any other properties to apply to the load job. - -For detailed syntax and parameter descriptions, see [BROKER LOAD](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md). - -#### Check load progress - -You can query the progress of INSERT jobs from the [`loads`](../sql-reference/information_schema/loads.md) view in the StarRocks Information Schema. This feature is supported from v3.1 onwards. - -```SQL -SELECT * FROM information_schema.loads; -``` - -For information about the fields provided in the `loads` view, see [`loads`](../sql-reference/information_schema/loads.md). - -If you have submitted multiple load jobs, you can filter on the `LABEL` associated with the job. Example: - -```SQL -SELECT * FROM information_schema.loads WHERE LABEL = 'user_behavior'; -``` - -In the output below there are two entries for the load job `user_behavior`: - -- The first record shows a state of `CANCELLED`. Scroll to `ERROR_MSG`, and you can see that the job has failed due to `listPath failed`. -- The second record shows a state of `FINISHED`, which means that the job has succeeded. - -```Plain -JOB_ID|LABEL |DATABASE_NAME|STATE |PROGRESS |TYPE |PRIORITY|SCAN_ROWS|FILTERED_ROWS|UNSELECTED_ROWS|SINK_ROWS|ETL_INFO|TASK_INFO |CREATE_TIME |ETL_START_TIME |ETL_FINISH_TIME |LOAD_START_TIME |LOAD_FINISH_TIME |JOB_DETAILS |ERROR_MSG |TRACKING_URL|TRACKING_SQL|REJECTED_RECORD_PATH| -------+-------------------------------------------+-------------+---------+-------------------+------+--------+---------+-------------+---------------+---------+--------+----------------------------------------------------+-------------------+-------------------+-------------------+-------------------+-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+------------+------------+--------------------+ - 10121|user_behavior |mydatabase |CANCELLED|ETL:N/A; LOAD:N/A |BROKER|NORMAL | 0| 0| 0| 0| |resource:N/A; timeout(s):72000; max_filter_ratio:0.0|2023-08-10 14:59:30| | | |2023-08-10 14:59:34|{"All backends":{},"FileNumber":0,"FileSize":0,"InternalTableLoadBytes":0,"InternalTableLoadRows":0,"ScanBytes":0,"ScanRows":0,"TaskNumber":0,"Unfinished backends":{}} |type:ETL_RUN_FAIL; msg:listPath failed| | | | - 10106|user_behavior |mydatabase |FINISHED |ETL:100%; LOAD:100%|BROKER|NORMAL | 86953525| 0| 0| 86953525| |resource:N/A; timeout(s):72000; max_filter_ratio:0.0|2023-08-10 14:50:15|2023-08-10 14:50:19|2023-08-10 14:50:19|2023-08-10 14:50:19|2023-08-10 14:55:10|{"All backends":{"a5fe5e1d-d7d0-4826-ba99-c7348f9a5f2f":[10004]},"FileNumber":1,"FileSize":1225637388,"InternalTableLoadBytes":2710603082,"InternalTableLoadRows":86953525,"ScanBytes":1225637388,"ScanRows":86953525,"TaskNumber":1,"Unfinished backends":{"a5| | | | | -``` - -After you confirm that the load job has finished, you can check a subset of the destination table to see if the data has been successfully loaded. Example: - -```SQL -SELECT * from user_behavior LIMIT 3; -``` - -The system returns a query result similar to the following, indicating that the data has been successfully loaded: - -```Plain -+--------+---------+------------+--------------+---------------------+ -| UserID | ItemID | CategoryID | BehaviorType | Timestamp | -+--------+---------+------------+--------------+---------------------+ -| 142 | 2869980 | 2939262 | pv | 2017-11-25 03:43:22 | -| 142 | 2522236 | 1669167 | pv | 2017-11-25 15:14:12 | -| 142 | 3031639 | 3607361 | pv | 2017-11-25 15:19:25 | -+--------+---------+------------+--------------+---------------------+ -``` diff --git a/docs/en/loading/hdfs_load.md b/docs/en/loading/hdfs_load.md deleted file mode 100644 index cff9aaa..0000000 --- a/docs/en/loading/hdfs_load.md +++ /dev/null @@ -1,605 +0,0 @@ ---- -displayed_sidebar: docs -toc_max_heading_level: 4 -keywords: ['Broker Load'] ---- - -# Load data from HDFS - -import LoadMethodIntro from '../_assets/commonMarkdown/loadMethodIntro.mdx' - -import InsertPrivNote from '../_assets/commonMarkdown/insertPrivNote.mdx' - -import PipeAdvantages from '../_assets/commonMarkdown/pipeAdvantages.mdx' - -StarRocks provides the following options for loading data from HDFS: - - - -## Before you begin - -### Make source data ready - -Make sure the source data you want to load into StarRocks is properly stored in your HDFS cluster. This topic assumes that you want to load `/user/amber/user_behavior_ten_million_rows.parquet` from HDFS into StarRocks. - -### Check privileges - - - -### Gather authentication details - -You can use the simple authentication method to establish connections with your HDFS cluster. To use simple authentication, you need to gather the username and password of the account that you can use to access the NameNode of the HDFS cluster. - -## Use INSERT+FILES() - -This method is available from v3.1 onwards and currently supports only the Parquet, ORC, and CSV (from v3.3.0 onwards) file formats. - -### Advantages of INSERT+FILES() - -[`FILES()`](../sql-reference/sql-functions/table-functions/files.md) can read the file stored in cloud storage based on the path-related properties you specify, infer the table schema of the data in the file, and then return the data from the file as data rows. - -With `FILES()`, you can: - -- Query the data directly from HDFS using [SELECT](../sql-reference/sql-statements/table_bucket_part_index/SELECT.md). -- Create and load a table using [CREATE TABLE AS SELECT](../sql-reference/sql-statements/table_bucket_part_index/CREATE_TABLE_AS_SELECT.md) (CTAS). -- Load the data into an existing table using [INSERT](../sql-reference/sql-statements/table_bucket_part_index/SELECT.md). - -### Typical examples - -#### Querying directly from HDFS using SELECT - -Querying directly from HDFS using SELECT+`FILES()` can give a good preview of the content of a dataset before you create a table. For example: - -- Get a preview of the dataset without storing the data. -- Query for the min and max values and decide what data types to use. -- Check for `NULL` values. - -The following example queries the data file `/user/amber/user_behavior_ten_million_rows.parquet` stored in the HDFS cluster: - -```SQL -SELECT * FROM FILES -( - "path" = "hdfs://:/user/amber/user_behavior_ten_million_rows.parquet", - "format" = "parquet", - "hadoop.security.authentication" = "simple", - "username" = "", - "password" = "" -) -LIMIT 3; -``` - -The system returns the following query result: - -```Plaintext -+--------+---------+------------+--------------+---------------------+ -| UserID | ItemID | CategoryID | BehaviorType | Timestamp | -+--------+---------+------------+--------------+---------------------+ -| 543711 | 829192 | 2355072 | pv | 2017-11-27 08:22:37 | -| 543711 | 2056618 | 3645362 | pv | 2017-11-27 10:16:46 | -| 543711 | 1165492 | 3645362 | pv | 2017-11-27 10:17:00 | -+--------+---------+------------+--------------+---------------------+ -``` - -> **NOTE** -> -> Notice that the column names as returned above are provided by the Parquet file. - -#### Creating and loading a table using CTAS - -This is a continuation of the previous example. The previous query is wrapped in CREATE TABLE AS SELECT (CTAS) to automate the table creation using schema inference. This means StarRocks will infer the table schema, create the table you want, and then load the data into the table. The column names and types are not required to create a table when using the `FILES()` table function with Parquet files as the Parquet format includes the column names. - -> **NOTE** -> -> The syntax of CREATE TABLE when using schema inference does not allow setting the number of replicas, so set it before creating the table. The example below is for a system with three replicas: -> -> ```SQL -> ADMIN SET FRONTEND CONFIG ('default_replication_num' = "3"); -> ``` - -Create a database and switch to it: - -```SQL -CREATE DATABASE IF NOT EXISTS mydatabase; -USE mydatabase; -``` - -Use CTAS to create a table and load the data of the data file `/user/amber/user_behavior_ten_million_rows.parquet` into the table: - -```SQL -CREATE TABLE user_behavior_inferred AS -SELECT * FROM FILES -( - "path" = "hdfs://:/user/amber/user_behavior_ten_million_rows.parquet", - "format" = "parquet", - "hadoop.security.authentication" = "simple", - "username" = "", - "password" = "" -); -``` - -After creating the table, you can view its schema by using [DESCRIBE](../sql-reference/sql-statements/table_bucket_part_index/DESCRIBE.md): - -```SQL -DESCRIBE user_behavior_inferred; -``` - -The system returns the following query result: - -```Plain -+--------------+-----------+------+-------+---------+-------+ -| Field | Type | Null | Key | Default | Extra | -+--------------+-----------+------+-------+---------+-------+ -| UserID | bigint | YES | true | NULL | | -| ItemID | bigint | YES | true | NULL | | -| CategoryID | bigint | YES | true | NULL | | -| BehaviorType | varbinary | YES | false | NULL | | -| Timestamp | varbinary | YES | false | NULL | | -+--------------+-----------+------+-------+---------+-------+ -``` - -Query the table to verify that the data has been loaded into it. Example: - -```SQL -SELECT * from user_behavior_inferred LIMIT 3; -``` - -The following query result is returned, indicating that the data has been successfully loaded: - -```Plaintext -+--------+--------+------------+--------------+---------------------+ -| UserID | ItemID | CategoryID | BehaviorType | Timestamp | -+--------+--------+------------+--------------+---------------------+ -| 84 | 56257 | 1879194 | pv | 2017-11-26 05:56:23 | -| 84 | 108021 | 2982027 | pv | 2017-12-02 05:43:00 | -| 84 | 390657 | 1879194 | pv | 2017-11-28 11:20:30 | -+--------+--------+------------+--------------+---------------------+ -``` - -#### Loading into an existing table using INSERT - -You may want to customize the table that you are inserting into, for example, the: - -- column data type, nullable setting, or default values -- key types and columns -- data partitioning and bucketing - -> **NOTE** -> -> Creating the most efficient table structure requires knowledge of how the data will be used and the content of the columns. This topic does not cover table design. For information about table design, see [Table types](../table_design/StarRocks_table_design.md). - -In this example, we are creating a table based on knowledge of how the table will be queried and the data in the Parquet file. The knowledge of the data in the Parquet file can be gained by querying the file directly in HDFS. - -- Since a query of the dataset in HDFS indicates that the `Timestamp` column contains data that matches a VARBINARY data type, the column type is specified in the following DDL. -- By querying the data in HDFS, you can find that there are no `NULL` values in the dataset, so the DDL does not set any columns as nullable. -- Based on knowledge of the expected query types, the sort key and bucketing column are set to the column `UserID`. Your use case might be different for this data, so you might decide to use `ItemID` in addition to or instead of `UserID` for the sort key. - -Create a database and switch to it: - -```SQL -CREATE DATABASE IF NOT EXISTS mydatabase; -USE mydatabase; -``` - -Create a table by hand (we recommend that the table have the same schema as the Parquet file you want to load from HDFS): - -```SQL -CREATE TABLE user_behavior_declared -( - UserID int(11), - ItemID int(11), - CategoryID int(11), - BehaviorType varchar(65533), - Timestamp varbinary -) -ENGINE = OLAP -DUPLICATE KEY(UserID) -DISTRIBUTED BY HASH(UserID); -``` - -Display the schema so that you can compare it with the inferred schema produced by the `FILES()` table function: - -```sql -DESCRIBE user_behavior_declared; -``` - -```plaintext -+--------------+----------------+------+-------+---------+-------+ -| Field | Type | Null | Key | Default | Extra | -+--------------+----------------+------+-------+---------+-------+ -| UserID | int | NO | true | NULL | | -| ItemID | int | NO | false | NULL | | -| CategoryID | int | NO | false | NULL | | -| BehaviorType | varchar(65533) | NO | false | NULL | | -| Timestamp | varbinary | NO | false | NULL | | -+--------------+----------------+------+-------+---------+-------+ -5 rows in set (0.00 sec) -``` - -:::tip - -Compare the schema you just created with the schema inferred earlier using the `FILES()` table function. Look at: - -- data types -- nullable -- key fields - -To better control the schema of the destination table and for better query performance, we recommend that you specify the table schema by hand in production environments. - -::: - -After creating the table, you can load it with INSERT INTO SELECT FROM FILES(): - -```SQL -INSERT INTO user_behavior_declared -SELECT * FROM FILES -( - "path" = "hdfs://:/user/amber/user_behavior_ten_million_rows.parquet", - "format" = "parquet", - "hadoop.security.authentication" = "simple", - "username" = "", - "password" = "" -); -``` - -After the load is complete, you can query the table to verify that the data has been loaded into it. Example: - -```SQL -SELECT * from user_behavior_declared LIMIT 3; -``` - -The following query result is returned, indicating that the data has been successfully loaded: - -```Plaintext -+--------+---------+------------+--------------+---------------------+ -| UserID | ItemID | CategoryID | BehaviorType | Timestamp | -+--------+---------+------------+--------------+---------------------+ -| 107 | 1568743 | 4476428 | pv | 2017-11-25 14:29:53 | -| 107 | 470767 | 1020087 | pv | 2017-11-25 14:32:31 | -| 107 | 358238 | 1817004 | pv | 2017-11-25 14:43:23 | -+--------+---------+------------+--------------+---------------------+ -``` - -#### Check load progress - -You can query the progress of INSERT jobs from the [`loads`](../sql-reference/information_schema/loads.md) view in the StarRocks Information Schema. This feature is supported from v3.1 onwards. Example: - -```SQL -SELECT * FROM information_schema.loads ORDER BY JOB_ID DESC; -``` - -For information about the fields provided in the `loads` view, see [`loads`](../sql-reference/information_schema/loads.md). - -If you have submitted multiple load jobs, you can filter on the `LABEL` associated with the job. Example: - -```SQL -SELECT * FROM information_schema.loads WHERE LABEL = 'insert_0d86c3f9-851f-11ee-9c3e-00163e044958' \G -*************************** 1. row *************************** - JOB_ID: 10214 - LABEL: insert_0d86c3f9-851f-11ee-9c3e-00163e044958 - DATABASE_NAME: mydatabase - STATE: FINISHED - PROGRESS: ETL:100%; LOAD:100% - TYPE: INSERT - PRIORITY: NORMAL - SCAN_ROWS: 10000000 - FILTERED_ROWS: 0 - UNSELECTED_ROWS: 0 - SINK_ROWS: 10000000 - ETL_INFO: - TASK_INFO: resource:N/A; timeout(s):300; max_filter_ratio:0.0 - CREATE_TIME: 2023-11-17 15:58:14 - ETL_START_TIME: 2023-11-17 15:58:14 - ETL_FINISH_TIME: 2023-11-17 15:58:14 - LOAD_START_TIME: 2023-11-17 15:58:14 - LOAD_FINISH_TIME: 2023-11-17 15:58:18 - JOB_DETAILS: {"All backends":{"0d86c3f9-851f-11ee-9c3e-00163e044958":[10120]},"FileNumber":0,"FileSize":0,"InternalTableLoadBytes":311710786,"InternalTableLoadRows":10000000,"ScanBytes":581574034,"ScanRows":10000000,"TaskNumber":1,"Unfinished backends":{"0d86c3f9-851f-11ee-9c3e-00163e044958":[]}} - ERROR_MSG: NULL - TRACKING_URL: NULL - TRACKING_SQL: NULL -REJECTED_RECORD_PATH: NULL -``` - -> **NOTE** -> -> INSERT is a synchronous command. If an INSERT job is still running, you need to open another session to check its execution status. - -## Use Broker Load - -An asynchronous Broker Load process handles making the connection to HDFS, pulling the data, and storing the data in StarRocks. - -This method supports the following file formats: - -- Parquet -- ORC -- CSV -- JSON (supported from v3.2.3 onwards) - -### Advantages of Broker Load - -- Broker Load runs in the background and clients do not need to stay connected for the job to continue. -- Broker Load is preferred for long-running jobs, with the default timeout spanning 4 hours. -- In addition to Parquet and ORC file format, Broker Load supports CSV file format and JSON file format (JSON file format is supported from v3.2.3 onwards). - -### Data flow - -![Workflow of Broker Load](../_assets/broker_load_how-to-work_en.png) - -1. The user creates a load job. -2. The frontend (FE) creates a query plan and distributes the plan to the backend nodes (BEs) or compute nodes (CNs). -3. The BEs or CNs pull the data from the source and load the data into StarRocks. - -### Typical example - -Create a table, start a load process that pulls the data file `/user/amber/user_behavior_ten_million_rows.parquet` from HDFS, and verify the progress and success of the data loading. - -#### Create a database and a table - -Create a database and switch to it: - -```SQL -CREATE DATABASE IF NOT EXISTS mydatabase; -USE mydatabase; -``` - -Create a table by hand (we recommend that the table has the same schema as the Parquet file that you want to load from HDFS): - -```SQL -CREATE TABLE user_behavior -( - UserID int(11), - ItemID int(11), - CategoryID int(11), - BehaviorType varchar(65533), - Timestamp varbinary -) -ENGINE = OLAP -DUPLICATE KEY(UserID) -DISTRIBUTED BY HASH(UserID); -``` - -#### Start a Broker Load - -Run the following command to start a Broker Load job that loads data from the data file `/user/amber/user_behavior_ten_million_rows.parquet` to the `user_behavior` table: - -```SQL -LOAD LABEL user_behavior -( - DATA INFILE("hdfs://:/user/amber/user_behavior_ten_million_rows.parquet") - INTO TABLE user_behavior - FORMAT AS "parquet" - ) - WITH BROKER -( - "hadoop.security.authentication" = "simple", - "username" = "", - "password" = "" -) -PROPERTIES -( - "timeout" = "72000" -); -``` - -This job has four main sections: - -- `LABEL`: A string used when querying the state of the load job. -- `LOAD` declaration: The source URI, source data format, and destination table name. -- `BROKER`: The connection details for the source. -- `PROPERTIES`: The timeout value and any other properties to apply to the load job. - -For detailed syntax and parameter descriptions, see [BROKER LOAD](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md). - -#### Check load progress - -You can query the progress of Broker Load jobs from the `information_schema.loads` view. This feature is supported from v3.1 onwards. - -```SQL -SELECT * FROM information_schema.loads; -``` - -For information about the fields provided in the `loads` view, see [Information Schema](../sql-reference/information_schema/loads.md)). - -If you have submitted multiple load jobs, you can filter on the `LABEL` associated with the job. Example: - -```SQL -SELECT * FROM information_schema.loads WHERE LABEL = 'user_behavior'; -``` - -In the output below there are two entries for the load job `user_behavior`: - -- The first record shows a state of `CANCELLED`. Scroll to `ERROR_MSG`, and you can see that the job has failed due to `listPath failed`. -- The second record shows a state of `FINISHED`, which means that the job has succeeded. - -```Plaintext -JOB_ID|LABEL |DATABASE_NAME|STATE |PROGRESS |TYPE |PRIORITY|SCAN_ROWS|FILTERED_ROWS|UNSELECTED_ROWS|SINK_ROWS|ETL_INFO|TASK_INFO |CREATE_TIME |ETL_START_TIME |ETL_FINISH_TIME |LOAD_START_TIME |LOAD_FINISH_TIME |JOB_DETAILS |ERROR_MSG |TRACKING_URL|TRACKING_SQL|REJECTED_RECORD_PATH| -------+-------------------------------------------+-------------+---------+-------------------+------+--------+---------+-------------+---------------+---------+--------+----------------------------------------------------+-------------------+-------------------+-------------------+-------------------+-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+------------+------------+--------------------+ - 10121|user_behavior |mydatabase |CANCELLED|ETL:N/A; LOAD:N/A |BROKER|NORMAL | 0| 0| 0| 0| |resource:N/A; timeout(s):72000; max_filter_ratio:0.0|2023-08-10 14:59:30| | | |2023-08-10 14:59:34|{"All backends":{},"FileNumber":0,"FileSize":0,"InternalTableLoadBytes":0,"InternalTableLoadRows":0,"ScanBytes":0,"ScanRows":0,"TaskNumber":0,"Unfinished backends":{}} |type:ETL_RUN_FAIL; msg:listPath failed| | | | - 10106|user_behavior |mydatabase |FINISHED |ETL:100%; LOAD:100%|BROKER|NORMAL | 86953525| 0| 0| 86953525| |resource:N/A; timeout(s):72000; max_filter_ratio:0.0|2023-08-10 14:50:15|2023-08-10 14:50:19|2023-08-10 14:50:19|2023-08-10 14:50:19|2023-08-10 14:55:10|{"All backends":{"a5fe5e1d-d7d0-4826-ba99-c7348f9a5f2f":[10004]},"FileNumber":1,"FileSize":1225637388,"InternalTableLoadBytes":2710603082,"InternalTableLoadRows":86953525,"ScanBytes":1225637388,"ScanRows":86953525,"TaskNumber":1,"Unfinished backends":{"a5| | | | | -``` - -After you confirm that the load job has finished, you can check a subset of the destination table to see if the data has been successfully loaded. Example: - -```SQL -SELECT * from user_behavior LIMIT 3; -``` - -The following query result is returned, indicating that the data has been successfully loaded: - -```Plaintext -+--------+---------+------------+--------------+---------------------+ -| UserID | ItemID | CategoryID | BehaviorType | Timestamp | -+--------+---------+------------+--------------+---------------------+ -| 142 | 2869980 | 2939262 | pv | 2017-11-25 03:43:22 | -| 142 | 2522236 | 1669167 | pv | 2017-11-25 15:14:12 | -| 142 | 3031639 | 3607361 | pv | 2017-11-25 15:19:25 | -+--------+---------+------------+--------------+---------------------+ -``` - -## Use Pipe - -Starting from v3.2, StarRocks provides the Pipe loading method, which currently supports only the Parquet and ORC file formats. - -### Advantages of Pipe - - - -Pipe is ideal for continuous data loading and large-scale data loading: - -- **Large-scale data loading in micro-batches helps reduce the cost of retries caused by data errors.** - - With the help of Pipe, StarRocks enables the efficient loading of a large number of data files with a significant data volume in total. Pipe automatically splits the files based on their number or size, breaking down the load job into smaller, sequential tasks. This approach ensures that errors in one file do not impact the entire load job. The load status of each file is recorded by Pipe, allowing you to easily identify and fix files that contain errors. By minimizing the need for retries due to data errors, this approach helps to reduce costs. - -- **Continuous data loading helps reduce manpower.** - - Pipe helps you write new or updated data files to a specific location and continuously load the new data from these files into StarRocks. After you create a Pipe job with `"AUTO_INGEST" = "TRUE"` specified, it will constantly monitor changes to the data files stored in the specified path and automatically load new or updated data from the data files into the destination StarRocks table. - -Additionally, Pipe performs file uniqueness checks to help prevent duplicate data loading. During the loading process, Pipe checks the uniqueness of each data file based on the file name and digest. If a file with a specific file name and digest has already been processed by a Pipe job, the Pipe job will skip all subsequent files with the same file name and digest. Note that HDFS uses `LastModifiedTime` as file digest. - -The load status of each data file is recorded and saved to the `information_schema.pipe_files` view. After a Pipe job associated with the view is deleted, the records about the files loaded in that job will also be deleted. - -### Data flow - -![Pipe data flow](../_assets/pipe_data_flow.png) - -### Differences between Pipe and INSERT+FILES() - -A Pipe job is split into one or more transactions based on the size and number of rows in each data file. Users can query the intermediate results during the loading process. In contrast, an INSERT+`FILES()` job is processed as a single transaction, and users are unable to view the data during the loading process. - -### File loading sequence - -For each Pipe job, StarRocks maintains a file queue, from which it fetches and loads data files as micro-batches. Pipe does not ensure that the data files are loaded in the same order as they are uploaded. Therefore, newer data may be loaded prior to older data. - -### Typical example - -#### Create a database and a table - -Create a database and switch to it: - -```SQL -CREATE DATABASE IF NOT EXISTS mydatabase; -USE mydatabase; -``` - -Create a table by hand (we recommend that the table have the same schema as the Parquet file you want to load from HDFS): - -```SQL -CREATE TABLE user_behavior_replica -( - UserID int(11), - ItemID int(11), - CategoryID int(11), - BehaviorType varchar(65533), - Timestamp varbinary -) -ENGINE = OLAP -DUPLICATE KEY(UserID) -DISTRIBUTED BY HASH(UserID); -``` - -#### Start a Pipe job - -Run the following command to start a Pipe job that loads data from the data file `/user/amber/user_behavior_ten_million_rows.parquet` to the `user_behavior_replica` table: - -```SQL -CREATE PIPE user_behavior_replica -PROPERTIES -( - "AUTO_INGEST" = "TRUE" -) -AS -INSERT INTO user_behavior_replica -SELECT * FROM FILES -( - "path" = "hdfs://:/user/amber/user_behavior_ten_million_rows.parquet", - "format" = "parquet", - "hadoop.security.authentication" = "simple", - "username" = "", - "password" = "" -); -``` - -This job has four main sections: - -- `pipe_name`: The name of the pipe. The pipe name must be unique within the database to which the pipe belongs. -- `INSERT_SQL`: The INSERT INTO SELECT FROM FILES statement that is used to load data from the specified source data file to the destination table. -- `PROPERTIES`: A set of optional parameters that specify how to execute the pipe. These include `AUTO_INGEST`, `POLL_INTERVAL`, `BATCH_SIZE`, and `BATCH_FILES`. Specify these properties in the `"key" = "value"` format. - -For detailed syntax and parameter descriptions, see [CREATE PIPE](../sql-reference/sql-statements/loading_unloading/pipe/CREATE_PIPE.md). - -#### Check load progress - -- Query the progress of Pipe jobs by using [SHOW PIPES](../sql-reference/sql-statements/loading_unloading/pipe/SHOW_PIPES.md). - - ```SQL - SHOW PIPES; - ``` - - If you have submitted multiple load jobs, you can filter on the `NAME` associated with the job. Example: - - ```SQL - SHOW PIPES WHERE NAME = 'user_behavior_replica' \G - *************************** 1. row *************************** - DATABASE_NAME: mydatabase - PIPE_ID: 10252 - PIPE_NAME: user_behavior_replica - STATE: RUNNING - TABLE_NAME: mydatabase.user_behavior_replica - LOAD_STATUS: {"loadedFiles":1,"loadedBytes":132251298,"loadingFiles":0,"lastLoadedTime":"2023-11-17 16:13:22"} - LAST_ERROR: NULL - CREATED_TIME: 2023-11-17 16:13:15 - 1 row in set (0.00 sec) - ``` - -- Query the progress of Pipe jobs from the [`pipes`](../sql-reference/information_schema/pipes.md) view in the StarRocks Information Schema. - - ```SQL - SELECT * FROM information_schema.pipes; - ``` - - If you have submitted multiple load jobs, you can filter on the `PIPE_NAME` associated with the job. Example: - - ```SQL - SELECT * FROM information_schema.pipes WHERE pipe_name = 'user_behavior_replica' \G - *************************** 1. row *************************** - DATABASE_NAME: mydatabase - PIPE_ID: 10252 - PIPE_NAME: user_behavior_replica - STATE: RUNNING - TABLE_NAME: mydatabase.user_behavior_replica - LOAD_STATUS: {"loadedFiles":1,"loadedBytes":132251298,"loadingFiles":0,"lastLoadedTime":"2023-11-17 16:13:22"} - LAST_ERROR: - CREATED_TIME: 2023-11-17 16:13:15 - 1 row in set (0.00 sec) - ``` - -#### Check file status - -You can query the load status of the files loaded from the [`pipe_files`](../sql-reference/information_schema/pipe_files.md) view in the StarRocks Information Schema. - -```SQL -SELECT * FROM information_schema.pipe_files; -``` - -If you have submitted multiple load jobs, you can filter on the `PIPE_NAME` associated with the job. Example: - -```SQL -SELECT * FROM information_schema.pipe_files WHERE pipe_name = 'user_behavior_replica' \G -*************************** 1. row *************************** - DATABASE_NAME: mydatabase - PIPE_ID: 10252 - PIPE_NAME: user_behavior_replica - FILE_NAME: hdfs://172.26.195.67:9000/user/amber/user_behavior_ten_million_rows.parquet - FILE_VERSION: 1700035418838 - FILE_SIZE: 132251298 - LAST_MODIFIED: 2023-11-15 08:03:38 - LOAD_STATE: FINISHED - STAGED_TIME: 2023-11-17 16:13:16 - START_LOAD_TIME: 2023-11-17 16:13:17 -FINISH_LOAD_TIME: 2023-11-17 16:13:22 - ERROR_MSG: -1 row in set (0.02 sec) -``` - -#### Manage Pipes - -You can alter, suspend or resume, drop, or query the pipes you have created and retry to load specific data files. For more information, see [ALTER PIPE](../sql-reference/sql-statements/loading_unloading/pipe/ALTER_PIPE.md), [SUSPEND or RESUME PIPE](../sql-reference/sql-statements/loading_unloading/pipe/SUSPEND_or_RESUME_PIPE.md), [DROP PIPE](../sql-reference/sql-statements/loading_unloading/pipe/DROP_PIPE.md), [SHOW PIPES](../sql-reference/sql-statements/loading_unloading/pipe/SHOW_PIPES.md), and [RETRY FILE](../sql-reference/sql-statements/loading_unloading/pipe/RETRY_FILE.md). diff --git a/docs/en/loading/huawei.md b/docs/en/loading/huawei.md deleted file mode 100644 index a55a3f1..0000000 --- a/docs/en/loading/huawei.md +++ /dev/null @@ -1,4 +0,0 @@ ---- -unlisted: true ---- - diff --git a/docs/en/loading/load_concept/strict_mode.md b/docs/en/loading/load_concept/strict_mode.md deleted file mode 100644 index c583869..0000000 --- a/docs/en/loading/load_concept/strict_mode.md +++ /dev/null @@ -1,163 +0,0 @@ ---- -displayed_sidebar: docs ---- - -# Strict mode - -Strict mode is an optional property that you can configure for data loads. It affects the loading behavior and the final loaded data. - -This topic introduces what strict mode is and how to set strict mode. - -## Understand strict mode - -During data loading, the data types of the source columns may not be completely consistent with the data types of the destination columns. In such cases, StarRocks performs conversions on the source column values that have inconsistent data types. Data conversions may fail due to various issues such as unmatched field data types and field length overflows. Source column values that fail to be properly converted are unqualified column values, and source rows that contain unqualified column values are referred to as "unqualified rows". Strict mode is used to control whether to filter out unqualified rows during data loading. - -Strict mode works as follows: - -- If strict mode is enabled, StarRocks loads only qualified rows. It filters out unqualified rows and returns details about the unqualified rows. -- If strict mode is disabled, StarRocks converts unqualified column values into `NULL` and loads unqualified rows that contain these `NULL` values together with qualified rows. - -Note the following points: - -- In actual business scenarios, both qualified and unqualified rows may contain `NULL` values. If the destination columns do not allow `NULL` values, StarRocks reports errors and filters out the rows that contain `NULL` values. - -- The maximum percentage of unqualified rows that can be filtered out for a load job is controlled by an optional job property `max_filter_ratio`. - -:::note - -The `max_filter_ratio` property for INSERT is supported from v3.4.0. - -::: - -For example, you want to load four rows that hold `\N` (`\N` denotes a `NULL` value), `abc`, `2000`, and `1` values respectively in a column from a CSV-formatted data file into a StarRocks table, and the data type of the destination StarRocks table column is TINYINT [-128, 127]. - -- The source column value `\N` is processed into `NULL` upon conversion to TINYINT. - - > **NOTE** - > - > `\N` is always processed into `NULL` upon conversion regardless of the destination data type. - -- The source column value `abc` is processed into `NULL`, because its data type is not TINYINT and the conversion fails. - -- The source column value `2000` is processed into `NULL`, because it is beyond the range supported by TINYINT and the conversion fails. - -- The source column value `1` can be properly converted to a TINYINT-type value `1`. - -If strict mode is disabled, StarRocks loads all the four rows. - -If strict mode is enabled, StarRocks loads only the rows that hold `\N` or `1` and filters out the rows that hold `abc` or `2000`. The rows filtered out are counted against the maximum percentage of rows that can be filtered out due to inadequate data quality as specified by the `max_filter_ratio` parameter. - -### Final loaded data with strict mode disabled - -| Source column value | Column value upon conversion to TINYINT | Load result when destination column allows NULL values | Load result when destination column does not allow NULL values | -| ------------------- | --------------------------------------- | ------------------------------------------------------ | ------------------------------------------------------------ | -| \N | NULL | The value `NULL` is loaded. | An error is reported. | -| abc | NULL | The value `NULL` is loaded. | An error is reported. | -| 2000 | NULL | The value `NULL` is loaded. | An error is reported. | -| 1 | 1 | The value `1` is loaded. | The value `1` is loaded. | - -### Final loaded data with strict mode enabled - -| Source column value | Column value upon conversion to TINYINT | Load result when destination column allows NULL values | Load result when destination column does not allow NULL values | -| ------------------- | --------------------------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | -| \N | NULL | The value `NULL` is loaded. | An error is reported. | -| abc | NULL | The value `NULL` is not allowed and therefore is filtered out. | An error is reported. | -| 2000 | NULL | The value `NULL` is not allowed and therefore is filtered out. | An error is reported. | -| 1 | 1 | The value `1` is loaded. | The value `1` is loaded. | - -## Set strict mode - -You can use the `strict_mode` parameter to set strict mode for the load job. Valid values are `true` and `false`. The default value is `false`. The value `true` enables strict mode, and the value `false` disables strict mode. Note the `strict_mode` parameter is supported for INSERT from v3.4.0, with the default value `true`. Now, except Stream Load, for all other loading methods, `strict_mode` is set the same way in PROPERTIES clause. - -You can also use the `enable_insert_strict` session variable to set strict mode. Valid values are `true` and `false`. The default value is `true`. The value `true` enables strict mode, and the value `false` disables strict mode. - -:::note - -From v3.4.0 onwards, when `enable_insert_strict` is set to `true`, the system loads only qualified rows. It filters out unqualified rows and returns details about the unqualified rows. Instead, in versions earlier than v3.4.0, when `enable_insert_strict` is set to `true`, the INSERT jobs fails when there is an unqualified rows. - -::: - -Examples are as follows: - -### Stream Load - -```Bash -curl --location-trusted -u : \ - -H "strict_mode: {true | false}" \ - -T -XPUT \ - http://:/api///_stream_load -``` - -For detailed syntax and parameters about Stream Load, see [STREAM LOAD](../../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md). - -### Broker Load - -```SQL -LOAD LABEL [.] -( - DATA INFILE (""[, "" ...]) - INTO TABLE -) -WITH BROKER -( - "username" = "", - "password" = "" -) -PROPERTIES -( - "strict_mode" = "{true | false}" -) -``` - -The preceding code snippet uses HDFS as an example. For detailed syntax and parameters about Broker Load, see [BROKER LOAD](../../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md). - -### Routine Load - -```SQL -CREATE ROUTINE LOAD [.] ON -PROPERTIES -( - "strict_mode" = "{true | false}" -) -FROM KAFKA -( - "kafka_broker_list" =":[,:...]", - "kafka_topic" = "" -) -``` - -The preceding code snippet uses Apache Kafka® as an example. For detailed syntax and parameters about Routine Load, see [CREATE ROUTINE LOAD](../../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md). - -### Spark Load - -```SQL -LOAD LABEL [.] -( - DATA INFILE (""[, "" ...]) - INTO TABLE -) -WITH RESOURCE -( - "spark.executor.memory" = "3g", - "broker.username" = "", - "broker.password" = "" -) -PROPERTIES -( - "strict_mode" = "{true | false}" -) -``` - -The preceding code snippet uses HDFS as an example. For detailed syntax and parameters about Spark Load, see [SPARK LOAD](../../sql-reference/sql-statements/loading_unloading/SPARK_LOAD.md). - -### INSERT - -```SQL -INSERT INTO [.] -PROPERTIES( - "strict_mode" = "{true | false}" -) - -``` - -For detailed syntax and parameters about INSERT, see [INSERT](../../sql-reference/sql-statements/loading_unloading/INSERT.md). diff --git a/docs/en/loading/load_from_pulsar.md b/docs/en/loading/load_from_pulsar.md deleted file mode 100644 index de29715..0000000 --- a/docs/en/loading/load_from_pulsar.md +++ /dev/null @@ -1,147 +0,0 @@ ---- -displayed_sidebar: docs ---- -import Experimental from '../_assets/commonMarkdown/_experimental.mdx' - -# Continuously load data from Apache® Pulsar™ - - - -As of StarRocks version 2.5, Routine Load supports continuously loading data from Apache® Pulsar™. Pulsar is distributed, open source pub-sub messaging and streaming platform with a store-compute separation architecture. Loading data from Pulsar via Routine Load is similar to loading data from Apache Kafka. This topic uses CSV-formatted data as an example to introduce how to load data from Apache Pulsar via Routine Load. - -## Supported data file formats - -Routine Load supports consuming CSV and JSON formatted data from a Pulsar cluster. - -> NOTE -> -> As for data in CSV format, StarRocks supports UTF-8 encoded strings within 50 bytes as column separators. Commonly used column separators include comma (,), tab and pipe (|). - -## Pulsar-related concepts - -**[Topic](https://pulsar.apache.org/docs/2.10.x/concepts-messaging/#topics)** - -Topics in Pulsar are named channels for transmitting messages from producers to consumers. Topics in Pulsar are divided into partitioned topics and non-partitioned topics. - -- **[Partitioned topics](https://pulsar.apache.org/docs/2.10.x/concepts-messaging/#partitioned-topics)** are a special type of topic that are handled by multiple brokers, thus allowing for higher throughput. A partitioned topic is actually implemented as N internal topics, where N is the number of partitions. -- **Non-partitioned topics** are a normal type of topic that are served only by a single broker, which limits the maximum throughput of the topic. - -**[Message ID](https://pulsar.apache.org/docs/2.10.x/concepts-messaging/#messages)** - -The message ID of a message is assigned by [BookKeeper instances](https://pulsar.apache.org/docs/2.10.x/concepts-architecture-overview/#apache-bookkeeper) as soon as the message is persistently stored. Message ID indicates a message' s specific position in a ledger and is unique within a Pulsar cluster. - -Pulsar supports consumers specifying the initial position through consumer.*seek*(*messageId*). But compared to the Kafka consumer offset which is a long integer value, the message ID consists of four parts: `ledgerId:entryID:partition-index:batch-index`. - -Therefore, you can't get the Message ID directly from the message. As a result, at present, **Routine Load does not support specifying initial position when loading data from Pulsar, but only supports consuming data from the beginning or end of a partition.** - -**[Subscription](https://pulsar.apache.org/docs/2.10.x/concepts-messaging/#subscriptions)** - -A subscription is a named configuration rule that determines how messages are delivered to consumers. Pulsar also supports consumers simultaneously subscribing to multiple topics. A topic can have multiple subscriptions. - -The type of a subscription is defined when a consumer connects to it, and the type can be changed by restarting all consumers with a different configuration. Four subscription types are available in Pulsar: - -- `exclusive` (default)*:* Only a single consumer is allowed to attach to the subscription. Only one customer is allowed to consume messages. -- `shared`: Multiple consumers can attach to the same subscription. Messages are delivered in a round robin distribution across consumers, and any given message is delivered to only one consumer. -- `failover`: Multiple consumers can attach to the same subscription. A master consumer is picked for a non-partitioned topic or each partition of a partitioned topic and receives messages. When the master consumer disconnects, all (non-acknowledged and subsequent) messages are delivered to the next consumer in line. -- `key_shared`: Multiple consumers can attach to the same subscription. Messages are delivered in a distribution across consumers and message with same key or same ordering key are delivered to only one consumer. - -> Note: -> -> Currently Routine Load uses exclusive type. - -## Create a Routine Load job - -The following examples describe how to consume CSV-formatted messages in Pulsar, and load the data into StarRocks by creating a Routine Load job. For detailed instruction and reference, see [CREATE ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md). - -```SQL -CREATE ROUTINE LOAD load_test.routine_wiki_edit_1 ON routine_wiki_edit -COLUMNS TERMINATED BY ",", -ROWS TERMINATED BY "\n", -COLUMNS (order_id, pay_dt, customer_name, nationality, temp_gender, price) -WHERE event_time > "2022-01-01 00:00:00", -PROPERTIES -( - "desired_concurrent_number" = "1", - "max_batch_interval" = "15000", - "max_error_number" = "1000" -) -FROM PULSAR -( - "pulsar_service_url" = "pulsar://localhost:6650", - "pulsar_topic" = "persistent://tenant/namespace/topic-name", - "pulsar_subscription" = "load-test", - "pulsar_partitions" = "load-partition-0,load-partition-1", - "pulsar_initial_positions" = "POSITION_EARLIEST,POSITION_LATEST", - "property.auth.token" = "eyJ0eXAiOiJKV1QiLCJhbGciOiJIUJzdWIiOiJqaXV0aWFuY2hlbiJ9.lulGngOC72vE70OW54zcbyw7XdKSOxET94WT_hIqD5Y" -); -``` - -When Routine Load is created to consume data from Pulsar, most input parameters except for `data_source_properties` are the same as consuming data from Kafka . For descriptions about parameters except data_source_properties `data_source_properties` , see [CREATE ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md). - -The parameters related to `data_source_properties` and their descriptions are as follows: - -| **Parameter** | **Required** | **Description** | -| ------------------------------------------- | ------------ | ------------------------------------------------------------ | -| pulsar_service_url | Yes | The URL that is used to connect to the Pulsar cluster. Format: `"pulsar://ip:port"` or `"pulsar://service:port"`.Example: `"pulsar_service_url" = "pulsar://``localhost:6650``"` | -| pulsar_topic | Yes | Subscribed topic. Example: "pulsar_topic" = "persistent://tenant/namespace/topic-name" | -| pulsar_subscription | Yes | Subscription configured for the topic.Example: `"pulsar_subscription" = "my_subscription"` | -| pulsar_partitions, pulsar_initial_positions | No | `pulsar_partitions` : Subscribed partitions in the topic.`pulsar_initial_positions`: initial positions of partitions specified by `pulsar_partitions`. The initial positions must correspond to the partitions in `pulsar_partitions`. Valid values:`POSITION_EARLIEST` (Default value): Subscription starts from the earliest available message in the partition. `POSITION_LATEST`: Subscription starts from the latest available message in the partition.Note:If `pulsar_partitions` is not specified, the topic's all partitions are subscribed.If both `pulsar_partitions` and `property.pulsar_default_initial_position` are specified, the `pulsar_partitions` value overrides `property.pulsar_default_initial_position` value.If neither `pulsar_partitions` nor `property.pulsar_default_initial_position` is specified, subscription starts from the latest available message in the partition.Example:`"pulsar_partitions" = "my-partition-0,my-partition-1,my-partition-2,my-partition-3", "pulsar_initial_positions" = "POSITION_EARLIEST,POSITION_EARLIEST,POSITION_LATEST,POSITION_LATEST"` | - -Routine Load supports the following custom parameters for Pulsar. - -| Parameter | Required | Description | -| ---------------------------------------- | -------- | ------------------------------------------------------------ | -| property.pulsar_default_initial_position | No | The default initial positions when the topic's partitions are subscribed. The parameter takes effect when `pulsar_initial_positions` is not specified. Its valid values are the same as the valid values of `pulsar_initial_positions`.Example: `"``property.pulsar_default_initial_position" = "POSITION_EARLIEST"` | -| property.auth.token | No | If Pulsar enables authenticating clients using security tokens, you need the token string to verify your identity.Example: `"p``roperty.auth.token" = "eyJ0eXAiOiJKV1QiLCJhbGciOiJIUJzdWIiOiJqaXV0aWFuY2hlbiJ9.lulGngOC72vE70OW54zcbyw7XdKSOxET94WT_hIqD"` | - -## Check a load job and task - -### Check a load job - -Execute the [SHOW ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/SHOW_ROUTINE_LOAD.md) statement to check the status of the load job `routine_wiki_edit_1`. StarRocks returns the execution state `State`, the statistical information (including the total rows consumed and the total rows loaded) `Statistics`, and the progress of the load job `progress`. - -When you check a Routine Load job that consumes data from Pulsar, most returned parameters except for `progress` are the same as consuming data from Kafka. `progress` refers to backlog, that is the number of unacked messages in a partition. - -```Plaintext -MySQL [load_test] > SHOW ROUTINE LOAD for routine_wiki_edit_1 \G -*************************** 1. row *************************** - Id: 10142 - Name: routine_wiki_edit_1 - CreateTime: 2022-06-29 14:52:55 - PauseTime: 2022-06-29 17:33:53 - EndTime: NULL - DbName: default_cluster:test_pulsar - TableName: test1 - State: PAUSED - DataSourceType: PULSAR - CurrentTaskNum: 0 - JobProperties: {"partitions":"*","rowDelimiter":"'\n'","partial_update":"false","columnToColumnExpr":"*","maxBatchIntervalS":"10","whereExpr":"*","timezone":"Asia/Shanghai","format":"csv","columnSeparator":"','","json_root":"","strict_mode":"false","jsonpaths":"","desireTaskConcurrentNum":"3","maxErrorNum":"10","strip_outer_array":"false","currentTaskConcurrentNum":"0","maxBatchRows":"200000"} -DataSourceProperties: {"serviceUrl":"pulsar://localhost:6650","currentPulsarPartitions":"my-partition-0,my-partition-1","topic":"persistent://tenant/namespace/topic-name","subscription":"load-test"} - CustomProperties: {"auth.token":"eyJ0eXAiOiJKV1QiLCJhbGciOiJIUJzdWIiOiJqaXV0aWFuY2hlbiJ9.lulGngOC72vE70OW54zcbyw7XdKSOxET94WT_hIqD"} - Statistic: {"receivedBytes":5480943882,"errorRows":0,"committedTaskNum":696,"loadedRows":66243440,"loadRowsRate":29000,"abortedTaskNum":0,"totalRows":66243440,"unselectedRows":0,"receivedBytesRate":2400000,"taskExecuteTimeMs":2283166} - Progress: {"my-partition-0(backlog): 100","my-partition-1(backlog): 0"} -ReasonOfStateChanged: - ErrorLogUrls: - OtherMsg: -1 row in set (0.00 sec) -``` - -### Check a load task - -Execute the [SHOW ROUTINE LOAD TASK](../sql-reference/sql-statements/loading_unloading/routine_load/SHOW_ROUTINE_LOAD_TASK.md) statement to check the load tasks of the load job `routine_wiki_edit_1`, such as how many tasks are running, the Kafka topic partitions that are consumed and the consumption progress `DataSourceProperties`, and the corresponding Coordinator BE node `BeId`. - -```SQL -MySQL [example_db]> SHOW ROUTINE LOAD TASK WHERE JobName = "routine_wiki_edit_1" \G -``` - -## Alter a load job - -Before altering a load job, you must pause it by using the [PAUSE ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/PAUSE_ROUTINE_LOAD.md) statement. Then you can execute the [ALTER ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/ALTER_ROUTINE_LOAD.md). After altering it, you can execute the [RESUME ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/RESUME_ROUTINE_LOAD.md) statement to resume it, and check its status by using the [SHOW ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/SHOW_ROUTINE_LOAD.md) statement. - -When Routine Load is used to consume data from Pulsar, most returned parameters except for `data_source_properties` are the same as consuming data from Kafka. - -**Take note of the following points**: - -- Among the `data_source_properties` related parameters, only `pulsar_partitions`, `pulsar_initial_positions`, and custom Pulsar parameters `property.pulsar_default_initial_position` and `property.auth.token` are currently supported to be modified. The parameters `pulsar_service_url`, `pulsar_topic`, and `pulsar_subscription` cannot be modified. -- If you need to modify the partition to be consumed and the matched initilal position, you need to make sure that you specify the partition using `pulsar_partitions` when you create the Routine Load job, and only the intial position `pulsar_initial_positions` of the specified partition can be modified. -- If you specify only Topic `pulsar_topic` when creating a Routine Load job, but not partitions `pulsar_partitions`, you can modify the starting position of all partitions under topic via `pulsar_default_initial_position`. diff --git a/docs/en/loading/loading.mdx b/docs/en/loading/loading.mdx deleted file mode 100644 index 373d3aa..0000000 --- a/docs/en/loading/loading.mdx +++ /dev/null @@ -1,9 +0,0 @@ ---- -displayed_sidebar: docs ---- - -# Data loading - -import DocCardList from '@theme/DocCardList'; - - \ No newline at end of file diff --git a/docs/en/loading/loading_introduction/feature-support-loading-and-unloading.md b/docs/en/loading/loading_introduction/feature-support-loading-and-unloading.md deleted file mode 100644 index e72d214..0000000 --- a/docs/en/loading/loading_introduction/feature-support-loading-and-unloading.md +++ /dev/null @@ -1,632 +0,0 @@ ---- -displayed_sidebar: docs -sidebar_label: "Feature Support" ---- - -# Feature Support: Data Loading and Unloading - -This document outlines the features of various data loading and unloading methods supported by StarRocks. - -## File format - -### Loading file formats - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Data SourceFile Format
CSVJSON [3]ParquetORCAvroProtoBufThrift
Stream LoadLocal file systems, applications, connectorsYesYesTo be supportedTo be supportedTo be supported
INSERT from FILESHDFS, S3, OSS, Azure, GCS, NFS(NAS) [5]Yes (v3.3+)To be supportedYes (v3.1+)Yes (v3.1+)Yes (v3.4.4+)To be supported
Broker LoadYesYes (v3.2.3+)YesYesTo be supported
Routine LoadKafkaYesYesTo be supportedTo be supportedYes (v3.0+) [1]To be supportedTo be supported
Spark LoadYesTo be supportedYesYesTo be supported
ConnectorsFlink, SparkYesYesTo be supportedTo be supportedTo be supported
Kafka Connector [2]KafkaYes (v3.0+)To be supportedTo be supportedYes (v3.0+)To be supported
PIPE [4]Consistent with INSERT from FILES
- -:::note - -[1], [2]\: Schema Registry is required. - -[3]\: JSON supports a variety of CDC formats. For details about the JSON CDC formats supported by StarRocks, see [JSON CDC format](#json-cdc-formats). - -[4]\: Currently, only INSERT from FILES is supported for loading with PIPE. - -[5]\: You need to mount a NAS device as NFS under the same directory of each BE or CN node to access the files in NFS via the `file://` protocol. - -::: - -#### JSON CDC formats - - - - - - - - - - - - - - - - - - - - - - - - - -
Stream LoadRoutine LoadBroker LoadINSERT from FILESKafka Connector [1]
DebeziumTo be supportedTo be supportedTo be supportedTo be supportedYes (v3.0+)
CanalTo be supported
Maxwell
- -:::note - -[1]\: You must configure the `transforms` parameter while loading Debezium CDC format data into Primary Key tables in StarRocks. - -::: - -### Unloading file formats - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
TargetFile format
Table formatRemote storageCSVJSONParquetORC
INSERT INTO FILESN/AHDFS, S3, OSS, Azure, GCS, NFS(NAS) [3]Yes (v3.3+)To be supportedYes (v3.2+)Yes (v3.3+)
INSERT INTO CatalogHiveHDFS, S3, OSS, Azure, GCSYes (v3.3+)To be supportedYes (v3.2+)Yes (v3.3+)
IcebergHDFS, S3, OSS, Azure, GCSTo be supportedTo be supportedYes (v3.2+)To be supported
Hudi/DeltaTo be supported
EXPORTN/AHDFS, S3, OSS, Azure, GCSYes [1]To be supportedTo be supportedTo be supported
PIPETo be supported [2]
- -:::note - -[1]\: Configuring Broker process is supported. - -[2]\: Currently, unloading data using PIPE is not supported. - -[3]\: You need to mount a NAS device as NFS under the same directory of each BE or CN node to access the files in NFS via the `file://` protocol. - -::: - -## File format-related parameters - -### Loading file format-related parameters - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
File formatParameterLoading method
Stream LoadINSERT from FILESBroker LoadRoutine LoadSpark Load
CSVcolumn_separatorYesYes (v3.3+)Yes [1]
row_delimiterYesYes [2] (v3.1+)Yes [3] (v2.2+)To be supported
encloseYes (v3.0+)Yes (v3.0+)Yes (v3.0+)To be supported
escape
skip_headerTo be supported
trim_spaceYes (v3.0+)
JSONjsonpathsYesTo be supportedYes (v3.2.3+)YesTo be supported
strip_outer_array
json_root
ignore_json_sizeTo be supported
- -:::note - -[1]\: The corresponding parameter is `COLUMNS TERMINATED BY`. - -[2]\: The corresponding parameter is `ROWS TERMINATED BY`. - -[3]\: The corresponding parameter is `ROWS TERMINATED BY`. - -::: - -### Unloading file format-related parameters - - - - - - - - - - - - - - - - - - - - -
File formatParameterUnloading method
INSERT INTO FILESEXPORT
CSVcolumn_separatorYes (v3.3+)Yes
line_delimiter [1]
- -:::note - -[1]\: The corresponding parameter in data loading is `row_delimiter`. - -::: - -## Compression formats - -### Loading compression formats - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
File formatCompression formatLoading method
Stream LoadBroker LoadINSERT from FILESRoutine LoadSpark Load
CSV -
    -
  • deflate
  • -
  • bzip2
  • -
  • gzip
  • -
  • lz4_frame
  • -
  • zstd
  • -
-
Yes [1]Yes [2]To be supportedTo be supportedTo be supported
JSONYes (v3.2.7+) [3]To be supportedN/ATo be supportedN/A
Parquet -
    -
  • gzip
  • -
  • lz4
  • -
  • snappy
  • -
  • zlib
  • -
  • zstd
  • -
-
N/AYes [4]To be supportedYes [4]
ORC
- -:::note - -[1]\: Currently, only when loading CSV files with Stream Load can you specify the compression format by using `format=gzip`, indicating gzip-compressed CSV files. `deflate` and `bzip2` formats are also supported. - -[2]\: Broker Load does not support specifying the compression format of CSV files by using the parameter `format`. Broker Load identifies the compression format by using the suffix of the file. The suffix of gzip-compressed files is `.gz`, and that of the zstd-compressed files is `.zst`. Besides, other `format`-related parameters, such as `trim_space` and `enclose`, are also not supported. - -[3]\: Supports specifying the compression format by using `compression = gzip`. - -[4]\: Supported by Arrow Library. You do not need to configure the `compression` parameter. - -::: - -### Unloading compression formats - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
File formatCompression formatUnloading method
INSERT INTO FILESINSERT INTO CatalogEXPORT
HiveIcebergHudi/Delta
CSV -
    -
  • deflate
  • -
  • bzip2
  • -
  • gzip
  • -
  • lz4_frame
  • -
  • zstd
  • -
-
To be supportedTo be supportedTo be supportedTo be supportedTo be supported
JSONN/AN/AN/AN/AN/AN/A
Parquet -
    -
  • gzip
  • -
  • lz4
  • -
  • snappy
  • -
  • zstd
  • -
-
Yes (v3.2+)Yes (v3.2+)Yes (v3.2+)To be supportedN/A
ORC
- -## Credentials - -### Loading - Authentication - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
AuthenticationLoading method
Stream LoadINSERT from FILESBroker LoadRoutine LoadExternal Catalog
Single KerberosN/AYes (v3.1+)Yes [1] (versions earlier than v2.5)Yes [2] (v3.1.4+)Yes
Kerberos Ticket Granting Ticket (TGT)N/ATo be supportedYes (v3.1.10+/v3.2.1+)
Single KDC Multiple KerberosN/A
Basic access authentications (Access Key pair, IAM Role)N/AYes (HDFS and S3-compatible object storage)Yes [3]Yes
- -:::note - -[1]\: For HDFS, StarRocks supports both simple authentication and Kerberos authentication. - -[2]\: When the security protocol is set to `sasl_plaintext` or `sasl_ssl`, both SASL and GSSAPI (Kerberos) authentications are supported. - -[3]\: When the security protocol is set to `sasl_plaintext` or `sasl_ssl`, both SASL and PLAIN authentications are supported. - -::: - -### Unloading - Authentication - -| | INSERT INTO FILES | EXPORT | -| :-------------- | :----------------: | :-------------: | -| Single Kerberos | To be supported | To be supported | - -## Loading - Other parameters and features - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Parameter and featureLoading method
Stream LoadINSERT from FILESINSERT from SELECT/VALUESBroker LoadPIPERoutine LoadSpark Load
partial_updateYes (v3.0+)Yes [1] (v3.3+)Yes (v3.0+)N/AYes (v3.0+)To be supported
partial_update_modeYes (v3.1+)To be supportedYes (v3.1+)N/ATo be supportedTo be supported
COLUMNS FROM PATHN/AYes (v3.2+)N/AYesN/AN/AYes
timezone or session variable time_zone [2]Yes [3]Yes [4]Yes [4]Yes [4]To be supportedYes [4]To be supported
Time accuracy - MicrosecondYesYesYesYes (v3.1.11+/v3.2.6+)To be supportedYesYes
- -:::note - -[1]\: From v3.3 onwards, StarRocks supports Partial Updates in Row mode for INSERT INTO by specifying the column list. - -[2]\: Setting the time zone by the parameter or the session variable will affect the results returned by functions such as strftime(), alignment_timestamp(), and from_unixtime(). - -[3]\: Only the parameter `timezone` is supported. - -[4]\: Only the session variable `time_zone` is supported. - -::: - -## Unloading - Other parameters and features - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Parameter and featureINSERT INTO FILESEXPORT
target_max_file_sizeYes (v3.2+)To be supported
single
Partitioned_by
Session variable time_zoneTo be supported
Time accuracy - MicrosecondTo be supportedTo be supported
diff --git a/docs/en/loading/loading_introduction/loading_concepts.md b/docs/en/loading/loading_introduction/loading_concepts.md deleted file mode 100644 index 271b57d..0000000 --- a/docs/en/loading/loading_introduction/loading_concepts.md +++ /dev/null @@ -1,146 +0,0 @@ ---- -displayed_sidebar: docs -toc_max_heading_level: 4 ---- - -# Loading concepts - -import InsertPrivNote from '../../_assets/commonMarkdown/insertPrivNote.mdx' - -This topic introduces common concepts and information about data loading. - -## Privileges - - - -## Labeling - -You can load data into StarRocks by running load jobs. Each load job has a unique label that is specified by the user or automatically generated by StarRocks to identify the job. Each label can be used only for one load job. After a load job is complete, its label cannot be reused for any other load jobs. Only the labels of failed load jobs can be reused. - -## Atomicity - -All the loading methods provided by StarRocks guarantee atomicity. Atomicity means that the qualified data within a load job must be all successfully loaded or none of the qualified data is successfully loaded. It never happens that some of the qualified data is loaded while the other data is not. Note that the qualified data does not include the data that is filtered out due to quality issues such as data type conversion errors. - -## Protocols - -StarRocks supports two communication protocols that can be used to submit load jobs: MySQL and HTTP. Of all the loading methods supported by StarRocks, only Stream Load uses HTTP, whereas all the others use MySQL. - -## Data types - -StarRocks supports loading data of all data types. You only need to take note of the limits on the loading of a few specific data types. For more information, see [Data types](../../sql-reference/data-types/README.md). - -## Strict mode - -Strict mode is an optional property that you can configure for data loads. It affects the loading behavior and the final loaded data. For details, see [Strict mode](../load_concept/strict_mode.md). - -## Loading modes - -StarRocks supports two loading modes: synchronous loading mode and asynchronous loading mode. - -:::note - -If you load data by using external programs, you must choose a loading mode that best suits your business requirements before you decide the loading method of your choice. - -::: - -### Synchronous loading - -In synchronous loading mode, after you submit a load job, StarRocks synchronously runs the job to load data, and returns the result of the job after the job finishes. You can check whether the job is successful based on the job result. - -StarRocks provides two loading methods that support synchronous loading: [Stream Load](../StreamLoad.md) and [INSERT](../InsertInto.md). - -The process of synchronous loading is as follows: - -1. Create a load job. - -2. View the job result returned by StarRocks. - -3. Check whether the job is successful based on the job result. If the job result indicates a load failure, you can retry the job. - -### Asynchronous loading - -In asynchronous loading mode, after you submit a load job, StarRocks immediately returns the job creation result. - -- If the result indicates a job creation success, StarRocks asynchronously runs the job. However, that does not mean that the data has been successfully loaded. You must use statements or commands to check the status of the job. Then, you can determine whether the data is successfully loaded based on the job status. - -- If the result indicates a job creation failure, you can determine whether you need to retry the job based on the failure information. - -:::tip - -You can set different write quorum for tables, that is, how many replicas are required to return loading success before StarRocks can determine the loading task is successful. You can specify write quorum by adding the property `write_quorum` when you [CREATE TABLE](../../sql-reference/sql-statements/table_bucket_part_index/CREATE_TABLE.md), or add this property to an existing table using [ALTER TABLE](../../sql-reference/sql-statements/table_bucket_part_index/ALTER_TABLE.md). - -::: - -StarRocks provides four loading methods that support asynchronous loading: [Broker Load](../../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md), [Pipe](../../sql-reference/sql-statements/loading_unloading/pipe/CREATE_PIPE.md), [Routine Load](../../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md), and [Spark Load](../../sql-reference/sql-statements/loading_unloading/SPARK_LOAD.md). - -The process of asynchronous loading is as follows: - -1. Create a load job. - -2. View the job creation result returned by StarRocks and determine whether the job is successfully created. - - - If the job creation succeeds, go to Step 3. - - - If the job creation fails, return to Step 1. - -3. Use statements or commands to check the status of the job until the job status shows **FINISHED** or **CANCELLED**. - -#### Workflow of Broker Load or Spark Load - -The workflow of a Broker Load or Spark Load job consists of five stages, as shown in the following figure. - -![Broker Load or Spark Load overflow](../../_assets/4.1-1.png) - -The workflow is described as follows: - -1. **PENDING** - - The job is in queue waiting to be scheduled by an FE. - -2. **ETL** - - The FE pre-processes the data, including cleansing, partitioning, sorting, and aggregation. - - Only a Spark Load job has the ETL stage. A Broker Load job skips this stage. - -3. **LOADING** - - The FE cleanses and transforms the data, and then sends the data to the BEs or CNs. After all data is loaded, the data is in queue waiting to take effect. At this time, the status of the job remains **LOADING**. - -4. **FINISHED** - - When loading finishes and all data involved takes effect, the status of the job becomes **FINISHED**. At this time, the data can be queried. **FINISHED** is a final job state. - -5. **CANCELLED** - - Before the status of the job becomes **FINISHED**, you can cancel the job at any time. Additionally, StarRocks can automatically cancel the job in case of load errors. After the job is canceled, the status of the job becomes **CANCELLED**, and all data updates made before the cancellation are reverted. **CANCELLED** is also a final job state. - -#### Workflow of Pipe - -The workflow of a Pipe job is described as follows: - -1. The job is submitted to an FE from a MySQL client. - -2. The FE splits the data files stored in the specified path based on their number or size, breaking down job into smaller, sequential tasks. The tasks enter a queue, waiting to be scheduled, after they are created. - -3. The FE obtains the tasks from the queue, and invokes the INSERT INTO SELECT FROM FILES statement to execute each task. - -4. The data loading finishes: - - - If `"AUTO_INGEST" = "FALSE"` is specified for the job at job creation, the job finishes after the data of all the data files stored in the specified path is loaded. - - - If `"AUTO_INGEST" = "TRUE"` is specified for the job at job creation, the FE will continue to monitor changes to the data files and automatically loads new or updated data from the data files into the destination StarRocks table. - -#### Workflow of Routine Load - -The workflow of a Routine Load job is described as follows: - -1. The job is submitted to an FE from a MySQL client. - -2. The FE splits the job into multiple tasks. Each task is engineered to load data from multiple partitions. - -3. The FE distributes the tasks to specified BEs or CNs. - -4. The BEs or CNs execute the tasks, and report to the FE after they finish the tasks. - -5. The FE generates subsequent tasks, retries failed tasks if there are any, or suspends task scheduling based on the reports from the BEs. diff --git a/docs/en/loading/loading_introduction/loading_considerations.md b/docs/en/loading/loading_introduction/loading_considerations.md deleted file mode 100644 index 01a9d13..0000000 --- a/docs/en/loading/loading_introduction/loading_considerations.md +++ /dev/null @@ -1,78 +0,0 @@ ---- -displayed_sidebar: docs -toc_max_heading_level: 4 ---- - -# Considerations - -This topic describes some system limits and configurations that you need to consider before you run data loads. - -## Memory limits - -StarRocks provides parameters for you to limit the memory usage for each load job, thereby reducing memory consumption, especially in high concurrency scenarios. However, do not specify an excessively low memory usage limit. If the memory usage limit is excessively low, data may be frequently flushed from memory to disk because the memory usage for load jobs reaches the specified limit. We recommend that you specify a proper memory usage limit based on your business scenario. - -The parameters that are used to limit memory usage vary for each loading method. For more information, see [Stream Load](../../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md), [Broker Load](../../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md), [Routine Load](../../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md), [Spark Load](../../sql-reference/sql-statements/loading_unloading/SPARK_LOAD.md), and [INSERT](../../sql-reference/sql-statements/loading_unloading/INSERT.md). Note that a load job usually runs on multiple BEs or CNs. Therefore, the parameters limit the memory usage of each load job on each involved BE or CN rather than the total memory usage of the load job on all involved BEs or CNs. - -StarRocks also provides parameters for you to limit the total memory usage of all load jobs that run on each individual BE or CN. For more information, see the "[System configurations](#system-configurations)" section below. - -## System configurations - -This section describes some parameter configurations that are applicable to all of the loading methods provided by StarRocks. - -### FE configurations - -You can configure the following parameters in the configuration file **fe.conf** of each FE: - -- `max_load_timeout_second` and `min_load_timeout_second` - - These parameters specify the maximum timeout period and minimum timeout period of each load job. The timeout periods are measured in seconds. The default maximum timeout period spans 3 days, and the default minimum timeout period spans 1 second. The maximum timeout period and minimum timeout period that you specify must fall within the range of 1 second to 3 days. These parameters are valid for both synchronous load jobs and asynchronous load jobs. - -- `desired_max_waiting_jobs` - - This parameter specifies the maximum number of load jobs that can be held waiting in queue. The default value is **1024** (100 in v2.4 and earlier, and 1024 in v2.5 and later). When the number of load jobs in the **PENDING** state on an FE reaches the maximum number that you specify, the FE rejects new load requests. This parameter is valid only for asynchronous load jobs. - -- `max_running_txn_num_per_db` - - This parameter specifies the maximum number of ongoing load transactions that are allowed in each database of your StarRocks cluster. A load job can contain one or more transactions. The default value is **100**. When the number of load transactions running in a database reaches the maximum number that you specify, the subsequent load jobs that you submit are not scheduled. In this situation, if you submit a synchronous load job, the job is rejected. If you submit an asynchronous load job, the job is held waiting in queue. - - :::note - - StarRocks counts all load jobs together and does not distinguish between synchronous load jobs and asynchronous load jobs. - - ::: - -- `label_keep_max_second` - - This parameter specifies the retention period of the history records for load jobs that have finished and are in the **FINISHED** or **CANCELLED** state. The default retention period spans 3 days. This parameter is valid for both synchronous load jobs and asynchronous load jobs. - -### BE/CN configurations - -You can configure the following parameters in the configuration file **be.conf** of each BE or the configuration file **cn.conf** of each CN: - -- `write_buffer_size` - - This parameter specifies the maximum memory block size. The default size is 100 MB. The loaded data is first written to a memory block on the BE or CN. When the amount of data that is loaded reaches the maximum memory block size that you specify, the data is flushed to disk. You must specify a proper maximum memory block size based on your business scenario. - - - If the maximum memory block size is exceedingly small, a large number of small files may be generated on the BE or CN. In this case, query performance degrades. You can increase the maximum memory block size to reduce the number of files generated. - - If the maximum memory block size is exceedingly large, remote procedure calls (RPCs) may time out. In this case, you can adjust the value of this parameter based on your business needs. - -- `streaming_load_rpc_max_alive_time_sec` - - The waiting timeout period for each Writer process. The default value is 1200 seconds. During the data loading process, StarRocks starts a Writer process to receive data from and write data to each tablet. If a Writer process does not receive any data within the waiting timeout period that you specify, StarRocks stops the Writer process. When your StarRocks cluster processes data at low speeds, a Writer process may not receive the next batch of data within a long period of time and therefore reports a "TabletWriter add batch with unknown id" error. In this case, you can increase the value of this parameter. - -- `load_process_max_memory_limit_bytes` and `load_process_max_memory_limit_percent` - - These parameters specify the maximum amount of memory that can be consumed for all load jobs on each individual BE or CN. StarRocks identifies the smaller memory consumption among the values of the two parameters as the final memory consumption that is allowed. - - - `load_process_max_memory_limit_bytes`: specifies the maximum memory size. The default maximum memory size is 100 GB. - - `load_process_max_memory_limit_percent`: specifies the maximum memory usage. The default value is 30%. This parameter differs from the `mem_limit` parameter. The `mem_limit` parameter specifies the total maximum memory usage of your StarRocks cluster, and the default value is 90% x 90%. - - If the memory capacity of the machine on which the BE or CN resides is M, the maximum amount of memory that can be consumed for load jobs is calculated as follows: `M x 90% x 90% x 30%`. - -### System variable configurations - -You can configure the following [system variable](../../sql-reference/System_variable.md): - -- `insert_timeout` - - The INSERT timeout duration. Unit: seconds. Value range: `1` to `259200`. Default value: `14400`. This variable will act on all operations involving INSERT jobs (for example, UPDATE, DELETE, CTAS, materialized view refresh, statistics collection, and PIPE) in the current connection. diff --git a/docs/en/loading/loading_introduction/loading_overview.mdx b/docs/en/loading/loading_introduction/loading_overview.mdx deleted file mode 100644 index e6ea999..0000000 --- a/docs/en/loading/loading_introduction/loading_overview.mdx +++ /dev/null @@ -1,9 +0,0 @@ ---- -displayed_sidebar: docs ---- - -# Loading overview - -import DocCardList from '@theme/DocCardList'; - - \ No newline at end of file diff --git a/docs/en/loading/loading_introduction/troubleshooting_loading.md b/docs/en/loading/loading_introduction/troubleshooting_loading.md deleted file mode 100644 index 69279b8..0000000 --- a/docs/en/loading/loading_introduction/troubleshooting_loading.md +++ /dev/null @@ -1,601 +0,0 @@ ---- -displayed_sidebar: docs -sidebar_label: "Troubleshooting" ---- - -# Troubleshooting Data Loading - -This guide is designed to help DBAs and operation engineers monitor the status of data load jobs through SQL interfaces—without relying on external monitoring systems. It also provides guidance on identifying performance bottlenecks and troubleshooting anomalies during load operations. - -## Terminology - -**Load Job:** A continuous data load process, such as a **Routine Load Job** or **Pipe Job**. - -**Load Task:** A one-time data load process, usually corresponding to a single load transaction. Examples include **Broker Load**, **Stream Load**, **Spark Load**, and **INSERT INTO**. Routine Load jobs and Pipe jobs continuously generate tasks to perform data ingestion. - -## Observe load jobs - -There are two ways to observe load jobs: - -- Using SQL statements **[SHOW ROUTINE LOAD](../../sql-reference/sql-statements/loading_unloading/routine_load/SHOW_ROUTINE_LOAD.md)** and **[SHOW PIPES](../../sql-reference/sql-statements/loading_unloading/pipe/SHOW_PIPES.md)**. -- Using system views **[information_schema.routine_load_jobs](../../sql-reference/information_schema/routine_load_jobs.md)** and **[information_schema.pipes](../../sql-reference/information_schema/pipes.md)**. - -## Observe load tasks - -Load tasks can also be monitored in two ways: - -- Using SQL statements **[SHOW LOAD](../../sql-reference/sql-statements/loading_unloading/SHOW_LOAD.md)** and **[SHOW ROUTINE LOAD TASK](../../sql-reference/sql-statements/loading_unloading/routine_load/SHOW_ROUTINE_LOAD_TASK.md)**. -- Using system views **[information_schema.loads](../../sql-reference/information_schema/loads.md)** and **statistics.loads_history**. - -### SQL statements - -The **SHOW** statements display both ongoing and recently completed load tasks for the current database, providing a quick overview of task status. The information retrieved is a subset of the **statistics.loads_history** system view. - -SHOW LOAD statements return information of Broker Load, Insert Into, and Spark Load tasks, and SHOW ROUTINE LOAD TASK statements return Routine Load task information. - -### System views - -#### information_schema.loads - -The **information_schema.loads** system view stores information about recent load tasks, including active and recently completed ones. StarRocks periodically synchronizes the data to the **statistics.loads_history** system table for persistent storage. - -**information_schema.loads** provides the following fields: - -| Field | Description | -| -------------------- | ------------------------------------------------------------ | -| ID | Globally unique identifier. | -| LABEL | Label of the load job. | -| PROFILE_ID | The ID of the Profile, which can be analyzed via `ANALYZE PROFILE`. | -| DB_NAME | The database to which the target table belongs. | -| TABLE_NAME | The target table. | -| USER | The user who initiates the load job. | -| WAREHOUSE | The warehouse to which the load job belongs. | -| STATE | The state of the load job. Valid values:
  • `PENDING`/`BEGIN`: The load job is created.
  • `QUEUEING`/`BEFORE_LOAD`: The load job is in the queue waiting to be scheduled.
  • `LOADING`: The load job is running.
  • `PREPARING`: The transaction is being pre-committed.
  • `PREPARED`: The transaction has been pre-committed.
  • `COMMITED`: The transaction has been committed.
  • `FINISHED`: The load job succeeded.
  • `CANCELLED`: The load job failed.
| -| PROGRESS | The progress of the ETL stage and LOADING stage of the load job. | -| TYPE | The type of the load job. For Broker Load, the return value is `BROKER`. For INSERT, the return value is `INSERT`. For Stream Load, the return value is `STREAM`. For Routine Load Load, the return value is `ROUTINE`. | -| PRIORITY | The priority of the load job. Valid values: `HIGHEST`, `HIGH`, `NORMAL`, `LOW`, and `LOWEST`. | -| SCAN_ROWS | The number of data rows that are scanned. | -| SCAN_BYTES | The number of bytes that are scanned. | -| FILTERED_ROWS | The number of data rows that are filtered out due to inadequate data quality. | -| UNSELECTED_ROWS | The number of data rows that are filtered out due to the conditions specified in the WHERE clause. | -| SINK_ROWS | The number of data rows that are loaded. | -| RUNTIME_DETAILS | Load runtime metadata. For details, see [RUNTIME_DETAILS](#runtime_details). | -| CREATE_TIME | The time at which the load job was created. Format: `yyyy-MM-dd HH:mm:ss`. Example: `2023-07-24 14:58:58`. | -| LOAD_START_TIME | The start time of the LOADING stage of the load job. Format: `yyyy-MM-dd HH:mm:ss`. Example: `2023-07-24 14:58:58`. | -| LOAD_COMMIT_TIME | The time at which the loading transaction was committed. Format: `yyyy-MM-dd HH:mm:ss`. Example: `2023-07-24 14:58:58`. | -| LOAD_FINISH_TIME | The end time of the LOADING stage of the load job. Format: `yyyy-MM-dd HH:mm:ss`. Example: `2023-07-24 14:58:58`. | -| PROPERTIES | The static properties of the load job. For details, see [PROPERTIES](#properties). | -| ERROR_MSG | The error message of the load job. If the load job did not encounter any error, `NULL` is returned. | -| TRACKING_SQL | The SQL statement that can be used to query the tracking log of the load job. A SQL statement is returned only when the load job involves unqualified data rows. If the load job does not involve any unqualified data rows, `NULL` is returned. | -| REJECTED_RECORD_PATH | The path from which you can access all the unqualified data rows that are filtered out in the load job. The number of unqualified data rows logged is determined by the `log_rejected_record_num` parameter configured in the load job. You can use the `wget` command to access the path. If the load job does not involve any unqualified data rows, `NULL` is returned. | - -##### RUNTIME_DETAILS - -- Universal metrics: - -| Metric | Description | -| -------------------- | ------------------------------------------------------------ | -| load_id | Globally unique ID of the load execution plan. | -| txn_id | Load transaction ID. | - -- Specific metrics for Broker Load, INSERT INTO, and Spark Load: - -| Metric | Description | -| -------------------- | ------------------------------------------------------------ | -| etl_info | ETL Details. This field is only valid for Spark Load jobs. For other types of load jobs, the value will be empty. | -| etl_start_time | The start time of the ETL stage of the load job. Format: `yyyy-MM-dd HH:mm:ss`. Example: `2023-07-24 14:58:58`. | -| etl_start_time | The end time of the ETL stage of the load job. Format: `yyyy-MM-dd HH:mm:ss`. Example: `2023-07-24 14:58:58`. | -| unfinished_backends | List of BEs with unfinished executions. | -| backends | List of BEs participating in execution. | -| file_num | Number of files read. | -| file_size | Total size of files read. | -| task_num | Number of subtasks. | - -- Specific metrics for Routine Load: - -| Metric | Description | -| -------------------- | ------------------------------------------------------------ | -| schedule_interval | The interval for Routine Load to be scheduled. | -| wait_slot_time | Time elapsed while the Routine Load task waits for execution slots. | -| check_offset_time | Time consumed when checking offset information during Routine Load task scheduling. | -| consume_time | Time consumed by the Routine Load task to read upstream data. | -| plan_time | Time for generating the execution plan. | -| commit_publish_time | Time consumed to execute the COMMIT RPC. | - -- Specific metrics for Stream Load: - -| Metric | Description | -| ---------------------- | ---------------------------------------------------------- | -| timeout | Timeout for load tasks. | -| begin_txn_ms | Time consumed to begin the transaction. | -| plan_time_ms | Time for generating the execution plan. | -| receive_data_time_ms | Time for receiving data. | -| commit_publish_time_ms | Time consumed to execute the COMMIT RPC. | -| client_ip | Client IP address. | - -##### PROPERTIES - -- Specific properties for Broker Load, INSERT INTO, and Spark Load: - -| Property | Description | -| ---------------------- | ---------------------------------------------------------- | -| timeout | Timeout for load tasks. | -| max_filter_ratio | Maximum ratio of data rows that are filtered out due to inadequate data quality. | - -- Specific properties for Routine Load: - -| Property | Description | -| ---------------------- | ---------------------------------------------------------- | -| job_name | Routine Load job name. | -| task_num | Number of subtasks actually executed in parallel. | -| timeout | Timeout for load tasks. | - -#### statistics.loads_history - -The **statistics.loads_history** system view stores load records for the last three months by default. DBAs can adjust the retention period by modifying the `partition_ttl` of the view. **statistics.loads_history** has the consistent schema with **information_schema.loads**. - -## Identify loading performance issues with Load Profiles - -A **Load Profile** records execution details of all worker nodes involved in a data load. It helps you quickly pinpoint performance bottlenecks in the StarRocks cluster. - -### Enable Load Profiles - -StarRocks provides multiple methods to enable Load Profiles, depending on the type of load: - -#### For Broker Load and INSERT INTO - -Enable Load Profiles for Broker Load and INSERT INTO at session level: - -```sql -SET enable_profile = true; -``` - -By default, profiles are automatically enabled for long-running jobs (longer than 300 seconds). You can customize this threshold by: - -```sql -SET big_query_profile_threshold = 60s; -``` - -:::note -When `big_query_profile_threshold` is set to its default value `0`, the default behavior is to disable Query Profiling for queries. However, for load tasks, profiles are automatically recorded for tasks with execution times exceeding 300 seconds. -::: - -StarRocks also supports **Runtime Profiles**, which periodically (every 30 seconds) report execution metrics of long-running load jobs. You can customize the report interval by: - -```sql -SET runtime_profile_report_interval = 60; -``` - -:::note -`runtime_profile_report_interval` specifies only the minimum report interval for load tasks. The actual report interval is dynamically adjusted and may exceed this value. -::: - -#### For Stream Load and Routine Load - -Enable Load Profiles for Stream Load and Routine Load at table level: - -```sql -ALTER TABLE SET ("enable_load_profile" = "true"); -``` - -Stream Load typically has high QPS, so StarRocks allows sampling for Load Profile collection to avoid performance degradation from extensive profiling. You can adjust the collection interval by configuring the FE parameter `load_profile_collect_interval_second`. This setting only applies to Load Profiles enabled via table properties. The default value is `0`. - -```SQL -ADMIN SET FRONTEND CONFIG ("load_profile_collect_interval_second"="30"); -``` - -StarRocks also allows collecting profiles only from load jobs that exceed a certain time threshold. You can adjust this threshold by configuring the FE parameter `stream_load_profile_collect_threshold_second`. The default value is `0`. - -```SQL -ADMIN SET FRONTEND CONFIG ("stream_load_profile_collect_threshold_second"="10"); -``` - -### Analyze Load Profiles - -The structure of Load Profiles is identical to that of Query Profiles. For detailed instructions, see [Query Tuning Recipes](../../best_practices/query_tuning/query_profile_tuning_recipes.md). - -You can analyze Load Profiles by executing [ANALYZE PROFILE](../../sql-reference/sql-statements/cluster-management/plan_profile/ANALYZE_PROFILE.md). For detailed instructions, see [Analyze text-based Profiles](../../best_practices/query_tuning/query_profile_text_based_analysis.md). - -Profiles provide detailed operator metrics. Key components include the `OlapTableSink` operator and the `LoadChannel` operator. - -#### OlapTableSink operator - -| Metric | Description | -| ----------------- | ------------------------------------------------------------ | -| IndexNum | Number of synchronous materialized views of the target table. | -| ReplicatedStorage | Whether single leader replication is enabled. | -| TxnID | Load transaction ID. | -| RowsRead | Number of data rows read from the upstream operator. | -| RowsFiltered | The number of data rows that are filtered out due to inadequate data quality. | -| RowsReturned | The number of data rows that are loaded. | -| RpcClientSideTime | Total time consumed for data writing RPC from client-side statistics. | -| RpcServerSideTime | Total time consumed for data writing RPC from server-side statistics. | -| PrepareDataTime | Time consumed for data format conversion and data quality checks. | -| SendDataTime | Local time consumed for sending data, including data serialization, compression, and writing to the send queue. | - -:::tip -- The significant variance between the maximum and minimum values of `PushChunkNum` in `OLAP_TABLE_SINK` indicates data skew in the upstream operator, which may cause write performance bottlenecks. -- `RpcClientSideTime` equals the sum of `RpcServerSideTime`, Network transmission time, and RPC framework processing time. If the difference between `RpcClientSideTime` and `RpcServerSideTime` is significant, consider to enable data compression to reduce transmission time. -- If `RpcServerSideTime` accounts for a significant portion of the time spent, further analysis can be conducted using `LoadChannel` Profile. -::: - -#### LoadChannel operator - -| Metric | Description | -| ------------------- | ------------------------------------------------------------ | -| Address | IP address or FQDN of the BE node. | -| LoadMemoryLimit | Memory Limit for loading. | -| PeakMemoryUsage | Peak memory usage for loading. | -| OpenCount | The number of times the channel is opened, reflecting the sink's total concurrency. | -| OpenTime | Total time consumed for the opening channel. | -| AddChunkCount | Number of loading chunks, that is, the number of calls to `TabletsChannel::add_chunk`. | -| AddRowNum | The number of data rows that are loaded. | -| AddChunkTime | Total time consumed by loading chunks, that is, the total execution time of `TabletsChannel::add_chunk`. | -| WaitFlushTime | Total time spent by `TabletsChannel::add_chunk` waiting for MemTable flush. | -| WaitWriterTime | Total time spent by `TabletsChannel::add_chunk` waiting for Async Delta Writer execution. | -| WaitReplicaTime | Total time spent by `TabletsChannel::add_chunk` waiting for synchronization from replicas. | -| PrimaryTabletsNum | Number of primary tablets. | -| SecondaryTabletsNum | Number of secondary tablets. | - -:::tip -If `WaitFlushTime` takes an extended period, it may indicate insufficient resources for the flush thread. Consider adjusting the BE configuration `flush_thread_num_per_store`. -::: - -## Best practices - -### Diagnose Broker Load performance bottleneck - -1. Load data using Broker Load: - - ```SQL - LOAD LABEL click_bench.hits_1713874468 - ( - DATA INFILE ("s3://test-data/benchmark_data/query_data/click_bench/hits.tbl*") - INTO TABLE hits COLUMNS TERMINATED BY "\t" (WatchID,JavaEnable,Title,GoodEvent,EventTime,EventDate,CounterID,ClientIP,RegionID,UserID,CounterClass,OS,UserAgent,URL,Referer,IsRefresh,RefererCategoryID,RefererRegionID,URLCategoryID,URLRegionID,ResolutionWidth,ResolutionHeight,ResolutionDepth,FlashMajor,FlashMinor,FlashMinor2,NetMajor,NetMinor,UserAgentMajor,UserAgentMinor,CookieEnable,JavascriptEnable,IsMobile,MobilePhone,MobilePhoneModel,Params,IPNetworkID,TraficSourceID,SearchEngineID,SearchPhrase,AdvEngineID,IsArtifical,WindowClientWidth,WindowClientHeight,ClientTimeZone,ClientEventTime,SilverlightVersion1,SilverlightVersion2,SilverlightVersion3,SilverlightVersion4,PageCharset,CodeVersion,IsLink,IsDownload,IsNotBounce,FUniqID,OriginalURL,HID,IsOldCounter,IsEvent,IsParameter,DontCountHits,WithHash,HitColor,LocalEventTime,Age,Sex,Income,Interests,Robotness,RemoteIP,WindowName,OpenerName,HistoryLength,BrowserLanguage,BrowserCountry,SocialNetwork,SocialAction,HTTPError,SendTiming,DNSTiming,ConnectTiming,ResponseStartTiming,ResponseEndTiming,FetchTiming,SocialSourceNetworkID,SocialSourcePage,ParamPrice,ParamOrderID,ParamCurrency,ParamCurrencyID,OpenstatServiceName,OpenstatCampaignID,OpenstatAdID,OpenstatSourceID,UTMSource,UTMMedium,UTMCampaign,UTMContent,UTMTerm,FromTag,HasGCLID,RefererHash,URLHash,CLID) - ) - WITH BROKER - ( - "aws.s3.access_key" = "", - "aws.s3.secret_key" = "", - "aws.s3.region" = "" - ) - ``` - -2. Use **SHOW PROFILELIST** to retrieve the list of runtime profiles. - - ```SQL - MySQL [click_bench]> SHOW PROFILELIST; - +--------------------------------------+---------------------+----------+---------+----------------------------------------------------------------------------------------------------------------------------------+ - | QueryId | StartTime | Time | State | Statement | - +--------------------------------------+---------------------+----------+---------+----------------------------------------------------------------------------------------------------------------------------------+ - | 3df61627-f82b-4776-b16a-6810279a79a3 | 2024-04-23 20:28:26 | 11s850ms | Running | LOAD LABEL click_bench.hits_1713875306 (DATA INFILE ("s3://test-data/benchmark_data/query_data/click_bench/hits.tbl*" ... | - +--------------------------------------+---------------------+----------+---------+----------------------------------------------------------------------------------------------------------------------------------+ - 1 row in set (0.00 sec) - ``` - -3. Use **ANALYZE PROFILE** to view the Runtime Profile. - - ```SQL - MySQL [click_bench]> ANALYZE PROFILE FROM '3df61627-f82b-4776-b16a-6810279a79a3'; - +-------------------------------------------------------------------------------------------------------------------------------------------------------------+ - | Explain String | - +-------------------------------------------------------------------------------------------------------------------------------------------------------------+ - | Summary | - | Attention: The transaction of the statement will be aborted, and no data will be actually inserted!!! | - | Attention: Profile is not identical!!! | - | QueryId: 3df61627-f82b-4776-b16a-6810279a79a3 | - | Version: default_profile-70fe819 | - | State: Running | - | Legend: ⏳ for blocked; 🚀 for running; ✅ for finished | - | TotalTime: 31s832ms | - | ExecutionTime: 30s1ms [Scan: 28s885ms (96.28%), Network: 0ns (0.00%), ResultDeliverTime: 7s613ms (25.38%), ScheduleTime: 145.701ms (0.49%)] | - | FrontendProfileMergeTime: 3.838ms | - | QueryPeakMemoryUsage: 141.367 MB, QueryAllocatedMemoryUsage: 82.422 GB | - | Top Most Time-consuming Nodes: | - | 1. FILE_SCAN (id=0) 🚀 : 28s902ms (85.43%) | - | 2. OLAP_TABLE_SINK 🚀 : 4s930ms (14.57%) | - | Top Most Memory-consuming Nodes: | - | Progress (finished operator/all operator): 0.00% | - | NonDefaultVariables: | - | big_query_profile_threshold: 0s -> 60s | - | enable_adaptive_sink_dop: false -> true | - | enable_profile: false -> true | - | sql_mode_v2: 32 -> 34 | - | use_compute_nodes: -1 -> 0 | - | Fragment 0 | - | │ BackendNum: 3 | - | │ InstancePeakMemoryUsage: 128.541 MB, InstanceAllocatedMemoryUsage: 82.422 GB | - | │ PrepareTime: 2.304ms | - | └──OLAP_TABLE_SINK | - | │ TotalTime: 4s930ms (14.57%) [CPUTime: 4s930ms] | - | │ OutputRows: 14.823M (14823424) | - | │ PartitionType: RANDOM | - | │ Table: hits | - | └──FILE_SCAN (id=0) 🚀 | - | Estimates: [row: ?, cpu: ?, memory: ?, network: ?, cost: ?] | - | TotalTime: 28s902ms (85.43%) [CPUTime: 17.038ms, ScanTime: 28s885ms] | - | OutputRows: 14.823M (14823424) | - | Progress (processed rows/total rows): ? | - | Detail Timers: [ScanTime = IOTaskExecTime + IOTaskWaitTime] | - | IOTaskExecTime: 25s612ms [min=19s376ms, max=28s804ms] | - | IOTaskWaitTime: 63.192ms [min=20.946ms, max=91.668ms] | - | | - +-------------------------------------------------------------------------------------------------------------------------------------------------------------+ - 40 rows in set (0.04 sec) - ``` - -The profile shows that the `FILE_SCAN` section took nearly 29 seconds, accounting for approximately 90% of the total 32-second duration. This indicates that reading data from object storage is currently the bottleneck in the loading process. - -### Diagnose Stream Load Performance - -1. Enable Load Profile for the target table. - - ```SQL - mysql> ALTER TABLE duplicate_200_column_sCH SET('enable_load_profile'='true'); - Query OK, 0 rows affected (0.00 sec) - ``` - -2. Use **SHOW PROFILELIST** to retrieve the list of profiles. - - ```SQL - mysql> SHOW PROFILELIST; - +--------------------------------------+---------------------+----------+----------+-----------+ - | QueryId | StartTime | Time | State | Statement | - +--------------------------------------+---------------------+----------+----------+-----------+ - | 90481df8-afaf-c0fd-8e91-a7889c1746b6 | 2024-09-19 10:43:38 | 9s571ms | Finished | | - | 9c41a13f-4d7b-2c18-4eaf-cdeea3facba5 | 2024-09-19 10:43:37 | 10s664ms | Finished | | - | 5641cf37-0af4-f116-46c6-ca7cce149886 | 2024-09-19 10:43:20 | 13s88ms | Finished | | - | 4446c8b3-4dc5-9faa-dccb-e1a71ab3519e | 2024-09-19 10:43:20 | 13s64ms | Finished | | - | 48469b66-3866-1cd9-9f3b-17d786bb4fa7 | 2024-09-19 10:43:20 | 13s85ms | Finished | | - | bc441907-e779-bc5a-be8e-992757e4d992 | 2024-09-19 10:43:19 | 845ms | Finished | | - +--------------------------------------+---------------------+----------+----------+-----------+ - ``` - -3. Use **ANALYZE PROFILE** to view the Profile. - - ```SQL - mysql> ANALYZE PROFILE FROM '90481df8-afaf-c0fd-8e91-a7889c1746b6'; - +-----------------------------------------------------------+ - | Explain String | - +-----------------------------------------------------------+ - | Load: | - | Summary: | - | - Query ID: 90481df8-afaf-c0fd-8e91-a7889c1746b6 | - | - Start Time: 2024-09-19 10:43:38 | - | - End Time: 2024-09-19 10:43:48 | - | - Query Type: Load | - | - Load Type: STREAM_LOAD | - | - Query State: Finished | - | - StarRocks Version: main-d49cb08 | - | - Sql Statement | - | - Default Db: ingestion_db | - | - NumLoadBytesTotal: 799008 | - | - NumRowsAbnormal: 0 | - | - NumRowsNormal: 280 | - | - Total: 9s571ms | - | - numRowsUnselected: 0 | - | Execution: | - | Fragment 0: | - | - Address: 172.26.93.218:59498 | - | - InstanceId: 90481df8-afaf-c0fd-8e91-a7889c1746b7 | - | - TxnID: 1367 | - | - ReplicatedStorage: true | - | - AutomaticPartition: false | - | - InstanceAllocatedMemoryUsage: 12.478 MB | - | - InstanceDeallocatedMemoryUsage: 10.745 MB | - | - InstancePeakMemoryUsage: 9.422 MB | - | - MemoryLimit: -1.000 B | - | - RowsProduced: 280 | - | - AllocAutoIncrementTime: 348ns | - | - AutomaticBucketSize: 0 | - | - BytesRead: 0.000 B | - | - CloseWaitTime: 9s504ms | - | - IOTaskExecTime: 0ns | - | - IOTaskWaitTime: 0ns | - | - IndexNum: 1 | - | - NumDiskAccess: 0 | - | - OpenTime: 15.639ms | - | - PeakMemoryUsage: 0.000 B | - | - PrepareDataTime: 583.480us | - | - ConvertChunkTime: 44.670us | - | - ValidateDataTime: 109.333us | - | - RowsFiltered: 0 | - | - RowsRead: 0 | - | - RowsReturned: 280 | - | - RowsReturnedRate: 12.049K (12049) /sec | - | - RpcClientSideTime: 28s396ms | - | - RpcServerSideTime: 28s385ms | - | - RpcServerWaitFlushTime: 0ns | - | - ScanTime: 9.841ms | - | - ScannerQueueCounter: 1 | - | - ScannerQueueTime: 3.272us | - | - ScannerThreadsInvoluntaryContextSwitches: 0 | - | - ScannerThreadsTotalWallClockTime: 0ns | - | - MaterializeTupleTime(*): 0ns | - | - ScannerThreadsSysTime: 0ns | - | - ScannerThreadsUserTime: 0ns | - | - ScannerThreadsVoluntaryContextSwitches: 0 | - | - SendDataTime: 2.452ms | - | - PackChunkTime: 1.475ms | - | - SendRpcTime: 1.617ms | - | - CompressTime: 0ns | - | - SerializeChunkTime: 880.424us | - | - WaitResponseTime: 0ns | - | - TotalRawReadTime(*): 0ns | - | - TotalReadThroughput: 0.000 B/sec | - | DataSource: | - | - DataSourceType: FileDataSource | - | - FileScanner: | - | - CastChunkTime: 0ns | - | - CreateChunkTime: 227.100us | - | - FileReadCount: 3 | - | - FileReadTime: 253.765us | - | - FillTime: 6.892ms | - | - MaterializeTime: 133.637us | - | - ReadTime: 0ns | - | - ScannerTotalTime: 9.292ms | - +-----------------------------------------------------------+ - 76 rows in set (0.00 sec) - ``` - -## Appendix - -### Useful SQL for Operations - -:::note -This section only applies to shared-nothing clusters. -::: - -#### query the throughput per minute - -```SQL --- overall -select date_trunc('minute', load_finish_time) as t,count(*) as tpm,sum(SCAN_BYTES) as scan_bytes,sum(sink_rows) as sink_rows from _statistics_.loads_history group by t order by t desc limit 10; - --- table -select date_trunc('minute', load_finish_time) as t,count(*) as tpm,sum(SCAN_BYTES) as scan_bytes,sum(sink_rows) as sink_rows from _statistics_.loads_history where table_name = 't' group by t order by t desc limit 10; -``` - -#### Query RowsetNum and SegmentNum of a table - -```SQL --- overall -select * from information_schema.be_tablets t, information_schema.tables_config c where t.table_id = c.table_id order by num_segment desc limit 5; -select * from information_schema.be_tablets t, information_schema.tables_config c where t.table_id = c.table_id order by num_rowset desc limit 5; - --- table -select * from information_schema.be_tablets t, information_schema.tables_config c where t.table_id = c.table_id and table_name = 't' order by num_segment desc limit 5; -select * from information_schema.be_tablets t, information_schema.tables_config c where t.table_id = c.table_id and table_name = 't' order by num_rowset desc limit 5; -``` - -- High RowsetNum (>100) indicates too frequent loads. You may consider to reduce frequency or increase Compaction threads. -- High SegmentNum (>100) indicates excessive segments per load. You may consider increase Compaction threads or adopt the random distribution strategy for the table. - -#### Check data skew - -##### Data skew across nodes - -```SQL --- overall -SELECT tbt.be_id, sum(tbt.DATA_SIZE) FROM information_schema.tables_config tb JOIN information_schema.be_tablets tbt ON tb.TABLE_ID = tbt.TABLE_ID group by be_id; - --- table -SELECT tbt.be_id, sum(tbt.DATA_SIZE) FROM information_schema.tables_config tb JOIN information_schema.be_tablets tbt ON tb.TABLE_ID = tbt.TABLE_ID WHERE tb.table_name = 't' group by be_id; -``` - -If you detected node-level skew, you may consider to use a higher-cardinality column as the distribution key or adopt the random distribution strategy for the table. - -##### Data skew across tablets - -```SQL -select tablet_id,t.data_size,num_row,visible_version,num_version,num_rowset,num_segment,PARTITION_NAME from information_schema.partitions_meta m, information_schema.be_tablets t where t.partition_id = m.partition_id and m.partition_name = 'att' and m.table_name='att' order by t.data_size desc; -``` - -### Common monitoring metrics for loading - -#### BE Load - -These metrics are available under the **BE Load** category in Grafana. If you cannot find this category, verify that you are using the [latest Grafana dashboard template](../../administration/management/monitoring/Monitor_and_Alert.md#125-configure-dashboard). - -##### ThreadPool - -These metrics help analyze the status of thread pools — for example, whether tasks are being backlogged, or how long they spend pending. Currently, there are four monitored thread pools: - -- `async_delta_writer` -- `memtable_flush` -- `segment_replicate_sync` -- `segment_flush` - -Each thread pool includes the following metrics: - -| Name | Description | -| ----------- | --------------------------------------------------------------------------------------------------------- | -| **rate** | Task processing rate. | -| **pending** | Time tasks spend waiting in the queue. | -| **execute** | Task execution time. | -| **total** | Maximum number of threads available in the pool. | -| **util** | Pool utilization over a given period; due to sampling inaccuracy, it may exceed 100% when heavily loaded. | -| **count** | Instantaneous number of tasks in the queue. | - -:::note -- A reliable indicator for backlog is whether **pending duration** keeps increasing. **workers util** and **queue count** are necessary but not sufficient indicators. -- If a backlog occurs, use **rate** and **execute duration** to determine whether it is due to increased load or slower processing. -- **workers util** helps assess how busy the pool is, which can guide tuning efforts. -::: - -##### LoadChannel::add_chunks - -These metrics help analyze the behavior of `LoadChannel::add_chunks` after receiving a `BRPC tablet_writer_add_chunks` request. - -| Name | Description | -| ----------------- | --------------------------------------------------------------------------------------- | -| **rate** | Processing rate of `add_chunks` requests. | -| **execute** | Average execution time of `add_chunks`. | -| **wait_memtable** | Average wait time for the primary replica’s MemTable flush. | -| **wait_writer** | Average wait time for the primary replica’s async delta writer to perform write/commit. | -| **wait_replica** | Average wait time for secondary replicas to complete segment flush. | - -:::note -- The **latency** metric equals the sum of `wait_memtable`, `wait_writer`, and `wait_replica`. -- A high waiting ratio indicates downstream bottlenecks, which should be further analyzed. -::: - -##### Async Delta Writer - -These metrics help analyze the behavior of the **async delta writer**. - -| Name | Description | -| ----------------- | ------------------------------------------------- | -| **rate** | Processing rate of write/commit tasks. | -| **pending** | Time spent waiting in the thread pool queue. | -| **execute** | Average time to process a single task. | -| **wait_memtable** | Average time waiting for MemTable flush. | -| **wait_replica** | Average time waiting for segment synchronization. | - -:::note -- The total time per task (from the upstream perspective) equals **pending** plus **execute**. -- **execute** further includes **wait_memtable** plus **wait_replica**. -- A high **pending** time may indicate that **execute** is slow or the thread pool is undersized. -- If **wait** occupies a large portion of **execute**, downstream stages are the bottleneck; otherwise, the bottleneck is likely within the writer’s logic itself. -::: - -##### MemTable Flush - -These metrics analyze **MemTable flush** performance. - -| Name | Description | -| --------------- | -------------------------------------------- | -| **rate** | Flush rate of MemTables. | -| **memory-size** | Amount of in-memory data flushed per second. | -| **disk-size** | Amount of disk data written per second. | -| **execute** | Task execution time. | -| **io** | I/O time of the flush task. | - -:::note -- By comparing **rate** and **size**, you can determine whether the workload is changing or if massive imports are occurring — for example, a small **rate** but large **size** indicates a massive import. -- The compression ratio can be estimated using `memory-size / disk-size`. -- You can also assess if I/O is the bottleneck by checking the proportion of **io** time in **execute**. -::: - -##### Segment Replicate Sync - -| Name | Description | -| ----------- | -------------------------------------------- | -| **rate** | Rate of segment synchronization. | -| **execute** | Time to synchronize a single tablet replica. | - -##### Segment Flush - -These metrics analyze **segment flush** performance. - -| Name | Description | -| ----------- | --------------------------------------- | -| **rate** | Segment flush rate. | -| **size** | Amount of disk data flushed per second. | -| **execute** | Task execution time. | -| **io** | I/O time of the flush task. | - -:::note -- By comparing **rate** and **size**, you can determine whether the workload is changing or if large imports are occurring — for example, a small **rate** but large **size** indicates a massive import. -- You can also assess if I/O is the bottleneck by checking the proportion of **io** time in **execute**. -::: diff --git a/docs/en/loading/loading_tools.md b/docs/en/loading/loading_tools.md deleted file mode 100644 index 07ddf94..0000000 --- a/docs/en/loading/loading_tools.md +++ /dev/null @@ -1,32 +0,0 @@ ---- -displayed_sidebar: docs ---- - -# Load data using tools - -StarRocks and its ecosystem partners offer the following tools to help you seamlessly integrate StarRocks with external databases. - -## [SMT](../integrations/loading_tools/SMT.md) - -SMT (StarRocks Migration Tool) is a data migration tool provided by StarRocks, designed to optimize complex data loading pipelines: source databases (such as MySQL, Oracle, PostgreSQL) ---> Flink ---> destination StarRocks clusters. Its main functions are as follows: - -- Simplifies table creation in StarRocks: Generates statements to create tables in StarRocks based on information from external databases and the target StarRocks cluster. -- Simplifies the full or incremental data synchronization process in the data pipeline: Generates SQL statements that can be run in Flink's SQL client to submit Flink jobs for synchronizing data. - -The following flowchart illustrates the process of loading data from the source database MySQL through Flink into StarRocks. - -![img](../_assets/load_tools.png) - -## [DataX](../integrations/loading_tools/DataX-starrocks-writer.md) - -DataX is a tool for offline data synchronization, and is open-sourced by Alibaba. DataX can synchronize data between various heterogeneous data sources, including relational databases (MySQL, Oracle, etc.), HDFS, and Hive. DataX provides the StarRocks Writer plugin to synchronize data from data sources supported by DataX to StarRocks. - -## [CloudCanal](../integrations/loading_tools/CloudCanal.md) - -CloudCanal Community Edition is a free data migration and synchronization platform published by [ClouGence Co., Ltd](https://www.cloudcanalx.com/) that integrates Schema Migration, Full Data Migration, verification, Correction, and real-time Incremental Synchronization. You can directly add StarRocks as a data source in CloudCanal's visual interface and create tasks to automatically migrate or synchronize data from source databases (e.g., MySQL, Oracle, PostgreSQL) to StarRocks. - -## [Kettle connector](https://github.com/StarRocks/starrocks-connector-for-kettle) - -Kettle is an ETL (Extract, Transform, Load) tool with a visual graphical interface, which allows users to build data processing workflows by dragging components and configuring parameters. This intuitive method greatly simplifies the process of data processing and loading, enabling users to handle data more conveniently. Additionally, Kettle provides a rich library of components, allowing users to select suitable components according to their needs and perform various complex data processing tasks. - -StarRocks offers the Kettle Connector to integrate with Kettle. By combining Kettle's robust data processing and transformation capabilities with StarRocks's high-performance data storage and analytical abilities, more flexible and efficient data processing workflows can be achieved. diff --git a/docs/en/loading/minio.md b/docs/en/loading/minio.md deleted file mode 100644 index 5be4b04..0000000 --- a/docs/en/loading/minio.md +++ /dev/null @@ -1,720 +0,0 @@ ---- -displayed_sidebar: docs -toc_max_heading_level: 4 ---- - -# Load data from MinIO - -import InsertPrivNote from '../_assets/commonMarkdown/insertPrivNote.mdx' - -StarRocks provides the following options for loading data from MinIO: - -- Synchronous loading using [INSERT](../sql-reference/sql-statements/loading_unloading/INSERT.md)+[`FILES()`](../sql-reference/sql-functions/table-functions/files.md) -- Asynchronous loading using [Broker Load](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md) - -Each of these options has its own advantages, which are detailed in the following sections. - -In most cases, we recommend that you use the INSERT+`FILES()` method, which is much easier to use. - -However, the INSERT+`FILES()` method currently supports only the Parquet, ORC, and CSV file formats. Therefore, if you need to load data of other file formats such as JSON, or [perform data changes such as DELETE during data loading](../loading/Load_to_Primary_Key_tables.md), you can resort to Broker Load. - -## Before you begin - -### Make source data ready - -Make sure the source data you want to load into StarRocks is properly stored in a MinIO bucket. You may also consider where the data and the database are located, because data transfer costs are much lower when your bucket and your StarRocks cluster are located in the same region. - -In this topic, we provide you with a sample dataset. You can download this with `curl`: - -```bash -curl -O https://starrocks-examples.s3.amazonaws.com/user_behavior_ten_million_rows.parquet -``` - -Load the Parquet file into your MinIO system and note the bucket name. The examples in this guide -use a bucket name of `/starrocks`. - -### Check privileges - - - -### Gather connection details - -In a nutshell, to use MinIO Access Key authentication you need to gather the following information: - -- The bucket that stores your data -- The object key (object name) if accessing a specific object in the bucket -- The MinIO endpoint -- The access key and secret key used as access credentials. - -![MinIO access key](../_assets/quick-start/MinIO-create.png) - -## Use INSERT+FILES() - -This method is available from v3.1 onwards and currently supports only the Parquet, ORC, and CSV (from v3.3.0 onwards) file formats. - -### Advantages of INSERT+FILES() - -[`FILES()`](../sql-reference/sql-functions/table-functions/files.md) can read the file stored in cloud storage based on the path-related properties you specify, infer the table schema of the data in the file, and then return the data from the file as data rows. - -With `FILES()`, you can: - -- Query the data directly from MinIO using [SELECT](../sql-reference/sql-statements/table_bucket_part_index/SELECT.md). -- Create and load a table using [CREATE TABLE AS SELECT](../sql-reference/sql-statements/table_bucket_part_index/CREATE_TABLE_AS_SELECT.md) (CTAS). -- Load the data into an existing table using [INSERT](../sql-reference/sql-statements/loading_unloading/INSERT.md). - -### Typical examples - -#### Querying directly from MinIO using SELECT - -Querying directly from MinIO using SELECT+`FILES()` can give a good preview of the content of a dataset before you create a table. For example: - -- Get a preview of the dataset without storing the data. -- Query for the min and max values and decide what data types to use. -- Check for `NULL` values. - -The following example queries the sample dataset previously added to your MinIO system. - -:::tip - -The highlighted section of the command includes the settings that you may need to change: - -- Set the `endpoint` and `path` to match your MinIO system. -- If your MinIO system uses SSL set `enable_ssl` to `true`. -- Substitute your MinIO access key and secret for `AAA` and `BBB`. - -::: - -```sql -SELECT * FROM FILES -( - -- highlight-start - "aws.s3.endpoint" = "http://minio:9000", - "path" = "s3://starrocks/user_behavior_ten_million_rows.parquet", - "aws.s3.enable_ssl" = "false", - "aws.s3.access_key" = "AAAAAAAAAAAAAAAAAAAA", - "aws.s3.secret_key" = "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB", - -- highlight-end - "format" = "parquet", - "aws.s3.use_aws_sdk_default_behavior" = "false", - "aws.s3.use_instance_profile" = "false", - "aws.s3.enable_path_style_access" = "true" -) -LIMIT 3; -``` - -The system returns the following query result: - -```plaintext -+--------+---------+------------+--------------+---------------------+ -| UserID | ItemID | CategoryID | BehaviorType | Timestamp | -+--------+---------+------------+--------------+---------------------+ -| 543711 | 829192 | 2355072 | pv | 2017-11-27 08:22:37 | -| 543711 | 2056618 | 3645362 | pv | 2017-11-27 10:16:46 | -| 543711 | 1165492 | 3645362 | pv | 2017-11-27 10:17:00 | -+--------+---------+------------+--------------+---------------------+ -3 rows in set (0.41 sec) -``` - -:::info - -Notice that the column names returned above are provided by the Parquet file. - -::: - -#### Creating and loading a table using CTAS - -This is a continuation of the previous example. The previous query is wrapped in CREATE TABLE AS SELECT (CTAS) to automate the table creation using schema inference. This means StarRocks will infer the table schema, create the table you want, and then load the data into the table. The column names and types are not required to create a table when using the `FILES()` table function with Parquet files as the Parquet format includes the column names. - -:::note - -The syntax of CREATE TABLE when using schema inference does not allow setting the number of replicas, so set it before creating the table. The example below is for a system with a single replica: - -```SQL -ADMIN SET FRONTEND CONFIG ('default_replication_num' = '1'); -``` - -::: - -Create a database and switch to it: - -```SQL -CREATE DATABASE IF NOT EXISTS mydatabase; -USE mydatabase; -``` - -Use CTAS to create a table and load the data of the sample dataset previously added to your MinIO system. - -:::tip - -The highlighted section of the command includes the settings that you may need to change: - -- Set the `endpoint` and `path` to match your MinIO system. -- If your MinIO system uses SSL set `enable_ssl` to `true`. -- Substitute your MinIO access key and secret key for `AAA` and `BBB`. - -::: - -```sql -CREATE TABLE user_behavior_inferred AS -SELECT * FROM FILES -( - -- highlight-start - "aws.s3.endpoint" = "http://minio:9000", - "path" = "s3://starrocks/user_behavior_ten_million_rows.parquet", - "aws.s3.enable_ssl" = "false", - "aws.s3.access_key" = "AAAAAAAAAAAAAAAAAAAA", - "aws.s3.secret_key" = "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB", - -- highlight-end - "format" = "parquet", - "aws.s3.use_aws_sdk_default_behavior" = "false", - "aws.s3.use_instance_profile" = "false", - "aws.s3.enable_path_style_access" = "true" -); -``` - -```plaintext -Query OK, 10000000 rows affected (3.17 sec) -{'label':'insert_a5da3ff5-9ee4-11ee-90b0-02420a060004', 'status':'VISIBLE', 'txnId':'17'} -``` - -After creating the table, you can view its schema by using [DESCRIBE](../sql-reference/sql-statements/table_bucket_part_index/DESCRIBE.md): - -```SQL -DESCRIBE user_behavior_inferred; -``` - -The system returns the following query result: - -```Plaintext -+--------------+------------------+------+-------+---------+-------+ -| Field | Type | Null | Key | Default | Extra | -+--------------+------------------+------+-------+---------+-------+ -| UserID | bigint | YES | true | NULL | | -| ItemID | bigint | YES | true | NULL | | -| CategoryID | bigint | YES | true | NULL | | -| BehaviorType | varchar(1048576) | YES | false | NULL | | -| Timestamp | varchar(1048576) | YES | false | NULL | | -+--------------+------------------+------+-------+---------+-------+ -``` - -Query the table to verify that the data has been loaded into it. Example: - -```SQL -SELECT * from user_behavior_inferred LIMIT 3; -``` - -The following query result is returned, indicating that the data has been successfully loaded: - -```Plaintext -+--------+--------+------------+--------------+---------------------+ -| UserID | ItemID | CategoryID | BehaviorType | Timestamp | -+--------+--------+------------+--------------+---------------------+ -| 58 | 158350 | 2355072 | pv | 2017-11-27 13:06:51 | -| 58 | 158590 | 3194735 | pv | 2017-11-27 02:21:04 | -| 58 | 215073 | 3002561 | pv | 2017-11-30 10:55:42 | -+--------+--------+------------+--------------+---------------------+ -``` - -#### Loading into an existing table using INSERT - -You may want to customize the table that you are inserting into, for example, the: - -- column data type, nullable setting, or default values -- key types and columns -- data partitioning and bucketing - -:::tip - -Creating the most efficient table structure requires knowledge of how the data will be used and the content of the columns. This topic does not cover table design. For information about table design, see [Table types](../table_design/StarRocks_table_design.md). - -::: - -In this example, we are creating a table based on knowledge of how the table will be queried and the data in the Parquet file. The knowledge of the data in the Parquet file can be gained by querying the file directly in MinIO. - -- Since a query of the dataset in MinIO indicates that the `Timestamp` column contains data that matches a `datetime` data type, the column type is specified in the following DDL. -- By querying the data in MinIO, you can find that there are no `NULL` values in the dataset, so the DDL does not set any columns as nullable. -- Based on knowledge of the expected query types, the sort key and bucketing column are set to the column `UserID`. Your use case might be different for this data, so you might decide to use `ItemID` in addition to or instead of `UserID` for the sort key. - -Create a database and switch to it: - -```SQL -CREATE DATABASE IF NOT EXISTS mydatabase; -USE mydatabase; -``` - -Create a table by hand (we recommend that the table have the same schema as the Parquet file you want to load from MinIO): - -```SQL -CREATE TABLE user_behavior_declared -( - UserID int(11) NOT NULL, - ItemID int(11) NOT NULL, - CategoryID int(11) NOT NULL, - BehaviorType varchar(65533) NOT NULL, - Timestamp datetime NOT NULL -) -ENGINE = OLAP -DUPLICATE KEY(UserID) -DISTRIBUTED BY HASH(UserID) -PROPERTIES -( - 'replication_num' = '1' -); -``` - -Display the schema so that you can compare it with the inferred schema produced by the `FILES()` table function: - -```sql -DESCRIBE user_behavior_declared; -``` - -```plaintext -+--------------+----------------+------+-------+---------+-------+ -| Field | Type | Null | Key | Default | Extra | -+--------------+----------------+------+-------+---------+-------+ -| UserID | int | NO | true | NULL | | -| ItemID | int | NO | false | NULL | | -| CategoryID | int | NO | false | NULL | | -| BehaviorType | varchar(65533) | NO | false | NULL | | -| Timestamp | datetime | NO | false | NULL | | -+--------------+----------------+------+-------+---------+-------+ -5 rows in set (0.00 sec) -``` - -:::tip - -Compare the schema you just created with the schema inferred earlier using the `FILES()` table function. Look at: - -- data types -- nullable -- key fields - -To better control the schema of the destination table and for better query performance, we recommend that you specify the table schema by hand in production environments. Having a `datetime` data type for the timestamp field is more efficient than using a `varchar`. - -::: - -After creating the table, you can load it with INSERT INTO SELECT FROM FILES(): - -:::tip - -The highlighted section of the command includes the settings that you may need to change: - -- Set the `endpoint` and `path` to match your MinIO system. -- If your MinIO system uses SSL set `enable_ssl` to `true`. -- Substitute your MinIO access key and secret key for `AAA` and `BBB`. - -::: - -```SQL -INSERT INTO user_behavior_declared -SELECT * FROM FILES -( - -- highlight-start - "aws.s3.endpoint" = "http://minio:9000", - "path" = "s3://starrocks/user_behavior_ten_million_rows.parquet", - "aws.s3.enable_ssl" = "false", - "aws.s3.access_key" = "AAAAAAAAAAAAAAAAAAAA", - "aws.s3.secret_key" = "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB", - -- highlight-end - "format" = "parquet", - "aws.s3.use_aws_sdk_default_behavior" = "false", - "aws.s3.use_instance_profile" = "false", - "aws.s3.enable_path_style_access" = "true" -); -``` - -After the load is complete, you can query the table to verify that the data has been loaded into it. Example: - -```SQL -SELECT * from user_behavior_declared LIMIT 3; -``` - -The following query result is returned, indicating that the data has been successfully loaded: - -```Plaintext -+--------+---------+------------+--------------+---------------------+ -| UserID | ItemID | CategoryID | BehaviorType | Timestamp | -+--------+---------+------------+--------------+---------------------+ -| 58 | 4309692 | 1165503 | pv | 2017-11-25 14:06:52 | -| 58 | 181489 | 1165503 | pv | 2017-11-25 14:07:22 | -| 58 | 3722956 | 1165503 | pv | 2017-11-25 14:09:28 | -+--------+---------+------------+--------------+---------------------+ -``` - -#### Check load progress - -You can query the progress of INSERT jobs from the [`loads`](../sql-reference/information_schema/loads.md) view in the StarRocks Information Schema. This feature is supported from v3.1 onwards. Example: - -```SQL -SELECT * FROM information_schema.loads ORDER BY JOB_ID DESC; -``` - -For information about the fields provided in the `loads` view, see [`loads`](../sql-reference/information_schema/loads.md). - -If you have submitted multiple load jobs, you can filter on the `LABEL` associated with the job. Example: - -```SQL -SELECT * FROM information_schema.loads WHERE LABEL = 'insert_e3b882f5-7eb3-11ee-ae77-00163e267b60' \G -*************************** 1. row *************************** - JOB_ID: 10243 - LABEL: insert_e3b882f5-7eb3-11ee-ae77-00163e267b60 - DATABASE_NAME: mydatabase - STATE: FINISHED - PROGRESS: ETL:100%; LOAD:100% - TYPE: INSERT - PRIORITY: NORMAL - SCAN_ROWS: 10000000 - FILTERED_ROWS: 0 - UNSELECTED_ROWS: 0 - SINK_ROWS: 10000000 - ETL_INFO: - TASK_INFO: resource:N/A; timeout(s):300; max_filter_ratio:0.0 - CREATE_TIME: 2023-11-09 11:56:01 - ETL_START_TIME: 2023-11-09 11:56:01 - ETL_FINISH_TIME: 2023-11-09 11:56:01 - LOAD_START_TIME: 2023-11-09 11:56:01 - LOAD_FINISH_TIME: 2023-11-09 11:56:44 - JOB_DETAILS: {"All backends":{"e3b882f5-7eb3-11ee-ae77-00163e267b60":[10142]},"FileNumber":0,"FileSize":0,"InternalTableLoadBytes":311710786,"InternalTableLoadRows":10000000,"ScanBytes":581574034,"ScanRows":10000000,"TaskNumber":1,"Unfinished backends":{"e3b882f5-7eb3-11ee-ae77-00163e267b60":[]}} - ERROR_MSG: NULL - TRACKING_URL: NULL - TRACKING_SQL: NULL -REJECTED_RECORD_PATH: NULL -``` - -:::tip - -INSERT is a synchronous command. If an INSERT job is still running, you need to open another session to check its execution status. - -::: - -### Compare the table sizes on disk - -This query compares the table with the inferred schema and the one where the schema -is declared. Because the inferred schema has nullable columns and a varchar for the -timestamp the data length is larger: - -```sql -SELECT TABLE_NAME, - TABLE_ROWS, - AVG_ROW_LENGTH, - DATA_LENGTH -FROM information_schema.tables -WHERE TABLE_NAME like 'user_behavior%'\G -``` - -```plaintext -*************************** 1. row *************************** - TABLE_NAME: user_behavior_declared - TABLE_ROWS: 10000000 -AVG_ROW_LENGTH: 10 - DATA_LENGTH: 102562516 -*************************** 2. row *************************** - TABLE_NAME: user_behavior_inferred - TABLE_ROWS: 10000000 -AVG_ROW_LENGTH: 17 - DATA_LENGTH: 176803880 -2 rows in set (0.04 sec) -``` - -## Use Broker Load - -An asynchronous Broker Load process handles making the connection to MinIO, pulling the data, and storing the data in StarRocks. - -This method supports the following file formats: - -- Parquet -- ORC -- CSV -- JSON (supported from v3.2.3 onwards) - -### Advantages of Broker Load - -- Broker Load runs in the background and clients do not need to stay connected for the job to continue. -- Broker Load is preferred for long-running jobs, with the default timeout spanning 4 hours. -- In addition to Parquet and ORC file format, Broker Load supports CSV file format and JSON file format (JSON file format is supported from v3.2.3 onwards). - -### Data flow - -![Workflow of Broker Load](../_assets/broker_load_how-to-work_en.png) - -1. The user creates a load job. -2. The frontend (FE) creates a query plan and distributes the plan to the backend nodes (BEs) or compute nodes (CNs). -3. The BEs or CNs pull the data from the source and load the data into StarRocks. - -### Typical example - -Create a table, start a load process that pulls the sample dataset previously loaded to your MinIO system. - -#### Create a database and a table - -Create a database and switch to it: - -```SQL -CREATE DATABASE IF NOT EXISTS mydatabase; -USE mydatabase; -``` - -Create a table by hand (we recommend that the table has the same schema as the Parquet file that you want to load from MinIO): - -```SQL -CREATE TABLE user_behavior -( - UserID int(11) NOT NULL, - ItemID int(11) NOT NULL, - CategoryID int(11) NOT NULL, - BehaviorType varchar(65533) NOT NULL, - Timestamp datetime NOT NULL -) -ENGINE = OLAP -DUPLICATE KEY(UserID) -DISTRIBUTED BY HASH(UserID) -PROPERTIES -( - 'replication_num' = '1' -); -``` - -#### Start a Broker Load - -Run the following command to start a Broker Load job that loads data from the sample dataset `user_behavior_ten_million_rows.parquet` to the `user_behavior` table: - -:::tip - -The highlighted section of the command includes the settings that you may need to change: - -- Set the `endpoint` and `DATA INFILE` to match your MinIO system. -- If your MinIO system uses SSL set `enable_ssl` to `true`. -- Substitute your MinIO access key and secret for `AAA` and `BBB`. - -::: - -```sql -LOAD LABEL UserBehavior -( - -- highlight-start - DATA INFILE("s3://starrocks/user_behavior_ten_million_rows.parquet") - -- highlight-end - INTO TABLE user_behavior - ) - WITH BROKER - ( - -- highlight-start - "aws.s3.endpoint" = "http://minio:9000", - "aws.s3.enable_ssl" = "false", - "aws.s3.access_key" = "AAAAAAAAAAAAAAAAAAAA", - "aws.s3.secret_key" = "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB", - -- highlight-end - "aws.s3.use_aws_sdk_default_behavior" = "false", - "aws.s3.use_instance_profile" = "false", - "aws.s3.enable_path_style_access" = "true" - ) -PROPERTIES -( - "timeout" = "72000" -); -``` - -This job has four main sections: - -- `LABEL`: A string used when querying the state of the load job. -- `LOAD` declaration: The source URI, source data format, and destination table name. -- `BROKER`: The connection details for the source. -- `PROPERTIES`: The timeout value and any other properties to apply to the load job. - -For detailed syntax and parameter descriptions, see [BROKER LOAD](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md). - -#### Check load progress - -You can query the progress of Broker Load jobs from the [`loads`](../sql-reference/information_schema/loads.md) view in the StarRocks Information Schema. This feature is supported from v3.1 onwards. - -```SQL -SELECT * FROM information_schema.loads; -``` - -For information about the fields provided in the `loads` view, see [`loads`](../sql-reference/information_schema/loads.md). - -If you have submitted multiple load jobs, you can filter on the `LABEL` associated with the job. Example: - -```sql -SELECT * FROM information_schema.loads -WHERE LABEL = 'UserBehavior'\G -``` - -```plaintext -*************************** 1. row *************************** - JOB_ID: 10176 - LABEL: userbehavior - DATABASE_NAME: mydatabase - STATE: FINISHED - PROGRESS: ETL:100%; LOAD:100% - TYPE: BROKER - PRIORITY: NORMAL - SCAN_ROWS: 10000000 - FILTERED_ROWS: 0 - UNSELECTED_ROWS: 0 - SINK_ROWS: 10000000 - ETL_INFO: - TASK_INFO: resource:N/A; timeout(s):72000; max_filter_ratio:0.0 - CREATE_TIME: 2023-12-19 23:02:41 - ETL_START_TIME: 2023-12-19 23:02:44 - ETL_FINISH_TIME: 2023-12-19 23:02:44 - LOAD_START_TIME: 2023-12-19 23:02:44 - LOAD_FINISH_TIME: 2023-12-19 23:02:46 - JOB_DETAILS: {"All backends":{"4aeec563-a91e-4c1e-b169-977b660950d1":[10004]},"FileNumber":1,"FileSize":132251298,"InternalTableLoadBytes":311710786,"InternalTableLoadRows":10000000,"ScanBytes":132251298,"ScanRows":10000000,"TaskNumber":1,"Unfinished backends":{"4aeec563-a91e-4c1e-b169-977b660950d1":[]}} - ERROR_MSG: NULL - TRACKING_URL: NULL - TRACKING_SQL: NULL -REJECTED_RECORD_PATH: NULL -1 row in set (0.02 sec) -``` - -After you confirm that the load job has finished, you can check a subset of the destination table to see if the data has been successfully loaded. Example: - -```SQL -SELECT * from user_behavior LIMIT 3; -``` - -The following query result is returned, indicating that the data has been successfully loaded: - -```Plaintext -+--------+---------+------------+--------------+---------------------+ -| UserID | ItemID | CategoryID | BehaviorType | Timestamp | -+--------+---------+------------+--------------+---------------------+ -| 142 | 2869980 | 2939262 | pv | 2017-11-25 03:43:22 | -| 142 | 2522236 | 1669167 | pv | 2017-11-25 15:14:12 | -| 142 | 3031639 | 3607361 | pv | 2017-11-25 15:19:25 | -+--------+---------+------------+--------------+---------------------+ -``` - - diff --git a/docs/en/loading/objectstorage.mdx b/docs/en/loading/objectstorage.mdx deleted file mode 100644 index 289ae91..0000000 --- a/docs/en/loading/objectstorage.mdx +++ /dev/null @@ -1,10 +0,0 @@ ---- -displayed_sidebar: docs -description: "Load from S3, GCS, Azure, and MinIO" ---- - -# Loading data from Object Storage - -import DocCardList from '@theme/DocCardList'; - - diff --git a/docs/en/loading/s3.md b/docs/en/loading/s3.md deleted file mode 100644 index c8bad2d..0000000 --- a/docs/en/loading/s3.md +++ /dev/null @@ -1,659 +0,0 @@ ---- -displayed_sidebar: docs -toc_max_heading_level: 4 -keywords: ['Broker Load'] ---- - -# Load data from AWS S3 - -import LoadMethodIntro from '../_assets/commonMarkdown/loadMethodIntro.mdx' - -import InsertPrivNote from '../_assets/commonMarkdown/insertPrivNote.mdx' - -import PipeAdvantages from '../_assets/commonMarkdown/pipeAdvantages.mdx' - -StarRocks provides the following options for loading data from AWS S3: - - - -## Before you begin - -### Make source data ready - -Make sure the source data you want to load into StarRocks is properly stored in an S3 bucket. You may also consider where the data and the database are located, because data transfer costs are much lower when your bucket and your StarRocks cluster are located in the same region. - -In this topic, we provide you with a sample dataset in an S3 bucket, `s3://starrocks-examples/user-behavior-10-million-rows.parquet`. You can access that dataset with any valid credentials as the object is readable by any AWS authenticated user. - -### Check privileges - - - -### Gather authentication details - -The examples in this topic use IAM user-based authentication. To ensure that you have permission to read data from AWS S3, we recommend that you read [Preparation for IAM user-based authentication](../integrations/authenticate_to_aws_resources.md) and follow the instructions to create an IAM user with proper [IAM policies](../sql-reference/aws_iam_policies.md) configured. - -In a nutshell, if you practice IAM user-based authentication, you need to gather information about the following AWS resources: - -- The S3 bucket that stores your data. -- The S3 object key (object name) if accessing a specific object in the bucket. Note that the object key can include a prefix if your S3 objects are stored in sub-folders. -- The AWS region to which the S3 bucket belongs. -- The access key and secret key used as access credentials. - -For information about all the authentication methods available, see [Authenticate to AWS resources](../integrations/authenticate_to_aws_resources.md). - -## Use INSERT+FILES() - -This method is available from v3.1 onwards and currently supports only the Parquet, ORC, and CSV (from v3.3.0 onwards) file formats. - -### Advantages of INSERT+FILES() - -[`FILES()`](../sql-reference/sql-functions/table-functions/files.md) can read the file stored in cloud storage based on the path-related properties you specify, infer the table schema of the data in the file, and then return the data from the file as data rows. - -With `FILES()`, you can: - -- Query the data directly from S3 using [SELECT](../sql-reference/sql-statements/table_bucket_part_index/SELECT.md). -- Create and load a table using [CREATE TABLE AS SELECT](../sql-reference/sql-statements/table_bucket_part_index/CREATE_TABLE_AS_SELECT.md) (CTAS). -- Load the data into an existing table using [INSERT](../sql-reference/sql-statements/loading_unloading/INSERT.md). - -### Typical examples - -#### Querying directly from S3 using SELECT - -Querying directly from S3 using SELECT+`FILES()` can give a good preview of the content of a dataset before you create a table. For example: - -- Get a preview of the dataset without storing the data. -- Query for the min and max values and decide what data types to use. -- Check for `NULL` values. - -The following example queries the sample dataset `s3://starrocks-examples/user-behavior-10-million-rows.parquet`: - -```SQL -SELECT * FROM FILES -( - "path" = "s3://starrocks-examples/user-behavior-10-million-rows.parquet", - "format" = "parquet", - "aws.s3.region" = "us-east-1", - "aws.s3.access_key" = "AAAAAAAAAAAAAAAAAAAA", - "aws.s3.secret_key" = "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB" -) -LIMIT 3; -``` - -> **NOTE** -> -> Substitute your credentials for `AAA` and `BBB` in the above command. Any valid `aws.s3.access_key` and `aws.s3.secret_key` can be used, as the object is readable by any AWS authenticated user. - -The system returns the following query result: - -```Plaintext -+--------+---------+------------+--------------+---------------------+ -| UserID | ItemID | CategoryID | BehaviorType | Timestamp | -+--------+---------+------------+--------------+---------------------+ -| 1 | 2576651 | 149192 | pv | 2017-11-25 01:21:25 | -| 1 | 3830808 | 4181361 | pv | 2017-11-25 07:04:53 | -| 1 | 4365585 | 2520377 | pv | 2017-11-25 07:49:06 | -+--------+---------+------------+--------------+---------------------+ -``` - -> **NOTE** -> -> Notice that the column names as returned above are provided by the Parquet file. - -#### Creating and loading a table using CTAS - -This is a continuation of the previous example. The previous query is wrapped in CREATE TABLE AS SELECT (CTAS) to automate the table creation using schema inference. This means StarRocks will infer the table schema, create the table you want, and then load the data into the table. The column names and types are not required to create a table when using the `FILES()` table function with Parquet files as the Parquet format includes the column names. - -> **NOTE** -> -> The syntax of CREATE TABLE when using schema inference does not allow setting the number of replicas, so set it before creating the table. The example below is for a system with one replica: -> -> ```SQL -> ADMIN SET FRONTEND CONFIG ('default_replication_num' = "1"); -> ``` - -Create a database and switch to it: - -```SQL -CREATE DATABASE IF NOT EXISTS mydatabase; -USE mydatabase; -``` - -Use CTAS to create a table and load the data of the sample dataset `s3://starrocks-examples/user-behavior-10-million-rows.parquet` into the table: - -```SQL -CREATE TABLE user_behavior_inferred AS -SELECT * FROM FILES -( - "path" = "s3://starrocks-examples/user-behavior-10-million-rows.parquet", - "format" = "parquet", - "aws.s3.region" = "us-east-1", - "aws.s3.access_key" = "AAAAAAAAAAAAAAAAAAAA", - "aws.s3.secret_key" = "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB" -); -``` - -> **NOTE** -> -> Substitute your credentials for `AAA` and `BBB` in the above command. Any valid `aws.s3.access_key` and `aws.s3.secret_key` can be used, as the object is readable by any AWS authenticated user. - -After creating the table, you can view its schema by using [DESCRIBE](../sql-reference/sql-statements/table_bucket_part_index/DESCRIBE.md): - -```SQL -DESCRIBE user_behavior_inferred; -``` - -The system returns the following query result: - -```Plain -+--------------+------------------+------+-------+---------+-------+ -| Field | Type | Null | Key | Default | Extra | -+--------------+------------------+------+-------+---------+-------+ -| UserID | bigint | YES | true | NULL | | -| ItemID | bigint | YES | true | NULL | | -| CategoryID | bigint | YES | true | NULL | | -| BehaviorType | varchar(1048576) | YES | false | NULL | | -| Timestamp | varchar(1048576) | YES | false | NULL | | -+--------------+------------------+------+-------+---------+-------+ -``` - -Query the table to verify that the data has been loaded into it. Example: - -```SQL -SELECT * from user_behavior_inferred LIMIT 3; -``` - -The following query result is returned, indicating that the data has been successfully loaded: - -```Plaintext -+--------+---------+------------+--------------+---------------------+ -| UserID | ItemID | CategoryID | BehaviorType | Timestamp | -+--------+---------+------------+--------------+---------------------+ -| 225586 | 3694958 | 1040727 | pv | 2017-12-01 00:58:40 | -| 225586 | 3726324 | 965809 | pv | 2017-12-01 02:16:02 | -| 225586 | 3732495 | 1488813 | pv | 2017-12-01 00:59:46 | -+--------+---------+------------+--------------+---------------------+ -``` - -#### Loading into an existing table using INSERT - -You may want to customize the table that you are inserting into, for example, the: - -- column data type, nullable setting, or default values -- key types and columns -- data partitioning and bucketing - -> **NOTE** -> -> Creating the most efficient table structure requires knowledge of how the data will be used and the content of the columns. This topic does not cover table design. For information about table design, see [Table types](../table_design/StarRocks_table_design.md). - -In this example, we are creating a table based on knowledge of how the table will be queried and the data in the Parquet file. The knowledge of the data in the Parquet file can be gained by querying the file directly in S3. - -- Since a query of the dataset in S3 indicates that the `Timestamp` column contains data that matches a VARCHAR data type, and StarRocks can cast from VARCHAR to DATETIME, the data type is changed to DATETIME in the following DDL. -- By querying the data in S3, you can find that there are no `NULL` values in the dataset, so the DDL could also set all columns as non-nullable. -- Based on knowledge of the expected query types, the sort key and bucketing column are set to the column `UserID`. Your use case might be different for this data, so you might decide to use `ItemID` in addition to, or instead of, `UserID` for the sort key. - -Create a database and switch to it: - -```SQL -CREATE DATABASE IF NOT EXISTS mydatabase; -USE mydatabase; -``` - -Create a table by hand: - -```SQL -CREATE TABLE user_behavior_declared -( - UserID int(11), - ItemID int(11), - CategoryID int(11), - BehaviorType varchar(65533), - Timestamp datetime -) -ENGINE = OLAP -DUPLICATE KEY(UserID) -DISTRIBUTED BY HASH(UserID); -``` - -Display the schema so that you can compare it with the inferred schema produced by the `FILES()` table function: - -```sql -DESCRIBE user_behavior_declared; -``` - -```plaintext -+--------------+----------------+------+-------+---------+-------+ -| Field | Type | Null | Key | Default | Extra | -+--------------+----------------+------+-------+---------+-------+ -| UserID | int | YES | true | NULL | | -| ItemID | int | YES | false | NULL | | -| CategoryID | int | YES | false | NULL | | -| BehaviorType | varchar(65533) | YES | false | NULL | | -| Timestamp | datetime | YES | false | NULL | | -+--------------+----------------+------+-------+---------+-------+ -``` - -:::tip - -Compare the schema you just created with the schema inferred earlier using the `FILES()` table function. Look at: - -- data types -- nullable -- key fields - -To better control the schema of the destination table and for better query performance, we recommend that you specify the table schema by hand in production environments. - -::: - -After creating the table, you can load it with INSERT INTO SELECT FROM FILES(): - -```SQL -INSERT INTO user_behavior_declared -SELECT * FROM FILES -( - "path" = "s3://starrocks-examples/user-behavior-10-million-rows.parquet", - "format" = "parquet", - "aws.s3.region" = "us-east-1", - "aws.s3.access_key" = "AAAAAAAAAAAAAAAAAAAA", - "aws.s3.secret_key" = "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB" -); -``` - -> **NOTE** -> -> Substitute your credentials for `AAA` and `BBB` in the above command. Any valid `aws.s3.access_key` and `aws.s3.secret_key` can be used, as the object is readable by any AWS authenticated user. - -After the load is complete, you can query the table to verify that the data has been loaded into it. Example: - -```SQL -SELECT * from user_behavior_declared LIMIT 3; -``` - -The following query result is returned, indicating that the data has been successfully loaded: - -```Plaintext -+--------+---------+------------+--------------+---------------------+ -| UserID | ItemID | CategoryID | BehaviorType | Timestamp | -+--------+---------+------------+--------------+---------------------+ -| 393529 | 3715112 | 883960 | pv | 2017-12-02 02:45:44 | -| 393529 | 2650583 | 883960 | pv | 2017-12-02 02:45:59 | -| 393529 | 3715112 | 883960 | pv | 2017-12-02 03:00:56 | -+--------+---------+------------+--------------+---------------------+ -``` - -#### Check load progress - -You can query the progress of INSERT jobs from the [`loads`](../sql-reference/information_schema/loads.md) view in the StarRocks Information Schema. This feature is supported from v3.1 onwards. Example: - -```SQL -SELECT * FROM information_schema.loads ORDER BY JOB_ID DESC; -``` - -For information about the fields provided in the `loads` view, see [`loads`](../sql-reference/information_schema/loads.md). - -If you have submitted multiple load jobs, you can filter on the `LABEL` associated with the job. Example: - -```SQL -SELECT * FROM information_schema.loads WHERE LABEL = 'insert_e3b882f5-7eb3-11ee-ae77-00163e267b60' \G -*************************** 1. row *************************** - JOB_ID: 10243 - LABEL: insert_e3b882f5-7eb3-11ee-ae77-00163e267b60 - DATABASE_NAME: mydatabase - STATE: FINISHED - PROGRESS: ETL:100%; LOAD:100% - TYPE: INSERT - PRIORITY: NORMAL - SCAN_ROWS: 10000000 - FILTERED_ROWS: 0 - UNSELECTED_ROWS: 0 - SINK_ROWS: 10000000 - ETL_INFO: - TASK_INFO: resource:N/A; timeout(s):300; max_filter_ratio:0.0 - CREATE_TIME: 2023-11-09 11:56:01 - ETL_START_TIME: 2023-11-09 11:56:01 - ETL_FINISH_TIME: 2023-11-09 11:56:01 - LOAD_START_TIME: 2023-11-09 11:56:01 - LOAD_FINISH_TIME: 2023-11-09 11:56:44 - JOB_DETAILS: {"All backends":{"e3b882f5-7eb3-11ee-ae77-00163e267b60":[10142]},"FileNumber":0,"FileSize":0,"InternalTableLoadBytes":311710786,"InternalTableLoadRows":10000000,"ScanBytes":581574034,"ScanRows":10000000,"TaskNumber":1,"Unfinished backends":{"e3b882f5-7eb3-11ee-ae77-00163e267b60":[]}} - ERROR_MSG: NULL - TRACKING_URL: NULL - TRACKING_SQL: NULL -REJECTED_RECORD_PATH: NULL -``` - -> **NOTE** -> -> INSERT is a synchronous command. If an INSERT job is still running, you need to open another session to check its execution status. - -## Use Broker Load - -An asynchronous Broker Load process handles making the connection to S3, pulling the data, and storing the data in StarRocks. - -This method supports the following file formats: - -- Parquet -- ORC -- CSV -- JSON (supported from v3.2.3 onwards) - -### Advantages of Broker Load - -- Broker Load runs in the background and clients do not need to stay connected for the job to continue. -- Broker Load is preferred for long-running jobs, with the default timeout spanning 4 hours. -- In addition to Parquet and ORC file format, Broker Load supports CSV file format and JSON file format (JSON file format is supported from v3.2.3 onwards). - -### Data flow - -![Workflow of Broker Load](../_assets/broker_load_how-to-work_en.png) - -1. The user creates a load job. -2. The frontend (FE) creates a query plan and distributes the plan to the backend nodes (BEs) or compute nodes (CNs). -3. The BEs or CNs pull the data from the source and load the data into StarRocks. - -### Typical example - -Create a table, start a load process that pulls the sample dataset `s3://starrocks-examples/user-behavior-10-million-rows.parquet` from S3, and verify the progress and success of the data loading. - -#### Create a database and a table - -Create a database and switch to it: - -```SQL -CREATE DATABASE IF NOT EXISTS mydatabase; -USE mydatabase; -``` - -Create a table by hand (we recommend that the table has the same schema as the Parquet file that you want to load from AWS S3): - -```SQL -CREATE TABLE user_behavior -( - UserID int(11), - ItemID int(11), - CategoryID int(11), - BehaviorType varchar(65533), - Timestamp datetime -) -ENGINE = OLAP -DUPLICATE KEY(UserID) -DISTRIBUTED BY HASH(UserID); -``` - -#### Start a Broker Load - -Run the following command to start a Broker Load job that loads data from the sample dataset `s3://starrocks-examples/user-behavior-10-million-rows.parquet` to the `user_behavior` table: - -```SQL -LOAD LABEL user_behavior -( - DATA INFILE("s3://starrocks-examples/user-behavior-10-million-rows.parquet") - INTO TABLE user_behavior - FORMAT AS "parquet" - ) - WITH BROKER - ( - "aws.s3.enable_ssl" = "true", - "aws.s3.use_instance_profile" = "false", - "aws.s3.region" = "us-east-1", - "aws.s3.access_key" = "AAAAAAAAAAAAAAAAAAAA", - "aws.s3.secret_key" = "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB" - ) -PROPERTIES -( - "timeout" = "72000" -); -``` - -> **NOTE** -> -> Substitute your credentials for `AAA` and `BBB` in the above command. Any valid `aws.s3.access_key` and `aws.s3.secret_key` can be used, as the object is readable by any AWS authenticated user. - -This job has four main sections: - -- `LABEL`: A string used when querying the state of the load job. -- `LOAD` declaration: The source URI, source data format, and destination table name. -- `BROKER`: The connection details for the source. -- `PROPERTIES`: The timeout value and any other properties to apply to the load job. - -For detailed syntax and parameter descriptions, see [BROKER LOAD](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md). - -#### Check load progress - -You can query the progress of the Broker Load job from the [`loads`](../sql-reference/information_schema/loads.md) view in the StarRocks Information Schema. This feature is supported from v3.1 onwards. - -```SQL -SELECT * FROM information_schema.loads WHERE LABEL = 'user_behavior'; -``` - -For information about the fields provided in the `loads` view, see [`loads`](../sql-reference/information_schema/loads.md). - -This record shows a state of `LOADING`, and the progress is 39%. If you see something similar, then run the command again until you see a state of `FINISHED`. - -```Plaintext - JOB_ID: 10466 - LABEL: user_behavior - DATABASE_NAME: mydatabase - # highlight-start - STATE: LOADING - PROGRESS: ETL:100%; LOAD:39% - # highlight-end - TYPE: BROKER - PRIORITY: NORMAL - SCAN_ROWS: 4620288 - FILTERED_ROWS: 0 - UNSELECTED_ROWS: 0 - SINK_ROWS: 4620288 - ETL_INFO: - TASK_INFO: resource:N/A; timeout(s):72000; max_filter_ratio:0.0 - CREATE_TIME: 2024-02-28 22:11:36 - ETL_START_TIME: 2024-02-28 22:11:41 - ETL_FINISH_TIME: 2024-02-28 22:11:41 - LOAD_START_TIME: 2024-02-28 22:11:41 - LOAD_FINISH_TIME: NULL - JOB_DETAILS: {"All backends":{"2fb97223-b14c-404b-9be1-83aa9b3a7715":[10004]},"FileNumber":1,"FileSize":136901706,"InternalTableLoadBytes":144032784,"InternalTableLoadRows":4620288,"ScanBytes":143969616,"ScanRows":4620288,"TaskNumber":1,"Unfinished backends":{"2fb97223-b14c-404b-9be1-83aa9b3a7715":[10004]}} - ERROR_MSG: NULL - TRACKING_URL: NULL - TRACKING_SQL: NULL -REJECTED_RECORD_PATH: NULL -``` - -After you confirm that the load job has finished, you can check a subset of the destination table to see if the data has been successfully loaded. Example: - -```SQL -SELECT * from user_behavior LIMIT 3; -``` - -The following query result is returned, indicating that the data has been successfully loaded: - -```Plaintext -+--------+---------+------------+--------------+---------------------+ -| UserID | ItemID | CategoryID | BehaviorType | Timestamp | -+--------+---------+------------+--------------+---------------------+ -| 34 | 856384 | 1029459 | pv | 2017-11-27 14:43:27 | -| 34 | 5079705 | 1029459 | pv | 2017-11-27 14:44:13 | -| 34 | 4451615 | 1029459 | pv | 2017-11-27 14:45:52 | -+--------+---------+------------+--------------+---------------------+ -``` - -## Use Pipe - -Starting from v3.2, StarRocks provides the Pipe loading method, which currently supports only the Parquet and ORC file formats. - -### Advantages of Pipe - - - -Pipe is ideal for continuous data loading and large-scale data loading: - -- **Large-scale data loading in micro-batches helps reduce the cost of retries caused by data errors.** - - With the help of Pipe, StarRocks enables the efficient loading of a large number of data files with a significant data volume in total. Pipe automatically splits the files based on their number or size, breaking down the load job into smaller, sequential tasks. This approach ensures that errors in one file do not impact the entire load job. The load status of each file is recorded by Pipe, allowing you to easily identify and fix files that contain errors. By minimizing the need for retries due to data errors, this approach helps to reduce costs. - -- **Continuous data loading helps reduce manpower.** - - Pipe helps you write new or updated data files to a specific location and continuously load the new data from these files into StarRocks. After you create a Pipe job with `"AUTO_INGEST" = "TRUE"` specified, it will constantly monitor changes to the data files stored in the specified path and automatically load new or updated data from the data files into the destination StarRocks table. - -Additionally, Pipe performs file uniqueness checks to help prevent duplicate data loading.During the loading process, Pipe checks the uniqueness of each data file based on the file name and digest. If a file with a specific file name and digest has already been processed by a Pipe job, the Pipe job will skip all subsequent files with the same file name and digest. Note that object storage like AWS S3 uses `ETag` as file digest. - -The load status of each data file is recorded and saved to the `information_schema.pipe_files` view. After a Pipe job associated with the view is deleted, the records about the files loaded in that job will also be deleted. - -### Differences between Pipe and INSERT+FILES() - -A Pipe job is split into one or more transactions based on the size and number of rows in each data file. Users can query the intermediate results during the loading process. In contrast, an INSERT+`FILES()` job is processed as a single transaction, and users are unable to view the data during the loading process. - -### File loading sequence - -For each Pipe job, StarRocks maintains a file queue, from which it fetches and loads data files as micro-batches. Pipe does not ensure that the data files are loaded in the same order as they are uploaded. Therefore, newer data may be loaded prior to older data. - -### Typical example - -#### Create a database and a table - -Create a database and switch to it: - -```SQL -CREATE DATABASE IF NOT EXISTS mydatabase; -USE mydatabase; -``` - -Create a table by hand (we recommend that the table have the same schema as the Parquet file you want to load from AWS S3): - -```SQL -CREATE TABLE user_behavior_from_pipe -( - UserID int(11), - ItemID int(11), - CategoryID int(11), - BehaviorType varchar(65533), - Timestamp datetime -) -ENGINE = OLAP -DUPLICATE KEY(UserID) -DISTRIBUTED BY HASH(UserID); -``` - -#### Start a Pipe job - -Run the following command to start a Pipe job that loads data from the sample dataset `s3://starrocks-examples/user-behavior-10-million-rows/` to the `user_behavior_from_pipe` table. This pipe job uses both micro batches, and continuous loading (described above) pipe-specific features. - -The other examples in this guide load a single Parquet file with 10 million rows. For the pipe example, the same dataset is split into 57 separate files, and these are all stored in one S3 folder. Note in the `CREATE PIPE` command below the `path` is the URI for an S3 folder and rather than providing a filename the URI ends in `/*`. By setting `AUTO_INGEST` and specifying a folder rather than an individual file the pipe job will poll the S3 folder for new files and ingest them as they are added to the folder. - -```SQL -CREATE PIPE user_behavior_pipe -PROPERTIES -( --- highlight-start - "AUTO_INGEST" = "TRUE" --- highlight-end -) -AS -INSERT INTO user_behavior_from_pipe -SELECT * FROM FILES -( --- highlight-start - "path" = "s3://starrocks-examples/user-behavior-10-million-rows/*", --- highlight-end - "format" = "parquet", - "aws.s3.region" = "us-east-1", - "aws.s3.access_key" = "AAAAAAAAAAAAAAAAAAAA", - "aws.s3.secret_key" = "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB" -); -``` - -> **NOTE** -> -> Substitute your credentials for `AAA` and `BBB` in the above command. Any valid `aws.s3.access_key` and `aws.s3.secret_key` can be used, as the object is readable by any AWS authenticated user. - -This job has four main sections: - -- `pipe_name`: The name of the pipe. The pipe name must be unique within the database to which the pipe belongs. -- `INSERT_SQL`: The INSERT INTO SELECT FROM FILES statement that is used to load data from the specified source data file to the destination table. -- `PROPERTIES`: A set of optional parameters that specify how to execute the pipe. These include `AUTO_INGEST`, `POLL_INTERVAL`, `BATCH_SIZE`, and `BATCH_FILES`. Specify these properties in the `"key" = "value"` format. - -For detailed syntax and parameter descriptions, see [CREATE PIPE](../sql-reference/sql-statements/loading_unloading/pipe/CREATE_PIPE.md). - -#### Check load progress - -- Query the progress of the Pipe job by using [SHOW PIPES](../sql-reference/sql-statements/loading_unloading/pipe/SHOW_PIPES.md) in the current database to which the Pipe job belongs. - - ```SQL - SHOW PIPES WHERE NAME = 'user_behavior_pipe' \G - ``` - - The following result is returned: - - :::tip - In the output shown below the pipe is in the `RUNNING` state. A pipe will stay in the `RUNNING` state until you manually stop it. The output also shows the number of files loaded (57) and the last time that a file was loaded. - ::: - - ```SQL - *************************** 1. row *************************** - DATABASE_NAME: mydatabase - PIPE_ID: 10476 - PIPE_NAME: user_behavior_pipe - -- highlight-start - STATE: RUNNING - TABLE_NAME: mydatabase.user_behavior_from_pipe - LOAD_STATUS: {"loadedFiles":57,"loadedBytes":295345637,"loadingFiles":0,"lastLoadedTime":"2024-02-28 22:14:19"} - -- highlight-end - LAST_ERROR: NULL - CREATED_TIME: 2024-02-28 22:13:41 - 1 row in set (0.02 sec) - ``` - -- Query the progress of the Pipe job from the [`pipes`](../sql-reference/information_schema/pipes.md) view in the StarRocks Information Schema. - - ```SQL - SELECT * FROM information_schema.pipes WHERE pipe_name = 'user_behavior_replica' \G - ``` - - The following result is returned: - - :::tip - Some of the queries in this guide end in `\G` instead of a semicolon (`;`). This causes the MySQL client to output the results in vertical format. If you are using DBeaver or another client you may need to use a semicolon (`;`) rather - than `\G`. - ::: - - ```SQL - *************************** 1. row *************************** - DATABASE_NAME: mydatabase - PIPE_ID: 10217 - PIPE_NAME: user_behavior_replica - STATE: RUNNING - TABLE_NAME: mydatabase.user_behavior_replica - LOAD_STATUS: {"loadedFiles":1,"loadedBytes":132251298,"loadingFiles":0,"lastLoadedTime":"2023-11-09 15:35:42"} - LAST_ERROR: - CREATED_TIME: 9891-01-15 07:51:45 - 1 row in set (0.01 sec) - ``` - -#### Check file status - -You can query the load status of the files loaded from the [`pipe_files`](../sql-reference/information_schema/pipe_files.md) view in the StarRocks Information Schema. - -```SQL -SELECT * FROM information_schema.pipe_files WHERE pipe_name = 'user_behavior_replica' \G -``` - -The following result is returned: - -```SQL -*************************** 1. row *************************** - DATABASE_NAME: mydatabase - PIPE_ID: 10217 - PIPE_NAME: user_behavior_replica - FILE_NAME: s3://starrocks-examples/user-behavior-10-million-rows.parquet - FILE_VERSION: e29daa86b1120fea58ad0d047e671787-8 - FILE_SIZE: 132251298 - LAST_MODIFIED: 2023-11-06 13:25:17 - LOAD_STATE: FINISHED - STAGED_TIME: 2023-11-09 15:35:02 - START_LOAD_TIME: 2023-11-09 15:35:03 -FINISH_LOAD_TIME: 2023-11-09 15:35:42 - ERROR_MSG: -1 row in set (0.03 sec) -``` - -#### Manage Pipe jobs - -You can alter, suspend or resume, drop, or query the pipes you have created and retry to load specific data files. For more information, see [ALTER PIPE](../sql-reference/sql-statements/loading_unloading/pipe/ALTER_PIPE.md), [SUSPEND or RESUME PIPE](../sql-reference/sql-statements/loading_unloading/pipe/SUSPEND_or_RESUME_PIPE.md), [DROP PIPE](../sql-reference/sql-statements/loading_unloading/pipe/DROP_PIPE.md), [SHOW PIPES](../sql-reference/sql-statements/loading_unloading/pipe/SHOW_PIPES.md), and [RETRY FILE](../sql-reference/sql-statements/loading_unloading/pipe/RETRY_FILE.md). diff --git a/docs/en/loading/s3_compatible.md b/docs/en/loading/s3_compatible.md deleted file mode 100644 index a55a3f1..0000000 --- a/docs/en/loading/s3_compatible.md +++ /dev/null @@ -1,4 +0,0 @@ ---- -unlisted: true ---- - diff --git a/docs/en/loading/tencent.md b/docs/en/loading/tencent.md deleted file mode 100644 index a55a3f1..0000000 --- a/docs/en/loading/tencent.md +++ /dev/null @@ -1,4 +0,0 @@ ---- -unlisted: true ---- - diff --git a/docs/en/loading/BrokerLoad.md b/docs/en/loading/test.md similarity index 100% rename from docs/en/loading/BrokerLoad.md rename to docs/en/loading/test.md diff --git a/docs/en/quick_start/helm.md b/docs/en/quick_start/helm.md deleted file mode 100644 index 7fc8b8b..0000000 --- a/docs/en/quick_start/helm.md +++ /dev/null @@ -1,620 +0,0 @@ ---- -displayed_sidebar: docs -description: Use Helm to deploy StarRocks -toc_max_heading_level: 2 ---- - -import Tabs from '@theme/Tabs'; -import TabItem from '@theme/TabItem'; -import OperatorPrereqs from '../_assets/deployment/_OperatorPrereqs.mdx' -import DDL from '../_assets/quick-start/_DDL.mdx' -import Clients from '../_assets/quick-start/_clientsAllin1.mdx' -import SQL from '../_assets/quick-start/_SQL.mdx' -import Curl from '../_assets/quick-start/_curl.mdx' - -# StarRocks with Helm - -## Goals - -The goals of this quickstart are: - -- Deploy the StarRocks Kubernetes Operator and a StarRocks cluster with Helm -- Configure a password for the StarRocks database user `root` -- Provide for high-availability with three FEs and three BEs -- Store metadata in persistent storage -- Store data in persistent storage -- Allow MySQL clients to connect from outside the Kubernetes cluster -- Allow loading data from outside the Kubernetes cluster using Stream Load -- Load some public datasets -- Query the data - -:::tip -The datasets and queries are the same as the ones used in the Basic Quick Start. The main difference here is deploying with Helm and the StarRocks Operator. -::: - -The data used is provided by NYC OpenData and the National Centers for Environmental Information. - -Both of these datasets are large, and because this tutorial is intended to help you get exposed to working with StarRocks we are not going to load data for the past 120 years. You can run this with a GKE Kubernetes cluster built on three e2-standard-4 machines (or similar) with 80GB disk. For larger deployments, we have other documentation and will provide that later. - -There is a lot of information in this document, and it is presented with step-by-step content at the beginning, and the technical details at the end. This is done to serve these purposes in this order: - -1. Get the system deployed with Helm. -2. Allow the reader to load data in StarRocks and analyze that data. -3. Explain the basics of data transformation during loading. - ---- - -## Prerequisites - - - -### SQL client - -You can use the SQL client provided in the Kubernetes environment, or use one on your system. This guide uses the `mysql CLI` Many MySQL-compatible clients will work. - -### curl - -`curl` is used to issue the data load job to StarRocks, and to download the datasets. Check to see if you have it installed by running `curl` or `curl.exe` at your OS prompt. If curl is not installed, [get curl here](https://curl.se/dlwiz/?type=bin). - ---- - -## Terminology - -### FE - -Frontend nodes are responsible for metadata management, client connection management, query planning, and query scheduling. Each FE stores and maintains a complete copy of metadata in its memory, which guarantees indiscriminate services among the FEs. - -### BE - -Backend nodes are responsible for both data storage and executing query plans. - ---- - -## Add the StarRocks Helm chart repo - -The Helm Chart contains the definitions of the StarRocks Operator and the custom resource StarRocksCluster. -1. Add the Helm Chart Repo. - - ```Bash - helm repo add starrocks https://starrocks.github.io/starrocks-kubernetes-operator - ``` - -2. Update the Helm Chart Repo to the latest version. - - ```Bash - helm repo update - ``` - -3. View the Helm Chart Repo that you added. - - ```Bash - helm search repo starrocks - ``` - - ``` - NAME CHART VERSION APP VERSION DESCRIPTION - starrocks/kube-starrocks 1.9.7 3.2-latest kube-starrocks includes two subcharts, operator... - starrocks/operator 1.9.7 1.9.7 A Helm chart for StarRocks operator - starrocks/starrocks 1.9.7 3.2-latest A Helm chart for StarRocks cluster - starrocks/warehouse 1.9.7 3.2-latest Warehouse is currently a feature of the StarRoc... - ``` - ---- - -## Download the data - -Download these two datasets to your machine. - -### New York City crash data - -```bash -curl -O https://raw.githubusercontent.com/StarRocks/demo/master/documentation-samples/quickstart/datasets/NYPD_Crash_Data.csv -``` - -### Weather data - -```bash -curl -O https://raw.githubusercontent.com/StarRocks/demo/master/documentation-samples/quickstart/datasets/72505394728.csv -``` - ---- - -## Create a Helm values file - -The goals for this quick start are: - -1. Configure a password for the StarRocks database user `root` -2. Provide for high-availability with three FEs and three BEs -3. Store metadata in persistent storage -4. Store data in persistent storage -5. Allow MySQL clients to connect from outside the Kubernetes cluster -6. Allow loading data from outside the Kubernetes cluster using Stream Load - -The Helm chart provides options to satisfy all of these goals, but they are not configured by default. The rest of this section covers the configuration needed to meet all of these goals. A complete values spec will be provided, but first read the details for each of the six sections and then copy the full spec. - -### 1. Password for the database user - -This bit of YAML instructs the StarRocks operator to set the password for the database user `root` to the value of the `password` key of the Kubernetes secret `starrocks-root-pass. - -```yaml -starrocks: - initPassword: - enabled: true - # Set a password secret, for example: - # kubectl create secret generic starrocks-root-pass --from-literal=password='g()()dpa$$word' - passwordSecret: starrocks-root-pass -``` - -- Task: Create the Kubernetes secret - - ```bash - kubectl create secret generic starrocks-root-pass --from-literal=password='g()()dpa$$word' - ``` - -### 2. High Availability with 3 FEs and 3 BEs - -By setting `starrocks.starrockFESpec.replicas` to 3, and `starrocks.starrockBeSpec.replicas` to 3 you will have enough FEs and BEs for high availability. Setting the CPU and memory requests low allows the pods to be created in a small Kubernetes environment. - -```yaml -starrocks: - starrocksFESpec: - replicas: 3 - resources: - requests: - cpu: 1 - memory: 1Gi - - starrocksBeSpec: - replicas: 3 - resources: - requests: - cpu: 1 - memory: 2Gi -``` - -### 3. Store metadata in persistent storage - -Setting a value for `starrocks.starrocksFESpec.storageSpec.name` to anything other than `""` causes: -- Persistent storage to be used -- the value of `starrocks.starrocksFESpec.storageSpec.name` to be used as the prefix for all storage volumes for the service. - -By setting the value to `fe` these PVs will be created for FE 0: - -- `fe-meta-kube-starrocks-fe-0` -- `fe-log-kube-starrocks-fe-0` - -```yaml -starrocks: - starrocksFESpec: - storageSpec: - name: fe -``` - -### 4. Store data in persistent storage - -Setting a value for `starrocks.starrocksBeSpec.storageSpec.name` to anything other than `""` causes: -- Persistent storage to be used -- the value of `starrocks.starrocksBeSpec.storageSpec.name` to be used as the prefix for all storage volumes for the service. - -By setting the value to `be` these PVs will be created for BE 0: - -- `be-data-kube-starrocks-be-0` -- `be-log-kube-starrocks-be-0` - -Setting the `storageSize` to 15Gi reduces the storage from the default of 1Ti to fit smaller quotas for storage. - -```yaml -starrocks: - starrocksBeSpec: - storageSpec: - name: be - storageSize: 15Gi -``` - -### 5. LoadBalancer for MySQL clients - -By default, access to the FE service is through cluster IPs. To allow external access, `service.type` is set to `LoadBalancer` - -```yaml -starrocks: - starrocksFESpec: - service: - type: LoadBalancer -``` - -### 6. LoadBalancer for external data loading - -Stream Load requires external access to both FEs and BEs. The requests are sent to the FE and then the FE assigns a BE to process the upload. To allow the `curl` command to be redirected to the BE the `starroclFeProxySpec` needs to be enabled and set to type `LoadBalancer`. - -```yaml -starrocks: - starrocksFeProxySpec: - enabled: true - service: - type: LoadBalancer -``` - -### The complete values file - -The above snippets combined provide a full values file. Save this to `my-values.yaml`: - -```yaml -starrocks: - initPassword: - enabled: true - # Set a password secret, for example: - # kubectl create secret generic starrocks-root-pass --from-literal=password='g()()dpa$$word' - passwordSecret: starrocks-root-pass - - starrocksFESpec: - replicas: 3 - service: - type: LoadBalancer - resources: - requests: - cpu: 1 - memory: 1Gi - storageSpec: - name: fe - - starrocksBeSpec: - replicas: 3 - resources: - requests: - cpu: 1 - memory: 2Gi - storageSpec: - name: be - storageSize: 15Gi - - starrocksFeProxySpec: - enabled: true - service: - type: LoadBalancer -``` - -## Set the StarRocks root database user password - -To load data from outside of the Kubernetes cluster the StarRocks database will be exposed externally. You should set -a password for the StarRocks database user `root`. The operator will apply the password to the FE and BE nodes. - -```bash -kubectl create secret generic starrocks-root-pass --from-literal=password='g()()dpa$$word' -``` - -``` -secret/starrocks-root-pass created -``` ---- - -## Deploy the operator and StarRocks cluster - -```bash -helm install -f my-values.yaml starrocks starrocks/kube-starrocks -``` - -``` -NAME: starrocks -LAST DEPLOYED: Wed Jun 26 20:25:09 2024 -NAMESPACE: default -STATUS: deployed -REVISION: 1 -TEST SUITE: None -NOTES: -Thank you for installing kube-starrocks-1.9.7 kube-starrocks chart. -It will install both operator and starrocks cluster, please wait for a few minutes for the cluster to be ready. - -Please see the values.yaml for more operation information: https://github.com/StarRocks/starrocks-kubernetes-operator/blob/main/helm-charts/charts/kube-starrocks/values.yaml -``` - -## Check the status of the StarRocks cluster - -You can check the progress with these commands: - -```bash -kubectl --namespace default get starrockscluster -l "cluster=kube-starrocks" -``` - -``` -NAME PHASE FESTATUS BESTATUS CNSTATUS FEPROXYSTATUS -kube-starrocks reconciling reconciling reconciling reconciling -``` - -```bash -kubectl get pods -``` - -:::note -The `kube-starrocks-initpwd` pod will go through `error` and `CrashLoopBackOff` states as it attempts to connect to the FE and BE pods to set the StarRocks root password. You should ignore these errors and wait for a status of `Completed` for this pod. -::: - -``` -NAME READY STATUS RESTARTS AGE -kube-starrocks-be-0 0/1 Running 0 20s -kube-starrocks-be-1 0/1 Running 0 20s -kube-starrocks-be-2 0/1 Running 0 20s -kube-starrocks-fe-0 1/1 Running 0 66s -kube-starrocks-fe-1 0/1 Running 0 65s -kube-starrocks-fe-2 0/1 Running 0 66s -kube-starrocks-fe-proxy-56f8998799-d4qmt 1/1 Running 0 20s -kube-starrocks-initpwd-m84br 0/1 CrashLoopBackOff 3 (50s ago) 92s -kube-starrocks-operator-54ffcf8c5c-xsjc8 1/1 Running 0 92s -``` - -```bash -kubectl get pvc -``` - -``` -NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE -be-data-kube-starrocks-be-0 Bound pvc-4ae0c9d8-7f9a-4147-ad74-b22569165448 15Gi RWO standard-rwo 82s -be-data-kube-starrocks-be-1 Bound pvc-28b4dbd1-0c8f-4b06-87e8-edec616cabbc 15Gi RWO standard-rwo 82s -be-data-kube-starrocks-be-2 Bound pvc-c7232ea6-d3d9-42f1-bfc1-024205a17656 15Gi RWO standard-rwo 82s -be-log-kube-starrocks-be-0 Bound pvc-6193c43d-c74f-4d12-afcc-c41ace3d5408 1Gi RWO standard-rwo 82s -be-log-kube-starrocks-be-1 Bound pvc-c01f124a-014a-439a-99a6-6afe95215bf0 1Gi RWO standard-rwo 82s -be-log-kube-starrocks-be-2 Bound pvc-136df15f-4d2e-43bc-a1c0-17227ce3fe6b 1Gi RWO standard-rwo 82s -fe-log-kube-starrocks-fe-0 Bound pvc-7eac524e-d286-4760-b21c-d9b6261d976f 5Gi RWO standard-rwo 2m23s -fe-log-kube-starrocks-fe-1 Bound pvc-38076b78-71e8-4659-b8e7-6751bec663f6 5Gi RWO standard-rwo 2m23s -fe-log-kube-starrocks-fe-2 Bound pvc-4ccfee60-02b7-40ba-a22e-861ea29dac74 5Gi RWO standard-rwo 2m23s -fe-meta-kube-starrocks-fe-0 Bound pvc-5130c9ff-b797-4f79-a1d2-4214af860d70 10Gi RWO standard-rwo 2m23s -fe-meta-kube-starrocks-fe-1 Bound pvc-13545330-63be-42cf-b1ca-3ed6f96a8c98 10Gi RWO standard-rwo 2m23s -fe-meta-kube-starrocks-fe-2 Bound pvc-609cadd4-c7b7-4cf9-84b0-a75678bb3c4d 10Gi RWO standard-rwo 2m23s -``` -### Verify that the cluster is healthy - -:::tip -These are the same commands as above, but show the desired state. -::: - -```bash -kubectl --namespace default get starrockscluster -l "cluster=kube-starrocks" -``` - -``` -NAME PHASE FESTATUS BESTATUS CNSTATUS FEPROXYSTATUS -kube-starrocks running running running running -``` - -```bash -kubectl get pods -``` - -:::tip -The system is ready when all of the pods except for `kube-starrocks-initpwd` show `1/1` in the `READY` column. The `kube-starrocks-initpwd` pod should show `0/1` and a `STATUS` of `Completed`. -::: - -``` -NAME READY STATUS RESTARTS AGE -kube-starrocks-be-0 1/1 Running 0 57s -kube-starrocks-be-1 1/1 Running 0 57s -kube-starrocks-be-2 1/1 Running 0 57s -kube-starrocks-fe-0 1/1 Running 0 103s -kube-starrocks-fe-1 1/1 Running 0 102s -kube-starrocks-fe-2 1/1 Running 0 103s -kube-starrocks-fe-proxy-56f8998799-d4qmt 1/1 Running 0 57s -kube-starrocks-initpwd-m84br 0/1 Completed 4 2m9s -kube-starrocks-operator-54ffcf8c5c-xsjc8 1/1 Running 0 2m9s -``` - -The `EXTERNAL-IP` addresses in the highlighted lines will be used to provide SQL client and Stream Load access from outside the Kubernetes cluster. - -```bash -kubectl get services -``` - -```bash -NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE -kube-starrocks-be-search ClusterIP None 9050/TCP 78s -kube-starrocks-be-service ClusterIP 34.118.228.231 9060/TCP,8040/TCP,9050/TCP,8060/TCP 78s -# highlight-next-line -kube-starrocks-fe-proxy-service LoadBalancer 34.118.230.176 34.176.12.205 8080:30241/TCP 78s -kube-starrocks-fe-search ClusterIP None 9030/TCP 2m4s -# highlight-next-line -kube-starrocks-fe-service LoadBalancer 34.118.226.82 34.176.215.97 8030:30620/TCP,9020:32461/TCP,9030:32749/TCP,9010:30911/TCP 2m4s -kubernetes ClusterIP 34.118.224.1 443/TCP 8h -``` - -:::tip -Store the `EXTERNAL-IP` addresses from the highlighted lines in environment variables so that you have them handy: - -``` -export MYSQL_IP=`kubectl get services kube-starrocks-fe-service --output jsonpath='{.status.loadBalancer.ingress[0].ip}'` -``` -``` -export FE_PROXY=`kubectl get services kube-starrocks-fe-proxy-service --output jsonpath='{.status.loadBalancer.ingress[0].ip}'`:8080 -``` -::: - - - ---- - -### Connect to StarRocks with a SQL client - -:::tip - -If you are using a client other than the mysql CLI, open that now. -::: - -This command will run the `mysql` command in a Kubernetes pod: - -```sql -kubectl exec --stdin --tty kube-starrocks-fe-0 -- \ - mysql -P9030 -h127.0.0.1 -u root --prompt="StarRocks > " -``` - -If you have the mysql CLI installed locally, you can use it instead of the one in the Kubernetes cluster: - -```sql -mysql -P9030 -h $MYSQL_IP -u root --prompt="StarRocks > " -p -``` - ---- -## Create some tables - -```bash -mysql -P9030 -h $MYSQL_IP -u root --prompt="StarRocks > " -p -``` - - - - -Exit from the MySQL client, or open a new shell to run commands at the command line to upload data. - -```sql -exit -``` - - - -## Upload data - -There are many ways to load data into StarRocks. For this tutorial, the simplest way is to use curl and StarRocks Stream Load. - -Upload the two datasets that you downloaded earlier. - -:::tip -Open a new shell as these curl commands are run at the operating system prompt, not in the `mysql` client. The commands refer to the datasets that you downloaded, so run them from the directory where you downloaded the files. - -Since this is a new shell, run the export commands again: - -```bash - -export MYSQL_IP=`kubectl get services kube-starrocks-fe-service --output jsonpath='{.status.loadBalancer.ingress[0].ip}'` - -export FE_PROXY=`kubectl get services kube-starrocks-fe-proxy-service --output jsonpath='{.status.loadBalancer.ingress[0].ip}'`:8080 -``` - -You will be prompted for a password. Use the password that you added to the Kubernetes secret `starrocks-root-pass`. If you used the command provided, the password is `g()()dpa$$word`. -::: - -The `curl` commands look complex, but they are explained in detail at the end of the tutorial. For now, we recommend running the commands and running some SQL to analyze the data, and then reading about the data loading details at the end. - - -```bash -curl --location-trusted -u root \ - -T ./NYPD_Crash_Data.csv \ - -H "label:crashdata-0" \ - -H "column_separator:," \ - -H "skip_header:1" \ - -H "enclose:\"" \ - -H "max_filter_ratio:1" \ - -H "columns:tmp_CRASH_DATE, tmp_CRASH_TIME, CRASH_DATE=str_to_date(concat_ws(' ', tmp_CRASH_DATE, tmp_CRASH_TIME), '%m/%d/%Y %H:%i'),BOROUGH,ZIP_CODE,LATITUDE,LONGITUDE,LOCATION,ON_STREET_NAME,CROSS_STREET_NAME,OFF_STREET_NAME,NUMBER_OF_PERSONS_INJURED,NUMBER_OF_PERSONS_KILLED,NUMBER_OF_PEDESTRIANS_INJURED,NUMBER_OF_PEDESTRIANS_KILLED,NUMBER_OF_CYCLIST_INJURED,NUMBER_OF_CYCLIST_KILLED,NUMBER_OF_MOTORIST_INJURED,NUMBER_OF_MOTORIST_KILLED,CONTRIBUTING_FACTOR_VEHICLE_1,CONTRIBUTING_FACTOR_VEHICLE_2,CONTRIBUTING_FACTOR_VEHICLE_3,CONTRIBUTING_FACTOR_VEHICLE_4,CONTRIBUTING_FACTOR_VEHICLE_5,COLLISION_ID,VEHICLE_TYPE_CODE_1,VEHICLE_TYPE_CODE_2,VEHICLE_TYPE_CODE_3,VEHICLE_TYPE_CODE_4,VEHICLE_TYPE_CODE_5" \ - # highlight-next-line - -XPUT http://$FE_PROXY/api/quickstart/crashdata/_stream_load -``` - -``` -Enter host password for user 'root': -{ - "TxnId": 2, - "Label": "crashdata-0", - "Status": "Success", - "Message": "OK", - "NumberTotalRows": 423726, - "NumberLoadedRows": 423725, - "NumberFilteredRows": 1, - "NumberUnselectedRows": 0, - "LoadBytes": 96227746, - "LoadTimeMs": 2483, - "BeginTxnTimeMs": 42, - "StreamLoadPlanTimeMs": 122, - "ReadDataTimeMs": 1610, - "WriteDataTimeMs": 2253, - "CommitAndPublishTimeMs": 65, - "ErrorURL": "http://kube-starrocks-be-2.kube-starrocks-be-search.default.svc.cluster.local:8040/api/_load_error_log?file=error_log_5149e6f80de42bcb_eab2ea77276de4ba" -} -``` - -```bash -curl --location-trusted -u root \ - -T ./72505394728.csv \ - -H "label:weather-0" \ - -H "column_separator:," \ - -H "skip_header:1" \ - -H "enclose:\"" \ - -H "max_filter_ratio:1" \ - -H "columns: STATION, DATE, LATITUDE, LONGITUDE, ELEVATION, NAME, REPORT_TYPE, SOURCE, HourlyAltimeterSetting, HourlyDewPointTemperature, HourlyDryBulbTemperature, HourlyPrecipitation, HourlyPresentWeatherType, HourlyPressureChange, HourlyPressureTendency, HourlyRelativeHumidity, HourlySkyConditions, HourlySeaLevelPressure, HourlyStationPressure, HourlyVisibility, HourlyWetBulbTemperature, HourlyWindDirection, HourlyWindGustSpeed, HourlyWindSpeed, Sunrise, Sunset, DailyAverageDewPointTemperature, DailyAverageDryBulbTemperature, DailyAverageRelativeHumidity, DailyAverageSeaLevelPressure, DailyAverageStationPressure, DailyAverageWetBulbTemperature, DailyAverageWindSpeed, DailyCoolingDegreeDays, DailyDepartureFromNormalAverageTemperature, DailyHeatingDegreeDays, DailyMaximumDryBulbTemperature, DailyMinimumDryBulbTemperature, DailyPeakWindDirection, DailyPeakWindSpeed, DailyPrecipitation, DailySnowDepth, DailySnowfall, DailySustainedWindDirection, DailySustainedWindSpeed, DailyWeather, MonthlyAverageRH, MonthlyDaysWithGT001Precip, MonthlyDaysWithGT010Precip, MonthlyDaysWithGT32Temp, MonthlyDaysWithGT90Temp, MonthlyDaysWithLT0Temp, MonthlyDaysWithLT32Temp, MonthlyDepartureFromNormalAverageTemperature, MonthlyDepartureFromNormalCoolingDegreeDays, MonthlyDepartureFromNormalHeatingDegreeDays, MonthlyDepartureFromNormalMaximumTemperature, MonthlyDepartureFromNormalMinimumTemperature, MonthlyDepartureFromNormalPrecipitation, MonthlyDewpointTemperature, MonthlyGreatestPrecip, MonthlyGreatestPrecipDate, MonthlyGreatestSnowDepth, MonthlyGreatestSnowDepthDate, MonthlyGreatestSnowfall, MonthlyGreatestSnowfallDate, MonthlyMaxSeaLevelPressureValue, MonthlyMaxSeaLevelPressureValueDate, MonthlyMaxSeaLevelPressureValueTime, MonthlyMaximumTemperature, MonthlyMeanTemperature, MonthlyMinSeaLevelPressureValue, MonthlyMinSeaLevelPressureValueDate, MonthlyMinSeaLevelPressureValueTime, MonthlyMinimumTemperature, MonthlySeaLevelPressure, MonthlyStationPressure, MonthlyTotalLiquidPrecipitation, MonthlyTotalSnowfall, MonthlyWetBulb, AWND, CDSD, CLDD, DSNW, HDSD, HTDD, NormalsCoolingDegreeDay, NormalsHeatingDegreeDay, ShortDurationEndDate005, ShortDurationEndDate010, ShortDurationEndDate015, ShortDurationEndDate020, ShortDurationEndDate030, ShortDurationEndDate045, ShortDurationEndDate060, ShortDurationEndDate080, ShortDurationEndDate100, ShortDurationEndDate120, ShortDurationEndDate150, ShortDurationEndDate180, ShortDurationPrecipitationValue005, ShortDurationPrecipitationValue010, ShortDurationPrecipitationValue015, ShortDurationPrecipitationValue020, ShortDurationPrecipitationValue030, ShortDurationPrecipitationValue045, ShortDurationPrecipitationValue060, ShortDurationPrecipitationValue080, ShortDurationPrecipitationValue100, ShortDurationPrecipitationValue120, ShortDurationPrecipitationValue150, ShortDurationPrecipitationValue180, REM, BackupDirection, BackupDistance, BackupDistanceUnit, BackupElements, BackupElevation, BackupEquipment, BackupLatitude, BackupLongitude, BackupName, WindEquipmentChangeDate" \ - # highlight-next-line - -XPUT http://$FE_PROXY/api/quickstart/weatherdata/_stream_load -``` - -``` -Enter host password for user 'root': -{ - "TxnId": 4, - "Label": "weather-0", - "Status": "Success", - "Message": "OK", - "NumberTotalRows": 22931, - "NumberLoadedRows": 22931, - "NumberFilteredRows": 0, - "NumberUnselectedRows": 0, - "LoadBytes": 15558550, - "LoadTimeMs": 404, - "BeginTxnTimeMs": 1, - "StreamLoadPlanTimeMs": 7, - "ReadDataTimeMs": 157, - "WriteDataTimeMs": 372, - "CommitAndPublishTimeMs": 23 -} -``` - -## Connect with a MySQL client - -Connect with a MySQL client if you are not connected. Remember to use the external IP address of the `kube-starrocks-fe-service` service and the password that you configured in the Kubernetes secret `starrocks-root-pass`. - -```bash -mysql -P9030 -h $MYSQL_IP -u root --prompt="StarRocks > " -p -``` - -## Answer some questions - - - -```sql -exit -``` - -## Cleanup - -Run this command if you are finished and would like to remove the StarRocks cluster and the StarRocks operator. - -```bash -helm delete starrocks -``` - - ---- - -## Summary - -In this tutorial you: - -- Deployed StarRocks with Helm and the StarRocks Operator -- Loaded crash data provided by New York City and weather data provided by NOAA -- Analyzed the data using SQL JOINs to find out that driving in low visibility or icy streets is a bad idea - -There is more to learn; we intentionally glossed over the data transformation done during the Stream Load. The details on that are in the notes on the curl commands below. - ---- - -## Notes on the curl commands - - - ---- - -## More information - -Default [`values.yaml`](https://github.com/StarRocks/starrocks-kubernetes-operator/blob/main/helm-charts/charts/kube-starrocks/values.yaml) - -[Stream Load](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md) - -The [Motor Vehicle Collisions - Crashes](https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95) dataset is provided by New York City subject to these [terms of use](https://www.nyc.gov/home/terms-of-use.page) and [privacy policy](https://www.nyc.gov/home/privacy-policy.page). - -The [Local Climatological Data](https://www.ncdc.noaa.gov/cdo-web/datatools/lcd)(LCD) is provided by NOAA with this [disclaimer](https://www.noaa.gov/disclaimer) and this [privacy policy](https://www.noaa.gov/protecting-your-privacy). - -[Helm](https://helm.sh/) is a package manager for Kubernetes. A [Helm Chart](https://helm.sh/docs/topics/charts/) is a Helm package and contains all of the resource definitions necessary to run an application on a Kubernetes cluster. - -[`starrocks-kubernetes-operator` and `kube-starrocks` Helm Chart](https://github.com/StarRocks/starrocks-kubernetes-operator). diff --git a/docs/en/quick_start/hudi.md b/docs/en/quick_start/hudi.md deleted file mode 100644 index a22672e..0000000 --- a/docs/en/quick_start/hudi.md +++ /dev/null @@ -1,414 +0,0 @@ ---- -displayed_sidebar: docs -sidebar_position: 4 -description: Data Lakehouse with Apache Hudi -toc_max_heading_level: 3 ---- -import DataLakeIntro from '../_assets/commonMarkdown/datalakeIntro.mdx' -import Clients from '../_assets/quick-start/_clientsCompose.mdx' -import Tabs from '@theme/Tabs'; -import TabItem from '@theme/TabItem'; - -# Apache Hudi Lakehouse - -## Overview - -- Deploy Object Storage, Apache Spark, Hudi, and StarRocks using Docker compose -- Load a tiny dataset into Hudi with Apache Spark -- Configure StarRocks to access the Hive Metastore using an external catalog -- Query the data with StarRocks where the data sits - - - -## Prerequisites - -### StarRocks `demo` repository - -Clone the [StarRocks demo repository](https://github.com/StarRocks/demo/) to your local machine. - -All the steps in this guide will be run from the `demo/documentation-samples/hudi/` directory in the directory where you cloned the `demo` GitHub repo. - -### Docker - -- Docker Setup: For Mac, Please follow the steps as defined in [Install Docker Desktop on Mac](https://docs.docker.com/desktop/install/mac-install/). For running Spark-SQL queries, please ensure at least 5 GB memory and 4 CPUs are allocated to Docker (See Docker → Preferences → Advanced). Otherwise, spark-SQL queries could be killed because of memory issues. -- 20 GB free disk space assigned to Docker - -### SQL client - -You can use the SQL client provided in the Docker environment, or use one on your system. Many MySQL compatible clients will work. - -## Configuration - -Change directory into `demo/documentation-samples/hudi` and look at the files. This is not a tutorial on Hudi, so not every configuration file will be described; but it is important for the reader to know where to look to see how things are configured. In the `hudi/` directory you will find the `docker-compose.yml` file which is used to launch and configure the services in Docker. Here is a list of those services and a brief description: - -### Docker services - -| Service | Responsibilities | -|--------------------------|---------------------------------------------------------------------| -| **`starrocks-fe`** | Metadata management, client connections, query plans and scheduling | -| **`starrocks-be`** | Running query plans | -| **`metastore_db`** | Postgres DB used to store the Hive metadata | -| **`hive_metastore`** | Provides the Apache Hive metastore | -| **`minio`** and **`mc`** | MinIO Object Storage and MinIO command line client | -| **`spark-hudi`** | Distributed computing and Transactional data lake platform | - -### Configuration files - -In the `hudi/conf/` directory you will find configuration files that get mounted in the `spark-hudi` -container. - -##### `core-site.xml` - -This file contains the object storage related settings. Links for this and other items in More information at the end of this document. - -##### `spark-defaults.conf` - -Settings for Hive, MinIO, and Spark SQL. - -##### `hudi-defaults.conf` - -Default file used to silence warnings in the `spark-shell`. - -##### `hadoop-metrics2-hbase.properties` - -Empty file used to silence warnings in the `spark-shell`. - -##### `hadoop-metrics2-s3a-file-system.properties` - -Empty file used to silence warnings in the `spark-shell`. - -## Bringing up Demo Cluster - -This demo system consists of StarRocks, Hudi, MinIO, and Spark services. Run Docker compose to bring up the cluster: - -```bash -docker compose up --detach --wait --wait-timeout 60 -``` - -```plaintext -[+] Running 8/8 - ✔ Network hudi Created 0.0s - ✔ Container hudi-starrocks-fe-1 Healthy 0.1s - ✔ Container hudi-minio-1 Healthy 0.1s - ✔ Container hudi-metastore_db-1 Healthy 0.1s - ✔ Container hudi-starrocks-be-1 Healthy 0.0s - ✔ Container hudi-mc-1 Healthy 0.0s - ✔ Container hudi-hive-metastore-1 Healthy 0.0s - ✔ Container hudi-spark-hudi-1 Healthy 0.1s - ``` - -:::tip - -With many containers running, `docker compose ps` output is easier to read if you pipe it to `jq`: - -```bash -docker compose ps --format json | \ -jq '{Service: .Service, State: .State, Status: .Status}' -``` - -```json -{ - "Service": "hive-metastore", - "State": "running", - "Status": "Up About a minute (healthy)" -} -{ - "Service": "mc", - "State": "running", - "Status": "Up About a minute" -} -{ - "Service": "metastore_db", - "State": "running", - "Status": "Up About a minute" -} -{ - "Service": "minio", - "State": "running", - "Status": "Up About a minute" -} -{ - "Service": "spark-hudi", - "State": "running", - "Status": "Up 33 seconds (healthy)" -} -{ - "Service": "starrocks-be", - "State": "running", - "Status": "Up About a minute (healthy)" -} -{ - "Service": "starrocks-fe", - "State": "running", - "Status": "Up About a minute (healthy)" -} -``` - -::: - -## Configure MinIO - -When you run the Spark commands you will set the basepath for the table being created to an `s3a` URI: - -```java -val basePath = "s3a://huditest/hudi_coders" -``` - -In this step you will create the bucket `huditest` in MinIO. The MinIO console is running on port `9000`. - -### Authenticate to MinIO - -Open a browser to [http://localhost:9000/](http://localhost:9000/) and authenticate. The username and password are specified in `docker-compose.yml`; they are `admin` and `password`. - -### Create a bucket - -In the left navigation select **Buckets**, and then **Create Bucket +**. Name the bucket `huditest` and select **Create Bucket** - -![Create bucket huditest](../_assets/quick-start/hudi-test-bucket.png) - -## Create and populate a table, then sync it to Hive - -:::tip - -Run this command, and any other `docker compose` commands, from the directory containing the `docker-compose.yml` file. -::: - -Open `spark-shell` in the `spark-hudi` service - -```bash -docker compose exec spark-hudi spark-shell -``` - -:::note -There will be warnings when `spark-shell` starts about illegal reflective access. You can ignore these warnings. -::: - -Run these commands at the `scala>` prompt to: - -- Configure this Spark session to load, process, and write data -- Create a dataframe and write that to a Hudi table -- Sync to the Hive Metastore - -```scala -import org.apache.spark.sql.functions._ -import org.apache.spark.sql.types._ -import org.apache.spark.sql.Row -import org.apache.spark.sql.SaveMode._ -import org.apache.hudi.DataSourceReadOptions._ -import org.apache.hudi.DataSourceWriteOptions._ -import org.apache.hudi.config.HoodieWriteConfig._ -import scala.collection.JavaConversions._ - -val schema = StructType( Array( - StructField("language", StringType, true), - StructField("users", StringType, true), - StructField("id", StringType, true) - )) - -val rowData= Seq(Row("Java", "20000", "a"), - Row("Python", "100000", "b"), - Row("Scala", "3000", "c")) - - -val df = spark.createDataFrame(rowData,schema) - -val databaseName = "hudi_sample" -val tableName = "hudi_coders_hive" -val basePath = "s3a://huditest/hudi_coders" - -df.write.format("hudi"). - option(org.apache.hudi.config.HoodieWriteConfig.TABLE_NAME, tableName). - option(RECORDKEY_FIELD_OPT_KEY, "id"). - option(PARTITIONPATH_FIELD_OPT_KEY, "language"). - option(PRECOMBINE_FIELD_OPT_KEY, "users"). - option("hoodie.datasource.write.hive_style_partitioning", "true"). - option("hoodie.datasource.hive_sync.enable", "true"). - option("hoodie.datasource.hive_sync.mode", "hms"). - option("hoodie.datasource.hive_sync.database", databaseName). - option("hoodie.datasource.hive_sync.table", tableName). - option("hoodie.datasource.hive_sync.partition_fields", "language"). - option("hoodie.datasource.hive_sync.partition_extractor_class", "org.apache.hudi.hive.MultiPartKeysValueExtractor"). - option("hoodie.datasource.hive_sync.metastore.uris", "thrift://hive-metastore:9083"). - mode(Overwrite). - save(basePath) -System.exit(0) -``` - -:::note -You will see a warning: - -```java -WARN -org.apache.hudi.metadata.HoodieBackedTableMetadata - -Metadata table was not found at path -s3a://huditest/hudi_coders/.hoodie/metadata -``` - -This can be ignored, the file will be created automatically during this `spark-shell` session. - -There will also be a warning: - -```bash -78184 [main] WARN org.apache.hadoop.fs.s3a.S3ABlockOutputStream - -Application invoked the Syncable API against stream writing to -hudi_coders/.hoodie/metadata/files/.files-0000_00000000000000.log.1_0-0-0. -This is unsupported -``` - -This warning informs you that syncing a log file that is open for writes is not supported when using object storage. The file will only be synced when it is closed. See [Stack Overflow](https://stackoverflow.com/a/74886836/10424890). -::: - -The final command in the above spark-shell session should exit the container, if it doesn't press enter and it will exit. - -## Configure StarRocks - -### Connect to StarRocks - -Connect to StarRocks with the provided MySQL client provided by the `starrocks-fe` service, or use your favorite SQL client and configure it to connect using the MySQL protocol on `localhost:9030`. - -```bash -docker compose exec starrocks-fe \ - mysql -P 9030 -h 127.0.0.1 -u root --prompt="StarRocks > " -``` - -### Create the linkage between StarRocks and Hudi - -There is a link at the end of this guide with more information on external catalogs. The external catalog created in this step acts as the linkage to the Hive Metastore (HMS) running in Docker. - -```sql -CREATE EXTERNAL CATALOG hudi_catalog_hms -PROPERTIES -( - "type" = "hudi", - "hive.metastore.type" = "hive", - "hive.metastore.uris" = "thrift://hive-metastore:9083", - "aws.s3.use_instance_profile" = "false", - "aws.s3.access_key" = "admin", - "aws.s3.secret_key" = "password", - "aws.s3.enable_ssl" = "false", - "aws.s3.enable_path_style_access" = "true", - "aws.s3.endpoint" = "http://minio:9000" -); -``` - -```plaintext -Query OK, 0 rows affected (0.59 sec) -``` - -### Use the new catalog - -```sql -SET CATALOG hudi_catalog_hms; -``` - -```plaintext -Query OK, 0 rows affected (0.01 sec) -``` - -### Navigate to the data inserted with Spark - -```sql -SHOW DATABASES; -``` - -```plaintext -+--------------------+ -| Database | -+--------------------+ -| default | -| hudi_sample | -| information_schema | -+--------------------+ -2 rows in set (0.40 sec) -``` - -```sql -USE hudi_sample; -``` - -```plaintext -Reading table information for completion of table and column names -You can turn off this feature to get a quicker startup with -A - -Database changed -``` - -```sql -SHOW TABLES; -``` - -```plaintext -+-----------------------+ -| Tables_in_hudi_sample | -+-----------------------+ -| hudi_coders_hive | -+-----------------------+ -1 row in set (0.07 sec) -``` - -### Query the data in Hudi with StarRocks - -Run this query twice, the first time may take around five seconds to complete as data is not yet cached in StarRocks. The second query will be very quick. - -```sql -SELECT * from hudi_coders_hive\G -``` - -:::tip -Some of the SQL queries in the StarRocks documentation end with `\G` instead -of a semicolon. The `\G` causes the mysql CLI to render the query results vertically. - -Many SQL clients do not interpret vertical formatting output, so you should replace `\G` with `;` if you are not using the mysql CLI. -::: - -```plaintext -*************************** 1. row *************************** - _hoodie_commit_time: 20240208165522561 - _hoodie_commit_seqno: 20240208165522561_0_0 - _hoodie_record_key: c -_hoodie_partition_path: language=Scala - _hoodie_file_name: bb29249a-b69d-4c32-843b-b7142d8dc51c-0_0-27-1221_20240208165522561.parquet - language: Scala - users: 3000 - id: c -*************************** 2. row *************************** - _hoodie_commit_time: 20240208165522561 - _hoodie_commit_seqno: 20240208165522561_2_0 - _hoodie_record_key: a -_hoodie_partition_path: language=Java - _hoodie_file_name: 12fc14aa-7dc4-454c-b710-1ad0556c9386-0_2-27-1223_20240208165522561.parquet - language: Java - users: 20000 - id: a -*************************** 3. row *************************** - _hoodie_commit_time: 20240208165522561 - _hoodie_commit_seqno: 20240208165522561_1_0 - _hoodie_record_key: b -_hoodie_partition_path: language=Python - _hoodie_file_name: 51977039-d71e-4dd6-90d4-0c93656dafcf-0_1-27-1222_20240208165522561.parquet - language: Python - users: 100000 - id: b -3 rows in set (0.15 sec) -``` - -## Summary - -This tutorial exposed you to the use of a StarRocks external catalog to show you that you can query your data where it sits using the Hudi external catalog. Many other integrations are available using Iceberg, Delta Lake, and JDBC catalogs. - -In this tutorial you: - -- Deployed StarRocks and a Hudi/Spark/MinIO environment in Docker -- Loaded a tiny dataset into Hudi with Apache Spark -- Configured a StarRocks external catalog to provide access to the Hudi catalog -- Queried the data with SQL in StarRocks without copying the data from the data lake - -## More information - -[StarRocks Catalogs](../data_source/catalog/catalog_overview.md) - -[Apache Hudi quickstart](https://hudi.apache.org/docs/quick-start-guide/) (includes Spark) - -[Apache Hudi S3 configuration](https://hudi.apache.org/docs/s3_hoodie/) - -[Apache Spark configuration docs](https://spark.apache.org/docs/latest/configuration.html) diff --git a/docs/en/quick_start/iceberg.md b/docs/en/quick_start/iceberg.md deleted file mode 100644 index c190c44..0000000 --- a/docs/en/quick_start/iceberg.md +++ /dev/null @@ -1,252 +0,0 @@ ---- -displayed_sidebar: docs -sidebar_position: 3 -description: Data Lakehouse with Apache Iceberg -toc_max_heading_level: 2 -keywords: [ 'iceberg' ] ---- - -import DataLakeIntro from '../_assets/commonMarkdown/datalakeIntro.mdx' -import Clients from '../_assets/quick-start/_clientsCompose.mdx' - -# Apache Iceberg Lakehouse - -This guide will get you up and running with Apache Iceberg™ using StarRocks™, including sample code to highlight some -powerful features. - -### Docker-Compose - -The fastest way to get started is to use a docker-compose file that uses the `starrocks/fe-ubuntu` and `starrocks/be-ubuntu` -images which contain a local StarRocks cluster with a configured Iceberg catalog. To use this, you'll need to install -the Docker CLI. - -Once you have Docker installed, save the yaml below into a file named docker-compose.yml: - -```yml -services: - - starrocks-fe: - image: starrocks/fe-ubuntu:4.0-latest - hostname: starrocks-fe - container_name: starrocks-fe - user: root - command: | - bash /opt/starrocks/fe/bin/start_fe.sh --host_type FQDN - ports: - - 8030:8030 - - 9020:9020 - - 9030:9030 - networks: - iceberg_net: - environment: - - AWS_ACCESS_KEY_ID=admin - - AWS_SECRET_ACCESS_KEY=password - - AWS_REGION=us-east-1 - healthcheck: - test: 'mysql -u root -h starrocks-fe -P 9030 -e "SHOW FRONTENDS\G" |grep "Alive: true"' - interval: 10s - timeout: 5s - retries: 3 - - starrocks-be: - image: starrocks/be-ubuntu:4.0-latest - command: - - /bin/bash - - -c - - | - ulimit -n 65535; - echo "# Enable data cache" >> /opt/starrocks/be/conf/be.conf - echo "block_cache_enable = true" >> /opt/starrocks/be/conf/be.conf - echo "block_cache_mem_size = 536870912" >> /opt/starrocks/be/conf/be.conf - echo "block_cache_disk_size = 1073741824" >> /opt/starrocks/be/conf/be.conf - sleep 15s - mysql --connect-timeout 2 -h starrocks-fe -P 9030 -u root -e "ALTER SYSTEM ADD BACKEND \"starrocks-be:9050\";" - bash /opt/starrocks/be/bin/start_be.sh - ports: - - 8040:8040 - hostname: starrocks-be - container_name: starrocks-be - user: root - depends_on: - starrocks-fe: - condition: service_healthy - healthcheck: - test: 'mysql -u root -h starrocks-fe -P 9030 -e "SHOW BACKENDS\G" |grep "Alive: true"' - interval: 10s - timeout: 5s - retries: 3 - networks: - iceberg_net: - environment: - - HOST_TYPE=FQDN - - rest: - image: apache/iceberg-rest-fixture - container_name: iceberg-rest - networks: - iceberg_net: - aliases: - - iceberg-rest.minio - ports: - - 8181:8181 - environment: - - AWS_ACCESS_KEY_ID=admin - - AWS_SECRET_ACCESS_KEY=password - - AWS_REGION=us-east-1 - - CATALOG_WAREHOUSE=s3://warehouse/ - - CATALOG_IO__IMPL=org.apache.iceberg.aws.s3.S3FileIO - - CATALOG_S3_ENDPOINT=http://minio:9000 - - minio: - image: minio/minio:RELEASE.2024-10-29T16-01-48Z - container_name: minio - environment: - - MINIO_ROOT_USER=admin - - MINIO_ROOT_PASSWORD=password - - MINIO_DOMAIN=minio - networks: - iceberg_net: - aliases: - - warehouse.minio - ports: - - 9001:9001 - - 9000:9000 - command: ["server", "/data", "--console-address", ":9001"] - mc: - depends_on: - - minio - image: minio/mc:RELEASE.2024-10-29T15-34-59Z - container_name: mc - networks: - iceberg_net: - environment: - - AWS_ACCESS_KEY_ID=admin - - AWS_SECRET_ACCESS_KEY=password - - AWS_REGION=us-east-1 - entrypoint: > - /bin/sh -c " - until (/usr/bin/mc config host add minio http://minio:9000 admin password) do echo '...waiting...' && sleep 1; done; - /usr/bin/mc rm -r --force minio/warehouse; - /usr/bin/mc mb minio/warehouse; - /usr/bin/mc policy set public minio/warehouse; - tail -f /dev/null - " -networks: - iceberg_net: -``` - -Next, start up the docker containers with this command: - -```Plain -docker compose up --detach --wait --wait-timeout 400 -``` - -You can then run any of the following commands to start a StarRocks session. - -```bash -docker exec -it starrocks-fe \ -mysql -P 9030 -h 127.0.0.1 -u root --prompt="StarRocks > " -``` - -### Adding and Using a Catalog - -```SQL -CREATE EXTERNAL CATALOG 'demo' -COMMENT "External catalog to Apache Iceberg on MinIO" -PROPERTIES -( - "type"="iceberg", - "iceberg.catalog.type"="rest", - "iceberg.catalog.uri"="http://iceberg-rest:8181", - "iceberg.catalog.warehouse"="warehouse", - "aws.s3.access_key"="admin", - "aws.s3.secret_key"="password", - "aws.s3.endpoint"="http://minio:9000", - "aws.s3.enable_path_style_access"="true" -); -``` - -```SQL -SHOW CATALOGS\G -``` - -```SQL -*************************** 1. row *************************** -Catalog: default_catalog - Type: Internal -Comment: An internal catalog contains this cluster's self-managed tables. -*************************** 2. row *************************** -Catalog: demo - Type: Iceberg -Comment: External catalog to Apache Iceberg on MinIO -2 rows in set (0.00 sec) -``` - -```SQL -SET CATALOG demo; -``` - -### Creating and using a database - -```SQL -CREATE DATABASE nyc; -``` - -```SQL -USE nyc; -``` - -### Creating a table - -```SQL -CREATE TABLE demo.nyc.taxis -( - trip_id bigint, - trip_distance float, - fare_amount double, - store_and_fwd_flag string, - vendor_id bigint -) PARTITION BY (vendor_id); -``` - -### Writing Data to a Table - -```SQL -INSERT INTO demo.nyc.taxis -VALUES (1000371, 1.8, 15.32, 'N', 1), - (1000372, 2.5, 22.15, 'N', 2), - (1000373, 0.9, 9.01, 'N', 2), - (1000374, 8.4, 42.13, 'Y', 1); -``` - -### Reading Data from a Table - -```SQL -SELECT * -FROM demo.nyc.taxis; -``` - -### Verify that the data is stored in object storage - -When you added and used the external catalog, Starrocks started using MinIO as the object store for the `demo.nyc.taxis` -table. If you navigate to http://localhost:9001 and then navigate through the Object Browser menu to -`warehouse/nyc/taxis/` you can confirm that StarRocks is using MinIO for the storage. - -:::tip - -The username and password for MinIO are in the docker-compose.yml file. You will be prompted to change the password to something better, just ignore this advice for the tutorial. - -![img](../_assets/quick-start/MinIO-Iceberg-data.png) -::: - -### Next Steps - -#### Adding Iceberg to StarRocks - -If you already have a StarRocks 3.2.0, or later, environment, it comes with the Iceberg 1.6.0 included. No additional -downloads or jars are needed. - -#### Learn More - -Now that you're up and running with Iceberg and StarRocks, check out -the [StarRocks-Iceberg docs](../data_source/catalog/iceberg/iceberg_catalog.md) to learn more! diff --git a/docs/en/quick_start/quick_start.mdx b/docs/en/quick_start/quick_start.mdx deleted file mode 100644 index 6d2a3af..0000000 --- a/docs/en/quick_start/quick_start.mdx +++ /dev/null @@ -1,14 +0,0 @@ ---- -displayed_sidebar: docs -description: Deploy StarRocks with Docker Compose, Helm, and StarRocks Operator ---- - -# Quick Start - -These Quick Start guides will help you get going with a small StarRocks environment. The clusters that you will launch will be suitable for learning how StarRocks works, but are not meant for analyzing large datasets or performance testing. - -For scalable deployments please see [deploying StarRocks](../deployment/deployment_overview.md) - -import DocCardList from '@theme/DocCardList'; - - diff --git a/docs/en/quick_start/routine-load.md b/docs/en/quick_start/routine-load.md deleted file mode 100644 index 3a4b17f..0000000 --- a/docs/en/quick_start/routine-load.md +++ /dev/null @@ -1,748 +0,0 @@ ---- -displayed_sidebar: docs -sidebar_position: 2 -toc_max_heading_level: 2 -description: Kafka routine load with shared-data storage ---- - -# Kafka routine load StarRocks using shared-data storage - -import Clients from '../_assets/quick-start/_clientsCompose.mdx' -import SQL from '../_assets/quick-start/_SQL.mdx' - -## About Routine Load - -Routine load is a method using Apache Kafka, or in this lab, Redpanda, to continuously stream data into StarRocks. The data is streamed into a Kafka topic, and a Routine Load job consumes the data into StarRocks. More details on Routine Load are provided at the end of the lab. - -## About shared-data - -In systems that separate storage from compute, data is stored in low-cost reliable remote storage systems such as Amazon S3, Google Cloud Storage, Azure Blob Storage, and other S3-compatible storage like MinIO. Hot data is cached locally and when the cache is hit, the query performance is comparable to that of storage-compute coupled architecture. Compute nodes (CN) can be added or removed on demand within seconds. This architecture reduces storage costs, ensures better resource isolation, and provides elasticity and scalability. - -This tutorial covers: - -- Running StarRocks, Redpanda, and MinIO with Docker Compose -- Using MinIO as the StarRocks storage layer -- Configuring StarRocks for shared-data -- Adding a Routine Load job to consume data from Redpanda - -The data used is synthetic. - -There is a lot of information in this document, and it is presented with step-by-step content at the beginning, and the technical details at the end. This is done to serve these purposes in this order: - -1. Configure Routine Load. -2. Allow the reader to load data in a shared-data deployment and analyze that data. -3. Provide the configuration details for shared-data deployments. - ---- - -## Prerequisites - -### Docker - -- [Docker](https://docs.docker.com/engine/install/) -- 4 GB RAM assigned to Docker -- 10 GB free disk space assigned to Docker - -### SQL client - -You can use the SQL client provided in the Docker environment, or use one on your system. Many MySQL-compatible clients will work, and this guide covers the configuration of DBeaver and MySQL Workbench. - -### curl - -`curl` is used to download the Compose file and the script to generate the data. Check to see if you have it installed by running `curl` or `curl.exe` at your OS prompt. If curl is not installed, [get curl here](https://curl.se/dlwiz/?type=bin). - -### Python - -Python 3 and the Python client for Apache Kafka, `kafka-python`, are required. - -- [Python](https://www.python.org/) -- [`kafka-python`](https://pypi.org/project/kafka-python/) - ---- - -## Terminology - -### FE - -Frontend nodes are responsible for metadata management, client connection management, query planning, and query scheduling. Each FE stores and maintains a complete copy of metadata in its memory, which guarantees indiscriminate services among the FEs. - -### CN - -Compute Nodes are responsible for executing query plans in shared-data deployments. - -### BE - -Backend nodes are responsible for both data storage and executing query plans in shared-nothing deployments. - -:::note -This guide does not use BEs, this information is included here so that you understand the difference between BEs and CNs. -::: - ---- - -## Launch StarRocks - -To run StarRocks with shared-data using Object Storage you need: - -- A frontend engine (FE) -- A compute node (CN) -- Object Storage - -This guide uses MinIO, which is S3 compatible Object Storage provider. MinIO is provided under the GNU Affero General Public License. - -### Download the lab files - -#### `docker-compose.yml` - -```bash -mkdir routineload -cd routineload -curl -O https://raw.githubusercontent.com/StarRocks/demo/master/documentation-samples/routine-load-shared-data/docker-compose.yml -``` - -#### `gen.py` - -`gen.py` is a script that uses the Python client for Apache Kafka to publish (produce) data to a Kafka topic. The script has been written with the address and port of the Redpanda container. - -```bash -curl -O https://raw.githubusercontent.com/StarRocks/demo/master/documentation-samples/routine-load-shared-data/gen.py -``` - -## Start StarRocks, MinIO, and Redpanda - -```bash -docker compose up --detach --wait --wait-timeout 120 -``` - -Check the progress of the services. It should take 30 seconds or more for the containers to become healthy. The `routineload-minio_mc-1` container will not show a health indicator, and it will exit once it is done configuring MinIO with the access key that StarRocks will use. Wait for `routineload-minio_mc-1` to exit with a `0` code and the rest of the services to be `Healthy`. - -Run `docker compose ps` until the services are healthy: - -```bash -docker compose ps -``` - -```plaintext -WARN[0000] /Users/droscign/routineload/docker-compose.yml: `version` is obsolete -[+] Running 6/7 - ✔ Network routineload_default Crea... 0.0s - ✔ Container minio Healthy 5.6s - ✔ Container redpanda Healthy 3.6s - ✔ Container redpanda-console Healt... 1.1s - ⠧ Container routineload-minio_mc-1 Waiting 23.1s - ✔ Container starrocks-fe Healthy 11.1s - ✔ Container starrocks-cn Healthy 23.0s -container routineload-minio_mc-1 exited (0) -``` - ---- - -## Examine MinIO credentials - -In order to use MinIO for Object Storage with StarRocks, StarRocks needs a MinIO access key. The access key was generated during the startup of the Docker services. To help you better understand the way that StarRocks connects to MinIO you should verify that the key exists. - -### Open the MinIO web UI - -Browse to http://localhost:9001/access-keys The username and password are specified in the Docker compose file, and are `miniouser` and `miniopassword`. You should see that there is one access key. The Key is `AAAAAAAAAAAAAAAAAAAA`, you cannot see the secret in the MinIO Console, but it is in the Docker compose file and is `BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB`: - -![View the MinIO access key](../_assets/quick-start/MinIO-view-key.png) - ---- - -### Create a bucket for your data - -When you create a storage volume in StarRocks you will specify the `LOCATION` for the data: - -```sh - LOCATIONS = ("s3://my-starrocks-bucket/") -``` - -Open [http://localhost:9001/buckets](http://localhost:9001/buckets) and add a bucket for the storage volume. Name the bucket `my-starrocks-bucket`. Accept the defaults for the three listed options. - ---- - -## SQL Clients - - - ---- - -## StarRocks configuration for shared-data - -At this point you have StarRocks running, and you have MinIO running. The MinIO access key is used to connect StarRocks and Minio. - -This is the part of the `FE` configuration that specifies that the StarRocks deployment will use shared data. This was added to the file `fe.conf` when Docker Compose created the deployment. - -```sh -# enable the shared data run mode -run_mode = shared_data -cloud_native_storage_type = S3 -``` - -:::info -You can verify these settings by running this command from the `quickstart` directory and looking at the end of the file: -::: - -```sh -docker compose exec starrocks-fe \ - cat /opt/starrocks/fe/conf/fe.conf -``` -::: - -### Connect to StarRocks with a SQL client - -:::tip - -Run this command from the directory containing the `docker-compose.yml` file. - -If you are using a client other than the mysql CLI, open that now. -::: - -```sql -docker compose exec starrocks-fe \ -mysql -P9030 -h127.0.0.1 -uroot --prompt="StarRocks > " -``` - -#### Examine the storage volumes - - -```sql -SHOW STORAGE VOLUMES; -``` - -:::tip -There should be no storage volumes, you will create one next. -::: - -```sh -Empty set (0.04 sec) -``` - -#### Create a shared-data storage volume - -Earlier you created a bucket in MinIO named `my-starrocks-volume`, and you verified that MinIO has an access key named `AAAAAAAAAAAAAAAAAAAA`. The following SQL will create a storage volume in the MionIO bucket using the access key and secret. - -```sql -CREATE STORAGE VOLUME s3_volume - TYPE = S3 - LOCATIONS = ("s3://my-starrocks-bucket/") - PROPERTIES - ( - "enabled" = "true", - "aws.s3.endpoint" = "minio:9000", - "aws.s3.access_key" = "AAAAAAAAAAAAAAAAAAAA", - "aws.s3.secret_key" = "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB", - "aws.s3.use_instance_profile" = "false", - "aws.s3.use_aws_sdk_default_behavior" = "false" - ); -``` - -Now you should see a storage volume listed, earlier it was an empty set: - -``` -SHOW STORAGE VOLUMES; -``` - -``` -+----------------+ -| Storage Volume | -+----------------+ -| s3_volume | -+----------------+ -1 row in set (0.02 sec) -``` - -View the details of the storage volume and note that this is nott yet the default volume, and that it is configured to use your bucket: - -``` -DESC STORAGE VOLUME s3_volume\G -``` - -:::tip -Some of the SQL in this document, and many other documents in the StarRocks documentation, and with `\G` instead -of a semicolon. The `\G` causes the mysql CLI to render the query results vertically. - -Many SQL clients do not interpret vertical formatting output, so you should replace `\G` with `;`. -::: - -```sh -*************************** 1. row *************************** - Name: s3_volume - Type: S3 -# highlight-start -IsDefault: false - Location: s3://my-starrocks-bucket/ -# highlight-end - Params: {"aws.s3.access_key":"******","aws.s3.secret_key":"******","aws.s3.endpoint":"minio:9000","aws.s3.region":"us-east-1","aws.s3.use_instance_profile":"false","aws.s3.use_web_identity_token_file":"false","aws.s3.use_aws_sdk_default_behavior":"false"} - Enabled: true - Comment: -1 row in set (0.02 sec) -``` - -## Set the default storage volume - -``` -SET s3_volume AS DEFAULT STORAGE VOLUME; -``` - -``` -DESC STORAGE VOLUME s3_volume\G -``` - -```sh -*************************** 1. row *************************** - Name: s3_volume - Type: S3 -# highlight-next-line -IsDefault: true - Location: s3://my-starrocks-bucket/ - Params: {"aws.s3.access_key":"******","aws.s3.secret_key":"******","aws.s3.endpoint":"minio:9000","aws.s3.region":"us-east-1","aws.s3.use_instance_profile":"false","aws.s3.use_web_identity_token_file":"false","aws.s3.use_aws_sdk_default_behavior":"false"} - Enabled: true - Comment: -1 row in set (0.02 sec) -``` - ---- - -## Create a table - -These SQL commands are run in your SQL client. - -```SQL -CREATE DATABASE IF NOT EXISTS quickstart; -``` - -Verify that the database `quickstart` is using the storage volume `s3_volume`: - -``` -SHOW CREATE DATABASE quickstart \G -``` - -```sh -*************************** 1. row *************************** - Database: quickstart -Create Database: CREATE DATABASE `quickstart` -# highlight-next-line -PROPERTIES ("storage_volume" = "s3_volume") -``` - -```SQL -USE quickstart; -``` - -```SQL -CREATE TABLE site_clicks ( - `uid` bigint NOT NULL COMMENT "uid", - `site` string NOT NULL COMMENT "site url", - `vtime` bigint NOT NULL COMMENT "vtime" -) -DISTRIBUTED BY HASH(`uid`) -PROPERTIES("replication_num"="1"); -``` - ---- - -### Open the Redpanda Console - -There will be no topics yet, a topic will be created in the next step. - -http://localhost:8080/overview - -### Publish data to a Redpanda topic - -From a command shell in the `routineload/` folder run this command to generate data: - -```python -python gen.py 5 -``` - -:::tip - -On your system, you might need to use `python3` in place of `python` in the command. - -If you are missing `kafka-python` try: - -``` -pip install kafka-python -``` - or - -``` -pip3 install kafka-python -``` - -::: - -```plaintext -b'{ "uid": 6926, "site": "https://docs.starrocks.io/", "vtime": 1718034793 } ' -b'{ "uid": 3303, "site": "https://www.starrocks.io/product/community", "vtime": 1718034793 } ' -b'{ "uid": 227, "site": "https://docs.starrocks.io/", "vtime": 1718034243 } ' -b'{ "uid": 7273, "site": "https://docs.starrocks.io/", "vtime": 1718034794 } ' -b'{ "uid": 4666, "site": "https://www.starrocks.io/", "vtime": 1718034794 } ' -``` - -### Verify in the Redpanda Console - -Navigate to http://localhost:8080/topics in the Redpanda Console, and you will see one topic named `test2`. Select that topic and then the **Messages** tab and you will see five messages matching the output of `gen.py`. - -## Consume the messages - -In StarRocks you will create a Routine Load job to: - -1. Consume the messages from the Redpanda topic `test2` -2. Load those messages into the table `site_clicks` - -StarRocks is configured to use MinIO for storage, so the data inserted into the `site_clicks` table will be stored in MinIO. - -### Create a Routine Load job - -Run this command in the SQL client to create the Routine Load job, the command will be explained in detail at the end of the lab. - -```SQL -CREATE ROUTINE LOAD quickstart.clicks ON site_clicks -PROPERTIES -( - "format" = "JSON", - "jsonpaths" ="[\"$.uid\",\"$.site\",\"$.vtime\"]" -) -FROM KAFKA -( - "kafka_broker_list" = "redpanda:29092", - "kafka_topic" = "test2", - "kafka_partitions" = "0", - "kafka_offsets" = "OFFSET_BEGINNING" -); -``` - -### Verify the Routine Load job - -```SQL -SHOW ROUTINE LOAD\G -``` - -Verify the three highlighted lines: - -1. The state should be `RUNNING` -2. The topic should be `test2` and the broker should be `redpanda:2092` -3. The statistics should show either 0 or 5 loaded rows depending on how soon you ran the `SHOW ROUTINE LOAD` command. If there are 0 loaded rows run it again. - -```SQL -*************************** 1. row *************************** - Id: 10078 - Name: clicks - CreateTime: 2024-06-12 15:51:12 - PauseTime: NULL - EndTime: NULL - DbName: quickstart - TableName: site_clicks - -- highlight-next-line - State: RUNNING - DataSourceType: KAFKA - CurrentTaskNum: 1 - JobProperties: {"partitions":"*","partial_update":"false","columnToColumnExpr":"*","maxBatchIntervalS":"10","partial_update_mode":"null","whereExpr":"*","dataFormat":"json","timezone":"Etc/UTC","format":"json","log_rejected_record_num":"0","taskTimeoutSecond":"60","json_root":"","maxFilterRatio":"1.0","strict_mode":"false","jsonpaths":"[\"$.uid\",\"$.site\",\"$.vtime\"]","taskConsumeSecond":"15","desireTaskConcurrentNum":"5","maxErrorNum":"0","strip_outer_array":"false","currentTaskConcurrentNum":"1","maxBatchRows":"200000"} - -- highlight-next-line -DataSourceProperties: {"topic":"test2","currentKafkaPartitions":"0","brokerList":"redpanda:29092"} - CustomProperties: {"group.id":"clicks_ea38a713-5a0f-4abe-9b11-ff4a241ccbbd"} - -- highlight-next-line - Statistic: {"receivedBytes":0,"errorRows":0,"committedTaskNum":0,"loadedRows":0,"loadRowsRate":0,"abortedTaskNum":0,"totalRows":0,"unselectedRows":0,"receivedBytesRate":0,"taskExecuteTimeMs":1} - Progress: {"0":"OFFSET_ZERO"} - TimestampProgress: {} -ReasonOfStateChanged: - ErrorLogUrls: - TrackingSQL: - OtherMsg: -LatestSourcePosition: {} -1 row in set (0.00 sec) -``` - -```SQL -SHOW ROUTINE LOAD\G -``` - -```SQL -*************************** 1. row *************************** - Id: 10076 - Name: clicks - CreateTime: 2024-06-12 18:40:53 - PauseTime: NULL - EndTime: NULL - DbName: quickstart - TableName: site_clicks - State: RUNNING - DataSourceType: KAFKA - CurrentTaskNum: 1 - JobProperties: {"partitions":"*","partial_update":"false","columnToColumnExpr":"*","maxBatchIntervalS":"10","partial_update_mode":"null","whereExpr":"*","dataFormat":"json","timezone":"Etc/UTC","format":"json","log_rejected_record_num":"0","taskTimeoutSecond":"60","json_root":"","maxFilterRatio":"1.0","strict_mode":"false","jsonpaths":"[\"$.uid\",\"$.site\",\"$.vtime\"]","taskConsumeSecond":"15","desireTaskConcurrentNum":"5","maxErrorNum":"0","strip_outer_array":"false","currentTaskConcurrentNum":"1","maxBatchRows":"200000"} -DataSourceProperties: {"topic":"test2","currentKafkaPartitions":"0","brokerList":"redpanda:29092"} - CustomProperties: {"group.id":"clicks_a9426fee-45bb-403a-a1a3-b3bc6c7aa685"} - -- highlight-next-line - Statistic: {"receivedBytes":372,"errorRows":0,"committedTaskNum":1,"loadedRows":5,"loadRowsRate":0,"abortedTaskNum":0,"totalRows":5,"unselectedRows":0,"receivedBytesRate":0,"taskExecuteTimeMs":519} - Progress: {"0":"4"} - TimestampProgress: {"0":"1718217035111"} -ReasonOfStateChanged: - ErrorLogUrls: - TrackingSQL: - OtherMsg: - -- highlight-next-line -LatestSourcePosition: {"0":"5"} -1 row in set (0.00 sec) -``` - ---- - -## Verify that data is stored in MinIO - -Open MinIO [http://localhost:9001/browser/](http://localhost:9001/browser/) and verify that there are objects stored under `my-starrocks-bucket`. - ---- - -## Query the data from StarRocks - -```SQL -USE quickstart; -SELECT * FROM site_clicks; -``` - -```SQL -+------+--------------------------------------------+------------+ -| uid | site | vtime | -+------+--------------------------------------------+------------+ -| 4607 | https://www.starrocks.io/blog | 1718031441 | -| 1575 | https://www.starrocks.io/ | 1718031523 | -| 2398 | https://docs.starrocks.io/ | 1718033630 | -| 3741 | https://www.starrocks.io/product/community | 1718030845 | -| 4792 | https://www.starrocks.io/ | 1718033413 | -+------+--------------------------------------------+------------+ -5 rows in set (0.07 sec) -``` - -## Publish additional data - -Running `gen.py` again will publish another five records to Redpanda. - -```bash -python gen.py 5 -``` - -### Verify that data is added - -Since the Routine Load job runs on a schedule (every 10 seconds by default), the data will be loaded within a few seconds. - -```SQL -SELECT * FROM site_clicks; -```` - -``` -+------+--------------------------------------------+------------+ -| uid | site | vtime | -+------+--------------------------------------------+------------+ -| 6648 | https://www.starrocks.io/blog | 1718205970 | -| 7914 | https://www.starrocks.io/ | 1718206760 | -| 9854 | https://www.starrocks.io/blog | 1718205676 | -| 1186 | https://www.starrocks.io/ | 1718209083 | -| 3305 | https://docs.starrocks.io/ | 1718209083 | -| 2288 | https://www.starrocks.io/blog | 1718206759 | -| 7879 | https://www.starrocks.io/product/community | 1718204280 | -| 2666 | https://www.starrocks.io/ | 1718208842 | -| 5801 | https://www.starrocks.io/ | 1718208783 | -| 8409 | https://www.starrocks.io/ | 1718206889 | -+------+--------------------------------------------+------------+ -10 rows in set (0.02 sec) -``` - ---- - -## Configuration details - -Now that you have experienced using StarRocks with shared-data it is important to understand the configuration. - -### CN configuration - -The CN configuration used here is the default, as the CN is designed for shared-data use. The default configuration is shown below. You do not need to make any changes. - -```bash -sys_log_level = INFO - -# ports for admin, web, heartbeat service -be_port = 9060 -be_http_port = 8040 -heartbeat_service_port = 9050 -brpc_port = 8060 -starlet_port = 9070 -``` - -### FE configuration - -The FE configuration is slightly different from the default as the FE must be configured to expect that data is stored in Object Storage rather than on local disks on BE nodes. - -The `docker-compose.yml` file generates the FE configuration in the `command`. - -```plaintext -# enable shared data, set storage type, set endpoint -run_mode = shared_data -cloud_native_storage_type = S3 -``` - -:::note -This config file does not contain the default entries for an FE, only the shared-data configuration is shown. -::: - -The non-default FE configuration settings: - -:::note -Many configuration parameters are prefixed with `s3_`. This prefix is used for all Amazon S3 compatible storage types (for example: S3, GCS, and MinIO). When using Azure Blob Storage the prefix is `azure_`. -::: - -#### `run_mode=shared_data` - -This enables shared-data use. - -#### `cloud_native_storage_type=S3` - -This specifies whether S3 compatible storage or Azure Blob Storage is used. For MinIO this is always S3. - -### Details of `CREATE storage volume` - -```sql -CREATE STORAGE VOLUME s3_volume - TYPE = S3 - LOCATIONS = ("s3://my-starrocks-bucket/") - PROPERTIES - ( - "enabled" = "true", - "aws.s3.endpoint" = "minio:9000", - "aws.s3.access_key" = "AAAAAAAAAAAAAAAAAAAA", - "aws.s3.secret_key" = "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB", - "aws.s3.use_instance_profile" = "false", - "aws.s3.use_aws_sdk_default_behavior" = "false" - ); -``` - -#### `aws_s3_endpoint=minio:9000` - -The MinIO endpoint, including port number. - -#### `aws_s3_path=starrocks` - -The bucket name. - -#### `aws_s3_access_key=AAAAAAAAAAAAAAAAAAAA` - -The MinIO access key. - -#### `aws_s3_secret_key=BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB` - -The MinIO access key secret. - -#### `aws_s3_use_instance_profile=false` - -When using MinIO an access key is used, and so instance profiles are not used with MinIO. - -#### `aws_s3_use_aws_sdk_default_behavior=false` - -When using MinIO this parameter is always set to false. - ---- - -## Notes on the Routine Load command - -StarRocks Routine Load takes many arguments. Only the ones used in this tutorial are described here, the rest will be linked to in the more information section. - -```SQL -CREATE ROUTINE LOAD quickstart.clicks ON site_clicks -PROPERTIES -( - "format" = "JSON", - "jsonpaths" ="[\"$.uid\",\"$.site\",\"$.vtime\"]" -) -FROM KAFKA -( - "kafka_broker_list" = "redpanda:29092", - "kafka_topic" = "test2", - "kafka_partitions" = "0", - "kafka_offsets" = "OFFSET_BEGINNING" -); -``` - -### Parameters - -``` -CREATE ROUTINE LOAD quickstart.clicks ON site_clicks -``` - -The parameters for `CREATE ROUTINE LOAD ON` are: -- database_name.job_name -- table_name - -`database_name` is optional. In this lab, it is `quickstart` and is specified. - -`job_name` is required, and is `clicks` - -`table_name` is required, and is `site_clicks` - -### Job properties - -#### Property `format` - -``` -"format" = "JSON", -``` - -In this case, the data is in JSON format, so the property is set to `JSON`. The other valid formats are: `CSV`, `JSON`, and `Avro`. `CSV` is the default. - -#### Property `jsonpaths` - -``` -"jsonpaths" ="[\"$.uid\",\"$.site\",\"$.vtime\"]" -``` - -The names of the fields that you want to load from JSON-formatted data. The value of this parameter is a valid JsonPath expression. More information is available at the end of this page. - - -### Data source properties - -#### `kafka_broker_list` - -``` -"kafka_broker_list" = "redpanda:29092", -``` - -Kafka's broker connection information. The format is `:`. Multiple brokers are separated by commas. - -#### `kafka_topic` - -``` -"kafka_topic" = "test2", -``` - -The Kafka topic to consume from. - -#### `kafka_partitions` and `kafka_offsets` - -``` -"kafka_partitions" = "0", -"kafka_offsets" = "OFFSET_BEGINNING" -``` - -These properties are presented together as there is one `kafka_offset` required for each `kafka_partitions` entry. - -`kafka_partitions` is a list of one or more partitions to consume. If this property is not set, then all partitions are consumed. - -`kafka_offsets` is a list of offsets, one for each partition listed in `kafka_partitions`. In this case the value is `OFFSET_BEGINNING` which causes all of the data to be consumed. The default is to only consume new data. - ---- - -## Summary - -In this tutorial you: - -- Deployed StarRocks, Reedpanda, and Minio in Docker -- Created a Routine Load job to consume data from a Kafka topic -- Learned how to configure a StarRocks Storage Volume that uses MinIO - -## More information - -[StarRocks Architecture](../introduction/Architecture.md) - -The sample used for this lab is very simple. Routine Load has many more options and capabilities. [learn more](../loading/RoutineLoad.md). - -[JSONPath](https://goessner.net/articles/JsonPath/) diff --git a/docs/en/quick_start/shared-data.md b/docs/en/quick_start/shared-data.md deleted file mode 100644 index b386914..0000000 --- a/docs/en/quick_start/shared-data.md +++ /dev/null @@ -1,595 +0,0 @@ ---- -displayed_sidebar: docs -sidebar_position: 2 -description: Separate compute and storage ---- - -# Separate storage and compute - -import DDL from '../_assets/quick-start/_DDL.mdx' -import Clients from '../_assets/quick-start/_clientsCompose.mdx' -import SQL from '../_assets/quick-start/_SQL.mdx' -import Curl from '../_assets/quick-start/_curl.mdx' - -In systems that separate storage from compute data is stored in low-cost reliable remote storage systems such as Amazon S3, Google Cloud Storage, Azure Blob Storage, and other S3-compatible storage like MinIO. Hot data is cached locally and When the cache is hit, the query performance is comparable to that of storage-compute coupled architecture. Compute nodes (CN) can be added or removed on demand within seconds. This architecture reduces storage cost, ensures better resource isolation, and provides elasticity and scalability. - -This tutorial covers: - -- Running StarRocks in Docker containers -- Using MinIO for Object Storage -- Configuring StarRocks for shared-data -- Loading two public datasets -- Analyzing the data with SELECT and JOIN -- Basic data transformation (the **T** in ETL) - -The data used is provided by NYC OpenData and the National Centers for Environmental Information at NOAA. - -Both of these datasets are very large, and because this tutorial is intended to help you get exposed to working with StarRocks we are not going to load data for the past 120 years. You can run the Docker image and load this data on a machine with 4 GB RAM assigned to Docker. For larger fault-tolerant and scalable deployments we have other documentation and will provide that later. - -There is a lot of information in this document, and it is presented with the step by step content at the beginning, and the technical details at the end. This is done to serve these purposes in this order: - -1. Allow the reader to load data in a shared-data deployment and analyze that data. -2. Provide the configuration details for shared-data deployments. -3. Explain the basics of data transformation during loading. - ---- - -## Prerequisites - -### Docker - -- [Docker](https://docs.docker.com/engine/install/) -- 4 GB RAM assigned to Docker -- 10 GB free disk space assigned to Docker - -### SQL client - -You can use the SQL client provided in the Docker environment, or use one on your system. Many MySQL compatible clients will work, and this guide covers the configuration of DBeaver and MySQL Workbench. - -### curl - -`curl` is used to issue the data load job to StarRocks, and to download the datasets. Check to see if you have it installed by running `curl` or `curl.exe` at your OS prompt. If curl is not installed, [get curl here](https://curl.se/dlwiz/?type=bin). - -### `/etc/hosts` - -The ingest method used in this guide is Stream Load. Stream Load connects to the FE service to start the ingest job. The FE then assigns the job to a backend node, the CN in this guide. In order for the ingest job to connect to the CN the name of the CN must be available to your operating system. Add this line to `/etc/hosts`: - -```bash -127.0.0.1 starrocks-cn -``` - ---- - -## Terminology - -### FE - -Frontend nodes are responsible for metadata management, client connection management, query planning, and query scheduling. Each FE stores and maintains a complete copy of metadata in its memory, which guarantees indiscriminate services among the FEs. - -### CN - -Compute Nodes are responsible for executing query plans in shared-data deployments. - -### BE - -Backend nodes are responsible for both data storage and executing query plans in shared-nothing deployments. - -:::note -This guide does not use BEs, this information is included here so that you understand the difference between BEs and CNs. -::: - ---- - -## Edit your hosts file - -The ingest method used in this guide is Stream Load. Stream Load connects to the FE service to start the ingest job. The FE then assigns the job to a backend node—the CN in this guide. In order for the ingest job to connect to the CN, the name of the CN must be available to your operating system. Add this line to `/etc/hosts`: - -```bash -127.0.0.1 starrocks-cn -``` - -## Download the lab files - -There are three files to download: - -- The Docker Compose file that deploys the StarRocks and MinIO environment -- New York City crash data -- Weather data - -This guide uses MinIO, which is S3 compatible Object Storage provided under the GNU Affero General Public License. - -### Create a directory to store the lab files - -```bash -mkdir quickstart -cd quickstart -``` - -### Download the Docker Compose file - -```bash -curl -O https://raw.githubusercontent.com/StarRocks/demo/master/documentation-samples/quickstart/docker-compose.yml -``` - -### Download the data - -Download these two datasets: - -#### New York City crash data - -```bash -curl -O https://raw.githubusercontent.com/StarRocks/demo/master/documentation-samples/quickstart/datasets/NYPD_Crash_Data.csv -``` - -#### Weather data - -```bash -curl -O https://raw.githubusercontent.com/StarRocks/demo/master/documentation-samples/quickstart/datasets/72505394728.csv -``` - ---- - -## Deploy StarRocks and MinIO - -```bash -docker compose up --detach --wait --wait-timeout 120 -``` - -It should take around 30 seconds for the FE, CN, and MinIO services to become healthy. The `quickstart-minio_mc-1` container will show a status of `Waiting` and also an exit code. An exit code of `0` indicates success. - -```bash -[+] Running 4/5 - ✔ Network quickstart_default Created 0.0s - ✔ Container minio Healthy 6.8s - ✔ Container starrocks-fe Healthy 29.3s - ⠼ Container quickstart-minio_mc-1 Waiting 29.3s - ✔ Container starrocks-cn Healthy 29.2s -container quickstart-minio_mc-1 exited (0) -``` - ---- - -## MinIO - -This quick start uses MinIO for shared storage. - -### Verify the MinIO credentials - -To use MinIO for Object Storage with StarRocks, StarRocks needs a MinIO access key. The access key was generated during the startup of the Docker services. To help you better understand the way that StarRocks connects to MinIO you should verify that the key exists. - -Browse to [http://localhost:9001/access-keys](http://localhost:9001/access-keys) The username and password are specified in the Docker compose file, and are `miniouser` and `miniopassword`. You should see that there is one access key. The Key is `AAAAAAAAAAAAAAAAAAAA`, you cannot see the secret in the MinIO Console, but it is in the Docker compose file and is `BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB`: - -![View the MinIO access key](../_assets/quick-start/MinIO-view-key.png) - -:::tip -If there are no access keys showing in the MinIO web UI, check the logs of the `minio_mc` service: - -```bash -docker compose logs minio_mc -``` - -Try rerunning the `minio_mc` pod: - -```bash -docker compose run minio_mc -``` -::: - -### Create a bucket for your data - -When you create a storage volume in StarRocks you will specify the `LOCATION` for the data: - -```sh - LOCATIONS = ("s3://my-starrocks-bucket/") -``` - -Open [http://localhost:9001/buckets](http://localhost:9001/buckets) and add a bucket for the storage volume. Name the bucket `my-starrocks-bucket`. Accept the defaults for the three listed options. - ---- - -## SQL Clients - - - ---- - - -## StarRocks configuration for shared-data - -At this point you have StarRocks running, and you have MinIO running. The MinIO access key is used to connect StarRocks and Minio. - -This is the part of the `FE` configuration that specifies that the StarRocks deployment will use shared data. This was added to the file `fe.conf` when Docker Compose created the deployment. - -```sh -# enable the shared data run mode -run_mode = shared_data -cloud_native_storage_type = S3 -``` - -:::info -You can verify these settings by running this command from the `quickstart` directory and looking at the end of the file: -::: - -```sh -docker compose exec starrocks-fe \ - cat /opt/starrocks/fe/conf/fe.conf -``` -::: - -### Connect to StarRocks with a SQL client - -:::tip - -Run this command from the directory containing the `docker-compose.yml` file. - -If you are using a client other than the MySQL Command-Line Client, open that now. -::: - -```sql -docker compose exec starrocks-fe \ -mysql -P9030 -h127.0.0.1 -uroot --prompt="StarRocks > " -``` - -#### Examine the storage volumes - - -```sql -SHOW STORAGE VOLUMES; -``` - -:::tip -There should be no storage volumes, you will create one next. -::: - -```sh -Empty set (0.04 sec) -``` - -#### Create a shared-data storage volume - -Earlier you created a bucket in MinIO named `my-starrocks-volume`, and you verified that MinIO has an access key named `AAAAAAAAAAAAAAAAAAAA`. The following SQL will create a storage volume in the MionIO bucket using the access key and secret. - -```sql -CREATE STORAGE VOLUME s3_volume - TYPE = S3 - LOCATIONS = ("s3://my-starrocks-bucket/") - PROPERTIES - ( - "enabled" = "true", - "aws.s3.endpoint" = "minio:9000", - "aws.s3.access_key" = "AAAAAAAAAAAAAAAAAAAA", - "aws.s3.secret_key" = "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB", - "aws.s3.use_instance_profile" = "false", - "aws.s3.use_aws_sdk_default_behavior" = "false" - ); -``` - -Now you should see a storage volume listed, earlier it was an empty set: - -``` -SHOW STORAGE VOLUMES; -``` - -``` -+----------------+ -| Storage Volume | -+----------------+ -| s3_volume | -+----------------+ -1 row in set (0.02 sec) -``` - -View the details of the storage volume and note that this is nott yet the default volume, and that it is configured to use your bucket: - -``` -DESC STORAGE VOLUME s3_volume\G -``` - -:::tip -Some of the SQL in this document, and many other documents in the StarRocks documentation, and with `\G` instead -of a semicolon. The `\G` causes the mysql CLI to render the query results vertically. - -Many SQL clients do not interpret vertical formatting output, so you should replace `\G` with `;`. -::: - -```sh -*************************** 1. row *************************** - Name: s3_volume - Type: S3 -# highlight-start -IsDefault: false - Location: s3://my-starrocks-bucket/ -# highlight-end - Params: {"aws.s3.access_key":"******","aws.s3.secret_key":"******","aws.s3.endpoint":"minio:9000","aws.s3.region":"us-east-1","aws.s3.use_instance_profile":"false","aws.s3.use_web_identity_token_file":"false","aws.s3.use_aws_sdk_default_behavior":"false"} - Enabled: true - Comment: -1 row in set (0.02 sec) -``` - -## Set the default storage volume - -``` -SET s3_volume AS DEFAULT STORAGE VOLUME; -``` - -``` -DESC STORAGE VOLUME s3_volume\G -``` - -```sh -*************************** 1. row *************************** - Name: s3_volume - Type: S3 -# highlight-next-line -IsDefault: true - Location: s3://my-starrocks-bucket/ - Params: {"aws.s3.access_key":"******","aws.s3.secret_key":"******","aws.s3.endpoint":"minio:9000","aws.s3.region":"us-east-1","aws.s3.use_instance_profile":"false","aws.s3.use_web_identity_token_file":"false","aws.s3.use_aws_sdk_default_behavior":"false"} - Enabled: true - Comment: -1 row in set (0.02 sec) -``` - -## Create a database - -``` -CREATE DATABASE IF NOT EXISTS quickstart; -``` - -Verify that the database `quickstart` is using the storage volume `s3_volume`: - -``` -SHOW CREATE DATABASE quickstart \G -``` - -```sh -*************************** 1. row *************************** - Database: quickstart -Create Database: CREATE DATABASE `quickstart` -# highlight-next-line -PROPERTIES ("storage_volume" = "s3_volume") -``` - ---- - -## Create some tables - - - ---- - -## Load two datasets - -There are many ways to load data into StarRocks. For this tutorial the simplest way is to use curl and StarRocks Stream Load. - -:::tip - -Run these curl commands from the directory where you downloaded the dataset. - -You will be prompted for a password. You probably have not assigned a password to the MySQL `root` user, so just hit enter. - -::: - -The `curl` commands look complex, but they are explained in detail at the end of the tutorial. For now, we recommend running the commands and running some SQL to analyze the data, and then reading about the data loading details at the end. - -### New York City collision data - Crashes - -```bash -curl --location-trusted -u root \ - -T ./NYPD_Crash_Data.csv \ - -H "label:crashdata-0" \ - -H "column_separator:," \ - -H "skip_header:1" \ - -H "enclose:\"" \ - -H "max_filter_ratio:1" \ - -H "columns:tmp_CRASH_DATE, tmp_CRASH_TIME, CRASH_DATE=str_to_date(concat_ws(' ', tmp_CRASH_DATE, tmp_CRASH_TIME), '%m/%d/%Y %H:%i'),BOROUGH,ZIP_CODE,LATITUDE,LONGITUDE,LOCATION,ON_STREET_NAME,CROSS_STREET_NAME,OFF_STREET_NAME,NUMBER_OF_PERSONS_INJURED,NUMBER_OF_PERSONS_KILLED,NUMBER_OF_PEDESTRIANS_INJURED,NUMBER_OF_PEDESTRIANS_KILLED,NUMBER_OF_CYCLIST_INJURED,NUMBER_OF_CYCLIST_KILLED,NUMBER_OF_MOTORIST_INJURED,NUMBER_OF_MOTORIST_KILLED,CONTRIBUTING_FACTOR_VEHICLE_1,CONTRIBUTING_FACTOR_VEHICLE_2,CONTRIBUTING_FACTOR_VEHICLE_3,CONTRIBUTING_FACTOR_VEHICLE_4,CONTRIBUTING_FACTOR_VEHICLE_5,COLLISION_ID,VEHICLE_TYPE_CODE_1,VEHICLE_TYPE_CODE_2,VEHICLE_TYPE_CODE_3,VEHICLE_TYPE_CODE_4,VEHICLE_TYPE_CODE_5" \ - -XPUT http://localhost:8030/api/quickstart/crashdata/_stream_load -``` - -Here is the output of the above command. The first highlighted section shown what you should expect to see (OK and all but one row inserted). One row was filtered out because it does not contain the correct number of columns. - -```bash -Enter host password for user 'root': -{ - "TxnId": 2, - "Label": "crashdata-0", - "Status": "Success", - # highlight-start - "Message": "OK", - "NumberTotalRows": 423726, - "NumberLoadedRows": 423725, - # highlight-end - "NumberFilteredRows": 1, - "NumberUnselectedRows": 0, - "LoadBytes": 96227746, - "LoadTimeMs": 1013, - "BeginTxnTimeMs": 21, - "StreamLoadPlanTimeMs": 63, - "ReadDataTimeMs": 563, - "WriteDataTimeMs": 870, - "CommitAndPublishTimeMs": 57, - # highlight-start - "ErrorURL": "http://starrocks-cn:8040/api/_load_error_log?file=error_log_da41dd88276a7bfc_739087c94262ae9f" - # highlight-end -}% -``` - -If there was an error the output provides a URL to see the error messages. The error message also contains the backend node that the Stream Load job was assigned to (`starrocks-cn`). Because you added an entry for `starrocks-cn` to the `/etc/hosts` file, you should be able to navigate to it and read the error message. - -Expand the summary for the content seen while developing this tutorial: - -
- -Reading error messages in the browser - -```bash -Error: Value count does not match column count. Expect 29, but got 32. - -Column delimiter: 44,Row delimiter: 10.. Row: 09/06/2015,14:15,,,40.6722269,-74.0110059,"(40.6722269, -74.0110059)",,,"R/O 1 BEARD ST. ( IKEA'S -09/14/2015,5:30,BRONX,10473,40.814551,-73.8490955,"(40.814551, -73.8490955)",TORRY AVENUE ,NORTON AVENUE ,,0,0,0,0,0,0,0,0,Driver Inattention/Distraction,Unspecified,,,,3297457,PASSENGER VEHICLE,PASSENGER VEHICLE,,, -``` - -
- -### Weather data - -Load the weather dataset in the same manner as you loaded the crash data. - -```bash -curl --location-trusted -u root \ - -T ./72505394728.csv \ - -H "label:weather-0" \ - -H "column_separator:," \ - -H "skip_header:1" \ - -H "enclose:\"" \ - -H "max_filter_ratio:1" \ - -H "columns: STATION, DATE, LATITUDE, LONGITUDE, ELEVATION, NAME, REPORT_TYPE, SOURCE, HourlyAltimeterSetting, HourlyDewPointTemperature, HourlyDryBulbTemperature, HourlyPrecipitation, HourlyPresentWeatherType, HourlyPressureChange, HourlyPressureTendency, HourlyRelativeHumidity, HourlySkyConditions, HourlySeaLevelPressure, HourlyStationPressure, HourlyVisibility, HourlyWetBulbTemperature, HourlyWindDirection, HourlyWindGustSpeed, HourlyWindSpeed, Sunrise, Sunset, DailyAverageDewPointTemperature, DailyAverageDryBulbTemperature, DailyAverageRelativeHumidity, DailyAverageSeaLevelPressure, DailyAverageStationPressure, DailyAverageWetBulbTemperature, DailyAverageWindSpeed, DailyCoolingDegreeDays, DailyDepartureFromNormalAverageTemperature, DailyHeatingDegreeDays, DailyMaximumDryBulbTemperature, DailyMinimumDryBulbTemperature, DailyPeakWindDirection, DailyPeakWindSpeed, DailyPrecipitation, DailySnowDepth, DailySnowfall, DailySustainedWindDirection, DailySustainedWindSpeed, DailyWeather, MonthlyAverageRH, MonthlyDaysWithGT001Precip, MonthlyDaysWithGT010Precip, MonthlyDaysWithGT32Temp, MonthlyDaysWithGT90Temp, MonthlyDaysWithLT0Temp, MonthlyDaysWithLT32Temp, MonthlyDepartureFromNormalAverageTemperature, MonthlyDepartureFromNormalCoolingDegreeDays, MonthlyDepartureFromNormalHeatingDegreeDays, MonthlyDepartureFromNormalMaximumTemperature, MonthlyDepartureFromNormalMinimumTemperature, MonthlyDepartureFromNormalPrecipitation, MonthlyDewpointTemperature, MonthlyGreatestPrecip, MonthlyGreatestPrecipDate, MonthlyGreatestSnowDepth, MonthlyGreatestSnowDepthDate, MonthlyGreatestSnowfall, MonthlyGreatestSnowfallDate, MonthlyMaxSeaLevelPressureValue, MonthlyMaxSeaLevelPressureValueDate, MonthlyMaxSeaLevelPressureValueTime, MonthlyMaximumTemperature, MonthlyMeanTemperature, MonthlyMinSeaLevelPressureValue, MonthlyMinSeaLevelPressureValueDate, MonthlyMinSeaLevelPressureValueTime, MonthlyMinimumTemperature, MonthlySeaLevelPressure, MonthlyStationPressure, MonthlyTotalLiquidPrecipitation, MonthlyTotalSnowfall, MonthlyWetBulb, AWND, CDSD, CLDD, DSNW, HDSD, HTDD, NormalsCoolingDegreeDay, NormalsHeatingDegreeDay, ShortDurationEndDate005, ShortDurationEndDate010, ShortDurationEndDate015, ShortDurationEndDate020, ShortDurationEndDate030, ShortDurationEndDate045, ShortDurationEndDate060, ShortDurationEndDate080, ShortDurationEndDate100, ShortDurationEndDate120, ShortDurationEndDate150, ShortDurationEndDate180, ShortDurationPrecipitationValue005, ShortDurationPrecipitationValue010, ShortDurationPrecipitationValue015, ShortDurationPrecipitationValue020, ShortDurationPrecipitationValue030, ShortDurationPrecipitationValue045, ShortDurationPrecipitationValue060, ShortDurationPrecipitationValue080, ShortDurationPrecipitationValue100, ShortDurationPrecipitationValue120, ShortDurationPrecipitationValue150, ShortDurationPrecipitationValue180, REM, BackupDirection, BackupDistance, BackupDistanceUnit, BackupElements, BackupElevation, BackupEquipment, BackupLatitude, BackupLongitude, BackupName, WindEquipmentChangeDate" \ - -XPUT http://localhost:8030/api/quickstart/weatherdata/_stream_load -``` - ---- - -## Verify that data is stored in MinIO - -Open MinIO [http://localhost:9001/browser/my-starrocks-bucket](http://localhost:9001/browser/my-starrocks-bucket) and verify that you have entries below `my-starrocks-bucket/` - -:::tip -The folder names below `my-starrocks-bucket/` are generated when you load the data. You should see a single directory below `my-starrocks-bucket`, and then two more below that. In those directories you will find the data, metadata, or schema entries. - -![MinIO object browser](../_assets/quick-start/MinIO-data.png) -::: - ---- - -## Answer some questions - - - ---- - -## Configuring StarRocks for shared-data - -Now that you have experienced using StarRocks with shared-data it is important to understand the configuration. - -### CN configuration - -The CN configuration used here is the default, as the CN is designed for shared-data use. The default configuration is shown below. You do not need to make any changes. - -```bash -sys_log_level = INFO - -# ports for admin, web, heartbeat service -be_port = 9060 -be_http_port = 8040 -heartbeat_service_port = 9050 -brpc_port = 8060 -starlet_port = 9070 -``` - -### FE configuration - -The FE configuration is slightly different from the default as the FE must be configured to expect that data is stored in Object Storage rather than on local disks on BE nodes. - -The `docker-compose.yml` file generates the FE configuration in the `command`. - -```plaintext -# enable shared data, set storage type, set endpoint -run_mode = shared_data -cloud_native_storage_type = S3 -``` - -:::note -This config file does not contain the default entries for an FE, only the shared-data configuration is shown. -::: - -The non-default FE configuration settings: - -:::note -Many configuration parameters are prefixed with `s3_`. This prefix is used for all Amazon S3 compatible storage types (for example: S3, GCS, and MinIO). When using Azure Blob Storage the prefix is `azure_`. -::: - -#### `run_mode=shared_data` - -This enables shared-data use. - -#### `cloud_native_storage_type=S3` - -This specifies whether S3 compatible storage or Azure Blob Storage is used. For MinIO this is always S3. - -### Details of `CREATE storage volume` - -```sql -CREATE STORAGE VOLUME s3_volume - TYPE = S3 - LOCATIONS = ("s3://my-starrocks-bucket/") - PROPERTIES - ( - "enabled" = "true", - "aws.s3.endpoint" = "minio:9000", - "aws.s3.access_key" = "AAAAAAAAAAAAAAAAAAAA", - "aws.s3.secret_key" = "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB", - "aws.s3.use_instance_profile" = "false", - "aws.s3.use_aws_sdk_default_behavior" = "false" - ); -``` - -#### `aws_s3_endpoint=minio:9000` - -The MinIO endpoint, including port number. - -#### `aws_s3_path=starrocks` - -The bucket name. - -#### `aws_s3_access_key=AAAAAAAAAAAAAAAAAAAA` - -The MinIO access key. - -#### `aws_s3_secret_key=BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB` - -The MinIO access key secret. - -#### `aws_s3_use_instance_profile=false` - -When using MinIO an access key is used, and so instance profiles are not used with MinIO. - -#### `aws_s3_use_aws_sdk_default_behavior=false` - -When using MinIO this parameter is always set to false. - -### Configuring FQDN mode - -The command to start the FE is also changed. The FE service command in the Docker Compose file has the option `--host_type FQDN` added. By setting `host_type` to `FQDN` the Stream Load job is forwarded to the fully qualified domain name of the CN pod, rather than the IP address. This is done because the IP address is in a range assigned to the Docker environment, and is not typically available from the host machine. - -These three changes allow traffic between the host network and the CN: - -- setting `--host_type` to `FQDN` -- exposing the CN port 8040 to the host network -- adding an entry to the hosts file for `starrocks-cn` pointing to `127.0.0.1` - ---- - -## Summary - -In this tutorial you: - -- Deployed StarRocks and Minio in Docker -- Created a MinIO access key -- Configured a StarRocks Storage Volume that uses MinIO -- Loaded crash data provided by New York City and weather data provided by NOAA -- Analyzed the data using SQL JOINs to find out that driving in low visibility or icy streets is a bad idea - -There is more to learn; we intentionally glossed over the data transform done during the Stream Load. The details on that are in the notes on the curl commands below. - -## Notes on the curl commands - - - -## More information - -[StarRocks table design](../table_design/StarRocks_table_design.md) - -[Stream Load](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md) - -The [Motor Vehicle Collisions - Crashes](https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95) dataset is provided by New York City subject to these [terms of use](https://www.nyc.gov/home/terms-of-use.page) and [privacy policy](https://www.nyc.gov/home/privacy-policy.page). - -The [Local Climatological Data](https://www.ncdc.noaa.gov/cdo-web/datatools/lcd)(LCD) is provided by NOAA with this [disclaimer](https://www.noaa.gov/disclaimer) and this [privacy policy](https://www.noaa.gov/protecting-your-privacy). diff --git a/docs/en/quick_start/shared-nothing.md b/docs/en/quick_start/shared-nothing.md deleted file mode 100644 index 5cb9486..0000000 --- a/docs/en/quick_start/shared-nothing.md +++ /dev/null @@ -1,246 +0,0 @@ ---- -displayed_sidebar: docs -sidebar_position: 1 -description: "StarRocks in Docker: Query real data with JOINs" ---- -import DDL from '../_assets/quick-start/_DDL.mdx' -import Clients from '../_assets/quick-start/_clientsAllin1.mdx' -import SQL from '../_assets/quick-start/_SQL.mdx' -import Curl from '../_assets/quick-start/_curl.mdx' - -# Deploy StarRocks with Docker - -This tutorial covers: - -- Running StarRocks in a single Docker container -- Loading two public datasets including basic transformation of the data -- Analyzing the data with SELECT and JOIN -- Basic data transformation (the **T** in ETL) - -## Follow along with the video if you prefer - - - -The data used is provided by NYC OpenData and the National Centers for Environmental Information. - -Both of these datasets are very large, and because this tutorial is intended to help you get exposed to working with StarRocks we are not going to load data for the past 120 years. You can run the Docker image and load this data on a machine with 4 GB RAM assigned to Docker. For larger fault-tolerant and scalable deployments we have other documentation and will provide that later. - -There is a lot of information in this document, and it is presented with the step by step content at the beginning, and the technical details at the end. This is done to serve these purposes in this order: - -1. Allow the reader to load data in StarRocks and analyze that data. -2. Explain the basics of data transformation during loading. - ---- - -## Prerequisites - -### Docker - -- [Docker](https://docs.docker.com/engine/install/) -- 4 GB RAM assigned to Docker -- 10 GB free disk space assigned to Docker - -### SQL client - -You can use the SQL client provided in the Docker environment, or use one on your system. Many MySQL compatible clients will work, and this guide covers the configuration of DBeaver and MySQL Workbench. - -### curl - -`curl` is used to issue the data load job to StarRocks, and to download the datasets. Check to see if you have it installed by running `curl` or `curl.exe` at your OS prompt. If curl is not installed, [get curl here](https://curl.se/dlwiz/?type=bin). - ---- - -## Terminology - -### FE - -Frontend nodes are responsible for metadata management, client connection management, query planning, and query scheduling. Each FE stores and maintains a complete copy of metadata in its memory, which guarantees indiscriminate services among the FEs. - -### BE - -Backend nodes are responsible for both data storage and executing query plans. - ---- - -## Launch StarRocks - -```bash -docker pull starrocks/allin1-ubuntu -docker run -p 9030:9030 -p 8030:8030 -p 8040:8040 -itd \ ---name quickstart starrocks/allin1-ubuntu -``` - ---- - -## SQL clients - - - ---- - -## Download the data - -Download these two datasets to your machine. You can download them to the host machine where you are running Docker, they do not need to be downloaded inside the container. - -### New York City crash data - -```bash -curl -O https://raw.githubusercontent.com/StarRocks/demo/master/documentation-samples/quickstart/datasets/NYPD_Crash_Data.csv -``` - -### Weather data - -```bash -curl -O https://raw.githubusercontent.com/StarRocks/demo/master/documentation-samples/quickstart/datasets/72505394728.csv -``` - ---- - -### Connect to StarRocks with a SQL client - -:::tip - -If you are using a client other than the mysql CLI, open that now. -::: - -This command will run the `mysql` command in the Docker container: - -```sql -docker exec -it quickstart \ -mysql -P 9030 -h 127.0.0.1 -u root --prompt="StarRocks > " -``` - ---- - -## Create some tables - - - ---- - -## Load two datasets - -There are many ways to load data into StarRocks. For this tutorial the simplest way is to use curl and StarRocks Stream Load. - -:::tip -Open a new shell as these curl commands are run at the operating system prompt, not in the `mysql` client. The commands refer to the datasets that you downloaded, so run them from the directory where you downloaded the files. - -You will be prompted for a password. You probably have not assigned a password to the MySQL `root` user, so just hit enter. -::: - -The `curl` commands look complex, but they are explained in detail at the end of the tutorial. For now, we recommend running the commands and running some SQL to analyze the data, and then reading about the data loading details at the end. - -### New York City collision data - Crashes - -```bash -curl --location-trusted -u root \ - -T ./NYPD_Crash_Data.csv \ - -H "label:crashdata-0" \ - -H "column_separator:," \ - -H "skip_header:1" \ - -H "enclose:\"" \ - -H "max_filter_ratio:1" \ - -H "columns:tmp_CRASH_DATE, tmp_CRASH_TIME, CRASH_DATE=str_to_date(concat_ws(' ', tmp_CRASH_DATE, tmp_CRASH_TIME), '%m/%d/%Y %H:%i'),BOROUGH,ZIP_CODE,LATITUDE,LONGITUDE,LOCATION,ON_STREET_NAME,CROSS_STREET_NAME,OFF_STREET_NAME,NUMBER_OF_PERSONS_INJURED,NUMBER_OF_PERSONS_KILLED,NUMBER_OF_PEDESTRIANS_INJURED,NUMBER_OF_PEDESTRIANS_KILLED,NUMBER_OF_CYCLIST_INJURED,NUMBER_OF_CYCLIST_KILLED,NUMBER_OF_MOTORIST_INJURED,NUMBER_OF_MOTORIST_KILLED,CONTRIBUTING_FACTOR_VEHICLE_1,CONTRIBUTING_FACTOR_VEHICLE_2,CONTRIBUTING_FACTOR_VEHICLE_3,CONTRIBUTING_FACTOR_VEHICLE_4,CONTRIBUTING_FACTOR_VEHICLE_5,COLLISION_ID,VEHICLE_TYPE_CODE_1,VEHICLE_TYPE_CODE_2,VEHICLE_TYPE_CODE_3,VEHICLE_TYPE_CODE_4,VEHICLE_TYPE_CODE_5" \ - -XPUT http://localhost:8030/api/quickstart/crashdata/_stream_load -``` - -Here is the output of the preceding command. The first highlighted section shows what you should expect to see (OK and all but one row inserted). One row was filtered out because it does not contain the correct number of columns. - -```bash -Enter host password for user 'root': -{ - "TxnId": 2, - "Label": "crashdata-0", - "Status": "Success", - # highlight-start - "Message": "OK", - "NumberTotalRows": 423726, - "NumberLoadedRows": 423725, - # highlight-end - "NumberFilteredRows": 1, - "NumberUnselectedRows": 0, - "LoadBytes": 96227746, - "LoadTimeMs": 1013, - "BeginTxnTimeMs": 21, - "StreamLoadPlanTimeMs": 63, - "ReadDataTimeMs": 563, - "WriteDataTimeMs": 870, - "CommitAndPublishTimeMs": 57, - # highlight-start - "ErrorURL": "http://127.0.0.1:8040/api/_load_error_log?file=error_log_da41dd88276a7bfc_739087c94262ae9f" - # highlight-end -}% -``` - -If there was an error the output provides a URL to see the error messages. Open this in a browser to find out what happened. Expand the detail to see a sample error message: - -
- -Reading error messages in the browser - -```bash -Error: Target column count: 29 doesn't match source value column count: 32. Column separator: ',', Row delimiter: '\n'. Row: 09/06/2015,14:15,,,40.6722269,-74.0110059,"(40.6722269, -74.0110059)",,,"R/O 1 BEARD ST. ( IKEA'S -09/14/2015,5:30,BRONX,10473,40.814551,-73.8490955,"(40.814551, -73.8490955)",TORRY AVENUE ,NORTON AVENUE ,,0,0,0,0,0,0,0,0,Driver Inattention/Distraction,Unspecified,,,,3297457,PASSENGER VEHICLE,PASSENGER VEHICLE,,, -``` - -
- -### Weather data - -Load the weather dataset in the same manner as you loaded the crash data. - -```bash -curl --location-trusted -u root \ - -T ./72505394728.csv \ - -H "label:weather-0" \ - -H "column_separator:," \ - -H "skip_header:1" \ - -H "enclose:\"" \ - -H "max_filter_ratio:1" \ - -H "columns: STATION, DATE, LATITUDE, LONGITUDE, ELEVATION, NAME, REPORT_TYPE, SOURCE, HourlyAltimeterSetting, HourlyDewPointTemperature, HourlyDryBulbTemperature, HourlyPrecipitation, HourlyPresentWeatherType, HourlyPressureChange, HourlyPressureTendency, HourlyRelativeHumidity, HourlySkyConditions, HourlySeaLevelPressure, HourlyStationPressure, HourlyVisibility, HourlyWetBulbTemperature, HourlyWindDirection, HourlyWindGustSpeed, HourlyWindSpeed, Sunrise, Sunset, DailyAverageDewPointTemperature, DailyAverageDryBulbTemperature, DailyAverageRelativeHumidity, DailyAverageSeaLevelPressure, DailyAverageStationPressure, DailyAverageWetBulbTemperature, DailyAverageWindSpeed, DailyCoolingDegreeDays, DailyDepartureFromNormalAverageTemperature, DailyHeatingDegreeDays, DailyMaximumDryBulbTemperature, DailyMinimumDryBulbTemperature, DailyPeakWindDirection, DailyPeakWindSpeed, DailyPrecipitation, DailySnowDepth, DailySnowfall, DailySustainedWindDirection, DailySustainedWindSpeed, DailyWeather, MonthlyAverageRH, MonthlyDaysWithGT001Precip, MonthlyDaysWithGT010Precip, MonthlyDaysWithGT32Temp, MonthlyDaysWithGT90Temp, MonthlyDaysWithLT0Temp, MonthlyDaysWithLT32Temp, MonthlyDepartureFromNormalAverageTemperature, MonthlyDepartureFromNormalCoolingDegreeDays, MonthlyDepartureFromNormalHeatingDegreeDays, MonthlyDepartureFromNormalMaximumTemperature, MonthlyDepartureFromNormalMinimumTemperature, MonthlyDepartureFromNormalPrecipitation, MonthlyDewpointTemperature, MonthlyGreatestPrecip, MonthlyGreatestPrecipDate, MonthlyGreatestSnowDepth, MonthlyGreatestSnowDepthDate, MonthlyGreatestSnowfall, MonthlyGreatestSnowfallDate, MonthlyMaxSeaLevelPressureValue, MonthlyMaxSeaLevelPressureValueDate, MonthlyMaxSeaLevelPressureValueTime, MonthlyMaximumTemperature, MonthlyMeanTemperature, MonthlyMinSeaLevelPressureValue, MonthlyMinSeaLevelPressureValueDate, MonthlyMinSeaLevelPressureValueTime, MonthlyMinimumTemperature, MonthlySeaLevelPressure, MonthlyStationPressure, MonthlyTotalLiquidPrecipitation, MonthlyTotalSnowfall, MonthlyWetBulb, AWND, CDSD, CLDD, DSNW, HDSD, HTDD, NormalsCoolingDegreeDay, NormalsHeatingDegreeDay, ShortDurationEndDate005, ShortDurationEndDate010, ShortDurationEndDate015, ShortDurationEndDate020, ShortDurationEndDate030, ShortDurationEndDate045, ShortDurationEndDate060, ShortDurationEndDate080, ShortDurationEndDate100, ShortDurationEndDate120, ShortDurationEndDate150, ShortDurationEndDate180, ShortDurationPrecipitationValue005, ShortDurationPrecipitationValue010, ShortDurationPrecipitationValue015, ShortDurationPrecipitationValue020, ShortDurationPrecipitationValue030, ShortDurationPrecipitationValue045, ShortDurationPrecipitationValue060, ShortDurationPrecipitationValue080, ShortDurationPrecipitationValue100, ShortDurationPrecipitationValue120, ShortDurationPrecipitationValue150, ShortDurationPrecipitationValue180, REM, BackupDirection, BackupDistance, BackupDistanceUnit, BackupElements, BackupElevation, BackupEquipment, BackupLatitude, BackupLongitude, BackupName, WindEquipmentChangeDate" \ - -XPUT http://localhost:8030/api/quickstart/weatherdata/_stream_load -``` - ---- - -## Answer some questions - - - ---- - -## Summary - -In this tutorial you: - -- Deployed StarRocks in Docker -- Loaded crash data provided by New York City and weather data provided by NOAA -- Analyzed the data using SQL JOINs to find out that driving in low visibility or icy streets is a bad idea - -There is more to learn; we intentionally glossed over the data transformation done during the Stream Load. The details on that are in the notes on the curl commands below. - ---- - -## Notes on the curl commands - - - ---- - -## More information - -[StarRocks table design](../table_design/StarRocks_table_design.md) - -[Stream Load](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md) - -The [Motor Vehicle Collisions - Crashes](https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95) dataset is provided by New York City subject to these [terms of use](https://www.nyc.gov/home/terms-of-use.page) and [privacy policy](https://www.nyc.gov/home/privacy-policy.page). - -The [Local Climatological Data](https://www.ncdc.noaa.gov/cdo-web/datatools/lcd)(LCD) is provided by NOAA with this [disclaimer](https://www.noaa.gov/disclaimer) and this [privacy policy](https://www.noaa.gov/protecting-your-privacy). diff --git a/docs/en/sql-reference/sql-functions/dict-functions/dict_mapping.md b/docs/en/sql-reference/sql-functions/dict-functions/dict_mapping.md index 58e409e..ebd0f2c 100644 --- a/docs/en/sql-reference/sql-functions/dict-functions/dict_mapping.md +++ b/docs/en/sql-reference/sql-functions/dict-functions/dict_mapping.md @@ -4,7 +4,7 @@ displayed_sidebar: docs # dict_mapping -Returns the value mapped to the specified key in a dictionary table. +Returns the dictionary table value mapped to the specified key. This function is mainly used to simplify the application of a global dictionary table. During data loading into a target table, StarRocks automatically obtains the value mapped to the specified key from the dictionary table by using the input parameters in this function, and then loads the value into the target table. diff --git a/docs/zh/introduction/Architecture.md b/docs/zh/introduction/Architecture.md deleted file mode 100644 index 14b7899..0000000 --- a/docs/zh/introduction/Architecture.md +++ /dev/null @@ -1,83 +0,0 @@ ---- -displayed_sidebar: docs ---- -import QSOverview from '../_assets/commonMarkdown/quickstart-overview-tip.mdx' - -# 架构 - -StarRocks 具有简单的架构。整个系统仅由两种类型的组件组成:前端和后端。前端节点称为 **FE**。后端节点有两种类型,**BE** 和 **CN** (计算节点)。当使用数据的本地存储时,将部署 BE;当数据存储在对象存储或 HDFS 上时,将部署 CN。StarRocks 不依赖于任何外部组件,从而简化了部署和维护。节点可以水平扩展,而不会导致服务中断。此外,StarRocks 具有元数据和服务数据的副本机制,从而提高了数据可靠性,并有效地防止了单点故障 (SPOF)。 - -StarRocks 兼容 MySQL 协议,并支持标准 SQL。用户可以轻松地从 MySQL 客户端连接到 StarRocks,以获得即时且有价值的见解。 - -## 架构选择 - -StarRocks 支持 shared-nothing (每个 BE 在其本地存储上都拥有一部分数据) 和 shared-data (所有数据都在对象存储或 HDFS 上,每个 CN 仅在本地存储上具有缓存)。您可以根据自己的需求决定数据的存储位置。 - -![Architecture choices](../_assets/architecture_choices.png) - -### Shared-nothing - -本地存储为实时查询提供了更高的查询延迟。 - -作为一种典型的海量并行处理 (MPP) 数据库,StarRocks 支持 shared-nothing 架构。在此架构中,BE 负责数据存储和计算。直接访问 BE 模式下的本地数据可以进行本地计算,避免了数据传输和数据复制,并提供了超快的查询和分析性能。此架构支持多副本数据存储,从而增强了集群处理高并发查询的能力并确保数据可靠性。它非常适合追求最佳查询性能的场景。 - -![shared-data-arch](../_assets/shared-nothing.png) - -#### 节点 - -在 shared-nothing 架构中,StarRocks 由两种类型的节点组成:FE 和 BE。 - -- FE 负责元数据管理和构建执行计划。 -- BE 执行查询计划并存储数据。BE 利用本地存储来加速查询,并利用多副本机制来确保高数据可用性。 - -##### FE - -FE 负责元数据管理、客户端连接管理、查询规划和查询调度。每个 FE 使用 BDB JE (Berkeley DB Java Edition) 来存储和维护其内存中元数据的完整副本,从而确保所有 FE 之间的一致服务。FE 可以充当 leader、follower 和 observer。如果 leader 节点崩溃,则 follower 基于 Raft 协议选举 leader。 - -| **FE 角色** | **元数据管理** | **Leader 选举** | -| ----------- |------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| ---------------------------------- | -| Leader | Leader FE 读取和写入元数据。Follower 和 observer FE 只能读取元数据。它们将元数据写入请求路由到 leader FE。Leader FE 更新元数据,然后使用 Raft 协议将元数据更改同步到 follower 和 observer FE。仅当元数据更改同步到超过一半的 follower FE 后,数据写入才被视为成功。 | 从技术上讲,leader FE 也是一个 follower 节点,并且是从 follower FE 中选出的。要执行 leader 选举,集群中必须有超过一半的 follower FE 处于活动状态。当 leader FE 发生故障时,follower FE 将启动另一轮 leader 选举。 | -| Follower | Follower 只能读取元数据。它们从 leader FE 同步和重放日志以更新元数据。 | Follower 参与 leader 选举,这要求集群中超过一半的 follower 处于活动状态。 | -| Observer | Observer 从 leader FE 同步和重放日志以更新元数据。 | Observer 主要用于增加集群的查询并发性。Observer 不参与 leader 选举,因此不会给集群增加 leader 选择压力。| - -##### BE - -BE 负责数据存储和 SQL 执行。 - -- 数据存储:BE 具有等效的数据存储能力。FE 根据预定义的规则将数据分发到 BE。BE 转换摄取的数据,将数据写入所需的格式,并为数据生成索引。 - -- SQL 执行:FE 根据查询的语义将每个 SQL 查询解析为逻辑执行计划,然后将逻辑计划转换为可以在 BE 上执行的物理执行计划。存储目标数据的 BE 执行查询。这样就无需数据传输和复制,从而实现高查询性能。 - -### Shared-data - -对象存储和 HDFS 提供了成本、可靠性和可扩展性优势。除了存储的可扩展性之外,由于存储和计算是分开的,因此可以添加和删除 CN 节点,而无需重新平衡数据。 - -在 shared-data 架构中,BE 被“计算节点 (CN)”取代,后者仅负责数据计算任务和缓存热数据。数据存储在低成本且可靠的远程存储系统中,例如 Amazon S3、Google Cloud Storage、Azure Blob Storage、MinIO 等。当缓存命中时,查询性能与 shared-nothing 架构的查询性能相当。可以根据需要在几秒钟内添加或删除 CN 节点。此架构降低了存储成本,确保了更好的资源隔离以及高弹性和可扩展性。 - -shared-data 架构与其 shared-nothing 架构一样,保持了简单的架构。它仅由两种类型的节点组成:FE 和 CN。唯一的区别是用户必须配置后端对象存储。 - -![shared-data-arch](../_assets/shared-data.png) - -#### 节点 - -shared-data 架构中的 FE 提供与 shared-nothing 架构中相同的功能。 - -BE 被 CN (计算节点) 取代,并且存储功能被卸载到对象存储或 HDFS。CN 是无状态计算节点,可执行 BE 的所有功能,但存储数据除外。 - -#### 存储 - -StarRocks shared-data 集群支持两种存储解决方案:对象存储 (例如,AWS S3、Google GCS、Azure Blob Storage 或 MinIO) 和 HDFS。 - -在 shared-data 集群中,数据文件格式与 shared-nothing 集群 (具有耦合的存储和计算) 的数据文件格式保持一致。数据被组织成 Segment 文件,并且各种索引技术在云原生表中被重用,云原生表是专门在 shared-data 集群中使用的表。 - -#### 缓存 - -StarRocks shared-data 集群将数据存储和计算分离,从而允许每个组件独立扩展,从而降低了成本并提高了弹性。但是,此架构可能会影响查询性能。 - -为了减轻这种影响,StarRocks 建立了一个多层数据访问系统,包括内存、本地磁盘和远程存储,以更好地满足各种业务需求。 - -针对热数据的查询直接扫描缓存,然后扫描本地磁盘,而冷数据需要从对象存储加载到本地缓存中,以加速后续查询。通过使热数据靠近计算单元,StarRocks 实现了真正的高性能计算和经济高效的存储。此外,通过数据预取策略优化了对冷数据的访问,从而有效地消除了查询的性能限制。 - -创建表时可以启用缓存。如果启用了缓存,则数据将被写入本地磁盘和后端对象存储。在查询期间,CN 节点首先从本地磁盘读取数据。如果未找到数据,则将从后端对象存储中检索数据,并同时缓存在本地磁盘上。 - - \ No newline at end of file diff --git a/docs/zh/loading/BrokerLoad.md b/docs/zh/loading/BrokerLoad.md deleted file mode 100644 index 0cab179..0000000 --- a/docs/zh/loading/BrokerLoad.md +++ /dev/null @@ -1,433 +0,0 @@ ---- -displayed_sidebar: docs ---- - -# 从 HDFS 或云存储加载数据 - -import InsertPrivNote from '../_assets/commonMarkdown/insertPrivNote.mdx' - -StarRocks 提供了基于 MySQL 的 Broker Load 导入方法,可帮助您将大量数据从 HDFS 或云存储导入到 StarRocks 中。 - -Broker Load 在异步导入模式下运行。提交导入作业后,StarRocks 会异步运行该作业。您需要使用 [SHOW LOAD](../sql-reference/sql-statements/loading_unloading/SHOW_LOAD.md) 语句或 `curl` 命令来检查作业结果。 - -Broker Load 支持单表导入和多表导入。您可以通过运行一个 Broker Load 作业将一个或多个数据文件导入到一个或多个目标表中。Broker Load 确保每个运行的导入作业在导入多个数据文件时的事务原子性。原子性意味着一个导入作业中多个数据文件的导入必须全部成功或全部失败。不会发生某些数据文件导入成功而其他文件导入失败的情况。 - -Broker Load 支持在数据导入时进行数据转换,并支持在数据导入期间通过 UPSERT 和 DELETE 操作进行数据更改。有关详细信息,请参见 [在导入时转换数据](../loading/Etl_in_loading.md) 和 [通过导入更改数据](../loading/Load_to_Primary_Key_tables.md)。 - - - -## 背景信息 - -在 v2.4 及更早版本中,StarRocks 依赖 Broker 在 StarRocks 集群和外部存储系统之间建立连接,以运行 Broker Load 作业。因此,您需要在导入语句中输入 `WITH BROKER ""` 来指定要使用的 Broker。这被称为“基于 Broker 的导入”。Broker 是一种独立的无状态服务,与文件系统接口集成。借助 Broker,StarRocks 可以访问和读取存储在外部存储系统中的数据文件,并可以使用自己的计算资源来预处理和导入这些数据文件的数据。 - -从 v2.5 开始,StarRocks 在运行 Broker Load 作业时,不再依赖 Broker 在 StarRocks 集群和外部存储系统之间建立连接。因此,您不再需要在导入语句中指定 Broker,但仍需要保留 `WITH BROKER` 关键字。这被称为“无 Broker 导入”。 - -当您的数据存储在 HDFS 中时,您可能会遇到无 Broker 导入不起作用的情况。当您的数据存储在多个 HDFS 集群中或您配置了多个 Kerberos 用户时,可能会发生这种情况。在这些情况下,您可以改为使用基于 Broker 的导入。要成功执行此操作,请确保至少部署了一个独立的 Broker 组。有关如何在这些情况下指定身份验证配置和 HA 配置的信息,请参见 [HDFS](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md#hdfs)。 - -## 支持的数据文件格式 - -Broker Load 支持以下数据文件格式: - -- CSV - -- Parquet - -- ORC - -> **NOTE** -> -> 对于 CSV 数据,请注意以下几点: -> -> - 您可以使用 UTF-8 字符串(例如逗号 (,)、制表符或管道 (|)),其长度不超过 50 字节作为文本分隔符。 -> - 空值用 `\N` 表示。例如,一个数据文件包含三列,并且该数据文件中的一条记录在第一列和第三列中包含数据,但在第二列中没有数据。在这种情况下,您需要在第二列中使用 `\N` 来表示空值。这意味着该记录必须编译为 `a,\N,b` 而不是 `a,,b`。`a,,b` 表示该记录的第二列包含一个空字符串。 - -## 支持的存储系统 - -Broker Load 支持以下存储系统: - -- HDFS - -- AWS S3 - -- Google GCS - -- 其他 S3 兼容的存储系统,例如 MinIO - -- Microsoft Azure Storage - -## 工作原理 - -在您向 FE 提交导入作业后,FE 会生成一个查询计划,根据可用 BE 的数量和要导入的数据文件的大小将查询计划拆分为多个部分,然后将查询计划的每个部分分配给一个可用的 BE。在导入期间,每个涉及的 BE 从您的 HDFS 或云存储系统中提取数据文件的数据,预处理数据,然后将数据导入到您的 StarRocks 集群中。在所有 BE 完成其查询计划部分后,FE 确定导入作业是否成功。 - -下图显示了 Broker Load 作业的工作流程。 - -![Broker Load 的工作流程](../_assets/broker_load_how-to-work_en.png) - -## 基本操作 - -### 创建多表导入作业 - -本主题以 CSV 为例,介绍如何将多个数据文件导入到多个表中。有关如何导入其他文件格式的数据以及 Broker Load 的语法和参数说明,请参见 [BROKER LOAD](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md)。 - -请注意,在 StarRocks 中,某些字面量被 SQL 语言用作保留关键字。请勿在 SQL 语句中直接使用这些关键字。如果要在 SQL 语句中使用此类关键字,请将其括在一对反引号 (`) 中。请参见 [关键字](../sql-reference/sql-statements/keywords.md)。 - -#### 数据示例 - -1. 在本地文件系统中创建 CSV 文件。 - - a. 创建一个名为 `file1.csv` 的 CSV 文件。该文件包含三列,依次表示用户 ID、用户名和用户分数。 - - ```Plain - 1,Lily,23 - 2,Rose,23 - 3,Alice,24 - 4,Julia,25 - ``` - - b. 创建一个名为 `file2.csv` 的 CSV 文件。该文件包含两列,依次表示城市 ID 和城市名称。 - - ```Plain - 200,'Beijing' - ``` - -2. 在 StarRocks 数据库 `test_db` 中创建 StarRocks 表。 - - > **NOTE** - > - > 从 v2.5.7 开始,StarRocks 可以在您创建表或添加分区时自动设置 bucket 数量 (BUCKETS)。您不再需要手动设置 bucket 数量。有关详细信息,请参见 [设置 bucket 数量](../table_design/data_distribution/Data_distribution.md#set-the-number-of-buckets)。 - - a. 创建一个名为 `table1` 的主键表。该表包含三列:`id`、`name` 和 `score`,其中 `id` 是主键。 - - ```SQL - CREATE TABLE `table1` - ( - `id` int(11) NOT NULL COMMENT "user ID", - `name` varchar(65533) NULL DEFAULT "" COMMENT "user name", - `score` int(11) NOT NULL DEFAULT "0" COMMENT "user score" - ) - ENGINE=OLAP - PRIMARY KEY(`id`) - DISTRIBUTED BY HASH(`id`); - ``` - - b. 创建一个名为 `table2` 的主键表。该表包含两列:`id` 和 `city`,其中 `id` 是主键。 - - ```SQL - CREATE TABLE `table2` - ( - `id` int(11) NOT NULL COMMENT "city ID", - `city` varchar(65533) NULL DEFAULT "" COMMENT "city name" - ) - ENGINE=OLAP - PRIMARY KEY(`id`) - DISTRIBUTED BY HASH(`id`); - ``` - -3. 将 `file1.csv` 和 `file2.csv` 上传到 HDFS 集群的 `/user/starrocks/` 路径、AWS S3 bucket `bucket_s3` 的 `input` 文件夹、Google GCS bucket `bucket_gcs` 的 `input` 文件夹、MinIO bucket `bucket_minio` 的 `input` 文件夹以及 Azure Storage 的指定路径。 - -#### 从 HDFS 加载数据 - -执行以下语句,将 `file1.csv` 和 `file2.csv` 从 HDFS 集群的 `/user/starrocks` 路径分别加载到 `table1` 和 `table2` 中: - -```SQL -LOAD LABEL test_db.label1 -( - DATA INFILE("hdfs://:/user/starrocks/file1.csv") - INTO TABLE table1 - COLUMNS TERMINATED BY "," - (id, name, score) - , - DATA INFILE("hdfs://:/user/starrocks/file2.csv") - INTO TABLE table2 - COLUMNS TERMINATED BY "," - (id, city) -) -WITH BROKER -( - StorageCredentialParams -) -PROPERTIES -( - "timeout" = "3600" -); -``` - -在上面的示例中,`StorageCredentialParams` 表示一组身份验证参数,这些参数因您选择的身份验证方法而异。有关详细信息,请参见 [BROKER LOAD](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md#hdfs)。 - -#### 从 AWS S3 加载数据 - -执行以下语句,将 `file1.csv` 和 `file2.csv` 从 AWS S3 bucket `bucket_s3` 的 `input` 文件夹分别加载到 `table1` 和 `table2` 中: - -```SQL -LOAD LABEL test_db.label2 -( - DATA INFILE("s3a://bucket_s3/input/file1.csv") - INTO TABLE table1 - COLUMNS TERMINATED BY "," - (id, name, score) - , - DATA INFILE("s3a://bucket_s3/input/file2.csv") - INTO TABLE table2 - COLUMNS TERMINATED BY "," - (id, city) -) -WITH BROKER -( - StorageCredentialParams -); -``` - -> **NOTE** -> -> Broker Load 仅支持根据 S3A 协议访问 AWS S3。因此,当您从 AWS S3 加载数据时,必须将您作为文件路径传递的 S3 URI 中的 `s3://` 替换为 `s3a://`。 - -在上面的示例中,`StorageCredentialParams` 表示一组身份验证参数,这些参数因您选择的身份验证方法而异。有关详细信息,请参见 [BROKER LOAD](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md#aws-s3)。 - -从 v3.1 开始,StarRocks 支持通过使用 INSERT 命令和 TABLE 关键字直接从 AWS S3 加载 Parquet 格式或 ORC 格式的文件数据,从而省去了首先创建外部表的麻烦。有关详细信息,请参见 [使用 INSERT 加载数据 > 使用 TABLE 关键字直接从外部源的文件中插入数据](../loading/InsertInto.md#insert-data-directly-from-files-in-an-external-source-using-files)。 - -#### 从 Google GCS 加载数据 - -执行以下语句,将 `file1.csv` 和 `file2.csv` 从 Google GCS bucket `bucket_gcs` 的 `input` 文件夹分别加载到 `table1` 和 `table2` 中: - -```SQL -LOAD LABEL test_db.label3 -( - DATA INFILE("gs://bucket_gcs/input/file1.csv") - INTO TABLE table1 - COLUMNS TERMINATED BY "," - (id, name, score) - , - DATA INFILE("gs://bucket_gcs/input/file2.csv") - INTO TABLE table2 - COLUMNS TERMINATED BY "," - (id, city) -) -WITH BROKER -( - StorageCredentialParams -); -``` - -> **NOTE** -> -> Broker Load 仅支持根据 gs 协议访问 Google GCS。因此,当您从 Google GCS 加载数据时,必须在您作为文件路径传递的 GCS URI 中包含 `gs://` 作为前缀。 - -在上面的示例中,`StorageCredentialParams` 表示一组身份验证参数,这些参数因您选择的身份验证方法而异。有关详细信息,请参见 [BROKER LOAD](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md#google-gcs)。 - -#### 从其他 S3 兼容的存储系统加载数据 - -以 MinIO 为例。您可以执行以下语句,将 `file1.csv` 和 `file2.csv` 从 MinIO bucket `bucket_minio` 的 `input` 文件夹分别加载到 `table1` 和 `table2` 中: - -```SQL -LOAD LABEL test_db.label7 -( - DATA INFILE("s3://bucket_minio/input/file1.csv") - INTO TABLE table1 - COLUMNS TERMINATED BY "," - (id, name, score) - , - DATA INFILE("s3://bucket_minio/input/file2.csv") - INTO TABLE table2 - COLUMNS TERMINATED BY "," - (id, city) -) -WITH BROKER -( - StorageCredentialParams -); -``` - -在上面的示例中,`StorageCredentialParams` 表示一组身份验证参数,这些参数因您选择的身份验证方法而异。有关详细信息,请参见 [BROKER LOAD](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md#other-s3-compatible-storage-system)。 - -#### 从 Microsoft Azure Storage 加载数据 - -执行以下语句,将 `file1.csv` 和 `file2.csv` 从 Azure Storage 的指定路径加载: - -```SQL -LOAD LABEL test_db.label8 -( - DATA INFILE("wasb[s]://@.blob.core.windows.net//file1.csv") - INTO TABLE table1 - COLUMNS TERMINATED BY "," - (id, name, score) - , - DATA INFILE("wasb[s]://@.blob.core.windows.net//file2.csv") - INTO TABLE table2 - COLUMNS TERMINATED BY "," - (id, city) -) -WITH BROKER -( - StorageCredentialParams -); -``` - -> **NOTICE** - > - > 从 Azure Storage 加载数据时,您需要根据您使用的访问协议和特定存储服务来确定要使用的前缀。以下示例使用 Blob Storage 作为示例。 - > - > - 当您从 Blob Storage 加载数据时,您必须根据用于访问存储帐户的协议在文件路径中包含 `wasb://` 或 `wasbs://` 作为前缀: - > - 如果您的 Blob Storage 仅允许通过 HTTP 进行访问,请使用 `wasb://` 作为前缀,例如 `wasb://@.blob.core.windows.net///*`。 - > - 如果您的 Blob Storage 仅允许通过 HTTPS 进行访问,请使用 `wasbs://` 作为前缀,例如 `wasbs://@.blob.core.windows.net///*` - > - 当您从 Data Lake Storage Gen1 加载数据时,您必须在文件路径中包含 `adl://` 作为前缀,例如 `adl://.azuredatalakestore.net//`。 - > - 当您从 Data Lake Storage Gen2 加载数据时,您必须根据用于访问存储帐户的协议在文件路径中包含 `abfs://` 或 `abfss://` 作为前缀: - > - 如果您的 Data Lake Storage Gen2 仅允许通过 HTTP 进行访问,请使用 `abfs://` 作为前缀,例如 `abfs://@.dfs.core.windows.net/`。 - > - 如果您的 Data Lake Storage Gen2 仅允许通过 HTTPS 进行访问,请使用 `abfss://` 作为前缀,例如 `abfss://@.dfs.core.windows.net/`。 - -在上面的示例中,`StorageCredentialParams` 表示一组身份验证参数,这些参数因您选择的身份验证方法而异。有关详细信息,请参见 [BROKER LOAD](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md#microsoft-azure-storage)。 - -#### 查询数据 - -从 HDFS 集群、AWS S3 bucket 或 Google GCS bucket 加载数据完成后,您可以使用 SELECT 语句查询 StarRocks 表的数据,以验证加载是否成功。 - -1. 执行以下语句以查询 `table1` 的数据: - - ```SQL - MySQL [test_db]> SELECT * FROM table1; - +------+-------+-------+ - | id | name | score | - +------+-------+-------+ - | 1 | Lily | 23 | - | 2 | Rose | 23 | - | 3 | Alice | 24 | - | 4 | Julia | 25 | - +------+-------+-------+ - 4 rows in set (0.00 sec) - ``` - -2. 执行以下语句以查询 `table2` 的数据: - - ```SQL - MySQL [test_db]> SELECT * FROM table2; - +------+--------+ - | id | city | - +------+--------+ - | 200 | Beijing| - +------+--------+ - 4 rows in set (0.01 sec) - ``` - -### 创建单表导入作业 - -您还可以将单个数据文件或指定路径中的所有数据文件加载到单个目标表中。假设您的 AWS S3 bucket `bucket_s3` 包含一个名为 `input` 的文件夹。`input` 文件夹包含多个数据文件,其中一个名为 `file1.csv`。这些数据文件包含与 `table1` 相同的列数,并且来自每个数据文件的列可以按顺序一一映射到来自 `table1` 的列。 - -要将 `file1.csv` 加载到 `table1` 中,请执行以下语句: - -```SQL -LOAD LABEL test_db.label_7 -( - DATA INFILE("s3a://bucket_s3/input/file1.csv") - INTO TABLE table1 - COLUMNS TERMINATED BY "," - FORMAT AS "CSV" -) -WITH BROKER -( - StorageCredentialParams -); -``` - -要将 `input` 文件夹中的所有数据文件加载到 `table1` 中,请执行以下语句: - -```SQL -LOAD LABEL test_db.label_8 -( - DATA INFILE("s3a://bucket_s3/input/*") - INTO TABLE table1 - COLUMNS TERMINATED BY "," - FORMAT AS "CSV" -) -WITH BROKER -( - StorageCredentialParams -); -``` - -在上面的示例中,`StorageCredentialParams` 表示一组身份验证参数,这些参数因您选择的身份验证方法而异。有关详细信息,请参见 [BROKER LOAD](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md#aws-s3)。 - -### 查看导入作业 - -Broker Load 允许您使用 SHOW LOAD 语句或 `curl` 命令查看 lob 作业。 - -#### 使用 SHOW LOAD - -有关详细信息,请参见 [SHOW LOAD](../sql-reference/sql-statements/loading_unloading/SHOW_LOAD.md)。 - -#### 使用 curl - -语法如下: - -```Bash -curl --location-trusted -u : \ - 'http://:/api//_load_info?label=' -``` - -> **NOTE** -> -> 如果您使用的帐户未设置密码,则只需输入 `:`。 - -例如,您可以运行以下命令来查看 `test_db` 数据库中标签为 `label1` 的导入作业的信息: - -```Bash -curl --location-trusted -u : \ - 'http://:/api/test_db/_load_info?label=label1' -``` - -`curl` 命令以 JSON 对象 `jobInfo` 的形式返回有关具有指定标签的最近执行的导入作业的信息: - -```JSON -{"jobInfo":{"dbName":"default_cluster:test_db","tblNames":["table1_simple"],"label":"label1","state":"FINISHED","failMsg":"","trackingUrl":""},"status":"OK","msg":"Success"}% -``` - -下表描述了 `jobInfo` 中的参数。 - -| **参数** | **描述** | -| ------------- | ------------------------------------------------------------ | -| dbName | 要将数据加载到的数据库的名称。 | -| tblNames | 要将数据加载到的表的名称。 | -| label | 导入作业的标签。 | -| state | 导入作业的状态。有效值:
  • `PENDING`:导入作业正在队列中等待调度。
  • `QUEUEING`:导入作业正在队列中等待调度。
  • `LOADING`:导入作业正在运行。
  • `PREPARED`:事务已提交。
  • `FINISHED`:导入作业成功。
  • `CANCELLED`:导入作业失败。
有关详细信息,请参见 [导入概念](./loading_introduction/loading_concepts.md) 中的“异步导入”部分。 | -| failMsg | 导入作业失败的原因。如果导入作业的 `state` 值为 `PENDING`、`LOADING` 或 `FINISHED`,则为 `failMsg` 参数返回 `NULL`。如果导入作业的 `state` 值为 `CANCELLED`,则为 `failMsg` 参数返回的值由两部分组成:`type` 和 `msg`。
  • `type` 部分可以是以下任何值:
    • `USER_CANCEL`:导入作业已手动取消。
    • `ETL_SUBMIT_FAIL`:导入作业未能提交。
    • `ETL-QUALITY-UNSATISFIED`:导入作业失败,因为不合格数据的百分比超过了 `max-filter-ratio` 参数的值。
    • `LOAD-RUN-FAIL`:导入作业在 `LOADING` 阶段失败。
    • `TIMEOUT`:导入作业未在指定的超时时间内完成。
    • `UNKNOWN`:导入作业因未知错误而失败。
  • `msg` 部分提供了导入失败的详细原因。
| -| trackingUrl | 用于访问在导入作业中检测到的不合格数据的 URL。您可以使用 `curl` 或 `wget` 命令访问该 URL 并获取不合格数据。如果未检测到不合格数据,则为 `trackingUrl` 参数返回 `NULL`。 | -| status | 导入作业的 HTTP 请求的状态。有效值为:`OK` 和 `Fail`。 | -| msg | 导入作业的 HTTP 请求的错误信息。 | - -### 取消导入作业 - -当导入作业未处于 **CANCELLED** 或 **FINISHED** 阶段时,您可以使用 [CANCEL LOAD](../sql-reference/sql-statements/loading_unloading/CANCEL_LOAD.md) 语句取消该作业。 - -例如,您可以执行以下语句以取消数据库 `test_db` 中标签为 `label1` 的导入作业: - -```SQL -CANCEL LOAD -FROM test_db -WHERE LABEL = "label"; -``` - -## 作业拆分和并发运行 - -一个 Broker Load 作业可以拆分为一个或多个并发运行的任务。一个导入作业中的所有任务都在一个事务中运行。它们必须全部成功或全部失败。StarRocks 根据您在 `LOAD` 语句中如何声明 `data_desc` 来拆分每个导入作业: - -- 如果您声明了多个 `data_desc` 参数,每个参数指定一个不同的表,则会生成一个任务来加载每个表的数据。 - -- 如果您声明了多个 `data_desc` 参数,每个参数指定同一表的不同分区,则会生成一个任务来加载每个分区的数据。 - -此外,每个任务可以进一步拆分为一个或多个实例,这些实例均匀分布到 StarRocks 集群的 BE 上并并发运行。StarRocks 根据以下 [FE 配置](../administration/management/FE_configuration.md) 拆分每个任务: - -- `min_bytes_per_broker_scanner`:每个实例处理的最小数据量。默认量为 64 MB。 - -- `load_parallel_instance_num`:每个 BE 上每个导入作业中允许的并发实例数。默认数量为 1。 - - 您可以使用以下公式计算单个任务中的实例数: - - **单个任务中的实例数 = min(单个任务要加载的数据量/`min_bytes_per_broker_scanner`,`load_parallel_instance_num` x BE 数量)** - -在大多数情况下,每个导入作业只声明一个 `data_desc`,每个导入作业只拆分为一个任务,并且该任务拆分的实例数与 BE 的数量相同。 - -## 相关配置项 - -[FE 配置项](../administration/management/FE_configuration.md) `max_broker_load_job_concurrency` 指定了 StarRocks 集群中可以并发运行的最大 Broker Load 作业数。 - -在 StarRocks v2.4 及更早版本中,如果在特定时间段内提交的 Broker Load 作业总数超过最大数量,则会将过多的作业排队并根据其提交时间进行调度。 - -自 StarRocks v2.5 以来,如果在特定时间段内提交的 Broker Load 作业总数超过最大数量,则会将过多的作业排队并根据其优先级进行调度。您可以在创建作业时使用 `priority` 参数为作业指定优先级。请参见 [BROKER LOAD](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md#opt_properties)。您还可以使用 [ALTER LOAD](../sql-reference/sql-statements/loading_unloading/ALTER_LOAD.md) 修改处于 **QUEUEING** 或 **LOADING** 状态的现有作业的优先级。 \ No newline at end of file diff --git a/docs/zh/loading/Etl_in_loading.md b/docs/zh/loading/Etl_in_loading.md deleted file mode 100644 index 1baded5..0000000 --- a/docs/zh/loading/Etl_in_loading.md +++ /dev/null @@ -1,452 +0,0 @@ ---- -displayed_sidebar: docs ---- - -# 数据导入时转换数据 - -import InsertPrivNote from '../_assets/commonMarkdown/insertPrivNote.mdx' - -StarRocks 支持在数据导入时转换数据。 - -该功能支持 [Stream Load](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md) 、[Broker Load](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md) 和 [Routine Load](../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md) ,但不支持 [Spark Load](../sql-reference/sql-statements/loading_unloading/SPARK_LOAD.md) 。 - - - -本文档以 CSV 数据为例,介绍如何在数据导入时提取和转换数据。支持的数据文件格式因您选择的导入方式而异。 - -> **NOTE** -> -> 对于 CSV 数据,您可以使用 UTF-8 字符串(例如逗号 (,)、制表符或管道 (|))作为文本分隔符,其长度不超过 50 字节。 - -## 适用场景 - -当您将数据文件导入到 StarRocks 表中时,数据文件的数据可能无法完全映射到 StarRocks 表的数据。在这种情况下,您无需在将数据加载到 StarRocks 表之前提取或转换数据。StarRocks 可以帮助您在加载期间提取和转换数据: - -- 跳过不需要加载的列。 - - 您可以跳过不需要加载的列。此外,如果数据文件的列与 StarRocks 表的列顺序不同,则可以在数据文件和 StarRocks 表之间创建列映射。 - -- 过滤掉您不想加载的行。 - - 您可以指定过滤条件,StarRocks 会根据这些条件过滤掉您不想加载的行。 - -- 从原始列生成新列。 - - 生成的列是从数据文件的原始列计算得出的特殊列。您可以将生成的列映射到 StarRocks 表的列。 - -- 从文件路径提取分区字段值。 - - 如果数据文件是从 Apache Hive™ 生成的,则可以从文件路径提取分区字段值。 - -## 数据示例 - -1. 在本地文件系统中创建数据文件。 - - a. 创建一个名为 `file1.csv` 的数据文件。该文件由四列组成,依次表示用户 ID、用户性别、事件日期和事件类型。 - - ```Plain - 354,female,2020-05-20,1 - 465,male,2020-05-21,2 - 576,female,2020-05-22,1 - 687,male,2020-05-23,2 - ``` - - b. 创建一个名为 `file2.csv` 的数据文件。该文件仅由一列组成,表示日期。 - - ```Plain - 2020-05-20 - 2020-05-21 - 2020-05-22 - 2020-05-23 - ``` - -2. 在 StarRocks 数据库 `test_db` 中创建表。 - - > **NOTE** - > - > 从 v2.5.7 开始,StarRocks 可以在您创建表或添加分区时自动设置 bucket 数量 (BUCKETS)。您不再需要手动设置 bucket 数量。有关详细信息,请参见 [设置 bucket 数量](../table_design/data_distribution/Data_distribution.md#set-the-number-of-buckets)。 - - a. 创建一个名为 `table1` 的表,该表由三列组成:`event_date`、`event_type` 和 `user_id`。 - - ```SQL - MySQL [test_db]> CREATE TABLE table1 - ( - `event_date` DATE COMMENT "event date", - `event_type` TINYINT COMMENT "event type", - `user_id` BIGINT COMMENT "user ID" - ) - DISTRIBUTED BY HASH(user_id); - ``` - - b. 创建一个名为 `table2` 的表,该表由四列组成:`date`、`year`、`month` 和 `day`。 - - ```SQL - MySQL [test_db]> CREATE TABLE table2 - ( - `date` DATE COMMENT "date", - `year` INT COMMENT "year", - `month` TINYINT COMMENT "month", - `day` TINYINT COMMENT "day" - ) - DISTRIBUTED BY HASH(date); - ``` - -3. 将 `file1.csv` 和 `file2.csv` 上传到 HDFS 集群的 `/user/starrocks/data/input/` 路径,将 `file1.csv` 的数据发布到 Kafka 集群的 `topic1`,并将 `file2.csv` 的数据发布到 Kafka 集群的 `topic2`。 - -## 跳过不需要加载的列 - -您要加载到 StarRocks 表中的数据文件可能包含一些无法映射到 StarRocks 表的任何列的列。在这种情况下,StarRocks 支持仅加载可以从数据文件映射到 StarRocks 表的列。 - -此功能支持从以下数据源加载数据: - -- 本地文件系统 - -- HDFS 和云存储 - - > **NOTE** - > - > 本节以 HDFS 为例。 - -- Kafka - -在大多数情况下,CSV 文件的列未命名。对于某些 CSV 文件,第一行由列名组成,但 StarRocks 将第一行的内容处理为普通数据,而不是列名。因此,当您加载 CSV 文件时,必须在作业创建语句或命令中**按顺序**临时命名 CSV 文件的列。这些临时命名的列**按名称**映射到 StarRocks 表的列。请注意以下关于数据文件列的几点: - -- 可以映射到 StarRocks 表的列,并通过使用 StarRocks 表中列的名称临时命名,这些列的数据将被直接加载。 - -- 无法映射到 StarRocks 表的列将被忽略,这些列的数据不会被加载。 - -- 如果某些列可以映射到 StarRocks 表的列,但在作业创建语句或命令中未临时命名,则加载作业会报告错误。 - -本节以 `file1.csv` 和 `table1` 为例。`file1.csv` 的四列依次临时命名为 `user_id`、`user_gender`、`event_date` 和 `event_type`。在 `file1.csv` 的临时命名列中,`user_id`、`event_date` 和 `event_type` 可以映射到 `table1` 的特定列,而 `user_gender` 无法映射到 `table1` 的任何列。因此,`user_id`、`event_date` 和 `event_type` 将被加载到 `table1` 中,但 `user_gender` 不会被加载。 - -### 导入数据 - -#### 从本地文件系统加载数据 - -如果 `file1.csv` 存储在您的本地文件系统中,请运行以下命令来创建 [Stream Load](../loading/StreamLoad.md) 作业: - -```Bash -curl --location-trusted -u : \ - -H "Expect:100-continue" \ - -H "column_separator:," \ - -H "columns: user_id, user_gender, event_date, event_type" \ - -T file1.csv -XPUT \ - http://:/api/test_db/table1/_stream_load -``` - -> **NOTE** -> -> 如果您选择 Stream Load,则必须使用 `columns` 参数临时命名数据文件的列,以在数据文件和 StarRocks 表之间创建列映射。 - -有关详细的语法和参数说明,请参见 [STREAM LOAD](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md)。 - -#### 从 HDFS 集群加载数据 - -如果 `file1.csv` 存储在您的 HDFS 集群中,请执行以下语句来创建 [Broker Load](../loading/hdfs_load.md) 作业: - -```SQL -LOAD LABEL test_db.label1 -( - DATA INFILE("hdfs://:/user/starrocks/data/input/file1.csv") - INTO TABLE `table1` - FORMAT AS "csv" - COLUMNS TERMINATED BY "," - (user_id, user_gender, event_date, event_type) -) -WITH BROKER; -``` - -> **NOTE** -> -> 如果您选择 Broker Load,则必须使用 `column_list` 参数临时命名数据文件的列,以在数据文件和 StarRocks 表之间创建列映射。 - -有关详细的语法和参数说明,请参见 [BROKER LOAD](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md)。 - -#### 从 Kafka 集群加载数据 - -如果 `file1.csv` 的数据发布到 Kafka 集群的 `topic1`,请执行以下语句来创建 [Routine Load](../loading/RoutineLoad.md) 作业: - -```SQL -CREATE ROUTINE LOAD test_db.table101 ON table1 - COLUMNS TERMINATED BY ",", - COLUMNS(user_id, user_gender, event_date, event_type) -FROM KAFKA -( - "kafka_broker_list" = ":", - "kafka_topic" = "topic1", - "property.kafka_default_offsets" = "OFFSET_BEGINNING" -); -``` - -> **NOTE** -> -> 如果您选择 Routine Load,则必须使用 `COLUMNS` 参数临时命名数据文件的列,以在数据文件和 StarRocks 表之间创建列映射。 - -有关详细的语法和参数说明,请参见 [CREATE ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md)。 - -### 查询数据 - -从本地文件系统、HDFS 集群或 Kafka 集群加载数据完成后,查询 `table1` 的数据以验证加载是否成功: - -```SQL -MySQL [test_db]> SELECT * FROM table1; -+------------+------------+---------+ -| event_date | event_type | user_id | -+------------+------------+---------+ -| 2020-05-22 | 1 | 576 | -| 2020-05-20 | 1 | 354 | -| 2020-05-21 | 2 | 465 | -| 2020-05-23 | 2 | 687 | -+------------+------------+---------+ -4 rows in set (0.01 sec) -``` - -## 过滤掉您不想加载的行 - -当您将数据文件加载到 StarRocks 表中时,您可能不想加载数据文件的特定行。在这种情况下,您可以使用 WHERE 子句来指定要加载的行。StarRocks 会过滤掉所有不满足 WHERE 子句中指定的过滤条件的行。 - -此功能支持从以下数据源加载数据: - -- 本地文件系统 - -- HDFS 和云存储 - > **NOTE** - > - > 本节以 HDFS 为例。 - -- Kafka - -本节以 `file1.csv` 和 `table1` 为例。如果您只想将 `file1.csv` 中事件类型为 `1` 的行加载到 `table1` 中,则可以使用 WHERE 子句来指定过滤条件 `event_type = 1`。 - -### 导入数据 - -#### 从本地文件系统加载数据 - -如果 `file1.csv` 存储在您的本地文件系统中,请运行以下命令来创建 [Stream Load](../loading/StreamLoad.md) 作业: - -```Bash -curl --location-trusted -u : \ - -H "Expect:100-continue" \ - -H "column_separator:," \ - -H "columns: user_id, user_gender, event_date, event_type" \ - -H "where: event_type=1" \ - -T file1.csv -XPUT \ - http://:/api/test_db/table1/_stream_load -``` - -有关详细的语法和参数说明,请参见 [STREAM LOAD](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md)。 - -#### 从 HDFS 集群加载数据 - -如果 `file1.csv` 存储在您的 HDFS 集群中,请执行以下语句来创建 [Broker Load](../loading/hdfs_load.md) 作业: - -```SQL -LOAD LABEL test_db.label2 -( - DATA INFILE("hdfs://:/user/starrocks/data/input/file1.csv") - INTO TABLE `table1` - FORMAT AS "csv" - COLUMNS TERMINATED BY "," - (user_id, user_gender, event_date, event_type) - WHERE event_type = 1 -) -WITH BROKER; -``` - -有关详细的语法和参数说明,请参见 [BROKER LOAD](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md)。 - -#### 从 Kafka 集群加载数据 - -如果 `file1.csv` 的数据发布到 Kafka 集群的 `topic1`,请执行以下语句来创建 [Routine Load](../loading/RoutineLoad.md) 作业: - -```SQL -CREATE ROUTINE LOAD test_db.table102 ON table1 -COLUMNS TERMINATED BY ",", -COLUMNS (user_id, user_gender, event_date, event_type), -WHERE event_type = 1 -FROM KAFKA -( - "kafka_broker_list" = ":", - "kafka_topic" = "topic1", - "property.kafka_default_offsets" = "OFFSET_BEGINNING" -); -``` - -有关详细的语法和参数说明,请参见 [CREATE ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md)。 - -### 查询数据 - -从本地文件系统、HDFS 集群或 Kafka 集群加载数据完成后,查询 `table1` 的数据以验证加载是否成功: - -```SQL -MySQL [test_db]> SELECT * FROM table1; -+------------+------------+---------+ -| event_date | event_type | user_id | -+------------+------------+---------+ -| 2020-05-20 | 1 | 354 | -| 2020-05-22 | 1 | 576 | -+------------+------------+---------+ -2 rows in set (0.01 sec) -``` - -## 从原始列生成新列 - -当您将数据文件加载到 StarRocks 表中时,数据文件的某些数据可能需要转换后才能加载到 StarRocks 表中。在这种情况下,您可以使用作业创建命令或语句中的函数或表达式来实现数据转换。 - -此功能支持从以下数据源加载数据: - -- 本地文件系统 - -- HDFS 和云存储 - > **NOTE** - > - > 本节以 HDFS 为例。 - -- Kafka - -本节以 `file2.csv` 和 `table2` 为例。`file2.csv` 仅由一列组成,表示日期。您可以使用 [year](../sql-reference/sql-functions/date-time-functions/year.md) 、[month](../sql-reference/sql-functions/date-time-functions/month.md) 和 [day](../sql-reference/sql-functions/date-time-functions/day.md) 函数从 `file2.csv` 中的每个日期提取年、月和日,并将提取的数据加载到 `table2` 的 `year`、`month` 和 `day` 列中。 - -### 导入数据 - -#### 从本地文件系统加载数据 - -如果 `file2.csv` 存储在您的本地文件系统中,请运行以下命令来创建 [Stream Load](../loading/StreamLoad.md) 作业: - -```Bash -curl --location-trusted -u : \ - -H "Expect:100-continue" \ - -H "column_separator:," \ - -H "columns:date,year=year(date),month=month(date),day=day(date)" \ - -T file2.csv -XPUT \ - http://:/api/test_db/table2/_stream_load -``` - -> **NOTE** -> -> - 在 `columns` 参数中,您必须首先临时命名数据文件的**所有列**,然后临时命名要从数据文件的原始列生成的新列。如前面的示例所示,`file2.csv` 的唯一列临时命名为 `date`,然后调用 `year=year(date)`、`month=month(date)` 和 `day=day(date)` 函数来生成三个新列,这些新列临时命名为 `year`、`month` 和 `day`。 -> -> - Stream Load 不支持 `column_name = function(column_name)`,但支持 `column_name = function(column_name)`。 - -有关详细的语法和参数说明,请参见 [STREAM LOAD](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md)。 - -#### 从 HDFS 集群加载数据 - -如果 `file2.csv` 存储在您的 HDFS 集群中,请执行以下语句来创建 [Broker Load](../loading/hdfs_load.md) 作业: - -```SQL -LOAD LABEL test_db.label3 -( - DATA INFILE("hdfs://:/user/starrocks/data/input/file2.csv") - INTO TABLE `table2` - FORMAT AS "csv" - COLUMNS TERMINATED BY "," - (date) - SET(year=year(date), month=month(date), day=day(date)) -) -WITH BROKER; -``` - -> **NOTE** -> -> 您必须首先使用 `column_list` 参数临时命名数据文件的**所有列**,然后使用 SET 子句临时命名要从数据文件的原始列生成的新列。如前面的示例所示,`file2.csv` 的唯一列在 `column_list` 参数中临时命名为 `date`,然后在 SET 子句中调用 `year=year(date)`、`month=month(date)` 和 `day=day(date)` 函数来生成三个新列,这些新列临时命名为 `year`、`month` 和 `day`。 - -有关详细的语法和参数说明,请参见 [BROKER LOAD](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md)。 - -#### 从 Kafka 集群加载数据 - -如果 `file2.csv` 的数据发布到 Kafka 集群的 `topic2`,请执行以下语句来创建 [Routine Load](../loading/RoutineLoad.md) 作业: - -```SQL -CREATE ROUTINE LOAD test_db.table201 ON table2 - COLUMNS TERMINATED BY ",", - COLUMNS(date,year=year(date),month=month(date),day=day(date)) -FROM KAFKA -( - "kafka_broker_list" = ":", - "kafka_topic" = "topic2", - "property.kafka_default_offsets" = "OFFSET_BEGINNING" -); -``` - -> **NOTE** -> -> 在 `COLUMNS` 参数中,您必须首先临时命名数据文件的**所有列**,然后临时命名要从数据文件的原始列生成的新列。如前面的示例所示,`file2.csv` 的唯一列临时命名为 `date`,然后调用 `year=year(date)`、`month=month(date)` 和 `day=day(date)` 函数来生成三个新列,这些新列临时命名为 `year`、`month` 和 `day`。 - -有关详细的语法和参数说明,请参见 [CREATE ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md)。 - -### 查询数据 - -从本地文件系统、HDFS 集群或 Kafka 集群加载数据完成后,查询 `table2` 的数据以验证加载是否成功: - -```SQL -MySQL [test_db]> SELECT * FROM table2; -+------------+------+-------+------+ -| date | year | month | day | -+------------+------+-------+------+ -| 2020-05-20 | 2020 | 5 | 20 | -| 2020-05-21 | 2020 | 5 | 21 | -| 2020-05-22 | 2020 | 5 | 22 | -| 2020-05-23 | 2020 | 5 | 23 | -+------------+------+-------+------+ -4 rows in set (0.01 sec) -``` - -## 从文件路径提取分区字段值 - -如果您指定的文件路径包含分区字段,则可以使用 `COLUMNS FROM PATH AS` 参数来指定要从文件路径提取的分区字段。文件路径中的分区字段等同于数据文件中的列。仅当您从 HDFS 集群加载数据时,才支持 `COLUMNS FROM PATH AS` 参数。 - -例如,您要加载以下四个从 Hive 生成的数据文件: - -```Plain -/user/starrocks/data/input/date=2020-05-20/data -1,354 -/user/starrocks/data/input/date=2020-05-21/data -2,465 -/user/starrocks/data/input/date=2020-05-22/data -1,576 -/user/starrocks/data/input/date=2020-05-23/data -2,687 -``` - -这四个数据文件存储在 HDFS 集群的 `/user/starrocks/data/input/` 路径中。这些数据文件中的每一个都按分区字段 `date` 分区,并由两列组成,依次表示事件类型和用户 ID。 - -### 从 HDFS 集群加载数据 - -执行以下语句以创建 [Broker Load](../loading/hdfs_load.md) 作业,该作业使您可以从 `/user/starrocks/data/input/` 文件路径提取 `date` 分区字段值,并使用通配符 (*) 来指定您要将文件路径中的所有数据文件加载到 `table1`: - -```SQL -LOAD LABEL test_db.label4 -( - DATA INFILE("hdfs://:/user/starrocks/data/input/date=*/*") - INTO TABLE `table1` - FORMAT AS "csv" - COLUMNS TERMINATED BY "," - (event_type, user_id) - COLUMNS FROM PATH AS (date) - SET(event_date = date) -) -WITH BROKER; -``` - -> **NOTE** -> -> 在前面的示例中,指定文件路径中的 `date` 分区字段等同于 `table1` 的 `event_date` 列。因此,您需要使用 SET 子句将 `date` 分区字段映射到 `event_date` 列。如果指定文件路径中的分区字段与 StarRocks 表的列同名,则无需使用 SET 子句创建映射。 - -有关详细的语法和参数说明,请参见 [BROKER LOAD](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md)。 - -### 查询数据 - -从 HDFS 集群加载数据完成后,查询 `table1` 的数据以验证加载是否成功: - -```SQL -MySQL [test_db]> SELECT * FROM table1; -+------------+------------+---------+ -| event_date | event_type | user_id | -+------------+------------+---------+ -| 2020-05-22 | 1 | 576 | -| 2020-05-20 | 1 | 354 | -| 2020-05-21 | 2 | 465 | -| 2020-05-23 | 2 | 687 | -+------------+------------+---------+ -4 rows in set (0.01 sec) -``` \ No newline at end of file diff --git a/docs/zh/loading/Flink-connector-starrocks.md b/docs/zh/loading/Flink-connector-starrocks.md deleted file mode 100644 index f058908..0000000 --- a/docs/zh/loading/Flink-connector-starrocks.md +++ /dev/null @@ -1,572 +0,0 @@ ---- -displayed_sidebar: docs ---- - -# 从 Apache Flink® 持续导入数据 - -StarRocks 提供了一个自主开发的连接器,名为 StarRocks Connector for Apache Flink® (简称为 Flink connector),以帮助您使用 Flink 将数据导入到 StarRocks 表中。其基本原理是先积累数据,然后通过 [STREAM LOAD](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md) 将数据一次性导入到 StarRocks 中。 - -Flink connector 支持 DataStream API、Table API & SQL 和 Python API。与 Apache Flink® 提供的 [flink-connector-jdbc](https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/table/jdbc/) 相比,它具有更高且更稳定的性能。 - -> **注意** -> -> 使用 Flink connector 将数据导入到 StarRocks 表中,需要对目标 StarRocks 表具有 SELECT 和 INSERT 权限。如果您没有这些权限,请按照 [GRANT](../sql-reference/sql-statements/account-management/GRANT.md) 中提供的说明,将这些权限授予用于连接 StarRocks 集群的用户。 - -## 版本要求 - -| Connector | Flink | StarRocks | Java | Scala | -|-----------|-------------------------------|---------------| ---- |-----------| -| 1.2.11 | 1.15,1.16,1.17,1.18,1.19,1.20 | 2.1 and later | 8 | 2.11,2.12 | -| 1.2.10 | 1.15,1.16,1.17,1.18,1.19 | 2.1 and later | 8 | 2.11,2.12 | -| 1.2.9 | 1.15,1.16,1.17,1.18 | 2.1 and later | 8 | 2.11,2.12 | -| 1.2.8 | 1.13,1.14,1.15,1.16,1.17 | 2.1 and later | 8 | 2.11,2.12 | -| 1.2.7 | 1.11,1.12,1.13,1.14,1.15 | 2.1 and later | 8 | 2.11,2.12 | - -## 获取 Flink connector - -您可以通过以下方式获取 Flink connector JAR 文件: - -- 直接下载已编译的 Flink connector JAR 文件。 -- 在您的 Maven 项目中添加 Flink connector 作为依赖项,然后下载 JAR 文件。 -- 自己将 Flink connector 的源代码编译为 JAR 文件。 - -Flink connector JAR 文件的命名格式如下: - -- 从 Flink 1.15 开始,命名格式为 `flink-connector-starrocks-${connector_version}_flink-${flink_version}.jar`。例如,如果您安装了 Flink 1.15 并且想要使用 Flink connector 1.2.7,则可以使用 `flink-connector-starrocks-1.2.7_flink-1.15.jar`。 - -- 在 Flink 1.15 之前,命名格式为 `flink-connector-starrocks-${connector_version}_flink-${flink_version}_${scala_version}.jar`。例如,如果您的环境中安装了 Flink 1.14 和 Scala 2.12,并且想要使用 Flink connector 1.2.7,则可以使用 `flink-connector-starrocks-1.2.7_flink-1.14_2.12.jar`。 - -> **注意** -> -> 通常,最新版本的 Flink connector 仅保持与 Flink 的三个最新版本的兼容性。 - -### 下载已编译的 Jar 文件 - -从 [Maven Central Repository](https://repo1.maven.org/maven2/com/starrocks) 直接下载相应版本的 Flink connector Jar 文件。 - -### Maven 依赖 - -在您的 Maven 项目的 `pom.xml` 文件中,按照以下格式添加 Flink connector 作为依赖项。将 `flink_version`、`scala_version` 和 `connector_version` 替换为相应的版本。 - -- 在 Flink 1.15 及更高版本中 - - ```xml - - com.starrocks - flink-connector-starrocks - ${connector_version}_flink-${flink_version} - - ``` - -- 在低于 Flink 1.15 的版本中 - - ```xml - - com.starrocks - flink-connector-starrocks - ${connector_version}_flink-${flink_version}_${scala_version} - - ``` - -### 自行编译 - -1. 下载 [Flink connector 源代码](https://github.com/StarRocks/starrocks-connector-for-apache-flink)。 -2. 执行以下命令,将 Flink connector 的源代码编译为 JAR 文件。请注意,`flink_version` 替换为相应的 Flink 版本。 - - ```bash - sh build.sh - ``` - - 例如,如果您的环境中的 Flink 版本为 1.15,则需要执行以下命令: - - ```bash - sh build.sh 1.15 - ``` - -3. 进入 `target/` 目录以查找 Flink connector JAR 文件,例如编译后生成的 `flink-connector-starrocks-1.2.7_flink-1.15-SNAPSHOT.jar`。 - -> **注意** -> -> 非正式发布的 Flink connector 的名称包含 `SNAPSHOT` 后缀。 - -## Options - -### connector - -**是否必须**: 是
-**默认值**: NONE
-**描述**: 您要使用的连接器。该值必须为 "starrocks"。 - -### jdbc-url - -**是否必须**: 是
-**默认值**: NONE
-**描述**: 用于连接 FE 的 MySQL 服务器的地址。您可以指定多个地址,这些地址必须用逗号 (,) 分隔。格式:`jdbc:mysql://:,:,:`。 - -### load-url - -**是否必须**: 是
-**默认值**: NONE
-**描述**: 用于连接 FE 的 HTTP 服务器的地址。您可以指定多个地址,这些地址必须用分号 (;) 分隔。格式:`:;:`。 - -### database-name - -**是否必须**: 是
-**默认值**: NONE
-**描述**: 您要将数据加载到的 StarRocks 数据库的名称。 - -### table-name - -**是否必须**: 是
-**默认值**: NONE
-**描述**: 您要用于将数据加载到 StarRocks 中的表的名称。 - -### username - -**是否必须**: 是
-**默认值**: NONE
-**描述**: 您要用于将数据加载到 StarRocks 中的帐户的用户名。该帐户需要对目标 StarRocks 表具有 [SELECT 和 INSERT 权限](../sql-reference/sql-statements/account-management/GRANT.md)。 - -### password - -**是否必须**: 是
-**默认值**: NONE
-**描述**: 上述帐户的密码。 - -### sink.version - -**是否必须**: 否
-**默认值**: AUTO
-**描述**: 用于加载数据的接口。从 Flink connector 1.2.4 版本开始支持此参数。
  • `V1`: 使用 [Stream Load](../loading/StreamLoad.md) 接口加载数据。1.2.4 之前的连接器仅支持此模式。
  • `V2`: 使用 [Stream Load 事务](./Stream_Load_transaction_interface.md) 接口加载数据。它要求 StarRocks 至少为 2.4 版本。推荐使用 `V2`,因为它优化了内存使用并提供了更稳定的 exactly-once 实现。
  • `AUTO`: 如果 StarRocks 的版本支持事务 Stream Load,将自动选择 `V2`,否则选择 `V1`。
- -### sink.label-prefix - -**是否必须**: 否
-**默认值**: NONE
-**描述**: Stream Load 使用的标签前缀。如果您正在使用 1.2.8 及更高版本的连接器进行 exactly-once,建议配置此参数。请参阅 [exactly-once 使用说明](#exactly-once)。 - -### sink.semantic - -**是否必须**: 否
-**默认值**: at-least-once
-**描述**: sink 保证的语义。有效值:**at-least-once** 和 **exactly-once**。 - -### sink.buffer-flush.max-bytes - -**是否必须**: 否
-**默认值**: 94371840(90M)
-**描述**: 在一次发送到 StarRocks 之前,可以在内存中累积的最大数据量。最大值范围为 64 MB 到 10 GB。将此参数设置为较大的值可以提高导入性能,但可能会增加导入延迟。此参数仅在 `sink.semantic` 设置为 `at-least-once` 时生效。如果 `sink.semantic` 设置为 `exactly-once`,则在触发 Flink checkpoint 时刷新内存中的数据。在这种情况下,此参数不生效。 - -### sink.buffer-flush.max-rows - -**是否必须**: 否
-**默认值**: 500000
-**描述**: 在一次发送到 StarRocks 之前,可以在内存中累积的最大行数。此参数仅在 `sink.version` 为 `V1` 且 `sink.semantic` 为 `at-least-once` 时可用。有效值:64000 到 5000000。 - -### sink.buffer-flush.interval-ms - -**是否必须**: 否
-**默认值**: 300000
-**描述**: 刷新数据的间隔。此参数仅在 `sink.semantic` 为 `at-least-once` 时可用。有效值:1000 到 3600000。单位:毫秒。 - -### sink.max-retries - -**是否必须**: 否
-**默认值**: 3
-**描述**: 系统重试执行 Stream Load 作业的次数。仅当您将 `sink.version` 设置为 `V1` 时,此参数才可用。有效值:0 到 10。 - -### sink.connect.timeout-ms - -**是否必须**: 否
-**默认值**: 30000
-**描述**: 建立 HTTP 连接的超时时间。有效值:100 到 60000。单位:毫秒。在 Flink connector v1.2.9 之前,默认值为 `1000`。 - -### sink.socket.timeout-ms - -**是否必须**: 否
-**默认值**: -1
-**描述**: 自 1.2.10 起支持。HTTP 客户端等待数据的时间。单位:毫秒。默认值 `-1` 表示没有超时。 - -### sink.sanitize-error-log - -**是否必须**: 否
-**默认值**: false
-**描述**: 自 1.2.12 起支持。是否对生产安全错误日志中的敏感数据进行清理。当此项设置为 `true` 时,连接器和 SDK 日志中的 Stream Load 错误日志中的敏感行数据和列值将被编辑。为了向后兼容,该值默认为 `false`。 - -### sink.wait-for-continue.timeout-ms - -**是否必须**: 否
-**默认值**: 10000
-**描述**: 自 1.2.7 起支持。等待来自 FE 的 HTTP 100-continue 响应的超时时间。有效值:`3000` 到 `60000`。单位:毫秒。 - -### sink.ignore.update-before - -**是否必须**: 否
-**默认值**: true
-**描述**: 自 1.2.8 版本起支持。将数据加载到主键表时,是否忽略来自 Flink 的 `UPDATE_BEFORE` 记录。如果此参数设置为 false,则该记录将被视为对 StarRocks 表的删除操作。 - -### sink.parallelism - -**是否必须**: 否
-**默认值**: NONE
-**描述**: 导入的并行度。仅适用于 Flink SQL。如果未指定此参数,则 Flink planner 决定并行度。**在多并行度的情况下,用户需要保证数据以正确的顺序写入。** - -### sink.properties.* - -**是否必须**: 否
-**默认值**: NONE
-**描述**: 用于控制 Stream Load 行为的参数。例如,参数 `sink.properties.format` 指定用于 Stream Load 的格式,例如 CSV 或 JSON。有关支持的参数及其描述的列表,请参阅 [STREAM LOAD](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md)。 - -### sink.properties.format - -**是否必须**: 否
-**默认值**: csv
-**描述**: 用于 Stream Load 的格式。Flink connector 将在将每批数据发送到 StarRocks 之前将其转换为该格式。有效值:`csv` 和 `json`。 - -### sink.properties.column_separator - -**是否必须**: 否
-**默认值**: \t
-**描述**: CSV 格式数据的列分隔符。 - -### sink.properties.row_delimiter - -**是否必须**: 否
-**默认值**: \n
-**描述**: CSV 格式数据的行分隔符。 - -### sink.properties.max_filter_ratio - -**是否必须**: 否
-**默认值**: 0
-**描述**: Stream Load 的最大错误容忍度。它是由于数据质量不足而被过滤掉的数据记录的最大百分比。有效值:`0` 到 `1`。默认值:`0`。有关详细信息,请参阅 [Stream Load](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md)。 - -### sink.properties.partial_update - -**是否必须**: 否
-**默认值**: `FALSE`
-**描述**: 是否使用部分更新。有效值:`TRUE` 和 `FALSE`。默认值为 `FALSE`,表示禁用此功能。 - -### sink.properties.partial_update_mode - -**是否必须**: 否
-**默认值**: `row`
-**描述**: 指定部分更新的模式。有效值:`row` 和 `column`。
  • 值 `row` (默认) 表示行模式下的部分更新,更适合于具有许多列和小批量的实时更新。
  • 值 `column` 表示列模式下的部分更新,更适合于具有少量列和许多行的批量更新。在这种情况下,启用列模式可以提供更快的更新速度。例如,在一个具有 100 列的表中,如果仅更新所有行的 10 列(总数的 10%),则列模式的更新速度比行模式快 10 倍。
- -### sink.properties.strict_mode - -**是否必须**: 否
-**默认值**: false
-**描述**: 指定是否为 Stream Load 启用严格模式。它会影响存在不合格行(例如,列值不一致)时的加载行为。有效值:`true` 和 `false`。默认值:`false`。有关详细信息,请参阅 [Stream Load](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md)。 - -### sink.properties.compression - -**是否必须**: 否
-**默认值**: NONE
-**描述**: 用于 Stream Load 的压缩算法。有效值:`lz4_frame`。JSON 格式的压缩需要 Flink connector 1.2.10+ 和 StarRocks v3.2.7+。CSV 格式的压缩仅需要 Flink connector 1.2.11+。 - -### sink.properties.prepared_timeout - -**是否必须**: 否
-**默认值**: NONE
-**描述**: 自 1.2.12 起支持,并且仅在 `sink.version` 设置为 `V2` 时有效。需要 StarRocks 3.5.4 或更高版本。设置从 `PREPARED` 到 `COMMITTED` 的事务 Stream Load 阶段的超时时间(以秒为单位)。通常,仅 exactly-once 需要此设置;at-least-once 通常不需要设置此项(连接器默认为 300 秒)。如果未在 exactly-once 中设置,则应用 StarRocks FE 配置 `prepared_transaction_default_timeout_second`(默认为 86400 秒)。请参阅 [StarRocks 事务超时管理](./Stream_Load_transaction_interface.md#transaction-timeout-management)。 - -## Flink 和 StarRocks 之间的数据类型映射 - -| Flink 数据类型 | StarRocks 数据类型 | -|-----------------------------------|-----------------------| -| BOOLEAN | BOOLEAN | -| TINYINT | TINYINT | -| SMALLINT | SMALLINT | -| INTEGER | INTEGER | -| BIGINT | BIGINT | -| FLOAT | FLOAT | -| DOUBLE | DOUBLE | -| DECIMAL | DECIMAL | -| BINARY | INT | -| CHAR | STRING | -| VARCHAR | STRING | -| STRING | STRING | -| DATE | DATE | -| TIMESTAMP_WITHOUT_TIME_ZONE(N) | DATETIME | -| TIMESTAMP_WITH_LOCAL_TIME_ZONE(N) | DATETIME | -| ARRAY<T> | ARRAY<T> | -| MAP<KT,VT> | JSON STRING | -| ROW<arg T...> | JSON STRING | - -## 使用说明 - -### Exactly Once - -- 如果您希望 sink 保证 exactly-once 语义,我们建议您将 StarRocks 升级到 2.5 或更高版本,并将 Flink connector 升级到 1.2.4 或更高版本。 - - 自 Flink connector 1.2.4 起,exactly-once 基于 StarRocks 自 2.4 起提供的 [Stream Load 事务接口](./Stream_Load_transaction_interface.md) 重新设计。与之前基于非事务 Stream Load 接口的实现相比,新实现减少了内存使用和 checkpoint 开销,从而提高了实时性能和加载稳定性。 - - - 如果 StarRocks 的版本早于 2.4 或 Flink connector 的版本早于 1.2.4,则 sink 将自动选择基于 Stream Load 非事务接口的实现。 - -- 保证 exactly-once 的配置 - - - `sink.semantic` 的值需要为 `exactly-once`。 - - - 如果 Flink connector 的版本为 1.2.8 及更高版本,建议指定 `sink.label-prefix` 的值。请注意,标签前缀在 StarRocks 中的所有类型的加载(例如 Flink 作业、Routine Load 和 Broker Load)中必须是唯一的。 - - - 如果指定了标签前缀,Flink connector 将使用该标签前缀来清理在某些 Flink 故障场景中可能生成的长期存在的事务,例如当 checkpoint 仍在进行中时 Flink 作业失败。如果您使用 `SHOW PROC '/transactions//running';` 在 StarRocks 中查看这些长期存在的事务,它们通常处于 `PREPARED` 状态。当 Flink 作业从 checkpoint 恢复时,Flink connector 将根据标签前缀和 checkpoint 中的一些信息找到这些长期存在的事务,并中止它们。由于用于实现 exactly-once 的两阶段提交机制,Flink connector 无法在 Flink 作业退出时中止它们。当 Flink 作业退出时,Flink connector 尚未收到来自 Flink checkpoint 协调器的通知,说明是否应将事务包含在成功的 checkpoint 中,如果无论如何都中止这些事务,可能会导致数据丢失。您可以在此 [博客文章](https://flink.apache.org/2018/02/28/an-overview-of-end-to-end-exactly-once-processing-in-apache-flink-with-apache-kafka-too/) 中大致了解如何在 Flink 中实现端到端 exactly-once。 - - - 如果未指定标签前缀,则仅在长期存在的事务超时后,StarRocks 才会清理它们。但是,如果在事务超时之前 Flink 作业频繁失败,则正在运行的事务数可能会达到 StarRocks `max_running_txn_num_per_db` 的限制。您可以为 `PREPARED` 事务设置较小的超时时间,以便在未指定标签前缀时更快地过期。请参阅以下有关如何设置 prepared 超时的信息。 - -- 如果您确定 Flink 作业在因停止或持续故障转移而长时间停机后最终将从 checkpoint 或 savepoint 恢复,请相应地调整以下 StarRocks 配置,以避免数据丢失。 - - - 调整 `PREPARED` 事务超时。请参阅以下有关如何设置超时的信息。 - - 超时时间需要大于 Flink 作业的停机时间。否则,包含在成功的 checkpoint 中的长期存在的事务可能会因超时而在您重新启动 Flink 作业之前中止,从而导致数据丢失。 - - 请注意,当您为此配置设置较大的值时,最好指定 `sink.label-prefix` 的值,以便可以根据标签前缀和 checkpoint 中的一些信息清理长期存在的事务,而不是由于超时(这可能会导致数据丢失)。 - - - `label_keep_max_second` 和 `label_keep_max_num`:StarRocks FE 配置,默认值分别为 `259200` 和 `1000`。有关详细信息,请参阅 [FE 配置](./loading_introduction/loading_considerations.md#fe-configurations)。`label_keep_max_second` 的值需要大于 Flink 作业的停机时间。否则,Flink connector 无法通过使用保存在 Flink 的 savepoint 或 checkpoint 中的事务标签来检查 StarRocks 中事务的状态,并确定这些事务是否已提交,这最终可能会导致数据丢失。 - -- 如何设置 PREPARED 事务的超时时间 - - - 对于 Connector 1.2.12+ 和 StarRocks 3.5.4+,您可以通过配置连接器参数 `sink.properties.prepared_timeout` 来设置超时时间。默认情况下,该值未设置,并且会回退到 StarRocks FE 的全局配置 `prepared_transaction_default_timeout_second`(默认值为 `86400`)。 - - - 对于其他版本的 Connector 或 StarRocks,您可以通过配置 StarRocks FE 的全局配置 `prepared_transaction_default_timeout_second`(默认值为 `86400`)来设置超时时间。 - -### Flush 策略 - -Flink connector 将在内存中缓冲数据,并通过 Stream Load 将它们批量刷新到 StarRocks。在 at-least-once 和 exactly-once 之间,触发刷新的方式有所不同。 - -对于 at-least-once,当满足以下任何条件时,将触发刷新: - -- 缓冲行的字节数达到限制 `sink.buffer-flush.max-bytes` -- 缓冲行的数量达到限制 `sink.buffer-flush.max-rows`。(仅对 sink 版本 V1 有效) -- 自上次刷新以来经过的时间达到限制 `sink.buffer-flush.interval-ms` -- 触发 checkpoint - -对于 exactly-once,仅当触发 checkpoint 时才会发生刷新。 - -### 监控导入指标 - -Flink connector 提供了以下指标来监控导入。 - -| 指标 | 类型 | 描述 | -|--------------------------|---------|------------------------------------------------------------| -| totalFlushBytes | counter | 成功刷新的字节数。 | -| totalFlushRows | counter | 成功刷新的行数。 | -| totalFlushSucceededTimes | counter | 成功刷新数据的次数。 | -| totalFlushFailedTimes | counter | 数据刷新失败的次数。 | -| totalFilteredRows | counter | 过滤的行数,也包含在 totalFlushRows 中。 | - -## 示例 - -以下示例展示了如何使用 Flink connector 通过 Flink SQL 或 Flink DataStream 将数据加载到 StarRocks 表中。 - -### 准备工作 - -#### 创建 StarRocks 表 - -创建一个数据库 `test` 并创建一个主键表 `score_board`。 - -```sql -CREATE DATABASE `test`; - -CREATE TABLE `test`.`score_board` -( - `id` int(11) NOT NULL COMMENT "", - `name` varchar(65533) NULL DEFAULT "" COMMENT "", - `score` int(11) NOT NULL DEFAULT "0" COMMENT "" -) -ENGINE=OLAP -PRIMARY KEY(`id`) -COMMENT "OLAP" -DISTRIBUTED BY HASH(`id`); -``` - -#### 设置 Flink 环境 - -- 下载 Flink 二进制文件 [Flink 1.15.2](https://archive.apache.org/dist/flink/flink-1.15.2/flink-1.15.2-bin-scala_2.12.tgz),并将其解压缩到目录 `flink-1.15.2`。 -- 下载 [Flink connector 1.2.7](https://repo1.maven.org/maven2/com/starrocks/flink-connector-starrocks/1.2.7_flink-1.15/flink-connector-starrocks-1.2.7_flink-1.15.jar),并将其放入目录 `flink-1.15.2/lib`。 -- 运行以下命令以启动 Flink 集群: - - ```shell - cd flink-1.15.2 - ./bin/start-cluster.sh - ``` - -#### 网络配置 - -确保 Flink 所在的机器可以通过 [`http_port`](../administration/management/FE_configuration.md#http_port)(默认:`8030`)和 [`query_port`](../administration/management/FE_configuration.md#query_port)(默认:`9030`)访问 StarRocks 集群的 FE 节点,并通过 [`be_http_port`](../administration/management/BE_configuration.md#be_http_port)(默认:`8040`)访问 BE 节点。 - -### 使用 Flink SQL 运行 - -- 运行以下命令以启动 Flink SQL 客户端。 - - ```shell - ./bin/sql-client.sh - ``` - -- 创建一个 Flink 表 `score_board`,并通过 Flink SQL 客户端将值插入到该表中。请注意,如果要将数据加载到 StarRocks 的主键表中,则必须在 Flink DDL 中定义主键。对于其他类型的 StarRocks 表,这是可选的。 - - ```SQL - CREATE TABLE `score_board` ( - `id` INT, - `name` STRING, - `score` INT, - PRIMARY KEY (id) NOT ENFORCED - ) WITH ( - 'connector' = 'starrocks', - 'jdbc-url' = 'jdbc:mysql://127.0.0.1:9030', - 'load-url' = '127.0.0.1:8030', - 'database-name' = 'test', - - 'table-name' = 'score_board', - 'username' = 'root', - 'password' = '' - ); - - INSERT INTO `score_board` VALUES (1, 'starrocks', 100), (2, 'flink', 100); - ``` - -### 使用 Flink DataStream 运行 - -根据输入记录的类型(例如 CSV Java `String`、JSON Java `String` 或自定义 Java 对象),有几种方法可以实现 Flink DataStream 作业。 - -- 输入记录是 CSV 格式的 `String`。有关完整示例,请参阅 [LoadCsvRecords](https://github.com/StarRocks/starrocks-connector-for-apache-flink/tree/cd8086cfedc64d5181785bdf5e89a847dc294c1d/examples/src/main/java/com/starrocks/connector/flink/examples/datastream)。 - - ```java - /** - * Generate CSV-format records. Each record has three values separated by "\t". - * These values will be loaded to the columns `id`, `name`, and `score` in the StarRocks table. - */ - String[] records = new String[]{ - "1\tstarrocks-csv\t100", - "2\tflink-csv\t100" - }; - DataStream source = env.fromElements(records); - - /** - * Configure the connector with the required properties. - * You also need to add properties "sink.properties.format" and "sink.properties.column_separator" - * to tell the connector the input records are CSV-format, and the column separator is "\t". - * You can also use other column separators in the CSV-format records, - * but remember to modify the "sink.properties.column_separator" correspondingly. - */ - StarRocksSinkOptions options = StarRocksSinkOptions.builder() - .withProperty("jdbc-url", jdbcUrl) - .withProperty("load-url", loadUrl) - .withProperty("database-name", "test") - .withProperty("table-name", "score_board") - .withProperty("username", "root") - .withProperty("password", "") - .withProperty("sink.properties.format", "csv") - .withProperty("sink.properties.column_separator", "\t") - .build(); - // Create the sink with the options. - SinkFunction starRockSink = StarRocksSink.sink(options); - source.addSink(starRockSink); - ``` - -- 输入记录是 JSON 格式的 `String`。有关完整示例,请参阅 [LoadJsonRecords](https://github.com/StarRocks/starrocks-connector-for-apache-flink/tree/cd8086cfedc64d5181785bdf5e89a847dc294c1d/examples/src/main/java/com/starrocks/connector/flink/examples/datastream)。 - - ```java - /** - * Generate JSON-format records. - * Each record has three key-value pairs corresponding to the columns `id`, `name`, and `score` in the StarRocks table. - */ - String[] records = new String[]{ - "{\"id\":1, \"name\":\"starrocks-json\", \"score\":100}", - "{\"id\":2, \"name\":\"flink-json\", \"score\":100}", - }; - DataStream source = env.fromElements(records); - - /** - * Configure the connector with the required properties. - * You also need to add properties "sink.properties.format" and "sink.properties.strip_outer_array" - * to tell the connector the input records are JSON-format and to strip the outermost array structure. - */ - StarRocksSinkOptions options = StarRocksSinkOptions.builder() - .withProperty("jdbc-url", jdbcUrl) - .withProperty("load-url", loadUrl) - .withProperty("database-name", "test") - .withProperty("table-name", "score_board") - .withProperty("username", "root") - .withProperty("password", "") - .withProperty("sink.properties.format", "json") - .withProperty("sink.properties.strip_outer_array", "true") - .build(); - // Create the sink with the options. - SinkFunction starRockSink = StarRocksSink.sink(options); - source.addSink(starRockSink); - ``` - -- 输入记录是自定义 Java 对象。有关完整示例,请参阅 [LoadCustomJavaRecords](https://github.com/StarRocks/starrocks-connector-for-apache-flink/tree/cd8086cfedc64d5181785bdf5e89a847dc294c1d/examples/src/main/java/com/starrocks/connector/flink/examples/datastream)。 - - - 在此示例中,输入记录是一个简单的 POJO `RowData`。 - - ```java - public static class RowData { - public int id; - public String name; - public int score; - - public RowData() {} - - public RowData(int id, String name, int score) { - this.id = id; - this.name = name; - this.score = score; - } - } - ``` - - - 主程序如下: - - ```java - // Generate records which use RowData as the container. - RowData[] records = new RowData[]{ - new RowData(1, "starrocks-rowdata", 100), - new RowData(2, "flink-rowdata", 100), - }; - DataStream source = env.fromElements(records); - - // Configure the connector with the required properties. - StarRocksSinkOptions options = StarRocksSinkOptions.builder() - .withProperty("jdbc-url", jdbcUrl) - .withProperty("load-url", loadUrl) - .withProperty("database-name", "test") - .withProperty("table-name", "score_board") - .withProperty("username", "root") - .withProperty("password", "") - .build(); - - /** - * The Flink connector will use a Java object array (Object[]) to represent a row to be loaded into the StarRocks table, - * and each element is the value for a column. - * You need to define the schema of the Object[] which matches that of the StarRocks table. - */ - TableSchema schema = TableSchema.builder() - .field("id", DataTypes.INT().notNull()) - .field("name", DataTypes.STRING()) - .field("score", DataTypes.INT()) - // When the StarRocks table is a Primary Key table, you must specify notNull(), for example, DataTypes.INT().notNull(), for the primary key `id`. - .primaryKey("id") - .build(); - // Transform the RowData to the Object[] according to the schema. - RowDataTransformer transformer = new RowDataTransformer(); - // Create the sink with the schema, options, and transformer. - SinkFunction starRockSink = StarRocksSink.sink(schema, options, transformer); - source.addSink(starRockSink); - ``` - - - 主程序中的 `RowDataTransformer` 定义如下: - - ```java - private static class RowDataTransformer implements StarRocksSinkRowBuilder { - - /** - * Set each element of the object array according to the input RowData. - * The schema of the array matches that of the StarRocks table. - */ - @Override - public void accept(Object[] internalRow, RowData rowData) { - internalRow[0] = rowData.id; - internalRow[1] = rowData.name; - internalRow[2] = rowData.score; - // When the StarRocks table is a Primary Key table, you need to set the last element to indicate whether \ No newline at end of file diff --git a/docs/zh/loading/Flink_cdc_load.md b/docs/zh/loading/Flink_cdc_load.md deleted file mode 100644 index ced3f4f..0000000 --- a/docs/zh/loading/Flink_cdc_load.md +++ /dev/null @@ -1,532 +0,0 @@ ---- -displayed_sidebar: docs -keywords: - - MySql - - mysql - - sync - - Flink CDC ---- - -# 从 MySQL 实时同步 - -import InsertPrivNote from '../_assets/commonMarkdown/insertPrivNote.mdx' - -StarRocks 支持多种方法将数据从 MySQL 实时同步到 StarRocks,从而实现海量数据的低延迟实时分析。 - -本主题介绍如何通过 Apache Flink® 将数据从 MySQL 实时同步到 StarRocks(在几秒钟内)。 - - - -## 原理 - -:::tip - -Flink CDC 用于从 MySQL 到 Flink 的同步。本主题使用版本低于 3.0 的 Flink CDC,因此使用 SMT 来同步表结构。但是,如果使用 Flink CDC 3.0,则无需使用 SMT 将表结构同步到 StarRocks。Flink CDC 3.0 甚至可以同步整个 MySQL 数据库的 schema、分片数据库和表,并且还支持 schema 变更同步。有关详细用法,请参见 [Streaming ELT from MySQL to StarRocks](https://nightlies.apache.org/flink/flink-cdc-docs-stable/docs/get-started/quickstart/mysql-to-starrocks)。 - -::: - -下图说明了整个同步过程。 - -![img](../_assets/4.9.2.png) - -通过 Flink 将 MySQL 实时同步到 StarRocks 分为两个阶段实现:同步数据库和表结构以及同步数据。首先,SMT 将 MySQL 数据库和表结构转换为 StarRocks 的表创建语句。然后,Flink 集群运行 Flink 作业,将 MySQL 的完整数据和增量数据同步到 StarRocks。 - -:::info - -同步过程保证了 exactly-once 语义。 - -::: - -**同步过程**: - -1. 同步数据库和表结构。 - - SMT 读取要同步的 MySQL 数据库和表的 schema,并生成用于在 StarRocks 中创建目标数据库和表的 SQL 文件。此操作基于 SMT 配置文件中的 MySQL 和 StarRocks 信息。 - -2. 同步数据。 - - a. Flink SQL 客户端执行数据导入语句 `INSERT INTO SELECT`,以将一个或多个 Flink 作业提交到 Flink 集群。 - - b. Flink 集群运行 Flink 作业以获取数据。Flink CDC 连接器首先从源数据库读取完整的历史数据,然后无缝切换到增量读取,并将数据发送到 flink-connector-starrocks。 - - c. flink-connector-starrocks 在 mini-batch 中累积数据,并将每批数据同步到 StarRocks。 - - :::info - - 只有 MySQL 中的数据操作语言 (DML) 操作可以同步到 StarRocks。数据定义语言 (DDL) 操作无法同步。 - - ::: - -## 适用场景 - -从 MySQL 实时同步具有广泛的用例,其中数据不断变化。以“商品销售实时排名”的实际用例为例。 - -Flink 基于 MySQL 中原始订单表计算商品销售的实时排名,并将排名实时同步到 StarRocks 的主键表。用户可以将可视化工具连接到 StarRocks,以实时查看排名,从而获得按需运营洞察。 - -## 前提条件 - -### 下载并安装同步工具 - -要从 MySQL 同步数据,您需要安装以下工具:SMT、Flink、Flink CDC 连接器和 flink-connector-starrocks。 - -1. 下载并安装 Flink,然后启动 Flink 集群。您也可以按照 [Flink 官方文档](https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/try-flink/local_installation/) 中的说明执行此步骤。 - - a. 在运行 Flink 之前,请在操作系统中安装 Java 8 或 Java 11。您可以运行以下命令来检查已安装的 Java 版本。 - - ```Bash - # 查看 Java 版本。 - java -version - - # 如果返回以下输出,则表示已安装 Java 8。 - java version "1.8.0_301" - Java(TM) SE Runtime Environment (build 1.8.0_301-b09) - Java HotSpot(TM) 64-Bit Server VM (build 25.301-b09, mixed mode) - ``` - - b. 下载 [Flink 安装包](https://flink.apache.org/downloads.html) 并解压缩。建议您使用 Flink 1.14 或更高版本。允许的最低版本为 Flink 1.11。本主题使用 Flink 1.14.5。 - - ```Bash - # 下载 Flink。 - wget https://archive.apache.org/dist/flink/flink-1.14.5/flink-1.14.5-bin-scala_2.11.tgz - # 解压缩 Flink。 - tar -xzf flink-1.14.5-bin-scala_2.11.tgz - # 进入 Flink 目录。 - cd flink-1.14.5 - ``` - - c. 启动 Flink 集群。 - - ```Bash - # 启动 Flink 集群。 - ./bin/start-cluster.sh - - # 如果返回以下输出,则表示已启动 Flink 集群。 - Starting cluster. - Starting standalonesession daemon on host. - Starting taskexecutor daemon on host. - ``` - -2. 下载 [Flink CDC connector](https://github.com/ververica/flink-cdc-connectors/releases)。本主题使用 MySQL 作为数据源,因此下载 `flink-sql-connector-mysql-cdc-x.x.x.jar`。连接器版本必须与 [Flink](https://github.com/ververica/flink-cdc-connectors/releases) 版本匹配。本主题使用 Flink 1.14.5,您可以下载 `flink-sql-connector-mysql-cdc-2.2.0.jar`。 - - ```Bash - wget https://repo1.maven.org/maven2/com/ververica/flink-sql-connector-mysql-cdc/2.1.1/flink-sql-connector-mysql-cdc-2.2.0.jar - ``` - -3. 下载 [flink-connector-starrocks](https://search.maven.org/artifact/com.starrocks/flink-connector-starrocks)。该版本必须与 Flink 版本匹配。 - - > flink-connector-starrocks 包 `x.x.x_flink-y.yy _ z.zz.jar` 包含三个版本号: - > - > - `x.x.x` 是 flink-connector-starrocks 的版本号。 - > - `y.yy` 是支持的 Flink 版本。 - > - `z.zz` 是 Flink 支持的 Scala 版本。如果 Flink 版本为 1.14.x 或更早版本,则必须下载具有 Scala 版本的包。 - > - > 本主题使用 Flink 1.14.5 和 Scala 2.11。因此,您可以下载以下包:`1.2.3_flink-14_2.11.jar`。 - -4. 将 Flink CDC 连接器 (`flink-sql-connector-mysql-cdc-2.2.0.jar`) 和 flink-connector-starrocks (`1.2.3_flink-1.14_2.11.jar`) 的 JAR 包移动到 Flink 的 `lib` 目录。 - - > **注意** - > - > 如果您的系统中已在运行 Flink 集群,则必须停止 Flink 集群并重新启动它才能加载和验证 JAR 包。 - > - > ```Bash - > $ ./bin/stop-cluster.sh - > $ ./bin/start-cluster.sh - > ``` - -5. 下载并解压缩 [SMT 包](https://www.starrocks.io/download/community),并将其放置在 `flink-1.14.5` 目录中。StarRocks 提供适用于 Linux x86 和 macos ARM64 的 SMT 包。您可以根据您的操作系统和 CPU 选择一个。 - - ```Bash - # for Linux x86 - wget https://releases.starrocks.io/resources/smt.tar.gz - # for macOS ARM64 - wget https://releases.starrocks.io/resources/smt_darwin_arm64.tar.gz - ``` - -### 启用 MySQL 二进制日志 - -要从 MySQL 实时同步数据,系统需要从 MySQL 二进制日志 (binlog) 中读取数据,解析数据,然后将数据同步到 StarRocks。确保已启用 MySQL 二进制日志。 - -1. 编辑 MySQL 配置文件 `my.cnf`(默认路径:`/etc/my.cnf`)以启用 MySQL 二进制日志。 - - ```Bash - # 启用 MySQL Binlog。 - log_bin = ON - # 配置 Binlog 的保存路径。 - log_bin =/var/lib/mysql/mysql-bin - # 配置 server_id。 - # 如果未为 MySQL 5.7.3 或更高版本配置 server_id,则无法使用 MySQL 服务。 - server_id = 1 - # 将 Binlog 格式设置为 ROW。 - binlog_format = ROW - # Binlog 文件的基本名称。附加一个标识符以标识每个 Binlog 文件。 - log_bin_basename =/var/lib/mysql/mysql-bin - # Binlog 文件的索引文件,用于管理所有 Binlog 文件的目录。 - log_bin_index =/var/lib/mysql/mysql-bin.index - ``` - -2. 运行以下命令之一以重新启动 MySQL,以使修改后的配置文件生效。 - - ```Bash - # 使用 service 重新启动 MySQL。 - service mysqld restart - # 使用 mysqld 脚本重新启动 MySQL。 - /etc/init.d/mysqld restart - ``` - -3. 连接到 MySQL 并检查是否已启用 MySQL 二进制日志。 - - ```Plain - -- 连接到 MySQL。 - mysql -h xxx.xx.xxx.xx -P 3306 -u root -pxxxxxx - - -- 检查是否已启用 MySQL 二进制日志。 - mysql> SHOW VARIABLES LIKE 'log_bin'; - +---------------+-------+ - | Variable_name | Value | - +---------------+-------+ - | log_bin | ON | - +---------------+-------+ - 1 row in set (0.00 sec) - ``` - -## 同步数据库和表结构 - -1. 编辑 SMT 配置文件。 - 转到 SMT `conf` 目录并编辑配置文件 `config_prod.conf`,例如 MySQL 连接信息、要同步的数据库和表的匹配规则以及 flink-connector-starrocks 的配置信息。 - - ```Bash - [db] - type = mysql - host = xxx.xx.xxx.xx - port = 3306 - user = user1 - password = xxxxxx - - [other] - # StarRocks 中 BE 的数量 - be_num = 3 - # StarRocks-1.18.1 及更高版本支持 `decimal_v3`。 - use_decimal_v3 = true - # 用于保存转换后的 DDL SQL 的文件 - output_dir = ./result - - [table-rule.1] - # 用于匹配数据库以设置属性的模式 - database = ^demo.*$ - # 用于匹配表以设置属性的模式 - table = ^.*$ - - ############################################ - ### Flink sink configurations - ### DO NOT set `connector`, `table-name`, `database-name`. They are auto-generated. - ############################################ - flink.starrocks.jdbc-url=jdbc:mysql://: - flink.starrocks.load-url= : - flink.starrocks.username=user2 - flink.starrocks.password=xxxxxx - flink.starrocks.sink.properties.format=csv - flink.starrocks.sink.properties.column_separator=\x01 - flink.starrocks.sink.properties.row_delimiter=\x02 - flink.starrocks.sink.buffer-flush.interval-ms=15000 - ``` - - - `[db]`: 用于访问源数据库的信息。 - - `type`: 源数据库的类型。在本主题中,源数据库为 `mysql`。 - - `host`: MySQL 服务器的 IP 地址。 - - `port`: MySQL 数据库的端口号,默认为 `3306` - - `user`: 用于访问 MySQL 数据库的用户名 - - `password`: 用户名的密码 - - - `[table-rule]`: 数据库和表匹配规则以及相应的 flink-connector-starrocks 配置。 - - - `Database`、`table`: MySQL 中数据库和表的名称。支持正则表达式。 - - `flink.starrocks.*`: flink-connector-starrocks 的配置信息。有关更多配置和信息,请参见 [flink-connector-starrocks](../loading/Flink-connector-starrocks.md)。 - - > 如果您需要为不同的表使用不同的 flink-connector-starrocks 配置。例如,如果某些表经常更新,并且您需要加速数据导入,请参见 [为不同的表使用不同的 flink-connector-starrocks 配置](#use-different-flink-connector-starrocks-configurations-for-different-tables)。如果您需要将从 MySQL 分片获得的多个表加载到同一个 StarRocks 表中,请参见 [将 MySQL 分片后的多个表同步到一个 StarRocks 表中](#synchronize-multiple-tables-after-mysql-sharding-to-one-table-in-starrocks)。 - - - `[other]`: 其他信息 - - `be_num`: StarRocks 集群中 BE 的数量(此参数将用于在后续 StarRocks 表创建中设置合理的 tablet 数量)。 - - `use_decimal_v3`: 是否启用 [Decimal V3](../sql-reference/data-types/numeric/DECIMAL.md)。启用 Decimal V3 后,将数据同步到 StarRocks 时,MySQL decimal 数据将转换为 Decimal V3 数据。 - - `output_dir`: 用于保存要生成的 SQL 文件的路径。SQL 文件将用于在 StarRocks 中创建数据库和表,并将 Flink 作业提交到 Flink 集群。默认路径为 `./result`,建议您保留默认设置。 - -2. 运行 SMT 以读取 MySQL 中的数据库和表结构,并根据配置文件在 `./result` 目录中生成 SQL 文件。`starrocks-create.all.sql` 文件用于在 StarRocks 中创建数据库和表,`flink-create.all.sql` 文件用于将 Flink 作业提交到 Flink 集群。 - - ```Bash - # 运行 SMT。 - ./starrocks-migrate-tool - - # 转到 result 目录并检查此目录中的文件。 - cd result - ls result - flink-create.1.sql smt.tar.gz starrocks-create.all.sql - flink-create.all.sql starrocks-create.1.sql - ``` - -3. 运行以下命令以连接到 StarRocks 并执行 `starrocks-create.all.sql` 文件,以在 StarRocks 中创建数据库和表。建议您使用 SQL 文件中的默认表创建语句来创建 [主键表](../table_design/table_types/primary_key_table.md)。 - - > **注意** - > - > 您还可以根据您的业务需求修改表创建语句,并创建一个不使用主键表的表。但是,源 MySQL 数据库中的 DELETE 操作无法同步到非主键表。创建此类表时请谨慎。 - - ```Bash - mysql -h -P -u user2 -pxxxxxx < starrocks-create.all.sql - ``` - - 如果数据需要在写入目标 StarRocks 表之前由 Flink 处理,则源表和目标表之间的表结构将不同。在这种情况下,您必须修改表创建语句。在此示例中,目标表仅需要 `product_id` 和 `product_name` 列以及商品销售的实时排名。您可以使用以下表创建语句。 - - ```Bash - CREATE DATABASE IF NOT EXISTS `demo`; - - CREATE TABLE IF NOT EXISTS `demo`.`orders` ( - `product_id` INT(11) NOT NULL COMMENT "", - `product_name` STRING NOT NULL COMMENT "", - `sales_cnt` BIGINT NOT NULL COMMENT "" - ) ENGINE=olap - PRIMARY KEY(`product_id`) - DISTRIBUTED BY HASH(`product_id`) - PROPERTIES ( - "replication_num" = "3" - ); - ``` - - > **注意** - > - > 从 v2.5.7 开始,StarRocks 可以在您创建表或添加分区时自动设置 bucket 数量 (BUCKETS)。您不再需要手动设置 bucket 数量。有关详细信息,请参见 [设置 bucket 数量](../table_design/data_distribution/Data_distribution.md#set-the-number-of-buckets)。 - -## 同步数据 - -运行 Flink 集群并提交 Flink 作业,以将 MySQL 中的完整数据和增量数据持续同步到 StarRocks。 - -1. 转到 Flink 目录并运行以下命令,以在 Flink SQL 客户端上运行 `flink-create.all.sql` 文件。 - - ```Bash - ./bin/sql-client.sh -f flink-create.all.sql - ``` - - 此 SQL 文件定义了动态表 `source table` 和 `sink table`、查询语句 `INSERT INTO SELECT`,并指定了连接器、源数据库和目标数据库。执行此文件后,会将 Flink 作业提交到 Flink 集群以启动数据同步。 - - > **注意** - > - > - 确保已启动 Flink 集群。您可以通过运行 `flink/bin/start-cluster.sh` 来启动 Flink 集群。 - > - 如果您的 Flink 版本早于 1.13,您可能无法直接运行 SQL 文件 `flink-create.all.sql`。您需要在 SQL 客户端的命令行界面 (CLI) 中逐个执行此文件中的 SQL 语句。您还需要转义 `\` 字符。 - > - > ```Bash - > 'sink.properties.column_separator' = '\\x01' - > 'sink.properties.row_delimiter' = '\\x02' - > ``` - - **在同步期间处理数据**: - - 如果您需要在同步期间处理数据,例如对数据执行 GROUP BY 或 JOIN,您可以修改 `flink-create.all.sql` 文件。以下示例通过执行 COUNT (*) 和 GROUP BY 来计算商品销售的实时排名。 - - ```Bash - $ ./bin/sql-client.sh -f flink-create.all.sql - No default environment is specified. - Searching for '/home/disk1/flink-1.13.6/conf/sql-client-defaults.yaml'...not found. - [INFO] Executing SQL from file. - - Flink SQL> CREATE DATABASE IF NOT EXISTS `default_catalog`.`demo`; - [INFO] Execute statement succeed. - - -- 基于 MySQL 中的 order 表创建一个动态表 `source table`。 - Flink SQL> - CREATE TABLE IF NOT EXISTS `default_catalog`.`demo`.`orders_src` (`order_id` BIGINT NOT NULL, - `product_id` INT NULL, - `order_date` TIMESTAMP NOT NULL, - `customer_name` STRING NOT NULL, - `product_name` STRING NOT NULL, - `price` DECIMAL(10, 5) NULL, - PRIMARY KEY(`order_id`) - NOT ENFORCED - ) with ('connector' = 'mysql-cdc', - 'hostname' = 'xxx.xx.xxx.xxx', - 'port' = '3306', - 'username' = 'root', - 'password' = '', - 'database-name' = 'demo', - 'table-name' = 'orders' - ); - [INFO] Execute statement succeed. - - -- 创建一个动态表 `sink table`。 - Flink SQL> - CREATE TABLE IF NOT EXISTS `default_catalog`.`demo`.`orders_sink` (`product_id` INT NOT NULL, - `product_name` STRING NOT NULL, - `sales_cnt` BIGINT NOT NULL, - PRIMARY KEY(`product_id`) - NOT ENFORCED - ) with ('sink.max-retries' = '10', - 'jdbc-url' = 'jdbc:mysql://:', - 'password' = '', - 'sink.properties.strip_outer_array' = 'true', - 'sink.properties.format' = 'json', - 'load-url' = ':', - 'username' = 'root', - 'sink.buffer-flush.interval-ms' = '15000', - 'connector' = 'starrocks', - 'database-name' = 'demo', - 'table-name' = 'orders' - ); - [INFO] Execute statement succeed. - - -- 实现商品销售的实时排名,其中 `sink table` 动态更新以反映 `source table` 中的数据更改。 - Flink SQL> - INSERT INTO `default_catalog`.`demo`.`orders_sink` select product_id,product_name, count(*) as cnt from `default_catalog`.`demo`.`orders_src` group by product_id,product_name; - [INFO] Submitting SQL update statement to the cluster... - [INFO] SQL update statement has been successfully submitted to the cluster: - Job ID: 5ae005c4b3425d8bb13fe660260a35da - ``` - - 如果您只需要同步一部分数据,例如付款时间晚于 2021 年 12 月 21 日的数据,您可以使用 `INSERT INTO SELECT` 中的 `WHERE` 子句来设置筛选条件,例如 `WHERE pay_dt > '2021-12-21'`。不满足此条件的数据将不会同步到 StarRocks。 - - 如果返回以下结果,则表示已提交 Flink 作业以进行完整和增量同步。 - - ```SQL - [INFO] Submitting SQL update statement to the cluster... - [INFO] SQL update statement has been successfully submitted to the cluster: - Job ID: 5ae005c4b3425d8bb13fe660260a35da - ``` - -2. 您可以使用 [Flink WebUI](https://nightlies.apache.org/flink/flink-docs-master/docs/try-flink/flink-operations-playground/#flink-webui) 或在 Flink SQL 客户端上运行 `bin/flink list -running` 命令,以查看 Flink 集群中正在运行的 Flink 作业和作业 ID。 - - - Flink WebUI - ![img](../_assets/4.9.3.png) - - - `bin/flink list -running` - - ```Bash - $ bin/flink list -running - Waiting for response... - ------------------ Running/Restarting Jobs ------------------- - 13.10.2022 15:03:54 : 040a846f8b58e82eb99c8663424294d5 : insert-into_default_catalog.lily.example_tbl1_sink (RUNNING) - -------------------------------------------------------------- - ``` - - > **注意** - > - > 如果作业异常,您可以使用 Flink WebUI 或通过查看 Flink 1.14.5 的 `/log` 目录中的日志文件来执行故障排除。 - -## 常见问题 - -### 为不同的表使用不同的 flink-connector-starrocks 配置 - -如果数据源中的某些表经常更新,并且您想要加速 flink-connector-starrocks 的加载速度,则必须在 SMT 配置文件 `config_prod.conf` 中为每个表设置单独的 flink-connector-starrocks 配置。 - -```Bash -[table-rule.1] -# 用于匹配数据库以设置属性的模式 -database = ^order.*$ -# 用于匹配表以设置属性的模式 -table = ^.*$ - -############################################ -### Flink sink configurations -### DO NOT set `connector`, `table-name`, `database-name`. They are auto-generated -############################################ -flink.starrocks.jdbc-url=jdbc:mysql://: -flink.starrocks.load-url= : -flink.starrocks.username=user2 -flink.starrocks.password=xxxxxx -flink.starrocks.sink.properties.format=csv -flink.starrocks.sink.properties.column_separator=\x01 -flink.starrocks.sink.properties.row_delimiter=\x02 -flink.starrocks.sink.buffer-flush.interval-ms=15000 - -[table-rule.2] -# 用于匹配数据库以设置属性的模式 -database = ^order2.*$ -# 用于匹配表以设置属性的模式 -table = ^.*$ - -############################################ -### Flink sink configurations -### DO NOT set `connector`, `table-name`, `database-name`. They are auto-generated -############################################ -flink.starrocks.jdbc-url=jdbc:mysql://: -flink.starrocks.load-url= : -flink.starrocks.username=user2 -flink.starrocks.password=xxxxxx -flink.starrocks.sink.properties.format=csv -flink.starrocks.sink.properties.column_separator=\x01 -flink.starrocks.sink.properties.row_delimiter=\x02 -flink.starrocks.sink.buffer-flush.interval-ms=10000 -``` - -### 将 MySQL 分片后的多个表同步到一个 StarRocks 表中 - -执行分片后,一个 MySQL 表中的数据可能会拆分为多个表,甚至分布到多个数据库中。所有表都具有相同的 schema。在这种情况下,您可以设置 `[table-rule]` 以将这些表同步到一个 StarRocks 表中。例如,MySQL 有两个数据库 `edu_db_1` 和 `edu_db_2`,每个数据库都有两个表 `course_1 和 course_2`,并且所有表的 schema 都相同。您可以使用以下 `[table-rule]` 配置将所有表同步到一个 StarRocks 表中。 - -> **注意** -> -> StarRocks 表的名称默认为 `course__auto_shard`。如果您需要使用其他名称,您可以在 SQL 文件 `starrocks-create.all.sql` 和 `flink-create.all.sql` 中修改它。 - -```Bash -[table-rule.1] -# 用于匹配数据库以设置属性的模式 -database = ^edu_db_[0-9]*$ -# 用于匹配表以设置属性的模式 -table = ^course_[0-9]*$ - -############################################ -### Flink sink configurations -### DO NOT set `connector`, `table-name`, `database-name`. They are auto-generated -############################################ -flink.starrocks.jdbc-url = jdbc: mysql://xxx.xxx.x.x:xxxx -flink.starrocks.load-url = xxx.xxx.x.x:xxxx -flink.starrocks.username = user2 -flink.starrocks.password = xxxxxx -flink.starrocks.sink.properties.format=csv -flink.starrocks.sink.properties.column_separator =\x01 -flink.starrocks.sink.properties.row_delimiter =\x02 -flink.starrocks.sink.buffer-flush.interval-ms = 5000 -``` - -### 导入 JSON 格式的数据 - -在前面的示例中,数据以 CSV 格式导入。如果您无法选择合适的分隔符,则需要替换 `[table-rule]` 中 `flink.starrocks.*` 的以下参数。 - -```Plain -flink.starrocks.sink.properties.format=csv -flink.starrocks.sink.properties.column_separator =\x01 -flink.starrocks.sink.properties.row_delimiter =\x02 -``` - -传入以下参数后,将以 JSON 格式导入数据。 - -```Plain -flink.starrocks.sink.properties.format=json -flink.starrocks.sink.properties.strip_outer_array=true -``` - -> **注意** -> -> 此方法会稍微降低加载速度。 - -### 将多个 INSERT INTO 语句作为一个 Flink 作业执行 - -您可以使用 `flink-create.all.sql` 文件中的 [STATEMENT SET](https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sqlclient/#execute-a-set-of-sql-statements) 语法将多个 INSERT INTO 语句作为一个 Flink 作业执行,这可以防止多个语句占用过多的 Flink 作业资源,并提高执行多个查询的效率。 - -> **注意** -> -> Flink 从 1.13 开始支持 STATEMENT SET 语法。 - -1. 打开 `result/flink-create.all.sql` 文件。 - -2. 修改文件中的 SQL 语句。将所有 INSERT INTO 语句移动到文件末尾。将 `EXECUTE STATEMENT SET BEGIN` 放在第一个 INSERT INTO 语句之前,并将 `END;` 放在最后一个 INSERT INTO 语句之后。 - -> **注意** -> -> CREATE DATABASE 和 CREATE TABLE 的位置保持不变。 - -```SQL -CREATE DATABASE IF NOT EXISTS db; -CREATE TABLE IF NOT EXISTS db.a1; -CREATE TABLE IF NOT EXISTS db.b1; -CREATE TABLE IF NOT EXISTS db.a2; -CREATE TABLE IF NOT EXISTS db.b2; -EXECUTE STATEMENT SET -BEGIN-- one or more INSERT INTO statements -INSERT INTO db.a1 SELECT * FROM db.b1; -INSERT INTO db.a2 SELECT * FROM db.b2; -END; -``` \ No newline at end of file diff --git a/docs/zh/loading/InsertInto.md b/docs/zh/loading/InsertInto.md deleted file mode 100644 index a50ae35..0000000 --- a/docs/zh/loading/InsertInto.md +++ /dev/null @@ -1,642 +0,0 @@ ---- -displayed_sidebar: docs ---- - -# 使用 INSERT 导入数据 - -import InsertPrivNote from '../_assets/commonMarkdown/insertPrivNote.mdx' - -本主题介绍如何使用 SQL 语句 INSERT 将数据导入到 StarRocks 中。 - -与 MySQL 和许多其他数据库管理系统类似,StarRocks 支持使用 INSERT 将数据导入到内表。您可以使用 VALUES 子句直接插入一行或多行,以测试函数或 DEMO。您还可以通过查询将数据从 [外部表](../data_source/External_table.md) 导入到内表。从 StarRocks v3.1 开始,您可以使用 INSERT 命令和表函数 [FILES()](../sql-reference/sql-functions/table-functions/files.md) 直接从云存储上的文件导入数据。 - -StarRocks v2.4 进一步支持使用 INSERT OVERWRITE 将数据覆盖到表中。INSERT OVERWRITE 语句集成了以下操作来实现覆盖功能: - -1. 根据存储原始数据的分区创建临时分区。 -2. 将数据插入到临时分区中。 -3. 将原始分区与临时分区交换。 - -> **NOTE** -> -> 如果您需要在覆盖数据之前验证数据,可以使用上述步骤覆盖数据,并在交换分区之前对其进行验证,而不是使用 INSERT OVERWRITE。 - -从 v3.4.0 开始,StarRocks 支持一种新的语义 - 用于分区表的 INSERT OVERWRITE 的动态覆盖。有关更多信息,请参见 [动态覆盖](#dynamic-overwrite)。 - -## 注意事项 - -- 您只能通过在 MySQL 客户端中按 **Ctrl** 和 **C** 键来取消同步 INSERT 事务。 -- 您可以使用 [SUBMIT TASK](../sql-reference/sql-statements/loading_unloading/ETL/SUBMIT_TASK.md) 提交异步 INSERT 任务。 -- 对于当前版本的 StarRocks,如果任何行的数据不符合表的 schema,则默认情况下 INSERT 事务将失败。例如,如果任何行中的字段长度超过表中映射字段的长度限制,则 INSERT 事务将失败。您可以将会话变量 `enable_insert_strict` 设置为 `false`,以允许事务通过过滤掉与表不匹配的行来继续执行。 -- 如果您频繁执行 INSERT 语句以将小批量数据导入到 StarRocks 中,则会生成过多的数据版本,严重影响查询性能。我们建议您在生产环境中不要过于频繁地使用 INSERT 命令导入数据,也不要将其用作日常数据导入的例程。如果您的应用程序或分析场景需要单独加载流式数据或小批量数据的解决方案,我们建议您使用 Apache Kafka® 作为数据源,并通过 Routine Load 导入数据。 -- 如果您执行 INSERT OVERWRITE 语句,StarRocks 会为存储原始数据的分区创建临时分区,将新数据插入到临时分区中,并将 [原始分区与临时分区交换](../sql-reference/sql-statements/table_bucket_part_index/ALTER_TABLE.md#use-a-temporary-partition-to-replace-the-current-partition)。所有这些操作都在 FE Leader 节点中执行。因此,如果在执行 INSERT OVERWRITE 命令时 FE Leader 节点崩溃,则整个导入事务将失败,并且临时分区将被截断。 - -## 准备工作 - -### 检查权限 - - - -### 创建对象 - -创建一个名为 `load_test` 的数据库,并创建一个表 `insert_wiki_edit` 作为目标表,并创建一个表 `source_wiki_edit` 作为源表。 - -> **NOTE** -> -> 本主题中演示的示例基于表 `insert_wiki_edit` 和表 `source_wiki_edit`。如果您喜欢使用自己的表和数据,则可以跳过准备工作并继续执行下一步。 - -```SQL -CREATE DATABASE IF NOT EXISTS load_test; -USE load_test; -CREATE TABLE insert_wiki_edit -( - event_time DATETIME, - channel VARCHAR(32) DEFAULT '', - user VARCHAR(128) DEFAULT '', - is_anonymous TINYINT DEFAULT '0', - is_minor TINYINT DEFAULT '0', - is_new TINYINT DEFAULT '0', - is_robot TINYINT DEFAULT '0', - is_unpatrolled TINYINT DEFAULT '0', - delta INT DEFAULT '0', - added INT DEFAULT '0', - deleted INT DEFAULT '0' -) -DUPLICATE KEY( - event_time, - channel, - user, - is_anonymous, - is_minor, - is_new, - is_robot, - is_unpatrolled -) -PARTITION BY RANGE(event_time)( - PARTITION p06 VALUES LESS THAN ('2015-09-12 06:00:00'), - PARTITION p12 VALUES LESS THAN ('2015-09-12 12:00:00'), - PARTITION p18 VALUES LESS THAN ('2015-09-12 18:00:00'), - PARTITION p24 VALUES LESS THAN ('2015-09-13 00:00:00') -) -DISTRIBUTED BY HASH(user); - -CREATE TABLE source_wiki_edit -( - event_time DATETIME, - channel VARCHAR(32) DEFAULT '', - user VARCHAR(128) DEFAULT '', - is_anonymous TINYINT DEFAULT '0', - is_minor TINYINT DEFAULT '0', - is_new TINYINT DEFAULT '0', - is_robot TINYINT DEFAULT '0', - is_unpatrolled TINYINT DEFAULT '0', - delta INT DEFAULT '0', - added INT DEFAULT '0', - deleted INT DEFAULT '0' -) -DUPLICATE KEY( - event_time, - channel,user, - is_anonymous, - is_minor, - is_new, - is_robot, - is_unpatrolled -) -PARTITION BY RANGE(event_time)( - PARTITION p06 VALUES LESS THAN ('2015-09-12 06:00:00'), - PARTITION p12 VALUES LESS THAN ('2015-09-12 12:00:00'), - PARTITION p18 VALUES LESS THAN ('2015-09-12 18:00:00'), - PARTITION p24 VALUES LESS THAN ('2015-09-13 00:00:00') -) -DISTRIBUTED BY HASH(user); -``` - -> **NOTICE** -> -> 从 v2.5.7 开始,当您创建表或添加分区时,StarRocks 可以自动设置 bucket 数量 (BUCKETS)。您不再需要手动设置 bucket 数量。有关详细信息,请参见 [设置 bucket 数量](../table_design/data_distribution/Data_distribution.md#set-the-number-of-buckets)。 - -## 通过 INSERT INTO VALUES 导入数据 - -您可以使用 INSERT INTO VALUES 命令将一行或多行追加到特定表中。多行用逗号 (,) 分隔。有关详细说明和参数参考,请参见 [SQL Reference - INSERT](../sql-reference/sql-statements/loading_unloading/INSERT.md)。 - -> **CAUTION** -> -> 通过 INSERT INTO VALUES 导入数据仅适用于需要使用小型数据集验证 DEMO 的情况。不建议用于大规模测试或生产环境。要将海量数据导入到 StarRocks 中,请参见 [导入选项](Loading_intro.md),以获取适合您场景的其他选项。 - -以下示例使用标签 `insert_load_wikipedia` 将两行数据插入到数据源表 `source_wiki_edit` 中。标签是数据库中每个数据导入事务的唯一标识标签。 - -```SQL -INSERT INTO source_wiki_edit -WITH LABEL insert_load_wikipedia -VALUES - ("2015-09-12 00:00:00","#en.wikipedia","AustinFF",0,0,0,0,0,21,5,0), - ("2015-09-12 00:00:00","#ca.wikipedia","helloSR",0,1,0,1,0,3,23,0); -``` - -## 通过 INSERT INTO SELECT 导入数据 - -您可以通过 INSERT INTO SELECT 命令将对数据源表执行查询的结果加载到目标表中。INSERT INTO SELECT 命令对来自数据源表的数据执行 ETL 操作,并将数据加载到 StarRocks 的内表中。数据源可以是一个或多个内部或外部表,甚至是云存储上的数据文件。目标表必须是 StarRocks 中的内表。有关详细说明和参数参考,请参见 [SQL Reference - INSERT](../sql-reference/sql-statements/loading_unloading/INSERT.md)。 - -### 将数据从内部表或外部表导入到内部表 - -> **NOTE** -> -> 从外部表导入数据与从内部表导入数据相同。为简单起见,我们仅在以下示例中演示如何从内部表导入数据。 - -- 以下示例将数据从源表导入到目标表 `insert_wiki_edit`。 - -```SQL -INSERT INTO insert_wiki_edit -WITH LABEL insert_load_wikipedia_1 -SELECT * FROM source_wiki_edit; -``` - -- 以下示例将数据从源表导入到目标表 `insert_wiki_edit` 的 `p06` 和 `p12` 分区。如果未指定分区,则数据将导入到所有分区。否则,数据将仅导入到指定的分区中。 - -```SQL -INSERT INTO insert_wiki_edit PARTITION(p06, p12) -WITH LABEL insert_load_wikipedia_2 -SELECT * FROM source_wiki_edit; -``` - -查询目标表以确保其中有数据。 - -```Plain text -MySQL > select * from insert_wiki_edit; -+---------------------+---------------+----------+--------------+----------+--------+----------+----------------+-------+-------+---------+ -| event_time | channel | user | is_anonymous | is_minor | is_new | is_robot | is_unpatrolled | delta | added | deleted | -+---------------------+---------------+----------+--------------+----------+--------+----------+----------------+-------+-------+---------+ -| 2015-09-12 00:00:00 | #en.wikipedia | AustinFF | 0 | 0 | 0 | 0 | 0 | 21 | 5 | 0 | -| 2015-09-12 00:00:00 | #ca.wikipedia | helloSR | 0 | 1 | 0 | 1 | 0 | 3 | 23 | 0 | -+---------------------+---------------+----------+--------------+----------+--------+----------+----------------+-------+-------+---------+ -2 rows in set (0.00 sec) -``` - -如果截断 `p06` 和 `p12` 分区,则查询中不会返回数据。 - -```Plain -MySQL > TRUNCATE TABLE insert_wiki_edit PARTITION(p06, p12); -Query OK, 0 rows affected (0.01 sec) - -MySQL > select * from insert_wiki_edit; -Empty set (0.00 sec) -``` - -- 以下示例将源表中的 `event_time` 和 `channel` 列导入到目标表 `insert_wiki_edit`。默认值用于此处未指定的列。 - -```SQL -INSERT INTO insert_wiki_edit -WITH LABEL insert_load_wikipedia_3 -( - event_time, - channel -) -SELECT event_time, channel FROM source_wiki_edit; -``` - -:::note -从 v3.3.1 开始,在主键表上使用 INSERT INTO 语句指定列列表将执行部分更新(而不是早期版本中的完全 Upsert)。如果未指定列列表,系统将执行完全 Upsert。 -::: - -### 使用 FILES() 直接从外部源的文件导入数据 - -从 v3.1 开始,StarRocks 支持使用 INSERT 命令和 [FILES()](../sql-reference/sql-functions/table-functions/files.md) 函数直接从云存储上的文件导入数据,因此您无需先创建 external catalog 或文件外部表。此外,FILES() 可以自动推断文件的表 schema,从而大大简化了数据导入过程。 - -以下示例将 AWS S3 bucket `inserttest` 中的 Parquet 文件 **parquet/insert_wiki_edit_append.parquet** 中的数据行插入到表 `insert_wiki_edit` 中: - -```Plain -INSERT INTO insert_wiki_edit - SELECT * FROM FILES( - "path" = "s3://inserttest/parquet/insert_wiki_edit_append.parquet", - "format" = "parquet", - "aws.s3.access_key" = "XXXXXXXXXX", - "aws.s3.secret_key" = "YYYYYYYYYY", - "aws.s3.region" = "us-west-2" -); -``` - -## 通过 INSERT OVERWRITE VALUES 覆盖数据 - -您可以使用 INSERT OVERWRITE VALUES 命令使用一行或多行覆盖特定表。多行用逗号 (,) 分隔。有关详细说明和参数参考,请参见 [SQL Reference - INSERT](../sql-reference/sql-statements/loading_unloading/INSERT.md)。 - -> **CAUTION** -> -> 通过 INSERT OVERWRITE VALUES 覆盖数据仅适用于需要使用小型数据集验证 DEMO 的情况。不建议用于大规模测试或生产环境。要将海量数据导入到 StarRocks 中,请参见 [导入选项](Loading_intro.md),以获取适合您场景的其他选项。 - -查询源表和目标表以确保其中有数据。 - -```Plain -MySQL > SELECT * FROM source_wiki_edit; -+---------------------+---------------+----------+--------------+----------+--------+----------+----------------+-------+-------+---------+ -| event_time | channel | user | is_anonymous | is_minor | is_new | is_robot | is_unpatrolled | delta | added | deleted | -+---------------------+---------------+----------+--------------+----------+--------+----------+----------------+-------+-------+---------+ -| 2015-09-12 00:00:00 | #ca.wikipedia | helloSR | 0 | 1 | 0 | 1 | 0 | 3 | 23 | 0 | -| 2015-09-12 00:00:00 | #en.wikipedia | AustinFF | 0 | 0 | 0 | 0 | 0 | 21 | 5 | 0 | -+---------------------+---------------+----------+--------------+----------+--------+----------+----------------+-------+-------+---------+ -2 rows in set (0.02 sec) - -MySQL > SELECT * FROM insert_wiki_edit; -+---------------------+---------------+----------+--------------+----------+--------+----------+----------------+-------+-------+---------+ -| event_time | channel | user | is_anonymous | is_minor | is_new | is_robot | is_unpatrolled | delta | added | deleted | -+---------------------+---------------+----------+--------------+----------+--------+----------+----------------+-------+-------+---------+ -| 2015-09-12 00:00:00 | #ca.wikipedia | helloSR | 0 | 1 | 0 | 1 | 0 | 3 | 23 | 0 | -| 2015-09-12 00:00:00 | #en.wikipedia | AustinFF | 0 | 0 | 0 | 0 | 0 | 21 | 5 | 0 | -+---------------------+---------------+----------+--------------+----------+--------+----------+----------------+-------+-------+---------+ -2 rows in set (0.01 sec) -``` - -以下示例使用两个新行覆盖源表 `source_wiki_edit`。 - -```SQL -INSERT OVERWRITE source_wiki_edit -WITH LABEL insert_load_wikipedia_ow -VALUES - ("2015-09-12 00:00:00","#cn.wikipedia","GELongstreet",0,0,0,0,0,36,36,0), - ("2015-09-12 00:00:00","#fr.wikipedia","PereBot",0,1,0,1,0,17,17,0); -``` - -## 通过 INSERT OVERWRITE SELECT 覆盖数据 - -您可以使用 INSERT OVERWRITE SELECT 命令使用对数据源表执行查询的结果覆盖表。INSERT OVERWRITE SELECT 语句对来自一个或多个内部或外部表的数据执行 ETL 操作,并使用该数据覆盖内部表。有关详细说明和参数参考,请参见 [SQL Reference - INSERT](../sql-reference/sql-statements/loading_unloading/INSERT.md)。 - -> **NOTE** -> -> 从外部表导入数据与从内部表导入数据相同。为简单起见,我们仅在以下示例中演示如何使用来自内部表的数据覆盖目标表。 - -查询源表和目标表以确保它们包含不同的数据行。 - -```Plain -MySQL > SELECT * FROM source_wiki_edit; -+---------------------+---------------+--------------+--------------+----------+--------+----------+----------------+-------+-------+---------+ -| event_time | channel | user | is_anonymous | is_minor | is_new | is_robot | is_unpatrolled | delta | added | deleted | -+---------------------+---------------+--------------+--------------+----------+--------+----------+----------------+-------+-------+---------+ -| 2015-09-12 00:00:00 | #cn.wikipedia | GELongstreet | 0 | 0 | 0 | 0 | 0 | 36 | 36 | 0 | -| 2015-09-12 00:00:00 | #fr.wikipedia | PereBot | 0 | 1 | 0 | 1 | 0 | 17 | 17 | 0 | -+---------------------+---------------+--------------+--------------+----------+--------+----------+----------------+-------+-------+---------+ -2 rows in set (0.02 sec) - -MySQL > SELECT * FROM insert_wiki_edit; -+---------------------+---------------+----------+--------------+----------+--------+----------+----------------+-------+-------+---------+ -| event_time | channel | user | is_anonymous | is_minor | is_new | is_robot | is_unpatrolled | delta | added | deleted | -+---------------------+---------------+----------+--------------+----------+--------+----------+----------------+-------+-------+---------+ -| 2015-09-12 00:00:00 | #en.wikipedia | AustinFF | 0 | 0 | 0 | 0 | 0 | 21 | 5 | 0 | -| 2015-09-12 00:00:00 | #ca.wikipedia | helloSR | 0 | 1 | 0 | 1 | 0 | 3 | 23 | 0 | -+---------------------+---------------+----------+--------------+----------+--------+----------+----------------+-------+-------+---------+ -2 rows in set (0.01 sec) -``` - -- 以下示例使用来自源表的数据覆盖表 `insert_wiki_edit`。 - -```SQL -INSERT OVERWRITE insert_wiki_edit -WITH LABEL insert_load_wikipedia_ow_1 -SELECT * FROM source_wiki_edit; -``` - -- 以下示例使用来自源表的数据覆盖表 `insert_wiki_edit` 的 `p06` 和 `p12` 分区。 - -```SQL -INSERT OVERWRITE insert_wiki_edit PARTITION(p06, p12) -WITH LABEL insert_load_wikipedia_ow_2 -SELECT * FROM source_wiki_edit; -``` - -查询目标表以确保其中有数据。 - -```plain text -MySQL > select * from insert_wiki_edit; -+---------------------+---------------+--------------+--------------+----------+--------+----------+----------------+-------+-------+---------+ -| event_time | channel | user | is_anonymous | is_minor | is_new | is_robot | is_unpatrolled | delta | added | deleted | -+---------------------+---------------+--------------+--------------+----------+--------+----------+----------------+-------+-------+---------+ -| 2015-09-12 00:00:00 | #fr.wikipedia | PereBot | 0 | 1 | 0 | 1 | 0 | 17 | 17 | 0 | -| 2015-09-12 00:00:00 | #cn.wikipedia | GELongstreet | 0 | 0 | 0 | 0 | 0 | 36 | 36 | 0 | -+---------------------+---------------+--------------+--------------+----------+--------+----------+----------------+-------+-------+---------+ -2 rows in set (0.01 sec) -``` - -如果截断 `p06` 和 `p12` 分区,则查询中不会返回数据。 - -```Plain -MySQL > TRUNCATE TABLE insert_wiki_edit PARTITION(p06, p12); -Query OK, 0 rows affected (0.01 sec) - -MySQL > select * from insert_wiki_edit; -Empty set (0.00 sec) -``` - -:::note -对于使用 `PARTITION BY column` 策略的表,INSERT OVERWRITE 支持通过指定分区键的值在目标表中创建新分区。现有分区照常覆盖。 - -以下示例创建分区表 `activity`,并在表中创建新分区,同时将数据插入到该分区中: - -```SQL -CREATE TABLE activity ( -id INT NOT NULL, -dt VARCHAR(10) NOT NULL -) ENGINE=OLAP -DUPLICATE KEY(`id`) -PARTITION BY (`id`, `dt`) -DISTRIBUTED BY HASH(`id`); - -INSERT OVERWRITE activity -PARTITION(id='4', dt='2022-01-01') -WITH LABEL insert_activity_auto_partition -VALUES ('4', '2022-01-01'); -``` - -::: - -- 以下示例使用来自源表的 `event_time` 和 `channel` 列覆盖目标表 `insert_wiki_edit`。默认值分配给未覆盖数据的列。 - -```SQL -INSERT OVERWRITE insert_wiki_edit -WITH LABEL insert_load_wikipedia_ow_3 -( - event_time, - channel -) -SELECT event_time, channel FROM source_wiki_edit; -``` - -### 动态覆盖 - -从 v3.4.0 开始,StarRocks 支持一种新的语义 - 用于分区表的 INSERT OVERWRITE 的动态覆盖。 - -目前,INSERT OVERWRITE 的默认行为如下: - -- 当覆盖整个分区表时(即,不指定 PARTITION 子句),新的数据记录将替换其相应分区中的数据。如果存在未涉及的分区,则这些分区将被截断,而其他分区将被覆盖。 -- 当覆盖空分区表(即,其中没有分区)并指定 PARTITION 子句时,系统会返回错误 `ERROR 1064 (HY000): Getting analyzing error. Detail message: Unknown partition 'xxx' in table 'yyy'`。 -- 当覆盖分区表并在 PARTITION 子句中指定不存在的分区时,系统会返回错误 `ERROR 1064 (HY000): Getting analyzing error. Detail message: Unknown partition 'xxx' in table 'yyy'`。 -- 当使用与 PARTITION 子句中指定的任何分区都不匹配的数据记录覆盖分区表时,系统要么返回错误 `ERROR 1064 (HY000): Insert has filtered data in strict mode`(如果启用了严格模式),要么过滤掉不合格的数据记录(如果禁用了严格模式)。 - -新的动态覆盖语义的行为大不相同: - -当覆盖整个分区表时,新的数据记录将替换其相应分区中的数据。如果存在未涉及的分区,则这些分区将被保留,而不是被截断或删除。并且如果存在与不存在的分区对应的新数据记录,系统将创建该分区。 - -默认情况下禁用动态覆盖语义。要启用它,您需要将系统变量 `dynamic_overwrite` 设置为 `true`。 - -在当前会话中启用动态覆盖: - -```SQL -SET dynamic_overwrite = true; -``` - -您也可以在 INSERT OVERWRITE 语句的 hint 中设置它,以使其仅对该语句生效: - -示例: - -```SQL -INSERT /*+set_var(dynamic_overwrite = true)*/ OVERWRITE insert_wiki_edit -SELECT * FROM source_wiki_edit; -``` - -## 将数据导入到具有生成列的表中 - -生成列是一种特殊的列,其值源自基于其他列的预定义表达式或评估。当您的查询请求涉及对昂贵表达式的评估时,生成列特别有用,例如,从 JSON 值查询某个字段或计算 ARRAY 数据。StarRocks 在将数据加载到表中的同时评估表达式并将结果存储在生成列中,从而避免了查询期间的表达式评估并提高了查询性能。 - -您可以使用 INSERT 将数据导入到具有生成列的表中。 - -以下示例创建一个表 `insert_generated_columns` 并向其中插入一行。该表包含两个生成列:`avg_array` 和 `get_string`。`avg_array` 计算 `data_array` 中 ARRAY 数据的平均值,`get_string` 从 `data_json` 中的 JSON 路径 `a` 提取字符串。 - -```SQL -CREATE TABLE insert_generated_columns ( - id INT(11) NOT NULL COMMENT "ID", - data_array ARRAY NOT NULL COMMENT "ARRAY", - data_json JSON NOT NULL COMMENT "JSON", - avg_array DOUBLE NULL - AS array_avg(data_array) COMMENT "Get the average of ARRAY", - get_string VARCHAR(65533) NULL - AS get_json_string(json_string(data_json), '$.a') COMMENT "Extract JSON string" -) ENGINE=OLAP -PRIMARY KEY(id) -DISTRIBUTED BY HASH(id); - -INSERT INTO insert_generated_columns -VALUES (1, [1,2], parse_json('{"a" : 1, "b" : 2}')); -``` - -> **NOTE** -> -> 不支持直接将数据加载到生成列中。 - -您可以查询该表以检查其中的数据。 - -```Plain -mysql> SELECT * FROM insert_generated_columns; -+------+------------+------------------+-----------+------------+ -| id | data_array | data_json | avg_array | get_string | -+------+------------+------------------+-----------+------------+ -| 1 | [1,2] | {"a": 1, "b": 2} | 1.5 | 1 | -+------+------------+------------------+-----------+------------+ -1 row in set (0.02 sec) -``` - -## 使用 PROPERTIES 的 INSERT 数据 - -从 v3.4.0 开始,INSERT 语句支持配置 PROPERTIES,它可以用于各种目的。PROPERTIES 会覆盖其相应的变量。 - -### 启用严格模式 - -从 v3.4.0 开始,您可以启用严格模式并为来自 FILES() 的 INSERT 设置 `max_filter_ratio`。来自 FILES() 的 INSERT 的严格模式与其他导入方法的行为相同。 - -如果要加载包含一些不合格行的数据集,您可以过滤掉这些不合格行,也可以加载它们并将 NULL 值分配给不合格的列。您可以使用属性 `strict_mode` 和 `max_filter_ratio` 来实现这些目的。 - -- 要过滤掉不合格的行:将 `strict_mode` 设置为 `true`,并将 `max_filter_ratio` 设置为所需的值。 -- 要加载所有具有 NULL 值的不合格行:将 `strict_mode` 设置为 `false`。 - -以下示例将 AWS S3 bucket `inserttest` 中的 Parquet 文件 **parquet/insert_wiki_edit_append.parquet** 中的数据行插入到表 `insert_wiki_edit` 中,启用严格模式以过滤掉不合格的数据记录,并容忍最多 10% 的错误数据: - -```SQL -INSERT INTO insert_wiki_edit -PROPERTIES( - "strict_mode" = "true", - "max_filter_ratio" = "0.1" -) -SELECT * FROM FILES( - "path" = "s3://inserttest/parquet/insert_wiki_edit_append.parquet", - "format" = "parquet", - "aws.s3.access_key" = "XXXXXXXXXX", - "aws.s3.secret_key" = "YYYYYYYYYY", - "aws.s3.region" = "us-west-2" -); -``` - -:::note - -`strict_mode` 和 `max_filter_ratio` 仅支持来自 FILES() 的 INSERT。来自表的 INSERT 不支持这些属性。 - -::: - -### 设置超时时长 - -从 v3.4.0 开始,您可以使用属性设置 INSERT 语句的超时时长。 - -以下示例将源表 `source_wiki_edit` 中的数据插入到目标表 `insert_wiki_edit` 中,并将超时时长设置为 `2` 秒。 - -```SQL -INSERT INTO insert_wiki_edit -PROPERTIES( - "timeout" = "2" -) -SELECT * FROM source_wiki_edit; -``` - -:::note - -从 v3.4.0 开始,您还可以使用系统变量 `insert_timeout` 设置 INSERT 超时时长,该变量适用于涉及 INSERT 的操作(例如,UPDATE、DELETE、CTAS、物化视图刷新、统计信息收集和 PIPE)。在早于 v3.4.0 的版本中,相应的变量是 `query_timeout`。 - -::: - -### 按名称匹配列 - -默认情况下,INSERT 按位置匹配源表和目标表中的列,即语句中列的映射。 - -以下示例通过位置显式匹配源表和目标表中的每个列: - -```SQL -INSERT INTO insert_wiki_edit ( - event_time, - channel, - user -) -SELECT event_time, channel, user FROM source_wiki_edit; -``` - -如果您更改列列表或 SELECT 语句中 `channel` 和 `user` 的顺序,则列映射将更改。 - -```SQL -INSERT INTO insert_wiki_edit ( - event_time, - channel, - user -) -SELECT event_time, user, channel FROM source_wiki_edit; -``` - -在这里,提取的数据可能不是您想要的,因为目标表 `insert_wiki_edit` 中的 `channel` 将填充来自源表 `source_wiki_edit` 中的 `user` 的数据。 - -通过在 INSERT 语句中添加 `BY NAME` 子句,系统将检测源表和目标表中的列名,并匹配具有相同名称的列。 - -:::note - -- 如果指定了 `BY NAME`,则不能指定列列表。 -- 如果未指定 `BY NAME`,则系统会按列列表和 SELECT 语句中列的位置匹配列。 - -::: - -以下示例按名称匹配源表和目标表中的每个列: - -```SQL -INSERT INTO insert_wiki_edit BY NAME -SELECT event_time, user, channel FROM source_wiki_edit; -``` - -在这种情况下,更改 `channel` 和 `user` 的顺序不会更改列映射。 - -## 使用 INSERT 异步导入数据 - -使用 INSERT 导入数据会提交一个同步事务,该事务可能会因会话中断或超时而失败。您可以使用 [SUBMIT TASK](../sql-reference/sql-statements/loading_unloading/ETL/SUBMIT_TASK.md) 提交异步 INSERT 事务。此功能自 StarRocks v2.5 起受支持。 - -- 以下示例异步地将数据从源表插入到目标表 `insert_wiki_edit`。 - -```SQL -SUBMIT TASK AS INSERT INTO insert_wiki_edit -SELECT * FROM source_wiki_edit; -``` - -- 以下示例使用源表中的数据异步覆盖表 `insert_wiki_edit`。 - -```SQL -SUBMIT TASK AS INSERT OVERWRITE insert_wiki_edit -SELECT * FROM source_wiki_edit; -``` - -- 以下示例使用源表中的数据异步覆盖表 `insert_wiki_edit`,并使用 hint 将查询超时延长至 `100000` 秒。 - -```SQL -SUBMIT /*+set_var(insert_timeout=100000)*/ TASK AS -INSERT OVERWRITE insert_wiki_edit -SELECT * FROM source_wiki_edit; -``` - -- 以下示例使用源表中的数据异步覆盖表 `insert_wiki_edit`,并将任务名称指定为 `async`。 - -```SQL -SUBMIT TASK async -AS INSERT OVERWRITE insert_wiki_edit -SELECT * FROM source_wiki_edit; -``` - -您可以通过查询 Information Schema 中的元数据视图 `task_runs` 来检查异步 INSERT 任务的状态。 - -以下示例检查 INSERT 任务 `async` 的状态。 - -```SQL -SELECT * FROM information_schema.task_runs WHERE task_name = 'async'; -``` - -## 检查 INSERT 作业状态 - -### 通过结果检查 - -同步 INSERT 事务根据事务的结果返回不同的状态。 - -- **事务成功** - -如果事务成功,StarRocks 将返回以下内容: - -```Plain -Query OK, 2 rows affected (0.05 sec) -{'label':'insert_load_wikipedia', 'status':'VISIBLE', 'txnId':'1006'} -``` - -- **事务失败** - -如果所有数据行都无法加载到目标表中,则 INSERT 事务将失败。如果事务失败,StarRocks 将返回以下内容: - -```Plain -ERROR 1064 (HY000): Insert has filtered data in strict mode, tracking_url=http://x.x.x.x:yyyy/api/_load_error_log?file=error_log_9f0a4fd0b64e11ec_906bbede076e9d08 -``` - -您可以通过使用 `tracking_url` 检查日志来找到问题。 - -### 通过 Information Schema 检查 - -您可以使用 [SELECT](../sql-reference/sql-statements/table_bucket_part_index/SELECT.md) 语句从 `information_schema` 数据库中的 `loads` 表中查询一个或多个导入作业的结果。此功能自 v3.1 起受支持。 - -示例 1:查询在 `load_test` 数据库上执行的导入作业的结果,按创建时间 (`CREATE_TIME`) 降序对结果进行排序,并且仅返回最上面的结果。 - -```SQL -SELECT * FROM information_schema.loads -WHERE database_name = 'load_test' -ORDER BY create_time DESC -LIMIT 1\G -``` - -示例 2:查询在 `load_test` 数据库上执行的导入作业(其标签为 `insert_load_wikipedia`)的结果: - -```SQL -SELECT * FROM information_schema.loads -WHERE database_name = 'load_test' and label = 'insert_load_wikipedia'\G -``` - -返回结果如下: - -```Plain -*************************** 1. row *************************** - JOB_ID: 21319 - LABEL: insert_load_wikipedia - DATABASE_NAME: load_test - STATE: FINISHED - PROGRESS: ETL:100%; LOAD:100% - TYPE: INSERT - PRIORITY: NORMAL - SCAN_ROWS: 0 - FILTERED_ROWS: 0 - UNSELECTED_ROWS: 0 - SINK_ROWS: 2 - ETL_INFO: - TASK_INFO: resource:N/A; timeout(s):300; max_filter_ratio:0.0 - CREATE_TIME: 2023-08-09 10:42:23 - ETL_START_TIME: 2023-08-09 10:42:23 - ETL_FINISH_TIME: 2023-08-09 \ No newline at end of file diff --git a/docs/zh/loading/Json_loading.md b/docs/zh/loading/Json_loading.md deleted file mode 100644 index 4915fcf..0000000 --- a/docs/zh/loading/Json_loading.md +++ /dev/null @@ -1,363 +0,0 @@ ---- -displayed_sidebar: docs ---- - -# 介绍 - -您可以使用 Stream Load 或 Routine Load 导入半结构化数据(例如 JSON)。 - -## 使用场景 - -* Stream Load:对于存储在文本文件中的 JSON 数据,使用 Stream Load 进行导入。 -* Routine Load:对于 Kafka 中的 JSON 数据,使用 Routine Load 进行导入。 - -### Stream Load 导入 - -示例数据: - -~~~json -{ "id": 123, "city" : "beijing"}, -{ "id": 456, "city" : "shanghai"}, - ... -~~~ - -示例: - -~~~shell -curl -v --location-trusted -u : \ - -H "format: json" -H "jsonpaths: [\"$.id\", \"$.city\"]" \ - -T example.json \ - http://FE_HOST:HTTP_PORT/api/DATABASE/TABLE/_stream_load -~~~ - -`format: json` 参数允许您执行导入数据的格式。`jsonpaths` 用于执行相应的数据导入路径。 - -相关参数: - -* jsonpaths: 选择每列的 JSON 路径 -* json\_root: 选择 JSON 开始解析的列 -* strip\_outer\_array: 裁剪最外层数组字段 -* strict\_mode: 在导入期间严格过滤列类型转换 - -当 JSON 数据模式和 StarRocks 数据模式不完全相同时,修改 `Jsonpath`。 - -示例数据: - -~~~json -{"k1": 1, "k2": 2} -~~~ - -导入示例: - -~~~bash -curl -v --location-trusted -u : \ - -H "format: json" -H "jsonpaths: [\"$.k2\", \"$.k1\"]" \ - -H "columns: k2, tmp_k1, k1 = tmp_k1 * 100" \ - -T example.json \ - http://127.0.0.1:8030/api/db1/tbl1/_stream_load -~~~ - -在导入期间执行将 k1 乘以 100 的 ETL 操作,并通过 `Jsonpath` 将列与原始数据匹配。 - -导入结果如下: - -~~~plain text -+------+------+ -| k1 | k2 | -+------+------+ -| 100 | 2 | -+------+------+ -~~~ - -对于缺少的列,如果列定义允许为空,则将添加 `NULL`,或者可以通过 `ifnull` 添加默认值。 - -示例数据: - -~~~json -[ - {"k1": 1, "k2": "a"}, - {"k1": 2}, - {"k1": 3, "k2": "c"}, -] -~~~ - -导入示例-1: - -~~~shell -curl -v --location-trusted -u : \ - -H "format: json" -H "strip_outer_array: true" \ - -T example.json \ - http://127.0.0.1:8030/api/db1/tbl1/_stream_load -~~~ - -导入结果如下: - -~~~plain text -+------+------+ -| k1 | k2 | -+------+------+ -| 1 | a | -+------+------+ -| 2 | NULL | -+------+------+ -| 3 | c | -+------+------+ -~~~ - -导入示例-2: - -~~~shell -curl -v --location-trusted -u : \ - -H "format: json" -H "strip_outer_array: true" \ - -H "jsonpaths: [\"$.k1\", \"$.k2\"]" \ - -H "columns: k1, tmp_k2, k2 = ifnull(tmp_k2, 'x')" \ - -T example.json \ - http://127.0.0.1:8030/api/db1/tbl1/_stream_load -~~~ - -导入结果如下: - -~~~plain text -+------+------+ -| k1 | k2 | -+------+------+ -| 1 | a | -+------+------+ -| 2 | x | -+------+------+ -| 3 | c | -+------+------+ -~~~ - -### Routine Load 导入 - -与 Stream Load 类似,Kafka 数据源的消息内容被视为完整的 JSON 数据。 - -1. 如果一条消息包含数组格式的多行数据,则将导入所有行,并且 Kafka 的 offset 将仅递增 1。 -2. 如果 Array 格式的 JSON 表示多行数据,但由于 JSON 格式错误导致 JSON 解析失败,则错误行将仅递增 1(鉴于解析失败,StarRocks 实际上无法确定它包含多少行数据,并且只能将错误数据记录为一行)。 - -### 使用 Canal 通过增量同步 binlog 将 MySQL 导入 StarRocks - -[Canal](https://github.com/alibaba/canal) 是阿里巴巴的开源 MySQL binlog 同步工具,通过它可以将 MySQL 数据同步到 Kafka。数据在 Kafka 中以 JSON 格式生成。以下演示如何使用 Routine Load 同步 Kafka 中的数据,以实现与 MySQL 的增量数据同步。 - -* 在 MySQL 中,我们有一个数据表,其表创建语句如下。 - -~~~sql -CREATE TABLE `query_record` ( - `query_id` varchar(64) NOT NULL, - `conn_id` int(11) DEFAULT NULL, - `fe_host` varchar(32) DEFAULT NULL, - `user` varchar(32) DEFAULT NULL, - `start_time` datetime NOT NULL, - `end_time` datetime DEFAULT NULL, - `time_used` double DEFAULT NULL, - `state` varchar(16) NOT NULL, - `error_message` text, - `sql` text NOT NULL, - `database` varchar(128) NOT NULL, - `profile` longtext, - `plan` longtext, - PRIMARY KEY (`query_id`), - KEY `idx_start_time` (`start_time`) USING BTREE -) ENGINE=InnoDB DEFAULT CHARSET=utf8 -~~~ - -* 前提条件:确保 MySQL 已启用 binlog 并且格式为 ROW。 - -~~~bash -[mysqld] -log-bin=mysql-bin # 启用 binlog -binlog-format=ROW # 选择 ROW 模式 -server_id=1 # 需要定义 MySQL 复制,并且不要复制 Canal 的 slaveId -~~~ - -* 创建一个帐户并授予辅助 MySQL 服务器权限: - -~~~sql -CREATE USER canal IDENTIFIED BY 'canal'; -GRANT SELECT, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'canal'@'%'; --- GRANT ALL PRIVILEGES ON *.* TO 'canal'@'%'; -FLUSH PRIVILEGES; -~~~ - -* 然后下载并安装 Canal。 - -~~~bash -wget https://github.com/alibaba/canal/releases/download/canal-1.0.17/canal.deployer-1.0.17.tar.gz - -mkdir /tmp/canal -tar zxvf canal.deployer-$version.tar.gz -C /tmp/canal -~~~ - -* 修改配置(MySQL 相关)。 - -`$ vi conf/example/instance.properties` - -~~~bash -## mysql serverId -canal.instance.mysql.slaveId = 1234 -#position info, need to change to your own database information -canal.instance.master.address = 127.0.0.1:3306 -canal.instance.master.journal.name = -canal.instance.master.position = -canal.instance.master.timestamp = -#canal.instance.standby.address = -#canal.instance.standby.journal.name = -#canal.instance.standby.position = -#canal.instance.standby.timestamp = -#username/password, need to change to your own database information -canal.instance.dbUsername = canal -canal.instance.dbPassword = canal -canal.instance.defaultDatabaseName = -canal.instance.connectionCharset = UTF-8 -#table regex -canal.instance.filter.regex = .\*\\\\..\* -# Select the name of the table to be synchronized and the partition name of the kafka target. -canal.mq.dynamicTopic=databasename.query_record -canal.mq.partitionHash= databasename.query_record:query_id -~~~ - -* 修改配置(Kafka 相关)。 - -`$ vi /usr/local/canal/conf/canal.properties` - -~~~bash -# Available options: tcp(by default), kafka, RocketMQ -canal.serverMode = kafka -# ... -# kafka/rocketmq Cluster Configuration: 192.168.1.117:9092,192.168.1.118:9092,192.168.1.119:9092 -canal.mq.servers = 127.0.0.1:6667 -canal.mq.retries = 0 -# This value can be increased in flagMessage mode, but do not exceed the maximum size of the MQ message. -canal.mq.batchSize = 16384 -canal.mq.maxRequestSize = 1048576 -# In flatMessage mode, please change this value to a larger value, 50-200 is recommended. -canal.mq.lingerMs = 1 -canal.mq.bufferMemory = 33554432 -# Canal's batch size with a default value of 50K. Please do not exceed 1M due to Kafka's maximum message size limit (under 900K) -canal.mq.canalBatchSize = 50 -# Timeout of `Canal get`, in milliseconds. Empty indicates unlimited timeout. -canal.mq.canalGetTimeout = 100 -# Whether the object is in flat json format -canal.mq.flatMessage = false -canal.mq.compressionType = none -canal.mq.acks = all -# Whether Kafka message delivery uses transactions -canal.mq.transaction = false -~~~ - -* 启动 - -`bin/startup.sh` - -相应的同步日志显示在 `logs/example/example.log` 和 Kafka 中,格式如下: - -~~~json -{ - "data": [{ - "query_id": "3c7ebee321e94773-b4d79cc3f08ca2ac", - "conn_id": "34434", - "fe_host": "172.26.34.139", - "user": "zhaoheng", - "start_time": "2020-10-19 20:40:10.578", - "end_time": "2020-10-19 20:40:10", - "time_used": "1.0", - "state": "FINISHED", - "error_message": "", - "sql": "COMMIT", - "database": "", - "profile": "", - "plan": "" - }, { - "query_id": "7ff2df7551d64f8e-804004341bfa63ad", - "conn_id": "34432", - "fe_host": "172.26.34.139", - "user": "zhaoheng", - "start_time": "2020-10-19 20:40:10.566", - "end_time": "2020-10-19 20:40:10", - "time_used": "0.0", - "state": "FINISHED", - "error_message": "", - "sql": "COMMIT", - "database": "", - "profile": "", - "plan": "" - }, { - "query_id": "3a4b35d1c1914748-be385f5067759134", - "conn_id": "34440", - "fe_host": "172.26.34.139", - "user": "zhaoheng", - "start_time": "2020-10-19 20:40:10.601", - "end_time": "1970-01-01 08:00:00", - "time_used": "-1.0", - "state": "RUNNING", - "error_message": "", - "sql": " SELECT SUM(length(lo_custkey)), SUM(length(c_custkey)) FROM lineorder_str INNER JOIN customer_str ON lo_custkey=c_custkey;", - "database": "ssb", - "profile": "", - "plan": "" - }], - "database": "center_service_lihailei", - "es": 1603111211000, - "id": 122, - "isDdl": false, - "mysqlType": { - "query_id": "varchar(64)", - "conn_id": "int(11)", - "fe_host": "varchar(32)", - "user": "varchar(32)", - "start_time": "datetime(3)", - "end_time": "datetime", - "time_used": "double", - "state": "varchar(16)", - "error_message": "text", - "sql": "text", - "database": "varchar(128)", - "profile": "longtext", - "plan": "longtext" - }, - "old": null, - "pkNames": ["query_id"], - "sql": "", - "sqlType": { - "query_id": 12, - "conn_id": 4, - "fe_host": 12, - "user": 12, - "start_time": 93, - "end_time": 93, - "time_used": 8, - "state": 12, - "error_message": 2005, - "sql": 2005, - "database": 12, - "profile": 2005, - "plan": 2005 - }, - "table": "query_record", - "ts": 1603111212015, - "type": "INSERT" -} -~~~ - -添加 `json_root` 和 `strip_outer_array = true` 以从 `data` 导入数据。 - -~~~sql -create routine load manual.query_job on query_record -columns (query_id,conn_id,fe_host,user,start_time,end_time,time_used,state,error_message,`sql`,`database`,profile,plan) -PROPERTIES ( - "format"="json", - "json_root"="$.data", - "desired_concurrent_number"="1", - "strip_outer_array" ="true", - "max_error_number"="1000" -) -FROM KAFKA ( - "kafka_broker_list"= "172.26.92.141:9092", - "kafka_topic" = "databasename.query_record" -); -~~~ - -这样就完成了从 MySQL 到 StarRocks 的近实时数据同步。 - -通过 `show routine load` 查看导入作业的状态和错误消息。 \ No newline at end of file diff --git a/docs/zh/loading/Kafka-connector-starrocks.md b/docs/zh/loading/Kafka-connector-starrocks.md deleted file mode 100644 index fc030a9..0000000 --- a/docs/zh/loading/Kafka-connector-starrocks.md +++ /dev/null @@ -1,829 +0,0 @@ ---- -displayed_sidebar: docs ---- - -# 使用 Kafka connector 导入数据 - -StarRocks 提供了一个自研的 connector,名为 Apache Kafka® connector (StarRocks Connector for Apache Kafka®,简称 Kafka connector)。作为一个 sink connector,它可以持续地从 Kafka 消费消息,并将消息导入到 StarRocks 中。 Kafka connector 保证至少一次 (at-least-once) 的语义。 - -Kafka connector 可以无缝集成到 Kafka Connect 中,这使得 StarRocks 能够更好地与 Kafka 生态系统集成。如果您想将实时数据导入到 StarRocks 中,这是一个明智的选择。与 Routine Load 相比,建议在以下场景中使用 Kafka connector: - -- Routine Load 仅支持导入 CSV、JSON 和 Avro 格式的数据,而 Kafka connector 可以导入更多格式的数据,例如 Protobuf。只要可以使用 Kafka Connect 的转换器将数据转换为 JSON 和 CSV 格式,就可以通过 Kafka connector 将数据导入到 StarRocks 中。 -- 自定义数据转换,例如 Debezium 格式的 CDC 数据。 -- 从多个 Kafka topic 导入数据。 -- 从 Confluent Cloud 导入数据。 -- 需要更精细地控制导入的批次大小、并行度和其他参数,以在导入速度和资源利用率之间取得平衡。 - -## 准备工作 - -### 版本要求 - -| Connector | Kafka | StarRocks | Java | -| --------- | --------- | ------------- | ---- | -| 1.0.6 | 3.4+/4.0+ | 2.5 及更高版本 | 8 | -| 1.0.5 | 3.4 | 2.5 及更高版本 | 8 | -| 1.0.4 | 3.4 | 2.5 及更高版本 | 8 | -| 1.0.3 | 3.4 | 2.5 及更高版本 | 8 | - -### 搭建 Kafka 环境 - -支持自管理的 Apache Kafka 集群和 Confluent Cloud。 - -- 对于自管理的 Apache Kafka 集群,您可以参考 [Apache Kafka quickstart](https://kafka.apache.org/quickstart) 快速部署 Kafka 集群。 Kafka Connect 已经集成到 Kafka 中。 -- 对于 Confluent Cloud,请确保您拥有 Confluent 帐户并已创建集群。 - -### 下载 Kafka connector - -将 Kafka connector 提交到 Kafka Connect: - -- 自管理的 Kafka 集群: - - 下载 [starrocks-connector-for-kafka-x.y.z-with-dependencies.jar](https://github.com/StarRocks/starrocks-connector-for-kafka/releases)。 - -- Confluent Cloud: - - 目前, Kafka connector 尚未上传到 Confluent Hub。您需要下载 [starrocks-connector-for-kafka-x.y.z-with-dependencies.jar](https://github.com/StarRocks/starrocks-connector-for-kafka/releases),将其打包成 ZIP 文件,然后将 ZIP 文件上传到 Confluent Cloud。 - -### 网络配置 - -确保 Kafka 所在的机器可以通过 [`http_port`](../administration/management/FE_configuration.md#http_port) (默认值:`8030`) 和 [`query_port`](../administration/management/FE_configuration.md#query_port) (默认值:`9030`) 访问 StarRocks 集群的 FE 节点,并通过 [`be_http_port`](../administration/management/BE_configuration.md#be_http_port) (默认值:`8040`) 访问 BE 节点。 - -## 使用方法 - -本节以自管理的 Kafka 集群为例,介绍如何配置 Kafka connector 和 Kafka Connect,然后运行 Kafka Connect 将数据导入到 StarRocks 中。 - -### 准备数据集 - -假设 Kafka 集群的 topic `test` 中存在 JSON 格式的数据。 - -```JSON -{"id":1,"city":"New York"} -{"id":2,"city":"Los Angeles"} -{"id":3,"city":"Chicago"} -``` - -### 创建表 - -根据 JSON 格式数据的键,在 StarRocks 集群的数据库 `example_db` 中创建表 `test_tbl`。 - -```SQL -CREATE DATABASE example_db; -USE example_db; -CREATE TABLE test_tbl (id INT, city STRING); -``` - -### 配置 Kafka connector 和 Kafka Connect,然后运行 Kafka Connect 导入数据 - -#### 在独立模式下运行 Kafka Connect - -1. 配置 Kafka connector。在 Kafka 安装目录下的 **config** 目录中,为 Kafka connector 创建配置文件 **connect-StarRocks-sink.properties**,并配置以下参数。有关更多参数和说明,请参见 [参数](#参数)。 - - :::info - - - 在本示例中,StarRocks 提供的 Kafka connector 是一个 sink connector,可以持续地从 Kafka 消费数据,并将数据导入到 StarRocks 中。 - - 如果源数据是 CDC 数据,例如 Debezium 格式的数据,并且 StarRocks 表是 Primary Key 表,则还需要在 StarRocks 提供的 Kafka connector 的配置文件 **connect-StarRocks-sink.properties** 中[配置 `transform`](#load-debezium-formatted-cdc-data),以将源数据的更改同步到 Primary Key 表。 - - ::: - - ```yaml - name=starrocks-kafka-connector - connector.class=com.starrocks.connector.kafka.StarRocksSinkConnector - topics=test - key.converter=org.apache.kafka.connect.json.JsonConverter - value.converter=org.apache.kafka.connect.json.JsonConverter - key.converter.schemas.enable=true - value.converter.schemas.enable=false - # StarRocks 集群中 FE 的 HTTP URL。默认端口为 8030。 - starrocks.http.url=192.168.xxx.xxx:8030 - # 如果 Kafka topic 名称与 StarRocks 表名不同,则需要配置它们之间的映射关系。 - starrocks.topic2table.map=test:test_tbl - # 输入 StarRocks 用户名。 - starrocks.username=user1 - # 输入 StarRocks 密码。 - starrocks.password=123456 - starrocks.database.name=example_db - sink.properties.strip_outer_array=true - ``` - -2. 配置并运行 Kafka Connect。 - - 1. 配置 Kafka Connect。在 **config** 目录下的配置文件 **config/connect-standalone.properties** 中,配置以下参数。有关更多参数和说明,请参见 [Running Kafka Connect](https://kafka.apache.org/documentation.html#connect_running)。 - - ```yaml - # Kafka brokers 的地址。多个 Kafka brokers 的地址需要用逗号 (,) 分隔。 - # 请注意,本示例使用 PLAINTEXT 作为访问 Kafka 集群的安全协议。如果您使用其他安全协议访问 Kafka 集群,则需要在本文件中配置相关信息。 - bootstrap.servers=:9092 - offset.storage.file.filename=/tmp/connect.offsets - offset.flush.interval.ms=10000 - key.converter=org.apache.kafka.connect.json.JsonConverter - value.converter=org.apache.kafka.connect.json.JsonConverter - key.converter.schemas.enable=true - value.converter.schemas.enable=false - # starrocks-connector-for-kafka-x.y.z-with-dependencies.jar 的绝对路径。 - plugin.path=/home/kafka-connect/starrocks-kafka-connector - ``` - - 2. 运行 Kafka Connect。 - - ```Bash - CLASSPATH=/home/kafka-connect/starrocks-kafka-connector/* bin/connect-standalone.sh config/connect-standalone.properties config/connect-starrocks-sink.properties - ``` - -#### 在分布式模式下运行 Kafka Connect - -1. 配置并运行 Kafka Connect。 - - 1. 配置 Kafka Connect。在 **config** 目录下的配置文件 `config/connect-distributed.properties` 中,配置以下参数。有关更多参数和说明,请参考 [Running Kafka Connect](https://kafka.apache.org/documentation.html#connect_running)。 - - ```yaml - # Kafka brokers 的地址。多个 Kafka brokers 的地址需要用逗号 (,) 分隔。 - # 请注意,本示例使用 PLAINTEXT 作为访问 Kafka 集群的安全协议。 - # 如果您使用其他安全协议访问 Kafka 集群,请在本文件中配置相关信息。 - bootstrap.servers=:9092 - offset.storage.file.filename=/tmp/connect.offsets - offset.flush.interval.ms=10000 - key.converter=org.apache.kafka.connect.json.JsonConverter - value.converter=org.apache.kafka.connect.json.JsonConverter - key.converter.schemas.enable=true - value.converter.schemas.enable=false - # starrocks-connector-for-kafka-x.y.z-with-dependencies.jar 的绝对路径。 - plugin.path=/home/kafka-connect/starrocks-kafka-connector - ``` - - 2. 运行 Kafka Connect。 - - ```BASH - CLASSPATH=/home/kafka-connect/starrocks-kafka-connector/* bin/connect-distributed.sh config/connect-distributed.properties - ``` - -2. 配置并创建 Kafka connector。请注意,在分布式模式下,您需要通过 REST API 配置和创建 Kafka connector。有关参数和说明,请参见 [参数](#参数)。 - - :::info - - - 在本示例中,StarRocks 提供的 Kafka connector 是一个 sink connector,可以持续地从 Kafka 消费数据,并将数据导入到 StarRocks 中。 - - 如果源数据是 CDC 数据,例如 Debezium 格式的数据,并且 StarRocks 表是 Primary Key 表,则还需要在 StarRocks 提供的 Kafka connector 的配置文件 **connect-StarRocks-sink.properties** 中[配置 `transform`](#load-debezium-formatted-cdc-data),以将源数据的更改同步到 Primary Key 表。 - - ::: - - ```Shell - curl -i http://127.0.0.1:8083/connectors -H "Content-Type: application/json" -X POST -d '{ - "name":"starrocks-kafka-connector", - "config":{ - "connector.class":"com.starrocks.connector.kafka.StarRocksSinkConnector", - "topics":"test", - "key.converter":"org.apache.kafka.connect.json.JsonConverter", - "value.converter":"org.apache.kafka.connect.json.JsonConverter", - "key.converter.schemas.enable":"true", - "value.converter.schemas.enable":"false", - "starrocks.http.url":"192.168.xxx.xxx:8030", - "starrocks.topic2table.map":"test:test_tbl", - "starrocks.username":"user1", - "starrocks.password":"123456", - "starrocks.database.name":"example_db", - "sink.properties.strip_outer_array":"true" - } - }' - ``` - -#### 查询 StarRocks 表 - -查询目标 StarRocks 表 `test_tbl`。 - -```mysql -MySQL [example_db]> select * from test_tbl; - -+------+-------------+ -| id | city | -+------+-------------+ -| 1 | New York | -| 2 | Los Angeles | -| 3 | Chicago | -+------+-------------+ -3 rows in set (0.01 sec) -``` - -如果返回以上结果,则表示数据已成功导入。 - -## 参数 - -### name - -**是否必须**:是
-**默认值**:
-**描述**:此 Kafka connector 的名称。在 Kafka Connect 集群中的所有 Kafka connector 中,它必须是全局唯一的。例如,starrocks-kafka-connector。 - -### connector.class - -**是否必须**:是
-**默认值**:
-**描述**:此 Kafka connector 的 sink 使用的类。将值设置为 `com.starrocks.connector.kafka.StarRocksSinkConnector`。 - -### topics - -**是否必须**:
-**默认值**:
-**描述**:要订阅的一个或多个 topic,其中每个 topic 对应一个 StarRocks 表。默认情况下,StarRocks 假定 topic 名称与 StarRocks 表的名称匹配。因此,StarRocks 通过使用 topic 名称来确定目标 StarRocks 表。请选择填写 `topics` 或 `topics.regex` (如下),但不能同时填写。但是,如果 StarRocks 表名与 topic 名称不同,则可以使用可选的 `starrocks.topic2table.map` 参数 (如下) 来指定从 topic 名称到表名称的映射。 - -### topics.regex - -**是否必须**:
-**默认值**: -**描述**:用于匹配要订阅的一个或多个 topic 的正则表达式。有关更多描述,请参见 `topics`。请选择填写 `topics.regex` 或 `topics` (如上),但不能同时填写。
- -### starrocks.topic2table.map - -**是否必须**:否
-**默认值**:
-**描述**:当 topic 名称与 StarRocks 表名不同时,StarRocks 表名与 topic 名称的映射。格式为 `:,:,...`。 - -### starrocks.http.url - -**是否必须**:是
-**默认值**:
-**描述**:StarRocks 集群中 FE 的 HTTP URL。格式为 `:,:,...`。多个地址用逗号 (,) 分隔。例如,`192.168.xxx.xxx:8030,192.168.xxx.xxx:8030`。 - -### starrocks.database.name - -**是否必须**:是
-**默认值**:
-**描述**:StarRocks 数据库的名称。 - -### starrocks.username - -**是否必须**:是
-**默认值**:
-**描述**:您的 StarRocks 集群帐户的用户名。该用户需要对 StarRocks 表具有 [INSERT](../sql-reference/sql-statements/account-management/GRANT.md) 权限。 - -### starrocks.password - -**是否必须**:是
-**默认值**:
-**描述**:您的 StarRocks 集群帐户的密码。 - -### key.converter - -**是否必须**:否
-**默认值**:Kafka Connect 集群使用的 Key converter
-**描述**:此参数指定 sink connector (Kafka-connector-starrocks) 的 key converter,用于反序列化 Kafka 数据的键。默认的 key converter 是 Kafka Connect 集群使用的 key converter。 - -### value.converter - -**是否必须**:否
-**默认值**:Kafka Connect 集群使用的 Value converter
-**描述**:此参数指定 sink connector (Kafka-connector-starrocks) 的 value converter,用于反序列化 Kafka 数据的值。默认的 value converter 是 Kafka Connect 集群使用的 value converter。 - -### key.converter.schema.registry.url - -**是否必须**:否
-**默认值**:
-**描述**:Key converter 的 Schema registry URL。 - -### value.converter.schema.registry.url - -**是否必须**:否
-**默认值**:
-**描述**:Value converter 的 Schema registry URL。 - -### tasks.max - -**是否必须**:否
-**默认值**:1
-**描述**:Kafka connector 可以创建的任务线程数的上限,通常与 Kafka Connect 集群中 worker 节点上的 CPU 核心数相同。您可以调整此参数以控制导入性能。 - -### bufferflush.maxbytes - -**是否必须**:否
-**默认值**:94371840(90M)
-**描述**:在一次发送到 StarRocks 之前,可以在内存中累积的最大数据量。最大值范围为 64 MB 到 10 GB。请记住,Stream Load SDK 缓冲区可能会创建多个 Stream Load 作业来缓冲数据。因此,此处提到的阈值是指总数据大小。 - -### bufferflush.intervalms - -**是否必须**:否
-**默认值**:1000
-**描述**:发送一批数据的间隔,用于控制导入延迟。范围:[1000, 3600000]。 - -### connect.timeoutms - -**是否必须**:否
-**默认值**:1000
-**描述**:连接到 HTTP URL 的超时时间。范围:[100, 60000]。 - -### sink.properties.* - -**是否必须**:
-**默认值**:
-**描述**:用于控制导入行为的 Stream Load 参数。例如,参数 `sink.properties.format` 指定用于 Stream Load 的格式,例如 CSV 或 JSON。有关支持的参数及其描述的列表,请参见 [STREAM LOAD](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md)。 - -### sink.properties.format - -**是否必须**:否
-**默认值**:json
-**描述**:用于 Stream Load 的格式。Kafka connector 会在将每批数据发送到 StarRocks 之前将其转换为该格式。有效值:`csv` 和 `json`。有关更多信息,请参见 [CSV 参数](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md#csv-parameters) 和 [JSON 参数](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md#json-parameters)。 - -### sink.properties.partial_update - -**是否必须**:否
-**默认值**:`FALSE`
-**描述**:是否使用部分更新。有效值:`TRUE` 和 `FALSE`。默认值:`FALSE`,表示禁用此功能。 - -### sink.properties.partial_update_mode - -**是否必须**:否
-**默认值**:`row`
-**描述**:指定部分更新的模式。有效值:`row` 和 `column`。
  • 值 `row` (默认) 表示行模式下的部分更新,更适合于具有许多列和小批量的实时更新。
  • 值 `column` 表示列模式下的部分更新,更适合于具有少量列和许多行的批量更新。在这种情况下,启用列模式可以提供更快的更新速度。例如,在一个具有 100 列的表中,如果仅更新所有行的 10 列 (占总数的 10%),则列模式的更新速度将快 10 倍。
- -## 使用注意事项 - -### Flush 策略 - -Kafka connector 会将数据缓存在内存中,并通过 Stream Load 将它们批量刷新到 StarRocks。当满足以下任何条件时,将触发刷新: - -- 缓冲行的字节数达到限制 `bufferflush.maxbytes`。 -- 自上次刷新以来经过的时间达到限制 `bufferflush.intervalms`。 -- 达到 connector 尝试提交任务偏移量的间隔。该间隔由 Kafka Connect 配置 [`offset.flush.interval.ms`](https://docs.confluent.io/platform/current/connect/references/allconfigs.html) 控制,默认值为 `60000`。 - -为了降低数据延迟,请调整 Kafka connector 设置中的这些配置。但是,更频繁的刷新会增加 CPU 和 I/O 使用率。 - -### 限制 - -- 不支持将来自 Kafka topic 的单个消息展平为多个数据行并加载到 StarRocks 中。 -- StarRocks 提供的 Kafka connector 的 sink 保证至少一次 (at-least-once) 的语义。 - -## 最佳实践 - -### 导入 Debezium 格式的 CDC 数据 - -Debezium 是一种流行的变更数据捕获 (Change Data Capture, CDC) 工具,支持监视各种数据库系统中的数据更改,并将这些更改流式传输到 Kafka。以下示例演示了如何配置和使用 Kafka connector 将 PostgreSQL 更改写入 StarRocks 中的 **Primary Key 表**。 - -#### 步骤 1:安装并启动 Kafka - -> **注意** -> -> 如果您有自己的 Kafka 环境,则可以跳过此步骤。 - -1. 从官方网站 [下载](https://dlcdn.apache.org/kafka/) 最新的 Kafka 版本并解压该软件包。 - - ```Bash - tar -xzf kafka_2.13-3.7.0.tgz - cd kafka_2.13-3.7.0 - ``` - -2. 启动 Kafka 环境。 - - 生成 Kafka 集群 UUID。 - - ```Bash - KAFKA_CLUSTER_ID="$(bin/kafka-storage.sh random-uuid)" - ``` - - 格式化日志目录。 - - ```Bash - bin/kafka-storage.sh format -t $KAFKA_CLUSTER_ID -c config/kraft/server.properties - ``` - - 启动 Kafka 服务器。 - - ```Bash - bin/kafka-server-start.sh config/kraft/server.properties - ``` - -#### 步骤 2:配置 PostgreSQL - -1. 确保 PostgreSQL 用户被授予 `REPLICATION` 权限。 - -2. 调整 PostgreSQL 配置。 - - 在 **postgresql.conf** 中将 `wal_level` 设置为 `logical`。 - - ```Properties - wal_level = logical - ``` - - 重新启动 PostgreSQL 服务器以应用更改。 - - ```Bash - pg_ctl restart - ``` - -3. 准备数据集。 - - 创建一个表并插入测试数据。 - - ```SQL - CREATE TABLE customers ( - id int primary key , - first_name varchar(65533) NULL, - last_name varchar(65533) NULL , - email varchar(65533) NULL - ); - - INSERT INTO customers VALUES (1,'a','a','a@a.com'); - ``` - -4. 验证 Kafka 中的 CDC 日志消息。 - - ```Json - { - "schema": { - "type": "struct", - "fields": [ - { - "type": "struct", - "fields": [ - { - "type": "int32", - "optional": false, - "field": "id" - }, - { - "type": "string", - "optional": true, - "field": "first_name" - }, - { - "type": "string", - "optional": true, - "field": "last_name" - }, - { - "type": "string", - "optional": true, - "field": "email" - } - ], - "optional": true, - "name": "test.public.customers.Value", - "field": "before" - }, - { - "type": "struct", - "fields": [ - { - "type": "int32", - "optional": false, - "field": "id" - }, - { - "type": "string", - "optional": true, - "field": "first_name" - }, - { - "type": "string", - "optional": true, - "field": "last_name" - }, - { - "type": "string", - "optional": true, - "field": "email" - } - ], - "optional": true, - "name": "test.public.customers.Value", - "field": "after" - }, - { - "type": "struct", - "fields": [ - { - "type": "string", - "optional": false, - "field": "version" - }, - { - "type": "string", - "optional": false, - "field": "connector" - }, - { - "type": "string", - "optional": false, - "field": "name" - }, - { - "type": "int64", - "optional": false, - "field": "ts_ms" - }, - { - "type": "string", - "optional": true, - "name": "io.debezium.data.Enum", - "version": 1, - "parameters": { - "allowed": "true,last,false,incremental" - }, - "default": "false", - "field": "snapshot" - }, - { - "type": "string", - "optional": false, - "field": "db" - }, - { - "type": "string", - "optional": true, - "field": "sequence" - }, - { - "type": "string", - "optional": false, - "field": "schema" - }, - { - "type": "string", - "optional": false, - "field": "table" - }, - { - "type": "int64", - "optional": true, - "field": "txId" - }, - { - "type": "int64", - "optional": true, - "field": "lsn" - }, - { - "type": "int64", - "optional": true, - "field": "xmin" - } - ], - "optional": false, - "name": "io.debezium.connector.postgresql.Source", - "field": "source" - }, - { - "type": "string", - "optional": false, - "field": "op" - }, - { - "type": "int64", - "optional": true, - "field": "ts_ms" - }, - { - "type": "struct", - "fields": [ - { - "type": "string", - "optional": false, - "field": "id" - }, - { - "type": "int64", - "optional": false, - "field": "total_order" - }, - { - "type": "int64", - "optional": false, - "field": "data_collection_order" - } - ], - "optional": true, - "name": "event.block", - "version": 1, - "field": "transaction" - } - ], - "optional": false, - "name": "test.public.customers.Envelope", - "version": 1 - }, - "payload": { - "before": null, - "after": { - "id": 1, - "first_name": "a", - "last_name": "a", - "email": "a@a.com" - }, - "source": { - "version": "2.5.3.Final", - "connector": "postgresql", - "name": "test", - "ts_ms": 1714283798721, - "snapshot": "false", - "db": "postgres", - "sequence": "[\"22910216\",\"22910504\"]", - "schema": "public", - "table": "customers", - "txId": 756, - "lsn": 22910504, - "xmin": null - }, - "op": "c", - "ts_ms": 1714283798790, - "transaction": null - } - } - ``` - -#### 步骤 3:配置 StarRocks - -在 StarRocks 中创建一个 Primary Key 表,其 schema 与 PostgreSQL 中的源表相同。 - -```SQL -CREATE TABLE `customers` ( - `id` int(11) COMMENT "", - `first_name` varchar(65533) NULL COMMENT "", - `last_name` varchar(65533) NULL COMMENT "", - `email` varchar(65533) NULL COMMENT "" -) ENGINE=OLAP -PRIMARY KEY(`id`) -DISTRIBUTED BY hash(id) buckets 1 -PROPERTIES ( -"bucket_size" = "4294967296", -"in_memory" = "false", -"enable_persistent_index" = "true", -"replicated_storage" = "true", -"fast_schema_evolution" = "true" -); -``` - -#### 步骤 4:安装 connector - -1. 下载 connectors 并在 **plugins** 目录中解压软件包。 - - ```Bash - mkdir plugins - tar -zxvf debezium-debezium-connector-postgresql-2.5.3.zip -C plugins - mv starrocks-connector-for-kafka-x.y.z-with-dependencies.jar plugins - ``` - - 此目录是 **config/connect-standalone.properties** 中配置项 `plugin.path` 的值。 - - ```Properties - plugin.path=/path/to/kafka_2.13-3.7.0/plugins - ``` - -2. 在 **pg-source.properties** 中配置 PostgreSQL 源 connector。 - - ```Json - { - "name": "inventory-connector", - "config": { - "connector.class": "io.debezium.connector.postgresql.PostgresConnector", - "plugin.name": "pgoutput", - "database.hostname": "localhost", - "database.port": "5432", - "database.user": "postgres", - "database.password": "", - "database.dbname" : "postgres", - "topic.prefix": "test" - } - } - ``` - -3. 在 **sr-sink.properties** 中配置 StarRocks sink connector。 - - ```Json - { - "name": "starrocks-kafka-connector", - "config": { - "connector.class": "com.starrocks.connector.kafka.StarRocksSinkConnector", - "tasks.max": "1", - "topics": "test.public.customers", - "starrocks.http.url": "172.26.195.69:28030", - "starrocks.database.name": "test", - "starrocks.username": "root", - "starrocks.password": "StarRocks@123", - "sink.properties.strip_outer_array": "true", - "connect.timeoutms": "3000", - "starrocks.topic2table.map": "test.public.customers:customers", - "transforms": "addfield,unwrap", - "transforms.addfield.type": "com.starrocks.connector.kafka.transforms.AddOpFieldForDebeziumRecord", - "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState", - "transforms.unwrap.drop.tombstones": "true", - "transforms.unwrap.delete.handling.mode": "rewrite" - } - } - ``` - - > **注意** - > - > - 如果 StarRocks 表不是 Primary Key 表,则无需指定 `addfield` 转换。 - > - unwrap 转换由 Debezium 提供,用于根据操作类型解包 Debezium 的复杂数据结构。有关更多信息,请参见 [New Record State Extraction](https://debezium.io/documentation/reference/stable/transformations/event-flattening.html)。 - -4. 配置 Kafka Connect。 - - 在 Kafka Connect 配置文件 **config/connect-standalone.properties** 中配置以下配置项。 - - ```Properties - # Kafka brokers 的地址。多个 Kafka brokers 的地址需要用逗号 (,) 分隔。 - # 请注意,本示例使用 PLAINTEXT 作为访问 Kafka 集群的安全协议。 - # 如果您使用其他安全协议访问 Kafka 集群,请在此部分配置相关信息。 - - bootstrap.servers=:9092 - offset.storage.file.filename=/tmp/connect.offsets - key.converter=org.apache.kafka.connect.json.JsonConverter - value.converter=org.apache.kafka.connect.json.JsonConverter - key.converter.schemas.enable=true - value.converter.schemas.enable=false - - # starrocks-connector-for-kafka-x.y.z-with-dependencies.jar 的绝对路径。 - plugin.path=/home/kafka-connect/starrocks-kafka-connector - - # 控制刷新策略的参数。有关更多信息,请参见“使用注意事项”部分。 - offset.flush.interval.ms=10000 - bufferflush.maxbytes = xxx - bufferflush.intervalms = xxx - ``` - - 有关更多参数的描述,请参见 [Running Kafka Connect](https://kafka.apache.org/documentation.html#connect_running)。 - -#### 步骤 5:在独立模式下启动 Kafka Connect - -在独立模式下运行 Kafka Connect 以启动 connectors。 - -```Bash -bin/connect-standalone.sh config/connect-standalone.properties config/pg-source.properties config/sr-sink.properties -``` - -#### 步骤 6:验证数据摄取 - -测试以下操作,并确保数据已正确摄取到 StarRocks 中。 - -##### INSERT - -- 在 PostgreSQL 中: - -```Plain -postgres=# insert into customers values (2,'b','b','b@b.com'); -INSERT 0 1 -postgres=# select * from customers; - id | first_name | last_name | email -----+------------+-----------+--------- - 1 | a | a | a@a.com - 2 | b | b | b@b.com -(2 rows) -``` - -- 在 StarRocks 中: - -```Plain -MySQL [test]> select * from customers; -+------+------------+-----------+---------+ -| id | first_name | last_name | email | -+------+------------+-----------+---------+ -| 1 | a | a | a@a.com | -| 2 | b | b | b@b.com | -+------+------------+-----------+---------+ -2 rows in set (0.01 sec) -``` - -##### UPDATE - -- 在 PostgreSQL 中: - -```Plain -postgres=# update customers set email='c@c.com'; -UPDATE 2 -postgres=# select * from customers; - id | first_name | last_name | email -----+------------+-----------+--------- - 1 | a | a | c@c.com - 2 | b | b | c@c.com -(2 rows) -``` - -- 在 StarRocks 中: - -```Plain -MySQL [test]> select * from customers; -+------+------------+-----------+---------+ -| id | first_name | last_name | email | -+------+------------+-----------+---------+ -| 1 | a | a | c@c.com | -| 2 | b | b | c@c.com | -+------+------------+-----------+---------+ -2 rows in set (0.00 sec) -``` - -##### DELETE - -- 在 PostgreSQL 中: - -```Plain -postgres=# delete from customers where id=1; -DELETE 1 -postgres=# select * from customers; - id | first_name | last_name | email -----+------------+-----------+--------- - 2 | b | b | c@c.com -(1 row) -``` - -- 在 StarRocks 中: - -```Plain -MySQL [test]> select * from customers; -+------+------------+-----------+---------+ -| id | first_name | last_ \ No newline at end of file diff --git a/docs/zh/loading/Load_to_Primary_Key_tables.md b/docs/zh/loading/Load_to_Primary_Key_tables.md deleted file mode 100644 index 152e81d..0000000 --- a/docs/zh/loading/Load_to_Primary_Key_tables.md +++ /dev/null @@ -1,709 +0,0 @@ ---- -displayed_sidebar: docs ---- - -# 通过导入更改数据 - -import InsertPrivNote from '../_assets/commonMarkdown/insertPrivNote.mdx' - -StarRocks 提供的 [主键表](../table_design/table_types/primary_key_table.md) 允许您通过运行 [Stream Load](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md) 、[Broker Load](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md) 或 [Routine Load](../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md) 作业来更改 StarRocks 表中的数据。这些数据更改包括插入、更新和删除。但是,主键表不支持使用 [Spark Load](../sql-reference/sql-statements/loading_unloading/SPARK_LOAD.md) 或 [INSERT](../sql-reference/sql-statements/loading_unloading/INSERT.md) 更改数据。 - -StarRocks 还支持部分更新和条件更新。 - - - -本主题以 CSV 数据为例,介绍如何通过导入更改 StarRocks 表中的数据。支持的数据文件格式因您选择的导入方法而异。 - -> **NOTE** -> -> 对于 CSV 数据,您可以使用 UTF-8 字符串(例如逗号 (,)、制表符或管道 (|)),其长度不超过 50 字节作为文本分隔符。 - -## 实现 - -StarRocks 提供的主键表支持 UPSERT 和 DELETE 操作,并且不区分 INSERT 操作和 UPDATE 操作。 - -创建导入作业时,StarRocks 支持向作业创建语句或命令添加名为 `__op` 的字段。 `__op` 字段用于指定要执行的操作类型。 - -> **NOTE** -> -> 创建表时,无需向该表添加名为 `__op` 的列。 - -定义 `__op` 字段的方法因您选择的导入方法而异: - -- 如果选择 Stream Load,请使用 `columns` 参数定义 `__op` 字段。 - -- 如果选择 Broker Load,请使用 SET 子句定义 `__op` 字段。 - -- 如果选择 Routine Load,请使用 `COLUMNS` 列定义 `__op` 字段。 - -您可以根据要进行的数据更改来决定是否添加 `__op` 字段。如果未添加 `__op` 字段,则操作类型默认为 UPSERT。主要的数据更改场景如下: - -- 如果要导入的数据文件仅涉及 UPSERT 操作,则无需添加 `__op` 字段。 - -- 如果要导入的数据文件仅涉及 DELETE 操作,则必须添加 `__op` 字段并将操作类型指定为 DELETE。 - -- 如果要导入的数据文件同时涉及 UPSERT 和 DELETE 操作,则必须添加 `__op` 字段,并确保数据文件包含一个列,其值为 `0` 或 `1`。值 `0` 表示 UPSERT 操作,值 `1` 表示 DELETE 操作。 - -## 使用说明 - -- 确保数据文件中的每一行都具有相同数量的列。 - -- 涉及数据更改的列必须包含主键列。 - -## 基本操作 - -本节提供有关如何通过导入更改 StarRocks 表中的数据的示例。有关详细的语法和参数说明,请参见 [STREAM LOAD](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md) 、[BROKER LOAD](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md) 和 [CREATE ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md)。 - -### UPSERT - -如果要导入的数据文件仅涉及 UPSERT 操作,则无需添加 `__op` 字段。 - -> **NOTE** -> -> 如果添加 `__op` 字段: -> -> - 可以将操作类型指定为 UPSERT。 -> -> - 可以将 `__op` 字段留空,因为操作类型默认为 UPSERT。 - -#### 数据示例 - -1. 准备数据文件。 - - a. 在本地文件系统中创建一个名为 `example1.csv` 的 CSV 文件。该文件由三列组成,依次表示用户 ID、用户名和用户分数。 - - ```Plain - 101,Lily,100 - 102,Rose,100 - ``` - - b. 将 `example1.csv` 的数据发布到 Kafka 集群的 `topic1`。 - -2. 准备 StarRocks 表。 - - a. 在 StarRocks 数据库 `test_db` 中创建一个名为 `table1` 的主键表。该表由三列组成:`id`、`name` 和 `score`,其中 `id` 是主键。 - - ```SQL - CREATE TABLE `table1` - ( - `id` int(11) NOT NULL COMMENT "user ID", - `name` varchar(65533) NOT NULL COMMENT "user name", - `score` int(11) NOT NULL COMMENT "user score" - ) - ENGINE=OLAP - PRIMARY KEY(`id`) - DISTRIBUTED BY HASH(`id`); - ``` - - > **NOTE** - > - > 从 v2.5.7 开始,StarRocks 可以在创建表或添加分区时自动设置 bucket 数量 (BUCKETS)。您不再需要手动设置 bucket 数量。有关详细信息,请参见 [设置 bucket 数量](../table_design/data_distribution/Data_distribution.md#set-the-number-of-buckets)。 - - b. 将一条记录插入到 `table1` 中。 - - ```SQL - INSERT INTO table1 VALUES - (101, 'Lily',80); - ``` - -#### 导入数据 - -运行一个导入作业,以将 `example1.csv` 中 `id` 为 `101` 的记录更新到 `table1`,并将 `example1.csv` 中 `id` 为 `102` 的记录插入到 `table1`。 - -- 运行 Stream Load 作业。 - - - 如果不想包含 `__op` 字段,请运行以下命令: - - ```Bash - curl --location-trusted -u : \ - -H "Expect:100-continue" \ - -H "label:label1" \ - -H "column_separator:," \ - -T example1.csv -XPUT \ - http://:/api/test_db/table1/_stream_load - ``` - - - 如果想包含 `__op` 字段,请运行以下命令: - - ```Bash - curl --location-trusted -u : \ - -H "Expect:100-continue" \ - -H "label:label2" \ - -H "column_separator:," \ - -H "columns:__op ='upsert'" \ - -T example1.csv -XPUT \ - http://:/api/test_db/table1/_stream_load - ``` - -- 运行 Broker Load 作业。 - - - 如果不想包含 `__op` 字段,请运行以下命令: - - ```SQL - LOAD LABEL test_db.label1 - ( - data infile("hdfs://:/example1.csv") - into table table1 - columns terminated by "," - format as "csv" - ) - WITH BROKER; - ``` - - - 如果想包含 `__op` 字段,请运行以下命令: - - ```SQL - LOAD LABEL test_db.label2 - ( - data infile("hdfs://:/example1.csv") - into table table1 - columns terminated by "," - format as "csv" - set (__op = 'upsert') - ) - WITH BROKER; - ``` - -- 运行 Routine Load 作业。 - - - 如果不想包含 `__op` 字段,请运行以下命令: - - ```SQL - CREATE ROUTINE LOAD test_db.table1 ON table1 - COLUMNS TERMINATED BY ",", - COLUMNS (id, name, score) - PROPERTIES - ( - "desired_concurrent_number" = "3", - "max_batch_interval" = "20", - "max_batch_rows"= "250000", - "max_error_number" = "1000" - ) - FROM KAFKA - ( - "kafka_broker_list" =":", - "kafka_topic" = "test1", - "property.kafka_default_offsets" ="OFFSET_BEGINNING" - ); - ``` - - - 如果想包含 `__op` 字段,请运行以下命令: - - ```SQL - CREATE ROUTINE LOAD test_db.table1 ON table1 - COLUMNS TERMINATED BY ",", - COLUMNS (id, name, score, __op ='upsert') - PROPERTIES - ( - "desired_concurrent_number" = "3", - "max_batch_interval" = "20", - "max_batch_rows"= "250000", - "max_error_number" = "1000" - ) - FROM KAFKA - ( - "kafka_broker_list" =":", - "kafka_topic" = "test1", - "property.kafka_default_offsets" ="OFFSET_BEGINNING" - ); - ``` - -#### 查询数据 - -导入完成后,查询 `table1` 的数据以验证导入是否成功: - -```SQL -SELECT * FROM table1; -+------+------+-------+ -| id | name | score | -+------+------+-------+ -| 101 | Lily | 100 | -| 102 | Rose | 100 | -+------+------+-------+ -2 rows in set (0.02 sec) -``` - -如上述查询结果所示,`example1.csv` 中 `id` 为 `101` 的记录已更新到 `table1`,并且 `example1.csv` 中 `id` 为 `102` 的记录已插入到 `table1` 中。 - -### DELETE - -如果要导入的数据文件仅涉及 DELETE 操作,则必须添加 `__op` 字段并将操作类型指定为 DELETE。 - -#### 数据示例 - -1. 准备数据文件。 - - a. 在本地文件系统中创建一个名为 `example2.csv` 的 CSV 文件。该文件由三列组成,依次表示用户 ID、用户名和用户分数。 - - ```Plain - 101,Jack,100 - ``` - - b. 将 `example2.csv` 的数据发布到 Kafka 集群的 `topic2`。 - -2. 准备 StarRocks 表。 - - a. 在 StarRocks 数据库 `test_db` 中创建一个名为 `table2` 的主键表。该表由三列组成:`id`、`name` 和 `score`,其中 `id` 是主键。 - - ```SQL - CREATE TABLE `table2` - ( - `id` int(11) NOT NULL COMMENT "user ID", - `name` varchar(65533) NOT NULL COMMENT "user name", - `score` int(11) NOT NULL COMMENT "user score" - ) - ENGINE=OLAP - PRIMARY KEY(`id`) - DISTRIBUTED BY HASH(`id`); - ``` - - > **NOTE** - > - > 从 v2.5.7 开始,StarRocks 可以在创建表或添加分区时自动设置 bucket 数量 (BUCKETS)。您不再需要手动设置 bucket 数量。有关详细信息,请参见 [设置 bucket 数量](../table_design/data_distribution/Data_distribution.md#set-the-number-of-buckets)。 - - b. 将两条记录插入到 `table2` 中。 - - ```SQL - INSERT INTO table2 VALUES - (101, 'Jack', 100), - (102, 'Bob', 90); - ``` - -#### 导入数据 - -运行一个导入作业,以从 `table2` 中删除 `example2.csv` 中 `id` 为 `101` 的记录。 - -- 运行 Stream Load 作业。 - - ```Bash - curl --location-trusted -u : \ - -H "Expect:100-continue" \ - -H "label:label3" \ - -H "column_separator:," \ - -H "columns:__op='delete'" \ - -T example2.csv -XPUT \ - http://:/api/test_db/table2/_stream_load - ``` - -- 运行 Broker Load 作业。 - - ```SQL - LOAD LABEL test_db.label3 - ( - data infile("hdfs://:/example2.csv") - into table table2 - columns terminated by "," - format as "csv" - set (__op = 'delete') - ) - WITH BROKER; - ``` - -- 运行 Routine Load 作业。 - - ```SQL - CREATE ROUTINE LOAD test_db.table2 ON table2 - COLUMNS(id, name, score, __op = 'delete') - PROPERTIES - ( - "desired_concurrent_number" = "3", - "max_batch_interval" = "20", - "max_batch_rows"= "250000", - "max_error_number" = "1000" - ) - FROM KAFKA - ( - "kafka_broker_list" =":", - "kafka_topic" = "test2", - "property.kafka_default_offsets" ="OFFSET_BEGINNING" - ); - ``` - -#### 查询数据 - -导入完成后,查询 `table2` 的数据以验证导入是否成功: - -```SQL -SELECT * FROM table2; -+------+------+-------+ -| id | name | score | -+------+------+-------+ -| 102 | Bob | 90 | -+------+------+-------+ -1 row in set (0.00 sec) -``` - -如上述查询结果所示,`example2.csv` 中 `id` 为 `101` 的记录已从 `table2` 中删除。 - -### UPSERT 和 DELETE - -如果要导入的数据文件同时涉及 UPSERT 和 DELETE 操作,则必须添加 `__op` 字段,并确保数据文件包含一个列,其值为 `0` 或 `1`。值 `0` 表示 UPSERT 操作,值 `1` 表示 DELETE 操作。 - -#### 数据示例 - -1. 准备数据文件。 - - a. 在本地文件系统中创建一个名为 `example3.csv` 的 CSV 文件。该文件由四列组成,依次表示用户 ID、用户名、用户分数和操作类型。 - - ```Plain - 101,Tom,100,1 - 102,Sam,70,0 - 103,Stan,80,0 - ``` - - b. 将 `example3.csv` 的数据发布到 Kafka 集群的 `topic3`。 - -2. 准备 StarRocks 表。 - - a. 在 StarRocks 数据库 `test_db` 中创建一个名为 `table3` 的主键表。该表由三列组成:`id`、`name` 和 `score`,其中 `id` 是主键。 - - ```SQL - CREATE TABLE `table3` - ( - `id` int(11) NOT NULL COMMENT "user ID", - `name` varchar(65533) NOT NULL COMMENT "user name", - `score` int(11) NOT NULL COMMENT "user score" - ) - ENGINE=OLAP - PRIMARY KEY(`id`) - DISTRIBUTED BY HASH(`id`); - ``` - - > **NOTE** - > - > 从 v2.5.7 开始,StarRocks 可以在创建表或添加分区时自动设置 bucket 数量 (BUCKETS)。您不再需要手动设置 bucket 数量。有关详细信息,请参见 [设置 bucket 数量](../table_design/data_distribution/Data_distribution.md#set-the-number-of-buckets)。 - - b. 将两条记录插入到 `table3` 中。 - - ```SQL - INSERT INTO table3 VALUES - (101, 'Tom', 100), - (102, 'Sam', 90); - ``` - -#### 导入数据 - -运行一个导入作业,以从 `table3` 中删除 `example3.csv` 中 `id` 为 `101` 的记录,将 `example3.csv` 中 `id` 为 `102` 的记录更新到 `table3`,并将 `example3.csv` 中 `id` 为 `103` 的记录插入到 `table3`。 - -- 运行 Stream Load 作业: - - ```Bash - curl --location-trusted -u : \ - -H "Expect:100-continue" \ - -H "label:label4" \ - -H "column_separator:," \ - -H "columns: id, name, score, temp, __op = temp" \ - -T example3.csv -XPUT \ - http://:/api/test_db/table3/_stream_load - ``` - - > **NOTE** - > - > 在上面的示例中,`example3.csv` 中表示操作类型的第四列暂时命名为 `temp`,并且 `__op` 字段通过 `columns` 参数映射到 `temp` 列。这样,StarRocks 可以根据 `example3.csv` 的第四列中的值是 `0` 还是 `1` 来决定是执行 UPSERT 还是 DELETE 操作。 - -- 运行 Broker Load 作业: - - ```Bash - LOAD LABEL test_db.label4 - ( - data infile("hdfs://:/example1.csv") - into table table1 - columns terminated by "," - format as "csv" - (id, name, score, temp) - set (__op=temp) - ) - WITH BROKER; - ``` - -- 运行 Routine Load 作业: - - ```SQL - CREATE ROUTINE LOAD test_db.table3 ON table3 - COLUMNS(id, name, score, temp, __op = temp) - PROPERTIES - ( - "desired_concurrent_number" = "3", - "max_batch_interval" = "20", - "max_batch_rows"= "250000", - "max_error_number" = "1000" - ) - FROM KAFKA - ( - "kafka_broker_list" = ":", - "kafka_topic" = "test3", - "property.kafka_default_offsets" = "OFFSET_BEGINNING" - ); - ``` - -#### 查询数据 - -导入完成后,查询 `table3` 的数据以验证导入是否成功: - -```SQL -SELECT * FROM table3; -+------+------+-------+ -| id | name | score | -+------+------+-------+ -| 102 | Sam | 70 | -| 103 | Stan | 80 | -+------+------+-------+ -2 rows in set (0.01 sec) -``` - -如上述查询结果所示,`example3.csv` 中 `id` 为 `101` 的记录已从 `table3` 中删除,`example3.csv` 中 `id` 为 `102` 的记录已更新到 `table3`,并且 `example3.csv` 中 `id` 为 `103` 的记录已插入到 `table3` 中。 - -## 部分更新 - -主键表还支持部分更新,并为不同的数据更新场景提供两种部分更新模式:行模式和列模式。这两种部分更新模式可以在保证查询性能的同时,尽可能地减少部分更新的开销,从而确保实时更新。行模式更适合涉及许多列和小批量的实时更新场景。列模式适用于涉及少量列和大量行的批量处理更新场景。 - -> **NOTICE** -> -> 执行部分更新时,如果要更新的行不存在,StarRocks 会插入一个新行,并在由于没有数据更新插入而为空的字段中填充默认值。 - -本节以 CSV 为例,介绍如何执行部分更新。 - -### 数据示例 - -1. 准备数据文件。 - - a. 在本地文件系统中创建一个名为 `example4.csv` 的 CSV 文件。该文件由两列组成,依次表示用户 ID 和用户名。 - - ```Plain - 101,Lily - 102,Rose - 103,Alice - ``` - - b. 将 `example4.csv` 的数据发布到 Kafka 集群的 `topic4`。 - -2. 准备 StarRocks 表。 - - a. 在 StarRocks 数据库 `test_db` 中创建一个名为 `table4` 的主键表。该表由三列组成:`id`、`name` 和 `score`,其中 `id` 是主键。 - - ```SQL - CREATE TABLE `table4` - ( - `id` int(11) NOT NULL COMMENT "user ID", - `name` varchar(65533) NOT NULL COMMENT "user name", - `score` int(11) NOT NULL COMMENT "user score" - ) - ENGINE=OLAP - PRIMARY KEY(`id`) - DISTRIBUTED BY HASH(`id`); - ``` - - > **NOTE** - > - > 从 v2.5.7 开始,StarRocks 可以在创建表或添加分区时自动设置 bucket 数量 (BUCKETS)。您不再需要手动设置 bucket 数量。有关详细信息,请参见 [设置 bucket 数量](../table_design/data_distribution/Data_distribution.md#set-the-number-of-buckets)。 - - b. 将一条记录插入到 `table4` 中。 - - ```SQL - INSERT INTO table4 VALUES - (101, 'Tom',80); - ``` - -### 导入数据 - -运行一个导入作业,以将 `example4.csv` 中两列的数据更新到 `table4` 的 `id` 和 `name` 列。 - -- 运行 Stream Load 作业: - - ```Bash - curl --location-trusted -u : \ - -H "Expect:100-continue" \ - -H "label:label7" -H "column_separator:," \ - -H "partial_update:true" \ - -H "columns:id,name" \ - -T example4.csv -XPUT \ - http://:/api/test_db/table4/_stream_load - ``` - - > **NOTE** - > - > 如果选择 Stream Load,则必须将 `partial_update` 参数设置为 `true` 才能启用部分更新功能。默认情况下,是行模式下的部分更新。如果需要执行列模式下的部分更新,则需要将 `partial_update_mode` 设置为 `column`。此外,必须使用 `columns` 参数指定要更新的列。 - -- 运行 Broker Load 作业: - - ```SQL - LOAD LABEL test_db.table4 - ( - data infile("hdfs://:/example4.csv") - into table table4 - format as "csv" - (id, name) - ) - WITH BROKER - PROPERTIES - ( - "partial_update" = "true" - ); - ``` - - > **NOTE** - > - > 如果选择 Broker Load,则必须将 `partial_update` 参数设置为 `true` 才能启用部分更新功能。默认情况下,是行模式下的部分更新。如果需要执行列模式下的部分更新,则需要将 `partial_update_mode` 设置为 `column`。此外,必须使用 `column_list` 参数指定要更新的列。 - -- 运行 Routine Load 作业: - - ```SQL - CREATE ROUTINE LOAD test_db.table4 on table4 - COLUMNS (id, name), - COLUMNS TERMINATED BY ',' - PROPERTIES - ( - "partial_update" = "true" - ) - FROM KAFKA - ( - "kafka_broker_list" =":", - "kafka_topic" = "test4", - "property.kafka_default_offsets" ="OFFSET_BEGINNING" - ); - ``` - - > **NOTE** - > - > - 如果选择 Routine Load,则必须将 `partial_update` 参数设置为 `true` 才能启用部分更新功能。此外,必须使用 `COLUMNS` 参数指定要更新的列。 - > - Routine Load 仅支持行模式下的部分更新,不支持列模式下的部分更新。 - -### 查询数据 - -导入完成后,查询 `table4` 的数据以验证导入是否成功: - -```SQL -SELECT * FROM table4; -+------+-------+-------+ -| id | name | score | -+------+-------+-------+ -| 102 | Rose | 0 | -| 101 | Lily | 80 | -| 103 | Alice | 0 | -+------+-------+-------+ -3 rows in set (0.01 sec) -``` - -如上述查询结果所示,`example4.csv` 中 `id` 为 `101` 的记录已更新到 `table4`,并且 `example4.csv` 中 `id` 为 `102` 和 `103` 的记录已插入到 `table4`。 - -## 条件更新 - -从 StarRocks v2.5 开始,主键表支持条件更新。您可以指定一个非主键列作为条件,以确定更新是否可以生效。这样,仅当源数据记录在指定列中具有大于或等于目标数据记录的值时,从源记录到目标记录的更新才会生效。 - -条件更新功能旨在解决数据无序问题。如果源数据是无序的,则可以使用此功能来确保新数据不会被旧数据覆盖。 - -> **NOTICE** -> -> - 不能为同一批数据指定不同的列作为更新条件。 -> - DELETE 操作不支持条件更新。 -> - 在低于 v3.1.3 的版本中,部分更新和条件更新不能同时使用。从 v3.1.3 开始,StarRocks 支持将部分更新与条件更新一起使用。 - -### 数据示例 - -1. 准备数据文件。 - - a. 在本地文件系统中创建一个名为 `example5.csv` 的 CSV 文件。该文件由三列组成,依次表示用户 ID、版本和用户分数。 - - ```Plain - 101,1,100 - 102,3,100 - ``` - - b. 将 `example5.csv` 的数据发布到 Kafka 集群的 `topic5`。 - -2. 准备 StarRocks 表。 - - a. 在 StarRocks 数据库 `test_db` 中创建一个名为 `table5` 的主键表。该表由三列组成:`id`、`version` 和 `score`,其中 `id` 是主键。 - - ```SQL - CREATE TABLE `table5` - ( - `id` int(11) NOT NULL COMMENT "user ID", - `version` int NOT NULL COMMENT "version", - `score` int(11) NOT NULL COMMENT "user score" - ) - ENGINE=OLAP - PRIMARY KEY(`id`) DISTRIBUTED BY HASH(`id`); - ``` - - > **NOTE** - > - > 从 v2.5.7 开始,StarRocks 可以在创建表或添加分区时自动设置 bucket 数量 (BUCKETS)。您不再需要手动设置 bucket 数量。有关详细信息,请参见 [设置 bucket 数量](../table_design/data_distribution/Data_distribution.md#set-the-number-of-buckets)。 - - b. 将一条记录插入到 `table5` 中。 - - ```SQL - INSERT INTO table5 VALUES - (101, 2, 80), - (102, 2, 90); - ``` - -### 导入数据 - -运行一个导入作业,以将 `example5.csv` 中 `id` 值分别为 `101` 和 `102` 的记录更新到 `table5`,并指定仅当两条记录中的 `version` 值大于或等于其当前的 `version` 值时,更新才会生效。 - -- 运行 Stream Load 作业: - - ```Bash - curl --location-trusted -u : \ - -H "Expect:100-continue" \ - -H "label:label10" \ - -H "column_separator:," \ - -H "merge_condition:version" \ - -T example5.csv -XPUT \ - http://:/api/test_db/table5/_stream_load - ``` -- 运行 Insert Load 作业: - ```SQL - INSERT INTO test_db.table5 properties("merge_condition" = "version") - VALUES (101, 2, 70), (102, 3, 100); - ``` - -- 运行 Routine Load 作业: - - ```SQL - CREATE ROUTINE LOAD test_db.table5 on table5 - COLUMNS (id, version, score), - COLUMNS TERMINATED BY ',' - PROPERTIES - ( - "merge_condition" = "version" - ) - FROM KAFKA - ( - "kafka_broker_list" =":", - "kafka_topic" = "topic5", - "property.kafka_default_offsets" ="OFFSET_BEGINNING" - ); - ``` - -- 运行 Broker Load 作业: - - ```SQL - LOAD LABEL test_db.table5 - ( DATA INFILE ("s3://xxx.csv") - INTO TABLE table5 COLUMNS TERMINATED BY "," FORMAT AS "CSV" - ) - WITH BROKER - PROPERTIES - ( - "merge_condition" = "version" - ); - ``` - -### 查询数据 - -导入完成后,查询 `table5` 的数据以验证导入是否成功: - -```SQL -SELECT * FROM table5; -+------+------+-------+ -| id | version | score | -+------+------+-------+ -| 101 | 2 | 80 | -| 102 | 3 | 100 | -+------+------+-------+ -2 rows in set (0.02 sec) -``` - -如上述查询结果所示,`example5.csv` 中 `id` 为 `101` 的记录未更新到 `table5`,并且 `example5.csv` 中 `id` 为 `102` 的记录已插入到 `table5`。 \ No newline at end of file diff --git a/docs/zh/loading/Loading_data_template.md b/docs/zh/loading/Loading_data_template.md deleted file mode 100644 index 1e51707..0000000 --- a/docs/zh/loading/Loading_data_template.md +++ /dev/null @@ -1,399 +0,0 @@ ---- -displayed_sidebar: docs -unlisted: true ---- - -# 从 \ 模板加载数据 - -## 模板说明 - -### 关于样式的说明 - -技术文档通常包含指向各处其他文档的链接。查看此文档时,您可能会注意到页面上的链接很少,几乎所有链接都位于文档底部的**更多信息**部分。并非每个关键字都需要链接到另一个页面,请假定读者知道 `CREATE TABLE` 的含义,如果他们不知道,他们可以点击搜索栏来查找。可以在文档中添加注释,告诉读者还有其他选项,详细信息在**更多信息**部分中进行了描述;这可以让需要信息的人知道他们可以在完成手头的任务后***稍后***阅读它。 - -### 模板 - -此模板基于从 Amazon S3 加载数据的过程,其中的某些部分不适用于从其他来源加载数据。请专注于此模板的流程,不要担心包含每个部分;流程应为: - -#### 简介 - -介绍性文字,让读者知道如果他们遵循本指南,最终结果会是什么。对于 S3 文档,最终结果是“以异步或同步方式从 S3 加载数据”。 - -#### 为什么? - -- 对使用该技术解决的业务问题的描述 -- 所描述方法(如果存在)的优点和缺点 - -#### 数据流或其他图表 - -图表或图像可能会有所帮助。如果您描述的技术很复杂并且图像有帮助,请使用一个。如果您描述的技术会产生一些可视化的东西(例如,使用 Superset 分析数据),那么一定要包含最终产品的图像。 - -如果流程不明显,请使用数据流图。当命令导致 StarRocks 运行多个进程并组合这些进程的输出,然后操作数据时,可能需要描述数据流。在此模板中,描述了两种加载数据的方法。其中一种很简单,没有数据流部分;另一种更复杂(StarRocks 正在处理复杂的工作,而不是用户!),并且复杂的选项包括数据流部分。 - -#### 带有验证部分的示例 - -请注意,示例应位于语法详细信息和其他深入的技术详细信息之前。许多读者会来阅读文档以查找他们可以复制、粘贴和修改的特定技术。 - -如果可能,请提供一个可以工作的示例,并包含要使用的数据集。此模板中的示例使用存储在 S3 中的数据集,任何拥有 AWS 账户并且可以使用密钥和密码进行身份验证的人都可以使用该数据集。通过提供数据集,示例对读者更有价值,因为他们可以充分体验所描述的技术。 - -确保示例按编写方式工作。这意味着两件事: - -1. 您已按呈现的顺序运行命令 -2. 您已包含必要的先决条件。例如,如果您的示例引用数据库 `foo`,那么您可能需要以 `CREATE DATABASE foo;`、`USE foo;` 作为前缀。 - -验证非常重要。如果您描述的过程包括多个步骤,那么每当应该完成某些事情时,都包含一个验证步骤;这有助于避免读者到达终点并意识到他们在第 10 步中存在拼写错误。在此示例中,“检查进度”和 `DESCRIBE user_behavior_inferred;` 步骤用于验证。 - -#### 更多信息 - -在模板的末尾,有一个位置可以放置指向相关信息的链接,包括您在正文中提到的可选信息的链接。 - -### 嵌入在模板中的注释 - -模板注释的格式与我们格式化文档注释的方式有意不同,以便在您处理模板时引起您的注意。请删除粗体斜体注释: - -```markdown -***Note: descriptive text*** -``` - -## 最后,模板的开始 - -***Note: If there are multiple recommended choices, tell the -reader this in the intro. For example, when loading from S3, -there is an option for synchronous loading, and asynchronous loading:*** - -StarRocks 提供了两种从 S3 加载数据的选项: - -1. 使用 Broker Load 进行异步加载 -2. 使用 `FILES()` 表函数进行同步加载 - -***Note: Tell the reader WHY they would choose one choice over the other:*** - -小型数据集通常使用 `FILES()` 表函数同步加载,大型数据集通常使用 Broker Load 异步加载。这两种方法各有优点,如下所述。 - -> **NOTE** -> -> 只有对 StarRocks 表具有 INSERT 权限的用户才能将数据加载到 StarRocks 表中。如果您没有 INSERT 权限,请按照 [GRANT](../sql-reference/sql-statements/account-management/GRANT.md) 中提供的说明,将 INSERT 权限授予用于连接到 StarRocks 集群的用户。 - -## 使用 Broker Load - -异步 Broker Load 进程处理与 S3 的连接、提取数据以及将数据存储在 StarRocks 中。 - -### Broker Load 的优点 - -- Broker Load 支持在加载期间进行数据转换、UPSERT 和 DELETE 操作。 -- Broker Load 在后台运行,客户端无需保持连接即可继续作业。 -- Broker Load 是长时间运行作业的首选,默认超时时间为 4 小时。 -- 除了 Parquet 和 ORC 文件格式外,Broker Load 还支持 CSV 文件。 - -### 数据流 - -***Note: Processes that involve multiple components or steps may be easier to understand with a diagram. This example includes a diagram that helps describe the steps that happen when a user chooses the Broker Load option.*** - -![Broker Load 的工作流程](../_assets/broker_load_how-to-work_en.png) - -1. 用户创建一个 load job。 -2. 前端 (FE) 创建一个查询计划并将该计划分发到后端节点 (BE)。 -3. 后端 (BE) 节点从源提取数据并将数据加载到 StarRocks 中。 - -### 典型示例 - -创建一个表,启动一个从 S3 提取 Parquet 文件的 load 进程,并验证数据加载的进度和成功。 - -> **NOTE** -> -> 这些示例使用 Parquet 格式的示例数据集,如果您想加载 CSV 或 ORC 文件,该信息链接在此页面的底部。 - -#### 创建表 - -为您的表创建一个数据库: - -```SQL -CREATE DATABASE IF NOT EXISTS project; -USE project; -``` - -创建表。此 schema 匹配 StarRocks 账户中托管的 S3 bucket 中的示例数据集。 - -```SQL -DROP TABLE IF EXISTS user_behavior; - -CREATE TABLE `user_behavior` ( - `UserID` int(11), - `ItemID` int(11), - `CategoryID` int(11), - `BehaviorType` varchar(65533), - `Timestamp` datetime -) ENGINE=OLAP -DUPLICATE KEY(`UserID`) -DISTRIBUTED BY HASH(`UserID`) -PROPERTIES ( - "replication_num" = "1" -); -``` - -#### 收集连接详细信息 - -> **NOTE** -> -> 这些示例使用基于 IAM 用户的身份验证。其他身份验证方法可用,并链接在此页面的底部。 - -从 S3 加载数据需要具有: - -- S3 bucket -- S3 对象键(对象名称),如果访问 bucket 中的特定对象。请注意,如果您的 S3 对象存储在子文件夹中,则对象键可以包含前缀。完整语法链接在**更多信息**中。 -- S3 区域 -- 访问密钥和密码 - -#### 启动 Broker Load - -此作业有四个主要部分: - -- `LABEL`:查询 `LOAD` 作业状态时使用的字符串。 -- `LOAD` 声明:源 URI、目标表和源数据格式。 -- `BROKER`:源的连接详细信息。 -- `PROPERTIES`:超时值和应用于此作业的任何其他属性。 - -> **NOTE** -> -> 这些示例中使用的数据集托管在 StarRocks 账户的 S3 bucket 中。可以使用任何有效的 `aws.s3.access_key` 和 `aws.s3.secret_key`,因为任何 AWS 身份验证用户都可以读取该对象。在下面的命令中,将您的凭据替换为 `AAA` 和 `BBB`。 - -```SQL -LOAD LABEL user_behavior -( - DATA INFILE("s3://starrocks-examples/user_behavior_sample_data.parquet") - INTO TABLE user_behavior - FORMAT AS "parquet" - ) - WITH BROKER - ( - "aws.s3.enable_ssl" = "true", - "aws.s3.use_instance_profile" = "false", - "aws.s3.region" = "us-east-1", - "aws.s3.access_key" = "AAAAAAAAAAAAAAAAAAAA", - "aws.s3.secret_key" = "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB" - ) -PROPERTIES -( - "timeout" = "72000" -); -``` - -#### 检查进度 - -查询 `information_schema.loads` 表以跟踪进度。如果您有多个 `LOAD` 作业正在运行,则可以按与该作业关联的 `LABEL` 进行过滤。在下面的输出中,load job `user_behavior` 有两个条目。第一个记录显示 `CANCELLED` 状态;滚动到输出的末尾,您会看到 `listPath failed`。第二个记录显示使用有效的 AWS IAM 访问密钥和密码成功。 - -```SQL -SELECT * FROM information_schema.loads; -``` - -```SQL -SELECT * FROM information_schema.loads WHERE LABEL = 'user_behavior'; -``` - -```plaintext -JOB_ID|LABEL |DATABASE_NAME|STATE |PROGRESS |TYPE |PRIORITY|SCAN_ROWS|FILTERED_ROWS|UNSELECTED_ROWS|SINK_ROWS|ETL_INFO|TASK_INFO |CREATE_TIME |ETL_START_TIME |ETL_FINISH_TIME |LOAD_START_TIME |LOAD_FINISH_TIME |JOB_DETAILS |ERROR_MSG |TRACKING_URL|TRACKING_SQL|REJECTED_RECORD_PATH| -------+-------------------------------------------+-------------+---------+-------------------+------+--------+---------+-------------+---------------+---------+--------+----------------------------------------------------+-------------------+-------------------+-------------------+-------------------+-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+------------+------------+--------------------+ - 10121|user_behavior |project |CANCELLED|ETL:N/A; LOAD:N/A |BROKER|NORMAL | 0| 0| 0| 0| |resource:N/A; timeout(s):72000; max_filter_ratio:0.0|2023-08-10 14:59:30| | | |2023-08-10 14:59:34|{"All backends":{},"FileNumber":0,"FileSize":0,"InternalTableLoadBytes":0,"InternalTableLoadRows":0,"ScanBytes":0,"ScanRows":0,"TaskNumber":0,"Unfinished backends":{}} |type:ETL_RUN_FAIL; msg:listPath failed| | | | - 10106|user_behavior |project |FINISHED |ETL:100%; LOAD:100%|BROKER|NORMAL | 86953525| 0| 0| 86953525| |resource:N/A; timeout(s):72000; max_filter_ratio:0.0|2023-08-10 14:50:15|2023-08-10 14:50:19|2023-08-10 14:50:19|2023-08-10 14:50:19|2023-08-10 14:55:10|{"All backends":{"a5fe5e1d-d7d0-4826-ba99-c7348f9a5f2f":[10004]},"FileNumber":1,"FileSize":1225637388,"InternalTableLoadBytes":2710603082,"InternalTableLoadRows":86953525,"ScanBytes":1225637388,"ScanRows":86953525,"TaskNumber":1,"Unfinished backends":{"a5| | | | | -``` - -您也可以在此处检查数据的子集。 - -```SQL -SELECT * from user_behavior LIMIT 10; -``` - -```plaintext -UserID|ItemID|CategoryID|BehaviorType|Timestamp | -------+------+----------+------------+-------------------+ -171146| 68873| 3002561|pv |2017-11-30 07:11:14| -171146|146539| 4672807|pv |2017-11-27 09:51:41| -171146|146539| 4672807|pv |2017-11-27 14:08:33| -171146|214198| 1320293|pv |2017-11-25 22:38:27| -171146|260659| 4756105|pv |2017-11-30 05:11:25| -171146|267617| 4565874|pv |2017-11-27 14:01:25| -171146|329115| 2858794|pv |2017-12-01 02:10:51| -171146|458604| 1349561|pv |2017-11-25 22:49:39| -171146|458604| 1349561|pv |2017-11-27 14:03:44| -171146|478802| 541347|pv |2017-12-02 04:52:39| -``` - -## 使用 `FILES()` 表函数 - -### `FILES()` 的优点 - -`FILES()` 可以推断 Parquet 数据列的数据类型,并为 StarRocks 表生成 schema。这提供了使用 `SELECT` 直接从 S3 查询文件,或者让 StarRocks 根据 Parquet 文件 schema 自动为您创建表的能力。 - -> **NOTE** -> -> Schema 推断是 3.1 版本中的一项新功能,仅适用于 Parquet 格式,并且尚不支持嵌套类型。 - -### 典型示例 - -有三个使用 `FILES()` 表函数的示例: - -- 直接从 S3 查询数据 -- 使用 schema 推断创建和加载表 -- 手动创建表,然后加载数据 - -> **NOTE** -> -> 这些示例中使用的数据集托管在 StarRocks 账户的 S3 bucket 中。可以使用任何有效的 `aws.s3.access_key` 和 `aws.s3.secret_key`,因为任何 AWS 身份验证用户都可以读取该对象。在下面的命令中,将您的凭据替换为 `AAA` 和 `BBB`。 - -#### 直接从 S3 查询 - -使用 `FILES()` 直接从 S3 查询可以在创建表之前很好地预览数据集的内容。例如: - -- 获取数据集的预览,而无需存储数据。 -- 查询最小值和最大值,并确定要使用的数据类型。 -- 检查 null 值。 - -```sql -SELECT * FROM FILES( - "path" = "s3://starrocks-examples/user_behavior_sample_data.parquet", - "format" = "parquet", - "aws.s3.region" = "us-east-1", - "aws.s3.access_key" = "AAAAAAAAAAAAAAAAAAAA", - "aws.s3.secret_key" = "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB" -) LIMIT 10; -``` - -> **NOTE** -> -> 请注意,列名由 Parquet 文件提供。 - -```plaintext -UserID|ItemID |CategoryID|BehaviorType|Timestamp | -------+-------+----------+------------+-------------------+ - 1|2576651| 149192|pv |2017-11-25 01:21:25| - 1|3830808| 4181361|pv |2017-11-25 07:04:53| - 1|4365585| 2520377|pv |2017-11-25 07:49:06| - 1|4606018| 2735466|pv |2017-11-25 13:28:01| - 1| 230380| 411153|pv |2017-11-25 21:22:22| - 1|3827899| 2920476|pv |2017-11-26 16:24:33| - 1|3745169| 2891509|pv |2017-11-26 19:44:31| - 1|1531036| 2920476|pv |2017-11-26 22:02:12| - 1|2266567| 4145813|pv |2017-11-27 00:11:11| - 1|2951368| 1080785|pv |2017-11-27 02:47:08| -``` - -#### 使用 schema 推断创建表 - -这是前一个示例的延续;之前的查询包装在 `CREATE TABLE` 中,以使用 schema 推断自动创建表。使用带有 Parquet 文件的 `FILES()` 表函数时,不需要列名和类型来创建表,因为 Parquet 格式包括列名和类型,StarRocks 将推断 schema。 - -> **NOTE** -> -> 使用 schema 推断时,`CREATE TABLE` 的语法不允许设置副本数,因此请在创建表之前设置它。以下示例适用于具有单个副本的系统: -> -> `ADMIN SET FRONTEND CONFIG ('default_replication_num' ="1");` - -```sql -CREATE DATABASE IF NOT EXISTS project; -USE project; - -CREATE TABLE `user_behavior_inferred` AS -SELECT * FROM FILES( - "path" = "s3://starrocks-examples/user_behavior_sample_data.parquet", - "format" = "parquet", - "aws.s3.region" = "us-east-1", - "aws.s3.access_key" = "AAAAAAAAAAAAAAAAAAAA", - "aws.s3.secret_key" = "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB" -); -``` - -```SQL -DESCRIBE user_behavior_inferred; -``` - -```plaintext -Field |Type |Null|Key |Default|Extra| -------------+----------------+----+-----+-------+-----+ -UserID |bigint |YES |true | | | -ItemID |bigint |YES |true | | | -CategoryID |bigint |YES |true | | | -BehaviorType|varchar(1048576)|YES |false| | | -Timestamp |varchar(1048576)|YES |false| | | -``` - -> **NOTE** -> -> 将推断的 schema 与手动创建的 schema 进行比较: -> -> - 数据类型 -> - 可为空 -> - 键字段 - -```SQL -SELECT * from user_behavior_inferred LIMIT 10; -``` - -```plaintext -UserID|ItemID|CategoryID|BehaviorType|Timestamp | -------+------+----------+------------+-------------------+ -171146| 68873| 3002561|pv |2017-11-30 07:11:14| -171146|146539| 4672807|pv |2017-11-27 09:51:41| -171146|146539| 4672807|pv |2017-11-27 14:08:33| -171146|214198| 1320293|pv |2017-11-25 22:38:27| -171146|260659| 4756105|pv |2017-11-30 05:11:25| -171146|267617| 4565874|pv |2017-11-27 14:01:25| -171146|329115| 2858794|pv |2017-12-01 02:10:51| -171146|458604| 1349561|pv |2017-11-25 22:49:39| -171146|458604| 1349561|pv |2017-11-27 14:03:44| -171146|478802| 541347|pv |2017-12-02 04:52:39| -``` - -#### 加载到现有表 - -您可能想要自定义要插入的表,例如: - -- 列数据类型、可为空设置或默认值 -- 键类型和列 -- 分布 -- 等等。 - -> **NOTE** -> -> 创建最有效的表结构需要了解数据的使用方式和列的内容。本文档不涵盖表设计,在页面末尾的**更多信息**中有一个链接。 - -在此示例中,我们基于对表查询方式和 Parquet 文件中数据的了解来创建表。可以通过直接在 S3 中查询文件来获得对 Parquet 文件中数据的了解。 - -- 由于 S3 中文件的查询表明 `Timestamp` 列包含与 `datetime` 数据类型匹配的数据,因此在以下 DDL 中指定了列类型。 -- 通过查询 S3 中的数据,您可以发现数据集中没有空值,因此 DDL 不会将任何列设置为可为空。 -- 根据对预期查询类型的了解,排序键和分桶列设置为列 `UserID`(您的用例可能对此数据有所不同,您可能会决定除了 `UserID` 之外或代替 `UserID` 使用 `ItemID` 作为排序键: - -```SQL -CREATE TABLE `user_behavior_declared` ( - `UserID` int(11), - `ItemID` int(11), - `CategoryID` int(11), - `BehaviorType` varchar(65533), - `Timestamp` datetime -) ENGINE=OLAP -DUPLICATE KEY(`UserID`) -DISTRIBUTED BY HASH(`UserID`) -PROPERTIES ( - "replication_num" = "1" -); -``` - -创建表后,您可以使用 `INSERT INTO` … `SELECT FROM FILES()` 加载它: - -```SQL -INSERT INTO user_behavior_declared - SELECT * FROM FILES( - "path" = "s3://starrocks-examples/user_behavior_sample_data.parquet", - "format" = "parquet", - "aws.s3.region" = "us-east-1", - "aws.s3.access_key" = "AAAAAAAAAAAAAAAAAAAA", - "aws.s3.secret_key" = "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB" -); -``` - -## 更多信息 - -- 有关同步和异步数据加载的更多详细信息,请参见 [加载概念](./loading_introduction/loading_concepts.md)。 -- 了解 Broker Load 如何在加载期间支持数据转换,请参见 [加载时转换数据](../loading/Etl_in_loading.md) 和 [通过加载更改数据](../loading/Load_to_Primary_Key_tables.md)。 -- 本文档仅涵盖基于 IAM 用户的身份验证。有关其他选项,请参见 [验证到 AWS 资源的身份](../integrations/authenticate_to_aws_resources.md)。 -- [AWS CLI 命令参考](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3/index.html) 详细介绍了 S3 URI。 -- 了解有关 [表设计](../table_design/StarRocks_table_design.md) 的更多信息。 -- Broker Load 提供了比上述示例更多的配置和使用选项,详细信息请参见 [Broker Load](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md) \ No newline at end of file diff --git a/docs/zh/loading/Loading_intro.md b/docs/zh/loading/Loading_intro.md deleted file mode 100644 index 0acc5c2..0000000 --- a/docs/zh/loading/Loading_intro.md +++ /dev/null @@ -1,186 +0,0 @@ ---- -displayed_sidebar: docs -toc_max_heading_level: 3 -keywords: - - load - - Insert - - Stream Load - - Broker Load - - Pipe - - Routine Load - - Spark Load ---- - - - -# 数据导入选项 - -数据导入是从各种数据源中清理和转换原始数据,并根据您的业务需求将结果数据加载到 StarRocks 中,以方便分析的过程。 - -StarRocks 提供了多种数据导入选项: - -- 导入方法:Insert、Stream Load、Broker Load、Pipe、Routine Load 和 Spark Load -- 生态系统工具:StarRocks Connector for Apache Kafka®(简称 Kafka connector)、StarRocks Connector for Apache Spark™(简称 Spark connector)、StarRocks Connector for Apache Flink®(简称 Flink connector)以及其他工具,如 SMT、DataX、CloudCanal 和 Kettle Connector -- API:Stream Load 事务接口 - -这些选项各有优势,并支持各自的数据源系统。 - -本主题概述了这些选项,并对它们进行了比较,以帮助您根据数据源、业务场景、数据量、数据文件格式和导入频率来确定您选择的导入选项。 - -## 数据导入选项简介 - -本节主要介绍 StarRocks 中可用的数据导入选项的特性和业务场景。 - -![Loading options overview](../_assets/loading_intro_overview.png) - -:::note - -在以下各节中,“批量”或“批量导入”指的是一次性将大量数据从指定源加载到 StarRocks 中,而“流式”或“流式导入”指的是连续实时地加载数据。 - -::: - -## 导入方法 - -### [Insert](InsertInto.md) - -**业务场景:** - -- INSERT INTO VALUES:向内表追加少量数据。 -- INSERT INTO SELECT: - - INSERT INTO SELECT FROM ``:将对内表或外部表的查询结果追加到表中。 - - INSERT INTO SELECT FROM FILES():将对远端存储中的数据文件的查询结果追加到表中。 - - :::note - - 对于 AWS S3,从 v3.1 开始支持此功能。对于 HDFS、Microsoft Azure Storage、Google GCS 和 S3 兼容存储(如 MinIO),从 v3.2 开始支持此功能。 - - ::: - -**文件格式:** - -- INSERT INTO VALUES:SQL -- INSERT INTO SELECT: - - INSERT INTO SELECT FROM ``:StarRocks 表 - - INSERT INTO SELECT FROM FILES():Parquet 和 ORC - -**数据量:** 不固定(数据量根据内存大小而变化。) - -### [Stream Load](StreamLoad.md) - -**业务场景:** 从本地文件系统批量导入数据。 - -**文件格式:** CSV 和 JSON - -**数据量:** 10 GB 或更少 - -### [Broker Load](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md) - -**业务场景:** - -- 从 HDFS 或云存储(如 AWS S3、Microsoft Azure Storage、Google GCS 和 S3 兼容存储(如 MinIO))批量导入数据。 -- 从本地文件系统或 NAS 批量导入数据。 - -**文件格式:** CSV、Parquet、ORC 和 JSON(自 v3.2.3 起支持) - -**数据量:** 几十 GB 到数百 GB - -### [Pipe](../sql-reference/sql-statements/loading_unloading/pipe/CREATE_PIPE.md) - -**业务场景:** 从 HDFS 或 AWS S3 批量导入或流式导入数据。 - -:::note - -从 v3.2 开始支持此导入方法。 - -::: - -**文件格式:** Parquet 和 ORC - -**数据量:** 100 GB 到 1 TB 或更多 - -### [Routine Load](../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md) - -**业务场景:** 从 Kafka 流式导入数据。 - -**文件格式:** CSV、JSON 和 Avro(自 v3.0.1 起支持) - -**数据量:** 以小批量形式导入 MB 到 GB 级别的数据 - -### [Spark Load](../sql-reference/sql-statements/loading_unloading/SPARK_LOAD.md) - -**业务场景:** 使用 Spark 集群批量导入存储在 HDFS 中的 Apache Hive™ 表的数据。 - -**文件格式:** CSV、Parquet(自 v2.0 起支持)和 ORC(自 v2.0 起支持) - -**数据量:** 几十 GB 到 TB - -## 生态系统工具 - -### [Kafka connector](Kafka-connector-starrocks.md) - -**业务场景:** 从 Kafka 流式导入数据。 - -### [Spark connector](Spark-connector-starrocks.md) - -**业务场景:** 从 Spark 批量导入数据。 - -### [Flink connector](Flink-connector-starrocks.md) - -**业务场景:** 从 Flink 流式导入数据。 - -### [SMT](../integrations/loading_tools/SMT.md) - -**业务场景:** 通过 Flink 从 MySQL、PostgreSQL、SQL Server、Oracle、Hive、ClickHouse 和 TiDB 等数据源加载数据。 - -### [DataX](../integrations/loading_tools/DataX-starrocks-writer.md) - -**业务场景:** 在各种异构数据源(包括关系数据库(例如 MySQL 和 Oracle)、HDFS 和 Hive)之间同步数据。 - -### [CloudCanal](../integrations/loading_tools/CloudCanal.md) - -**业务场景:** 将数据从源数据库(例如 MySQL、Oracle 和 PostgreSQL)迁移或同步到 StarRocks。 - -### [Kettle Connector](https://github.com/StarRocks/starrocks-connector-for-kettle) - -**业务场景:** 与 Kettle 集成。通过将 Kettle 强大的数据处理和转换能力与 StarRocks 的高性能数据存储和分析能力相结合,可以实现更灵活高效的数据处理工作流程。 - -## API - -### [Stream Load transaction interface](Stream_Load_transaction_interface.md) - -**业务场景:** 为从 Flink 和 Kafka 等外部系统加载数据而运行的事务实现两阶段提交 (2PC),同时提高高并发流式导入的性能。从 v2.4 开始支持此功能。 - -**文件格式:** CSV 和 JSON - -**数据量:** 10 GB 或更少 - -## 数据导入选项的选择 - -本节列出了可用于常见数据源的导入选项,帮助您选择最适合您情况的选项。 - -### 对象存储 - -| **数据源** | **可用的数据导入选项** that is displayed on the left navigation bar. - -Data loading is the process of cleansing and transforming raw data from various data sources based on your business requirements and loading the resulting data into StarRocks to facilitate analysis. - -StarRocks provides a variety of options for data loading: - -1. [**Insert**](InsertInto.md): Append to an internal table with small amounts of data. -2. [**Stream Load**](StreamLoad.md): Batch load data from a local file system. -3. [**Broker Load**](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md): Batch load data from HDFS or cloud storage like AWS S3, Microsoft Azure Storage, Google GCS, and S3-compatible storage (such as MinIO). -4. [**Pipe**](../sql-reference/sql-statements/loading_unloading/pipe/CREATE_PIPE.md): Batch load or stream data from HDFS or AWS S3. -5. [**Routine Load**](../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md): Stream data from Kafka. -6. [**Spark Load**](../sql-reference/sql-statements/loading_unloading/SPARK_LOAD.md): Batch load data of Apache Hive™ tables stored in HDFS by using Spark clusters. -7. [**Kafka connector**](Kafka-connector-starrocks.md): Stream data from Kafka. -8. [**Spark connector**](Spark-connector-starrocks.md): Batch load data from Spark. -9. [**Flink connector**](Flink-connector-starrocks.md): Stream data from Flink. -10. [**SMT**](../integrations/loading_tools/SMT.md): Load data from data sources such as MySQL, PostgreSQL, SQL Server, Oracle, Hive, ClickHouse, and TiDB through Flink. -11. [**DataX**](../integrations/loading_tools/DataX-starrocks-writer.md): Synchronize data between various heterogeneous data sources, including relational databases (for example, MySQL and Oracle), HDFS, and Hive. -12. [**CloudCanal**](../integrations/loading_tools/CloudCanal.md): Migrate or synchronize data from source databases (for example, MySQL, Oracle, and PostgreSQL) to StarRocks. -13. [**Kettle Connector**](https://github.com/StarRocks/starrocks-connector-for-kettle): Integrate with Kettle. By combining Kettle's robust data processing and transformation capabilities with StarRocks's high-performance data storage and analytical abilities, more flexible and efficient data processing workflows can be achieved. -14. [**Stream Load transaction interface**](Stream_Load_transaction_interface.md): Implement two-phase commit (2PC) for transactions that are run to load data from external systems such as Flink and Kafka, while improving the performance of highly concurrent stream loads. - -These options each have its own advantages and support its own set of data source systems to pull from. - -This topic provides an overview of these options, along with comparisons between them to help you determine the loading option of your choice based on your data source, business scenario, data volume, data file format, and loading frequency. \ No newline at end of file diff --git a/docs/zh/loading/RoutineLoad.md b/docs/zh/loading/RoutineLoad.md deleted file mode 100644 index 5a1644e..0000000 --- a/docs/zh/loading/RoutineLoad.md +++ /dev/null @@ -1,530 +0,0 @@ ---- -displayed_sidebar: docs -keywords: ['Routine Load'] ---- - -# 使用 Routine Load 导入数据 - -import InsertPrivNote from '../_assets/commonMarkdown/insertPrivNote.mdx' -import QSTip from '../_assets/commonMarkdown/quickstart-routine-load-tip.mdx' - - - -本文介绍如何创建 Routine Load 作业,将 Kafka 消息(事件)流式传输到 StarRocks 中,并帮助您熟悉 Routine Load 的一些基本概念。 - -要将消息流持续不断地导入到 StarRocks 中,您可以将消息流存储在 Kafka topic 中,并创建一个 Routine Load 作业来消费这些消息。Routine Load 作业在 StarRocks 中持久存在,生成一系列导入任务,以消费 topic 中全部或部分分区中的消息,并将消息导入到 StarRocks 中。 - -Routine Load 作业支持精确一次 (exactly-once) 的交付语义,以保证导入到 StarRocks 中的数据既不会丢失也不会重复。 - -Routine Load 支持在数据导入时进行数据转换,并支持在数据导入期间通过 UPSERT 和 DELETE 操作进行数据更改。更多信息,请参见 [在导入时转换数据](../loading/Etl_in_loading.md) 和 [通过导入更改数据](../loading/Load_to_Primary_Key_tables.md)。 - - - -## 支持的数据格式 - -Routine Load 现在支持从 Kafka 集群消费 CSV、JSON 和 Avro (v3.0.1 起支持) 格式的数据。 - -> **NOTE** -> -> 对于 CSV 数据,请注意以下几点: -> -> - 您可以使用 UTF-8 字符串,例如逗号 (,)、制表符或管道 (|),其长度不超过 50 字节作为文本分隔符。 -> - 空值用 `\N` 表示。例如,一个数据文件包含三列,该数据文件中的一条记录在第一列和第三列中包含数据,但在第二列中没有数据。在这种情况下,您需要在第二列中使用 `\N` 来表示空值。这意味着该记录必须编译为 `a,\N,b` 而不是 `a,,b`。`a,,b` 表示该记录的第二列包含一个空字符串。 - -## 基本概念 - -![routine load](../_assets/4.5.2-1.png) - -### 术语 - -- **导入作业** - - Routine Load 作业是一个长时间运行的作业。只要其状态为 RUNNING,导入作业就会持续生成一个或多个并发导入任务,这些任务消费 Kafka 集群 topic 中的消息并将数据导入到 StarRocks 中。 - -- **导入任务** - - 一个导入作业通过一定的规则被拆分为多个导入任务。导入任务是数据导入的基本单元。作为一个独立的事件,一个导入任务基于 [Stream Load](../loading/StreamLoad.md) 实现导入机制。多个导入任务并发地消费来自 topic 不同分区的消息,并将数据导入到 StarRocks 中。 - -### 工作流程 - -1. **创建一个 Routine Load 作业。** - 要从 Kafka 导入数据,您需要通过运行 [CREATE ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md) 语句来创建一个 Routine Load 作业。FE 解析该语句,并根据您指定的属性创建该作业。 - -2. **FE 将作业拆分为多个导入任务。** - - FE 基于一定的规则将作业拆分为多个导入任务。每个导入任务都是一个独立的事务。 - 拆分规则如下: - - FE 根据所需的并发数 `desired_concurrent_number`、Kafka topic 中的分区数以及处于活动状态的 BE 节点数来计算导入任务的实际并发数。 - - FE 基于计算出的实际并发数将作业拆分为导入任务,并将这些任务排列在任务队列中。 - - 每个 Kafka topic 由多个分区组成。Topic 分区和导入任务之间的关系如下: - - 一个分区唯一地分配给一个导入任务,并且来自该分区的所有消息都由该导入任务消费。 - - 一个导入任务可以消费来自一个或多个分区的消息。 - - 所有分区均匀地分配给导入任务。 - -3. **多个导入任务并发运行,以消费来自多个 Kafka topic 分区的消息,并将数据导入到 StarRocks 中。** - - 1. **FE 调度和提交导入任务**:FE 及时调度队列中的导入任务,并将它们分配给选定的 Coordinator BE 节点。导入任务之间的时间间隔由配置项 `max_batch_interval` 定义。FE 将导入任务均匀地分配给所有 BE 节点。有关 `max_batch_interval` 的更多信息,请参见 [CREATE ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md#examples)。 - - 2. Coordinator BE 启动导入任务,消费分区中的消息,解析和过滤数据。一个导入任务持续到消费了预定义数量的消息或达到预定义的时间限制为止。消息批处理大小和时间限制在 FE 配置 `max_routine_load_batch_size` 和 `routine_load_task_consume_second` 中定义。有关详细信息,请参见 [FE 配置](../administration/management/FE_configuration.md)。然后,Coordinator BE 将消息分发给 Executor BE。Executor BE 将消息写入磁盘。 - - > **NOTE** - > - > StarRocks 支持通过包括 SASL_SSL、SAS_PLAINTEXT、SSL 和 PLAINTEXT 在内的安全协议访问 Kafka。本主题以通过 PLAINTEXT 连接到 Kafka 为例。如果您需要通过其他安全协议连接到 Kafka,请参见 [CREATE ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md)。 - -4. **FE 生成新的导入任务以持续导入数据。** - 在 Executor BE 将数据写入磁盘后,Coordinator BE 将导入任务的结果报告给 FE。基于该结果,FE 然后生成新的导入任务以持续导入数据。或者,FE 重试失败的任务以确保导入到 StarRocks 中的数据既不会丢失也不会重复。 - -## 创建 Routine Load 作业 - -以下三个示例描述了如何消费 Kafka 中的 CSV 格式、JSON 格式和 Avro 格式的数据,并通过创建 Routine Load 作业将数据导入到 StarRocks 中。有关详细的语法和参数描述,请参见 [CREATE ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md)。 - -### 导入 CSV 格式的数据 - -本节介绍如何创建一个 Routine Load 作业,以消费 Kafka 集群中的 CSV 格式数据,并将数据导入到 StarRocks 中。 - -#### 准备数据集 - -假设 Kafka 集群的 topic `ordertest1` 中有一个 CSV 格式的数据集。数据集中的每条消息都包含六个字段:订单 ID、支付日期、客户姓名、国籍、性别和价格。 - -```Plain -2020050802,2020-05-08,Johann Georg Faust,Deutschland,male,895 -2020050802,2020-05-08,Julien Sorel,France,male,893 -2020050803,2020-05-08,Dorian Grey,UK,male,1262 -2020050901,2020-05-09,Anna Karenina",Russia,female,175 -2020051001,2020-05-10,Tess Durbeyfield,US,female,986 -2020051101,2020-05-11,Edogawa Conan,japan,male,8924 -``` - -#### 创建表 - -根据 CSV 格式数据的字段,在数据库 `example_db` 中创建表 `example_tbl1`。以下示例创建一个包含 5 个字段的表,不包括 CSV 格式数据中的客户性别字段。 - -```SQL -CREATE TABLE example_db.example_tbl1 ( - `order_id` bigint NOT NULL COMMENT "Order ID", - `pay_dt` date NOT NULL COMMENT "Payment date", - `customer_name` varchar(26) NULL COMMENT "Customer name", - `nationality` varchar(26) NULL COMMENT "Nationality", - `price`double NULL COMMENT "Price" -) -ENGINE=OLAP -DUPLICATE KEY (order_id,pay_dt) -DISTRIBUTED BY HASH(`order_id`); -``` - -> **NOTICE** -> -> 从 v2.5.7 开始,StarRocks 可以在您创建表或添加分区时自动设置 bucket 数量 (BUCKETS)。您不再需要手动设置 bucket 数量。有关详细信息,请参见 [设置 bucket 数量](../table_design/data_distribution/Data_distribution.md#set-the-number-of-buckets)。 - -#### 提交 Routine Load 作业 - -执行以下语句以提交一个名为 `example_tbl1_ordertest1` 的 Routine Load 作业,以消费 topic `ordertest1` 中的消息并将数据导入到表 `example_tbl1` 中。导入任务从 topic 的指定分区中的初始 offset 消费消息。 - -```SQL -CREATE ROUTINE LOAD example_db.example_tbl1_ordertest1 ON example_tbl1 -COLUMNS TERMINATED BY ",", -COLUMNS (order_id, pay_dt, customer_name, nationality, temp_gender, price) -PROPERTIES -( - "desired_concurrent_number" = "5" -) -FROM KAFKA -( - "kafka_broker_list" = ":,:", - "kafka_topic" = "ordertest1", - "kafka_partitions" = "0,1,2,3,4", - "property.kafka_default_offsets" = "OFFSET_BEGINNING" -); -``` - -提交导入作业后,您可以执行 [SHOW ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/SHOW_ROUTINE_LOAD.md) 语句来检查导入作业的状态。 - -- **导入作业名称** - - 一个表上可能有多个导入作业。因此,我们建议您使用相应的 Kafka topic 和提交导入作业的时间来命名导入作业。这有助于您区分每个表上的导入作业。 - -- **列分隔符** - - 属性 `COLUMN TERMINATED BY` 定义 CSV 格式数据的列分隔符。默认为 `\t`。 - -- **Kafka topic 分区和 offset** - - 您可以指定属性 `kafka_partitions` 和 `kafka_offsets` 来指定要消费消息的分区和 offset。例如,如果您希望导入作业消费 topic `ordertest1` 的 Kafka 分区 `"0,1,2,3,4"` 中的消息,并且所有消息都具有初始 offset,则可以按如下方式指定属性:如果您希望导入作业消费 Kafka 分区 `"0,1,2,3,4"` 中的消息,并且需要为每个分区指定单独的起始 offset,则可以按如下方式配置: - - ```SQL - "kafka_partitions" ="0,1,2,3,4", - "kafka_offsets" = "OFFSET_BEGINNING, OFFSET_END, 1000, 2000, 3000" - ``` - - 您还可以使用属性 `property.kafka_default_offsets` 设置所有分区的默认 offset。 - - ```SQL - "kafka_partitions" ="0,1,2,3,4", - "property.kafka_default_offsets" = "OFFSET_BEGINNING" - ``` - - 有关详细信息,请参见 [CREATE ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md)。 - -- **数据映射和转换** - - 要指定 CSV 格式数据和 StarRocks 表之间的映射和转换关系,您需要使用 `COLUMNS` 参数。 - - **数据映射:** - - - StarRocks 提取 CSV 格式数据中的列,并按照**顺序**将它们映射到 `COLUMNS` 参数中声明的字段。 - - - StarRocks 提取 `COLUMNS` 参数中声明的字段,并按照**名称**将它们映射到 StarRocks 表的列。 - - **数据转换:** - - 并且由于该示例排除了 CSV 格式数据中的客户性别列,因此 `COLUMNS` 参数中的字段 `temp_gender` 用作此字段的占位符。其他字段直接映射到 StarRocks 表 `example_tbl1` 的列。 - - 有关数据转换的更多信息,请参见 [在导入时转换数据](./Etl_in_loading.md)。 - - > **NOTE** - > - > 如果 CSV 格式数据中列的名称、数量和顺序与 StarRocks 表中的列完全对应,则无需指定 `COLUMNS` 参数。 - -- **任务并发** - - 当 Kafka topic 分区很多且 BE 节点足够时,您可以通过增加任务并发来加速导入。 - - 要增加实际的导入任务并发,您可以在创建 Routine Load 作业时增加所需的导入任务并发 `desired_concurrent_number`。您还可以将 FE 的动态配置项 `max_routine_load_task_concurrent_num`(默认的最大导入任务并发数)设置为更大的值。有关 `max_routine_load_task_concurrent_num` 的更多信息,请参见 [FE 配置项](../administration/management/FE_configuration.md)。 - - 实际的任务并发由处于活动状态的 BE 节点数、预先指定的 Kafka topic 分区数以及 `desired_concurrent_number` 和 `max_routine_load_task_concurrent_num` 的值中的最小值定义。 - - 在该示例中,处于活动状态的 BE 节点数为 `5`,预先指定的 Kafka topic 分区数为 `5`,并且 `max_routine_load_task_concurrent_num` 的值为 `5`。要增加实际的导入任务并发,您可以将 `desired_concurrent_number` 从默认值 `3` 增加到 `5`。 - - 有关属性的更多信息,请参见 [CREATE ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md)。 - -### 导入 JSON 格式的数据 - -本节介绍如何创建一个 Routine Load 作业,以消费 Kafka 集群中的 JSON 格式数据,并将数据导入到 StarRocks 中。 - -#### 准备数据集 - -假设 Kafka 集群的 topic `ordertest2` 中有一个 JSON 格式的数据集。该数据集包括六个键:商品 ID、客户姓名、国籍、支付时间和价格。此外,您希望将支付时间列转换为 DATE 类型,并将其导入到 StarRocks 表中的 `pay_dt` 列。 - -```JSON -{"commodity_id": "1", "customer_name": "Mark Twain", "country": "US","pay_time": 1589191487,"price": 875} -{"commodity_id": "2", "customer_name": "Oscar Wilde", "country": "UK","pay_time": 1589191487,"price": 895} -{"commodity_id": "3", "customer_name": "Antoine de Saint-Exupéry","country": "France","pay_time": 1589191487,"price": 895} -``` - -> **CAUTION** 一行中的每个 JSON 对象必须位于一条 Kafka 消息中,否则将返回 JSON 解析错误。 - -#### 创建表 - -根据 JSON 格式数据的键,在数据库 `example_db` 中创建表 `example_tbl2`。 - -```SQL -CREATE TABLE `example_tbl2` ( - `commodity_id` varchar(26) NULL COMMENT "Commodity ID", - `customer_name` varchar(26) NULL COMMENT "Customer name", - `country` varchar(26) NULL COMMENT "Country", - `pay_time` bigint(20) NULL COMMENT "Payment time", - `pay_dt` date NULL COMMENT "Payment date", - `price`double SUM NULL COMMENT "Price" -) -ENGINE=OLAP -AGGREGATE KEY(`commodity_id`,`customer_name`,`country`,`pay_time`,`pay_dt`) -DISTRIBUTED BY HASH(`commodity_id`); -``` - -> **NOTICE** -> -> 从 v2.5.7 开始,StarRocks 可以在您创建表或添加分区时自动设置 bucket 数量 (BUCKETS)。您不再需要手动设置 bucket 数量。有关详细信息,请参见 [设置 bucket 数量](../table_design/data_distribution/Data_distribution.md#set-the-number-of-buckets)。 - -#### 提交 Routine Load 作业 - -执行以下语句以提交一个名为 `example_tbl2_ordertest2` 的 Routine Load 作业,以消费 topic `ordertest2` 中的消息并将数据导入到表 `example_tbl2` 中。导入任务从 topic 的指定分区中的初始 offset 消费消息。 - -```SQL -CREATE ROUTINE LOAD example_db.example_tbl2_ordertest2 ON example_tbl2 -COLUMNS(commodity_id, customer_name, country, pay_time, price, pay_dt=from_unixtime(pay_time, '%Y%m%d')) -PROPERTIES -( - "desired_concurrent_number" = "5", - "format" = "json", - "jsonpaths" = "[\"$.commodity_id\",\"$.customer_name\",\"$.country\",\"$.pay_time\",\"$.price\"]" - ) -FROM KAFKA -( - "kafka_broker_list" =":,:", - "kafka_topic" = "ordertest2", - "kafka_partitions" ="0,1,2,3,4", - "property.kafka_default_offsets" = "OFFSET_BEGINNING" -); -``` - -提交导入作业后,您可以执行 [SHOW ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/SHOW_ROUTINE_LOAD.md) 语句来检查导入作业的状态。 - -- **数据格式** - - 您需要在 `PROPERTIES` 子句中指定 `"format" = "json"`,以定义数据格式为 JSON。 - -- **数据映射和转换** - - 要指定 JSON 格式数据和 StarRocks 表之间的映射和转换关系,您需要指定参数 `COLUMNS` 和属性 `jsonpaths`。`COLUMNS` 参数中指定的字段顺序必须与 JSON 格式数据的顺序匹配,并且字段名称必须与 StarRocks 表的名称匹配。属性 `jsonpaths` 用于从 JSON 数据中提取所需的字段。然后,这些字段由属性 `COLUMNS` 命名。 - - 因为该示例需要将支付时间字段转换为 DATE 数据类型,并将数据导入到 StarRocks 表中的 `pay_dt` 列,所以您需要使用 from_unixtime 函数。其他字段直接映射到表 `example_tbl2` 的字段。 - - **数据映射:** - - - StarRocks 提取 JSON 格式数据的 `name` 和 `code` 键,并将它们映射到 `jsonpaths` 属性中声明的键。 - - - StarRocks 提取 `jsonpaths` 属性中声明的键,并按照**顺序**将它们映射到 `COLUMNS` 参数中声明的字段。 - - - StarRocks 提取 `COLUMNS` 参数中声明的字段,并按照**名称**将它们映射到 StarRocks 表的列。 - - **数据转换**: - - - 因为该示例需要将键 `pay_time` 转换为 DATE 数据类型,并将数据导入到 StarRocks 表中的 `pay_dt` 列,所以您需要在 `COLUMNS` 参数中使用 from_unixtime 函数。其他字段直接映射到表 `example_tbl2` 的字段。 - - - 并且由于该示例排除了 JSON 格式数据中的客户性别列,因此 `COLUMNS` 参数中的字段 `temp_gender` 用作此字段的占位符。其他字段直接映射到 StarRocks 表 `example_tbl1` 的列。 - - 有关数据转换的更多信息,请参见 [在导入时转换数据](./Etl_in_loading.md)。 - - > **NOTE** - > - > 如果 JSON 对象中键的名称和数量与 StarRocks 表中字段的名称和数量完全匹配,则无需指定 `COLUMNS` 参数。 - -### 导入 Avro 格式的数据 - -从 v3.0.1 开始,StarRocks 支持使用 Routine Load 导入 Avro 数据。 - -#### 准备数据集 - -##### Avro schema - -1. 创建以下 Avro schema 文件 `avro_schema.avsc`: - - ```JSON - { - "type": "record", - "name": "sensor_log", - "fields" : [ - {"name": "id", "type": "long"}, - {"name": "name", "type": "string"}, - {"name": "checked", "type" : "boolean"}, - {"name": "data", "type": "double"}, - {"name": "sensor_type", "type": {"type": "enum", "name": "sensor_type_enum", "symbols" : ["TEMPERATURE", "HUMIDITY", "AIR-PRESSURE"]}} - ] - } - ``` - -2. 在 [Schema Registry](https://docs.confluent.io/cloud/current/get-started/schema-registry.html#create-a-schema) 中注册 Avro schema。 - -##### Avro 数据 - -准备 Avro 数据并将其发送到 Kafka topic `topic_0`。 - -#### 创建表 - -根据 Avro 数据的字段,在 StarRocks 集群的目标数据库 `example_db` 中创建一个表 `sensor_log`。表的列名必须与 Avro 数据中的字段名匹配。有关表列和 Avro 数据字段之间的数据类型映射,请参见 [数据类型映射](#Data types mapping)。 - -```SQL -CREATE TABLE example_db.sensor_log ( - `id` bigint NOT NULL COMMENT "sensor id", - `name` varchar(26) NOT NULL COMMENT "sensor name", - `checked` boolean NOT NULL COMMENT "checked", - `data` double NULL COMMENT "sensor data", - `sensor_type` varchar(26) NOT NULL COMMENT "sensor type" -) -ENGINE=OLAP -DUPLICATE KEY (id) -DISTRIBUTED BY HASH(`id`); -``` - -> **NOTICE** -> -> 从 v2.5.7 开始,StarRocks 可以在您创建表或添加分区时自动设置 bucket 数量 (BUCKETS)。您不再需要手动设置 bucket 数量。有关详细信息,请参见 [设置 bucket 数量](../table_design/data_distribution/Data_distribution.md#set-the-number-of-buckets)。 - -#### 提交 Routine Load 作业 - -执行以下语句以提交一个名为 `sensor_log_load_job` 的 Routine Load 作业,以消费 Kafka topic `topic_0` 中的 Avro 消息并将数据导入到数据库 `sensor` 中的表 `sensor_log` 中。导入作业从 topic 的指定分区中的初始 offset 消费消息。 - -```SQL -CREATE ROUTINE LOAD example_db.sensor_log_load_job ON sensor_log -PROPERTIES -( - "format" = "avro" -) -FROM KAFKA -( - "kafka_broker_list" = ":,:,...", - "confluent.schema.registry.url" = "http://172.xx.xxx.xxx:8081", - "kafka_topic" = "topic_0", - "kafka_partitions" = "0,1,2,3,4,5", - "property.kafka_default_offsets" = "OFFSET_BEGINNING" -); -``` - -- 数据格式 - - 您需要在 `PROPERTIES` 子句中指定 `"format = "avro"`,以定义数据格式为 Avro。 - -- Schema Registry - - 您需要配置 `confluent.schema.registry.url` 以指定注册 Avro schema 的 Schema Registry 的 URL。StarRocks 使用此 URL 检索 Avro schema。格式如下: - - ```Plaintext - confluent.schema.registry.url = http[s]://[:@][:] - ``` - -- 数据映射和转换 - - 要指定 Avro 格式数据和 StarRocks 表之间的映射和转换关系,您需要指定参数 `COLUMNS` 和属性 `jsonpaths`。`COLUMNS` 参数中指定的字段顺序必须与属性 `jsonpaths` 中字段的顺序匹配,并且字段名称必须与 StarRocks 表的名称匹配。属性 `jsonpaths` 用于从 Avro 数据中提取所需的字段。然后,这些字段由属性 `COLUMNS` 命名。 - - 有关数据转换的更多信息,请参见 [在导入时转换数据](./Etl_in_loading.md)。 - - > NOTE - > - > 如果 Avro 记录中字段的名称和数量与 StarRocks 表中列的名称和数量完全匹配,则无需指定 `COLUMNS` 参数。 - -提交导入作业后,您可以执行 [SHOW ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/SHOW_ROUTINE_LOAD.md) 语句来检查导入作业的状态。 - -#### 数据类型映射 - -您要导入的 Avro 数据字段与 StarRocks 表列之间的数据类型映射如下: - -##### 原始类型 - -| Avro | StarRocks | -| ------- | --------- | -| nul | NULL | -| boolean | BOOLEAN | -| int | INT | -| long | BIGINT | -| float | FLOAT | -| double | DOUBLE | -| bytes | STRING | -| string | STRING | - -##### 复杂类型 - -| Avro | StarRocks | -| -------------- | ------------------------------------------------------------ | -| record | 将整个 RECORD 或其子字段作为 JSON 导入到 StarRocks 中。 | -| enums | STRING | -| arrays | ARRAY | -| maps | JSON | -| union(T, null) | NULLABLE(T) | -| fixed | STRING | - -#### 限制 - -- 目前,StarRocks 不支持 schema evolution。 -- 每条 Kafka 消息必须仅包含一条 Avro 数据记录。 - -## 检查导入作业和任务 - -### 检查导入作业 - -执行 [SHOW ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/SHOW_ROUTINE_LOAD.md) 语句以检查导入作业 `example_tbl2_ordertest2` 的状态。StarRocks 返回执行状态 `State`、统计信息(包括消费的总行数和导入的总行数)`Statistics` 以及导入作业的进度 `progress`。 - -如果导入作业的状态自动更改为 **PAUSED**,则可能是因为错误行数已超过阈值。有关设置此阈值的详细说明,请参见 [CREATE ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md)。您可以检查文件 `ReasonOfStateChanged` 和 `ErrorLogUrls` 以识别和解决问题。解决问题后,您可以执行 [RESUME ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/RESUME_ROUTINE_LOAD.md) 语句以恢复 **PAUSED** 导入作业。 - -如果导入作业的状态为 **CANCELLED**,则可能是因为导入作业遇到异常(例如,表已被删除)。您可以检查文件 `ReasonOfStateChanged` 和 `ErrorLogUrls` 以识别和解决问题。但是,您无法恢复 **CANCELLED** 导入作业。 - -```SQL -MySQL [example_db]> SHOW ROUTINE LOAD FOR example_tbl2_ordertest2 \G -*************************** 1. row *************************** - Id: 63013 - Name: example_tbl2_ordertest2 - CreateTime: 2022-08-10 17:09:00 - PauseTime: NULL - EndTime: NULL - DbName: default_cluster:example_db - TableName: example_tbl2 - State: RUNNING - DataSourceType: KAFKA - CurrentTaskNum: 3 - JobProperties: {"partitions":"*","partial_update":"false","columnToColumnExpr":"commodity_id,customer_name,country,pay_time,pay_dt=from_unixtime(`pay_time`, '%Y%m%d'),price","maxBatchIntervalS":"20","whereExpr":"*","dataFormat":"json","timezone":"Asia/Shanghai","format":"json","json_root":"","strict_mode":"false","jsonpaths":"[\"$.commodity_id\",\"$.customer_name\",\"$.country\",\"$.pay_time\",\"$.price\"]","desireTaskConcurrentNum":"3","maxErrorNum":"0","strip_outer_array":"false","currentTaskConcurrentNum":"3","maxBatchRows":"200000"} -DataSourceProperties: {"topic":"ordertest2","currentKafkaPartitions":"0,1,2,3,4","brokerList":":,:"} - CustomProperties: {"kafka_default_offsets":"OFFSET_BEGINNING"} - Statistic: {"receivedBytes":230,"errorRows":0,"committedTaskNum":1,"loadedRows":2,"loadRowsRate":0,"abortedTaskNum":0,"totalRows":2,"unselectedRows":0,"receivedBytesRate":0,"taskExecuteTimeMs":522} - Progress: {"0":"1","1":"OFFSET_ZERO","2":"OFFSET_ZERO","3":"OFFSET_ZERO","4":"OFFSET_ZERO"} -ReasonOfStateChanged: - ErrorLogUrls: - OtherMsg: -``` - -> **CAUTION** -> -> 您无法检查已停止或尚未启动的导入作业。 - -### 检查导入任务 - -执行 [SHOW ROUTINE LOAD TASK](../sql-reference/sql-statements/loading_unloading/routine_load/SHOW_ROUTINE_LOAD_TASK.md) 语句以检查导入作业 `example_tbl2_ordertest2` 的导入任务,例如当前正在运行的任务数、正在消费的 Kafka topic 分区和消费进度 `DataSourceProperties` 以及相应的 Coordinator BE 节点 `BeId`。 - -```SQL -MySQL [example_db]> SHOW ROUTINE LOAD TASK WHERE JobName = "example_tbl2_ordertest2" \G -*************************** 1. row *************************** - TaskId: 18c3a823-d73e-4a64-b9cb-b9eced026753 - TxnId: -1 - TxnStatus: UNKNOWN - JobId: 63013 - CreateTime: 2022-08-10 17:09:05 - LastScheduledTime: 2022-08-10 17:47:27 - ExecuteStartTime: NULL - Timeout: 60 - BeId: -1 -DataSourceProperties: {"1":0,"4":0} - Message: there is no new data in kafka, wait for 20 seconds to schedule again -*************************** 2. row *************************** - TaskId: f76c97ac-26aa-4b41-8194-a8ba2063eb00 - TxnId: -1 - TxnStatus: UNKNOWN - JobId: 63013 - CreateTime: 2022-08-10 17:09:05 - LastScheduledTime: 2022-08-10 17:47:26 - ExecuteStartTime: NULL - Timeout: 60 - BeId: -1 -DataSourceProperties: {"2":0} - Message: there is no new data in kafka, wait for 20 seconds to schedule again -*************************** 3. row *************************** - TaskId: 1a327a34-99f4-4f8d-8014-3cd38db99ec6 - TxnId: -1 - TxnStatus: UNKNOWN - JobId: 63013 - CreateTime: 2022-08-10 17:09:26 - LastScheduledTime: 2022-08-10 17:47:27 - ExecuteStartTime: NULL - Timeout: 60 - BeId: -1 -DataSourceProperties: {"0":2,"3":0} - Message: there is no new data in kafka, wait for 20 seconds to schedule again -``` - -## 暂停导入作业 - -您可以执行 [PAUSE ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/PAUSE_ROUTINE_LOAD.md) 语句以暂停导入作业。执行该语句后,导入作业的状态将为 **PAUSED**。但是,它尚未停止。您可以执行 [RESUME ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/RESUME_ROUTINE_LOAD.md) 语句以恢复它。您还可以使用 [SHOW ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/SHOW_ROUTINE_LOAD.md) 语句检查其状态。 - -以下示例暂停导入作业 `example_tbl2_ordertest2`: - -```SQL -PAUSE ROUTINE LOAD FOR example_tbl2_ordertest2; -``` - -## 恢复导入作业 - -您可以执行 [RESUME ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/RESUME_ROUTINE_LOAD.md) 语句以恢复已暂停的导入作业。导入作业的状态将暂时为 **NEED_SCHEDULE**(因为正在重新调度导入作业),然后变为 **RUNNING**。您可以使用 [SHOW ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/SHOW_ROUTINE_LOAD.md) 语句检查其状态。 - -以下示例恢复已暂停的导入作业 `example_tbl2_ordertest2`: - -```SQL -RESUME ROUTINE LOAD FOR example_tbl2_ordertest2; -``` - -## 更改导入作业 - -在更改导入作业之前,您必须使用 [PAUSE ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/PAUSE_ROUTINE_LOAD.md) 语句暂停它。然后,您可以执行 [ALTER ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/ALTER_ROUTINE_LOAD.md)。更改后,您可以执行 [RESUME ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/RESUME_ROUTINE_LOAD.md) 语句以恢复它,并使用 [SHOW ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/SHOW_ROUTINE_LOAD.md) 语句检查其状态。 - -假设处于活动状态的 BE 节点数增加到 `6`,并且要消费的 Kafka topic 分区为 `"0,1,2,3,4,5,6,7"`。如果您想增加实际的导入任务并发,您可以执行以下语句来将所需的任务并发数 `desired_concurrent_number` 增加到 `6`(大于或等于处于活动状态的 BE 节点数),并指定 Kafka topic 分区和初始 offset。 - -> **NOTE** -> -> 因为实际的任务并发由多个参数的最小值决定,所以您必须确保 FE 动态参数 `max_routine_load_task_concurrent_num` 的值大于或 \ No newline at end of file diff --git a/docs/zh/loading/SQL_transaction.md b/docs/zh/loading/SQL_transaction.md deleted file mode 100644 index d1be8ee..0000000 --- a/docs/zh/loading/SQL_transaction.md +++ /dev/null @@ -1,156 +0,0 @@ ---- -displayed_sidebar: docs ---- - -import Beta from '../_assets/commonMarkdown/_beta.mdx' - -# SQL 事务 - - - -启动一个简单的 SQL 事务,以批量提交多个 DML 语句。 - -## 概述 - -自 v3.5.0 起,StarRocks 支持 SQL 事务,以确保在多个表中操作数据时,更新表的原子性。 - -事务由在同一原子单元中处理的多个 SQL 语句组成。 事务中的语句要么一起应用,要么一起撤消,从而保证了事务的 ACID(原子性、一致性、隔离性和持久性)属性。 - -目前,StarRocks 中的 SQL 事务支持以下操作: -- INSERT INTO -- UPDATE -- DELETE - -:::note - -- 目前不支持 INSERT OVERWRITE。 -- 从 v4.0 开始,只有在存算分离集群中才支持在事务中对同一表执行多个 INSERT 语句。 -- 从 v4.0 开始,只有在存算分离集群中才支持 UPDATE 和 DELETE。 - -::: - -从 v4.0 开始,在一个 SQL 事务中: -- 支持对一个表执行**多个 INSERT 语句**。 -- 允许对一个表执行**仅一个 UPDATE *或* DELETE** 语句。 -- **不允许**在对同一表执行 INSERT 语句**之后**执行 **UPDATE *或* DELETE** 语句。 - -事务的 ACID 属性仅在有限的 READ COMMITTED 隔离级别上得到保证,即: -- 语句仅对在该语句开始之前已提交的数据进行操作。 -- 如果在第一个语句和第二个语句的执行之间提交了另一个事务,则同一事务中的两个连续语句可以对不同的数据进行操作。 -- 前面的 DML 语句带来的数据更改对于同一事务中的后续语句是不可见的。 - -一个事务与一个会话相关联。 多个会话不能共享同一个事务。 - -## 用法 - -1. 必须通过执行 START TRANSACTION 语句来启动事务。 StarRocks 还支持同义词 BEGIN。 - - ```SQL - { START TRANSACTION | BEGIN [ WORK ] } - ``` - -2. 启动事务后,您可以在事务中定义多个 DML 语句。 有关详细信息,请参见 [使用说明](#usage-notes)。 - -3. 必须通过执行 `COMMIT` 或 `ROLLBACK` 显式结束事务。 - - - 要应用(提交)事务,请使用以下语法: - - ```SQL - COMMIT [ WORK ] - ``` - - - 要撤消(回滚)事务,请使用以下语法: - - ```SQL - ROLLBACK [ WORK ] - ``` - -## 示例 - -1. 在存算分离集群中创建演示表 `desT`,并将数据加载到其中。 - - :::note - 如果您想在存算一体集群中尝试此示例,则必须跳过步骤 3,并且在步骤 4 中仅定义一个 INSERT 语句。 - ::: - - ```SQL - CREATE TABLE desT ( - k int, - v int - ) PRIMARY KEY(k); - - INSERT INTO desT VALUES - (1,1), - (2,2), - (3,3); - ``` - -2. 启动一个事务。 - - ```SQL - START TRANSACTION; - ``` - - 或者 - - ```SQL - BEGIN WORK; - ``` - -3. 定义一个 UPDATE 或 DELETE 语句。 - - ```SQL - UPDATE desT SET v = v + 1 WHERE k = 1, - ``` - - 或者 - - ```SQL - DELETE FROM desT WHERE k = 1; - ``` - -4. 定义多个 INSERT 语句。 - - ```SQL - -- 插入具有指定值的数据。 - INSERT INTO desT VALUES (4,4); - -- 将数据从内表插入到另一个内表。 - INSERT INTO desT SELECT * FROM srcT; - -- 从远端存储插入数据。 - INSERT INTO desT - SELECT * FROM FILES( - "path" = "s3://inserttest/parquet/srcT.parquet", - "format" = "parquet", - "aws.s3.access_key" = "XXXXXXXXXX", - "aws.s3.secret_key" = "YYYYYYYYYY", - "aws.s3.region" = "us-west-2" - ); - ``` - -5. 应用或撤消事务。 - - - 要应用事务中的 SQL 语句。 - - ```SQL - COMMIT WORK; - ``` - - - 要撤消事务中的 SQL 语句。 - - ```SQL - ROLLBACK WORK; - ``` - -## 使用说明 - -- 目前,StarRocks 在 SQL 事务中支持 SELECT、INSERT、UPDATE 和 DELETE 语句。 从 v4.0 开始,只有在存算分离集群中才支持 UPDATE 和 DELETE。 -- 不允许对在同一事务中数据已更改的表执行 SELECT 语句。 -- 从 v4.0 开始,只有在存算分离集群中才支持在事务中对同一表执行多个 INSERT 语句。 -- 在一个事务中,您只能对每个表定义一个 UPDATE 或 DELETE 语句,并且它必须位于 INSERT 语句之前。 -- 后续 DML 语句无法读取同一事务中前面的语句带来的未提交更改。 例如,前面的 INSERT 语句的目标表不能是后续语句的源表。 否则,系统将返回错误。 -- 事务中 DML 语句的所有目标表必须位于同一数据库中。 不允许跨数据库操作。 -- 目前,不支持 INSERT OVERWRITE。 -- 不允许嵌套事务。 您不能在 BEGIN-COMMIT/ROLLBACK 对中指定 BEGIN WORK。 -- 如果正在进行的事务所属的会话终止或关闭,则该事务将自动回滚。 -- 如上所述,StarRock 仅支持事务隔离级别的有限 READ COMMITTED。 -- 不支持写入冲突检查。 当两个事务同时写入同一表时,两个事务都可以成功提交。 数据更改的可见性(顺序)取决于 COMMIT WORK 语句的执行顺序。 \ No newline at end of file diff --git a/docs/zh/loading/Spark-connector-starrocks.md b/docs/zh/loading/Spark-connector-starrocks.md deleted file mode 100644 index f2b66d6..0000000 --- a/docs/zh/loading/Spark-connector-starrocks.md +++ /dev/null @@ -1,682 +0,0 @@ ---- -displayed_sidebar: docs ---- - -# 使用 Spark Connector 导入数据 (推荐) - -StarRocks 提供了一个自主开发的 Connector,名为 StarRocks Connector for Apache Spark™ (简称 Spark Connector),以帮助您使用 Spark 将数据导入到 StarRocks 表中。其基本原理是先积累数据,然后通过 [STREAM LOAD](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md) 一次性将所有数据导入到 StarRocks 中。Spark Connector 基于 Spark DataSource V2 实现。DataSource 可以使用 Spark DataFrames 或 Spark SQL 创建。同时支持批量和结构化流模式。 - -> **注意** -> -> 只有对 StarRocks 表具有 SELECT 和 INSERT 权限的用户才能将数据导入到该表。您可以按照 [GRANT](../sql-reference/sql-statements/account-management/GRANT.md) 中提供的说明,将这些权限授予用户。 - -## 版本要求 - -| Spark connector | Spark | StarRocks | Java | Scala | -| --------------- | ---------------- | ------------- | ---- | ----- | -| 1.1.2 | 3.2, 3.3, 3.4, 3.5 | 2.5 及更高版本 | 8 | 2.12 | -| 1.1.1 | 3.2, 3.3, or 3.4 | 2.5 及更高版本 | 8 | 2.12 | -| 1.1.0 | 3.2, 3.3, or 3.4 | 2.5 及更高版本 | 8 | 2.12 | - -> **注意** -> -> - 请参阅 [升级 Spark connector](#upgrade-spark-connector) 以了解不同版本的 Spark Connector 之间的行为变更。 -> - 从 1.1.1 版本开始,Spark Connector 不提供 MySQL JDBC 驱动程序,您需要手动将驱动程序导入到 Spark classpath 中。您可以在 [MySQL site](https://dev.mysql.com/downloads/connector/j/) 或 [Maven Central](https://repo1.maven.org/maven2/mysql/mysql-connector-java/) 上找到该驱动程序。 - -## 获取 Spark connector - -您可以通过以下方式获取 Spark Connector JAR 文件: - -- 直接下载已编译的 Spark Connector JAR 文件。 -- 在您的 Maven 项目中添加 Spark Connector 作为依赖项,然后下载 JAR 文件。 -- 自行将 Spark Connector 的源代码编译为 JAR 文件。 - -Spark Connector JAR 文件的命名格式为 `starrocks-spark-connector-${spark_version}_${scala_version}-${connector_version}.jar`。 - -例如,如果您在您的环境中安装了 Spark 3.2 和 Scala 2.12,并且您想使用 Spark Connector 1.1.0,您可以使用 `starrocks-spark-connector-3.2_2.12-1.1.0.jar`。 - -> **注意** -> -> 通常,最新版本的 Spark Connector 仅保持与最近三个版本的 Spark 的兼容性。 - -### 下载已编译的 Jar 文件 - -直接从 [Maven Central Repository](https://repo1.maven.org/maven2/com/starrocks) 下载相应版本的 Spark Connector JAR。 - -### Maven 依赖 - -1. 在您的 Maven 项目的 `pom.xml` 文件中,按照以下格式添加 Spark Connector 作为依赖项。将 `spark_version`、`scala_version` 和 `connector_version` 替换为相应的版本。 - - ```xml - - com.starrocks - starrocks-spark-connector-${spark_version}_${scala_version} - ${connector_version} - - ``` - -2. 例如,如果您的环境中的 Spark 版本是 3.2,Scala 版本是 2.12,并且您选择 Spark Connector 1.1.0,您需要添加以下依赖项: - - ```xml - - com.starrocks - starrocks-spark-connector-3.2_2.12 - 1.1.0 - - ``` - -### 自行编译 - -1. 下载 [Spark connector package](https://github.com/StarRocks/starrocks-connector-for-apache-spark)。 -2. 执行以下命令将 Spark Connector 的源代码编译为 JAR 文件。请注意,`spark_version` 替换为相应的 Spark 版本。 - - ```bash - sh build.sh - ``` - - 例如,如果您的环境中的 Spark 版本是 3.2,您需要执行以下命令: - - ```bash - sh build.sh 3.2 - ``` - -3. 转到 `target/` 目录以查找 Spark Connector JAR 文件,例如编译后生成的 `starrocks-spark-connector-3.2_2.12-1.1.0-SNAPSHOT.jar`。 - -> **注意** -> -> 非正式发布的 Spark Connector 的名称包含 `SNAPSHOT` 后缀。 - -## 参数 - -### starrocks.fe.http.url - -**是否必须**: 是
-**默认值**: 无
-**描述**: StarRocks 集群中 FE 的 HTTP URL。您可以指定多个 URL,这些 URL 必须用逗号 (,) 分隔。格式:`:,:`。从 1.1.1 版本开始,您还可以向 URL 添加 `http://` 前缀,例如 `http://:,http://:`。 - -### starrocks.fe.jdbc.url - -**是否必须**: 是
-**默认值**: 无
-**描述**: 用于连接到 FE 的 MySQL 服务器的地址。格式:`jdbc:mysql://:`。 - -### starrocks.table.identifier - -**是否必须**: 是
-**默认值**: 无
-**描述**: StarRocks 表的名称。格式:`.`。 - -### starrocks.user - -**是否必须**: 是
-**默认值**: 无
-**描述**: StarRocks 集群帐户的用户名。该用户需要对 StarRocks 表具有 [SELECT 和 INSERT 权限](../sql-reference/sql-statements/account-management/GRANT.md)。 - -### starrocks.password - -**是否必须**: 是
-**默认值**: 无
-**描述**: StarRocks 集群帐户的密码。 - -### starrocks.write.label.prefix - -**是否必须**: 否
-**默认值**: spark-
-**描述**: Stream Load 使用的 Label 前缀。 - -### starrocks.write.enable.transaction-stream-load - -**是否必须**: 否
-**默认值**: TRUE
-**描述**: 是否使用 [Stream Load 事务接口](../loading/Stream_Load_transaction_interface.md) 加载数据。它需要 StarRocks v2.5 或更高版本。此功能可以在事务中加载更多数据,同时减少内存使用,并提高性能。
**注意:** 从 1.1.1 开始,此参数仅当 `starrocks.write.max.retries` 的值为非正数时才生效,因为 Stream Load 事务接口不支持重试。 - -### starrocks.write.buffer.size - -**是否必须**: 否
-**默认值**: 104857600
-**描述**: 在一次发送到 StarRocks 之前可以在内存中累积的最大数据量。将此参数设置为较大的值可以提高加载性能,但可能会增加加载延迟。 - -### starrocks.write.buffer.rows - -**是否必须**: 否
-**默认值**: Integer.MAX_VALUE
-**描述**: 自 1.1.1 版本起支持。在一次发送到 StarRocks 之前可以在内存中累积的最大行数。 - -### starrocks.write.flush.interval.ms - -**是否必须**: 否
-**默认值**: 300000
-**描述**: 将数据发送到 StarRocks 的间隔。此参数用于控制加载延迟。 - -### starrocks.write.max.retries - -**是否必须**: 否
-**默认值**: 3
-**描述**: 自 1.1.1 版本起支持。如果加载失败,Connector 重试对同一批数据执行 Stream Load 的次数。
**注意:** 因为 Stream Load 事务接口不支持重试。如果此参数为正数,则 Connector 始终使用 Stream Load 接口并忽略 `starrocks.write.enable.transaction-stream-load` 的值。 - -### starrocks.write.retry.interval.ms - -**是否必须**: 否
-**默认值**: 10000
-**描述**: 自 1.1.1 版本起支持。如果加载失败,重试对同一批数据执行 Stream Load 的间隔。 - -### starrocks.columns - -**是否必须**: 否
-**默认值**: 无
-**描述**: 您要将数据加载到的 StarRocks 表列。您可以指定多个列,这些列必须用逗号 (,) 分隔,例如 `"col0,col1,col2"`。 - -### starrocks.column.types - -**是否必须**: 否
-**默认值**: 无
-**描述**: 自 1.1.1 版本起支持。自定义 Spark 的列数据类型,而不是使用从 StarRocks 表和 [默认映射](#data-type-mapping-between-spark-and-starrocks) 推断的默认值。参数值是 DDL 格式的 Schema,与 Spark [StructType#toDDL](https://github.com/apache/spark/blob/master/sql/api/src/main/scala/org/apache/spark/sql/types/StructType.scala#L449) 的输出相同,例如 `col0 INT, col1 STRING, col2 BIGINT`。请注意,您只需要指定需要自定义的列。一个用例是将数据加载到 [BITMAP](#load-data-into-columns-of-bitmap-type) 或 [HLL](#load-data-into-columns-of-hll-type) 类型的列中。 - -### starrocks.write.properties.* - -**是否必须**: 否
-**默认值**: 无
-**描述**: 用于控制 Stream Load 行为的参数。例如,参数 `starrocks.write.properties.format` 指定要加载的数据的格式,例如 CSV 或 JSON。有关支持的参数及其描述的列表,请参阅 [STREAM LOAD](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md)。 - -### starrocks.write.properties.format - -**是否必须**: 否
-**默认值**: CSV
-**描述**: Spark Connector 在将每批数据发送到 StarRocks 之前转换数据的基于的文件格式。有效值:CSV 和 JSON。 - -### starrocks.write.properties.row_delimiter - -**是否必须**: 否
-**默认值**: \n
-**描述**: CSV 格式数据的行分隔符。 - -### starrocks.write.properties.column_separator - -**是否必须**: 否
-**默认值**: \t
-**描述**: CSV 格式数据的列分隔符。 - -### starrocks.write.properties.partial_update - -**是否必须**: 否
-**默认值**: `FALSE`
-**描述**: 是否使用部分更新。有效值:`TRUE` 和 `FALSE`。默认值:`FALSE`,表示禁用此功能。 - -### starrocks.write.properties.partial_update_mode - -**是否必须**: 否
-**默认值**: `row`
-**描述**: 指定部分更新的模式。有效值:`row` 和 `column`。
  • 值 `row` (默认) 表示行模式下的部分更新,更适合于具有许多列和小批量的实时更新。
  • 值 `column` 表示列模式下的部分更新,更适合于具有少量列和许多行的批量更新。在这种情况下,启用列模式可以提供更快的更新速度。例如,在一个具有 100 列的表中,如果仅更新所有行的 10 列(总数的 10%),则列模式的更新速度快 10 倍。
- -### starrocks.write.num.partitions - -**是否必须**: 否
-**默认值**: 无
-**描述**: Spark 可以并行写入数据的分区数。当数据量较小时,您可以减少分区数以降低加载并发和频率。此参数的默认值由 Spark 确定。但是,此方法可能会导致 Spark Shuffle 成本。 - -### starrocks.write.partition.columns - -**是否必须**: 否
-**默认值**: 无
-**描述**: Spark 中的分区列。该参数仅当指定了 `starrocks.write.num.partitions` 时才生效。如果未指定此参数,则所有正在写入的列都用于分区。 - -### starrocks.timezone - -**是否必须**: 否
-**默认值**: JVM 的默认时区
-**描述**: 自 1.1.1 起支持。用于将 Spark `TimestampType` 转换为 StarRocks `DATETIME` 的时区。默认值是由 `ZoneId#systemDefault()` 返回的 JVM 时区。格式可以是时区名称,例如 `Asia/Shanghai`,或时区偏移量,例如 `+08:00`。 - -## Spark 和 StarRocks 之间的数据类型映射 - -- 默认数据类型映射如下: - - | Spark 数据类型 | StarRocks 数据类型 | - | --------------- | ------------------------------------------------------------ | - | BooleanType | BOOLEAN | - | ByteType | TINYINT | - | ShortType | SMALLINT | - | IntegerType | INT | - | LongType | BIGINT | - | StringType | LARGEINT | - | FloatType | FLOAT | - | DoubleType | DOUBLE | - | DecimalType | DECIMAL | - | StringType | CHAR | - | StringType | VARCHAR | - | StringType | STRING | - | StringType | JSON | - | DateType | DATE | - | TimestampType | DATETIME | - | ArrayType | ARRAY
**注意:**
**自 1.1.1 版本起支持**。有关详细步骤,请参阅 [将数据加载到 ARRAY 类型的列中](#load-data-into-columns-of-array-type)。 | - -- 您还可以自定义数据类型映射。 - - 例如,StarRocks 表包含 BITMAP 和 HLL 列,但 Spark 不支持这两种数据类型。您需要在 Spark 中自定义相应的数据类型。有关详细步骤,请参阅将数据加载到 [BITMAP](#load-data-into-columns-of-bitmap-type) 和 [HLL](#load-data-into-columns-of-hll-type) 列中。**自 1.1.1 版本起支持 BITMAP 和 HLL**。 - -## 升级 Spark connector - -### 从 1.1.0 版本升级到 1.1.1 版本 - -- 从 1.1.1 版本开始,Spark Connector 不提供 `mysql-connector-java`,它是 MySQL 的官方 JDBC 驱动程序,因为 `mysql-connector-java` 使用的 GPL 许可存在限制。 - 但是,Spark Connector 仍然需要 MySQL JDBC 驱动程序来连接到 StarRocks 以获取表元数据,因此您需要手动将驱动程序添加到 Spark classpath 中。您可以在 [MySQL site](https://dev.mysql.com/downloads/connector/j/) 或 [Maven Central](https://repo1.maven.org/maven2/mysql/mysql-connector-java/) 上找到该驱动程序。 -- 从 1.1.1 版本开始,Connector 默认使用 Stream Load 接口,而不是 1.1.0 版本中的 Stream Load 事务接口。如果您仍然想使用 Stream Load 事务接口,您可以 - 将选项 `starrocks.write.max.retries` 设置为 `0`。有关详细信息,请参阅 `starrocks.write.enable.transaction-stream-load` 和 `starrocks.write.max.retries` 的描述。 - -## 示例 - -以下示例展示了如何使用 Spark Connector 通过 Spark DataFrames 或 Spark SQL 将数据加载到 StarRocks 表中。Spark DataFrames 支持批量和结构化流模式。 - -有关更多示例,请参阅 [Spark Connector Examples](https://github.com/StarRocks/starrocks-connector-for-apache-spark/tree/main/src/test/java/com/starrocks/connector/spark/examples)。 - -### 准备工作 - -#### 创建 StarRocks 表 - -创建一个数据库 `test` 并创建一个主键表 `score_board`。 - -```sql -CREATE DATABASE `test`; - -CREATE TABLE `test`.`score_board` -( - `id` int(11) NOT NULL COMMENT "", - `name` varchar(65533) NULL DEFAULT "" COMMENT "", - `score` int(11) NOT NULL DEFAULT "0" COMMENT "" -) -ENGINE=OLAP -PRIMARY KEY(`id`) -COMMENT "OLAP" -DISTRIBUTED BY HASH(`id`); -``` - -#### 网络配置 - -确保 Spark 所在的机器可以通过 [`http_port`](../administration/management/FE_configuration.md#http_port) (默认值:`8030`) 和 [`query_port`](../administration/management/FE_configuration.md#query_port) (默认值:`9030`) 访问 StarRocks 集群的 FE 节点,并通过 [`be_http_port`](../administration/management/BE_configuration.md#be_http_port) (默认值:`8040`) 访问 BE 节点。 - -#### 设置您的 Spark 环境 - -请注意,以下示例在 Spark 3.2.4 中运行,并使用 `spark-shell`、`pyspark` 和 `spark-sql`。在运行示例之前,请确保将 Spark Connector JAR 文件放在 `$SPARK_HOME/jars` 目录中。 - -### 使用 Spark DataFrames 加载数据 - -以下两个示例说明了如何使用 Spark DataFrames 批量或结构化流模式加载数据。 - -#### 批量 - -在内存中构造数据并将数据加载到 StarRocks 表中。 - -1. 您可以使用 Scala 或 Python 编写 Spark 应用程序。 - - 对于 Scala,在 `spark-shell` 中运行以下代码片段: - - ```Scala - // 1. 从序列创建 DataFrame。 - val data = Seq((1, "starrocks", 100), (2, "spark", 100)) - val df = data.toDF("id", "name", "score") - - // 2. 通过将格式配置为 "starrocks" 和以下选项来写入 StarRocks。 - // 您需要根据自己的环境修改选项。 - df.write.format("starrocks") - .option("starrocks.fe.http.url", "127.0.0.1:8030") - .option("starrocks.fe.jdbc.url", "jdbc:mysql://127.0.0.1:9030") - .option("starrocks.table.identifier", "test.score_board") - .option("starrocks.user", "root") - .option("starrocks.password", "") - .mode("append") - .save() - ``` - - 对于 Python,在 `pyspark` 中运行以下代码片段: - - ```python - from pyspark.sql import SparkSession - - spark = SparkSession \ - .builder \ - .appName("StarRocks Example") \ - .getOrCreate() - - # 1. 从序列创建 DataFrame。 - data = [(1, "starrocks", 100), (2, "spark", 100)] - df = spark.sparkContext.parallelize(data) \ - .toDF(["id", "name", "score"]) - - # 2. 通过将格式配置为 "starrocks" 和以下选项来写入 StarRocks。 - # 您需要根据自己的环境修改选项。 - df.write.format("starrocks") \ - .option("starrocks.fe.http.url", "127.0.0.1:8030") \ - .option("starrocks.fe.jdbc.url", "jdbc:mysql://127.0.0.1:9030") \ - .option("starrocks.table.identifier", "test.score_board") \ - .option("starrocks.user", "root") \ - .option("starrocks.password", "") \ - .mode("append") \ - .save() - ``` - -2. 在 StarRocks 表中查询数据。 - - ```sql - MySQL [test]> SELECT * FROM `score_board`; - +------+-----------+-------+ - | id | name | score | - +------+-----------+-------+ - | 1 | starrocks | 100 | - | 2 | spark | 100 | - +------+-----------+-------+ - 2 rows in set (0.00 sec) - ``` - -#### 结构化流 - -构造从 CSV 文件读取数据的流,并将数据加载到 StarRocks 表中。 - -1. 在目录 `csv-data` 中,创建一个包含以下数据的 CSV 文件 `test.csv`: - - ```csv - 3,starrocks,100 - 4,spark,100 - ``` - -2. 您可以使用 Scala 或 Python 编写 Spark 应用程序。 - - 对于 Scala,在 `spark-shell` 中运行以下代码片段: - - ```Scala - import org.apache.spark.sql.types.StructType - - // 1. 从 CSV 创建 DataFrame。 - val schema = (new StructType() - .add("id", "integer") - .add("name", "string") - .add("score", "integer") - ) - val df = (spark.readStream - .option("sep", ",") - .schema(schema) - .format("csv") - // 将其替换为您的目录 "csv-data" 的路径。 - .load("/path/to/csv-data") - ) - - // 2. 通过将格式配置为 "starrocks" 和以下选项来写入 StarRocks。 - // 您需要根据自己的环境修改选项。 - val query = (df.writeStream.format("starrocks") - .option("starrocks.fe.http.url", "127.0.0.1:8030") - .option("starrocks.fe.jdbc.url", "jdbc:mysql://127.0.0.1:9030") - .option("starrocks.table.identifier", "test.score_board") - .option("starrocks.user", "root") - .option("starrocks.password", "") - // 将其替换为您的检查点目录 - .option("checkpointLocation", "/path/to/checkpoint") - .outputMode("append") - .start() - ) - ``` - - 对于 Python,在 `pyspark` 中运行以下代码片段: - - ```python - from pyspark.sql import SparkSession - from pyspark.sql.types import IntegerType, StringType, StructType, StructField - - spark = SparkSession \ - .builder \ - .appName("StarRocks SS Example") \ - .getOrCreate() - - # 1. 从 CSV 创建 DataFrame。 - schema = StructType([ - StructField("id", IntegerType()), - StructField("name", StringType()), - StructField("score", IntegerType()) - ]) - df = ( - spark.readStream - .option("sep", ",") - .schema(schema) - .format("csv") - # 将其替换为您的目录 "csv-data" 的路径。 - .load("/path/to/csv-data") - ) - - # 2. 通过将格式配置为 "starrocks" 和以下选项来写入 StarRocks。 - # 您需要根据自己的环境修改选项。 - query = ( - df.writeStream.format("starrocks") - .option("starrocks.fe.http.url", "127.0.0.1:8030") - .option("starrocks.fe.jdbc.url", "jdbc:mysql://127.0.0.1:9030") - .option("starrocks.table.identifier", "test.score_board") - .option("starrocks.user", "root") - .option("starrocks.password", "") - # 将其替换为您的检查点目录 - .option("checkpointLocation", "/path/to/checkpoint") - .outputMode("append") - .start() - ) - ``` - -3. 在 StarRocks 表中查询数据。 - - ```SQL - MySQL [test]> select * from score_board; - +------+-----------+-------+ - | id | name | score | - +------+-----------+-------+ - | 4 | spark | 100 | - | 3 | starrocks | 100 | - +------+-----------+-------+ - 2 rows in set (0.67 sec) - ``` - -### 使用 Spark SQL 加载数据 - -以下示例说明了如何通过使用 [Spark SQL CLI](https://spark.apache.org/docs/latest/sql-distributed-sql-engine-spark-sql-cli.html) 中的 `INSERT INTO` 语句使用 Spark SQL 加载数据。 - -1. 在 `spark-sql` 中执行以下 SQL 语句: - - ```SQL - -- 1. 通过将数据源配置为 `starrocks` 和以下选项来创建表。 - -- 您需要根据自己的环境修改选项。 - CREATE TABLE `score_board` - USING starrocks - OPTIONS( - "starrocks.fe.http.url"="127.0.0.1:8030", - "starrocks.fe.jdbc.url"="jdbc:mysql://127.0.0.1:9030", - "starrocks.table.identifier"="test.score_board", - "starrocks.user"="root", - "starrocks.password"="" - ); - - -- 2. 将两行插入到表中。 - INSERT INTO `score_board` VALUES (5, "starrocks", 100), (6, "spark", 100); - ``` - -2. 在 StarRocks 表中查询数据。 - - ```SQL - MySQL [test]> select * from score_board; - +------+-----------+-------+ - | id | name | score | - +------+-----------+-------+ - | 6 | spark | 100 | - | 5 | starrocks | 100 | - +------+-----------+-------+ - 2 rows in set (0.00 sec) - ``` - -## 最佳实践 - -### 导入数据到主键表 - -本节将展示如何将数据导入到 StarRocks 主键表以实现部分更新和条件更新。 -您可以参阅 [通过导入更改数据](../loading/Load_to_Primary_Key_tables.md) 以获取这些功能的详细介绍。 -这些示例使用 Spark SQL。 - -#### 准备工作 - -在 StarRocks 中创建一个数据库 `test` 并创建一个主键表 `score_board`。 - -```SQL -CREATE DATABASE `test`; - -CREATE TABLE `test`.`score_board` -( - `id` int(11) NOT NULL COMMENT "", - `name` varchar(65533) NULL DEFAULT "" COMMENT "", - `score` int(11) NOT NULL DEFAULT "0" COMMENT "" -) -ENGINE=OLAP -PRIMARY KEY(`id`) -COMMENT "OLAP" -DISTRIBUTED BY HASH(`id`); -``` - -#### 部分更新 - -本示例将展示如何仅通过导入更新列 `name` 中的数据: - -1. 在 MySQL 客户端中将初始数据插入到 StarRocks 表中。 - - ```sql - mysql> INSERT INTO `score_board` VALUES (1, 'starrocks', 100), (2, 'spark', 100); - - mysql> select * from score_board; - +------+-----------+-------+ - | id | name | score | - +------+-----------+-------+ - | 1 | starrocks | 100 | - | 2 | spark | 100 | - +------+-----------+-------+ - 2 rows in set (0.02 sec) - ``` - -2. 在 Spark SQL 客户端中创建一个 Spark 表 `score_board`。 - - - 将选项 `starrocks.write.properties.partial_update` 设置为 `true`,这会告诉 Connector 执行部分更新。 - - 将选项 `starrocks.columns` 设置为 `"id,name"`,以告诉 Connector 要写入哪些列。 - - ```SQL - CREATE TABLE `score_board` - USING starrocks - OPTIONS( - "starrocks.fe.http.url"="127.0.0.1:8030", - "starrocks.fe.jdbc.url"="jdbc:mysql://127.0.0.1:9030", - "starrocks.table.identifier"="test.score_board", - "starrocks.user"="root", - "starrocks.password"="", - "starrocks.write.properties.partial_update"="true", - "starrocks.columns"="id,name" - ); - ``` - -3. 在 Spark SQL 客户端中将数据插入到表中,并且仅更新列 `name`。 - - ```SQL - INSERT INTO `score_board` VALUES (1, 'starrocks-update'), (2, 'spark-update'); - ``` - -4. 在 MySQL 客户端中查询 StarRocks 表。 - - 您可以看到只有 `name` 的值发生了变化,而 `score` 的值没有发生变化。 - - ```SQL - mysql> select * from score_board; - +------+------------------+-------+ - | id | name | score | - +------+------------------+-------+ - | 1 | starrocks-update | 100 | - | 2 | spark-update | 100 | - +------+------------------+-------+ - 2 rows in set (0.02 sec) - ``` - -#### 条件更新 - -本示例将展示如何根据列 `score` 的值执行条件更新。仅当 `score` 的新值大于或等于旧值时,对 `id` 的更新才会生效。 - -1. 在 MySQL 客户端中将初始数据插入到 StarRocks 表中。 - - ```SQL - mysql> INSERT INTO `score_board` VALUES (1, 'starrocks', 100), (2, 'spark', 100); - - mysql> select * from score_board; - +------+-----------+-------+ - | id | name | score | - +------+-----------+-------+ - | 1 | starrocks | 100 | - | 2 | spark | 100 | - +------+-----------+-------+ - 2 rows in set (0.02 sec) - ``` - -2. 通过以下方式创建一个 Spark 表 `score_board`。 - - - 将选项 `starrocks.write.properties.merge_condition` 设置为 `score`,这会告诉 Connector 使用列 `score` 作为条件。 - - 确保 Spark Connector 使用 Stream Load 接口加载数据,而不是 Stream Load 事务接口,因为后者不支持此功能。 - - ```SQL - CREATE TABLE `score_board` - USING starrocks - OPTIONS( - "starrocks.fe.http.url"="127.0.0.1:8030", - "starrocks.fe.jdbc.url"="jdbc:mysql://127.0.0.1:9030", - "starrocks.table.identifier"="test.score_board", - "starrocks.user"="root", - "starrocks.password"="", - "starrocks.write.properties.merge_condition"="score" - ); - ``` - -3. 将数据插入到 Spark SQL 客户端中的表中,并使用较小的 Score 值更新 `id` 为 1 的行,并使用较大的 Score 值更新 `id` 为 2 的行。 - - ```SQL - INSERT INTO `score_board` VALUES (1, 'starrocks-update', 99), (2, 'spark-update', 101); - ``` - -4. 在 MySQL 客户端中查询 StarRocks 表。 - - 您可以看到只有 `id` 为 2 的行发生了变化,而 `id` 为 1 的行没有发生变化。 - - ```SQL - mysql> select * from score_board; - +------+--------------+-------+ - | id | name | score | - +------+--------------+-------+ - | 1 | starrocks | 100 | - | 2 | spark-update | 101 | - +------+--------------+-------+ - 2 rows in set (0.03 sec) - ``` - -### 将数据加载到 BITMAP 类型的列中 - -[`BITMAP`](../sql-reference/data-types/other-data-types/BITMAP.md) 通常用于加速 Count Distinct,例如计算 UV,请参阅 [使用 Bitmap 进行精确的 Count Distinct](../using_starrocks/distinct_values/Using_bitmap.md)。 -这里我们以 UV 的计数为例,展示如何将数据加载到 `BITMAP` 类型的列中。**自 1.1.1 版本起支持 `BITMAP`**。 - -1. 创建一个 StarRocks 聚合表。 - - 在数据库 `test` 中,创建一个聚合表 `page_uv`,其中列 `visit_users` 定义为 `BITMAP` 类型,并配置了聚合函数 `BITMAP_UNION`。 - - ```SQL - CREATE TABLE `test`.`page_uv` ( - `page_id` INT NOT NULL COMMENT '页面 ID', - `visit_date` datetime NOT NULL COMMENT '访问时间', - `visit_users` BITMAP BITMAP_UNION NOT NULL COMMENT '用户 ID' - ) ENGINE=OLAP - AGGREGATE KEY(`page_id`, `visit_date`) - DISTRIBUTED BY HASH(`page_id`); - ``` - -2. 创建一个 Spark 表。 - - Spark 表的 Schema 是从 StarRocks 表推断出来的,并且 Spark 不支持 `BITMAP` 类型。因此,您需要在 Spark 中自定义相应的列数据类型,例如作为 `BIGINT`,通过配置选项 `"starrocks.column.types"="visit_users BIGINT"`。当使用 Stream Load 摄取数据时,Connector 使用 [`to_bitmap`](../sql-reference/sql-functions/bitmap-functions/to_bitmap.md) 函数将 `BIGINT` 类型的数据转换为 `BITMAP` 类型。 - - 在 `spark-sql` 中运行以下 DDL: - - ```SQL - CREATE TABLE `page_uv` - USING starrocks - OPTIONS( - "starrocks.fe.http.url"="127.0.0.1:8030", - "starrocks.fe.jdbc.url"="jdbc:mysql://127.0.0.1:9030", - "starrocks.table \ No newline at end of file diff --git a/docs/zh/loading/SparkLoad.md b/docs/zh/loading/SparkLoad.md deleted file mode 100644 index ca661d1..0000000 --- a/docs/zh/loading/SparkLoad.md +++ /dev/null @@ -1,537 +0,0 @@ ---- -displayed_sidebar: docs ---- - -# 使用 Spark Load 批量导入数据 - -Spark Load 使用外部 Apache Spark™ 资源来预处理导入的数据,从而提高导入性能并节省计算资源。它主要用于 **初始迁移** 和将 **大量数据导入** 到 StarRocks 中(数据量高达 TB 级别)。 - -Spark Load 是一种**异步**导入方法,用户需要通过 MySQL 协议创建 Spark 类型的导入作业,并使用 `SHOW LOAD` 查看导入结果。 - -> **注意** -> -> - 只有对 StarRocks 表具有 INSERT 权限的用户才能将数据导入到该表中。您可以按照 [GRANT](../sql-reference/sql-statements/account-management/GRANT.md) 中提供的说明授予所需的权限。 -> - Spark Load 不能用于将数据导入到主键表。 - -## 术语解释 - -- **Spark ETL**:主要负责导入过程中的数据 ETL,包括全局字典构建(BITMAP 类型)、分区、排序、聚合等。 -- **Broker**:Broker 是一个独立的无状态进程。它封装了文件系统接口,并为 StarRocks 提供了从远端存储系统读取文件的能力。 -- **Global Dictionary**:保存将数据从原始值映射到编码值的数据结构。原始值可以是任何数据类型,而编码值是整数。全局字典主要用于预计算精确去重的场景。 - -## 原理 - -用户通过 MySQL 客户端提交 Spark 类型的导入作业;FE 记录元数据并返回提交结果。 - -spark load 任务的执行分为以下几个主要阶段。 - -1. 用户将 spark load 作业提交给 FE。 -2. FE 调度提交 ETL 任务到 Apache Spark™ 集群执行。 -3. Apache Spark™ 集群执行 ETL 任务,包括全局字典构建(BITMAP 类型)、分区、排序、聚合等。 -4. ETL 任务完成后,FE 获取每个预处理切片的数据路径,并调度相关的 BE 执行 Push 任务。 -5. BE 通过 Broker 进程从 HDFS 读取数据,并将其转换为 StarRocks 存储格式。 - > 如果您选择不使用 Broker 进程,BE 将直接从 HDFS 读取数据。 -6. FE 调度生效版本并完成导入作业。 - -下图说明了 spark load 的主要流程。 - -![Spark load](../_assets/4.3.2-1.png) - ---- - -## Global Dictionary - -### 适用场景 - -目前,StarRocks 中的 BITMAP 列是使用 Roaringbitmap 实现的,它只接受整数作为输入数据类型。因此,如果您想在导入过程中为 BITMAP 列实现预计算,则需要将输入数据类型转换为整数。 - -在 StarRocks 现有的导入过程中,全局字典的数据结构是基于 Hive 表实现的,它保存了从原始值到编码值的映射。 - -### 构建过程 - -1. 从上游数据源读取数据,并生成一个临时 Hive 表,命名为 `hive-table`。 -2. 提取 `hive-table` 的非强调字段的值,以生成一个新的 Hive 表,命名为 `distinct-value-table`。 -3. 创建一个新的全局字典表,命名为 `dict-table`,其中一列用于原始值,一列用于编码值。 -4. 在 `distinct-value-table` 和 `dict-table` 之间进行左连接,然后使用窗口函数对该集合进行编码。最后,将去重列的原始值和编码值都写回 `dict-table`。 -5. 在 `dict-table` 和 `hive-table` 之间进行连接,以完成将 `hive-table` 中的原始值替换为整数编码值的工作。 -6. `hive-table` 将被下次数据预处理读取,然后在计算后导入到 StarRocks 中。 - -## 数据预处理 - -数据预处理的基本过程如下: - -1. 从上游数据源(HDFS 文件或 Hive 表)读取数据。 -2. 完成读取数据的字段映射和计算,然后基于分区信息生成 `bucket-id`。 -3. 基于 StarRocks 表的 Rollup 元数据生成 RollupTree。 -4. 迭代 RollupTree 并执行分层聚合操作。下一层级的 Rollup 可以从上一层级的 Rollup 计算得出。 -5. 每次完成聚合计算后,数据将根据 `bucket-id` 进行分桶,然后写入 HDFS。 -6. 后续的 Broker 进程将从 HDFS 拉取文件并将其导入到 StarRocks BE 节点。 - -## 基本操作 - -### 配置 ETL 集群 - -Apache Spark™ 在 StarRocks 中用作外部计算资源,用于 ETL 工作。可能还有其他外部资源添加到 StarRocks 中,例如用于查询的 Spark/GPU、用于外部存储的 HDFS/S3、用于 ETL 的 MapReduce 等。因此,我们引入了 `Resource Management` 来管理 StarRocks 使用的这些外部资源。 - -在提交 Apache Spark™ 导入作业之前,请配置 Apache Spark™ 集群以执行 ETL 任务。操作语法如下: - -~~~sql --- 创建 Apache Spark™ 资源 -CREATE EXTERNAL RESOURCE resource_name -PROPERTIES -( - type = spark, - spark_conf_key = spark_conf_value, - working_dir = path, - broker = broker_name, - broker.property_key = property_value -); - --- 删除 Apache Spark™ 资源 -DROP RESOURCE resource_name; - --- 显示资源 -SHOW RESOURCES -SHOW PROC "/resources"; - --- 权限 -GRANT USAGE_PRIV ON RESOURCE resource_name TO user_identityGRANT USAGE_PRIV ON RESOURCE resource_name TO ROLE role_name; -REVOKE USAGE_PRIV ON RESOURCE resource_name FROM user_identityREVOKE USAGE_PRIV ON RESOURCE resource_name FROM ROLE role_name; -~~~ - -- 创建资源 - -**例如**: - -~~~sql --- yarn 集群模式 -CREATE EXTERNAL RESOURCE "spark0" -PROPERTIES -( - "type" = "spark", - "spark.master" = "yarn", - "spark.submit.deployMode" = "cluster", - "spark.jars" = "xxx.jar,yyy.jar", - "spark.files" = "/tmp/aaa,/tmp/bbb", - "spark.executor.memory" = "1g", - "spark.yarn.queue" = "queue0", - "spark.hadoop.yarn.resourcemanager.address" = "127.0.0.1:9999", - "spark.hadoop.fs.defaultFS" = "hdfs://127.0.0.1:10000", - "working_dir" = "hdfs://127.0.0.1:10000/tmp/starrocks", - "broker" = "broker0", - "broker.username" = "user0", - "broker.password" = "password0" -); - --- yarn HA 集群模式 -CREATE EXTERNAL RESOURCE "spark1" -PROPERTIES -( - "type" = "spark", - "spark.master" = "yarn", - "spark.submit.deployMode" = "cluster", - "spark.hadoop.yarn.resourcemanager.ha.enabled" = "true", - "spark.hadoop.yarn.resourcemanager.ha.rm-ids" = "rm1,rm2", - "spark.hadoop.yarn.resourcemanager.hostname.rm1" = "host1", - "spark.hadoop.yarn.resourcemanager.hostname.rm2" = "host2", - "spark.hadoop.fs.defaultFS" = "hdfs://127.0.0.1:10000", - "working_dir" = "hdfs://127.0.0.1:10000/tmp/starrocks", - "broker" = "broker1" -); -~~~ - -`resource-name` 是在 StarRocks 中配置的 Apache Spark™ 资源的名称。 - -`PROPERTIES` 包括与 Apache Spark™ 资源相关的参数,如下所示: -> **注意** -> -> 有关 Apache Spark™ 资源 PROPERTIES 的详细说明,请参见 [CREATE RESOURCE](../sql-reference/sql-statements/Resource/CREATE_RESOURCE.md)。 - -- Spark 相关参数: - - `type`:资源类型,必需,目前仅支持 `spark`。 - - `spark.master`:必需,目前仅支持 `yarn`。 - - `spark.submit.deployMode`:Apache Spark™ 程序的部署模式,必需,目前支持 `cluster` 和 `client`。 - - `spark.hadoop.fs.defaultFS`:如果 master 是 yarn,则为必需。 - - 与 yarn 资源管理器相关的参数,必需。 - - 单个节点上的一个 ResourceManager - `spark.hadoop.yarn.resourcemanager.address`:单点资源管理器的地址。 - - ResourceManager HA - > 您可以选择指定 ResourceManager 的主机名或地址。 - - `spark.hadoop.yarn.resourcemanager.ha.enabled`:启用资源管理器 HA,设置为 `true`。 - - `spark.hadoop.yarn.resourcemanager.ha.rm-ids`:资源管理器逻辑 ID 列表。 - - `spark.hadoop.yarn.resourcemanager.hostname.rm-id`:对于每个 rm-id,指定与资源管理器对应的主机名。 - - `spark.hadoop.yarn.resourcemanager.address.rm-id`:对于每个 rm-id,指定客户端提交作业的 `host:port`。 - -- `*working_dir`:ETL 使用的目录。如果 Apache Spark™ 用作 ETL 资源,则为必需。例如:`hdfs://host:port/tmp/starrocks`。 - -- Broker 相关参数: - - `broker`:Broker 名称。如果 Apache Spark™ 用作 ETL 资源,则为必需。您需要提前使用 `ALTER SYSTEM ADD BROKER` 命令完成配置。 - - `broker.property_key`:Broker 进程读取 ETL 生成的中间文件时要指定的信息(例如,身份验证信息)。 - -**注意**: - -以上是通过 Broker 进程加载的参数说明。如果您打算在没有 Broker 进程的情况下加载数据,则应注意以下事项。 - -- 您无需指定 `broker`。 -- 如果您需要配置用户身份验证和 NameNode 节点的 HA,则需要在 HDFS 集群中的 hdfs-site.xml 文件中配置参数,有关参数的说明,请参见 [broker_properties](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md#hdfs)。并且您需要将 **hdfs-site.xml** 文件移动到每个 FE 的 **$FE_HOME/conf** 下和每个 BE 的 **$BE_HOME/conf** 下。 - -> 注意 -> -> 如果 HDFS 文件只能由特定用户访问,您仍然需要在 `broker.name` 中指定 HDFS 用户名,并在 `broker.password` 中指定用户密码。 - -- 查看资源 - -常规帐户只能查看他们具有 `USAGE-PRIV` 访问权限的资源。root 和 admin 帐户可以查看所有资源。 - -- 资源权限 - -资源权限通过 `GRANT REVOKE` 进行管理,目前仅支持 `USAGE-PRIV` 权限。您可以将 `USAGE-PRIV` 权限授予用户或角色。 - -~~~sql --- 授予 user0 访问 spark0 资源的权限 -GRANT USAGE_PRIV ON RESOURCE "spark0" TO "user0"@"%"; - --- 授予 role0 访问 spark0 资源的权限 -GRANT USAGE_PRIV ON RESOURCE "spark0" TO ROLE "role0"; - --- 授予 user0 访问所有资源的权限 -GRANT USAGE_PRIV ON RESOURCE* TO "user0"@"%"; - --- 授予 role0 访问所有资源的权限 -GRANT USAGE_PRIV ON RESOURCE* TO ROLE "role0"; - --- 撤销 user0 对 spark0 资源的使用权限 -REVOKE USAGE_PRIV ON RESOURCE "spark0" FROM "user0"@"%"; -~~~ - -### 配置 Spark Client - -为 FE 配置 Spark 客户端,以便后者可以通过执行 `spark-submit` 命令来提交 Spark 任务。建议使用官方版本的 Spark2 2.4.5 或更高版本 [spark download address](https://archive.apache.org/dist/spark/)。下载后,请按照以下步骤完成配置。 - -- 配置 `SPARK-HOME` - -将 Spark 客户端放置在与 FE 相同的机器上的目录中,并在 FE 配置文件中将 `spark_home_default_dir` 配置为此目录,默认情况下,该目录是 FE 根目录中的 `lib/spark2x` 路径,并且不能为空。 - -- **配置 SPARK 依赖包** - -要配置依赖包,请压缩并归档 Spark 客户端下 jars 文件夹中的所有 jar 文件,并将 FE 配置中的 `spark_resource_path` 项配置为此 zip 文件。如果此配置为空,FE 将尝试在 FE 根目录中查找 `lib/spark2x/jars/spark-2x.zip` 文件。如果 FE 无法找到它,它将报告错误。 - -提交 spark load 作业时,归档的依赖文件将上传到远程存储库。默认存储库路径位于 `working_dir/{cluster_id}` 目录下,并以 `--spark-repository--{resource-name}` 命名,这意味着集群中的一个资源对应于一个远程存储库。目录结构引用如下: - -~~~bash ----spark-repository--spark0/ - - |---archive-1.0.0/ - - | |\---lib-990325d2c0d1d5e45bf675e54e44fb16-spark-dpp-1.0.0\-jar-with-dependencies.jar - - | |\---lib-7670c29daf535efe3c9b923f778f61fc-spark-2x.zip - - |---archive-1.1.0/ - - | |\---lib-64d5696f99c379af2bee28c1c84271d5-spark-dpp-1.1.0\-jar-with-dependencies.jar - - | |\---lib-1bbb74bb6b264a270bc7fca3e964160f-spark-2x.zip - - |---archive-1.2.0/ - - | |-... - -~~~ - -除了 spark 依赖项(默认情况下命名为 `spark-2x.zip`)之外,FE 还会将 DPP 依赖项上传到远程存储库。如果 spark load 提交的所有依赖项已经存在于远程存储库中,则无需再次上传依赖项,从而节省了每次重复上传大量文件的时间。 - -### 配置 YARN Client - -为 FE 配置 yarn 客户端,以便 FE 可以执行 yarn 命令来获取正在运行的应用程序的状态或终止它。建议使用官方版本的 Hadoop2 2.5.2 或更高版本 ([hadoop download address](https://archive.apache.org/dist/hadoop/common/))。下载后,请按照以下步骤完成配置: - -- **配置 YARN 可执行路径** - -将下载的 yarn 客户端放置在与 FE 相同的机器上的目录中,并在 FE 配置文件中将 `yarn_client_path` 项配置为 yarn 的二进制可执行文件,默认情况下,该文件是 FE 根目录中的 `lib/yarn-client/hadoop/bin/yarn` 路径。 - -- **配置生成 YARN 所需的配置文件的路径(可选)** - -当 FE 通过 yarn 客户端获取应用程序的状态或终止应用程序时,默认情况下,StarRocks 会在 FE 根目录的 `lib/yarn-config` 路径中生成执行 yarn 命令所需的配置文件。可以通过配置 FE 配置文件中的 `yarn_config_dir` 条目来修改此路径,该条目当前包括 `core-site.xml` 和 `yarn-site.xml`。 - -### 创建导入作业 - -**语法:** - -~~~sql -LOAD LABEL load_label - (data_desc, ...) -WITH RESOURCE resource_name -[resource_properties] -[PROPERTIES (key1=value1, ... )] - -* load_label: - db_name.label_name - -* data_desc: - DATA INFILE ('file_path', ...) - [NEGATIVE] - INTO TABLE tbl_name - [PARTITION (p1, p2)] - [COLUMNS TERMINATED BY separator ] - [(col1, ...)] - [COLUMNS FROM PATH AS (col2, ...)] - [SET (k1=f1(xx), k2=f2(xx))] - [WHERE predicate] - - DATA FROM TABLE hive_external_tbl - [NEGATIVE] - INTO TABLE tbl_name - [PARTITION (p1, p2)] - [SET (k1=f1(xx), k2=f2(xx))] - [WHERE predicate] - -* resource_properties: - (key2=value2, ...) -~~~ - -**示例 1**:上游数据源为 HDFS 的情况 - -~~~sql -LOAD LABEL db1.label1 -( - DATA INFILE("hdfs://abc.com:8888/user/starrocks/test/ml/file1") - INTO TABLE tbl1 - COLUMNS TERMINATED BY "," - (tmp_c1,tmp_c2) - SET - ( - id=tmp_c2, - name=tmp_c1 - ), - DATA INFILE("hdfs://abc.com:8888/user/starrocks/test/ml/file2") - INTO TABLE tbl2 - COLUMNS TERMINATED BY "," - (col1, col2) - where col1 > 1 -) -WITH RESOURCE 'spark0' -( - "spark.executor.memory" = "2g", - "spark.shuffle.compress" = "true" -) -PROPERTIES -( - "timeout" = "3600" -); -~~~ - -**示例 2**:上游数据源为 Hive 的情况。 - -- 步骤 1:创建一个新的 Hive 资源 - -~~~sql -CREATE EXTERNAL RESOURCE hive0 -PROPERTIES -( - "type" = "hive", - "hive.metastore.uris" = "thrift://xx.xx.xx.xx:8080" -); - ~~~ - -- 步骤 2:创建一个新的 Hive 外部表 - -~~~sql -CREATE EXTERNAL TABLE hive_t1 -( - k1 INT, - K2 SMALLINT, - k3 varchar(50), - uuid varchar(100) -) -ENGINE=hive -PROPERTIES -( - "resource" = "hive0", - "database" = "tmp", - "table" = "t1" -); - ~~~ - -- 步骤 3:提交 load 命令,要求导入的 StarRocks 表中的列存在于 Hive 外部表中。 - -~~~sql -LOAD LABEL db1.label1 -( - DATA FROM TABLE hive_t1 - INTO TABLE tbl1 - SET - ( - uuid=bitmap_dict(uuid) - ) -) -WITH RESOURCE 'spark0' -( - "spark.executor.memory" = "2g", - "spark.shuffle.compress" = "true" -) -PROPERTIES -( - "timeout" = "3600" -); - ~~~ - -Spark load 中参数的介绍: - -- **Label** - -导入作业的 Label。每个导入作业都有一个 Label,该 Label 在数据库中是唯一的,遵循与 broker load 相同的规则。 - -- **数据描述类参数** - -目前,支持的数据源是 CSV 和 Hive 表。其他规则与 broker load 相同。 - -- **导入作业参数** - -导入作业参数是指属于导入语句的 `opt_properties` 部分的参数。这些参数适用于整个导入作业。规则与 broker load 相同。 - -- **Spark 资源参数** - -Spark 资源需要提前配置到 StarRocks 中,并且用户需要被授予 USAGE-PRIV 权限,然后才能将资源应用于 Spark load。 -当用户有临时需求时,可以设置 Spark 资源参数,例如为作业添加资源和修改 Spark 配置。该设置仅对该作业生效,不影响 StarRocks 集群中的现有配置。 - -~~~sql -WITH RESOURCE 'spark0' -( - "spark.driver.memory" = "1g", - "spark.executor.memory" = "3g" -) -~~~ - -- **当数据源为 Hive 时导入** - -目前,要在导入过程中使用 Hive 表,您需要创建 `Hive` 类型的外部表,然后在提交导入命令时指定其名称。 - -- **导入过程以构建全局字典** - -在 load 命令中,您可以按以下格式指定构建全局字典所需的字段:`StarRocks 字段名称=bitmap_dict(hive 表字段名称)` 请注意,目前**仅当上游数据源是 Hive 表时才支持全局字典**。 - -- **加载二进制类型数据** - -从 v2.5.17 开始,Spark Load 支持 bitmap_from_binary 函数,该函数可以将二进制数据转换为 bitmap 数据。如果 Hive 表或 HDFS 文件的列类型为二进制,并且 StarRocks 表中对应的列是 bitmap 类型的聚合列,则可以在 load 命令中按以下格式指定字段,`StarRocks 字段名称=bitmap_from_binary(Hive 表字段名称)`。这样就不需要构建全局字典。 - -## 查看导入作业 - -Spark load 导入是异步的,broker load 也是如此。用户必须记录导入作业的 label,并在 `SHOW LOAD` 命令中使用它来查看导入结果。查看导入的命令对于所有导入方法都是通用的。示例如下。 - -有关返回参数的详细说明,请参阅 Broker Load。不同之处如下。 - -~~~sql -mysql> show load order by createtime desc limit 1\G -*************************** 1. row *************************** - JobId: 76391 - Label: label1 - State: FINISHED - Progress: ETL:100%; LOAD:100% - Type: SPARK - EtlInfo: unselected.rows=4; dpp.abnorm.ALL=15; dpp.norm.ALL=28133376 - TaskInfo: cluster:cluster0; timeout(s):10800; max_filter_ratio:5.0E-5 - ErrorMsg: N/A - CreateTime: 2019-07-27 11:46:42 - EtlStartTime: 2019-07-27 11:46:44 - EtlFinishTime: 2019-07-27 11:49:44 - LoadStartTime: 2019-07-27 11:49:44 -LoadFinishTime: 2019-07-27 11:50:16 - URL: http://1.1.1.1:8089/proxy/application_1586619723848_0035/ - JobDetails: {"ScannedRows":28133395,"TaskNumber":1,"FileNumber":1,"FileSize":200000} -~~~ - -- **State** - -导入作业的当前阶段。 -PENDING:已提交作业。 -ETL:已提交 Spark ETL。 -LOADING:FE 调度 BE 执行 push 操作。 -FINISHED:push 已完成,版本已生效。 - -导入作业有两个最终阶段 – `CANCELLED` 和 `FINISHED`,都表示 load 作业已完成。`CANCELLED` 表示导入失败,`FINISHED` 表示导入成功。 - -- **Progress** - -导入作业进度的描述。有两种类型的进度 – ETL 和 LOAD,它们对应于导入过程的两个阶段,ETL 和 LOADING。 - -- LOAD 的进度范围为 0~100%。 - -`LOAD 进度 = 当前已完成的所有副本导入的 tablet 数量 / 此导入作业的 tablet 总数 * 100%`。 - -- 如果所有表都已导入,则 LOAD 进度为 99%,当导入进入最终验证阶段时,将更改为 100%。 - -- 导入进度不是线性的。如果在一段时间内进度没有变化,并不意味着导入没有执行。 - -- **Type** - - 导入作业的类型。SPARK 表示 spark load。 - -- **CreateTime/EtlStartTime/EtlFinishTime/LoadStartTime/LoadFinishTime** - -这些值表示创建导入的时间、ETL 阶段开始的时间、ETL 阶段完成的时间、LOADING 阶段开始的时间以及整个导入作业完成的时间。 - -- **JobDetails** - -显示作业的详细运行状态,包括导入的文件数、总大小(以字节为单位)、子任务数、正在处理的原始行数等。例如: - -~~~json - {"ScannedRows":139264,"TaskNumber":1,"FileNumber":1,"FileSize":940754064} -~~~ - -- **URL** - -您可以将输入复制到浏览器以访问相应应用程序的 Web 界面。 - -### 查看 Apache Spark™ Launcher 提交日志 - -有时,用户需要查看在 Apache Spark™ 作业提交期间生成的详细日志。默认情况下,日志保存在 FE 根目录中的 `log/spark_launcher_log` 路径中,命名为 `spark-launcher-{load-job-id}-{label}.log`。日志在此目录中保存一段时间,并在 FE 元数据中的导入信息被清除时删除。默认保留时间为 3 天。 - -### 取消导入 - -当 Spark load 作业状态不是 `CANCELLED` 或 `FINISHED` 时,用户可以通过指定导入作业的 Label 手动取消它。 - ---- - -## 相关系统配置 - -**FE 配置:** 以下配置是 Spark load 的系统级配置,适用于所有 Spark load 导入作业。可以通过修改 `fe.conf` 来调整配置值。 - -- enable-spark-load:启用 Spark load 和资源创建,默认值为 false。 -- spark-load-default-timeout-second:作业的默认超时时间为 259200 秒(3 天)。 -- spark-home-default-dir:Spark 客户端路径 (`fe/lib/spark2x`)。 -- spark-resource-path:打包的 S park 依赖项文件的路径(默认为空)。 -- spark-launcher-log-dir:Spark 客户端的提交日志存储的目录 (`fe/log/spark-launcher-log`)。 -- yarn-client-path:yarn 二进制可执行文件的路径 (`fe/lib/yarn-client/hadoop/bin/yarn`)。 -- yarn-config-dir:Yarn 的配置文件路径 (`fe/lib/yarn-config`)。 - ---- - -## 最佳实践 - -使用 Spark load 最合适的场景是原始数据位于文件系统 (HDFS) 中,并且数据量在数十 GB 到 TB 级别。对于较小的数据量,请使用 Stream Load 或 Broker Load。 - -有关完整的 spark load 导入示例,请参阅 github 上的演示:[https://github.com/StarRocks/demo/blob/master/docs/03_sparkLoad2StarRocks.md](https://github.com/StarRocks/demo/blob/master/docs/03_sparkLoad2StarRocks.md) - -## 常见问题解答 - -- `Error: When running with master 'yarn' either HADOOP-CONF-DIR or YARN-CONF-DIR must be set in the environment.` - - 使用 Spark Load 时,未在 Spark 客户端的 `spark-env.sh` 中配置 `HADOOP-CONF-DIR` 环境变量。 - -- `Error: Cannot run program "xxx/bin/spark-submit": error=2, No such file or directory` - - 使用 Spark Load 时,`spark_home_default_dir` 配置项未指定 Spark 客户端根目录。 - -- `Error: File xxx/jars/spark-2x.zip does not exist.` - - 使用 Spark load 时,`spark-resource-path` 配置项未指向打包的 zip 文件。 - -- `Error: yarn client does not exist in path: xxx/yarn-client/hadoop/bin/yarn` - - 使用 Spark load 时,yarn-client-path 配置项未指定 yarn 可执行文件。 - -- `ERROR: Cannot execute hadoop-yarn/bin/... /libexec/yarn-config.sh` - - 将 Hadoop 与 CDH 结合使用时,需要配置 `HADOOP_LIBEXEC_DIR` 环境变量。 - 由于 `hadoop-yarn` 和 hadoop 目录不同,因此默认的 `libexec` 目录将查找 `hadoop-yarn/bin/... /libexec`,而 `libexec` 位于 hadoop 目录中。 - ```yarn application status`` 命令获取 Spark 任务状态报告错误,导致导入作业失败。 \ No newline at end of file diff --git a/docs/zh/loading/StreamLoad.md b/docs/zh/loading/StreamLoad.md deleted file mode 100644 index 88047e4..0000000 --- a/docs/zh/loading/StreamLoad.md +++ /dev/null @@ -1,545 +0,0 @@ ---- -displayed_sidebar: docs -keywords: ['Stream Load'] ---- - -# 从本地文件系统加载数据 - -import InsertPrivNote from '../_assets/commonMarkdown/insertPrivNote.mdx' - -StarRocks 提供了两种从本地文件系统加载数据的方法: - -- 使用 [Stream Load](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md) 进行同步导入 -- 使用 [Broker Load](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md) 进行异步导入 - -每种方式都有其优点: - -- Stream Load 支持 CSV 和 JSON 文件格式。如果您想从少量文件(每个文件大小不超过 10 GB)加载数据,建议使用此方法。 -- Broker Load 支持 Parquet、ORC、CSV 和 JSON 文件格式(从 v3.2.3 版本开始支持 JSON 文件格式)。如果您想从大量文件(每个文件大小超过 10 GB)加载数据,或者文件存储在网络附加存储 (NAS) 设备中,建议使用此方法。**从 v2.5 版本开始,支持使用 Broker Load 从本地文件系统加载数据。** - -对于 CSV 数据,请注意以下几点: - -- 您可以使用 UTF-8 字符串(例如逗号 (,)、制表符或管道符 (|)),其长度不超过 50 字节作为文本分隔符。 -- 空值用 `\N` 表示。例如,一个数据文件包含三列,该数据文件中的一条记录在第一列和第三列中包含数据,但在第二列中不包含数据。在这种情况下,您需要在第二列中使用 `\N` 来表示空值。这意味着该记录必须编译为 `a,\N,b` 而不是 `a,,b`。`a,,b` 表示该记录的第二列包含一个空字符串。 - -Stream Load 和 Broker Load 都支持在数据导入时进行数据转换,并支持在数据导入期间通过 UPSERT 和 DELETE 操作进行数据更改。更多信息,请参见 [在导入时转换数据](../loading/Etl_in_loading.md) 和 [通过导入更改数据](../loading/Load_to_Primary_Key_tables.md)。 - -## 前提条件 - -### 检查权限 - - - -#### 检查网络配置 - -确保您要加载的数据所在的机器可以通过 [`http_port`](../administration/management/FE_configuration.md#http_port) (默认值:`8030`)和 [`be_http_port`](../administration/management/BE_configuration.md#be_http_port) (默认值:`8040`)访问 StarRocks 集群的 FE 和 BE 节点。 - -## 通过 Stream Load 从本地文件系统加载 - -Stream Load 是一种基于 HTTP PUT 的同步导入方法。提交导入作业后,StarRocks 会同步运行该作业,并在作业完成后返回作业结果。您可以根据作业结果判断作业是否成功。 - -> **注意** -> -> 通过 Stream Load 将数据导入到 StarRocks 表后,也会更新基于该表创建的物化视图的数据。 - -### 原理 - -您可以在客户端上根据 HTTP 协议向 FE 提交导入请求,然后 FE 使用 HTTP 重定向将导入请求转发到特定的 BE 或 CN。您也可以直接在客户端上向您选择的 BE 或 CN 提交导入请求。 - -:::note - -如果您向 FE 提交导入请求,FE 会使用轮询机制来决定哪个 BE 或 CN 将作为协调器来接收和处理导入请求。轮询机制有助于在 StarRocks 集群中实现负载均衡。因此,我们建议您将导入请求发送到 FE。 - -::: - -接收导入请求的 BE 或 CN 作为协调器 BE 或 CN 运行,以根据使用的 schema 将数据拆分为多个部分,并将每个部分的数据分配给其他涉及的 BE 或 CN。导入完成后,协调器 BE 或 CN 将导入作业的结果返回给您的客户端。请注意,如果在导入期间停止协调器 BE 或 CN,则导入作业将失败。 - -下图显示了 Stream Load 作业的工作流程。 - -![Stream Load 工作流程](../_assets/4.2-1.png) - -### 限制 - -Stream Load 不支持加载包含 JSON 格式列的 CSV 文件的数据。 - -### 典型示例 - -本节以 curl 为例,介绍如何将 CSV 或 JSON 文件中的数据从本地文件系统加载到 StarRocks 中。有关详细的语法和参数说明,请参见 [STREAM LOAD](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md)。 - -请注意,在 StarRocks 中,一些字面量被 SQL 语言用作保留关键字。不要在 SQL 语句中直接使用这些关键字。如果您想在 SQL 语句中使用这样的关键字,请将其用一对反引号 (`) 括起来。请参见 [关键字](../sql-reference/sql-statements/keywords.md)。 - -#### 加载 CSV 数据 - -##### 准备数据集 - -在您的本地文件系统中,创建一个名为 `example1.csv` 的 CSV 文件。该文件包含三列,依次表示用户 ID、用户名和用户分数。 - -```Plain -1,Lily,23 -2,Rose,23 -3,Alice,24 -4,Julia,25 -``` - -##### 创建数据库和表 - -创建一个数据库并切换到该数据库: - -```SQL -CREATE DATABASE IF NOT EXISTS mydatabase; -USE mydatabase; -``` - -创建一个名为 `table1` 的主键表。该表包含三列:`id`、`name` 和 `score`,其中 `id` 是主键。 - -```SQL -CREATE TABLE `table1` -( - `id` int(11) NOT NULL COMMENT "user ID", - `name` varchar(65533) NULL COMMENT "user name", - `score` int(11) NOT NULL COMMENT "user score" -) -ENGINE=OLAP -PRIMARY KEY(`id`) -DISTRIBUTED BY HASH(`id`); -``` - -:::note - -从 v2.5.7 开始,StarRocks 可以在您创建表或添加分区时自动设置 bucket 数量 (BUCKETS)。您不再需要手动设置 bucket 数量。有关详细信息,请参见 [设置 bucket 数量](../table_design/data_distribution/Data_distribution.md#set-the-number-of-buckets)。 - -::: - -##### 启动 Stream Load - -运行以下命令将 `example1.csv` 的数据加载到 `table1` 中: - -```Bash -curl --location-trusted -u : -H "label:123" \ - -H "Expect:100-continue" \ - -H "column_separator:," \ - -H "columns: id, name, score" \ - -T example1.csv -XPUT \ - http://:/api/mydatabase/table1/_stream_load -``` - -:::note - -- 如果您使用的帐户未设置密码,则只需输入 `:`。 -- 您可以使用 [SHOW FRONTENDS](../sql-reference/sql-statements/cluster-management/nodes_processes/SHOW_FRONTENDS.md) 查看 FE 节点的 IP 地址和 HTTP 端口。 - -::: - -`example1.csv` 包含三列,这些列由逗号 (,) 分隔,并且可以按顺序映射到 `table1` 的 `id`、`name` 和 `score` 列。因此,您需要使用 `column_separator` 参数将逗号 (,) 指定为列分隔符。您还需要使用 `columns` 参数将 `example1.csv` 的三列临时命名为 `id`、`name` 和 `score`,这些列按顺序映射到 `table1` 的三列。 - -导入完成后,您可以查询 `table1` 以验证导入是否成功: - -```SQL -SELECT * FROM table1; -+------+-------+-------+ -| id | name | score | -+------+-------+-------+ -| 1 | Lily | 23 | -| 2 | Rose | 23 | -| 3 | Alice | 24 | -| 4 | Julia | 25 | -+------+-------+-------+ -4 rows in set (0.00 sec) -``` - -#### 加载 JSON 数据 - -从 v3.2.7 开始,Stream Load 支持在传输过程中压缩 JSON 数据,从而减少网络带宽开销。用户可以使用参数 `compression` 和 `Content-Encoding` 指定不同的压缩算法。支持的压缩算法包括 GZIP、BZIP2、LZ4_FRAME 和 ZSTD。有关语法,请参见 [STREAM LOAD](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md)。 - -##### 准备数据集 - -在您的本地文件系统中,创建一个名为 `example2.json` 的 JSON 文件。该文件包含两列,依次表示城市 ID 和城市名称。 - -```JSON -{"name": "Beijing", "code": 2} -``` - -##### 创建数据库和表 - -创建一个数据库并切换到该数据库: - -```SQL -CREATE DATABASE IF NOT EXISTS mydatabase; -USE mydatabase; -``` - -创建一个名为 `table2` 的主键表。该表包含两列:`id` 和 `city`,其中 `id` 是主键。 - -```SQL -CREATE TABLE `table2` -( - `id` int(11) NOT NULL COMMENT "city ID", - `city` varchar(65533) NULL COMMENT "city name" -) -ENGINE=OLAP -PRIMARY KEY(`id`) -DISTRIBUTED BY HASH(`id`); -``` - -:::note - -从 v2.5.7 开始,StarRocks 可以在您创建表或添加分区时自动设置 bucket 数量 (BUCKETS)。您不再需要手动设置 bucket 数量。有关详细信息,请参见 [设置 bucket 数量](../table_design/data_distribution/Data_distribution.md#set-the-number-of-buckets)。 - -::: - -##### 启动 Stream Load - -运行以下命令将 `example2.json` 的数据加载到 `table2` 中: - -```Bash -curl -v --location-trusted -u : -H "strict_mode: true" \ - -H "Expect:100-continue" \ - -H "format: json" -H "jsonpaths: [\"$.name\", \"$.code\"]" \ - -H "columns: city,tmp_id, id = tmp_id * 100" \ - -T example2.json -XPUT \ - http://:/api/mydatabase/table2/_stream_load -``` - -:::note - -- 如果您使用的帐户未设置密码,则只需输入 `:`。 -- 您可以使用 [SHOW FRONTENDS](../sql-reference/sql-statements/cluster-management/nodes_processes/SHOW_FRONTENDS.md) 查看 FE 节点的 IP 地址和 HTTP 端口。 - -::: - -`example2.json` 包含两个键 `name` 和 `code`,它们映射到 `table2` 的 `id` 和 `city` 列,如下图所示。 - -![JSON - 列映射](../_assets/4.2-2.png) - -上图中显示的映射描述如下: - -- StarRocks 提取 `example2.json` 的 `name` 和 `code` 键,并将它们映射到 `jsonpaths` 参数中声明的 `name` 和 `code` 字段。 - -- StarRocks 提取 `jsonpaths` 参数中声明的 `name` 和 `code` 字段,并将它们**按顺序映射**到 `columns` 参数中声明的 `city` 和 `tmp_id` 字段。 - -- StarRocks 提取 `columns` 参数中声明的 `city` 和 `tmp_id` 字段,并将它们**按名称映射**到 `table2` 的 `city` 和 `id` 列。 - -:::note - -在前面的示例中,`example2.json` 中 `code` 的值在加载到 `table2` 的 `id` 列之前乘以 100。 - -::: - -有关 `jsonpaths`、`columns` 和 StarRocks 表的列之间的详细映射,请参见 [STREAM LOAD](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md) 中的“列映射”部分。 - -导入完成后,您可以查询 `table2` 以验证导入是否成功: - -```SQL -SELECT * FROM table2; -+------+--------+ -| id | city | -+------+--------+ -| 200 | Beijing| -+------+--------+ -4 rows in set (0.01 sec) -``` - -import Beta from '../_assets/commonMarkdown/_beta.mdx' - -#### 合并 Stream Load 请求 - - - -从 v3.4.0 开始,系统支持合并多个 Stream Load 请求。 - -:::warning - -请注意,Merge Commit 优化适用于在单个表上具有**并发** Stream Load 作业的场景。如果并发度为 1,则不建议使用此优化。同时,在将 `merge_commit_async` 设置为 `false` 并将 `merge_commit_interval_ms` 设置为较大的值之前,请三思,因为它们可能会导致导入性能下降。 - -::: - -Merge Commit 是一种针对 Stream Load 的优化,专为高并发、小批量(从 KB 到数十 MB)实时导入场景而设计。在早期版本中,每个 Stream Load 请求都会生成一个事务和一个数据版本,这导致在高并发导入场景中出现以下问题: - -- 过多的数据版本会影响查询性能,并且限制版本数量可能会导致 `too many versions` 错误。 -- 通过 Compaction 合并数据版本会增加资源消耗。 -- 它会生成小文件,从而增加 IOPS 和 I/O 延迟。在存算分离集群中,这也会增加云对象存储成本。 -- 作为事务管理器的 Leader FE 节点可能会成为单点瓶颈。 - -Merge Commit 通过将时间窗口内的多个并发 Stream Load 请求合并到单个事务中来缓解这些问题。这减少了高并发请求生成的事务和版本数量,从而提高了导入性能。 - -Merge Commit 支持同步和异步模式。每种模式都有其优点和缺点。您可以根据您的用例进行选择。 - -- **同步模式** - - 服务器仅在合并的事务提交后才返回,从而确保导入成功且可见。 - -- **异步模式** - - 服务器在收到数据后立即返回。此模式不保证导入成功。 - -| **模式** | **优点** | **缺点** | -| -------- | ------------------------------------------------------------ | ------------------------------------------------------------ | -| 同步 |
  • 确保请求返回时数据的持久性和可见性。
  • 保证来自同一客户端的多个顺序导入请求按顺序执行。
| 来自客户端的每个导入请求都会被阻塞,直到服务器关闭合并窗口。如果窗口过大,可能会降低单个客户端的数据处理能力。 | -| 异步 | 允许单个客户端发送后续导入请求,而无需等待服务器关闭合并窗口,从而提高导入吞吐量。 |
  • 不保证返回时数据的持久性或可见性。客户端必须稍后验证事务状态。
  • 不保证来自同一客户端的多个顺序导入请求按顺序执行。
| - -##### 启动 Stream Load - -- 运行以下命令以启动启用 Merge Commit 的 Stream Load 作业(同步模式),并将合并窗口设置为 `5000` 毫秒,并行度设置为 `2`: - - ```Bash - curl --location-trusted -u : \ - -H "Expect:100-continue" \ - -H "column_separator:," \ - -H "columns: id, name, score" \ - -H "enable_merge_commit:true" \ - -H "merge_commit_interval_ms:5000" \ - -H "merge_commit_parallel:2" \ - -T example1.csv -XPUT \ - http://:/api/mydatabase/table1/_stream_load - ``` - -- 运行以下命令以启动启用 Merge Commit 的 Stream Load 作业(异步模式),并将合并窗口设置为 `60000` 毫秒,并行度设置为 `2`: - - ```Bash - curl --location-trusted -u : \ - -H "Expect:100-continue" \ - -H "column_separator:," \ - -H "columns: id, name, score" \ - -H "enable_merge_commit:true" \ - -H "merge_commit_async:true" \ - -H "merge_commit_interval_ms:60000" \ - -H "merge_commit_parallel:2" \ - -T example1.csv -XPUT \ - http://:/api/mydatabase/table1/_stream_load - ``` - -:::note - -- Merge Commit 仅支持将**同构**导入请求合并到单个数据库和表中。“同构”表示 Stream Load 参数相同,包括:通用参数、JSON 格式参数、CSV 格式参数、`opt_properties` 和 Merge Commit 参数。 -- 对于加载 CSV 格式的数据,您必须确保每行都以行分隔符结尾。不支持 `skip_header`。 -- 服务器会自动为事务生成 label。如果指定了 label,则会被忽略。 -- Merge Commit 将多个导入请求合并到单个事务中。如果一个请求包含数据质量问题,则事务中的所有请求都将失败。 - -::: - -#### 检查 Stream Load 进度 - -导入作业完成后,StarRocks 以 JSON 格式返回作业结果。有关更多信息,请参见 [STREAM LOAD](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md) 中的“返回值”部分。 - -Stream Load 不允许您使用 SHOW LOAD 语句查询导入作业的结果。 - -#### 取消 Stream Load 作业 - -Stream Load 不允许您取消导入作业。如果导入作业超时或遇到错误,StarRocks 会自动取消该作业。 - -### 参数配置 - -本节介绍如果您选择导入方法 Stream Load,则需要配置的一些系统参数。这些参数配置对所有 Stream Load 作业生效。 - -- `streaming_load_max_mb`:您要加载的每个数据文件的最大大小。默认最大大小为 10 GB。有关更多信息,请参见 [配置 BE 或 CN 动态参数](../administration/management/BE_configuration.md)。 - - 我们建议您一次不要加载超过 10 GB 的数据。如果数据文件的大小超过 10 GB,我们建议您将数据文件拆分为每个小于 10 GB 的小文件,然后逐个加载这些文件。如果您无法拆分大于 10 GB 的数据文件,您可以根据文件大小增加此参数的值。 - - 增加此参数的值后,只有在您重新启动 StarRocks 集群的 BE 或 CN 后,新值才能生效。此外,系统性能可能会下降,并且在导入失败时重试的成本也会增加。 - - :::note - - 当您加载 JSON 文件的数据时,请注意以下几点: - - - 文件中每个 JSON 对象的大小不能超过 4 GB。如果文件中的任何 JSON 对象超过 4 GB,StarRocks 将抛出错误“This parser can't support a document that big.”。 - - - 默认情况下,HTTP 请求中的 JSON body 不能超过 100 MB。如果 JSON body 超过 100 MB,StarRocks 将抛出错误“The size of this batch exceed the max size [104857600] of json type data data [8617627793]. Set ignore_json_size to skip check, although it may lead huge memory consuming.”。要防止此错误,您可以在 HTTP 请求头中添加 `"ignore_json_size:true"` 以忽略对 JSON body 大小的检查。 - - ::: - -- `stream_load_default_timeout_second`:每个导入作业的超时时间。默认超时时间为 600 秒。有关更多信息,请参见 [配置 FE 动态参数](../administration/management/FE_configuration.md#configure-fe-dynamic-parameters)。 - - 如果您创建的许多导入作业超时,您可以根据从以下公式获得的计算结果来增加此参数的值: - - **每个导入作业的超时时间 > 要加载的数据量/平均加载速度** - - 例如,如果您要加载的数据文件的大小为 10 GB,并且 StarRocks 集群的平均加载速度为 100 MB/s,则将超时时间设置为超过 100 秒。 - - :::note - - 上述公式中的**平均加载速度**是 StarRocks 集群的平均加载速度。它因磁盘 I/O 和 StarRocks 集群中 BE 或 CN 的数量而异。 - - ::: - - Stream Load 还提供了 `timeout` 参数,允许您指定单个导入作业的超时时间。有关更多信息,请参见 [STREAM LOAD](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md)。 - -### 使用注意事项 - -如果要在加载的数据文件中缺少某个记录的字段,并且 StarRocks 表中映射该字段的列定义为 `NOT NULL`,则 StarRocks 会在加载该记录期间自动在 StarRocks 表的映射列中填充 `NULL` 值。您还可以使用 `ifnull()` 函数来指定要填充的默认值。 - -例如,如果上述 `example2.json` 文件中缺少表示城市 ID 的字段,并且您想在 `table2` 的映射列中填充 `x` 值,则可以指定 `"columns: city, tmp_id, id = ifnull(tmp_id, 'x')"`. - -## 通过 Broker Load 从本地文件系统加载 - -除了 Stream Load 之外,您还可以使用 Broker Load 从本地文件系统加载数据。从 v2.5 版本开始支持此功能。 - -Broker Load 是一种异步导入方法。提交导入作业后,StarRocks 会异步运行该作业,并且不会立即返回作业结果。您需要手动查询作业结果。请参见 [检查 Broker Load 进度](#check-broker-load-progress)。 - -### 限制 - -- 目前,Broker Load 仅支持通过版本为 v2.5 或更高版本的单个 broker 从本地文件系统加载。 -- 针对单个 broker 的高并发查询可能会导致超时和 OOM 等问题。为了减轻影响,您可以使用 `pipeline_dop` 变量(请参见 [系统变量](../sql-reference/System_variable.md#pipeline_dop))来设置 Broker Load 的查询并行度。对于针对单个 broker 的查询,我们建议您将 `pipeline_dop` 设置为小于 `16` 的值。 - -### 典型示例 - -Broker Load 支持从单个数据文件加载到单个表,从多个数据文件加载到单个表,以及从多个数据文件加载到多个表。本节以从多个数据文件加载到单个表为例。 - -请注意,在 StarRocks 中,一些字面量被 SQL 语言用作保留关键字。不要在 SQL 语句中直接使用这些关键字。如果您想在 SQL 语句中使用这样的关键字,请将其用一对反引号 (`) 括起来。请参见 [关键字](../sql-reference/sql-statements/keywords.md)。 - -#### 准备数据集 - -以 CSV 文件格式为例。登录到您的本地文件系统,并在特定存储位置(例如,`/home/disk1/business/`)中创建两个 CSV 文件 `file1.csv` 和 `file2.csv`。这两个文件都包含三列,依次表示用户 ID、用户名和用户分数。 - -- `file1.csv` - - ```Plain - 1,Lily,21 - 2,Rose,22 - 3,Alice,23 - 4,Julia,24 - ``` - -- `file2.csv` - - ```Plain - 5,Tony,25 - 6,Adam,26 - 7,Allen,27 - 8,Jacky,28 - ``` - -#### 创建数据库和表 - -创建一个数据库并切换到该数据库: - -```SQL -CREATE DATABASE IF NOT EXISTS mydatabase; -USE mydatabase; -``` - -创建一个名为 `mytable` 的主键表。该表包含三列:`id`、`name` 和 `score`,其中 `id` 是主键。 - -```SQL -CREATE TABLE `mytable` -( - `id` int(11) NOT NULL COMMENT "User ID", - `name` varchar(65533) NULL DEFAULT "" COMMENT "User name", - `score` int(11) NOT NULL DEFAULT "0" COMMENT "User score" -) -ENGINE=OLAP -PRIMARY KEY(`id`) -DISTRIBUTED BY HASH(`id`) -PROPERTIES("replication_num"="1"); -``` - -#### 启动 Broker Load - -运行以下命令以启动一个 Broker Load 作业,该作业将存储在本地文件系统的 `/home/disk1/business/` 路径中的所有数据文件(`file1.csv` 和 `file2.csv`)中的数据加载到 StarRocks 表 `mytable` 中: - -```SQL -LOAD LABEL mydatabase.label_local -( - DATA INFILE("file:///home/disk1/business/csv/*") - INTO TABLE mytable - COLUMNS TERMINATED BY "," - (id, name, score) -) -WITH BROKER "sole_broker" -PROPERTIES -( - "timeout" = "3600" -); -``` - -此作业有四个主要部分: - -- `LABEL`:一个字符串,用于查询导入作业的状态。 -- `LOAD` 声明:源 URI、源数据格式和目标表名称。 -- `PROPERTIES`:超时值和要应用于导入作业的任何其他属性。 - -有关详细的语法和参数说明,请参见 [BROKER LOAD](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md)。 - -#### 检查 Broker Load 进度 - -在 v3.0 及更早版本中,使用 [SHOW LOAD](../sql-reference/sql-statements/loading_unloading/SHOW_LOAD.md) 语句或 curl 命令来查看 Broker Load 作业的进度。 - -在 v3.1 及更高版本中,您可以从 [`information_schema.loads`](../sql-reference/information_schema/loads.md) 视图中查看 Broker Load 作业的进度: - -```SQL -SELECT * FROM information_schema.loads; -``` - -如果您提交了多个导入作业,则可以按与该作业关联的 `LABEL` 进行过滤。示例: - -```SQL -SELECT * FROM information_schema.loads WHERE LABEL = 'label_local'; -``` - -确认导入作业已完成后,您可以查询表以查看数据是否已成功加载。示例: - -```SQL -SELECT * FROM mytable; -+------+-------+-------+ -| id | name | score | -+------+-------+-------+ -| 3 | Alice | 23 | -| 5 | Tony | 25 | -| 6 | Adam | 26 | -| 1 | Lily | 21 | -| 2 | Rose | 22 | -| 4 | Julia | 24 | -| 7 | Allen | 27 | -| 8 | Jacky | 28 | -+------+-------+-------+ -8 rows in set (0.07 sec) -``` - -#### 取消 Broker Load 作业 - -当导入作业未处于 **CANCELLED** 或 **FINISHED** 阶段时,您可以使用 [CANCEL LOAD](../sql-reference/sql-statements/loading_unloading/CANCEL_LOAD.md) 语句来取消该作业。 - -例如,您可以执行以下语句来取消数据库 `mydatabase` 中 label 为 `label_local` 的导入作业: - -```SQL -CANCEL LOAD -FROM mydatabase -WHERE LABEL = "label_local"; -``` - -## 通过 Broker Load 从 NAS 加载 - -有两种方法可以使用 Broker Load 从 NAS 加载数据: - -- 将 NAS 视为本地文件系统,并使用 broker 运行导入作业。请参见上一节“[通过 Broker Load 从本地系统加载](#loading-from-a-local-file-system-via-broker-load)”。 -- (推荐)将 NAS 视为云存储系统,并在没有 broker 的情况下运行导入作业。 - -本节介绍第二种方法。详细操作如下: - -1. 将 NAS 设备挂载到 StarRocks 集群的所有 BE 或 CN 节点和 FE 节点上的相同路径。这样,所有 BE 或 CN 都可以像访问自己本地存储的文件一样访问 NAS 设备。 - -2. 使用 Broker Load 将数据从 NAS 设备加载到目标 StarRocks 表。示例: - - ```SQL - LOAD LABEL test_db.label_nas - ( - DATA INFILE("file:///home/disk1/sr/*") - INTO TABLE mytable - COLUMNS TERMINATED BY "," - ) - WITH BROKER - PROPERTIES - ( - "timeout" = "3600" - ); - ``` - - 此作业有四个主要部分: - - - `LABEL`:一个字符串,用于查询导入作业的状态。 - - `LOAD` 声明:源 URI、源数据格式和目标表名称。请注意,声明中的 `DATA INFILE` 用于指定 NAS 设备的挂载点文件夹路径,如上例所示,其中 `file:///` 是前缀,`/home/disk1/sr` 是挂载点文件夹路径。 - - `BROKER`:您无需指定 broker 名称。 - - `PROPERTIES`:超时值和要应用于导入作业的任何其他属性。 - - 有关详细的语法和参数说明,请参见 [BROKER LOAD](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md)。 - -提交作业后,您可以根据需要查看导入进度或取消作业。有关详细操作,请参见本主题中的“[检查 Broker Load 进度](#check-broker-load-progress)”和“[取消 Broker Load 作业](#cancel-a-broker-load-job)”。 \ No newline at end of file diff --git a/docs/zh/loading/Stream_Load_transaction_interface.md b/docs/zh/loading/Stream_Load_transaction_interface.md deleted file mode 100644 index f348822..0000000 --- a/docs/zh/loading/Stream_Load_transaction_interface.md +++ /dev/null @@ -1,546 +0,0 @@ ---- -displayed_sidebar: docs -keywords: ['Stream Load'] ---- - -# 使用 Stream Load 事务接口导入数据 - -import InsertPrivNote from '../_assets/commonMarkdown/insertPrivNote.mdx' - -从 v2.4 版本开始,StarRocks 提供了一个 Stream Load 事务接口,用于为从外部系统(如 Apache Flink® 和 Apache Kafka®)加载数据的事务实现两阶段提交 (2PC)。Stream Load 事务接口有助于提高高并发流式导入的性能。 - -从 v4.0 版本开始,Stream Load 事务接口支持多表事务,即在同一数据库中将数据导入到多个表中。 - -本主题介绍 Stream Load 事务接口,以及如何使用此接口将数据导入到 StarRocks 中。 - -## 描述 - -Stream Load 事务接口支持使用与 HTTP 协议兼容的工具或语言来调用 API 操作。本主题以 curl 为例,说明如何使用此接口。此接口提供各种功能,例如事务管理、数据写入、事务预提交、事务去重和事务超时管理。 - -:::note -Stream Load 支持 CSV 和 JSON 文件格式。如果要从少量文件(单个文件大小不超过 10 GB)加载数据,建议使用此方法。Stream Load 不支持 Parquet 文件格式。如果需要从 Parquet 文件加载数据,请使用 [INSERT+files()](../loading/InsertInto.md#insert-data-directly-from-files-in-an-external-source-using-files) 。 -::: - -### 事务管理 - -Stream Load 事务接口提供以下 API 操作,用于管理事务: - -- `/api/transaction/begin`: 启动一个新事务。 - -- `/api/transaction/prepare`: 预提交当前事务,并使数据更改暂时持久化。预提交事务后,您可以继续提交或回滚事务。如果您的集群在事务预提交后崩溃,您仍然可以在集群恢复后继续提交事务。 - -- `/api/transaction/commit`: 提交当前事务以使数据更改持久化。 - -- `/api/transaction/rollback`: 回滚当前事务以中止数据更改。 - -> **NOTE** -> -> 事务预提交后,请勿继续使用该事务写入数据。如果继续使用该事务写入数据,您的写入请求将返回错误。 - -下图显示了事务状态和操作之间的关系: - -```mermaid -stateDiagram-v2 - direction LR - [*] --> PREPARE : begin - PREPARE --> PREPARED : prepare - PREPARE --> ABORTED : rollback - PREPARED --> COMMITTED : commit - PREPARED --> ABORTED : rollback -``` - -### 数据写入 - -Stream Load 事务接口提供 `/api/transaction/load` 操作,用于写入数据。您可以在一个事务中多次调用此操作。 - -从 v4.0 版本开始,您可以对不同的表调用 `/api/transaction/load` 操作,以将数据加载到同一数据库中的多个表中。 - -### 事务去重 - -Stream Load 事务接口沿用了 StarRocks 的标签机制。您可以将唯一标签绑定到每个事务,以实现事务的至多一次保证。 - -### 事务超时管理 - -启动事务时,可以使用 HTTP 请求头中的 `timeout` 字段指定事务从 `PREPARE` 到 `PREPARED` 状态的超时时间(以秒为单位)。如果事务在此时间段后仍未准备好,则会自动中止。如果未指定此字段,则默认值由 FE 配置 [`stream_load_default_timeout_second`](../administration/management/FE_configuration.md#stream_load_default_timeout_second) 确定(默认值:600 秒)。 - -启动事务时,您还可以使用 HTTP 请求头中的 `idle_transaction_timeout` 字段指定事务可以保持空闲的超时时间(以秒为单位)。如果在此期间未写入任何数据,则事务将自动回滚。 - -准备事务时,可以使用 HTTP 请求头中的 `prepared_timeout` 字段指定事务从 `PREPARED` 到 `COMMITTED` 状态的超时时间(以秒为单位)。如果事务在此时间段后仍未提交,则会自动中止。如果未指定此字段,则默认值由 FE 配置 [`prepared_transaction_default_timeout_second`](../administration/management/FE_configuration.md#prepared_transaction_default_timeout_second) 确定(默认值:86400 秒)。v3.5.4 及更高版本支持 `prepared_timeout`。 - -## 优势 - -Stream Load 事务接口具有以下优势: - -- **Exactly-once 语义** - - 事务分为两个阶段:预提交和提交,这使得跨系统加载数据变得容易。例如,此接口可以保证从 Flink 加载数据的 exactly-once 语义。 - -- **提高导入性能** - - 如果您使用程序运行导入作业,则 Stream Load 事务接口允许您根据需要合并多个小批量数据,然后通过调用 `/api/transaction/commit` 操作在一个事务中一次性发送所有数据。这样,需要加载的数据版本更少,并且导入性能得到提高。 - -## 限制 - -Stream Load 事务接口具有以下限制: - -- 从 v4.0 版本开始,支持**单数据库多表**事务。对**多数据库多表**事务的支持正在开发中。 - -- 仅支持**来自一个客户端的并发数据写入**。对**来自多个客户端的并发数据写入**的支持正在开发中。 - -- `/api/transaction/load` 操作可以在一个事务中多次调用。在这种情况下,为调用的所有 `/api/transaction/load` 操作指定的参数设置(`table` 除外)必须相同。 - -- 当您使用 Stream Load 事务接口加载 CSV 格式的数据时,请确保数据文件中的每个数据记录都以行分隔符结尾。 - -## 注意事项 - -- 如果您调用的 `/api/transaction/begin`、`/api/transaction/load` 或 `/api/transaction/prepare` 操作返回错误,则事务失败并自动回滚。 -- 调用 `/api/transaction/begin` 操作以启动新事务时,必须指定标签。请注意,后续的 `/api/transaction/load`、`/api/transaction/prepare` 和 `/api/transaction/commit` 操作必须使用与 `/api/transaction/begin` 操作相同的标签。 -- 如果您使用正在进行的事务的标签来调用 `/api/transaction/begin` 操作以启动新事务,则先前的事务将失败并回滚。 -- 如果您使用多表事务将数据加载到不同的表中,则必须为事务中涉及的所有操作指定参数 `-H "transaction_type:multi"`。 -- StarRocks 支持的 CSV 格式数据的默认列分隔符和行分隔符是 `\t` 和 `\n`。如果您的数据文件未使用默认列分隔符或行分隔符,则必须使用 `"column_separator: "` 或 `"row_delimiter: "` 在调用 `/api/transaction/load` 操作时指定数据文件中实际使用的列分隔符或行分隔符。 - -## 前提条件 - -### 检查权限 - - - -#### 检查网络配置 - -确保您要加载的数据所在的机器可以通过 [`http_port`](../administration/management/FE_configuration.md#http_port) (默认值:`8030`)和 [`be_http_port`](../administration/management/BE_configuration.md#be_http_port) (默认值:`8040`)访问 StarRocks 集群的 FE 和 BE 节点。 - -## 基本操作 - -### 准备示例数据 - -本主题以 CSV 格式的数据为例。 - -1. 在本地文件系统的 `/home/disk1/` 路径中,创建一个名为 `example1.csv` 的 CSV 文件。该文件由三列组成,依次表示用户 ID、用户名和用户分数。 - - ```Plain - 1,Lily,23 - 2,Rose,23 - 3,Alice,24 - 4,Julia,25 - ``` - -2. 在您的 StarRocks 数据库 `test_db` 中,创建一个名为 `table1` 的主键表。该表由三列组成:`id`、`name` 和 `score`,其中 `id` 是主键。 - - ```SQL - CREATE TABLE `table1` - ( - `id` int(11) NOT NULL COMMENT "user ID", - `name` varchar(65533) NULL COMMENT "user name", - `score` int(11) NOT NULL COMMENT "user score" - ) - ENGINE=OLAP - PRIMARY KEY(`id`) - DISTRIBUTED BY HASH(`id`) BUCKETS 10; - ``` - -### 启动事务 - -#### 语法 - -```Bash -curl --location-trusted -u : -H "label:" \ - -H "Expect:100-continue" \ - [-H "transaction_type:multi"]\ # 可选。启动多表事务。 - -H "db:" -H "table:" \ - -XPOST http://:/api/transaction/begin -``` - -> **NOTE** -> -> 如果要将数据加载到事务中的不同表中,请在命令中指定 `-H "transaction_type:multi"`。 - -#### 示例 - -```Bash -curl --location-trusted -u :<123456> -H "label:streamload_txn_example1_table1" \ - -H "Expect:100-continue" \ - -H "db:test_db" -H "table:table1" \ - -XPOST http://:/api/transaction/begin -``` - -> **NOTE** -> -> 在此示例中,`streamload_txn_example1_table1` 被指定为事务的标签。 - -#### 返回结果 - -- 如果事务成功启动,则返回以下结果: - - ```Bash - { - "Status": "OK", - "Message": "", - "Label": "streamload_txn_example1_table1", - "TxnId": 9032, - "BeginTxnTimeMs": 0 - } - ``` - -- 如果事务绑定到重复标签,则返回以下结果: - - ```Bash - { - "Status": "LABEL_ALREADY_EXISTS", - "ExistingJobStatus": "RUNNING", - "Message": "Label [streamload_txn_example1_table1] has already been used." - } - ``` - -- 如果发生除重复标签之外的错误,则返回以下结果: - - ```Bash - { - "Status": "FAILED", - "Message": "" - } - ``` - -### 写入数据 - -#### 语法 - -```Bash -curl --location-trusted -u : -H "label:" \ - -H "Expect:100-continue" \ - [-H "transaction_type:multi"]\ # 可选。通过多表事务加载数据。 - -H "db:" -H "table:" \ - -T \ - -XPUT http://:/api/transaction/load -``` - -> **NOTE** -> -> - 调用 `/api/transaction/load` 操作时,必须使用 `` 指定要加载的数据文件的保存路径。 -> - 您可以调用具有不同 `table` 参数值的 `/api/transaction/load` 操作,以将数据加载到同一数据库中的不同表中。在这种情况下,您必须在命令中指定 `-H "transaction_type:multi"`。 - -#### 示例 - -```Bash -curl --location-trusted -u :<123456> -H "label:streamload_txn_example1_table1" \ - -H "Expect:100-continue" \ - -H "db:test_db" -H "table:table1" \ - -T /home/disk1/example1.csv \ - -H "column_separator: ," \ - -XPUT http://:/api/transaction/load -``` - -> **NOTE** -> -> 在此示例中,数据文件 `example1.csv` 中使用的列分隔符是逗号 (`,`),而不是 StarRocks 的默认列分隔符 (`\t`)。因此,在调用 `/api/transaction/load` 操作时,必须使用 `"column_separator: "` 指定逗号 (`,`) 作为列分隔符。 - -#### 返回结果 - -- 如果数据写入成功,则返回以下结果: - - ```Bash - { - "TxnId": 1, - "Seq": 0, - "Label": "streamload_txn_example1_table1", - "Status": "OK", - "Message": "", - "NumberTotalRows": 5265644, - "NumberLoadedRows": 5265644, - "NumberFilteredRows": 0, - "NumberUnselectedRows": 0, - "LoadBytes": 10737418067, - "LoadTimeMs": 418778, - "StreamLoadPutTimeMs": 68, - "ReceivedDataTimeMs": 38964, - } - ``` - -- 如果事务被认为是未知的,则返回以下结果: - - ```Bash - { - "TxnId": 1, - "Label": "streamload_txn_example1_table1", - "Status": "FAILED", - "Message": "TXN_NOT_EXISTS" - } - ``` - -- 如果事务被认为处于无效状态,则返回以下结果: - - ```Bash - { - "TxnId": 1, - "Label": "streamload_txn_example1_table1", - "Status": "FAILED", - "Message": "Transcation State Invalid" - } - ``` - -- 如果发生除未知事务和无效状态之外的错误,则返回以下结果: - - ```Bash - { - "TxnId": 1, - "Label": "streamload_txn_example1_table1", - "Status": "FAILED", - "Message": "" - } - ``` - -### 预提交事务 - -#### 语法 - -```Bash -curl --location-trusted -u : -H "label:" \ - -H "Expect:100-continue" \ - [-H "transaction_type:multi"]\ # 可选。预提交多表事务。 - -H "db:" \ - [-H "prepared_timeout:"] \ - -XPOST http://:/api/transaction/prepare -``` - -> **NOTE** -> -> 如果要预提交的事务是多表事务,请在命令中指定 `-H "transaction_type:multi"`。 - -#### 示例 - -```Bash -curl --location-trusted -u :<123456> -H "label:streamload_txn_example1_table1" \ - -H "Expect:100-continue" \ - -H "db:test_db" \ - -H "prepared_timeout:300" \ - -XPOST http://:/api/transaction/prepare -``` - -> **NOTE** -> -> `prepared_timeout` 字段是可选的。如果未指定,则默认值由 FE 配置 [`prepared_transaction_default_timeout_second`](../administration/management/FE_configuration.md#prepared_transaction_default_timeout_second) 确定(默认值:86400 秒)。v3.5.4 及更高版本支持 `prepared_timeout`。 - -#### 返回结果 - -- 如果预提交成功,则返回以下结果: - - ```Bash - { - "TxnId": 1, - "Label": "streamload_txn_example1_table1", - "Status": "OK", - "Message": "", - "NumberTotalRows": 5265644, - "NumberLoadedRows": 5265644, - "NumberFilteredRows": 0, - "NumberUnselectedRows": 0, - "LoadBytes": 10737418067, - "LoadTimeMs": 418778, - "StreamLoadPutTimeMs": 68, - "ReceivedDataTimeMs": 38964, - "WriteDataTimeMs": 417851 - "CommitAndPublishTimeMs": 1393 - } - ``` - -- 如果事务被认为不存在,则返回以下结果: - - ```Bash - { - "TxnId": 1, - "Label": "streamload_txn_example1_table1", - "Status": "FAILED", - "Message": "Transcation Not Exist" - } - ``` - -- 如果预提交超时,则返回以下结果: - - ```Bash - { - "TxnId": 1, - "Label": "streamload_txn_example1_table1", - "Status": "FAILED", - "Message": "commit timeout", - } - ``` - -- 如果发生除不存在的事务和预提交超时之外的错误,则返回以下结果: - - ```Bash - { - "TxnId": 1, - "Label": "streamload_txn_example1_table1", - "Status": "FAILED", - "Message": "publish timeout" - } - ``` - -### 提交事务 - -#### 语法 - -```Bash -curl --location-trusted -u : -H "label:" \ - -H "Expect:100-continue" \ - [-H "transaction_type:multi"]\ # 可选。提交多表事务。 - -H "db:" \ - -XPOST http://:/api/transaction/commit -``` - -> **NOTE** -> -> 如果要提交的事务是多表事务,请在命令中指定 `-H "transaction_type:multi"`。 - -#### 示例 - -```Bash -curl --location-trusted -u :<123456> -H "label:streamload_txn_example1_table1" \ - -H "Expect:100-continue" \ - -H "db:test_db" \ - -XPOST http://:/api/transaction/commit -``` - -#### 返回结果 - -- 如果提交成功,则返回以下结果: - - ```Bash - { - "TxnId": 1, - "Label": "streamload_txn_example1_table1", - "Status": "OK", - "Message": "", - "NumberTotalRows": 5265644, - "NumberLoadedRows": 5265644, - "NumberFilteredRows": 0, - "NumberUnselectedRows": 0, - "LoadBytes": 10737418067, - "LoadTimeMs": 418778, - "StreamLoadPutTimeMs": 68, - "ReceivedDataTimeMs": 38964, - "WriteDataTimeMs": 417851 - "CommitAndPublishTimeMs": 1393 - } - ``` - -- 如果事务已提交,则返回以下结果: - - ```Bash - { - "TxnId": 1, - "Label": "streamload_txn_example1_table1", - "Status": "OK", - "Message": "Transaction already commited", - } - ``` - -- 如果事务被认为不存在,则返回以下结果: - - ```Bash - { - "TxnId": 1, - "Label": "streamload_txn_example1_table1", - "Status": "FAILED", - "Message": "Transcation Not Exist" - } - ``` - -- 如果提交超时,则返回以下结果: - - ```Bash - { - "TxnId": 1, - "Label": "streamload_txn_example1_table1", - "Status": "FAILED", - "Message": "commit timeout", - } - ``` - -- 如果数据发布超时,则返回以下结果: - - ```Bash - { - "TxnId": 1, - "Label": "streamload_txn_example1_table1", - "Status": "FAILED", - "Message": "publish timeout", - "CommitAndPublishTimeMs": 1393 - } - ``` - -- 如果发生除不存在的事务和超时之外的错误,则返回以下结果: - - ```Bash - { - "TxnId": 1, - "Label": "streamload_txn_example1_table1", - "Status": "FAILED", - "Message": "" - } - ``` - -### 回滚事务 - -#### 语法 - -```Bash -curl --location-trusted -u : -H "label:" \ - -H "Expect:100-continue" \ - [-H "transaction_type:multi"]\ # 可选。回滚多表事务。 - -H "db:" \ - -XPOST http://:/api/transaction/rollback -``` - -> **NOTE** -> -> 如果要回滚的事务是多表事务,请在命令中指定 `-H "transaction_type:multi"`。 - -#### 示例 - -```Bash -curl --location-trusted -u :<123456> -H "label:streamload_txn_example1_table1" \ - -H "Expect:100-continue" \ - -H "db:test_db" \ - -XPOST http://:/api/transaction/rollback -``` - -#### 返回结果 - -- 如果回滚成功,则返回以下结果: - - ```Bash - { - "TxnId": 1, - "Label": "streamload_txn_example1_table1", - "Status": "OK", - "Message": "" - } - ``` - -- 如果事务被认为不存在,则返回以下结果: - - ```Bash - { - "TxnId": 1, - "Label": "streamload_txn_example1_table1", - "Status": "FAILED", - "Message": "Transcation Not Exist" - } - ``` - -- 如果发生除不存在的事务之外的错误,则返回以下结果: - - ```Bash - { - "TxnId": 1, - "Label": "streamload_txn_example1_table1", - "Status": "FAILED", - "Message": "" - } - ``` - -## 相关参考 - -有关 Stream Load 的适用场景和支持的数据文件格式,以及 Stream Load 的工作原理的信息,请参见 [通过 Stream Load 从本地文件系统加载](../loading/StreamLoad.md#loading-from-a-local-file-system-via-stream-load)。 - -有关创建 Stream Load 作业的语法和参数的信息,请参见 [STREAM LOAD](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md)。 \ No newline at end of file diff --git a/docs/zh/loading/alibaba.md b/docs/zh/loading/alibaba.md deleted file mode 100644 index db92470..0000000 --- a/docs/zh/loading/alibaba.md +++ /dev/null @@ -1 +0,0 @@ -unlisted: true \ No newline at end of file diff --git a/docs/zh/loading/automq-routine-load.md b/docs/zh/loading/automq-routine-load.md deleted file mode 100644 index 4765159..0000000 --- a/docs/zh/loading/automq-routine-load.md +++ /dev/null @@ -1,147 +0,0 @@ ---- -displayed_sidebar: docs -description: Cloud based Kafka from AutoMQ ---- - -# AutoMQ Kafka - -import Replicanum from '../_assets/commonMarkdown/replicanum.mdx' - -[AutoMQ for Kafka](https://www.automq.com/docs) 是一个为云环境重新设计的 Kafka 云原生版本。 -AutoMQ Kafka 是 [开源](https://github.com/AutoMQ/automq-for-kafka) 的,并且完全兼容 Kafka 协议,充分利用云的优势。 -与自管理的 Apache Kafka 相比,AutoMQ Kafka 凭借其云原生架构,提供了诸如容量自动伸缩、网络流量的自我平衡、秒级移动分区等功能。这些功能大大降低了用户的总拥有成本 (TCO)。 - -本文将指导您使用 StarRocks Routine Load 将数据导入到 AutoMQ Kafka 中。 -要了解 Routine Load 的基本原理,请参阅 Routine Load 原理部分。 - -## 准备环境 - -### 准备 StarRocks 和测试数据 - -确保您有一个正在运行的 StarRocks 集群。 - -创建一个数据库和一个 Primary Key 表用于测试: - -```sql -create database automq_db; -create table users ( - id bigint NOT NULL, - name string NOT NULL, - timestamp string NULL, - status string NULL -) PRIMARY KEY (id) -DISTRIBUTED BY HASH(id) -PROPERTIES ( - "enable_persistent_index" = "true" -); -``` - - - -## 准备 AutoMQ Kafka 和测试数据 - -要准备您的 AutoMQ Kafka 环境和测试数据,请按照 AutoMQ [Quick Start](https://www.automq.com/docs) 指南部署您的 AutoMQ Kafka 集群。确保 StarRocks 可以直接连接到您的 AutoMQ Kafka 服务器。 - -要在 AutoMQ Kafka 中快速创建一个名为 `example_topic` 的 topic 并写入测试 JSON 数据,请按照以下步骤操作: - -### 创建一个 topic - -使用 Kafka 的命令行工具创建一个 topic。确保您可以访问 Kafka 环境并且 Kafka 服务正在运行。 -以下是创建 topic 的命令: - -```shell -./kafka-topics.sh --create --topic example_topic --bootstrap-server 10.0.96.4:9092 --partitions 1 --replication-factor 1 -``` - -> Note: 将 `topic` 和 `bootstrap-server` 替换为您的 Kafka 服务器地址。 - -要检查 topic 创建的结果,请使用以下命令: - -```shell -./kafka-topics.sh --describe example_topic --bootstrap-server 10.0.96.4:9092 -``` - -### 生成测试数据 - -生成一个简单的 JSON 格式测试数据 - -```json -{ - "id": 1, - "name": "testuser", - "timestamp": "2023-11-10T12:00:00", - "status": "active" -} -``` - -### 写入测试数据 - -使用 Kafka 的命令行工具或编程方法将测试数据写入 example_topic。以下是使用命令行工具的示例: - -```shell -echo '{"id": 1, "name": "testuser", "timestamp": "2023-11-10T12:00:00", "status": "active"}' | sh kafka-console-producer.sh --broker-list 10.0.96.4:9092 --topic example_topic -``` - -> Note: 将 `topic` 和 `bootstrap-server` 替换为您的 Kafka 服务器地址。 - -要查看最近写入的 topic 数据,请使用以下命令: - -```shell -sh kafka-console-consumer.sh --bootstrap-server 10.0.96.4:9092 --topic example_topic --from-beginning -``` - -## 创建 Routine Load 任务 - -在 StarRocks 命令行中,创建一个 Routine Load 任务以持续从 AutoMQ Kafka topic 导入数据: - -```sql -CREATE ROUTINE LOAD automq_example_load ON users -COLUMNS(id, name, timestamp, status) -PROPERTIES -( - "desired_concurrent_number" = "5", - "format" = "json", - "jsonpaths" = "[\"$.id\",\"$.name\",\"$.timestamp\",\"$.status\"]" -) -FROM KAFKA -( - "kafka_broker_list" = "10.0.96.4:9092", - "kafka_topic" = "example_topic", - "kafka_partitions" = "0", - "property.kafka_default_offsets" = "OFFSET_BEGINNING" -); -``` - -> Note: 将 `kafka_broker_list` 替换为您的 Kafka 服务器地址。 - -### 参数说明 - -#### 数据格式 - -在 PROPERTIES 子句的 "format" = "json" 中将数据格式指定为 JSON。 - -#### 数据提取和转换 - -要指定源数据和目标表之间的映射和转换关系,请配置 COLUMNS 和 jsonpaths 参数。COLUMNS 中的列名对应于目标表的列名,它们的顺序对应于源数据中的列顺序。jsonpaths 参数用于从 JSON 数据中提取所需的字段数据,类似于新生成的 CSV 数据。然后,COLUMNS 参数临时命名 jsonpaths 中的字段,以便排序。有关数据转换的更多说明,请参见 [数据导入转换](./Etl_in_loading.md)。 -> Note: 如果每行 JSON 对象都具有与目标表的列相对应的键名和数量(不需要顺序),则无需配置 COLUMNS。 - -## 验证数据导入 - -首先,我们检查 Routine Load 导入作业,并确认 Routine Load 导入任务状态为 RUNNING 状态。 - -```sql -show routine load\G -``` - -然后,查询 StarRocks 数据库中的相应表,我们可以观察到数据已成功导入。 - -```sql -StarRocks > select * from users; -+------+--------------+---------------------+--------+ -| id | name | timestamp | status | -+------+--------------+---------------------+--------+ -| 1 | testuser | 2023-11-10T12:00:00 | active | -| 2 | testuser | 2023-11-10T12:00:00 | active | -+------+--------------+---------------------+--------+ -2 rows in set (0.01 sec) -``` \ No newline at end of file diff --git a/docs/zh/loading/azure.md b/docs/zh/loading/azure.md deleted file mode 100644 index 23b0d0b..0000000 --- a/docs/zh/loading/azure.md +++ /dev/null @@ -1,455 +0,0 @@ ---- -displayed_sidebar: docs -toc_max_heading_level: 4 -keywords: ['Broker Load'] ---- - -# 从 Microsoft Azure Storage 导入数据 - -import InsertPrivNote from '../_assets/commonMarkdown/insertPrivNote.mdx' - -StarRocks 提供了以下从 Azure 导入数据的选项: - -- 使用 [INSERT](../sql-reference/sql-statements/loading_unloading/INSERT.md)+[`FILES()`](../sql-reference/sql-functions/table-functions/files.md) 进行同步导入 -- 使用 [Broker Load](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md) 进行异步导入 - -每个选项都有其自身的优势,以下各节将详细介绍。 - -在大多数情况下,我们建议您使用 INSERT+`FILES()` 方法,该方法更易于使用。 - -但是,INSERT+`FILES()` 方法目前仅支持 Parquet、ORC 和 CSV 文件格式。因此,如果您需要导入其他文件格式(如 JSON)的数据,或者[在数据导入期间执行数据更改(如 DELETE)](../loading/Load_to_Primary_Key_tables.md),则可以求助于 Broker Load。 - -## 前提条件 - -### 准备源数据 - -确保要导入到 StarRocks 中的源数据已正确存储在 Azure 存储帐户中的容器中。 - -在本主题中,假设您要导入存储在 Azure Data Lake Storage Gen2 (ADLS Gen2) 存储帐户 (`starrocks`) 中的容器 (`starrocks-container`) 的根目录下的 Parquet 格式的示例数据集 (`user_behavior_ten_million_rows.parquet`) 的数据。 - -### 检查权限 - - - -### 收集身份验证详细信息 - -本主题中的示例使用共享密钥身份验证方法。为确保您有权从 ADLS Gen2 读取数据,我们建议您阅读 [Azure Data Lake Storage Gen2 > 共享密钥(存储帐户的访问密钥)](../integrations/authenticate_to_azure_storage.md#service-principal-1),以了解您需要配置的身份验证参数。 - -简而言之,如果您使用共享密钥身份验证,则需要收集以下信息: - -- 您的 ADLS Gen2 存储帐户的用户名 -- 您的 ADLS Gen2 存储帐户的共享密钥 - -有关所有可用身份验证方法的信息,请参阅 [Azure 云存储身份验证](../integrations/authenticate_to_azure_storage.md)。 - -## 使用 INSERT+FILES() - -此方法从 v3.2 开始可用,目前仅支持 Parquet、ORC 和 CSV(从 v3.3.0 开始)文件格式。 - -### INSERT+FILES() 的优势 - -`FILES()` 可以读取存储在云存储中的文件,基于您指定的路径相关属性,推断文件中数据的表结构,然后将文件中的数据作为数据行返回。 - -使用 `FILES()`,您可以: - -- 使用 [SELECT](../sql-reference/sql-statements/table_bucket_part_index/SELECT.md) 直接从 Azure 查询数据。 -- 使用 [CREATE TABLE AS SELECT](../sql-reference/sql-statements/table_bucket_part_index/CREATE_TABLE_AS_SELECT.md) (CTAS) 创建和导入表。 -- 使用 [INSERT](../sql-reference/sql-statements/loading_unloading/INSERT.md) 将数据导入到现有表中。 - -### 典型示例 - -#### 使用 SELECT 直接从 Azure 查询 - -使用 SELECT+`FILES()` 直接从 Azure 查询可以在创建表之前很好地预览数据集的内容。例如: - -- 在不存储数据的情况下获取数据集的预览。 -- 查询最小值和最大值,并确定要使用的数据类型。 -- 检查 `NULL` 值。 - -以下示例查询存储在您的存储帐户 `starrocks` 中的容器 `starrocks-container` 中的示例数据集 `user_behavior_ten_million_rows.parquet`: - -```SQL -SELECT * FROM FILES -( - "path" = "abfss://starrocks-container@starrocks.dfs.core.windows.net/user_behavior_ten_million_rows.parquet", - "format" = "parquet", - "azure.adls2.storage_account" = "starrocks", - "azure.adls2.shared_key" = "xxxxxxxxxxxxxxxxxx" -) -LIMIT 3; -``` - -系统返回类似于以下的查询结果: - -```Plain -+--------+---------+------------+--------------+---------------------+ -| UserID | ItemID | CategoryID | BehaviorType | Timestamp | -+--------+---------+------------+--------------+---------------------+ -| 543711 | 829192 | 2355072 | pv | 2017-11-27 08:22:37 | -| 543711 | 2056618 | 3645362 | pv | 2017-11-27 10:16:46 | -| 543711 | 1165492 | 3645362 | pv | 2017-11-27 10:17:00 | -+--------+---------+------------+--------------+---------------------+ -``` - -> **NOTE** -> -> 请注意,上面返回的列名由 Parquet 文件提供。 - -#### 使用 CTAS 创建和导入表 - -这是前一个示例的延续。先前的查询包含在 CREATE TABLE AS SELECT (CTAS) 中,以使用模式推断自动执行表创建。这意味着 StarRocks 将推断表结构,创建您想要的表,然后将数据导入到表中。当使用 `FILES()` 表函数与 Parquet 文件时,不需要列名和类型来创建表,因为 Parquet 格式包含列名。 - -> **NOTE** -> -> 使用模式推断时,CREATE TABLE 的语法不允许设置副本数。如果您使用的是 StarRocks 存算一体集群,请在创建表之前设置副本数。以下示例适用于具有三个副本的系统: -> -> ```SQL -> ADMIN SET FRONTEND CONFIG ('default_replication_num' = "3"); -> ``` - -创建一个数据库并切换到该数据库: - -```SQL -CREATE DATABASE IF NOT EXISTS mydatabase; -USE mydatabase; -``` - -使用 CTAS 创建一个表,并将示例数据集 `user_behavior_ten_million_rows.parquet` 的数据导入到该表中,该数据集存储在您的存储帐户 `starrocks` 中的容器 `starrocks-container` 中: - -```SQL -CREATE TABLE user_behavior_inferred AS -SELECT * FROM FILES -( - "path" = "abfss://starrocks-container@starrocks.dfs.core.windows.net/user_behavior_ten_million_rows.parquet", - "format" = "parquet", - "azure.adls2.storage_account" = "starrocks", - "azure.adls2.shared_key" = "xxxxxxxxxxxxxxxxxx" -); -``` - -创建表后,您可以使用 [DESCRIBE](../sql-reference/sql-statements/table_bucket_part_index/DESCRIBE.md) 查看其结构: - -```SQL -DESCRIBE user_behavior_inferred; -``` - -系统返回以下查询结果: - -```Plain -+--------------+-----------+------+-------+---------+-------+ -| Field | Type | Null | Key | Default | Extra | -+--------------+-----------+------+-------+---------+-------+ -| UserID | bigint | YES | true | NULL | | -| ItemID | bigint | YES | true | NULL | | -| CategoryID | bigint | YES | true | NULL | | -| BehaviorType | varbinary | YES | false | NULL | | -| Timestamp | varbinary | YES | false | NULL | | -+--------------+-----------+------+-------+---------+-------+ -``` - -查询表以验证数据是否已导入到其中。例如: - -```SQL -SELECT * from user_behavior_inferred LIMIT 3; -``` - -返回以下查询结果,表明数据已成功导入: - -```Plain -+--------+--------+------------+--------------+---------------------+ -| UserID | ItemID | CategoryID | BehaviorType | Timestamp | -+--------+--------+------------+--------------+---------------------+ -| 84 | 162325 | 2939262 | pv | 2017-12-02 05:41:41 | -| 84 | 232622 | 4148053 | pv | 2017-11-27 04:36:10 | -| 84 | 595303 | 903809 | pv | 2017-11-26 08:03:59 | -+--------+--------+------------+--------------+---------------------+ -``` - -#### 使用 INSERT 导入到现有表中 - -您可能想要自定义要插入的表,例如: - -- 列数据类型、可空设置或默认值 -- 键类型和列 -- 数据分区和分桶 - -> **NOTE** -> -> 创建最有效的表结构需要了解数据的使用方式和列的内容。本主题不涵盖表设计。有关表设计的信息,请参阅 [表类型](../table_design/StarRocks_table_design.md)。 - -在此示例中,我们基于对表查询方式和 Parquet 文件中数据的了解来创建表。对 Parquet 文件中数据的了解可以通过直接在 Azure 中查询文件来获得。 - -- 由于对 Azure 中数据集的查询表明 `Timestamp` 列包含与 VARBINARY 数据类型匹配的数据,因此在以下 DDL 中指定了列类型。 -- 通过查询 Azure 中的数据,您可以发现数据集中没有 `NULL` 值,因此 DDL 不会将任何列设置为可空。 -- 基于对预期查询类型的了解,排序键和分桶列设置为 `UserID` 列。您的用例可能与此数据不同,因此您可能会决定将 `ItemID` 除了 `UserID` 之外或代替 `UserID` 用于排序键。 - -创建一个数据库并切换到该数据库: - -```SQL -CREATE DATABASE IF NOT EXISTS mydatabase; -USE mydatabase; -``` - -手动创建一个表(我们建议该表具有与您要从 Azure 导入的 Parquet 文件相同的结构): - -```SQL -CREATE TABLE user_behavior_declared -( - UserID int(11), - ItemID int(11), - CategoryID int(11), - BehaviorType varchar(65533), - Timestamp varbinary -) -ENGINE = OLAP -DUPLICATE KEY(UserID) -DISTRIBUTED BY HASH(UserID); -``` - -显示结构,以便您可以将其与 `FILES()` 表函数生成的推断结构进行比较: - -```sql -DESCRIBE user_behavior_declared; -``` - -```plaintext -+--------------+----------------+------+-------+---------+-------+ -| Field | Type | Null | Key | Default | Extra | -+--------------+----------------+------+-------+---------+-------+ -| UserID | int | NO | true | NULL | | -| ItemID | int | NO | false | NULL | | -| CategoryID | int | NO | false | NULL | | -| BehaviorType | varchar(65533) | NO | false | NULL | | -| Timestamp | varbinary | NO | false | NULL | | -+--------------+----------------+------+-------+---------+-------+ -5 rows in set (0.00 sec) -``` - -:::tip - -将您刚刚创建的结构与之前使用 `FILES()` 表函数推断的结构进行比较。查看: - -- 数据类型 -- 可空性 -- 键字段 - -为了更好地控制目标表的结构并获得更好的查询性能,我们建议您在生产环境中手动指定表结构。 - -::: - -创建表后,您可以使用 INSERT INTO SELECT FROM FILES() 导入它: - -```SQL -INSERT INTO user_behavior_declared -SELECT * FROM FILES -( - "path" = "abfss://starrocks-container@starrocks.dfs.core.windows.net/user_behavior_ten_million_rows.parquet", - "format" = "parquet", - "azure.adls2.storage_account" = "starrocks", - "azure.adls2.shared_key" = "xxxxxxxxxxxxxxxxxx" -); -``` - -导入完成后,您可以查询表以验证数据是否已导入到其中。例如: - -```SQL -SELECT * from user_behavior_declared LIMIT 3; -``` - -系统返回类似于以下的查询结果,表明数据已成功导入: - -```Plain - +--------+---------+------------+--------------+---------------------+ -| UserID | ItemID | CategoryID | BehaviorType | Timestamp | -+--------+---------+------------+--------------+---------------------+ -| 142 | 2869980 | 2939262 | pv | 2017-11-25 03:43:22 | -| 142 | 2522236 | 1669167 | pv | 2017-11-25 15:14:12 | -| 142 | 3031639 | 3607361 | pv | 2017-11-25 15:19:25 | -+--------+---------+------------+--------------+---------------------+ -``` - -#### 检查导入进度 - -您可以从 StarRocks Information Schema 中的 [`loads`](../sql-reference/information_schema/loads.md) 视图查询 INSERT 作业的进度。此功能从 v3.1 开始支持。例如: - -```SQL -SELECT * FROM information_schema.loads ORDER BY JOB_ID DESC; -``` - -有关 `loads` 视图中提供的字段的信息,请参阅 [`loads`](../sql-reference/information_schema/loads.md)。 - -如果您提交了多个导入作业,则可以按与作业关联的 `LABEL` 进行过滤。例如: - -```SQL -SELECT * FROM information_schema.loads WHERE LABEL = 'insert_f3fc2298-a553-11ee-92f4-00163e0842bd' \G -*************************** 1. row *************************** - JOB_ID: 10193 - LABEL: insert_f3fc2298-a553-11ee-92f4-00163e0842bd - DATABASE_NAME: mydatabase - STATE: FINISHED - PROGRESS: ETL:100%; LOAD:100% - TYPE: INSERT - PRIORITY: NORMAL - SCAN_ROWS: 10000000 - FILTERED_ROWS: 0 - UNSELECTED_ROWS: 0 - SINK_ROWS: 10000000 - ETL_INFO: - TASK_INFO: resource:N/A; timeout(s):300; max_filter_ratio:0.0 - CREATE_TIME: 2023-12-28 15:37:38 - ETL_START_TIME: 2023-12-28 15:37:38 - ETL_FINISH_TIME: 2023-12-28 15:37:38 - LOAD_START_TIME: 2023-12-28 15:37:38 - LOAD_FINISH_TIME: 2023-12-28 15:39:35 - JOB_DETAILS: {"All backends":{"f3fc2298-a553-11ee-92f4-00163e0842bd":[10120]},"FileNumber":0,"FileSize":0,"InternalTableLoadBytes":581730322,"InternalTableLoadRows":10000000,"ScanBytes":581574034,"ScanRows":10000000,"TaskNumber":1,"Unfinished backends":{"f3fc2298-a553-11ee-92f4-00163e0842bd":[]}} - ERROR_MSG: NULL - TRACKING_URL: NULL - TRACKING_SQL: NULL -REJECTED_RECORD_PATH: NULL -``` - -> **NOTE** -> -> INSERT 是一个同步命令。如果 INSERT 作业仍在运行,您需要打开另一个会话来检查其执行状态。 - -## 使用 Broker Load - -异步 Broker Load 进程处理与 Azure 的连接、提取数据以及将数据存储在 StarRocks 中。 - -此方法支持以下文件格式: - -- Parquet -- ORC -- CSV -- JSON(从 v3.2.3 开始支持) - -### Broker Load 的优势 - -- Broker Load 在后台运行,客户端无需保持连接即可继续作业。 -- Broker Load 是长时间运行作业的首选,默认超时时间为 4 小时。 -- 除了 Parquet 和 ORC 文件格式外,Broker Load 还支持 CSV 文件格式和 JSON 文件格式(从 v3.2.3 开始支持 JSON 文件格式)。 - -### 数据流 - -![Broker Load 的工作流程](../_assets/broker_load_how-to-work_en.png) - -1. 用户创建一个导入作业。 -2. 前端 (FE) 创建一个查询计划,并将该计划分发到后端节点 (BE) 或计算节点 (CN)。 -3. BE 或 CN 从源提取数据,并将数据导入到 StarRocks 中。 - -### 典型示例 - -创建一个表,启动一个从 Azure 提取示例数据集 `user_behavior_ten_million_rows.parquet` 的导入进程,并验证数据导入的进度和成功。 - -#### 创建数据库和表 - -连接到您的 StarRocks 集群。然后,创建一个数据库并切换到该数据库: - -```SQL -CREATE DATABASE IF NOT EXISTS mydatabase; -USE mydatabase; -``` - -手动创建一个表(我们建议该表具有与您要从 Azure 导入的 Parquet 文件相同的结构): - -```SQL -CREATE TABLE user_behavior -( - UserID int(11), - ItemID int(11), - CategoryID int(11), - BehaviorType varchar(65533), - Timestamp varbinary -) -ENGINE = OLAP -DUPLICATE KEY(UserID) -DISTRIBUTED BY HASH(UserID); -``` - -#### 启动 Broker Load - -运行以下命令以启动一个 Broker Load 作业,该作业将数据从示例数据集 `user_behavior_ten_million_rows.parquet` 导入到 `user_behavior` 表: - -```SQL -LOAD LABEL user_behavior -( - DATA INFILE("abfss://starrocks-container@starrocks.dfs.core.windows.net/user_behavior_ten_million_rows.parquet") - INTO TABLE user_behavior - FORMAT AS "parquet" -) -WITH BROKER -( - "azure.adls2.storage_account" = "starrocks", - "azure.adls2.shared_key" = "xxxxxxxxxxxxxxxxxx" -) -PROPERTIES -( - "timeout" = "3600" -); -``` - -此作业有四个主要部分: - -- `LABEL`: 查询导入作业状态时使用的字符串。 -- `LOAD` 声明:源 URI、源数据格式和目标表名。 -- `BROKER`: 源的连接详细信息。 -- `PROPERTIES`: 超时值和要应用于导入作业的任何其他属性。 - -有关详细的语法和参数说明,请参阅 [BROKER LOAD](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md)。 - -#### 检查导入进度 - -您可以从 StarRocks Information Schema 中的 [`loads`](../sql-reference/information_schema/loads.md) 视图查询 Broker Load 作业的进度。此功能从 v3.1 开始支持。 - -```SQL -SELECT * FROM information_schema.loads \G -``` - -有关 `loads` 视图中提供的字段的信息,请参阅 [`loads`](../sql-reference/information_schema/loads.md)。 - -如果您提交了多个导入作业,则可以按与作业关联的 `LABEL` 进行过滤: - -```SQL -SELECT * FROM information_schema.loads WHERE LABEL = 'user_behavior' \G -*************************** 1. row *************************** - JOB_ID: 10250 - LABEL: user_behavior - DATABASE_NAME: mydatabase - STATE: FINISHED - PROGRESS: ETL:100%; LOAD:100% - TYPE: BROKER - PRIORITY: NORMAL - SCAN_ROWS: 10000000 - FILTERED_ROWS: 0 - UNSELECTED_ROWS: 0 - SINK_ROWS: 10000000 - ETL_INFO: - TASK_INFO: resource:N/A; timeout(s):3600; max_filter_ratio:0.0 - CREATE_TIME: 2023-12-28 16:15:19 - ETL_START_TIME: 2023-12-28 16:15:25 - ETL_FINISH_TIME: 2023-12-28 16:15:25 - LOAD_START_TIME: 2023-12-28 16:15:25 - LOAD_FINISH_TIME: 2023-12-28 16:16:31 - JOB_DETAILS: {"All backends":{"6a8ef4c0-1009-48c9-8d18-c4061d2255bf":[10121]},"FileNumber":1,"FileSize":132251298,"InternalTableLoadBytes":311710786,"InternalTableLoadRows":10000000,"ScanBytes":132251298,"ScanRows":10000000,"TaskNumber":1,"Unfinished backends":{"6a8ef4c0-1009-48c9-8d18-c4061d2255bf":[]}} - ERROR_MSG: NULL - TRACKING_URL: NULL - TRACKING_SQL: NULL -REJECTED_RECORD_PATH: NULL -``` - -确认导入作业已完成后,您可以检查目标表的一个子集,以查看数据是否已成功导入。例如: - -```SQL -SELECT * from user_behavior LIMIT 3; -``` - -系统返回类似于以下的查询结果,表明数据已成功导入: - -```Plain -+--------+---------+------------+--------------+---------------------+ -| UserID | ItemID | CategoryID | BehaviorType | Timestamp | -+--------+---------+------------+--------------+---------------------+ -| 142 | 2869980 | 2939262 | pv | 2017-11-25 03:43:22 | -| 142 | 2522236 | 1669167 | pv | 2017-11-25 15:14:12 | -| 142 | 3031639 | 3607361 | pv | 2017-11-25 15:19:25 | -+--------+---------+------------+--------------+---------------------+ \ No newline at end of file diff --git a/docs/zh/loading/gcs.md b/docs/zh/loading/gcs.md deleted file mode 100644 index c6fd228..0000000 --- a/docs/zh/loading/gcs.md +++ /dev/null @@ -1,464 +0,0 @@ ---- -displayed_sidebar: docs -toc_max_heading_level: 4 -keywords: ['Broker Load'] ---- - -# 从 GCS 导入数据 - -import InsertPrivNote from '../_assets/commonMarkdown/insertPrivNote.mdx' - -StarRocks 提供了以下从 GCS 导入数据的选项: - -- 使用 [INSERT](../sql-reference/sql-statements/loading_unloading/INSERT.md) + [`FILES()`](../sql-reference/sql-functions/table-functions/files.md) 进行同步导入 -- 使用 [Broker Load](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md) 进行异步导入 - -每个选项都有其自身的优势,以下各节将详细介绍。 - -在大多数情况下,我们建议您使用 INSERT+`FILES()` 方法,该方法更易于使用。 - -但是,INSERT+`FILES()` 方法目前仅支持 Parquet、ORC 和 CSV 文件格式。因此,如果您需要导入其他文件格式的数据(如 JSON),或者[在数据导入期间执行数据更改(如 DELETE)](../loading/Load_to_Primary_Key_tables.md),则可以采用 Broker Load。 - -## 前提条件 - -### 准备源数据 - -确保要加载到 StarRocks 中的源数据已正确存储在 GCS bucket 中。您还可以考虑数据和数据库的位置,因为当您的 bucket 和 StarRocks 集群位于同一区域时,数据传输成本会低得多。 - -在本主题中,我们为您提供了一个 GCS bucket 中的示例数据集 `gs://starrocks-samples/user_behavior_ten_million_rows.parquet`。您可以使用任何有效的凭据访问该数据集,因为任何 GCP 用户都可以读取该对象。 - -### 检查权限 - - - -### 收集身份验证详细信息 - -本主题中的示例使用基于服务帐户的身份验证。要实践基于 IAM 用户的身份验证,您需要收集有关以下 GCS 资源的信息: - -- 存储数据的 GCS bucket。 -- GCS 对象键(对象名称),如果访问 bucket 中的特定对象。请注意,如果您的 GCS 对象存储在子文件夹中,则对象键可以包含前缀。 -- GCS bucket 所属的 GCS 区域。 -- 您的 Google Cloud 服务帐户的 `private_ key_id`、`private_key` 和 `client_email`。 - -有关所有可用身份验证方法的信息,请参阅 [验证到 Google Cloud Storage 的身份](../integrations/authenticate_to_gcs.md)。 - -## 使用 INSERT+FILES() - -此方法从 v3.2 开始可用,目前仅支持 Parquet、ORC 和 CSV(从 v3.3.0 开始)文件格式。 - -### INSERT+FILES() 的优势 - -`FILES()` 可以读取存储在云存储中的文件,基于您指定的路径相关属性,推断文件中数据的表结构,然后将文件中的数据作为数据行返回。 - -使用 `FILES()`,您可以: - -- 使用 [SELECT](../sql-reference/sql-statements/table_bucket_part_index/SELECT.md) 直接从 GCS 查询数据。 -- 使用 [CREATE TABLE AS SELECT](../sql-reference/sql-statements/table_bucket_part_index/CREATE_TABLE_AS_SELECT.md) (CTAS) 创建和加载表。 -- 使用 [INSERT](../sql-reference/sql-statements/loading_unloading/INSERT.md) 将数据加载到现有表中。 - -### 典型示例 - -#### 使用 SELECT 直接从 GCS 查询 - -使用 SELECT+`FILES()` 直接从 GCS 查询可以在创建表之前很好地预览数据集的内容。例如: - -- 在不存储数据的情况下获取数据集的预览。 -- 查询最小值和最大值,并确定要使用的数据类型。 -- 检查 `NULL` 值。 - -以下示例查询示例数据集 `gs://starrocks-samples/user_behavior_ten_million_rows.parquet`: - -```SQL -SELECT * FROM FILES -( - "path" = "gs://starrocks-samples/user_behavior_ten_million_rows.parquet", - "format" = "parquet", - "gcp.gcs.service_account_email" = "sampledatareader@xxxxx-xxxxxx-000000.iam.gserviceaccount.com", - "gcp.gcs.service_account_private_key_id" = "baaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa", - "gcp.gcs.service_account_private_key" = "-----BEGIN PRIVATE KEY-----\n...\n-----END PRIVATE KEY-----" -) -LIMIT 3; -``` - -> **NOTE** -> -> 将上述命令中的凭据替换为您自己的凭据。可以使用任何有效的服务帐户电子邮件、密钥和密码,因为任何经过 GCP 身份验证的用户都可以读取该对象。 - -系统返回类似于以下的查询结果: - -```Plain -+--------+---------+------------+--------------+---------------------+ -| UserID | ItemID | CategoryID | BehaviorType | Timestamp | -+--------+---------+------------+--------------+---------------------+ -| 543711 | 829192 | 2355072 | pv | 2017-11-27 08:22:37 | -| 543711 | 2056618 | 3645362 | pv | 2017-11-27 10:16:46 | -| 543711 | 1165492 | 3645362 | pv | 2017-11-27 10:17:00 | -+--------+---------+------------+--------------+---------------------+ -``` - -> **NOTE** -> -> 请注意,上面返回的列名由 Parquet 文件提供。 - -#### 使用 CTAS 创建和加载表 - -这是前一个示例的延续。先前的查询包装在 CREATE TABLE AS SELECT (CTAS) 中,以使用模式推断自动执行表创建。这意味着 StarRocks 将推断表模式,创建您想要的表,然后将数据加载到表中。当使用带有 Parquet 文件的 `FILES()` 表函数时,由于 Parquet 格式包含列名,因此不需要列名和类型来创建表。 - -> **NOTE** -> -> 使用模式推断时,CREATE TABLE 的语法不允许设置副本数。如果您使用的是 StarRocks 存算一体集群,请在创建表之前设置副本数。以下示例适用于具有三个副本的系统: -> -> ```SQL -> ADMIN SET FRONTEND CONFIG ('default_replication_num' = "3"); -> ``` - -创建一个数据库并切换到它: - -```SQL -CREATE DATABASE IF NOT EXISTS mydatabase; -USE mydatabase; -``` - -使用 CTAS 创建一个表,并将示例数据集 `gs://starrocks-samples/user_behavior_ten_million_rows.parquet` 的数据加载到该表中: - -```SQL -CREATE TABLE user_behavior_inferred AS -SELECT * FROM FILES -( - "path" = "gs://starrocks-samples/user_behavior_ten_million_rows.parquet", - "format" = "parquet", - "gcp.gcs.service_account_email" = "sampledatareader@xxxxx-xxxxxx-000000.iam.gserviceaccount.com", - "gcp.gcs.service_account_private_key_id" = "baaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa", - "gcp.gcs.service_account_private_key" = "-----BEGIN PRIVATE KEY-----\n...\n-----END PRIVATE KEY-----" -); -``` - -> **NOTE** -> -> 将上述命令中的凭据替换为您自己的凭据。可以使用任何有效的服务帐户电子邮件、密钥和密码,因为任何经过 GCP 身份验证的用户都可以读取该对象。 - -创建表后,您可以使用 [DESCRIBE](../sql-reference/sql-statements/table_bucket_part_index/DESCRIBE.md) 查看其模式: - -```SQL -DESCRIBE user_behavior_inferred; -``` - -系统返回类似于以下的查询结果: - -```Plain -+--------------+-----------+------+-------+---------+-------+ -| Field | Type | Null | Key | Default | Extra | -+--------------+-----------+------+-------+---------+-------+ -| UserID | bigint | YES | true | NULL | | -| ItemID | bigint | YES | true | NULL | | -| CategoryID | bigint | YES | true | NULL | | -| BehaviorType | varbinary | YES | false | NULL | | -| Timestamp | varbinary | YES | false | NULL | | -+--------------+-----------+------+-------+---------+-------+ -``` - -查询表以验证数据是否已加载到其中。示例: - -```SQL -SELECT * from user_behavior_inferred LIMIT 3; -``` - -返回以下查询结果,表明数据已成功加载: - -```Plain -+--------+--------+------------+--------------+---------------------+ -| UserID | ItemID | CategoryID | BehaviorType | Timestamp | -+--------+--------+------------+--------------+---------------------+ -| 84 | 162325 | 2939262 | pv | 2017-12-02 05:41:41 | -| 84 | 232622 | 4148053 | pv | 2017-11-27 04:36:10 | -| 84 | 595303 | 903809 | pv | 2017-11-26 08:03:59 | -+--------+--------+------------+--------------+---------------------+ -``` - -#### 使用 INSERT 加载到现有表中 - -您可能想要自定义要插入的表,例如: - -- 列数据类型、可为空设置或默认值 -- 键类型和列 -- 数据分区和分桶 - -> **NOTE** -> -> 创建最有效的表结构需要了解数据的使用方式和列的内容。本主题不涵盖表设计。有关表设计的信息,请参阅 [表类型](../table_design/StarRocks_table_design.md)。 - -在此示例中,我们基于对表的查询方式和 Parquet 文件中的数据的了解来创建表。对 Parquet 文件中的数据的了解可以通过直接在 GCS 中查询文件来获得。 - -- 由于 GCS 中数据集的查询表明 `Timestamp` 列包含与 VARBINARY 数据类型匹配的数据,因此在以下 DDL 中指定了列类型。 -- 通过查询 GCS 中的数据,您可以发现数据集中没有 `NULL` 值,因此 DDL 不会将任何列设置为可为空。 -- 根据对预期查询类型的了解,排序键和分桶列设置为列 `UserID`。您的用例可能与此数据不同,因此您可能会决定除了 `UserID` 之外或代替 `UserID` 使用 `ItemID` 作为排序键。 - -创建一个数据库并切换到它: - -```SQL -CREATE DATABASE IF NOT EXISTS mydatabase; -USE mydatabase; -``` - -手动创建一个表(我们建议该表具有与要从 GCS 加载的 Parquet 文件相同的模式): - -```SQL -CREATE TABLE user_behavior_declared -( - UserID int(11), - ItemID int(11), - CategoryID int(11), - BehaviorType varchar(65533), - Timestamp varbinary -) -ENGINE = OLAP -DUPLICATE KEY(UserID) -DISTRIBUTED BY HASH(UserID); -``` - -显示模式,以便您可以将其与 `FILES()` 表函数生成的推断模式进行比较: - -```sql -DESCRIBE user_behavior_declared; -``` - -```plaintext -+--------------+----------------+------+-------+---------+-------+ -| Field | Type | Null | Key | Default | Extra | -+--------------+----------------+------+-------+---------+-------+ -| UserID | int | NO | true | NULL | | -| ItemID | int | NO | false | NULL | | -| CategoryID | int | NO | false | NULL | | -| BehaviorType | varchar(65533) | NO | false | NULL | | -| Timestamp | varbinary | NO | false | NULL | | -+--------------+----------------+------+-------+---------+-------+ -5 rows in set (0.00 sec) -``` - -:::tip - -将您刚刚创建的模式与之前使用 `FILES()` 表函数推断的模式进行比较。查看: - -- 数据类型 -- 可为空性 -- 键字段 - -为了更好地控制目标表的模式并获得更好的查询性能,我们建议您在生产环境中手动指定表模式。 - -::: - -创建表后,您可以使用 INSERT INTO SELECT FROM FILES() 加载它: - -```SQL -INSERT INTO user_behavior_declared - SELECT * FROM FILES - ( - "path" = "gs://starrocks-samples/user_behavior_ten_million_rows.parquet", - "format" = "parquet", - "gcp.gcs.service_account_email" = "sampledatareader@xxxxx-xxxxxx-000000.iam.gserviceaccount.com", - "gcp.gcs.service_account_private_key_id" = "baaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa", - "gcp.gcs.service_account_private_key" = "-----BEGIN PRIVATE KEY-----\n...\n-----END PRIVATE KEY-----" -); -``` - -> **NOTE** -> -> 将上述命令中的凭据替换为您自己的凭据。可以使用任何有效的服务帐户电子邮件、密钥和密码,因为任何经过 GCP 身份验证的用户都可以读取该对象。 - -加载完成后,您可以查询表以验证数据是否已加载到其中。示例: - -```SQL -SELECT * from user_behavior_declared LIMIT 3; -``` - -系统返回类似于以下的查询结果,表明数据已成功加载: - -```Plain -+--------+---------+------------+--------------+---------------------+ -| UserID | ItemID | CategoryID | BehaviorType | Timestamp | -+--------+---------+------------+--------------+---------------------+ -| 142 | 2869980 | 2939262 | pv | 2017-11-25 03:43:22 | -| 142 | 2522236 | 1669167 | pv | 2017-11-25 15:14:12 | -| 142 | 3031639 | 3607361 | pv | 2017-11-25 15:19:25 | -+--------+---------+------------+--------------+---------------------+ -``` - -#### 检查导入进度 - -您可以从 StarRocks Information Schema 中的 [`loads`](../sql-reference/information_schema/loads.md) 视图查询 INSERT 作业的进度。此功能从 v3.1 开始受支持。示例: - -```SQL -SELECT * FROM information_schema.loads ORDER BY JOB_ID DESC; -``` - -有关 `loads` 视图中提供的字段的信息,请参阅 [`loads`](../sql-reference/information_schema/loads.md)。 - -如果您提交了多个导入作业,则可以按与该作业关联的 `LABEL` 进行筛选。示例: - -```SQL -SELECT * FROM information_schema.loads WHERE LABEL = 'insert_f3fc2298-a553-11ee-92f4-00163e0842bd' \G -*************************** 1. row *************************** - JOB_ID: 10193 - LABEL: insert_f3fc2298-a553-11ee-92f4-00163e0842bd - DATABASE_NAME: mydatabase - STATE: FINISHED - PROGRESS: ETL:100%; LOAD:100% - TYPE: INSERT - PRIORITY: NORMAL - SCAN_ROWS: 10000000 - FILTERED_ROWS: 0 - UNSELECTED_ROWS: 0 - SINK_ROWS: 10000000 - ETL_INFO: - TASK_INFO: resource:N/A; timeout(s):300; max_filter_ratio:0.0 - CREATE_TIME: 2023-12-28 15:37:38 - ETL_START_TIME: 2023-12-28 15:37:38 - ETL_FINISH_TIME: 2023-12-28 15:37:38 - LOAD_START_TIME: 2023-12-28 15:37:38 - LOAD_FINISH_TIME: 2023-12-28 15:39:35 - JOB_DETAILS: {"All backends":{"f3fc2298-a553-11ee-92f4-00163e0842bd":[10120]},"FileNumber":0,"FileSize":0,"InternalTableLoadBytes":581730322,"InternalTableLoadRows":10000000,"ScanBytes":581574034,"ScanRows":10000000,"TaskNumber":1,"Unfinished backends":{"f3fc2298-a553-11ee-92f4-00163e0842bd":[]}} - ERROR_MSG: NULL - TRACKING_URL: NULL - TRACKING_SQL: NULL -REJECTED_RECORD_PATH: NULL -``` - -> **NOTE** -> -> INSERT 是一个同步命令。如果 INSERT 作业仍在运行,您需要打开另一个会话来检查其执行状态。 - -## 使用 Broker Load - -异步 Broker Load 进程处理与 GCS 的连接、提取数据以及将数据存储在 StarRocks 中。 - -此方法支持以下文件格式: - -- Parquet -- ORC -- CSV -- JSON(从 v3.2.3 开始支持) - -### Broker Load 的优势 - -- Broker Load 在后台运行,客户端无需保持连接即可继续作业。 -- Broker Load 是长时间运行作业的首选,默认超时时间为 4 小时。 -- 除了 Parquet 和 ORC 文件格式外,Broker Load 还支持 CSV 文件格式和 JSON 文件格式(从 v3.2.3 开始支持 JSON 文件格式)。 - -### 数据流 - -![Broker Load 的工作流程](../_assets/broker_load_how-to-work_en.png) - -1. 用户创建一个导入作业。 -2. 前端 (FE) 创建一个查询计划,并将该计划分发到后端节点 (BEs) 或计算节点 (CNs)。 -3. BE 或 CN 从源中提取数据,并将数据加载到 StarRocks 中。 - -### 典型示例 - -创建一个表,启动一个从 GCS 提取示例数据集 `gs://starrocks-samples/user_behavior_ten_million_rows.parquet` 的导入进程,并验证数据加载的进度和成功。 - -#### 创建数据库和表 - -创建一个数据库并切换到它: - -```SQL -CREATE DATABASE IF NOT EXISTS mydatabase; -USE mydatabase; -``` - -手动创建一个表(我们建议该表具有与要从 GCS 加载的 Parquet 文件相同的模式): - -```SQL -CREATE TABLE user_behavior -( - UserID int(11), - ItemID int(11), - CategoryID int(11), - BehaviorType varchar(65533), - Timestamp varbinary -) -ENGINE = OLAP -DUPLICATE KEY(UserID) -DISTRIBUTED BY HASH(UserID); -``` - -#### 启动 Broker Load - -运行以下命令以启动一个 Broker Load 作业,该作业将数据从示例数据集 `gs://starrocks-samples/user_behavior_ten_million_rows.parquet` 加载到 `user_behavior` 表: - -```SQL -LOAD LABEL user_behavior -( - DATA INFILE("gs://starrocks-samples/user_behavior_ten_million_rows.parquet") - INTO TABLE user_behavior - FORMAT AS "parquet" - ) - WITH BROKER - ( - - "gcp.gcs.service_account_email" = "sampledatareader@xxxxx-xxxxxx-000000.iam.gserviceaccount.com", - "gcp.gcs.service_account_private_key_id" = "baaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa", - "gcp.gcs.service_account_private_key" = "-----BEGIN PRIVATE KEY-----\n...\n-----END PRIVATE KEY-----" - ) -PROPERTIES -( - "timeout" = "72000" -); -``` - -> **NOTE** -> -> 将上述命令中的凭据替换为您自己的凭据。可以使用任何有效的服务帐户电子邮件、密钥和密码,因为任何经过 GCP 身份验证的用户都可以读取该对象。 - -此作业有四个主要部分: - -- `LABEL`: 查询导入作业状态时使用的字符串。 -- `LOAD` 声明:源 URI、源数据格式和目标表名称。 -- `BROKER`: 源的连接详细信息。 -- `PROPERTIES`: 超时值和要应用于导入作业的任何其他属性。 - -有关详细的语法和参数说明,请参阅 [BROKER LOAD](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md)。 - -#### 检查导入进度 - -您可以从 StarRocks Information Schema 中的 [`loads`](../sql-reference/information_schema/loads.md) 视图查询 INSERT 作业的进度。此功能从 v3.1 开始受支持。 - -```SQL -SELECT * FROM information_schema.loads; -``` - -有关 `loads` 视图中提供的字段的信息,请参阅 [`loads`](../sql-reference/information_schema/loads.md)。 - -如果您提交了多个导入作业,则可以按与该作业关联的 `LABEL` 进行筛选。示例: - -```SQL -SELECT * FROM information_schema.loads WHERE LABEL = 'user_behavior'; -``` - -在下面的输出中,`user_behavior` 导入作业有两个条目: - -- 第一个记录显示 `CANCELLED` 状态。滚动到 `ERROR_MSG`,您可以看到该作业由于 `listPath failed` 而失败。 -- 第二个记录显示 `FINISHED` 状态,这意味着该作业已成功。 - -```Plain -JOB_ID|LABEL |DATABASE_NAME|STATE |PROGRESS |TYPE |PRIORITY|SCAN_ROWS|FILTERED_ROWS|UNSELECTED_ROWS|SINK_ROWS|ETL_INFO|TASK_INFO |CREATE_TIME |ETL_START_TIME |ETL_FINISH_TIME |LOAD_START_TIME |LOAD_FINISH_TIME |JOB_DETAILS |ERROR_MSG |TRACKING_URL|TRACKING_SQL|REJECTED_RECORD_PATH| -------+-------------------------------------------+-------------+---------+-------------------+------+--------+---------+-------------+---------------+---------+--------+----------------------------------------------------+-------------------+-------------------+-------------------+-------------------+-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+------------+------------+--------------------+ - 10121|user_behavior |mydatabase |CANCELLED|ETL:N/A; LOAD:N/A |BROKER|NORMAL | 0| 0| 0| 0| |resource:N/A; timeout(s):72000; max_filter_ratio:0.0|2023-08-10 14:59:30| | | |2023-08-10 14:59:34|{"All backends":{},"FileNumber":0,"FileSize":0,"InternalTableLoadBytes":0,"InternalTableLoadRows":0,"ScanBytes":0,"ScanRows":0,"TaskNumber":0,"Unfinished backends":{}} |type:ETL_RUN_FAIL; msg:listPath failed| | | | - 10106|user_behavior |mydatabase |FINISHED |ETL:100%; LOAD:100%|BROKER|NORMAL | 86953525| 0| 0| 86953525| |resource:N/A; timeout(s):72000; max_filter_ratio:0.0|2023-08-10 14:50:15|2023-08-10 14:50:19|2023-08-10 14:50:19|2023-08-10 14:50:19|2023-08-10 14:55:10|{"All backends":{"a5fe5e1d-d7d0-4826-ba99-c7348f9a5f2f":[10004]},"FileNumber":1,"FileSize":1225637388,"InternalTableLoadBytes":2710603082,"InternalTableLoadRows":86953525,"ScanBytes":1225637388,"ScanRows":86953525,"TaskNumber":1,"Unfinished backends":{"a5| | | | | -``` - -确认导入作业已完成后,您可以检查目标表的一个子集,以查看数据是否已成功加载。示例: - -```SQL -SELECT * from user_behavior LIMIT 3; -``` - -系统返回类似于以下的查询结果,表明数据已成功加载: - -```Plain -+--------+---------+------------+--------------+---------------------+ -| UserID | ItemID | CategoryID | BehaviorType | Timestamp | -+--------+---------+------------+--------------+---------------------+ -| 142 | 2869980 | 2939262 | pv | 2017-11-25 03:43:22 | -| 142 | 2522236 | 1669167 | pv | 2017-11-25 15:14:12 | -| 142 | 3031639 | 3607361 | pv | 2017-11-25 15:19:25 | -+--------+---------+------------+--------------+---------------------+ \ No newline at end of file diff --git a/docs/zh/loading/hdfs_load.md b/docs/zh/loading/hdfs_load.md deleted file mode 100644 index 4674d0b..0000000 --- a/docs/zh/loading/hdfs_load.md +++ /dev/null @@ -1,605 +0,0 @@ ---- -displayed_sidebar: docs -toc_max_heading_level: 4 -keywords: ['Broker Load'] ---- - -# 从 HDFS 导入数据 - -import LoadMethodIntro from '../_assets/commonMarkdown/loadMethodIntro.mdx' - -import InsertPrivNote from '../_assets/commonMarkdown/insertPrivNote.mdx' - -import PipeAdvantages from '../_assets/commonMarkdown/pipeAdvantages.mdx' - -StarRocks 提供了以下从 HDFS 导入数据的选项: - - - -## 前提条件 - -### 准备源数据 - -确保要导入到 StarRocks 中的源数据已正确存储在 HDFS 集群中。本文档假设您要将 `/user/amber/user_behavior_ten_million_rows.parquet` 从 HDFS 导入到 StarRocks 中。 - -### 检查权限 - - - -### 收集认证信息 - -您可以使用简单的身份验证方法来建立与 HDFS 集群的连接。要使用简单的身份验证,您需要收集可用于访问 HDFS 集群的 NameNode 的帐户的用户名和密码。 - -## 使用 INSERT+FILES() - -此方法从 v3.1 开始可用,目前仅支持 Parquet、ORC 和 CSV (从 v3.3.0 开始支持) 文件格式。 - -### INSERT+FILES() 的优势 - -[`FILES()`](../sql-reference/sql-functions/table-functions/files.md) 可以读取存储在云存储中的文件,根据您指定的路径相关属性,推断文件中数据的表结构,然后将文件中的数据作为数据行返回。 - -使用 `FILES()`,您可以: - -- 使用 [SELECT](../sql-reference/sql-statements/table_bucket_part_index/SELECT.md) 直接从 HDFS 查询数据。 -- 使用 [CREATE TABLE AS SELECT](../sql-reference/sql-statements/table_bucket_part_index/CREATE_TABLE_AS_SELECT.md) (CTAS) 创建和导入表。 -- 使用 [INSERT](../sql-reference/sql-statements/table_bucket_part_index/SELECT.md) 将数据导入到现有表中。 - -### 典型示例 - -#### 使用 SELECT 直接从 HDFS 查询 - -使用 SELECT+`FILES()` 直接从 HDFS 查询可以在创建表之前很好地预览数据集的内容。例如: - -- 在不存储数据的情况下获取数据集的预览。 -- 查询最小值和最大值,并确定要使用的数据类型。 -- 检查 `NULL` 值。 - -以下示例查询存储在 HDFS 集群中的数据文件 `/user/amber/user_behavior_ten_million_rows.parquet`: - -```SQL -SELECT * FROM FILES -( - "path" = "hdfs://:/user/amber/user_behavior_ten_million_rows.parquet", - "format" = "parquet", - "hadoop.security.authentication" = "simple", - "username" = "", - "password" = "" -) -LIMIT 3; -``` - -系统返回以下查询结果: - -```Plaintext -+--------+---------+------------+--------------+---------------------+ -| UserID | ItemID | CategoryID | BehaviorType | Timestamp | -+--------+---------+------------+--------------+---------------------+ -| 543711 | 829192 | 2355072 | pv | 2017-11-27 08:22:37 | -| 543711 | 2056618 | 3645362 | pv | 2017-11-27 10:16:46 | -| 543711 | 1165492 | 3645362 | pv | 2017-11-27 10:17:00 | -+--------+---------+------------+--------------+---------------------+ -``` - -> **NOTE** -> -> 请注意,上面返回的列名由 Parquet 文件提供。 - -#### 使用 CTAS 创建和导入表 - -这是前一个示例的延续。前面的查询包含在 CREATE TABLE AS SELECT (CTAS) 中,以使用模式推断自动执行表创建。这意味着 StarRocks 将推断表结构,创建您想要的表,然后将数据导入到表中。使用 `FILES()` 表函数时,Parquet 格式包含列名,因此创建表时不需要列名和类型。 - -> **NOTE** -> -> 使用模式推断时,CREATE TABLE 的语法不允许设置副本数,因此请在创建表之前设置它。以下示例适用于具有三个副本的系统: -> -> ```SQL -> ADMIN SET FRONTEND CONFIG ('default_replication_num' = "3"); -> ``` - -创建一个数据库并切换到它: - -```SQL -CREATE DATABASE IF NOT EXISTS mydatabase; -USE mydatabase; -``` - -使用 CTAS 创建一个表,并将数据文件 `/user/amber/user_behavior_ten_million_rows.parquet` 的数据导入到该表中: - -```SQL -CREATE TABLE user_behavior_inferred AS -SELECT * FROM FILES -( - "path" = "hdfs://:/user/amber/user_behavior_ten_million_rows.parquet", - "format" = "parquet", - "hadoop.security.authentication" = "simple", - "username" = "", - "password" = "" -); -``` - -创建表后,您可以使用 [DESCRIBE](../sql-reference/sql-statements/table_bucket_part_index/DESCRIBE.md) 查看其结构: - -```SQL -DESCRIBE user_behavior_inferred; -``` - -系统返回以下查询结果: - -```Plain -+--------------+-----------+------+-------+---------+-------+ -| Field | Type | Null | Key | Default | Extra | -+--------------+-----------+------+-------+---------+-------+ -| UserID | bigint | YES | true | NULL | | -| ItemID | bigint | YES | true | NULL | | -| CategoryID | bigint | YES | true | NULL | | -| BehaviorType | varbinary | YES | false | NULL | | -| Timestamp | varbinary | YES | false | NULL | | -+--------------+-----------+------+-------+---------+-------+ -``` - -查询表以验证数据是否已导入到其中。示例: - -```SQL -SELECT * from user_behavior_inferred LIMIT 3; -``` - -返回以下查询结果,表明数据已成功导入: - -```Plaintext -+--------+--------+------------+--------------+---------------------+ -| UserID | ItemID | CategoryID | BehaviorType | Timestamp | -+--------+--------+------------+--------------+---------------------+ -| 84 | 56257 | 1879194 | pv | 2017-11-26 05:56:23 | -| 84 | 108021 | 2982027 | pv | 2017-12-02 05:43:00 | -| 84 | 390657 | 1879194 | pv | 2017-11-28 11:20:30 | -+--------+--------+------------+--------------+---------------------+ -``` - -#### 使用 INSERT 导入到现有表 - -您可能想要自定义要导入的表,例如: - -- 列数据类型、nullable 设置或默认值 -- 键类型和列 -- 数据分区和分桶 - -> **NOTE** -> -> 创建最有效的表结构需要了解数据的使用方式和列的内容。本主题不包括表设计。有关表设计的信息,请参见 [表类型](../table_design/StarRocks_table_design.md)。 - -在此示例中,我们基于对表查询方式和 Parquet 文件中数据的了解来创建表。对 Parquet 文件中数据的了解可以通过直接在 HDFS 中查询文件来获得。 - -- 由于 HDFS 中数据集的查询表明 `Timestamp` 列包含与 VARBINARY 数据类型匹配的数据,因此列类型在以下 DDL 中指定。 -- 通过查询 HDFS 中的数据,您可以发现数据集中没有 `NULL` 值,因此 DDL 不会将任何列设置为 nullable。 -- 根据对预期查询类型的了解,排序键和分桶列设置为列 `UserID`。您的用例可能与此数据不同,因此您可能会决定将 `ItemID` 除了 `UserID` 之外或代替 `UserID` 用于排序键。 - -创建一个数据库并切换到它: - -```SQL -CREATE DATABASE IF NOT EXISTS mydatabase; -USE mydatabase; -``` - -手动创建一个表(我们建议该表具有与要从 HDFS 导入的 Parquet 文件相同的结构): - -```SQL -CREATE TABLE user_behavior_declared -( - UserID int(11), - ItemID int(11), - CategoryID int(11), - BehaviorType varchar(65533), - Timestamp varbinary -) -ENGINE = OLAP -DUPLICATE KEY(UserID) -DISTRIBUTED BY HASH(UserID); -``` - -显示结构,以便您可以将其与 `FILES()` 表函数生成的推断结构进行比较: - -```sql -DESCRIBE user_behavior_declared; -``` - -```plaintext -+--------------+----------------+------+-------+---------+-------+ -| Field | Type | Null | Key | Default | Extra | -+--------------+----------------+------+-------+---------+-------+ -| UserID | int | NO | true | NULL | | -| ItemID | int | NO | false | NULL | | -| CategoryID | int | NO | false | NULL | | -| BehaviorType | varchar(65533) | NO | false | NULL | | -| Timestamp | varbinary | NO | false | NULL | | -+--------------+----------------+------+-------+---------+-------+ -5 rows in set (0.00 sec) -``` - -:::tip - -将您刚刚创建的结构与之前使用 `FILES()` 表函数推断的结构进行比较。查看: - -- 数据类型 -- nullable -- 键字段 - -为了更好地控制目标表的结构并获得更好的查询性能,我们建议您在生产环境中手动指定表结构。 - -::: - -创建表后,您可以使用 INSERT INTO SELECT FROM FILES() 导入它: - -```SQL -INSERT INTO user_behavior_declared -SELECT * FROM FILES -( - "path" = "hdfs://:/user/amber/user_behavior_ten_million_rows.parquet", - "format" = "parquet", - "hadoop.security.authentication" = "simple", - "username" = "", - "password" = "" -); -``` - -导入完成后,您可以查询表以验证数据是否已导入到其中。示例: - -```SQL -SELECT * from user_behavior_declared LIMIT 3; -``` - -返回以下查询结果,表明数据已成功导入: - -```Plaintext -+--------+---------+------------+--------------+---------------------+ -| UserID | ItemID | CategoryID | BehaviorType | Timestamp | -+--------+---------+------------+--------------+---------------------+ -| 107 | 1568743 | 4476428 | pv | 2017-11-25 14:29:53 | -| 107 | 470767 | 1020087 | pv | 2017-11-25 14:32:31 | -| 107 | 358238 | 1817004 | pv | 2017-11-25 14:43:23 | -+--------+---------+------------+--------------+---------------------+ -``` - -#### 检查导入进度 - -您可以从 StarRocks Information Schema 中的 [`loads`](../sql-reference/information_schema/loads.md) 视图查询 INSERT 作业的进度。此功能从 v3.1 开始支持。示例: - -```SQL -SELECT * FROM information_schema.loads ORDER BY JOB_ID DESC; -``` - -有关 `loads` 视图中提供的字段的信息,请参见 [`loads`](../sql-reference/information_schema/loads.md)。 - -如果您提交了多个导入作业,则可以按与该作业关联的 `LABEL` 进行过滤。示例: - -```SQL -SELECT * FROM information_schema.loads WHERE LABEL = 'insert_0d86c3f9-851f-11ee-9c3e-00163e044958' \G -*************************** 1. row *************************** - JOB_ID: 10214 - LABEL: insert_0d86c3f9-851f-11ee-9c3e-00163e044958 - DATABASE_NAME: mydatabase - STATE: FINISHED - PROGRESS: ETL:100%; LOAD:100% - TYPE: INSERT - PRIORITY: NORMAL - SCAN_ROWS: 10000000 - FILTERED_ROWS: 0 - UNSELECTED_ROWS: 0 - SINK_ROWS: 10000000 - ETL_INFO: - TASK_INFO: resource:N/A; timeout(s):300; max_filter_ratio:0.0 - CREATE_TIME: 2023-11-17 15:58:14 - ETL_START_TIME: 2023-11-17 15:58:14 - ETL_FINISH_TIME: 2023-11-17 15:58:14 - LOAD_START_TIME: 2023-11-17 15:58:14 - LOAD_FINISH_TIME: 2023-11-17 15:58:18 - JOB_DETAILS: {"All backends":{"0d86c3f9-851f-11ee-9c3e-00163e044958":[10120]},"FileNumber":0,"FileSize":0,"InternalTableLoadBytes":311710786,"InternalTableLoadRows":10000000,"ScanBytes":581574034,"ScanRows":10000000,"TaskNumber":1,"Unfinished backends":{"0d86c3f9-851f-11ee-9c3e-00163e044958":[]}} - ERROR_MSG: NULL - TRACKING_URL: NULL - TRACKING_SQL: NULL -REJECTED_RECORD_PATH: NULL -``` - -> **NOTE** -> -> INSERT 是一个同步命令。如果 INSERT 作业仍在运行,您需要打开另一个会话来检查其执行状态。 - -## 使用 Broker Load - -异步 Broker Load 进程处理与 HDFS 的连接、拉取数据以及将数据存储在 StarRocks 中。 - -此方法支持以下文件格式: - -- Parquet -- ORC -- CSV -- JSON (从 v3.2.3 开始支持) - -### Broker Load 的优势 - -- Broker Load 在后台运行,客户端无需保持连接即可继续作业。 -- 对于长时间运行的作业,Broker Load 是首选,默认超时时间为 4 小时。 -- 除了 Parquet 和 ORC 文件格式外,Broker Load 还支持 CSV 文件格式和 JSON 文件格式 (从 v3.2.3 开始支持 JSON 文件格式)。 - -### 数据流 - -![Broker Load 的工作流程](../_assets/broker_load_how-to-work_en.png) - -1. 用户创建一个导入作业。 -2. 前端 (FE) 创建一个查询计划,并将该计划分发到后端节点 (BEs) 或计算节点 (CNs)。 -3. BE 或 CN 从源拉取数据,并将数据导入到 StarRocks 中。 - -### 典型示例 - -创建一个表,启动一个从 HDFS 拉取数据文件 `/user/amber/user_behavior_ten_million_rows.parquet` 的导入进程,并验证数据导入的进度和成功。 - -#### 创建数据库和表 - -创建一个数据库并切换到它: - -```SQL -CREATE DATABASE IF NOT EXISTS mydatabase; -USE mydatabase; -``` - -手动创建一个表(我们建议该表具有与要从 HDFS 导入的 Parquet 文件相同的结构): - -```SQL -CREATE TABLE user_behavior -( - UserID int(11), - ItemID int(11), - CategoryID int(11), - BehaviorType varchar(65533), - Timestamp varbinary -) -ENGINE = OLAP -DUPLICATE KEY(UserID) -DISTRIBUTED BY HASH(UserID); -``` - -#### 启动 Broker Load - -运行以下命令以启动一个 Broker Load 作业,该作业将数据从数据文件 `/user/amber/user_behavior_ten_million_rows.parquet` 导入到 `user_behavior` 表: - -```SQL -LOAD LABEL user_behavior -( - DATA INFILE("hdfs://:/user/amber/user_behavior_ten_million_rows.parquet") - INTO TABLE user_behavior - FORMAT AS "parquet" - ) - WITH BROKER -( - "hadoop.security.authentication" = "simple", - "username" = "", - "password" = "" -) -PROPERTIES -( - "timeout" = "72000" -); -``` - -此作业有四个主要部分: - -- `LABEL`: 查询导入作业状态时使用的字符串。 -- `LOAD` 声明:源 URI、源数据格式和目标表名称。 -- `BROKER`: 源的连接详细信息。 -- `PROPERTIES`: 超时值和要应用于导入作业的任何其他属性。 - -有关详细的语法和参数说明,请参见 [BROKER LOAD](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md)。 - -#### 检查导入进度 - -您可以从 `information_schema.loads` 视图查询 Broker Load 作业的进度。此功能从 v3.1 开始支持。 - -```SQL -SELECT * FROM information_schema.loads; -``` - -有关 `loads` 视图中提供的字段的信息,请参见 [Information Schema](../sql-reference/information_schema/loads.md)。 - -如果您提交了多个导入作业,则可以按与该作业关联的 `LABEL` 进行过滤。示例: - -```SQL -SELECT * FROM information_schema.loads WHERE LABEL = 'user_behavior'; -``` - -在下面的输出中,`user_behavior` 导入作业有两个条目: - -- 第一个记录显示 `CANCELLED` 状态。滚动到 `ERROR_MSG`,您可以看到该作业由于 `listPath failed` 而失败。 -- 第二个记录显示 `FINISHED` 状态,这意味着该作业已成功。 - -```Plaintext -JOB_ID|LABEL |DATABASE_NAME|STATE |PROGRESS |TYPE |PRIORITY|SCAN_ROWS|FILTERED_ROWS|UNSELECTED_ROWS|SINK_ROWS|ETL_INFO|TASK_INFO |CREATE_TIME |ETL_START_TIME |ETL_FINISH_TIME |LOAD_START_TIME |LOAD_FINISH_TIME |JOB_DETAILS |ERROR_MSG |TRACKING_URL|TRACKING_SQL|REJECTED_RECORD_PATH| -------+-------------------------------------------+-------------+---------+-------------------+------+--------+---------+-------------+---------------+---------+--------+----------------------------------------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+------------+------------+--------------------+ - 10121|user_behavior |mydatabase |CANCELLED|ETL:N/A; LOAD:N/A |BROKER|NORMAL | 0| 0| 0| 0| |resource:N/A; timeout(s):72000; max_filter_ratio:0.0|2023-08-10 14:59:30| | | |2023-08-10 14:59:34|{"All backends":{},"FileNumber":0,"FileSize":0,"InternalTableLoadBytes":0,"InternalTableLoadRows":0,"ScanBytes":0,"ScanRows":0,"TaskNumber":0,"Unfinished backends":{}} |type:ETL_RUN_FAIL; msg:listPath failed| | | | - 10106|user_behavior |mydatabase |FINISHED |ETL:100%; LOAD:100%|BROKER|NORMAL | 86953525| 0| 0| 86953525| |resource:N/A; timeout(s):72000; max_filter_ratio:0.0|2023-08-10 14:50:15|2023-08-10 14:50:19|2023-08-10 14:50:19|2023-08-10 14:50:19|2023-08-10 14:55:10|{"All backends":{"a5fe5e1d-d7d0-4826-ba99-c7348f9a5f2f":[10004]},"FileNumber":1,"FileSize":1225637388,"InternalTableLoadBytes":2710603082,"InternalTableLoadRows":86953525,"ScanBytes":1225637388,"ScanRows":86953525,"TaskNumber":1,"Unfinished backends":{"a5| | | | | -``` - -确认导入作业已完成后,您可以检查目标表的子集,以查看数据是否已成功导入。示例: - -```SQL -SELECT * from user_behavior LIMIT 3; -``` - -返回以下查询结果,表明数据已成功导入: - -```Plaintext -+--------+---------+------------+--------------+---------------------+ -| UserID | ItemID | CategoryID | BehaviorType | Timestamp | -+--------+---------+------------+--------------+---------------------+ -| 142 | 2869980 | 2939262 | pv | 2017-11-25 03:43:22 | -| 142 | 2522236 | 1669167 | pv | 2017-11-25 15:14:12 | -| 142 | 3031639 | 3607361 | pv | 2017-11-25 15:19:25 | -+--------+---------+------------+--------------+---------------------+ -``` - -## 使用 Pipe - -从 v3.2 开始,StarRocks 提供了 Pipe 导入方法,目前仅支持 Parquet 和 ORC 文件格式。 - -### Pipe 的优势 - - - -Pipe 非常适合连续数据导入和大规模数据导入: - -- **微批次中的大规模数据导入有助于降低由数据错误引起的重试成本。** - - 借助 Pipe,StarRocks 能够高效地导入大量数据文件,这些文件总共有很大的数据量。Pipe 根据文件的数量或大小自动拆分文件,将导入作业分解为较小的顺序任务。这种方法确保一个文件中的错误不会影响整个导入作业。每个文件的导入状态由 Pipe 记录,使您可以轻松识别和修复包含错误的文件。通过最大限度地减少由于数据错误而需要重试的次数,这种方法有助于降低成本。 - -- **连续数据导入有助于减少人力。** - - Pipe 帮助您将新的或更新的数据文件写入特定位置,并不断地将这些文件中的新数据导入到 StarRocks 中。在使用 `"AUTO_INGEST" = "TRUE"` 指定创建 Pipe 作业后,它将不断监视存储在指定路径中的数据文件的更改,并自动将数据文件中的新的或更新的数据导入到目标 StarRocks 表中。 - -此外,Pipe 执行文件唯一性检查,以帮助防止重复数据导入。在导入过程中,Pipe 根据文件名和摘要检查每个数据文件的唯一性。如果具有特定文件名和摘要的文件已经由 Pipe 作业处理,则 Pipe 作业将跳过所有后续具有相同文件名和摘要的文件。请注意,HDFS 使用 `LastModifiedTime` 作为文件摘要。 - -每个数据文件的导入状态都会被记录并保存到 `information_schema.pipe_files` 视图中。在删除与该视图关联的 Pipe 作业后,有关在该作业中导入的文件的记录也将被删除。 - -### 数据流 - -![Pipe 数据流](../_assets/pipe_data_flow.png) - -### Pipe 和 INSERT+FILES() 之间的区别 - -Pipe 作业根据每个数据文件的大小和行数拆分为一个或多个事务。用户可以在导入过程中查询中间结果。相比之下,INSERT+`FILES()` 作业作为一个事务处理,用户无法在导入过程中查看数据。 - -### 文件导入顺序 - -对于每个 Pipe 作业,StarRocks 维护一个文件队列,从中提取数据文件并将其作为微批次导入。Pipe 不保证数据文件以与上传顺序相同的顺序导入。因此,较新的数据可能在较旧的数据之前导入。 - -### 典型示例 - -#### 创建数据库和表 - -创建一个数据库并切换到它: - -```SQL -CREATE DATABASE IF NOT EXISTS mydatabase; -USE mydatabase; -``` - -手动创建一个表(我们建议该表具有与要从 HDFS 导入的 Parquet 文件相同的结构): - -```SQL -CREATE TABLE user_behavior_replica -( - UserID int(11), - ItemID int(11), - CategoryID int(11), - BehaviorType varchar(65533), - Timestamp varbinary -) -ENGINE = OLAP -DUPLICATE KEY(UserID) -DISTRIBUTED BY HASH(UserID); -``` - -#### 启动 Pipe 作业 - -运行以下命令以启动一个 Pipe 作业,该作业将数据从数据文件 `/user/amber/user_behavior_ten_million_rows.parquet` 导入到 `user_behavior_replica` 表: - -```SQL -CREATE PIPE user_behavior_replica -PROPERTIES -( - "AUTO_INGEST" = "TRUE" -) -AS -INSERT INTO user_behavior_replica -SELECT * FROM FILES -( - "path" = "hdfs://:/user/amber/user_behavior_ten_million_rows.parquet", - "format" = "parquet", - "hadoop.security.authentication" = "simple", - "username" = "", - "password" = "" -); -``` - -此作业有四个主要部分: - -- `pipe_name`: Pipe 的名称。Pipe 名称在 Pipe 所属的数据库中必须是唯一的。 -- `INSERT_SQL`: 用于将数据从指定的源数据文件导入到目标表的 INSERT INTO SELECT FROM FILES 语句。 -- `PROPERTIES`: 一组可选参数,用于指定如何执行 Pipe。这些参数包括 `AUTO_INGEST`、`POLL_INTERVAL`、`BATCH_SIZE` 和 `BATCH_FILES`。以 `"key" = "value"` 格式指定这些属性。 - -有关详细的语法和参数说明,请参见 [CREATE PIPE](../sql-reference/sql-statements/loading_unloading/pipe/CREATE_PIPE.md)。 - -#### 检查导入进度 - -- 使用 [SHOW PIPES](../sql-reference/sql-statements/loading_unloading/pipe/SHOW_PIPES.md) 查询 Pipe 作业的进度。 - - ```SQL - SHOW PIPES; - ``` - - 如果您提交了多个导入作业,则可以按与该作业关联的 `NAME` 进行过滤。示例: - - ```SQL - SHOW PIPES WHERE NAME = 'user_behavior_replica' \G - *************************** 1. row *************************** - DATABASE_NAME: mydatabase - PIPE_ID: 10252 - PIPE_NAME: user_behavior_replica - STATE: RUNNING - TABLE_NAME: mydatabase.user_behavior_replica - LOAD_STATUS: {"loadedFiles":1,"loadedBytes":132251298,"loadingFiles":0,"lastLoadedTime":"2023-11-17 16:13:22"} - LAST_ERROR: NULL - CREATED_TIME: 2023-11-17 16:13:15 - 1 row in set (0.00 sec) - ``` - -- 从 StarRocks Information Schema 中的 [`pipes`](../sql-reference/information_schema/pipes.md) 视图查询 Pipe 作业的进度。 - - ```SQL - SELECT * FROM information_schema.pipes; - ``` - - 如果您提交了多个导入作业,则可以按与该作业关联的 `PIPE_NAME` 进行过滤。示例: - - ```SQL - SELECT * FROM information_schema.pipes WHERE pipe_name = 'user_behavior_replica' \G - *************************** 1. row *************************** - DATABASE_NAME: mydatabase - PIPE_ID: 10252 - PIPE_NAME: user_behavior_replica - STATE: RUNNING - TABLE_NAME: mydatabase.user_behavior_replica - LOAD_STATUS: {"loadedFiles":1,"loadedBytes":132251298,"loadingFiles":0,"lastLoadedTime":"2023-11-17 16:13:22"} - LAST_ERROR: - CREATED_TIME: 2023-11-17 16:13:15 - 1 row in set (0.00 sec) - ``` - -#### 检查文件状态 - -您可以从 StarRocks Information Schema 中的 [`pipe_files`](../sql-reference/information_schema/pipe_files.md) 视图查询从导入的文件的导入状态。 - -```SQL -SELECT * FROM information_schema.pipe_files; -``` - -如果您提交了多个导入作业,则可以按与该作业关联的 `PIPE_NAME` 进行过滤。示例: - -```SQL -SELECT * FROM information_schema.pipe_files WHERE pipe_name = 'user_behavior_replica' \G -*************************** 1. row *************************** - DATABASE_NAME: mydatabase - PIPE_ID: 10252 - PIPE_NAME: user_behavior_replica - FILE_NAME: hdfs://172.26.195.67:9000/user/amber/user_behavior_ten_million_rows.parquet - FILE_VERSION: 1700035418838 - FILE_SIZE: 132251298 - LAST_MODIFIED: 2023-11-15 08:03:38 - LOAD_STATE: FINISHED - STAGED_TIME: 2023-11-17 16:13:16 - START_LOAD_TIME: 2023-11-17 16:13:17 -FINISH_LOAD_TIME: 2023-11-17 16:13:22 - ERROR_MSG: -1 row in set (0.02 sec) -``` - -#### 管理 Pipes - -您可以更改、暂停或恢复、删除或查询您创建的 Pipes,并重试导入特定的数据文件。有关更多信息,请参见 [ALTER PIPE](../sql-reference/sql-statements/loading_unloading/pipe/ALTER_PIPE.md)、[SUSPEND or RESUME PIPE](../sql-reference/sql-statements/loading_unloading/pipe/SUSPEND_or_RESUME_PIPE.md)、[DROP PIPE](../sql-reference/sql-statements/loading_unloading/pipe/DROP_PIPE.md)、[SHOW PIPES](../sql-reference/sql-statements/loading_unloading/pipe/SHOW_PIPES.md) 和 [RETRY FILE](../sql-reference/sql-statements/loading_unloading/pipe/RETRY_FILE.md)。 \ No newline at end of file diff --git a/docs/zh/loading/huawei.md b/docs/zh/loading/huawei.md deleted file mode 100644 index db92470..0000000 --- a/docs/zh/loading/huawei.md +++ /dev/null @@ -1 +0,0 @@ -unlisted: true \ No newline at end of file diff --git a/docs/zh/loading/load_concept/strict_mode.md b/docs/zh/loading/load_concept/strict_mode.md deleted file mode 100644 index c6949b9..0000000 --- a/docs/zh/loading/load_concept/strict_mode.md +++ /dev/null @@ -1,163 +0,0 @@ ---- -displayed_sidebar: docs ---- - -# Strict mode - -Strict mode 是一个可选属性,您可以为数据导入配置它。它会影响导入行为和最终导入的数据。 - -本主题介绍 strict mode 是什么以及如何设置 strict mode。 - -## 了解 strict mode - -在数据导入期间,源列的数据类型可能与目标列的数据类型不完全一致。在这种情况下,StarRocks 会对数据类型不一致的源列值执行转换。由于各种问题,例如不匹配的字段数据类型和字段长度溢出,数据转换可能会失败。未能正确转换的源列值是不合格的列值,包含不合格列值的源行被称为“不合格行”。Strict mode 用于控制在数据导入期间是否过滤掉不合格的行。 - -Strict mode 的工作方式如下: - -- 如果启用了 strict mode,StarRocks 仅导入合格的行。它会过滤掉不合格的行,并返回有关不合格行的详细信息。 -- 如果禁用了 strict mode,StarRocks 会将不合格的列值转换为 `NULL`,并将包含这些 `NULL` 值的不合格行与合格行一起导入。 - -请注意以下几点: - -- 在实际业务场景中,合格行和不合格行都可能包含 `NULL` 值。如果目标列不允许 `NULL` 值,StarRocks 会报告错误并过滤掉包含 `NULL` 值的行。 - -- 可以为导入作业过滤掉的不合格行的最大百分比由可选作业属性 `max_filter_ratio` 控制。 - -:::note - -从 v3.4.0 开始支持 INSERT 的 `max_filter_ratio` 属性。 - -::: - -例如,您想要将 CSV 格式的数据文件中的四行数据(分别包含 `\N`(`\N` 表示 `NULL` 值)、`abc`、`2000` 和 `1` 值)导入到 StarRocks 表的某一列中,并且目标 StarRocks 表列的数据类型为 TINYINT [-128, 127]。 - -- 源列值 `\N` 在转换为 TINYINT 时会被处理为 `NULL`。 - - > **NOTE** - > - > 无论目标数据类型如何,`\N` 在转换时始终会被处理为 `NULL`。 - -- 源列值 `abc` 会被处理为 `NULL`,因为其数据类型不是 TINYINT 并且转换失败。 - -- 源列值 `2000` 会被处理为 `NULL`,因为它超出了 TINYINT 支持的范围并且转换失败。 - -- 源列值 `1` 可以被正确转换为 TINYINT 类型的值 `1`。 - -如果禁用了 strict mode,StarRocks 会导入所有这四行数据。 - -如果启用了 strict mode,StarRocks 仅导入包含 `\N` 或 `1` 的行,并过滤掉包含 `abc` 或 `2000` 的行。过滤掉的行会计入 `max_filter_ratio` 参数指定的最大行百分比,这些行由于数据质量不足而被过滤掉。 - -### 禁用 strict mode 后最终导入的数据 - -| 源列值 | 转换为 TINYINT 后的列值 | 目标列允许 NULL 值时的导入结果 | 目标列不允许 NULL 值时的导入结果 | -| ------------------- | --------------------------------------- | ------------------------------------------------------ | ------------------------------------------------------------ | -| \N | NULL | 导入 `NULL` 值。 | 报告错误。 | -| abc | NULL | 导入 `NULL` 值。 | 报告错误。 | -| 2000 | NULL | 导入 `NULL` 值。 | 报告错误。 | -| 1 | 1 | 导入 `1` 值。 | 导入 `1` 值。 | - -### 启用 strict mode 后最终导入的数据 - -| 源列值 | 转换为 TINYINT 后的列值 | 目标列允许 NULL 值时的导入结果 | 目标列不允许 NULL 值时的导入结果 | -| ------------------- | --------------------------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | -| \N | NULL | 导入 `NULL` 值。 | 报告错误。 | -| abc | NULL | 不允许 `NULL` 值,因此被过滤掉。 | 报告错误。 | -| 2000 | NULL | 不允许 `NULL` 值,因此被过滤掉。 | 报告错误。 | -| 1 | 1 | 导入 `1` 值。 | 导入 `1` 值。 | - -## 设置 strict mode - -您可以使用 `strict_mode` 参数为导入作业设置 strict mode。有效值为 `true` 和 `false`。默认值为 `false`。值 `true` 启用 strict mode,值 `false` 禁用 strict mode。请注意,从 v3.4.0 开始支持 INSERT 的 `strict_mode` 参数,默认值为 `true`。现在,除了 Stream Load 之外,对于所有其他导入方法,`strict_mode` 在 PROPERTIES 子句中以相同的方式设置。 - -您还可以使用 `enable_insert_strict` 会话变量来设置 strict mode。有效值为 `true` 和 `false`。默认值为 `true`。值 `true` 启用 strict mode,值 `false` 禁用 strict mode。 - -:::note - -从 v3.4.0 开始,当 `enable_insert_strict` 设置为 `true` 时,系统仅导入合格的行。它会过滤掉不合格的行,并返回有关不合格行的详细信息。相反,在早于 v3.4.0 的版本中,当 `enable_insert_strict` 设置为 `true` 时,如果存在不合格的行,则 INSERT 作业会失败。 - -::: - -示例如下: - -### Stream Load - -```Bash -curl --location-trusted -u : \ - -H "strict_mode: {true | false}" \ - -T -XPUT \ - http://:/api///_stream_load -``` - -有关 Stream Load 的详细语法和参数,请参见 [STREAM LOAD](../../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md)。 - -### Broker Load - -```SQL -LOAD LABEL [.] -( - DATA INFILE (""[, "" ...]) - INTO TABLE -) -WITH BROKER -( - "username" = "", - "password" = "" -) -PROPERTIES -( - "strict_mode" = "{true | false}" -) -``` - -上面的代码片段使用 HDFS 作为示例。有关 Broker Load 的详细语法和参数,请参见 [BROKER LOAD](../../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md)。 - -### Routine Load - -```SQL -CREATE ROUTINE LOAD [.] ON -PROPERTIES -( - "strict_mode" = "{true | false}" -) -FROM KAFKA -( - "kafka_broker_list" =":[,:...]", - "kafka_topic" = "" -) -``` - -上面的代码片段使用 Apache Kafka® 作为示例。有关 Routine Load 的详细语法和参数,请参见 [CREATE ROUTINE LOAD](../../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md)。 - -### Spark Load - -```SQL -LOAD LABEL [.] -( - DATA INFILE (""[, "" ...]) - INTO TABLE -) -WITH RESOURCE -( - "spark.executor.memory" = "3g", - "broker.username" = "", - "broker.password" = "" -) -PROPERTIES -( - "strict_mode" = "{true | false}" -) -``` - -上面的代码片段使用 HDFS 作为示例。有关 Spark Load 的详细语法和参数,请参见 [SPARK LOAD](../../sql-reference/sql-statements/loading_unloading/SPARK_LOAD.md)。 - -### INSERT - -```SQL -INSERT INTO [.] -PROPERTIES( - "strict_mode" = "{true | false}" -) - -``` - -有关 INSERT 的详细语法和参数,请参见 [INSERT](../../sql-reference/sql-statements/loading_unloading/INSERT.md)。 \ No newline at end of file diff --git a/docs/zh/loading/load_from_pulsar.md b/docs/zh/loading/load_from_pulsar.md deleted file mode 100644 index 009023b..0000000 --- a/docs/zh/loading/load_from_pulsar.md +++ /dev/null @@ -1,147 +0,0 @@ ---- -displayed_sidebar: docs ---- -import Experimental from '../_assets/commonMarkdown/_experimental.mdx' - -# 从 Apache® Pulsar™ 持续导入数据 - - - -从 StarRocks 2.5 版本开始,Routine Load 支持从 Apache® Pulsar™ 持续导入数据。Pulsar 是一个分布式的开源发布-订阅消息和流平台,采用存储计算分离架构。通过 Routine Load 从 Pulsar 导入数据类似于从 Apache Kafka 导入数据。本文以 CSV 格式的数据为例,介绍如何通过 Routine Load 从 Apache Pulsar 导入数据。 - -## 支持的数据文件格式 - -Routine Load 支持从 Pulsar 集群消费 CSV 和 JSON 格式的数据。 - -> NOTE -> -> 对于 CSV 格式的数据,StarRocks 支持将 UTF-8 编码的、长度不超过 50 字节的字符串用作列分隔符。常用的列分隔符包括逗号 (,)、Tab 和竖线 (|)。 - -## Pulsar 相关概念 - -**[Topic](https://pulsar.apache.org/docs/2.10.x/concepts-messaging/#topics)** - -Topic 是 Pulsar 中用于将消息从生产者传输到消费者的命名通道。Pulsar 中的 Topic 分为分区 Topic 和非分区 Topic。 - -- **[Partitioned topics](https://pulsar.apache.org/docs/2.10.x/concepts-messaging/#partitioned-topics)** 是一种特殊类型的 Topic,由多个 Broker 处理,从而实现更高的吞吐量。分区 Topic 实际上是作为 N 个内部 Topic 实现的,其中 N 是分区的数量。 -- **Non-partitioned topics** 是一种普通类型的 Topic,仅由单个 Broker 提供服务,这限制了 Topic 的最大吞吐量。 - -**[Message ID](https://pulsar.apache.org/docs/2.10.x/concepts-messaging/#messages)** - -消息的 Message ID 由 [BookKeeper instances](https://pulsar.apache.org/docs/2.10.x/concepts-architecture-overview/#apache-bookkeeper) 在消息被持久化存储后立即分配。Message ID 指示消息在账本中的特定位置,并且在 Pulsar 集群中是唯一的。 - -Pulsar 支持消费者通过 consumer.*seek*(*messageId*) 指定初始位置。但与 Kafka 消费者 offset(一个长整型值)相比,Message ID 由四个部分组成:`ledgerId:entryID:partition-index:batch-index`。 - -因此,您无法直接从消息中获取 Message ID。因此,目前,**Routine Load 不支持在从 Pulsar 导入数据时指定初始位置,而仅支持从分区的开头或结尾消费数据。** - -**[Subscription](https://pulsar.apache.org/docs/2.10.x/concepts-messaging/#subscriptions)** - -Subscription 是一种命名配置规则,用于确定如何将消息传递给消费者。Pulsar 还支持消费者同时订阅多个 Topic。一个 Topic 可以有多个 Subscription。 - -Subscription 的类型在消费者连接到它时定义,并且可以通过使用不同的配置重新启动所有消费者来更改类型。Pulsar 中有四种 Subscription 类型可用: - -- `exclusive` (默认)*:* 只允许单个消费者连接到 Subscription。只允许一个客户消费消息。 -- `shared`: 多个消费者可以连接到同一个 Subscription。消息以轮询方式在消费者之间分发,并且任何给定的消息仅传递给一个消费者。 -- `failover`: 多个消费者可以连接到同一个 Subscription。为非分区 Topic 或分区 Topic 的每个分区选择一个主消费者并接收消息。当主消费者断开连接时,所有(未确认的和后续的)消息都会传递给队列中的下一个消费者。 -- `key_shared`: 多个消费者可以连接到同一个 Subscription。消息在消费者之间分发,并且具有相同 key 或相同排序 key 的消息仅传递给一个消费者。 - -> Note: -> -> 目前 Routine Load 使用 exclusive 类型。 - -## 创建 Routine Load 作业 - -以下示例描述了如何消费 Pulsar 中 CSV 格式的消息,并通过创建 Routine Load 作业将数据加载到 StarRocks 中。有关详细说明和参考,请参见 [CREATE ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md)。 - -```SQL -CREATE ROUTINE LOAD load_test.routine_wiki_edit_1 ON routine_wiki_edit -COLUMNS TERMINATED BY ",", -ROWS TERMINATED BY "\n", -COLUMNS (order_id, pay_dt, customer_name, nationality, temp_gender, price) -WHERE event_time > "2022-01-01 00:00:00", -PROPERTIES -( - "desired_concurrent_number" = "1", - "max_batch_interval" = "15000", - "max_error_number" = "1000" -) -FROM PULSAR -( - "pulsar_service_url" = "pulsar://localhost:6650", - "pulsar_topic" = "persistent://tenant/namespace/topic-name", - "pulsar_subscription" = "load-test", - "pulsar_partitions" = "load-partition-0,load-partition-1", - "pulsar_initial_positions" = "POSITION_EARLIEST,POSITION_LATEST", - "property.auth.token" = "eyJ0eXAiOiJKV1QiLCJhbGciOiJIUJzdWIiOiJqaXV0aWFuY2hlbiJ9.lulGngOC72vE70OW54zcbyw7XdKSOxET94WT_hIqD5Y" -); -``` - -当创建 Routine Load 以从 Pulsar 消费数据时,除了 `data_source_properties` 之外,大多数输入参数与从 Kafka 消费数据时相同。有关除了 `data_source_properties` 之外的参数的描述,请参见 [CREATE ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md)。 - -与 `data_source_properties` 相关的参数及其描述如下: - -| **Parameter** | **Required** | **Description** | -| ------------------------------------------- | ------------ | ------------------------------------------------------------ | -| pulsar_service_url | Yes | 用于连接到 Pulsar 集群的 URL。格式:`"pulsar://ip:port"` 或 `"pulsar://service:port"`。示例:`"pulsar_service_url" = "pulsar://``localhost:6650``"` | -| pulsar_topic | Yes | 订阅的 Topic。示例:`"pulsar_topic" = "persistent://tenant/namespace/topic-name"` | -| pulsar_subscription | Yes | 为 Topic 配置的 Subscription。示例:`"pulsar_subscription" = "my_subscription"` | -| pulsar_partitions, pulsar_initial_positions | No | `pulsar_partitions` : Topic 中订阅的分区。`pulsar_initial_positions`: 由 `pulsar_partitions` 指定的分区的初始位置。初始位置必须与 `pulsar_partitions` 中的分区相对应。有效值:`POSITION_EARLIEST`(默认值):Subscription 从分区中最早可用的消息开始。`POSITION_LATEST`: Subscription 从分区中最新的可用消息开始。注意:如果未指定 `pulsar_partitions`,则订阅 Topic 的所有分区。如果同时指定了 `pulsar_partitions` 和 `property.pulsar_default_initial_position`,则 `pulsar_partitions` 值将覆盖 `property.pulsar_default_initial_position` 值。如果既未指定 `pulsar_partitions` 也未指定 `property.pulsar_default_initial_position`,则订阅从分区中最新的可用消息开始。示例:`"pulsar_partitions" = "my-partition-0,my-partition-1,my-partition-2,my-partition-3", "pulsar_initial_positions" = "POSITION_EARLIEST,POSITION_EARLIEST,POSITION_LATEST,POSITION_LATEST"` | - -Routine Load 支持以下 Pulsar 的自定义参数。 - -| Parameter | Required | Description | -| ---------------------------------------- | -------- | ------------------------------------------------------------ | -| property.pulsar_default_initial_position | No | 订阅 Topic 的分区时的默认初始位置。该参数在未指定 `pulsar_initial_positions` 时生效。其有效值与 `pulsar_initial_positions` 的有效值相同。示例:`"``property.pulsar_default_initial_position" = "POSITION_EARLIEST"` | -| property.auth.token | No | 如果 Pulsar 启用了使用安全令牌对客户端进行身份验证,则您需要令牌字符串来验证您的身份。示例:`"p``roperty.auth.token" = "eyJ0eXAiOiJKV1QiLCJhbGciOiJIUJzdWIiOiJqaXV0aWFuY2hlbiJ9.lulGngOC72vE70OW54zcbyw7XdKSOxET94WT_hIqD"` | - -## 检查导入作业和任务 - -### 检查导入作业 - -执行 [SHOW ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/SHOW_ROUTINE_LOAD.md) 语句以检查导入作业 `routine_wiki_edit_1` 的状态。StarRocks 返回执行状态 `State`、统计信息(包括已消费的总行数和已加载的总行数)`Statistics` 以及导入作业的进度 `progress`。 - -当您检查从 Pulsar 消费数据的 Routine Load 作业时,除了 `progress` 之外,大多数返回的参数与从 Kafka 消费数据时相同。`progress` 指的是积压,即分区中未确认的消息数。 - -```Plaintext -MySQL [load_test] > SHOW ROUTINE LOAD for routine_wiki_edit_1 \G -*************************** 1. row *************************** - Id: 10142 - Name: routine_wiki_edit_1 - CreateTime: 2022-06-29 14:52:55 - PauseTime: 2022-06-29 17:33:53 - EndTime: NULL - DbName: default_cluster:test_pulsar - TableName: test1 - State: PAUSED - DataSourceType: PULSAR - CurrentTaskNum: 0 - JobProperties: {"partitions":"*","rowDelimiter":"'\n'","partial_update":"false","columnToColumnExpr":"*","maxBatchIntervalS":"10","whereExpr":"*","timezone":"Asia/Shanghai","format":"csv","columnSeparator":"','","json_root":"","strict_mode":"false","jsonpaths":"","desireTaskConcurrentNum":"3","maxErrorNum":"10","strip_outer_array":"false","currentTaskConcurrentNum":"0","maxBatchRows":"200000"} -DataSourceProperties: {"serviceUrl":"pulsar://localhost:6650","currentPulsarPartitions":"my-partition-0,my-partition-1","topic":"persistent://tenant/namespace/topic-name","subscription":"load-test"} - CustomProperties: {"auth.token":"eyJ0eXAiOiJKV1QiLCJhbGciOiJIUJzdWIiOiJqaXV0aWFuY2hlbiJ9.lulGngOC72vE70OW54zcbyw7XdKSOxET94WT_hIqD"} - Statistic: {"receivedBytes":5480943882,"errorRows":0,"committedTaskNum":696,"loadedRows":66243440,"loadRowsRate":29000,"abortedTaskNum":0,"totalRows":66243440,"unselectedRows":0,"receivedBytesRate":2400000,"taskExecuteTimeMs":2283166} - Progress: {"my-partition-0(backlog): 100","my-partition-1(backlog): 0"} -ReasonOfStateChanged: - ErrorLogUrls: - OtherMsg: -1 row in set (0.00 sec) -``` - -### 检查导入任务 - -执行 [SHOW ROUTINE LOAD TASK](../sql-reference/sql-statements/loading_unloading/routine_load/SHOW_ROUTINE_LOAD_TASK.md) 语句以检查导入作业 `routine_wiki_edit_1` 的导入任务,例如有多少任务正在运行、正在消费的 Kafka Topic 分区以及消费进度 `DataSourceProperties`,以及相应的 Coordinator BE 节点 `BeId`。 - -```SQL -MySQL [example_db]> SHOW ROUTINE LOAD TASK WHERE JobName = "routine_wiki_edit_1" \G -``` - -## 修改导入作业 - -在修改导入作业之前,必须使用 [PAUSE ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/PAUSE_ROUTINE_LOAD.md) 语句暂停它。然后,您可以执行 [ALTER ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/ALTER_ROUTINE_LOAD.md)。修改后,您可以执行 [RESUME ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/RESUME_ROUTINE_LOAD.md) 语句以恢复它,并使用 [SHOW ROUTINE LOAD](../sql-reference/sql-statements/loading_unloading/routine_load/SHOW_ROUTINE_LOAD.md) 语句检查其状态。 - -当 Routine Load 用于从 Pulsar 消费数据时,除了 `data_source_properties` 之外,大多数返回的参数与从 Kafka 消费数据时相同。 - -**请注意以下几点**: - -- 在与 `data_source_properties` 相关的参数中,目前仅支持修改 `pulsar_partitions`、`pulsar_initial_positions` 和自定义 Pulsar 参数 `property.pulsar_default_initial_position` 和 `property.auth.token`。参数 `pulsar_service_url`、`pulsar_topic` 和 `pulsar_subscription` 无法修改。 -- 如果您需要修改要消费的分区和匹配的初始位置,则需要确保在创建 Routine Load 作业时使用 `pulsar_partitions` 指定分区,并且只能修改指定分区的初始位置 `pulsar_initial_positions`。 -- 如果您在创建 Routine Load 作业时仅指定 Topic `pulsar_topic`,而不指定分区 `pulsar_partitions`,则可以通过 `pulsar_default_initial_position` 修改 Topic 下所有分区的起始位置。 \ No newline at end of file diff --git a/docs/zh/loading/loading.mdx b/docs/zh/loading/loading.mdx deleted file mode 100644 index 6a837cd..0000000 --- a/docs/zh/loading/loading.mdx +++ /dev/null @@ -1,9 +0,0 @@ ---- -displayed_sidebar: docs ---- - -# 数据导入 - -import DocCardList from '@theme/DocCardList'; - - \ No newline at end of file diff --git a/docs/zh/loading/loading_introduction/feature-support-loading-and-unloading.md b/docs/zh/loading/loading_introduction/feature-support-loading-and-unloading.md deleted file mode 100644 index d8320e9..0000000 --- a/docs/zh/loading/loading_introduction/feature-support-loading-and-unloading.md +++ /dev/null @@ -1,632 +0,0 @@ ---- -displayed_sidebar: docs -sidebar_label: "Feature Support" ---- - -# 功能支持:数据导入和导出 - -本文档概述了 StarRocks 支持的各种数据导入和导出方法的功能。 - -## 文件格式 - -### 导入文件格式 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Data SourceFile Format
CSVJSON [3]ParquetORCAvroProtoBufThrift
Stream LoadLocal file systems, applications, connectorsYesYesTo be supportedTo be supportedTo be supported
INSERT from FILESHDFS, S3, OSS, Azure, GCS, NFS(NAS) [5]Yes (v3.3+)To be supportedYes (v3.1+)Yes (v3.1+)Yes (v3.4.4+)To be supported
Broker LoadYesYes (v3.2.3+)YesYesTo be supported
Routine LoadKafkaYesYesTo be supportedTo be supportedYes (v3.0+) [1]To be supportedTo be supported
Spark LoadYesTo be supportedYesYesTo be supported
ConnectorsFlink, SparkYesYesTo be supportedTo be supportedTo be supported
Kafka Connector [2]KafkaYes (v3.0+)To be supportedTo be supportedYes (v3.0+)To be supported
PIPE [4]Consistent with INSERT from FILES
- -:::note - -[1], [2]\: Schema Registry is required. - -[3]\: JSON 支持多种 CDC 格式。有关 StarRocks 支持的 JSON CDC 格式的详细信息,请参见 [JSON CDC format](#json-cdc-formats)。 - -[4]\: 目前,仅 INSERT from FILES 支持使用 PIPE 进行导入。 - -[5]\: 您需要将 NAS 设备作为 NFS 挂载在每个 BE 或 CN 节点的相同目录下,才能通过 `file://` 协议访问 NFS 中的文件。 - -::: - -#### JSON CDC formats - - - - - - - - - - - - - - - - - - - - - - - - - -
Stream LoadRoutine LoadBroker LoadINSERT from FILESKafka Connector [1]
DebeziumTo be supportedTo be supportedTo be supportedTo be supportedYes (v3.0+)
CanalTo be supported
Maxwell
- -:::note - -[1]\: 将 Debezium CDC 格式数据导入到 StarRocks 的主键表时,必须配置 `transforms` 参数。 - -::: - -### 导出文件格式 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
TargetFile format
Table formatRemote storageCSVJSONParquetORC
INSERT INTO FILESN/AHDFS, S3, OSS, Azure, GCS, NFS(NAS) [3]Yes (v3.3+)To be supportedYes (v3.2+)Yes (v3.3+)
INSERT INTO CatalogHiveHDFS, S3, OSS, Azure, GCSYes (v3.3+)To be supportedYes (v3.2+)Yes (v3.3+)
IcebergHDFS, S3, OSS, Azure, GCSTo be supportedTo be supportedYes (v3.2+)To be supported
Hudi/DeltaTo be supported
EXPORTN/AHDFS, S3, OSS, Azure, GCSYes [1]To be supportedTo be supportedTo be supported
PIPETo be supported [2]
- -:::note - -[1]\: Configuring Broker process is supported. - -[2]\: Currently, unloading data using PIPE is not supported. - -[3]\: 您需要将 NAS 设备作为 NFS 挂载在每个 BE 或 CN 节点的相同目录下,才能通过 `file://` 协议访问 NFS 中的文件。 - -::: - -## 文件格式相关参数 - -### 导入文件格式相关参数 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
File formatParameterLoading method
Stream LoadINSERT from FILESBroker LoadRoutine LoadSpark Load
CSVcolumn_separatorYesYes (v3.3+)Yes [1]
row_delimiterYesYes [2] (v3.1+)Yes [3] (v2.2+)To be supported
encloseYes (v3.0+)Yes (v3.0+)Yes (v3.0+)To be supported
escape
skip_headerTo be supported
trim_spaceYes (v3.0+)
JSONjsonpathsYesTo be supportedYes (v3.2.3+)YesTo be supported
strip_outer_array
json_root
ignore_json_sizeTo be supported
- -:::note - -[1]\: 对应的参数是 `COLUMNS TERMINATED BY`。 - -[2]\: 对应的参数是 `ROWS TERMINATED BY`。 - -[3]\: 对应的参数是 `ROWS TERMINATED BY`。 - -::: - -### 导出文件格式相关参数 - - - - - - - - - - - - - - - - - - - - -
File formatParameterUnloading method
INSERT INTO FILESEXPORT
CSVcolumn_separatorYes (v3.3+)Yes
line_delimiter [1]
- -:::note - -[1]\: 数据导入中对应的参数是 `row_delimiter`。 - -::: - -## 压缩格式 - -### 导入压缩格式 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
File formatCompression formatLoading method
Stream LoadBroker LoadINSERT from FILESRoutine LoadSpark Load
CSV -
    -
  • deflate
  • -
  • bzip2
  • -
  • gzip
  • -
  • lz4_frame
  • -
  • zstd
  • -
-
Yes [1]Yes [2]To be supportedTo be supportedTo be supported
JSONYes (v3.2.7+) [3]To be supportedN/ATo be supportedN/A
Parquet -
    -
  • gzip
  • -
  • lz4
  • -
  • snappy
  • -
  • zlib
  • -
  • zstd
  • -
-
N/AYes [4]To be supportedYes [4]
ORC
- -:::note - -[1]\: 目前,仅当使用 Stream Load 导入 CSV 文件时,才能使用 `format=gzip` 指定压缩格式,表示 gzip 压缩的 CSV 文件。`deflate` 和 `bzip2` 格式也支持。 - -[2]\: Broker Load 不支持使用参数 `format` 指定 CSV 文件的压缩格式。Broker Load 通过文件的后缀名来识别压缩格式。gzip 压缩文件的后缀名为 `.gz`,zstd 压缩文件的后缀名为 `.zst`。此外,不支持其他与 `format` 相关的参数,例如 `trim_space` 和 `enclose`。 - -[3]\: 支持使用 `compression = gzip` 指定压缩格式。 - -[4]\: 由 Arrow Library 支持。您无需配置 `compression` 参数。 - -::: - -### 导出压缩格式 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
File formatCompression formatUnloading method
INSERT INTO FILESINSERT INTO CatalogEXPORT
HiveIcebergHudi/Delta
CSV -
    -
  • deflate
  • -
  • bzip2
  • -
  • gzip
  • -
  • lz4_frame
  • -
  • zstd
  • -
-
To be supportedTo be supportedTo be supportedTo be supportedTo be supported
JSONN/AN/AN/AN/AN/AN/A
Parquet -
    -
  • gzip
  • -
  • lz4
  • -
  • snappy
  • -
  • zstd
  • -
-
Yes (v3.2+)Yes (v3.2+)Yes (v3.2+)To be supportedN/A
ORC
- -## 凭据 - -### 导入 - 身份验证 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
AuthenticationLoading method
Stream LoadINSERT from FILESBroker LoadRoutine LoadExternal Catalog
Single KerberosN/AYes (v3.1+)Yes [1] (versions earlier than v2.5)Yes [2] (v3.1.4+)Yes
Kerberos Ticket Granting Ticket (TGT)N/ATo be supportedYes (v3.1.10+/v3.2.1+)
Single KDC Multiple KerberosN/A
Basic access authentications (Access Key pair, IAM Role)N/AYes (HDFS and S3-compatible object storage)Yes [3]Yes
- -:::note - -[1]\: 对于 HDFS,StarRocks 支持简单身份验证和 Kerberos 身份验证。 - -[2]\: 当安全协议设置为 `sasl_plaintext` 或 `sasl_ssl` 时,支持 SASL 和 GSSAPI (Kerberos) 身份验证。 - -[3]\: 当安全协议设置为 `sasl_plaintext` 或 `sasl_ssl` 时,支持 SASL 和 PLAIN 身份验证。 - -::: - -### 导出 - 身份验证 - -| | INSERT INTO FILES | EXPORT | -| :-------------- | :----------------: | :-------------: | -| Single Kerberos | To be supported | To be supported | - -## 导入 - 其他参数和功能 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Parameter and featureLoading method
Stream LoadINSERT from FILESINSERT from SELECT/VALUESBroker LoadPIPERoutine LoadSpark Load
partial_updateYes (v3.0+)Yes [1] (v3.3+)Yes (v3.0+)N/AYes (v3.0+)To be supported
partial_update_modeYes (v3.1+)To be supportedYes (v3.1+)N/ATo be supportedTo be supported
COLUMNS FROM PATHN/AYes (v3.2+)N/AYesN/AN/AYes
timezone or session variable time_zone [2]Yes [3]Yes [4]Yes [4]Yes [4]To be supportedYes [4]To be supported
Time accuracy - MicrosecondYesYesYesYes (v3.1.11+/v3.2.6+)To be supportedYesYes
- -:::note - -[1]\: 从 v3.3 开始,StarRocks 支持在行模式下通过指定列列表对 INSERT INTO 进行部分更新。 - -[2]\: 通过参数或会话变量设置时区会影响 strftime()、alignment_timestamp() 和 from_unixtime() 等函数返回的结果。 - -[3]\: 仅支持参数 `timezone`。 - -[4]\: 仅支持会话变量 `time_zone`。 - -::: - -## 导出 - 其他参数和功能 - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Parameter and featureINSERT INTO FILESEXPORT
target_max_file_sizeYes (v3.2+)To be supported
single
Partitioned_by
Session variable time_zoneTo be supported
Time accuracy - MicrosecondTo be supportedTo be supported
\ No newline at end of file diff --git a/docs/zh/loading/loading_introduction/loading_concepts.md b/docs/zh/loading/loading_introduction/loading_concepts.md deleted file mode 100644 index 0bbd53d..0000000 --- a/docs/zh/loading/loading_introduction/loading_concepts.md +++ /dev/null @@ -1,146 +0,0 @@ ---- -displayed_sidebar: docs -toc_max_heading_level: 4 ---- - -# 数据导入概念 - -import InsertPrivNote from '../../_assets/commonMarkdown/insertPrivNote.mdx' - -本文介绍数据导入相关的常用概念和信息。 - -## 权限 - - - -## Label - -您可以通过运行数据导入作业将数据导入到 StarRocks 中。每个数据导入作业都有一个唯一的 label。该 label 由用户指定或由 StarRocks 自动生成,用于标识该作业。每个 label 只能用于一个数据导入作业。数据导入作业完成后,其 label 不能用于任何其他数据导入作业。只有失败的数据导入作业的 label 才能被重复使用。 - -## 原子性 - -StarRocks 提供的所有数据导入方法都保证原子性。原子性是指一个数据导入作业中,所有符合条件的数据必须全部成功导入,或者全部不成功导入。不会出现部分符合条件的数据被导入,而其他数据没有被导入的情况。请注意,符合条件的数据不包括由于数据质量问题(如数据类型转换错误)而被过滤掉的数据。 - -## 协议 - -StarRocks 支持两种可用于提交数据导入作业的通信协议:MySQL 和 HTTP。在 StarRocks 支持的所有数据导入方法中,只有 Stream Load 使用 HTTP,而所有其他方法都使用 MySQL。 - -## 数据类型 - -StarRocks 支持导入所有数据类型的数据。您只需要注意一些特定数据类型导入的限制。更多信息,请参见 [数据类型](../../sql-reference/data-types/README.md)。 - -## 严格模式 - -严格模式是您可以为数据导入配置的可选属性。它会影响数据导入的行为和最终导入的数据。详情请参见 [严格模式](../load_concept/strict_mode.md)。 - -## 导入模式 - -StarRocks 支持两种数据导入模式:同步导入模式和异步导入模式。 - -:::note - -如果您使用外部程序导入数据,您必须在选择数据导入方法之前,选择最适合您业务需求的导入模式。 - -::: - -### 同步导入 - -在同步导入模式下,提交数据导入作业后,StarRocks 会同步运行该作业以导入数据,并在作业完成后返回作业结果。您可以根据作业结果检查作业是否成功。 - -StarRocks 提供了两种支持同步导入的数据导入方法:[Stream Load](../StreamLoad.md) 和 [INSERT](../InsertInto.md)。 - -同步导入的流程如下: - -1. 创建一个数据导入作业。 - -2. 查看 StarRocks 返回的作业结果。 - -3. 根据作业结果检查作业是否成功。如果作业结果表明导入失败,您可以重试该作业。 - -### 异步导入 - -在异步导入模式下,提交数据导入作业后,StarRocks 会立即返回作业创建结果。 - -- 如果结果表明作业创建成功,StarRocks 会异步运行该作业。但这并不意味着数据已成功导入。您必须使用语句或命令来检查作业的状态。然后,您可以根据作业状态确定数据是否已成功导入。 - -- 如果结果表明作业创建失败,您可以根据失败信息确定是否需要重试该作业。 - -:::tip - -您可以为表设置不同的写入仲裁,即在 StarRocks 确定数据导入任务成功之前,需要有多少个副本返回数据导入成功。您可以通过在 [CREATE TABLE](../../sql-reference/sql-statements/table_bucket_part_index/CREATE_TABLE.md) 时添加属性 `write_quorum` 来指定写入仲裁,或者使用 [ALTER TABLE](../../sql-reference/sql-statements/table_bucket_part_index/ALTER_TABLE.md) 将此属性添加到现有表中。 - -::: - -StarRocks 提供了四种支持异步导入的数据导入方法:[Broker Load](../../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md)、[Pipe](../../sql-reference/sql-statements/loading_unloading/pipe/CREATE_PIPE.md)、[Routine Load](../../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md) 和 [Spark Load](../../sql-reference/sql-statements/loading_unloading/SPARK_LOAD.md)。 - -异步导入的流程如下: - -1. 创建一个数据导入作业。 - -2. 查看 StarRocks 返回的作业创建结果,并确定作业是否成功创建。 - - - 如果作业创建成功,请转到步骤 3。 - - - 如果作业创建失败,请返回到步骤 1。 - -3. 使用语句或命令检查作业的状态,直到作业状态显示为 **FINISHED** 或 **CANCELLED**。 - -#### Broker Load 或 Spark Load 的工作流程 - -Broker Load 或 Spark Load 作业的工作流程包括五个阶段,如下图所示。 - -![Broker Load or Spark Load overflow](../../_assets/4.1-1.png) - -工作流程描述如下: - -1. **PENDING** - - 作业在队列中等待 FE 调度。 - -2. **ETL** - - FE 预处理数据,包括数据清洗、分区、排序和聚合。 - - 只有 Spark Load 作业有 ETL 阶段。Broker Load 作业跳过此阶段。 - -3. **LOADING** - - FE 清洗和转换数据,然后将数据发送到 BE 或 CN。加载完所有数据后,数据在队列中等待生效。此时,作业的状态保持为 **LOADING**。 - -4. **FINISHED** - - 当数据导入完成并且所有涉及的数据生效时,作业的状态变为 **FINISHED**。此时,可以查询数据。**FINISHED** 是最终作业状态。 - -5. **CANCELLED** - - 在作业的状态变为 **FINISHED** 之前,您可以随时取消该作业。此外,如果发生数据导入错误,StarRocks 可以自动取消该作业。作业取消后,作业的状态变为 **CANCELLED**,并且在取消之前所做的所有数据更新都将被还原。**CANCELLED** 也是最终作业状态。 - -#### Pipe 的工作流程 - -Pipe 作业的工作流程描述如下: - -1. 从 MySQL 客户端将作业提交到 FE。 - -2. FE 根据指定路径中存储的数据文件的数量或大小拆分数据文件,将作业分解为更小的顺序任务。任务进入队列,等待调度,创建后。 - -3. FE 从队列中获取任务,并调用 INSERT INTO SELECT FROM FILES 语句来执行每个任务。 - -4. 数据导入完成: - - - 如果在作业创建时为作业指定了 `"AUTO_INGEST" = "FALSE"`,则在加载完指定路径中存储的所有数据文件的数据后,作业将完成。 - - - 如果在作业创建时为作业指定了 `"AUTO_INGEST" = "TRUE"`,则 FE 将继续监视数据文件的更改,并自动将数据文件中的新数据或更新数据加载到目标 StarRocks 表中。 - -#### Routine Load 的工作流程 - -Routine Load 作业的工作流程描述如下: - -1. 从 MySQL 客户端将作业提交到 FE。 - -2. FE 将作业拆分为多个任务。每个任务都设计为从多个分区加载数据。 - -3. FE 将任务分发到指定的 BE 或 CN。 - -4. BE 或 CN 执行任务,并在完成任务后向 FE 报告。 - -5. FE 生成后续任务,重试失败的任务(如果有),或根据 BE 的报告暂停任务调度。 \ No newline at end of file diff --git a/docs/zh/loading/loading_introduction/loading_considerations.md b/docs/zh/loading/loading_introduction/loading_considerations.md deleted file mode 100644 index 022aa34..0000000 --- a/docs/zh/loading/loading_introduction/loading_considerations.md +++ /dev/null @@ -1,73 +0,0 @@ -### 注意事项 - -本文档介绍在运行数据导入作业之前需要考虑的一些系统限制和配置。 - -## 内存限制 - -StarRocks 提供了参数来限制每个导入作业的内存使用量,从而减少内存消耗,尤其是在高并发场景下。但是,不要指定过低的内存使用量限制。如果内存使用量限制过低,则可能因为导入作业的内存使用量达到指定限制,导致数据频繁地从内存刷新到磁盘。建议您根据您的业务场景指定合适的内存使用量限制。 - -用于限制内存使用量的参数因每种导入方式而异。更多信息,请参见 [Stream Load](../../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md)、[Broker Load](../../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md)、[Routine Load](../../sql-reference/sql-statements/loading_unloading/routine_load/CREATE_ROUTINE_LOAD.md)、[Spark Load](../../sql-reference/sql-statements/loading_unloading/SPARK_LOAD.md) 和 [INSERT](../../sql-reference/sql-statements/loading_unloading/INSERT.md)。请注意,一个导入作业通常在多个 BE 或 CN 上运行。因此,这些参数限制的是每个导入作业在每个涉及的 BE 或 CN 上的内存使用量,而不是导入作业在所有涉及的 BE 或 CN 上的总内存使用量。 - -StarRocks 还提供了参数来限制每个 BE 或 CN 上运行的所有导入作业的总内存使用量。更多信息,请参见下面的“[系统配置](#system-configurations)”部分。 - -## 系统配置 - -本节介绍一些适用于 StarRocks 提供的所有导入方法的参数配置。 - -### FE 配置 - -您可以在每个 FE 的配置文件 **fe.conf** 中配置以下参数: - -- `max_load_timeout_second` 和 `min_load_timeout_second` - - 这些参数指定每个导入作业的最大超时时间和最小超时时间。超时时间以秒为单位。默认的最大超时时间为 3 天,默认的最小超时时间为 1 秒。您指定的最大超时时间和最小超时时间必须在 1 秒到 3 天的范围内。这些参数对同步导入作业和异步导入作业都有效。 - -- `desired_max_waiting_jobs` - - 此参数指定队列中可以等待的最大导入作业数。默认值为 **1024**(v2.4 及更早版本中为 100,v2.5 及更高版本中为 1024)。当 FE 上处于 **PENDING** 状态的导入作业数达到您指定的最大数量时,FE 将拒绝新的导入请求。此参数仅对异步导入作业有效。 - -- `max_running_txn_num_per_db` - - 此参数指定您的 StarRocks 集群的每个数据库中允许的最大并发导入事务数。一个导入作业可以包含一个或多个事务。默认值为 **100**。当数据库中运行的导入事务数达到您指定的最大数量时,您提交的后续导入作业将不会被调度。在这种情况下,如果您提交的是同步导入作业,则该作业将被拒绝。如果您提交的是异步导入作业,则该作业将在队列中等待。 - - :::note - - StarRocks 将所有导入作业一起计数,不区分同步导入作业和异步导入作业。 - - ::: - -- `label_keep_max_second` - - 此参数指定已完成且处于 **FINISHED** 或 **CANCELLED** 状态的导入作业的历史记录的保留期限。默认保留期限为 3 天。此参数对同步导入作业和异步导入作业都有效。 - -### BE/CN 配置 - -您可以在每个 BE 的配置文件 **be.conf** 或每个 CN 的配置文件 **cn.conf** 中配置以下参数: - -- `write_buffer_size` - - 此参数指定最大内存块大小。默认大小为 100 MB。导入的数据首先写入 BE 或 CN 上的内存块。当导入的数据量达到您指定的最大内存块大小时,数据将刷新到磁盘。您必须根据您的业务场景指定合适的内存块大小。 - - - 如果最大内存块大小过小,则可能在 BE 或 CN 上生成大量小文件。在这种情况下,[查询性能] 会下降。您可以增加最大内存块大小以减少生成的文件数。 - - 如果最大内存块大小过大,则远程过程调用 (RPC) 可能会超时。在这种情况下,您可以根据您的业务需求调整此参数的值。 - -- `streaming_load_rpc_max_alive_time_sec` - - 每个 Writer 进程的等待超时时间。默认值为 1200 秒。在数据导入过程中,StarRocks 启动一个 Writer 进程来接收数据并将数据写入每个 tablet。如果在您指定的等待超时时间内,Writer 进程没有收到任何数据,StarRocks 将停止该 Writer 进程。当您的 StarRocks 集群以低速处理数据时,Writer 进程可能在很长一段时间内没有收到下一批数据,因此报告“TabletWriter add batch with unknown id”错误。在这种情况下,您可以增加此参数的值。 - -- `load_process_max_memory_limit_bytes` 和 `load_process_max_memory_limit_percent` - - 这些参数指定每个 BE 或 CN 上所有导入作业可以消耗的最大内存量。StarRocks 将这两个参数值中较小的内存消耗量确定为允许的最终内存消耗量。 - - - `load_process_max_memory_limit_bytes`: 指定最大内存大小。默认最大内存大小为 100 GB。 - - `load_process_max_memory_limit_percent`: 指定最大内存使用率。默认值为 30%。此参数与 `mem_limit` 参数不同。`mem_limit` 参数指定您的 StarRocks 集群的总最大内存使用量,默认值为 90% x 90%。 - - 如果 BE 或 CN 所在的机器的内存容量为 M,则可以为导入作业消耗的最大内存量计算如下:`M x 90% x 90% x 30%`。 - -### 系统变量配置 - -您可以配置以下 [系统变量](../../sql-reference/System_variable.md): - -- `insert_timeout` - - INSERT 超时时间。单位:秒。取值范围:`1` 到 `259200`。默认值:`14400`。此变量将作用于当前连接中所有涉及 INSERT 作业的操作(例如,UPDATE、DELETE、CTAS、物化视图刷新、统计信息收集和 PIPE)。 \ No newline at end of file diff --git a/docs/zh/loading/loading_introduction/loading_overview.mdx b/docs/zh/loading/loading_introduction/loading_overview.mdx deleted file mode 100644 index 551ea6c..0000000 --- a/docs/zh/loading/loading_introduction/loading_overview.mdx +++ /dev/null @@ -1,9 +0,0 @@ ---- -displayed_sidebar: docs ---- - -# 数据导入概览 - -import DocCardList from '@theme/DocCardList'; - - \ No newline at end of file diff --git a/docs/zh/loading/loading_introduction/troubleshooting_loading.md b/docs/zh/loading/loading_introduction/troubleshooting_loading.md deleted file mode 100644 index 93d173b..0000000 --- a/docs/zh/loading/loading_introduction/troubleshooting_loading.md +++ /dev/null @@ -1,44 +0,0 @@ ---- -displayed_sidebar: docs -sidebar_label: "Troubleshooting" ---- - -# 数据导入问题排查 - -本文档旨在帮助DBA和运维工程师通过SQL界面监控数据导入作业的状态,而无需依赖外部监控系统。同时,本文档还提供了在导入操作期间识别性能瓶颈和排除异常情况的指导。 - -## 术语 - -**导入作业 (Load Job):** 连续的数据导入过程,例如 **Routine Load Job** 或 **Pipe Job**。 - -**导入任务 (Load Task):** 一次性的数据导入过程,通常对应于单个导入事务。例如 **Broker Load**、**Stream Load**、**Spark Load** 和 **INSERT INTO**。Routine Load 作业和 Pipe 作业会持续生成任务以执行数据摄取。 - -## 观察导入作业 - -有两种方法可以观察导入作业: - -- 使用 SQL 语句 **[SHOW ROUTINE LOAD](../../sql-reference/sql-statements/loading_unloading/routine_load/SHOW_ROUTINE_LOAD.md)** 和 **[SHOW PIPES](../../sql-reference/sql-statements/loading_unloading/pipe/SHOW_PIPES.md)**。 -- 使用系统视图 **[information_schema.routine_load_jobs](../../sql-reference/information_schema/routine_load_jobs.md)** 和 **[information_schema.pipes](../../sql-reference/information_schema/pipes.md)**。 - -## 观察导入任务 - -也可以通过两种方式监控导入任务: - -- 使用 SQL 语句 **[SHOW LOAD](../../sql-reference/sql-statements/loading_unloading/SHOW_LOAD.md)** 和 **[SHOW ROUTINE LOAD TASK](../../sql-reference/sql-statements/loading_unloading/routine_load/SHOW_ROUTINE_LOAD_TASK.md)**。 -- 使用系统视图 **[information_schema.loads](../../sql-reference/information_schema/loads.md)** 和 statistics.loads_history。 - -### SQL 语句 - -**SHOW** 语句显示当前数据库正在进行和最近完成的导入任务,从而快速了解任务状态。检索到的信息是 **statistics.loads_history** 系统视图的子集。 - -SHOW LOAD 语句返回 Broker Load、Insert Into 和 Spark Load 任务的信息,SHOW ROUTINE LOAD TASK 语句返回 Routine Load 任务的信息。 - -### 系统视图 - -#### information_schema.loads - -**information_schema.loads** 系统视图存储最近的导入任务的信息,包括正在进行和最近完成的任务。StarRocks 定期将数据同步到 **statistics.loads_history** 系统表以进行持久存储。 - -**information_schema.loads** 提供以下字段: - -| 字段 (Field) | 描述 (Description) \ No newline at end of file diff --git a/docs/zh/loading/loading_tools.md b/docs/zh/loading/loading_tools.md deleted file mode 100644 index 8613965..0000000 --- a/docs/zh/loading/loading_tools.md +++ /dev/null @@ -1,32 +0,0 @@ ---- -displayed_sidebar: docs ---- - -# 使用工具导入数据 - -StarRocks 及其生态系统合作伙伴提供以下工具,以帮助您将 StarRocks 与外部数据库无缝集成。 - -## [SMT](../integrations/loading_tools/SMT.md) - -SMT (StarRocks Migration Tool) 是 StarRocks 提供的数据迁移工具,旨在优化复杂的数据导入管道:源数据库(如 MySQL、Oracle、PostgreSQL)---> Flink ---> 目标 StarRocks 集群。其主要功能如下: - -- 简化 StarRocks 中的表创建:基于外部数据库和目标 StarRocks 集群的信息,生成在 StarRocks 中创建表的语句。 -- 简化数据管道中的全量或增量数据同步过程:生成可在 Flink 的 SQL 客户端中运行的 SQL 语句,以提交 Flink 作业来同步数据。 - -下图说明了通过 Flink 将数据从源数据库 MySQL 导入到 StarRocks 的过程。 - -![img](../_assets/load_tools.png) - -## [DataX](../integrations/loading_tools/DataX-starrocks-writer.md) - -DataX 是一款离线数据同步工具,由阿里巴巴开源。 DataX 可以同步各种异构数据源之间的数据,包括关系数据库(MySQL、Oracle 等)、HDFS 和 Hive。 DataX 提供了 StarRocks Writer 插件,用于将 DataX 支持的数据源中的数据同步到 StarRocks。 - -## [CloudCanal](../integrations/loading_tools/CloudCanal.md) - -CloudCanal 社区版是由 [ClouGence Co., Ltd](https://www.cloudcanalx.com/) 发布的免费数据迁移和同步平台,集成了 Schema Migration、全量数据迁移、验证、纠正和实时增量同步。 您可以直接在 CloudCanal 的可视化界面中添加 StarRocks 作为数据源,并创建任务以自动将数据从源数据库(例如,MySQL、Oracle、PostgreSQL)迁移或同步到 StarRocks。 - -## [Kettle connector](https://github.com/StarRocks/starrocks-connector-for-kettle) - -Kettle 是一款具有可视化图形界面的 ETL (Extract, Transform, Load) 工具,允许用户通过拖动组件和配置参数来构建数据处理工作流程。 这种直观的方法大大简化了数据处理和导入的过程,使用户能够更方便地处理数据。 此外,Kettle 提供了丰富的组件库,允许用户根据需要选择合适的组件并执行各种复杂的数据处理任务。 - -StarRocks 提供 Kettle Connector 以与 Kettle 集成。 通过将 Kettle 强大的数据处理和转换能力与 StarRocks 的高性能数据存储和分析能力相结合,可以实现更灵活、更高效的数据处理工作流程。 \ No newline at end of file diff --git a/docs/zh/loading/minio.md b/docs/zh/loading/minio.md deleted file mode 100644 index 94887e6..0000000 --- a/docs/zh/loading/minio.md +++ /dev/null @@ -1,717 +0,0 @@ ---- -displayed_sidebar: docs -toc_max_heading_level: 4 ---- - -# 从 MinIO 导入数据 - -import InsertPrivNote from '../_assets/commonMarkdown/insertPrivNote.mdx' - -StarRocks 提供了以下从 MinIO 导入数据的选项: - -- 使用 [INSERT](../sql-reference/sql-statements/loading_unloading/INSERT.md) + [`FILES()`](../sql-reference/sql-functions/table-functions/files.md) 进行同步导入 -- 使用 [Broker Load](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md) 进行异步导入 - -这些选项各有优势,以下各节将详细介绍。 - -在大多数情况下,我们建议您使用 INSERT+`FILES()` 方法,因为它更易于使用。 - -但是,INSERT+`FILES()` 方法目前仅支持 Parquet、ORC 和 CSV 文件格式。因此,如果您需要导入其他文件格式(如 JSON)的数据,或者[在数据导入期间执行数据更改(如 DELETE)](../loading/Load_to_Primary_Key_tables.md),您可以选择使用 Broker Load。 - -## 前提条件 - -### 准备源数据 - -确保要导入到 StarRocks 中的源数据已正确存储在 MinIO bucket 中。您还可以考虑数据和数据库的位置,因为当您的 bucket 和 StarRocks 集群位于同一区域时,数据传输成本会低得多。 - -在本主题中,我们为您提供了一个示例数据集。您可以使用 `curl` 下载它: - -```bash -curl -O https://starrocks-examples.s3.amazonaws.com/user_behavior_ten_million_rows.parquet -``` - -将 Parquet 文件导入到您的 MinIO 系统中,并记下 bucket 名称。本指南中的示例使用 `/starrocks` 作为 bucket 名称。 - -### 检查权限 - - - -### 收集连接详细信息 - -简而言之,要使用 MinIO Access Key 身份验证,您需要收集以下信息: - -- 存储数据的 bucket -- 对象 key(对象名称)(如果访问 bucket 中的特定对象) -- MinIO endpoint -- 用作访问凭据的 access key 和 secret key。 - -![MinIO access key](../_assets/quick-start/MinIO-create.png) - -## 使用 INSERT+FILES() - -此方法从 v3.1 版本开始可用,目前仅支持 Parquet、ORC 和 CSV(从 v3.3.0 版本开始)文件格式。 - -### INSERT+FILES() 的优势 - -[`FILES()`](../sql-reference/sql-functions/table-functions/files.md) 可以读取存储在云存储中的文件,基于您指定的路径相关属性,推断文件中数据的表结构,然后将文件中的数据作为数据行返回。 - -使用 `FILES()`,您可以: - -- 使用 [SELECT](../sql-reference/sql-statements/table_bucket_part_index/SELECT.md) 直接从 MinIO 查询数据。 -- 使用 [CREATE TABLE AS SELECT](../sql-reference/sql-statements/table_bucket_part_index/CREATE_TABLE_AS_SELECT.md) (CTAS) 创建和导入表。 -- 使用 [INSERT](../sql-reference/sql-statements/loading_unloading/INSERT.md) 将数据导入到现有表中。 - -### 典型示例 - -#### 使用 SELECT 直接从 MinIO 查询 - -使用 SELECT+`FILES()` 直接从 MinIO 查询可以在创建表之前很好地预览数据集的内容。例如: - -- 在不存储数据的情况下获取数据集的预览。 -- 查询最小值和最大值,并确定要使用的数据类型。 -- 检查 `NULL` 值。 - -以下示例查询先前添加到您的 MinIO 系统的示例数据集。 - -:::tip - -命令中突出显示的部分包含您可能需要更改的设置: - -- 设置 `endpoint` 和 `path` 以匹配您的 MinIO 系统。 -- 如果您的 MinIO 系统使用 SSL,请将 `enable_ssl` 设置为 `true`。 -- 将您的 MinIO access key 和 secret key 替换为 `AAA` 和 `BBB`。 - -::: - -```sql -SELECT * FROM FILES -( - -- highlight-start - "aws.s3.endpoint" = "http://minio:9000", - "path" = "s3://starrocks/user_behavior_ten_million_rows.parquet", - "aws.s3.enable_ssl" = "false", - "aws.s3.access_key" = "AAAAAAAAAAAAAAAAAAAA", - "aws.s3.secret_key" = "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB", - -- highlight-end - "format" = "parquet", - "aws.s3.use_aws_sdk_default_behavior" = "false", - "aws.s3.use_instance_profile" = "false", - "aws.s3.enable_path_style_access" = "true" -) -LIMIT 3; -``` - -系统返回以下查询结果: - -```plaintext -+--------+---------+------------+--------------+---------------------+ -| UserID | ItemID | CategoryID | BehaviorType | Timestamp | -+--------+---------+------------+--------------+---------------------+ -| 543711 | 829192 | 2355072 | pv | 2017-11-27 08:22:37 | -| 543711 | 2056618 | 3645362 | pv | 2017-11-27 10:16:46 | -| 543711 | 1165492 | 3645362 | pv | 2017-11-27 10:17:00 | -+--------+---------+------------+--------------+---------------------+ -3 rows in set (0.41 sec) -``` - -:::info - -请注意,上面返回的列名由 Parquet 文件提供。 - -::: - -#### 使用 CTAS 创建和导入表 - -这是前一个示例的延续。先前的查询包装在 CREATE TABLE AS SELECT (CTAS) 中,以使用模式推断自动执行表创建。这意味着 StarRocks 将推断表结构,创建您想要的表,然后将数据加载到表中。使用 `FILES()` 表函数和 Parquet 文件时,不需要列名和类型来创建表,因为 Parquet 格式包含列名。 - -:::note - -使用模式推断时,CREATE TABLE 的语法不允许设置副本数,因此请在创建表之前设置它。以下示例适用于具有单个副本的系统: - -```SQL -ADMIN SET FRONTEND CONFIG ('default_replication_num' = '1'); -``` - -::: - -创建一个数据库并切换到它: - -```SQL -CREATE DATABASE IF NOT EXISTS mydatabase; -USE mydatabase; -``` - -使用 CTAS 创建一个表,并加载先前添加到您的 MinIO 系统的示例数据集的数据。 - -:::tip - -命令中突出显示的部分包含您可能需要更改的设置: - -- 设置 `endpoint` 和 `path` 以匹配您的 MinIO 系统。 -- 如果您的 MinIO 系统使用 SSL,请将 `enable_ssl` 设置为 `true`。 -- 将您的 MinIO access key 和 secret key 替换为 `AAA` 和 `BBB`。 - -::: - -```sql -CREATE TABLE user_behavior_inferred AS -SELECT * FROM FILES -( - -- highlight-start - "aws.s3.endpoint" = "http://minio:9000", - "path" = "s3://starrocks/user_behavior_ten_million_rows.parquet", - "aws.s3.enable_ssl" = "false", - "aws.s3.access_key" = "AAAAAAAAAAAAAAAAAAAA", - "aws.s3.secret_key" = "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB", - -- highlight-end - "format" = "parquet", - "aws.s3.use_aws_sdk_default_behavior" = "false", - "aws.s3.use_instance_profile" = "false", - "aws.s3.enable_path_style_access" = "true" -); -``` - -```plaintext -Query OK, 10000000 rows affected (3.17 sec) -{'label':'insert_a5da3ff5-9ee4-11ee-90b0-02420a060004', 'status':'VISIBLE', 'txnId':'17'} -``` - -创建表后,您可以使用 [DESCRIBE](../sql-reference/sql-statements/table_bucket_part_index/DESCRIBE.md) 查看其结构: - -```SQL -DESCRIBE user_behavior_inferred; -``` - -系统返回以下查询结果: - -```Plaintext -+--------------+------------------+------+-------+---------+-------+ -| Field | Type | Null | Key | Default | Extra | -+--------------+------------------+------+-------+---------+-------+ -| UserID | bigint | YES | true | NULL | | -| ItemID | bigint | YES | true | NULL | | -| CategoryID | bigint | YES | true | NULL | | -| BehaviorType | varchar(1048576) | YES | false | NULL | | -| Timestamp | varchar(1048576) | YES | false | NULL | | -+--------------+------------------+------+-------+---------+-------+ -``` - -查询表以验证数据是否已加载到其中。示例: - -```SQL -SELECT * from user_behavior_inferred LIMIT 3; -``` - -返回以下查询结果,表明数据已成功加载: - -```Plaintext -+--------+--------+------------+--------------+---------------------+ -| UserID | ItemID | CategoryID | BehaviorType | Timestamp | -+--------+--------+------------+--------------+---------------------+ -| 58 | 158350 | 2355072 | pv | 2017-11-27 13:06:51 | -| 58 | 158590 | 3194735 | pv | 2017-11-27 02:21:04 | -| 58 | 215073 | 3002561 | pv | 2017-11-30 10:55:42 | -+--------+--------+------------+--------------+---------------------+ -``` - -#### 使用 INSERT 导入到现有表 - -您可能想要自定义要导入的表,例如: - -- 列数据类型、nullable 设置或默认值 -- key 类型和列 -- 数据分区和分桶 - -:::tip - -创建最有效的表结构需要了解如何使用数据以及列的内容。本主题不包括表设计。有关表设计的信息,请参见 [表类型](../table_design/StarRocks_table_design.md)。 - -::: - -在此示例中,我们基于对如何查询表以及 Parquet 文件中的数据的了解来创建表。对 Parquet 文件中数据的了解可以通过直接在 MinIO 中查询文件来获得。 - -- 由于在 MinIO 中查询数据集表明 `Timestamp` 列包含与 `datetime` 数据类型匹配的数据,因此在以下 DDL 中指定了列类型。 -- 通过查询 MinIO 中的数据,您可以发现数据集中没有 `NULL` 值,因此 DDL 不会将任何列设置为 nullable。 -- 根据对预期查询类型的了解,排序键和分桶列设置为列 `UserID`。您的用例可能与此数据不同,因此您可能会决定除了 `UserID` 之外或代替 `UserID` 使用 `ItemID` 作为排序键。 - -创建一个数据库并切换到它: - -```SQL -CREATE DATABASE IF NOT EXISTS mydatabase; -USE mydatabase; -``` - -手动创建一个表(我们建议该表具有与要从 MinIO 导入的 Parquet 文件相同的结构): - -```SQL -CREATE TABLE user_behavior_declared -( - UserID int(11) NOT NULL, - ItemID int(11) NOT NULL, - CategoryID int(11) NOT NULL, - BehaviorType varchar(65533) NOT NULL, - Timestamp datetime NOT NULL -) -ENGINE = OLAP -DUPLICATE KEY(UserID) -DISTRIBUTED BY HASH(UserID) -PROPERTIES -( - 'replication_num' = '1' -); -``` - -显示结构,以便您可以将其与 `FILES()` 表函数生成的推断结构进行比较: - -```sql -DESCRIBE user_behavior_declared; -``` - -```plaintext -+--------------+----------------+------+-------+---------+-------+ -| Field | Type | Null | Key | Default | Extra | -+--------------+----------------+------+-------+---------+-------+ -| UserID | int | NO | true | NULL | | -| ItemID | int | NO | false | NULL | | -| CategoryID | int | NO | false | NULL | | -| BehaviorType | varchar(65533) | NO | false | NULL | | -| Timestamp | datetime | NO | false | NULL | | -+--------------+----------------+------+-------+---------+-------+ -5 rows in set (0.00 sec) -``` - -:::tip - -将您刚刚创建的结构与之前使用 `FILES()` 表函数推断的结构进行比较。查看: - -- 数据类型 -- nullable -- key 字段 - -为了更好地控制目标表的结构并获得更好的查询性能,我们建议您在生产环境中手动指定表结构。对于时间戳字段,使用 `datetime` 数据类型比使用 `varchar` 更有效。 - -::: - -创建表后,您可以使用 INSERT INTO SELECT FROM FILES() 加载它: - -:::tip - -命令中突出显示的部分包含您可能需要更改的设置: - -- 设置 `endpoint` 和 `path` 以匹配您的 MinIO 系统。 -- 如果您的 MinIO 系统使用 SSL,请将 `enable_ssl` 设置为 `true`。 -- 将您的 MinIO access key 和 secret key 替换为 `AAA` 和 `BBB`。 - -::: - -```SQL -INSERT INTO user_behavior_declared -SELECT * FROM FILES -( - -- highlight-start - "aws.s3.endpoint" = "http://minio:9000", - "path" = "s3://starrocks/user_behavior_ten_million_rows.parquet", - "aws.s3.enable_ssl" = "false", - "aws.s3.access_key" = "AAAAAAAAAAAAAAAAAAAA", - "aws.s3.secret_key" = "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB", - -- highlight-end - "format" = "parquet", - "aws.s3.use_aws_sdk_default_behavior" = "false", - "aws.s3.use_instance_profile" = "false", - "aws.s3.enable_path_style_access" = "true" -); -``` - -加载完成后,您可以查询表以验证数据是否已加载到其中。示例: - -```SQL -SELECT * from user_behavior_declared LIMIT 3; -``` - -返回以下查询结果,表明数据已成功加载: - -```Plaintext -+--------+---------+------------+--------------+---------------------+ -| UserID | ItemID | CategoryID | BehaviorType | Timestamp | -+--------+---------+------------+--------------+---------------------+ -| 58 | 4309692 | 1165503 | pv | 2017-11-25 14:06:52 | -| 58 | 181489 | 1165503 | pv | 2017-11-25 14:07:22 | -| 58 | 3722956 | 1165503 | pv | 2017-11-25 14:09:28 | -+--------+---------+------------+--------------+---------------------+ -``` - -#### 检查导入进度 - -您可以从 StarRocks Information Schema 中的 [`loads`](../sql-reference/information_schema/loads.md) 视图查询 INSERT 作业的进度。此功能从 v3.1 版本开始支持。示例: - -```SQL -SELECT * FROM information_schema.loads ORDER BY JOB_ID DESC; -``` - -有关 `loads` 视图中提供的字段的信息,请参见 [`loads`](../sql-reference/information_schema/loads.md)。 - -如果您提交了多个导入作业,则可以按与作业关联的 `LABEL` 进行过滤。示例: - -```SQL -SELECT * FROM information_schema.loads WHERE LABEL = 'insert_e3b882f5-7eb3-11ee-ae77-00163e267b60' \G -*************************** 1. row *************************** - JOB_ID: 10243 - LABEL: insert_e3b882f5-7eb3-11ee-ae77-00163e267b60 - DATABASE_NAME: mydatabase - STATE: FINISHED - PROGRESS: ETL:100%; LOAD:100% - TYPE: INSERT - PRIORITY: NORMAL - SCAN_ROWS: 10000000 - FILTERED_ROWS: 0 - UNSELECTED_ROWS: 0 - SINK_ROWS: 10000000 - ETL_INFO: - TASK_INFO: resource:N/A; timeout(s):300; max_filter_ratio:0.0 - CREATE_TIME: 2023-11-09 11:56:01 - ETL_START_TIME: 2023-11-09 11:56:01 - ETL_FINISH_TIME: 2023-11-09 11:56:01 - LOAD_START_TIME: 2023-11-09 11:56:01 - LOAD_FINISH_TIME: 2023-11-09 11:56:44 - JOB_DETAILS: {"All backends":{"e3b882f5-7eb3-11ee-ae77-00163e267b60":[10142]},"FileNumber":0,"FileSize":0,"InternalTableLoadBytes":311710786,"InternalTableLoadRows":10000000,"ScanBytes":581574034,"ScanRows":10000000,"TaskNumber":1,"Unfinished backends":{"e3b882f5-7eb3-11ee-ae77-00163e267b60":[]}} - ERROR_MSG: NULL - TRACKING_URL: NULL - TRACKING_SQL: NULL -REJECTED_RECORD_PATH: NULL -``` - -:::tip - -INSERT 是一个同步命令。如果 INSERT 作业仍在运行,您需要打开另一个会话来检查其执行状态。 - -::: - -### 比较磁盘上的表大小 - -此查询比较具有推断结构的表和声明了结构的表。由于推断的结构具有 nullable 列和时间戳的 varchar,因此数据长度更大: - -```sql -SELECT TABLE_NAME, - TABLE_ROWS, - AVG_ROW_LENGTH, - DATA_LENGTH -FROM information_schema.tables -WHERE TABLE_NAME like 'user_behavior%'\G -``` - -```plaintext -*************************** 1. row *************************** - TABLE_NAME: user_behavior_declared - TABLE_ROWS: 10000000 -AVG_ROW_LENGTH: 10 - DATA_LENGTH: 102562516 -*************************** 2. row *************************** - TABLE_NAME: user_behavior_inferred - TABLE_ROWS: 10000000 -AVG_ROW_LENGTH: 17 - DATA_LENGTH: 176803880 -2 rows in set (0.04 sec) -``` - -## 使用 Broker Load - -异步 Broker Load 进程处理与 MinIO 的连接、提取数据以及将数据存储在 StarRocks 中。 - -此方法支持以下文件格式: - -- Parquet -- ORC -- CSV -- JSON(从 v3.2.3 版本开始支持) - -### Broker Load 的优势 - -- Broker Load 在后台运行,客户端无需保持连接即可继续作业。 -- Broker Load 更适合长时间运行的作业,默认超时时间为 4 小时。 -- 除了 Parquet 和 ORC 文件格式外,Broker Load 还支持 CSV 文件格式和 JSON 文件格式(JSON 文件格式从 v3.2.3 版本开始支持)。 - -### 数据流 - -![Broker Load 的工作流程](../_assets/broker_load_how-to-work_en.png) - -1. 用户创建一个导入作业。 -2. 前端 (FE) 创建一个查询计划,并将该计划分发到后端节点 (BE) 或计算节点 (CN)。 -3. BE 或 CN 从源提取数据,并将数据加载到 StarRocks 中。 - -### 典型示例 - -创建一个表,启动一个导入进程,该进程提取先前加载到您的 MinIO 系统的示例数据集。 - -#### 创建数据库和表 - -创建一个数据库并切换到它: - -```SQL -CREATE DATABASE IF NOT EXISTS mydatabase; -USE mydatabase; -``` - -手动创建一个表(我们建议该表具有与要从 MinIO 导入的 Parquet 文件相同的结构): - -```SQL -CREATE TABLE user_behavior -( - UserID int(11) NOT NULL, - ItemID int(11) NOT NULL, - CategoryID int(11) NOT NULL, - BehaviorType varchar(65533) NOT NULL, - Timestamp datetime NOT NULL -) -ENGINE = OLAP -DUPLICATE KEY(UserID) -DISTRIBUTED BY HASH(UserID) -PROPERTIES -( - 'replication_num' = '1' -); -``` - -#### 启动 Broker Load - -运行以下命令以启动一个 Broker Load 作业,该作业将数据从示例数据集 `user_behavior_ten_million_rows.parquet` 加载到 `user_behavior` 表: - -:::tip - -命令中突出显示的部分包含您可能需要更改的设置: - -- 设置 `endpoint` 和 `DATA INFILE` 以匹配您的 MinIO 系统。 -- 如果您的 MinIO 系统使用 SSL,请将 `enable_ssl` 设置为 `true`。 -- 将您的 MinIO access key 和 secret key 替换为 `AAA` 和 `BBB`。 - -::: - -```sql -LOAD LABEL UserBehavior -( - -- highlight-start - DATA INFILE("s3://starrocks/user_behavior_ten_million_rows.parquet") - -- highlight-end - INTO TABLE user_behavior - ) - WITH BROKER - ( - -- highlight-start - "aws.s3.endpoint" = "http://minio:9000", - "aws.s3.enable_ssl" = "false", - "aws.s3.access_key" = "AAAAAAAAAAAAAAAAAAAA", - "aws.s3.secret_key" = "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB", - -- highlight-end - "aws.s3.use_aws_sdk_default_behavior" = "false", - "aws.s3.use_instance_profile" = "false", - "aws.s3.enable_path_style_access" = "true" - ) -PROPERTIES -( - "timeout" = "72000" -); -``` - -此作业有四个主要部分: - -- `LABEL`: 用于查询导入作业状态的字符串。 -- `LOAD` 声明:源 URI、源数据格式和目标表名称。 -- `BROKER`: 源的连接详细信息。 -- `PROPERTIES`: 超时值和要应用于导入作业的任何其他属性。 - -有关详细的语法和参数说明,请参见 [BROKER LOAD](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md)。 - -#### 检查导入进度 - -您可以从 StarRocks Information Schema 中的 [`loads`](../sql-reference/information_schema/loads.md) 视图查询 Broker Load 作业的进度。此功能从 v3.1 版本开始支持。 - -```SQL -SELECT * FROM information_schema.loads; -``` - -有关 `loads` 视图中提供的字段的信息,请参见 [`loads`](../sql-reference/information_schema/loads.md)。 - -如果您提交了多个导入作业,则可以按与作业关联的 `LABEL` 进行过滤。示例: - -```sql -SELECT * FROM information_schema.loads -WHERE LABEL = 'UserBehavior'\G -``` - -```plaintext -*************************** 1. row *************************** - JOB_ID: 10176 - LABEL: userbehavior - DATABASE_NAME: mydatabase - STATE: FINISHED - PROGRESS: ETL:100%; LOAD:100% - TYPE: BROKER - PRIORITY: NORMAL - SCAN_ROWS: 10000000 - FILTERED_ROWS: 0 - UNSELECTED_ROWS: 0 - SINK_ROWS: 10000000 - ETL_INFO: - TASK_INFO: resource:N/A; timeout(s):72000; max_filter_ratio:0.0 - CREATE_TIME: 2023-12-19 23:02:41 - ETL_START_TIME: 2023-12-19 23:02:44 - ETL_FINISH_TIME: 2023-12-19 23:02:44 - LOAD_START_TIME: 2023-12-19 23:02:44 - LOAD_FINISH_TIME: 2023-12-19 23:02:46 - JOB_DETAILS: {"All backends":{"4aeec563-a91e-4c1e-b169-977b660950d1":[10004]},"FileNumber":1,"FileSize":132251298,"InternalTableLoadBytes":311710786,"InternalTableLoadRows":10000000,"ScanBytes":132251298,"ScanRows":10000000,"TaskNumber":1,"Unfinished backends":{"4aeec563-a91e-4c1e-b169-977b660950d1":[]}} - ERROR_MSG: NULL - TRACKING_URL: NULL - TRACKING_SQL: NULL -REJECTED_RECORD_PATH: NULL -1 row in set (0.02 sec) -``` - -确认导入作业已完成后,您可以检查目标表的一个子集,以查看数据是否已成功加载。示例: - -```SQL -SELECT * from user_behavior LIMIT 3; -``` - -返回以下查询结果,表明数据已成功加载: - -```Plaintext -+--------+---------+------------+--------------+---------------------+ -| UserID | ItemID | CategoryID | BehaviorType | Timestamp | -+--------+---------+------------+--------------+---------------------+ -| 142 | 2869980 | 2939262 | pv | 2017-11-25 03:43:22 | -| 142 | 2522236 | 1669167 | pv | 2017-11-25 15:14:12 | -| 142 | 3031639 | 3607361 | pv | 2017-11-25 15:19:25 | -+--------+---------+------------+--------------+---------------------+ -``` - - \ No newline at end of file diff --git a/docs/zh/loading/objectstorage.mdx b/docs/zh/loading/objectstorage.mdx deleted file mode 100644 index b33b7d9..0000000 --- a/docs/zh/loading/objectstorage.mdx +++ /dev/null @@ -1,10 +0,0 @@ ---- -displayed_sidebar: docs -description: "从 S3、GCS、Azure 和 MinIO 导入数据" ---- - -# 从对象存储导入数据 - -import DocCardList from '@theme/DocCardList'; - - \ No newline at end of file diff --git a/docs/zh/loading/s3.md b/docs/zh/loading/s3.md deleted file mode 100644 index 1878479..0000000 --- a/docs/zh/loading/s3.md +++ /dev/null @@ -1,658 +0,0 @@ ---- -displayed_sidebar: docs -toc_max_heading_level: 4 -keywords: ['Broker Load'] ---- - -# 从 AWS S3 导入数据 - -import LoadMethodIntro from '../_assets/commonMarkdown/loadMethodIntro.mdx' - -import InsertPrivNote from '../_assets/commonMarkdown/insertPrivNote.mdx' - -import PipeAdvantages from '../_assets/commonMarkdown/pipeAdvantages.mdx' - -StarRocks 提供了以下从 AWS S3 导入数据的选项: - - - -## 前提条件 - -### 准备源数据 - -确保要导入到 StarRocks 中的源数据已正确存储在 S3 bucket 中。您还可以考虑数据和数据库的位置,因为当您的 bucket 和 StarRocks 集群位于同一区域时,数据传输成本会低得多。 - -在本主题中,我们为您提供了一个 S3 bucket 中的示例数据集 `s3://starrocks-examples/user-behavior-10-million-rows.parquet`。您可以使用任何有效的凭据访问该数据集,因为任何经过 AWS 身份验证的用户都可以读取该对象。 - -### 检查权限 - - - -### 收集身份验证详细信息 - -本主题中的示例使用基于 IAM 用户的身份验证。为确保您有权从 AWS S3 读取数据,我们建议您阅读 [IAM 用户身份验证准备](../integrations/authenticate_to_aws_resources.md) 并按照说明创建一个配置了正确 [IAM policies](../sql-reference/aws_iam_policies.md) 的 IAM 用户。 - -简而言之,如果您使用基于 IAM 用户的身份验证,则需要收集有关以下 AWS 资源的信息: - -- 存储数据的 S3 bucket。 -- S3 对象键(对象名称),如果访问 bucket 中的特定对象。请注意,如果您的 S3 对象存储在子文件夹中,则对象键可以包含前缀。 -- S3 bucket 所属的 AWS 区域。 -- 用作访问凭证的访问密钥和密钥。 - -有关所有可用的身份验证方法的信息,请参阅 [向 AWS 资源进行身份验证](../integrations/authenticate_to_aws_resources.md)。 - -## 使用 INSERT+FILES() - -此方法从 v3.1 开始可用,目前仅支持 Parquet、ORC 和 CSV(从 v3.3.0 开始)文件格式。 - -### INSERT+FILES() 的优势 - -[`FILES()`](../sql-reference/sql-functions/table_functions/files.md) 可以读取存储在云存储中的文件,基于您指定的路径相关属性,推断文件中数据的表结构,然后将文件中的数据作为数据行返回。 - -使用 `FILES()`,您可以: - -- 使用 [SELECT](../sql-reference/sql-statements/table_bucket_part_index/SELECT.md) 直接从 S3 查询数据。 -- 使用 [CREATE TABLE AS SELECT](../sql-reference/sql-statements/table_bucket_part_index/CREATE_TABLE_AS_SELECT.md) (CTAS) 创建和导入表。 -- 使用 [INSERT](../sql-reference/sql-statements/loading_unloading/INSERT.md) 将数据导入到现有表中。 - -### 典型示例 - -#### 使用 SELECT 直接从 S3 查询 - -使用 SELECT+`FILES()` 直接从 S3 查询可以在创建表之前很好地预览数据集的内容。例如: - -- 在不存储数据的情况下获取数据集的预览。 -- 查询最小值和最大值,并确定要使用的数据类型。 -- 检查 `NULL` 值。 - -以下示例查询示例数据集 `s3://starrocks-examples/user-behavior-10-million-rows.parquet`: - -```SQL -SELECT * FROM FILES -( - "path" = "s3://starrocks-examples/user-behavior-10-million-rows.parquet", - "format" = "parquet", - "aws.s3.region" = "us-east-1", - "aws.s3.access_key" = "AAAAAAAAAAAAAAAAAAAA", - "aws.s3.secret_key" = "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB" -) -LIMIT 3; -``` - -> **NOTE** -> -> 在上面的命令中,将您的凭据替换为 `AAA` 和 `BBB`。可以使用任何有效的 `aws.s3.access_key` 和 `aws.s3.secret_key`,因为任何经过 AWS 身份验证的用户都可以读取该对象。 - -系统返回以下查询结果: - -```Plaintext -+--------+---------+------------+--------------+---------------------+ -| UserID | ItemID | CategoryID | BehaviorType | Timestamp | -+--------+---------+------------+--------------+---------------------+ -| 1 | 2576651 | 149192 | pv | 2017-11-25 01:21:25 | -| 1 | 3830808 | 4181361 | pv | 2017-11-25 07:04:53 | -| 1 | 4365585 | 2520377 | pv | 2017-11-25 07:49:06 | -+--------+---------+------------+--------------+---------------------+ -``` - -> **NOTE** -> -> 请注意,上面返回的列名由 Parquet 文件提供。 - -#### 使用 CTAS 创建和导入表 - -这是前一个示例的延续。先前的查询包装在 CREATE TABLE AS SELECT (CTAS) 中,以使用模式推断自动执行表创建。这意味着 StarRocks 将推断表结构,创建您想要的表,然后将数据加载到表中。使用 `FILES()` 表函数和 Parquet 文件时,不需要列名和类型来创建表,因为 Parquet 格式包含列名。 - -> **NOTE** -> -> 使用模式推断时,CREATE TABLE 的语法不允许设置副本数,因此请在创建表之前设置它。以下示例适用于具有一个副本的系统: -> -> ```SQL -> ADMIN SET FRONTEND CONFIG ('default_replication_num' = "1"); -> ``` - -创建一个数据库并切换到它: - -```SQL -CREATE DATABASE IF NOT EXISTS mydatabase; -USE mydatabase; -``` - -使用 CTAS 创建一个表,并将示例数据集 `s3://starrocks-examples/user-behavior-10-million-rows.parquet` 的数据加载到该表中: - -```SQL -CREATE TABLE user_behavior_inferred AS -SELECT * FROM FILES -( - "path" = "s3://starrocks-examples/user-behavior-10-million-rows.parquet", - "format" = "parquet", - "aws.s3.region" = "us-east-1", - "aws.s3.access_key" = "AAAAAAAAAAAAAAAAAAAA", - "aws.s3.secret_key" = "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB" -); -``` - -> **NOTE** -> -> 在上面的命令中,将您的凭据替换为 `AAA` 和 `BBB`。可以使用任何有效的 `aws.s3.access_key` 和 `aws.s3.secret_key`,因为任何经过 AWS 身份验证的用户都可以读取该对象。 - -创建表后,您可以使用 [DESCRIBE](../sql-reference/sql-statements/table_bucket_part_index/DESCRIBE.md) 查看其结构: - -```SQL -DESCRIBE user_behavior_inferred; -``` - -系统返回以下查询结果: - -```Plain -+--------------+------------------+------+-------+---------+-------+ -| Field | Type | Null | Key | Default | Extra | -+--------------+------------------+------+-------+---------+-------+ -| UserID | bigint | YES | true | NULL | | -| ItemID | bigint | YES | true | NULL | | -| CategoryID | bigint | YES | true | NULL | | -| BehaviorType | varchar(1048576) | YES | false | NULL | | -| Timestamp | varchar(1048576) | YES | false | NULL | | -+--------------+------------------+------+-------+---------+-------+ -``` - -查询表以验证数据是否已加载到其中。例如: - -```SQL -SELECT * from user_behavior_inferred LIMIT 3; -``` - -返回以下查询结果,表明数据已成功加载: - -```Plaintext -+--------+---------+------------+--------------+---------------------+ -| UserID | ItemID | CategoryID | BehaviorType | Timestamp | -+--------+---------+------------+--------------+---------------------+ -| 225586 | 3694958 | 1040727 | pv | 2017-12-01 00:58:40 | -| 225586 | 3726324 | 965809 | pv | 2017-12-01 02:16:02 | -| 225586 | 3732495 | 1488813 | pv | 2017-12-01 00:59:46 | -+--------+---------+------------+--------------+---------------------+ -``` - -#### 使用 INSERT 导入到现有表 - -您可能想要自定义要插入的表,例如: - -- 列数据类型、可空设置或默认值 -- 键类型和列 -- 数据分区和分桶 - -> **NOTE** -> -> 创建最有效的表结构需要了解数据的使用方式和列的内容。本主题不涵盖表设计。有关表设计的信息,请参阅 [表类型](../table_design/StarRocks_table_design.md)。 - -在此示例中,我们基于对表将被如何查询以及 Parquet 文件中的数据的了解来创建表。对 Parquet 文件中的数据的了解可以通过直接在 S3 中查询文件来获得。 - -- 由于对 S3 中数据集的查询表明 `Timestamp` 列包含与 VARCHAR 数据类型匹配的数据,并且 StarRocks 可以从 VARCHAR 转换为 DATETIME,因此在以下 DDL 中,数据类型更改为 DATETIME。 -- 通过查询 S3 中的数据,您可以发现数据集中没有 `NULL` 值,因此 DDL 也可以将所有列设置为不可为空。 -- 根据对预期查询类型的了解,排序键和分桶列设置为列 `UserID`。您的用例可能与此数据不同,因此您可能会决定除了 `UserID` 之外或代替 `UserID` 使用 `ItemID` 作为排序键。 - -创建一个数据库并切换到它: - -```SQL -CREATE DATABASE IF NOT EXISTS mydatabase; -USE mydatabase; -``` - -手动创建一个表: - -```SQL -CREATE TABLE user_behavior_declared -( - UserID int(11), - ItemID int(11), - CategoryID int(11), - BehaviorType varchar(65533), - Timestamp datetime -) -ENGINE = OLAP -DUPLICATE KEY(UserID) -DISTRIBUTED BY HASH(UserID); -``` - -显示结构,以便您可以将其与 `FILES()` 表函数生成的推断结构进行比较: - -```sql -DESCRIBE user_behavior_declared; -``` - -```plaintext -+--------------+----------------+------+-------+---------+-------+ -| Field | Type | Null | Key | Default | Extra | -+--------------+----------------+------+-------+---------+-------+ -| UserID | int | YES | true | NULL | | -| ItemID | int | YES | false | NULL | | -| CategoryID | int | YES | false | NULL | | -| BehaviorType | varchar(65533) | YES | false | NULL | | -| Timestamp | datetime | YES | false | NULL | | -+--------------+----------------+------+-------+---------+-------+ -``` - -:::tip - -将您刚刚创建的结构与之前使用 `FILES()` 表函数推断的结构进行比较。查看: - -- 数据类型 -- 可空性 -- 键字段 - -为了更好地控制目标表的结构并获得更好的查询性能,我们建议您在生产环境中手动指定表结构。 - -::: - -创建表后,您可以使用 INSERT INTO SELECT FROM FILES() 加载它: - -```SQL -INSERT INTO user_behavior_declared -SELECT * FROM FILES -( - "path" = "s3://starrocks-examples/user-behavior-10-million-rows.parquet", - "format" = "parquet", - "aws.s3.region" = "us-east-1", - "aws.s3.access_key" = "AAAAAAAAAAAAAAAAAAAA", - "aws.s3.secret_key" = "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB" -); -``` - -> **NOTE** -> -> 在上面的命令中,将您的凭据替换为 `AAA` 和 `BBB`。可以使用任何有效的 `aws.s3.access_key` 和 `aws.s3.secret_key`,因为任何经过 AWS 身份验证的用户都可以读取该对象。 - -加载完成后,您可以查询表以验证数据是否已加载到其中。例如: - -```SQL -SELECT * from user_behavior_declared LIMIT 3; -``` - -返回以下查询结果,表明数据已成功加载: - -```Plaintext -+--------+---------+------------+--------------+---------------------+ -| UserID | ItemID | CategoryID | BehaviorType | Timestamp | -+--------+---------+------------+--------------+---------------------+ -| 393529 | 3715112 | 883960 | pv | 2017-12-02 02:45:44 | -| 393529 | 2650583 | 883960 | pv | 2017-12-02 02:45:59 | -| 393529 | 3715112 | 883960 | pv | 2017-12-02 03:00:56 | -+--------+---------+------------+--------------+---------------------+ -``` - -#### 检查导入进度 - -您可以从 StarRocks Information Schema 中的 [`loads`](../sql-reference/information_schema/loads.md) 视图查询 INSERT 作业的进度。此功能从 v3.1 开始支持。例如: - -```SQL -SELECT * FROM information_schema.loads ORDER BY JOB_ID DESC; -``` - -有关 `loads` 视图中提供的字段的信息,请参阅 [`loads`](../sql-reference/information_schema/loads.md)。 - -如果您提交了多个导入作业,则可以按与该作业关联的 `LABEL` 进行过滤。例如: - -```SQL -SELECT * FROM information_schema.loads WHERE LABEL = 'insert_e3b882f5-7eb3-11ee-ae77-00163e267b60' \G -*************************** 1. row *************************** - JOB_ID: 10243 - LABEL: insert_e3b882f5-7eb3-11ee-ae77-00163e267b60 - DATABASE_NAME: mydatabase - STATE: FINISHED - PROGRESS: ETL:100%; LOAD:100% - TYPE: INSERT - PRIORITY: NORMAL - SCAN_ROWS: 10000000 - FILTERED_ROWS: 0 - UNSELECTED_ROWS: 0 - SINK_ROWS: 10000000 - ETL_INFO: - TASK_INFO: resource:N/A; timeout(s):300; max_filter_ratio:0.0 - CREATE_TIME: 2023-11-09 11:56:01 - ETL_START_TIME: 2023-11-09 11:56:01 - ETL_FINISH_TIME: 2023-11-09 11:56:01 - LOAD_START_TIME: 2023-11-09 11:56:01 - LOAD_FINISH_TIME: 2023-11-09 11:56:44 - JOB_DETAILS: {"All backends":{"e3b882f5-7eb3-11ee-ae77-00163e267b60":[10142]},"FileNumber":0,"FileSize":0,"InternalTableLoadBytes":311710786,"InternalTableLoadRows":10000000,"ScanBytes":581574034,"ScanRows":10000000,"TaskNumber":1,"Unfinished backends":{"e3b882f5-7eb3-11ee-ae77-00163e267b60":[]}} - ERROR_MSG: NULL - TRACKING_URL: NULL - TRACKING_SQL: NULL -REJECTED_RECORD_PATH: NULL -``` - -> **NOTE** -> -> INSERT 是一个同步命令。如果 INSERT 作业仍在运行,您需要打开另一个会话来检查其执行状态。 - -## 使用 Broker Load - -异步 Broker Load 进程处理与 S3 的连接、提取数据以及将数据存储在 StarRocks 中。 - -此方法支持以下文件格式: - -- Parquet -- ORC -- CSV -- JSON(从 v3.2.3 开始支持) - -### Broker Load 的优势 - -- Broker Load 在后台运行,客户端无需保持连接即可继续作业。 -- Broker Load 是长时间运行的作业的首选,默认超时时间为 4 小时。 -- 除了 Parquet 和 ORC 文件格式外,Broker Load 还支持 CSV 文件格式和 JSON 文件格式(从 v3.2.3 开始支持 JSON 文件格式)。 - -### 数据流 - -![Broker Load 的工作流程](../_assets/broker_load_how-to-work_en.png) - -1. 用户创建一个导入作业。 -2. 前端 (FE) 创建一个查询计划,并将该计划分发到后端节点 (BE) 或计算节点 (CN)。 -3. BE 或 CN 从源提取数据,并将数据加载到 StarRocks 中。 - -### 典型示例 - -创建一个表,启动一个从 S3 提取示例数据集 `s3://starrocks-examples/user-behavior-10-million-rows.parquet` 的导入进程,并验证数据导入的进度和成功。 - -#### 创建数据库和表 - -创建一个数据库并切换到它: - -```SQL -CREATE DATABASE IF NOT EXISTS mydatabase; -USE mydatabase; -``` - -手动创建一个表(我们建议该表具有与您要从 AWS S3 导入的 Parquet 文件相同的结构): - -```SQL -CREATE TABLE user_behavior -( - UserID int(11), - ItemID int(11), - CategoryID int(11), - BehaviorType varchar(65533), - Timestamp datetime -) -ENGINE = OLAP -DUPLICATE KEY(UserID) -DISTRIBUTED BY HASH(UserID); -``` - -#### 启动 Broker Load - -运行以下命令以启动一个 Broker Load 作业,该作业将数据从示例数据集 `s3://starrocks-examples/user-behavior-10-million-rows.parquet` 加载到 `user_behavior` 表: - -```SQL -LOAD LABEL user_behavior -( - DATA INFILE("s3://starrocks-examples/user-behavior-10-million-rows.parquet") - INTO TABLE user_behavior - FORMAT AS "parquet" - ) - WITH BROKER - ( - "aws.s3.enable_ssl" = "true", - "aws.s3.use_instance_profile" = "false", - "aws.s3.region" = "us-east-1", - "aws.s3.access_key" = "AAAAAAAAAAAAAAAAAAAA", - "aws.s3.secret_key" = "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB" - ) -PROPERTIES -( - "timeout" = "72000" -); -``` - -> **NOTE** -> -> 在上面的命令中,将您的凭据替换为 `AAA` 和 `BBB`。可以使用任何有效的 `aws.s3.access_key` 和 `aws.s3.secret_key`,因为任何经过 AWS 身份验证的用户都可以读取该对象。 - -此作业有四个主要部分: - -- `LABEL`:一个字符串,用于查询导入作业的状态。 -- `LOAD` 声明:源 URI、源数据格式和目标表名称。 -- `BROKER`:源的连接详细信息。 -- `PROPERTIES`:超时值和要应用于导入作业的任何其他属性。 - -有关详细的语法和参数说明,请参阅 [BROKER LOAD](../sql-reference/sql-statements/loading_unloading/BROKER_LOAD.md)。 - -#### 检查导入进度 - -您可以从 StarRocks Information Schema 中的 [`loads`](../sql-reference/information_schema/loads.md) 视图查询 Broker Load 作业的进度。此功能从 v3.1 开始支持。 - -```SQL -SELECT * FROM information_schema.loads WHERE LABEL = 'user_behavior'; -``` - -有关 `loads` 视图中提供的字段的信息,请参阅 [`loads`](../sql-reference/information_schema/loads.md)。 - -此记录显示 `LOADING` 状态,进度为 39%。如果您看到类似的内容,请再次运行该命令,直到看到 `FINISHED` 状态。 - -```Plaintext - JOB_ID: 10466 - LABEL: user_behavior - DATABASE_NAME: mydatabase - # highlight-start - STATE: LOADING - PROGRESS: ETL:100%; LOAD:39% - # highlight-end - TYPE: BROKER - PRIORITY: NORMAL - SCAN_ROWS: 4620288 - FILTERED_ROWS: 0 - UNSELECTED_ROWS: 0 - SINK_ROWS: 4620288 - ETL_INFO: - TASK_INFO: resource:N/A; timeout(s):72000; max_filter_ratio:0.0 - CREATE_TIME: 2024-02-28 22:11:36 - ETL_START_TIME: 2024-02-28 22:11:41 - ETL_FINISH_TIME: 2024-02-28 22:11:41 - LOAD_START_TIME: 2024-02-28 22:11:41 - LOAD_FINISH_TIME: NULL - JOB_DETAILS: {"All backends":{"2fb97223-b14c-404b-9be1-83aa9b3a7715":[10004]},"FileNumber":1,"FileSize":136901706,"InternalTableLoadBytes":144032784,"InternalTableLoadRows":4620288,"ScanBytes":143969616,"ScanRows":4620288,"TaskNumber":1,"Unfinished backends":{"2fb97223-b14c-404b-9be1-83aa9b3a7715":[10004]}} - ERROR_MSG: NULL - TRACKING_URL: NULL - TRACKING_SQL: NULL -REJECTED_RECORD_PATH: NULL -``` - -在确认导入作业已完成后,您可以检查目标表的子集,以查看数据是否已成功加载。例如: - -```SQL -SELECT * from user_behavior LIMIT 3; -``` - -返回以下查询结果,表明数据已成功加载: - -```Plaintext -+--------+---------+------------+--------------+---------------------+ -| UserID | ItemID | CategoryID | BehaviorType | Timestamp | -+--------+---------+------------+--------------+---------------------+ -| 34 | 856384 | 1029459 | pv | 2017-11-27 14:43:27 | -| 34 | 5079705 | 1029459 | pv | 2017-11-27 14:44:13 | -| 34 | 4451615 | 1029459 | pv | 2017-11-27 14:45:52 | -+--------+---------+------------+--------------+---------------------+ -``` - -## 使用 Pipe - -从 v3.2 开始,StarRocks 提供了 Pipe 导入方法,该方法目前仅支持 Parquet 和 ORC 文件格式。 - -### Pipe 的优势 - - - -Pipe 非常适合连续数据导入和大规模数据导入: - -- **微批次中的大规模数据导入有助于降低由数据错误引起的重试成本。** - - 借助 Pipe,StarRocks 能够高效地导入大量数据文件,这些文件总数据量很大。Pipe 根据文件的数量或大小自动拆分文件,将导入作业分解为更小的顺序任务。这种方法确保一个文件中的错误不会影响整个导入作业。每个文件的导入状态都由 Pipe 记录,使您可以轻松识别和修复包含错误的文件。通过最大限度地减少因数据错误而需要重试的情况,这种方法有助于降低成本。 - -- **连续数据导入有助于减少人力。** - - Pipe 帮助您将新的或更新的数据文件写入特定位置,并不断地将这些文件中的新数据加载到 StarRocks 中。在使用 `"AUTO_INGEST" = "TRUE"` 指定创建 Pipe 作业后,它将不断监视存储在指定路径中的数据文件的更改,并自动将数据文件中的新数据或更新的数据加载到目标 StarRocks 表中。 - -此外,Pipe 执行文件唯一性检查,以帮助防止重复数据导入。在导入过程中,Pipe 根据文件名和摘要检查每个数据文件的唯一性。如果具有特定文件名和摘要的文件已由 Pipe 作业处理,则 Pipe 作业将跳过所有后续具有相同文件名和摘要的文件。请注意,对象存储(如 AWS S3)使用 `ETag` 作为文件摘要。 - -每个数据文件的导入状态都会被记录并保存到 `information_schema.pipe_files` 视图中。在删除与该视图关联的 Pipe 作业后,有关在该作业中加载的文件的记录也将被删除。 - -### Pipe 和 INSERT+FILES() 之间的区别 - -Pipe 作业根据每个数据文件的大小和行数拆分为一个或多个事务。用户可以在导入过程中查询中间结果。相比之下,INSERT+`FILES()` 作业作为单个事务处理,用户无法在导入过程中查看数据。 - -### 文件导入顺序 - -对于每个 Pipe 作业,StarRocks 维护一个文件队列,从中提取和加载数据文件作为微批次。Pipe 不保证数据文件以与上传顺序相同的顺序加载。因此,较新的数据可能在较旧的数据之前加载。 - -### 典型示例 - -#### 创建数据库和表 - -创建一个数据库并切换到它: - -```SQL -CREATE DATABASE IF NOT EXISTS mydatabase; -USE mydatabase; -``` - -手动创建一个表(我们建议该表具有与您要从 AWS S3 导入的 Parquet 文件相同的结构): - -```SQL -CREATE TABLE user_behavior_from_pipe -( - UserID int(11), - ItemID int(11), - CategoryID int(11), - BehaviorType varchar(65533), - Timestamp datetime -) -ENGINE = OLAP -DUPLICATE KEY(UserID) -DISTRIBUTED BY HASH(UserID); -``` - -#### 启动 Pipe 作业 - -运行以下命令以启动一个 Pipe 作业,该作业将数据从示例数据集 `s3://starrocks-examples/user-behavior-10-million-rows/` 加载到 `user_behavior_from_pipe` 表。此 pipe 作业使用微批处理和连续加载(如上所述)pipe 特定的功能。 - -本指南中的其他示例加载包含 1000 万行的单个 Parquet 文件。对于 pipe 示例,相同的数据集被拆分为 57 个单独的文件,这些文件都存储在一个 S3 文件夹中。请注意,在下面的 `CREATE PIPE` 命令中,`path` 是 S3 文件夹的 URI,而不是提供文件名,URI 以 `/*` 结尾。通过设置 `AUTO_INGEST` 并指定文件夹而不是单个文件,pipe 作业将轮询 S3 文件夹中的新文件,并在将它们添加到文件夹时导入它们。 - -```SQL -CREATE PIPE user_behavior_pipe -PROPERTIES -( --- highlight-start - "AUTO_INGEST" = "TRUE" --- highlight-end -) -AS -INSERT INTO user_behavior_from_pipe -SELECT * FROM FILES -( --- highlight-start - "path" = "s3://starrocks-examples/user-behavior-10-million-rows/*", --- highlight-end - "format" = "parquet", - "aws.s3.region" = "us-east-1", - "aws.s3.access_key" = "AAAAAAAAAAAAAAAAAAAA", - "aws.s3.secret_key" = "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB" -); -``` - -> **NOTE** -> -> 在上面的命令中,将您的凭据替换为 `AAA` 和 `BBB`。可以使用任何有效的 `aws.s3.access_key` 和 `aws.s3.secret_key`,因为任何经过 AWS 身份验证的用户都可以读取该对象。 - -此作业有四个主要部分: - -- `pipe_name`:pipe 的名称。pipe 名称在 pipe 所属的数据库中必须是唯一的。 -- `INSERT_SQL`:INSERT INTO SELECT FROM FILES 语句,用于将数据从指定的源数据文件加载到目标表。 -- `PROPERTIES`:一组可选参数,用于指定如何执行 pipe。这些参数包括 `AUTO_INGEST`、`POLL_INTERVAL`、`BATCH_SIZE` 和 `BATCH_FILES`。以 `"key" = "value"` 格式指定这些属性。 - -有关详细的语法和参数说明,请参阅 [CREATE PIPE](../sql-reference/sql-statements/loading_unloading/pipe/CREATE_PIPE.md)。 - -#### 检查导入进度 - -- 通过在 Pipe 作业所属的当前数据库中使用 [SHOW PIPES](../sql-reference/sql-statements/loading_unloading/pipe/SHOW_PIPES.md) 来查询 Pipe 作业的进度。 - - ```SQL - SHOW PIPES WHERE NAME = 'user_behavior_pipe' \G - ``` - - 返回以下结果: - - :::tip - 在下面显示的输出中,pipe 处于 `RUNNING` 状态。pipe 将保持 `RUNNING` 状态,直到您手动停止它。输出还显示了加载的文件数 (57) 和上次加载文件的时间。 - ::: - - ```SQL - *************************** 1. row *************************** - DATABASE_NAME: mydatabase - PIPE_ID: 10476 - PIPE_NAME: user_behavior_pipe - -- highlight-start - STATE: RUNNING - TABLE_NAME: mydatabase.user_behavior_from_pipe - LOAD_STATUS: {"loadedFiles":57,"loadedBytes":295345637,"loadingFiles":0,"lastLoadedTime":"2024-02-28 22:14:19"} - -- highlight-end - LAST_ERROR: NULL - CREATED_TIME: 2024-02-28 22:13:41 - 1 row in set (0.02 sec) - ``` - -- 从 StarRocks Information Schema 中的 [`pipes`](../sql-reference/information_schema/pipes.md) 视图查询 Pipe 作业的进度。 - - ```SQL - SELECT * FROM information_schema.pipes WHERE pipe_name = 'user_behavior_replica' \G - ``` - - 返回以下结果: - - :::tip - 本指南中的某些查询以 `\G` 而不是分号 (`;`) 结尾。这会导致 MySQL 客户端以垂直格式输出结果。如果您使用的是 DBeaver 或其他客户端,则可能需要使用分号 (`;`) 而不是 `\G`。 - ::: - - ```SQL - *************************** 1. row *************************** - DATABASE_NAME: mydatabase - PIPE_ID: 10217 - PIPE_NAME: user_behavior_replica - STATE: RUNNING - TABLE_NAME: mydatabase.user_behavior_replica - LOAD_STATUS: {"loadedFiles":1,"loadedBytes":132251298,"loadingFiles":0,"lastLoadedTime":"2023-11-09 15:35:42"} - LAST_ERROR: - CREATED_TIME: 9891-01-15 07:51:45 - 1 row in set (0.01 sec) - ``` - -#### 检查文件状态 - -您可以从 StarRocks Information Schema 中的 [`pipe_files`](../sql-reference/information_schema/pipe_files.md) 视图查询加载的文件的加载状态。 - -```SQL -SELECT * FROM information_schema.pipe_files WHERE pipe_name = 'user_behavior_replica' \G -``` - -返回以下结果: - -```SQL -*************************** 1. row *************************** - DATABASE_NAME: mydatabase - PIPE_ID: 10217 - PIPE_NAME: user_behavior_replica - FILE_NAME: s3://starrocks-examples/user-behavior-10-million-rows.parquet - FILE_VERSION: e29daa86b1120fea58ad0d047e671787-8 - FILE_SIZE: 132251298 - LAST_MODIFIED: 2023-11-06 13:25:17 - LOAD_STATE: FINISHED - STAGED_TIME: 2023-11-09 15:35:02 - START_LOAD_TIME: 2023-11-09 15:35:03 -FINISH_LOAD_TIME: 2023-11-09 15:35:42 - ERROR_MSG: -1 row in set (0.03 sec) -``` - -#### 管理 Pipe 作业 - -您可以更改、暂停或恢复、删除或查询您已创建的 pipe,并重试加载特定的数据文件。有关更多信息,请参阅 [ALTER PIPE](../sql-reference/sql-statements/loading_unloading/pipe/ALTER_PIPE.md)、[SUSPEND or RESUME PIPE](../sql-reference/sql-statements/loading_unloading/pipe/SUSPEND_or_RESUME_PIPE.md)、[DROP PIPE](../sql-reference/sql-statements/loading_unloading/pipe/DROP_PIPE.md)、[SHOW PIPES](../sql-reference/sql-statements/loading_unloading/pipe/SHOW_PIPES.md) 和 [RETRY FILE](../sql-reference/sql-statements/loading_unloading/pipe/RETRY_FILE.md)。 \ No newline at end of file diff --git a/docs/zh/loading/s3_compatible.md b/docs/zh/loading/s3_compatible.md deleted file mode 100644 index db92470..0000000 --- a/docs/zh/loading/s3_compatible.md +++ /dev/null @@ -1 +0,0 @@ -unlisted: true \ No newline at end of file diff --git a/docs/zh/loading/tencent.md b/docs/zh/loading/tencent.md deleted file mode 100644 index db92470..0000000 --- a/docs/zh/loading/tencent.md +++ /dev/null @@ -1 +0,0 @@ -unlisted: true \ No newline at end of file diff --git a/docs/zh/sql-reference/sql-functions/dict-functions/dict_mapping.md b/docs/zh/sql-reference/sql-functions/dict-functions/dict_mapping.md deleted file mode 100644 index 105f7c2..0000000 --- a/docs/zh/sql-reference/sql-functions/dict-functions/dict_mapping.md +++ /dev/null @@ -1,215 +0,0 @@ ---- -displayed_sidebar: docs ---- - -# dict_mapping - -返回字典表中与指定键映射的值。 - -此函数主要用于简化全局字典表的应用。在将数据加载到目标表期间,StarRocks 会通过使用此函数中的输入参数,自动从字典表中获取与指定键映射的值,然后将该值加载到目标表中。 - -自 v3.2.5 起,StarRocks 支持此功能。另请注意,目前 StarRocks 的存算分离模式不支持此功能。 - -## 语法 - -```SQL -dict_mapping("[.]", key_column_expr_list [, ] [, ] ) - -key_column_expr_list ::= key_column_expr [, key_column_expr ... ] - -key_column_expr ::= | -``` - -## 参数 - -- 必需参数: - - `[.]`:字典表的名称,需要是 Primary Key table。支持的数据类型为 VARCHAR。 - - `key_column_expr_list`:字典表中键列的表达式列表,包括一个或多个 `key_column_exprs`。`key_column_expr` 可以是字典表中键列的名称,也可以是特定的键或键表达式。 - - 此表达式列表需要包括字典表的所有 Primary Key 列,这意味着表达式的总数需要与字典表中 Primary Key 列的总数匹配。因此,当字典表使用组合主键时,此列表中的表达式需要按顺序与表结构中定义的 Primary Key 列相对应。此列表中的多个表达式用逗号 (`,`) 分隔。如果 `key_column_expr` 是特定的键或键表达式,则其类型必须与字典表中相应 Primary Key 列的类型匹配。 - -- 可选参数: - - ``:值列的名称,也就是映射列。如果未指定值列,则默认值列是字典表的 AUTO_INCREMENT 列。值列也可以定义为字典表中的任何列,不包括自增列和主键。该列的数据类型没有限制。 - - ``(可选):如果键在字典表中不存在,是否返回 Null。有效值: - - `true`:如果键不存在,则返回 Null。 - - `false`(默认):如果键不存在,则抛出异常。 - -## 返回值 - -返回的数据类型与值列的数据类型保持一致。如果值列是字典表的自增列,则返回的数据类型为 BIGINT。 - -但是,当未找到与指定键映射的值时,如果 `` 参数设置为 `true`,则返回 `NULL`。如果参数设置为 `false`(默认),则返回错误 `query failed if record not exist in dict table`。 - -## 示例 - -**示例 1:直接查询字典表中与键映射的值。** - -1. 创建一个字典表并加载模拟数据。 - - ```SQL - MySQL [test]> CREATE TABLE dict ( - order_uuid STRING, - order_id_int BIGINT AUTO_INCREMENT - ) - PRIMARY KEY (order_uuid) - DISTRIBUTED BY HASH (order_uuid); - Query OK, 0 rows affected (0.02 sec) - - MySQL [test]> INSERT INTO dict (order_uuid) VALUES ('a1'), ('a2'), ('a3'); - Query OK, 3 rows affected (0.12 sec) - {'label':'insert_9e60b0e4-89fa-11ee-a41f-b22a2c00f66b', 'status':'VISIBLE', 'txnId':'15029'} - - MySQL [test]> SELECT * FROM dict; - +------------+--------------+ - | order_uuid | order_id_int | - +------------+--------------+ - | a1 | 1 | - | a3 | 3 | - | a2 | 2 | - +------------+--------------+ - 3 rows in set (0.01 sec) - ``` - - > **注意** - > - > 目前,`INSERT INTO` 语句不支持部分更新。因此,请确保插入到 `dict` 的键列中的值不重复。否则,在字典表中多次插入相同的键列值会导致其在值列中映射的值发生更改。 - -2. 查询字典表中与键 `a1` 映射的值。 - - ```SQL - MySQL [test]> SELECT dict_mapping('dict', 'a1'); - +----------------------------+ - | dict_mapping('dict', 'a1') | - +----------------------------+ - | 1 | - +----------------------------+ - 1 row in set (0.01 sec) - ``` - -**示例 2:表中的映射列配置为使用 `dict_mapping` 函数生成的列。因此,在将数据加载到此表时,StarRocks 可以自动获取与键映射的值。** - -1. 创建一个数据表,并通过使用 `dict_mapping('dict', order_uuid)` 将映射列配置为生成列。 - - ```SQL - CREATE TABLE dest_table1 ( - id BIGINT, - -- 此列记录 STRING 类型的订单号,对应于示例 1 中 dict 表中的 order_uuid 列。 - order_uuid STRING, - batch int comment 'used to distinguish different batch loading', - -- 此列记录与 order_uuid 列映射的 BIGINT 类型的订单号。 - -- 因为此列是使用 dict_mapping 配置的生成列,所以此列中的值在数据加载期间会自动从示例 1 中的 dict 表中获取。 - -- 随后,此列可以直接用于去重和 JOIN 查询。 - order_id_int BIGINT AS dict_mapping('dict', order_uuid) - ) - DUPLICATE KEY (id, order_uuid) - DISTRIBUTED BY HASH(id); - ``` - -2. 当将模拟数据加载到此表中时,其中 `order_id_int` 列配置为 `dict_mapping('dict', 'order_uuid')`,StarRocks 会根据 `dict` 表中键和值之间的映射关系自动将值加载到 `order_id_int` 列中。 - - ```SQL - MySQL [test]> INSERT INTO dest_table1(id, order_uuid, batch) VALUES (1, 'a1', 1), (2, 'a1', 1), (3, 'a3', 1), (4, 'a3', 1); - Query OK, 4 rows affected (0.05 sec) - {'label':'insert_e191b9e4-8a98-11ee-b29c-00163e03897d', 'status':'VISIBLE', 'txnId':'72'} - - MySQL [test]> SELECT * FROM dest_table1; - +------+------------+-------+--------------+ - | id | order_uuid | batch | order_id_int | - +------+------------+-------+--------------+ - | 1 | a1 | 1 | 1 | - | 4 | a3 | 1 | 3 | - | 2 | a1 | 1 | 1 | - | 3 | a3 | 1 | 3 | - +------+------------+-------+--------------+ - 4 rows in set (0.02 sec) - ``` - - 在此示例中使用 `dict_mapping` 可以加速 [去重计算和 JOIN 查询](../../../using_starrocks/query_acceleration_with_auto_increment.md)。与之前构建全局字典以加速精确去重的解决方案相比,使用 `dict_mapping` 的解决方案更加灵活和用户友好。因为映射值是在“将键和值之间的映射关系加载到表”的阶段直接从字典表中获取的。您无需编写语句来连接字典表以获取映射值。此外,此解决方案支持各种数据导入方法。 - -**示例 3:如果表中的映射列未配置为生成列,则在将数据加载到表时,您需要为映射列显式配置 `dict_mapping` 函数,以获取与键映射的值。** - -> **注意** -> -> 示例 3 和示例 2 之间的区别在于,在导入到数据表时,您需要修改导入命令,以便为映射列显式配置 `dict_mapping` 表达式。 - -1. 创建一个表。 - - ```SQL - CREATE TABLE dest_table2 ( - id BIGINT, - order_uuid STRING, - order_id_int BIGINT NULL, - batch int comment 'used to distinguish different batch loading' - ) - DUPLICATE KEY (id, order_uuid, order_id_int) - DISTRIBUTED BY HASH(id); - ``` - -2. 当模拟数据加载到此表中时,您可以通过配置 `dict_mapping` 从字典表中获取映射的值。 - - ```SQL - MySQL [test]> INSERT INTO dest_table2 VALUES (1, 'a1', dict_mapping('dict', 'a1'), 1); - Query OK, 1 row affected (0.35 sec) - {'label':'insert_19872ab6-8a96-11ee-b29c-00163e03897d', 'status':'VISIBLE', 'txnId':'42'} - - MySQL [test]> SELECT * FROM dest_table2; - +------+------------+--------------+-------+ - | id | order_uuid | order_id_int | batch | - +------+------------+--------------+-------+ - | 1 | a1 | 1 | 1 | - +------+------------+--------------+-------+ - 1 row in set (0.02 sec) - ``` - -**示例 4:启用 null_if_not_exist 模式** - -当 `` 模式被禁用,并且查询了字典表中不存在的键所映射的值时,将返回一个错误,而不是 `NULL`。它确保数据行的键首先被加载到字典表中,并且在将该数据行加载到目标表之前,会生成其映射的值(字典 ID)。 - -```SQL -MySQL [test]> SELECT dict_mapping('dict', 'b1', true); -ERROR 1064 (HY000): Query failed if record not exist in dict table. -``` - -**示例 5:如果字典表使用组合主键,则在查询时必须指定所有主键。** - -1. 创建一个具有组合主键的字典表,并将模拟数据加载到其中。 - - ```SQL - MySQL [test]> CREATE TABLE dict2 ( - order_uuid STRING, - order_date DATE, - order_id_int BIGINT AUTO_INCREMENT - ) - PRIMARY KEY (order_uuid,order_date) -- 组合主键 - DISTRIBUTED BY HASH (order_uuid,order_date) - ; - Query OK, 0 rows affected (0.02 sec) - - MySQL [test]> INSERT INTO dict2 VALUES ('a1','2023-11-22',default), ('a2','2023-11-22',default), ('a3','2023-11-22',default); - Query OK, 3 rows affected (0.12 sec) - {'label':'insert_9e60b0e4-89fa-11ee-a41f-b22a2c00f66b', 'status':'VISIBLE', 'txnId':'15029'} - - - MySQL [test]> select * from dict2; - +------------+------------+--------------+ - | order_uuid | order_date | order_id_int | - +------------+------------+--------------+ - | a1 | 2023-11-22 | 1 | - | a3 | 2023-11-22 | 3 | - | a2 | 2023-11-22 | 2 | - +------------+------------+--------------+ - 3 rows in set (0.01 sec) - ``` - -2. 查询字典表中与键映射的值。由于字典表具有组合主键,因此需要在 `dict_mapping` 中指定所有主键。 - - ```SQL - SELECT dict_mapping('dict2', 'a1', cast('2023-11-22' as DATE)); - ``` - - 请注意,如果仅指定一个主键,则会发生错误。 - - ```SQL - MySQL [test]> SELECT dict_mapping('dict2', 'a1'); - ERROR 1064 (HY000): Getting analyzing error. Detail message: dict_mapping function param size should be 3 - 5. - ``` \ No newline at end of file