Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions docs/configs/docsdev.js
Original file line number Diff line number Diff line change
Expand Up @@ -153,6 +153,10 @@ export default {
title: 'Amazon EMR',
link: '/en-us/docs/dev/user_doc/guide/task/emr.html',
},
{
title: 'Amazon EMR Serverless',
link: '/en-us/docs/dev/user_doc/guide/task/emr-serverless.html',
},
{
title: 'Apache Zeppelin',
link: '/en-us/docs/dev/user_doc/guide/task/zeppelin.html',
Expand Down Expand Up @@ -881,6 +885,10 @@ export default {
title: 'Amazon EMR',
link: '/zh-cn/docs/dev/user_doc/guide/task/emr.html',
},
{
title: 'Amazon EMR Serverless',
link: '/zh-cn/docs/dev/user_doc/guide/task/emr-serverless.html',
},
{
title: 'Apache Zeppelin',
link: '/zh-cn/docs/dev/user_doc/guide/task/zeppelin.html',
Expand Down
144 changes: 144 additions & 0 deletions docs/docs/en/guide/task/emr-serverless.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
# Amazon EMR Serverless

## Overview

Amazon EMR Serverless task type, for submitting and monitoring job runs on [Amazon EMR Serverless](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless.html) applications.
Unlike traditional EMR on EC2, EMR Serverless requires no cluster infrastructure management and automatically scales compute resources on demand, suitable for Spark and Hive workloads.

Using [aws-java-sdk](https://aws.amazon.com/cn/sdk-for-java/) in the background code, to transfer JSON parameters to a [StartJobRunRequest](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/emrserverless/model/StartJobRunRequest.html) object
and submit it to AWS via the [StartJobRun API](https://docs.aws.amazon.com/emr-serverless/latest/APIReference/API_StartJobRun.html), then poll job status via the [GetJobRun API](https://docs.aws.amazon.com/emr-serverless/latest/APIReference/API_GetJobRun.html) until completion.

## Create Task

- Click `Project Management -> Project Name -> Workflow Definition`, click the `Create Workflow` button to enter the DAG editing page.
- Drag `AmazonEMRServerless` task from the toolbar to the artboard to complete the creation.

## Task Parameters

[//]: # (TODO: use the commented anchor below once our website template supports this syntax)
[//]: # (- Please refer to [DolphinScheduler Task Parameters Appendix](appendix.md#default-task-parameters) `Default Task Parameters` section for default parameters.)

- Please refer to [DolphinScheduler Task Parameters Appendix](appendix.md) `Default Task Parameters` section for default parameters.

| **Parameter** | **Description** |
|-------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Application Id | EMR Serverless application ID (e.g. `00fkht2eodujab09`), obtainable from the [EMR Serverless Console](https://console.aws.amazon.com/emr/home#/serverless) |
| Execution Role Arn | ARN of the IAM role for job execution (e.g. `arn:aws:iam::123456789012:role/EMRServerlessRole`), this role needs permissions to access S3, Glue, and other services |
| Job Name | Job name (optional), used to identify the job in the EMR Serverless console |
| StartJobRunRequest JSON | JSON corresponding to the `JobDriver` and `ConfigurationOverrides` portions of the [StartJobRunRequest](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/emrserverless/model/StartJobRunRequest.html), see examples below. **Note**: `ApplicationId` and `ExecutionRoleArn` do not need to be included in the JSON as they are automatically injected from the form parameters above |

![RUN_JOB_FLOW](../../../../img/tasks/demo/emr_serverless_create.png)

## Task Example

### Submit a Spark Job

This example shows how to create an `EMR_SERVERLESS` task node to submit a Spark job to an EMR Serverless application.

StartJobRunRequest JSON example (Spark):

```json
{
"JobDriver": {
"SparkSubmit": {
"EntryPoint": "s3://my-bucket/scripts/my-spark-job.jar",
"EntryPointArguments": [
"s3://my-bucket/input/",
"s3://my-bucket/output/"
],
"SparkSubmitParameters": "--class com.example.MySparkApp --conf spark.executor.cores=4 --conf spark.executor.memory=8g --conf spark.executor.instances=10"
}
},
"ConfigurationOverrides": {
"MonitoringConfiguration": {
"S3MonitoringConfiguration": {
"LogUri": "s3://my-bucket/emr-serverless-logs/"
}
}
}
}
```

### Submit a Hive Job

This example shows how to create an `EMR_SERVERLESS` task node to submit a Hive query job.

StartJobRunRequest JSON example (Hive):

```json
{
"JobDriver": {
"HiveSQL": {
"Query": "s3://my-bucket/scripts/my-hive-query.sql",
"Parameters": "--hiveconf hive.exec.dynamic.partition=true --hiveconf hive.exec.dynamic.partition.mode=nonstrict"
}
},
"ConfigurationOverrides": {
"MonitoringConfiguration": {
"S3MonitoringConfiguration": {
"LogUri": "s3://my-bucket/emr-serverless-logs/"
}
},
"ApplicationConfiguration": [
{
"Classification": "hive-site",
"Properties": {
"hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
}
}
]
}
}
```

## AWS Authentication Configuration

The EMR Serverless task reads AWS credentials from the DolphinScheduler `aws.yaml` configuration file, under the `aws.emr` section at `conf/aws.yaml`.

### Using IAM Role (Recommended)

If the DolphinScheduler Worker node runs on an EC2 instance with an attached IAM Role:

```yaml
aws:
emr:
credentials.provider.type: InstanceProfileCredentialsProvider
region: us-east-1
```

### Using Access Key

If you need to authenticate using AK/SK:

```yaml
aws:
emr:
credentials.provider.type: AWSStaticCredentialsProvider
access.key.id: your-access-key-id
access.key.secret: your-secret-access-key
region: us-east-1
```

> **Note**: The `aws.emr` section configuration is shared by both EMR on EC2 and EMR Serverless task types.

## Job State Transitions

After an EMR Serverless job is submitted, DolphinScheduler polls the job status every 10 seconds:

```
SUBMITTED → PENDING → SCHEDULED → RUNNING → SUCCESS
→ FAILED
→ CANCELLED
```

- When a job reaches `SUCCESS` state, the task is marked as successful
- When a job reaches `FAILED` or `CANCELLED` state, the task is marked as failed
- If a DolphinScheduler task is killed, it automatically calls the [CancelJobRun API](https://docs.aws.amazon.com/emr-serverless/latest/APIReference/API_CancelJobRun.html) to cancel the running job

## Notice

- The **Application Id** must correspond to a pre-existing EMR Serverless application (created via the AWS Console or API) in `STARTED` or `CREATED` state
- The **Execution Role** requires the following minimum permissions: `emr-serverless:StartJobRun`, `emr-serverless:GetJobRun`, `emr-serverless:CancelJobRun`, plus S3, Glue and other data access permissions required by the job
- `StartJobRunRequest JSON` should NOT include `ApplicationId` or `ExecutionRoleArn` fields — they are automatically injected from the form parameters
- EMR Serverless task supports failover: when a Worker node fails, a new Worker can recover tracking of running jobs through `appIds` (the `jobRunId`)

144 changes: 144 additions & 0 deletions docs/docs/zh/guide/task/emr-serverless.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
# Amazon EMR Serverless

## 综述

Amazon EMR Serverless 任务类型,用于向 [Amazon EMR Serverless](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless.html) 应用程序提交并监控作业运行。
与传统的 EMR on EC2 不同,EMR Serverless 无需管理集群基础设施,按需自动扩缩容计算资源,适用于 Spark 和 Hive 工作负载。

后台使用 [aws-java-sdk](https://aws.amazon.com/cn/sdk-for-java/) 将 JSON 参数转换为 [StartJobRunRequest](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/emrserverless/model/StartJobRunRequest.html) 对象,
通过 [StartJobRun API](https://docs.aws.amazon.com/emr-serverless/latest/APIReference/API_StartJobRun.html) 提交到 AWS,并通过 [GetJobRun API](https://docs.aws.amazon.com/emr-serverless/latest/APIReference/API_GetJobRun.html) 轮询作业状态直到完成。

## 创建任务

- 点击 `项目管理 -> 项目名称 -> 工作流定义`,点击 `创建工作流` 按钮进入 DAG 编辑页面。
- 从工具栏中拖拽 `AmazonEMRServerless` 任务到画布中完成创建。

## 任务参数

[//]: # (TODO: use the commented anchor below once our website template supports this syntax)
[//]: # (- 默认参数说明请参考[DolphinScheduler任务参数附录](appendix.md#默认任务参数)`默认任务参数`一栏。)

- 默认参数说明请参考[DolphinScheduler任务参数附录](appendix.md)`默认任务参数`一栏。

| **任务参数** | **描述** |
|-------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Application Id | EMR Serverless 应用程序 ID(格式如 `00fkht2eodujab09`),可在 [EMR Serverless 控制台](https://console.aws.amazon.com/emr/home#/serverless) 获取 |
| Execution Role Arn | 作业执行 IAM 角色的 ARN(格式如 `arn:aws:iam::123456789012:role/EMRServerlessRole`),该角色需要有访问 S3、Glue 等服务的权限 |
| Job Name | 作业名称(可选),用于在 EMR Serverless 控制台中标识作业 |
| StartJobRunRequest JSON | [StartJobRunRequest](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/emrserverless/model/StartJobRunRequest.html) 中 `JobDriver` 和 `ConfigurationOverrides` 部分对应的 JSON,详细定义见下方示例。**注意**:`ApplicationId` 和 `ExecutionRoleArn` 无需在 JSON 中重复填写,系统会自动从上方参数注入 |

![RUN_JOB_FLOW](../../../../img/tasks/demo/emr_serverless_create.png)

## 任务样例

### 提交 Spark 作业

该样例展示了如何创建 `EMR_SERVERLESS` 任务节点来提交一个 Spark 作业到 EMR Serverless 应用程序。

StartJobRunRequest JSON 参数样例(Spark):

```json
{
"JobDriver": {
"SparkSubmit": {
"EntryPoint": "s3://my-bucket/scripts/my-spark-job.jar",
"EntryPointArguments": [
"s3://my-bucket/input/",
"s3://my-bucket/output/"
],
"SparkSubmitParameters": "--class com.example.MySparkApp --conf spark.executor.cores=4 --conf spark.executor.memory=8g --conf spark.executor.instances=10"
}
},
"ConfigurationOverrides": {
"MonitoringConfiguration": {
"S3MonitoringConfiguration": {
"LogUri": "s3://my-bucket/emr-serverless-logs/"
}
}
}
}
```

### 提交 Hive 作业

该样例展示了如何创建 `EMR_SERVERLESS` 任务节点来提交一个 Hive 查询作业。

StartJobRunRequest JSON 参数样例(Hive):

```json
{
"JobDriver": {
"HiveSQL": {
"Query": "s3://my-bucket/scripts/my-hive-query.sql",
"Parameters": "--hiveconf hive.exec.dynamic.partition=true --hiveconf hive.exec.dynamic.partition.mode=nonstrict"
}
},
"ConfigurationOverrides": {
"MonitoringConfiguration": {
"S3MonitoringConfiguration": {
"LogUri": "s3://my-bucket/emr-serverless-logs/"
}
},
"ApplicationConfiguration": [
{
"Classification": "hive-site",
"Properties": {
"hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
}
}
]
}
}
```

## AWS 认证配置

EMR Serverless 任务通过 DolphinScheduler 的 `aws.yaml` 配置文件读取 AWS 认证信息,配置路径为 `conf/aws.yaml` 中的 `aws.emr` 段。

### 使用 IAM Role(推荐)

如果 DolphinScheduler Worker 节点运行在 EC2 实例上并已绑定 IAM Role,配置如下:

```yaml
aws:
emr:
credentials.provider.type: InstanceProfileCredentialsProvider
region: us-east-1
```

### 使用 Access Key

如果需要使用 AK/SK 方式认证:

```yaml
aws:
emr:
credentials.provider.type: AWSStaticCredentialsProvider
access.key.id: your-access-key-id
access.key.secret: your-secret-access-key
region: us-east-1
```

> **注意**:`aws.emr` 段的配置同时被 EMR on EC2 和 EMR Serverless 任务类型共享。

## 作业状态流转

EMR Serverless 作业提交后,DolphinScheduler 会每 10 秒轮询一次作业状态:

```
SUBMITTED → PENDING → SCHEDULED → RUNNING → SUCCESS
→ FAILED
→ CANCELLED
```

- 作业进入 `SUCCESS` 状态时,任务标记为成功
- 作业进入 `FAILED` 或 `CANCELLED` 状态时,任务标记为失败
- 如果 DolphinScheduler 任务被终止,会自动调用 [CancelJobRun API](https://docs.aws.amazon.com/emr-serverless/latest/APIReference/API_CancelJobRun.html) 取消正在运行的作业

## 注意事项

- **Application Id** 对应的 EMR Serverless 应用程序需要预先在 AWS 控制台或通过 API 创建,并确保处于 `STARTED` 或 `CREATED` 状态
- **Execution Role** 需要有以下最小权限:`emr-serverless:StartJobRun`、`emr-serverless:GetJobRun`、`emr-serverless:CancelJobRun`,以及作业所需的 S3、Glue 等数据访问权限
- `StartJobRunRequest JSON` 中无需填写 `ApplicationId` 和 `ExecutionRoleArn` 字段,系统会自动从表单参数注入
- EMR Serverless 任务支持故障转移(Failover):当 Worker 节点发生故障时,新的 Worker 可以通过 `appIds`(即 `jobRunId`)恢复对正在运行作业的跟踪

Binary file added docs/img/tasks/demo/emr_serverless_create.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ task:
- 'REMOTESHELL'
cloud:
- 'EMR'
- 'EMR_SERVERLESS'
- 'K8S'
- 'DMS'
- 'DATA_FACTORY'
Expand Down
5 changes: 5 additions & 0 deletions dolphinscheduler-bom/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -752,6 +752,11 @@
<artifactId>aws-java-sdk-emr</artifactId>
<version>${aws-sdk.version}</version>
</dependency>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk-emrserverless</artifactId>
<version>${aws-sdk.version}</version>
</dependency>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk-s3</artifactId>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,12 @@
<version>${project.version}</version>
</dependency>

<dependency>
<groupId>org.apache.dolphinscheduler</groupId>
<artifactId>dolphinscheduler-task-emr-serverless</artifactId>
<version>${project.version}</version>
</dependency>

<dependency>
<groupId>org.apache.dolphinscheduler</groupId>
<artifactId>dolphinscheduler-task-zeppelin</artifactId>
Expand Down
Loading
Loading