How to create an EMR Spark node - DataWorks - Alibaba Cloud Documentation Center

Spark is a popular, general-purpose engine for big data analytics that features high performance and ease of use. You can use Spark to perform complex in-memory analysis and build large, low-latency data analysis applications. DataWorks provides E-MapReduce (EMR) Spark nodes that you can use to develop and periodically schedule Spark tasks. This topic describes how to create an EMR Spark node and provides examples of its features.

Preparations

Before developing nodes, if you need to customize the component environment, you can create a custom image based on the official image dataworks_emr_base_task_pod and use the custom image in DataStudio.
For example, you can replace Spark JAR packages or include specific libraries, files, or JAR packages when creating the custom image.
An EMR cluster must be registered with DataWorks. For more information, see Legacy Data Development: Attach an EMR computing resource.
(Optional) If you use a RAM user to develop tasks, you must add the user to the workspace and assign the user the Development or Workspace Manager role. The Workspace Manager role has more permissions than required, so assign this role with caution. For more information about how to add members, see Add members to a workspace.
A resource group must be purchased and configured. The configuration includes associating the resource group with a workspace and configuring the network. For more information, see Create and use a serverless resource group.
A workflow must be created. In DataStudio, development operations for different compute engines are performed based on workflows. Therefore, you must create a workflow before you create a node. For more information, see Create a workflow.
If you want to use a specific development environment to develop a task, you can create a custom image in the DataWorks console. For more information, see Manage images.

Limits

This type of node can be run only on a serverless resource group or an exclusive resource group for scheduling. We recommend that you use a serverless resource group. If you need to use an image in DataStudio, use a serverless computing resource group.
If you want to manage metadata for a DataLake or custom cluster in DataWorks, you must first configure EMR-HOOK in the cluster. If EMR-HOOK is not configured, metadata cannot be displayed in real time, audit logs cannot be generated, and data lineages cannot be displayed in DataWorks. In addition, the related EMR governance tasks cannot be run. For more information about how to configure EMR-HOOK, see Configure EMR-HOOK for Spark SQL.
You cannot view data lineages of a Spark cluster that is created on the EMR on ACK page. You can view data lineages of an EMR Serverless Spark cluster.
For Spark clusters that are created on the EMR on ACK page and EMR Serverless Spark clusters, you can use only the Object Storage Service (OSS) REF method to reference OSS resources and upload resources to OSS. You cannot upload resources to the Hadoop Distributed File System (HDFS).
For DataLake and custom clusters, you can use the OSS REF method to reference OSS resources and upload resources to OSS or HDFS.

Notes

If you have enabled Ranger access control for Spark in the EMR cluster that is attached to your workspace, note the following:

When you use the default image to run Spark tasks, this feature is available by default.
To use a custom image to run Spark tasks, submit a ticket to contact technical support to upgrade the image to support this feature.

Preparations: Develop a Spark task and get the JAR package

Before you use DataWorks to schedule an EMR Spark task, you must develop Spark task code in EMR and compile the code to generate a JAR package. For more information about how to develop EMR Spark tasks, see Spark Overview.

Note

You must upload the JAR package to DataWorks to periodically schedule the EMR Spark task.

Step 1: Create an EMR Spark node

Go to the DataStudio page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and O&M > Data Development. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.
Create an EMR Spark node.
1. Right-click the target workflow and select Create Node > EMR > EMR Spark.
  Note
  Alternatively, you can hover over Create and select Create Node > EMR > EMR Spark.
2. In the Create Node dialog box, enter a Name and select the Engine Instance, Node Type, and Path. Click Confirm. The EMR Spark node editing page appears.
  Note
  The node name can contain uppercase letters, lowercase letters, Chinese characters, digits, underscores (_), and periods (.).

Step 2: Develop a Spark task

On the EMR Spark node editing page, double-click the created node to open the task development page. You can choose an operation based on your scenario:

(Recommended) Upload a resource from your on-premises machine to DataStudio and then reference the resource. For more information, see Method 1: Upload and then reference an EMR JAR resource.
Use the OSS REF method to reference an OSS resource. For more information, see Method 2: Directly reference an OSS resource.

Method 1: Upload and then reference an EMR JAR resource

DataWorks lets you upload a resource from your on-premises machine to DataStudio and then reference the resource. After the EMR Spark task is compiled, you must obtain the compiled JAR package. We recommend that you choose a storage method for the JAR package resource based on its size.

You can upload the JAR package resource, create it as a DataWorks EMR resource, and then submit it. Alternatively, you can store it directly in the HDFS of the EMR cluster. You cannot upload resources to HDFS for EMR on ACK Spark clusters or EMR Serverless Spark clusters.

If the JAR package is smaller than 200 MB

Create an EMR JAR resource.
If a JAR package is smaller than 200 MB, you can upload the package from your on-premises machine as an EMR JAR resource to DataWorks. This lets you visually manage the resource in the DataWorks console. After you create the resource, you must submit it. For more information, see Create and use EMR resources.
Note
When you create an EMR resource for the first time, if you want the JAR package to be stored in OSS after it is uploaded, you must first perform authorization as prompted on the page.
Reference the EMR JAR resource.
1. Double-click the created EMR Spark node to open its code editing page.
2. Under the EMR > Resource node, find the previously uploaded EMR JAR resource, right-click the resource, and select Reference Resource.
3. After you select Reference Resource, the resource reference code is automatically added to the editing page of the current EMR Spark node. The following code provides an example.
```
##@resource_reference{"spark-examples_2.12-1.0.0-SNAPSHOT-shaded.jar"}
spark-examples_2.12-1.0.0-SNAPSHOT-shaded.jar
```
  If the preceding reference code is automatically added, the resource is successfully referenced. spark-examples_2.12-1.0.0-SNAPSHOT-shaded.jar is the name of the EMR JAR resource that you uploaded.
4. Rewrite the EMR Spark node code and add the spark-submit command. The following code provides an example of the rewritten code.
  Note
  Comments are not supported when you edit code for an EMR Spark node. Rewrite the task code by referring to the following example. Do not add comments. Otherwise, an error is reported when you run the node.
```
##@resource_reference{"spark-examples_2.11-2.4.0.jar"}
spark-submit --class org.apache.spark.examples.SparkPi --master yarn  spark-examples_2.11-2.4.0.jar 100
```
  Description:
  - org.apache.spark.examples.SparkPi: The main class of the task in your compiled JAR package.
  - spark-examples_2.11-2.4.0.jar: The name of the EMR JAR resource that you uploaded.
  - You can use the values from the preceding example for the other parameters. You can also execute the following command to view the help documentation for the spark-submit command and modify the spark-submit command as needed.
    Note
    To use a parameter for the spark-submit command in a Spark node, you must add it to your code. For example, --executor-memory 2G.
    Spark nodes support submitting jobs only using Yarn in cluster mode.
    If you submit jobs using spark-submit, we recommend that you set the deploy-mode to cluster instead of client.
```
spark-submit --help
```

If the JAR package is 200 MB or larger

Create an EMR JAR resource.
If a JAR package is 200 MB or larger, you cannot upload it from your on-premises machine as a DataWorks resource. We recommend that you store the JAR package directly in the HDFS of the EMR cluster and record its storage path. You can then reference this path when you schedule the Spark task in DataWorks.
Reference the EMR JAR resource.
If the JAR package is stored in HDFS, you can reference it by specifying its path in the code of the EMR Spark node.
1. Double-click the created EMR Spark node to open its code editing page.
2. Write the spark-submit command. The following code provides an example.
```
spark-submit --master yarn
--deploy-mode cluster
--name SparkPi
--driver-memory 4G
--driver-cores 1
--num-executors 5
--executor-memory 4G
--executor-cores 1
--class org.apache.spark.examples.JavaSparkPi
hdfs:///tmp/jars/spark-examples_2.11-2.4.8.jar 100
```
  Where:
  - hdfs:///tmp/jars/spark-examples_2.11-2.4.8.jar: The actual path of the JAR package in HDFS.
  - org.apache.spark.examples.JavaSparkPi: The main class of the task in your compiled JAR package.
  - Other parameters are the actual parameters of the EMR cluster. You must modify them as needed. You can also run the following command to view the help information for the spark-submit command and modify the command as needed.
    Important
    To use a parameter for the spark-submit command in a Spark node, you must add the parameter to your code, for example, --executor-memory 2G.
    Spark nodes support submitting jobs only using Yarn in cluster mode.
    If you submit a task using spark-submit, we recommend that you set deploy-mode to cluster instead of client.
```
spark-submit --help
```

Method 2: Directly reference an OSS resource

This node can directly reference an OSS resource using the OSS REF method. When you run the EMR node, DataWorks automatically loads the OSS resource specified in the code to the local environment. This method is commonly used in scenarios where EMR tasks require JAR dependencies or scripts.

Develop a JAR resource.

Prepare code dependencies.

You can access the EMR cluster and view the required code dependencies in the /usr/lib/emr/spark-current/jars/ path of the master node. The following example uses Spark 3.4.2. You must open an existing IntelliJ IDEA project and add Project Object Model (POM) dependencies and reference plug-ins.

Add pom dependencies

<dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.12</artifactId>
            <version>3.4.2</version>
        </dependency>
        <!-- Apache Spark SQL -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.12</artifactId>
            <version>3.4.2</version>
        </dependency>
</dependencies>

Reference related plugins

<build>
        <sourceDirectory>src/main/scala</sourceDirectory>
        <testSourceDirectory>src/test/scala</testSourceDirectory>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.7.0</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>
            <plugin>
                <artifactId>maven-assembly-plugin</artifactId>
                <configuration>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <groupId>net.alchim31.maven</groupId>
                <artifactId>scala-maven-plugin</artifactId>
                <version>3.2.2</version>
                <configuration>
                    <recompileMode>incremental</recompileMode>
                </configuration>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                            <goal>testCompile</goal>
                        </goals>
                        <configuration>
                            <args>
                                <arg>-dependencyfile</arg>
                                <arg>${project.build.directory}/.scala_dependencies</arg>
                            </args>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>

Write sample code.

package com.aliyun.emr.example.spark

import org.apache.spark.sql.SparkSession

object SparkMaxComputeDemo {
  def main(args: Array[String]): Unit = {
    // Create a SparkSession.
    val spark = SparkSession.builder()
      .appName("HelloDataWorks")
      .getOrCreate()

    // Print the Spark version.
    println(s"Spark version: ${spark.version}")
  }
}

Package the code into a JAR file.
After you write and save the preceding code, package the code into a JAR file. In this example, a file named SparkWorkOSS-1.0-SNAPSHOT-jar-with-dependencies.jar is generated.

Upload the JAR resource.
1. After you finish developing the code, log on to the OSS console. In the navigation pane on the left, click Buckets.
2. Click the target bucket name to open the Objects page.
  In this example, the onaliyun-bucket-2 bucket is used.
3. Click Create Directory to create a directory to store JAR resources.
  Set Directory Name to emr/jars.
4. Upload the JAR resource to its storage directory.
  Go to the destination folder. Click Upload File. In the Files To Upload section, click Select Files and add the SparkWorkOSS-1.0-SNAPSHOT-jar-with-dependencies.jar file. Then, click Upload File.

Reference the JAR resource.

Edit the code to reference the JAR resource.

On the editing page of the EMR Spark node, edit the code to reference the JAR resource.

spark-submit --class com.aliyun.emr.example.spark.SparkMaxComputeDemo --master yarn ossref://onaliyun-bucket-2/emr/jars/SparkWorkOSS-1.0-SNAPSHOT-jar-with-dependencies.jar

Parameter description:

Parameter	Description
`class`	The full name of the main class to run.
`master`	The running mode of the Spark application.
`ossref` file path	Format: `ossref://{endpoint}/{bucket}/{object}` endpoint: The access endpoint for OSS. If the endpoint parameter is left empty, only a resource in an OSS bucket that resides in the same region as the current EMR cluster can be referenced. Bucket: A container that is used to store objects in OSS. Each Bucket has a unique name. You can log on to the OSS Management Console to view all Buckets under the current logon account. object: a file name or path that is stored in a Bucket.

Run the EMR Spark node task.
After you finish writing the code, click the icon and select the Serverless resource group that you created to run the task. After the task execution is complete, record the applicationIds value from the console, such as application_1730367929285_xxxx.
View the result.
Create an EMR Shell node and run the yarn logs -applicationId application_1730367929285_xxxx command to view the running results:

(Optional) Configure advanced parameters

You can configure Spark-specific properties on the Advanced Settings tab for a node. For more information about how to configure Spark properties, see Spark Configuration. The following table describes the advanced parameters that can be configured for different types of EMR clusters.

DataLake cluster/custom cluster: EMR on ECS

Advanced parameter	Description
queue	The scheduling queue to which jobs are submitted. The default queue is default. If you have configured a workspace-level YARN Resource Queue when you register an EMR cluster to a DataWorks workspace: If you select Yes for Global Settings Task Precedence, Spark tasks use the YARN queue that was configured when you registered the EMR cluster. If you do not select Yes, the queue configured for the EMR Spark node is used when the Spark task runs. For information about EMR YARN, see YARN schedulers. For information about queue configuration when you register an EMR cluster, see Configure a global YARN queue.
priority	The priority. The default value is 1.
FLOW_SKIP_SQL_ANALYZE	The SQL statement execution method. Valid values: `true`: Executes multiple SQL statements each time. `false` (default): Only one SQL statement is executed at a time. Note This parameter is supported only for testing workflows in the data development environment.
Other	You can directly append custom SPARK parameters in the advanced configuration section. For example, for a parameter such as `"spark.eventLog.enabled":false`, DataWorks automatically adds it to the code submitted to the EMR cluster in the `--conf key=value` format. You can also configure global Spark parameters. For more information, see Set global Spark parameters. Note To enable Ranger access control, add the `spark.hadoop.fs.oss.authorization.method=ranger` configuration in Set global Spark parameters to ensure that Ranger access control takes effect.

Hadoop cluster: EMR on ECS

Advanced parameter	Description
queue	The scheduling queue to which jobs are submitted. The default queue is default. If you have configured a workspace-level YARN Resource Queue when you register an EMR cluster to a DataWorks workspace: If you select Yes for Global Settings Task Precedence, Spark tasks use the YARN queue that was configured when you registered the EMR cluster. If you do not select Yes, the queue configured for the EMR Spark node is used when the Spark task runs. For information about EMR YARN, see YARN schedulers. For information about queue configuration when you register an EMR cluster, see Configure a global YARN queue.
priority	The priority. The default value is 1.
FLOW_SKIP_SQL_ANALYZE	The SQL statement execution method. Valid values: `true`: Multiple SQL statements are executed at a time. `false`: Only one SQL statement is executed at a time. Note This parameter is supported only for testing workflows in the data development environment.
USE_GATEWAY	Specifies whether to submit jobs through a Gateway cluster when submitting jobs on this node. Valid values: `true`: Jobs are submitted through the Gateway cluster. `false`: Jobs are not submitted through the Gateway cluster. By default, they are submitted to the header node. Note If the cluster that contains this node is not associated with a Gateway cluster and you manually set this parameter to `true`, subsequent EMR job submissions will fail.
Other	You can directly append custom SPARK parameters in the advanced configuration. For example, if you add the parameter `"spark.eventLog.enabled":false`, DataWorks automatically adds it to the code submitted to the EMR cluster in the `--conf key=value` format. You can also configure global Spark parameters. For more information, see Set global Spark parameters. Note To enable Ranger access control, add the `spark.hadoop.fs.oss.authorization.method=ranger` configuration in Set Global Spark Parameters to ensure that Ranger access control takes effect.

Spark cluster: EMR ON ACK

Advanced parameter	Description
queue	Not supported.
priority	Not supported.
FLOW_SKIP_SQL_ANALYZE	The SQL statement execution method. Valid values: `true`: Executes multiple SQL statements each time. `false`: Only one SQL statement is executed at a time. Note This parameter is supported only for testing workflows in the data development environment.
Other	You can directly append custom Spark parameters in Advanced Configuration. For example, `"spark.eventLog.enabled":false`. DataWorks automatically appends the parameter to the code that is submitted to the EMR cluster in the `--conf key=value` format. You can also configure global Spark parameters. For more information, see Configure global Spark parameters.

EMR Serverless Spark cluster

For more information about parameter settings, see Submit a Spark job.

Advanced parameter	Description
queue	The scheduling queue to which jobs are submitted. The default queue is dev_queue.
priority	The priority. The default value is 1.
FLOW_SKIP_SQL_ANALYZE	The SQL statement execution method. Valid values: `true`: Multiple SQL statements are executed each time. `false`: Only one SQL statement is executed at a time. Note This parameter is supported only for testing workflows in the data development environment.
SERVERLESS_RELEASE_VERSION	The version of the Spark engine. By default, the Default Engine Version configured for the cluster under Cluster Management in the Management Center is used. You can configure this parameter to specify different engine versions for different types of tasks.
SERVERLESS_QUEUE_NAME	The resource queue. By default, in Management Center, under Cluster Management, the value configured for the cluster's Default Resource Queue is used. You can add queues to meet resource isolation and management requirements. For more information, see Manage resource queues.
Other	You can directly append custom SPARK parameters in the advanced configuration. For example, `"spark.eventLog.enabled":false`. DataWorks automatically adds the parameter to the code that is submitted to the EMR cluster in the `--conf key=value` format. You can also configure global Spark parameters. For more information, see Configure global Spark parameters.

Execute SQL task

In the toolbar, click the icon. In the Parameters dialog box, select the scheduling resource group that you created and click Run.
Note
- If you want to access a computing resource over the internet or in a VPC, use a scheduling resource group that is connected to the computing resource. For more information, see Network connectivity solutions.
- To change the resource group for a task, you can click the Run With Parameters icon to select a different scheduling resource group.
- When you use an EMR Spark node to query data, a maximum of 10,000 records can be returned, and the total data size cannot exceed 10 MB.
Click the icon to save the SQL statement.
(Optional) Perform smoke testing.
You can perform smoke testing in the development environment when you submit the node or after the node is submitted. For more information, see Perform smoke testing.

Step 3: Configure node scheduling

If you want the system to periodically run a task on the node, you can click Properties in the right-side navigation pane on the configuration tab of the node to configure task scheduling properties based on your business requirements. For more information, see Overview.

Note

You must set the node's Rerun Properties and Dependent Ancestor Nodes before you can submit the node.
To customize the component environment, you can create a custom image based on the official dataworks_emr_base_task_pod image and use the image in Data Development.
For example, when you create a custom image, you can replace Spark JAR packages or add dependencies for specific libraries, files, or JAR packages.

Step 4: Publish the node task

After a task on a node is configured, you must commit and deploy the task. After you commit and deploy the task, the system runs the task on a regular basis based on scheduling configurations.

Click the icon in the top toolbar to save the task.
Click the icon in the top toolbar to commit the task.
In the Submit dialog box, configure the Change description parameter. Then, determine whether to review task code after you commit the task based on your business requirements.
Note
- You must configure the Rerun and Parent Nodes parameters on the Properties tab before you commit the task.
- You can use the code review feature to ensure the code quality of tasks and prevent task execution errors caused by invalid task code. If you enable the code review feature, the task code that is committed can be deployed only after the task code passes the code review. For more information, see Code review.

If you use a workspace in standard mode, you must deploy the task in the production environment after you commit the task. To deploy a task on a node, click Deploy in the upper-right corner of the configuration tab of the node. For more information, see Deploy nodes.

What to do next

After you commit and deploy the task, the task is periodically run based on the scheduling configurations. You can click Operation Center in the upper-right corner of the configuration tab of the corresponding node to go to Operation Center and view the scheduling status of the task. For more information, see View and manage auto triggered tasks.

FAQ

Why do you encounter a 'DlfMetaStoreClientFactory not found' error when you execute spark-submit in YARN-Cluster mode in an EMR Spark node after you enable Kerberos in an EMR cluster?

DataWorks:Create an EMR Spark node

Preparations

Limits

Notes

Preparations: Develop a Spark task and get the JAR package

Step 1: Create an EMR Spark node

Step 2: Develop a Spark task

Method 1: Upload and then reference an EMR JAR resource

If the JAR package is smaller than 200 MB

If the JAR package is 200 MB or larger

Method 2: Directly reference an OSS resource

Add pom dependencies

Reference related plugins

(Optional) Configure advanced parameters

DataLake cluster/custom cluster: EMR on ECS

Hadoop cluster: EMR on ECS

Spark cluster: EMR ON ACK

EMR Serverless Spark cluster

Execute SQL task

Step 3: Configure node scheduling

Step 4: Publish the node task

What to do next

FAQ

References