Spark is a popular, general-purpose engine for big data analytics that features high performance and ease of use. You can use Spark to perform complex in-memory analysis and build large, low-latency data analysis applications. DataWorks provides E-MapReduce (EMR) Spark nodes that you can use to develop and periodically schedule Spark tasks. This topic describes how to create an EMR Spark node and provides examples of its features.
Preparations
Before developing nodes, if you need to customize the component environment, you can create a custom image based on the official image
dataworks_emr_base_task_podand use the custom image in DataStudio.For example, you can replace Spark JAR packages or include specific
libraries,files, orJAR packageswhen creating the custom image.An EMR cluster must be registered with DataWorks. For more information, see Legacy Data Development: Attach an EMR computing resource.
(Optional) If you use a RAM user to develop tasks, you must add the user to the workspace and assign the user the Development or Workspace Manager role. The Workspace Manager role has more permissions than required, so assign this role with caution. For more information about how to add members, see Add members to a workspace.
A resource group must be purchased and configured. The configuration includes associating the resource group with a workspace and configuring the network. For more information, see Create and use a serverless resource group.
A workflow must be created. In DataStudio, development operations for different compute engines are performed based on workflows. Therefore, you must create a workflow before you create a node. For more information, see Create a workflow.
If you want to use a specific development environment to develop a task, you can create a custom image in the DataWorks console. For more information, see Manage images.
Limits
This type of node can be run only on a serverless resource group or an exclusive resource group for scheduling. We recommend that you use a serverless resource group. If you need to use an image in DataStudio, use a serverless computing resource group.
If you want to manage metadata for a DataLake or custom cluster in DataWorks, you must first configure EMR-HOOK in the cluster. If EMR-HOOK is not configured, metadata cannot be displayed in real time, audit logs cannot be generated, and data lineages cannot be displayed in DataWorks. In addition, the related EMR governance tasks cannot be run. For more information about how to configure EMR-HOOK, see Configure EMR-HOOK for Spark SQL.
You cannot view data lineages of a Spark cluster that is created on the EMR on ACK page. You can view data lineages of an EMR Serverless Spark cluster.
For Spark clusters that are created on the EMR on ACK page and EMR Serverless Spark clusters, you can use only the Object Storage Service (OSS) REF method to reference OSS resources and upload resources to OSS. You cannot upload resources to the Hadoop Distributed File System (HDFS).
For DataLake and custom clusters, you can use the OSS REF method to reference OSS resources and upload resources to OSS or HDFS.
Notes
If you have enabled Ranger access control for Spark in the EMR cluster that is attached to your workspace, note the following:
When you use the default image to run Spark tasks, this feature is available by default.
To use a custom image to run Spark tasks, submit a ticket to contact technical support to upgrade the image to support this feature.
Preparations: Develop a Spark task and get the JAR package
Before you use DataWorks to schedule an EMR Spark task, you must develop Spark task code in EMR and compile the code to generate a JAR package. For more information about how to develop EMR Spark tasks, see Spark Overview.
You must upload the JAR package to DataWorks to periodically schedule the EMR Spark task.
Step 1: Create an EMR Spark node
Go to the DataStudio page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.
Create an EMR Spark node.
Right-click the target workflow and select .
NoteAlternatively, you can hover over Create and select .
In the Create Node dialog box, enter a Name and select the Engine Instance, Node Type, and Path. Click Confirm. The EMR Spark node editing page appears.
NoteThe node name can contain uppercase letters, lowercase letters, Chinese characters, digits, underscores (_), and periods (.).
Step 2: Develop a Spark task
On the EMR Spark node editing page, double-click the created node to open the task development page. You can choose an operation based on your scenario:
(Recommended) Upload a resource from your on-premises machine to DataStudio and then reference the resource. For more information, see Method 1: Upload and then reference an EMR JAR resource.
Use the OSS REF method to reference an OSS resource. For more information, see Method 2: Directly reference an OSS resource.
Method 1: Upload and then reference an EMR JAR resource
DataWorks lets you upload a resource from your on-premises machine to DataStudio and then reference the resource. After the EMR Spark task is compiled, you must obtain the compiled JAR package. We recommend that you choose a storage method for the JAR package resource based on its size.
You can upload the JAR package resource, create it as a DataWorks EMR resource, and then submit it. Alternatively, you can store it directly in the HDFS of the EMR cluster. You cannot upload resources to HDFS for EMR on ACK Spark clusters or EMR Serverless Spark clusters.
If the JAR package is smaller than 200 MB
Create an EMR JAR resource.
If a JAR package is smaller than 200 MB, you can upload the package from your on-premises machine as an EMR JAR resource to DataWorks. This lets you visually manage the resource in the DataWorks console. After you create the resource, you must submit it. For more information, see Create and use EMR resources.
NoteWhen you create an EMR resource for the first time, if you want the JAR package to be stored in OSS after it is uploaded, you must first perform authorization as prompted on the page.
Reference the EMR JAR resource.
Double-click the created EMR Spark node to open its code editing page.
Under the node, find the previously uploaded EMR JAR resource, right-click the resource, and select Reference Resource.
After you select Reference Resource, the resource reference code is automatically added to the editing page of the current EMR Spark node. The following code provides an example.
##@resource_reference{"spark-examples_2.12-1.0.0-SNAPSHOT-shaded.jar"} spark-examples_2.12-1.0.0-SNAPSHOT-shaded.jarIf the preceding reference code is automatically added, the resource is successfully referenced. spark-examples_2.12-1.0.0-SNAPSHOT-shaded.jar is the name of the EMR JAR resource that you uploaded.
Rewrite the EMR Spark node code and add the spark-submit command. The following code provides an example of the rewritten code.
NoteComments are not supported when you edit code for an EMR Spark node. Rewrite the task code by referring to the following example. Do not add comments. Otherwise, an error is reported when you run the node.
##@resource_reference{"spark-examples_2.11-2.4.0.jar"} spark-submit --class org.apache.spark.examples.SparkPi --master yarn spark-examples_2.11-2.4.0.jar 100Description:
org.apache.spark.examples.SparkPi: The main class of the task in your compiled JAR package.
spark-examples_2.11-2.4.0.jar: The name of the EMR JAR resource that you uploaded.
You can use the values from the preceding example for the other parameters. You can also execute the following command to view the help documentation for the
spark-submitcommand and modify thespark-submitcommand as needed.NoteTo use a parameter for the
spark-submitcommand in a Spark node, you must add it to your code. For example,--executor-memory 2G.Spark nodes support submitting jobs only using Yarn in cluster mode.
If you submit jobs using
spark-submit, we recommend that you set the deploy-mode to cluster instead of client.
spark-submit --help
If the JAR package is 200 MB or larger
Create an EMR JAR resource.
If a JAR package is 200 MB or larger, you cannot upload it from your on-premises machine as a DataWorks resource. We recommend that you store the JAR package directly in the HDFS of the EMR cluster and record its storage path. You can then reference this path when you schedule the Spark task in DataWorks.
Reference the EMR JAR resource.
If the JAR package is stored in HDFS, you can reference it by specifying its path in the code of the EMR Spark node.
Double-click the created EMR Spark node to open its code editing page.
Write the spark-submit command. The following code provides an example.
spark-submit --master yarn --deploy-mode cluster --name SparkPi --driver-memory 4G --driver-cores 1 --num-executors 5 --executor-memory 4G --executor-cores 1 --class org.apache.spark.examples.JavaSparkPi hdfs:///tmp/jars/spark-examples_2.11-2.4.8.jar 100Where:
hdfs:///tmp/jars/spark-examples_2.11-2.4.8.jar: The actual path of the JAR package in HDFS.
org.apache.spark.examples.JavaSparkPi: The main class of the task in your compiled JAR package.
Other parameters are the actual parameters of the EMR cluster. You must modify them as needed. You can also run the following command to view the help information for the spark-submit command and modify the command as needed.
ImportantTo use a parameter for the spark-submit command in a Spark node, you must add the parameter to your code, for example,
--executor-memory 2G.Spark nodes support submitting jobs only using Yarn in cluster mode.
If you submit a task using spark-submit, we recommend that you set deploy-mode to cluster instead of client.
spark-submit --help
Method 2: Directly reference an OSS resource
(Optional) Configure advanced parameters
You can configure Spark-specific properties on the Advanced Settings tab for a node. For more information about how to configure Spark properties, see Spark Configuration. The following table describes the advanced parameters that can be configured for different types of EMR clusters.
DataLake cluster/custom cluster: EMR on ECS
Advanced parameter | Description |
queue | The scheduling queue to which jobs are submitted. The default queue is default. If you have configured a workspace-level YARN Resource Queue when you register an EMR cluster to a DataWorks workspace:
For information about EMR YARN, see YARN schedulers. For information about queue configuration when you register an EMR cluster, see Configure a global YARN queue. |
priority | The priority. The default value is 1. |
FLOW_SKIP_SQL_ANALYZE | The SQL statement execution method. Valid values:
Note This parameter is supported only for testing workflows in the data development environment. |
Other |
|
Hadoop cluster: EMR on ECS
Advanced parameter | Description |
queue | The scheduling queue to which jobs are submitted. The default queue is default. If you have configured a workspace-level YARN Resource Queue when you register an EMR cluster to a DataWorks workspace:
For information about EMR YARN, see YARN schedulers. For information about queue configuration when you register an EMR cluster, see Configure a global YARN queue. |
priority | The priority. The default value is 1. |
FLOW_SKIP_SQL_ANALYZE | The SQL statement execution method. Valid values:
Note This parameter is supported only for testing workflows in the data development environment. |
USE_GATEWAY | Specifies whether to submit jobs through a Gateway cluster when submitting jobs on this node. Valid values:
Note If the cluster that contains this node is not associated with a Gateway cluster and you manually set this parameter to |
Other |
|
Spark cluster: EMR ON ACK
Advanced parameter | Description |
queue | Not supported. |
priority | Not supported. |
FLOW_SKIP_SQL_ANALYZE | The SQL statement execution method. Valid values:
Note This parameter is supported only for testing workflows in the data development environment. |
Other |
|
EMR Serverless Spark cluster
For more information about parameter settings, see Submit a Spark job.
Advanced parameter | Description |
queue | The scheduling queue to which jobs are submitted. The default queue is dev_queue. |
priority | The priority. The default value is 1. |
FLOW_SKIP_SQL_ANALYZE | The SQL statement execution method. Valid values:
Note This parameter is supported only for testing workflows in the data development environment. |
SERVERLESS_RELEASE_VERSION | The version of the Spark engine. By default, the Default Engine Version configured for the cluster under Cluster Management in the Management Center is used. You can configure this parameter to specify different engine versions for different types of tasks. |
SERVERLESS_QUEUE_NAME | The resource queue. By default, in Management Center, under Cluster Management, the value configured for the cluster's Default Resource Queue is used. You can add queues to meet resource isolation and management requirements. For more information, see Manage resource queues. |
Other |
|
Execute SQL task
In the toolbar, click the
icon. In the Parameters dialog box, select the scheduling resource group that you created and click Run.NoteIf you want to access a computing resource over the internet or in a VPC, use a scheduling resource group that is connected to the computing resource. For more information, see Network connectivity solutions.
To change the resource group for a task, you can click the Run With Parameters
icon to select a different scheduling resource group.When you use an EMR Spark node to query data, a maximum of 10,000 records can be returned, and the total data size cannot exceed 10 MB.
Click the
icon to save the SQL statement.(Optional) Perform smoke testing.
You can perform smoke testing in the development environment when you submit the node or after the node is submitted. For more information, see Perform smoke testing.
Step 3: Configure node scheduling
If you want the system to periodically run a task on the node, you can click Properties in the right-side navigation pane on the configuration tab of the node to configure task scheduling properties based on your business requirements. For more information, see Overview.
You must set the node's Rerun Properties and Dependent Ancestor Nodes before you can submit the node.
To customize the component environment, you can create a custom image based on the official
dataworks_emr_base_task_podimage and use the image in Data Development.For example, when you create a custom image, you can replace Spark JAR packages or add dependencies for specific
libraries,files, orJAR packages.
Step 4: Publish the node task
After a task on a node is configured, you must commit and deploy the task. After you commit and deploy the task, the system runs the task on a regular basis based on scheduling configurations.
Click the
icon in the top toolbar to save the task. Click the
icon in the top toolbar to commit the task. In the Submit dialog box, configure the Change description parameter. Then, determine whether to review task code after you commit the task based on your business requirements.
NoteYou must configure the Rerun and Parent Nodes parameters on the Properties tab before you commit the task.
You can use the code review feature to ensure the code quality of tasks and prevent task execution errors caused by invalid task code. If you enable the code review feature, the task code that is committed can be deployed only after the task code passes the code review. For more information, see Code review.
If you use a workspace in standard mode, you must deploy the task in the production environment after you commit the task. To deploy a task on a node, click Deploy in the upper-right corner of the configuration tab of the node. For more information, see Deploy nodes.
What to do next
After you commit and deploy the task, the task is periodically run based on the scheduling configurations. You can click Operation Center in the upper-right corner of the configuration tab of the corresponding node to go to Operation Center and view the scheduling status of the task. For more information, see View and manage auto triggered tasks.
icon and select the Serverless resource group that you created to run the task. After the task execution is complete, record the 