Configure scheduling properties for a node in DataStudio - DataWorks

Nodes and workflows in the project directory often require recurring scheduling. You can configure scheduling properties, such as the scheduling period, dependencies, and parameters, on the scheduling configuration panel for a node or workflow. This topic describes how to configure scheduling properties.

Prerequisites

A node must be created. In DataWorks, tasks are developed based on nodes. Tasks for different engine types are encapsulated as different node types. You can select a node type based on your requirements. For more information, see Node development.
The periodic scheduling switch must be turned on. Tasks in a DataWorks workspace are automatically scheduled based on their configurations only if the Enable Periodic Scheduling switch is turned on. You can turn on this switch on the Scheduling Settings page for the workspace. For more information, see System Settings.

Precautions

The scheduling configurations of a task only define its properties at runtime. The task is scheduled based on these configurations only after it is published to the production environment.
The scheduling time specifies only the expected running time of a task. The actual running time depends on the running status of the ancestor nodes. For more information about the conditions for running a task, see Diagnose a running task.
DataWorks lets you create dependencies between different types of tasks. Before you proceed, we recommend that you read the Principles and examples of scheduling configurations for complex dependencies document to understand the preset dependencies in DataWorks for this scenario.
In DataWorks, a recurring instance is generated for a scheduling node based on the scheduling type and period that you specify. For example, if you configure a node to run hourly, a corresponding number of hourly instances are generated for the node each day. The node runs automatically using these recurring instances. For more information, see View recurring instances.
If you use scheduling parameters, the request parameters in the code for each cycle of a DataWorks scheduling node are determined by the scheduled time of the cycle and the scheduling parameter expression that you specify. For more information about how request parameters relate to the configuration and replacement of scheduling parameters, see Supported formats of scheduling parameters.
A workflow includes the workflow node and inner nodes. Their dependencies are complex. This topic describes only the dependencies and scheduling of individual nodes. For more information about the scheduling dependencies of a workflow, see Recurring workflow.

Go to the scheduling configuration page

Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose Shortcuts > Data Studio in the Actions column.
Go to the scheduling configuration page.
1. On the DataStudio page, find the node and open its configuration tab.
2. In the right-side navigation bar of the node configuration tab, click Scheduling Configuration to open the node scheduling configuration page.

Configure scheduling properties for a node

On the scheduling configuration page of a node, you must configure Scheduling Parameters, Scheduling Policy, Scheduling Time, Scheduling Dependencies, and Node Output Parameters for the node.

(Optional) Scheduling parameters

If you define variables in the code when you edit the node, you must assign values to the variables in this section.

Scheduling parameters are automatically replaced with specific values based on the business time of the scheduled task and the value format of the scheduling parameters. This allows for the dynamic replacement of parameters within the scheduling time of the task.

Configure scheduling parameters

You can define scheduling parameters in one of the following ways.

Method

Description

Configuration diagram

Add a parameter

You can configure multiple scheduling parameters for a scheduling node. If you want to use multiple scheduling parameters, you can click Add Parameter.

You can manually assign a value to a scheduling parameter. For more information, see Supported formats of scheduling parameters.
You can also click in the Actions column of a parameter to associate the parameter that is defined for the current node with an output parameter of an ancestor node.

Load parameters from code

This method is used to automatically detect the variable names that are defined in the code of the current node and add the detected variable names as scheduling parameters for subsequent use.

Note

In most cases, a variable name is defined in the ${Custom variable name} format in the code.

The method for defining variable names for PyODPS and general Shell nodes is different from that for other types of nodes. For more information about the formats of scheduling parameters for different types of nodes, see Examples of scheduling parameter configurations for different types of nodes.

Supported formats of scheduling parameters

For more information, see Supported formats of scheduling parameters.

Check the scheduling parameter configurations of the task in the production environment

To prevent issues caused by unexpected scheduling parameters when a recurring task runs, we recommend that you go to the Auto Triggered Task page in Operation Center to check the scheduling parameter configurations for the recurring task in the production environment after the task is published. For more information about how to view a recurring task, see Manage auto triggered tasks.

Scheduling policy

The scheduling policy defines the instance generation mode, scheduling type, computing resources, and resource groups for a recurring task.

Parameter	Description
Instance generation mode	After a node is submitted and published to the CDN mapping system in the production environment, the platform generates Recurring Instances for automatic scheduling based on the Instance Generation Mode configured for the node. Generate On The Next Day (T+1): The node is automatically scheduled on the day after it is published to the production environment. You can go to the Recurring Instance page to view the running status of the task. To run the task on the current day, you can perform a data backfill operation for the task. The running status of the recurring instance for which you perform a data backfill operation with the business time set to `Yesterday` is the same as that with the business time set to `Today`. Generate Immediately After Publishing: The node is automatically scheduled on the day when it is published to the production environment. You can go to the Recurring Instance page to view the running status of the task. When you create a task and select this mode, whether the task actually runs data or performs a dry-run on the current day depends on the scheduled time and publishing time of the task. When you modify the scheduling period of a published production task, DataWorks replaces the generated instances for future time periods based on the latest scheduling configurations, but does not delete expired instances.
Scheduling type	Normal Scenarios: A recurring task that runs in the Normal state, and the generated recurring instances also run in the Normal state. Impact: The task is scheduled to run at the time specified in the scheduling period configurations and runs normally (that is, it actually processes data). After the current node runs successfully, its descendant nodes are also triggered to run normally. By default, this option is selected as the scheduling type for a task. Skip Execution Scenarios: A recurring task is in the frozen state, and the generated recurring instances are also in the frozen state. The current node cannot be run and blocks the execution of its descendant nodes. If a business process does not need to be run for a period of time, you can select this scheduling type to freeze the root node of the business process. When the business process needs to be run, you can unfreeze the root node. For more information about how to unfreeze a task, see Freeze and unfreeze a task. Impact: The task is scheduled to run at the time specified in the scheduling period configurations, but the node status is set to paused (that is, it does not actually process data). When the task is scheduled, the system directly returns a failure and blocks the execution of descendant nodes that depend on the current node. Dry-run Scenarios: If a node does not need to be run for a period of time and does not block the execution of its descendant nodes, you can select this scheduling type. Impact: The task is scheduled to run at the time specified in the scheduling period configurations, but the node is in a dry-run state (that is, it does not actually process data). When the task is scheduled, the system directly returns a success (running time is `0` seconds), does not actually run the task (the execution log is empty), does not block the execution of descendant nodes that depend on the current node (that is, the descendant nodes run normally), and does not occupy resources.
Timeout period	If you set a timeout period, the task automatically stops running if its running time exceeds the specified timeout period. The configuration instructions are as follows: The timeout period takes effect for recurring instances, data backfill instances, and test instances. The default timeout period is 3 to 7 days. The system dynamically adjusts the default timeout period for tasks based on the actual payload, ranging from 3 to 7 days. When you manually set the timeout period, the maximum value can be set to 168 hours (7 days).
Rerun property	Configure the node to be rerun in specific situations. The rerun property cannot be empty. The supported types and their application scenarios are as follows: The Node Can Be Rerun Regardless Of Whether It Succeeds Or Fails: If multiple reruns of the node do not affect the result, you can select this rerun type. The Node Cannot Be Rerun If It Succeeds But Can Be Rerun If It Fails: If the node runs successfully once and rerunning it affects the result, but rerunning it after a failure does not affect the result, you can select this rerun type. The Node Cannot Be Rerun Regardless Of Whether It Succeeds Or Fails: If rerunning the node affects the result regardless of whether it succeeds or fails (for example, some synchronization nodes), you can select this rerun type. Note If you select this type, the system will not automatically rerun the corresponding node after a fault recovery. You cannot configure Automatic Rerun Upon Failure.
Automatic rerun upon failure	If you enable this feature, when a task fails to run (excluding cases where the user actively stops the task), the CDN mapping system automatically triggers a rerun based on the number of retries and the retry interval. Number Of Retries: The default number of automatic retries when a recurring task fails to run. The minimum number of retries is 1 (the task is automatically rerun once after an error occurs), and the maximum is 10 (the task is automatically rerun 10 times after an error occurs). You can modify this value based on your business needs. Retry Interval: The default interval for each retry is 30 minutes. The minimum interval is 1 minute, and the maximum is 30 minutes. Note You can go to the Scheduling Configuration page to set the default number of retries and retry interval for the workspace. For more information, see System Settings. If a node fails because its running time exceeds the timeout period, the automatic rerun configuration will not take effect.
Computing resource	Configure the DPI engine resources required for the task to run. To create new resources, you can do so through computing resource management.
Computing quota	You can configure the computing quota required for a task to run in a MaxCompute SQL node or a MaxCompute Script node to provide computing resources (CPU and memory) for the computing job.
Schedule resource group	Configure the schedule resource group used for the task to run. Select as needed. To modify the default resource group configured for a new task, you can go to the System Settings page. For more information, see System Settings. To modify the resource group for an existing task, see General reference: Switch resource groups.
Dataset	Click to add a created dataset to the node. Only Shell nodes, Python node, and Notebook nodes support adding datasets during development. Dataset: Select a dataset from the drop-down list of all datasets created in the current workspace. When you select a dataset of the Object Storage Service (OSS) type, you need to grant permission for the resource group to access the bucket for the first time. A bucket only needs to be authorized once. When you select a dataset of the file storage (NAS) type, if the network of the current DataWorks resource group is not connected to the NAS mount target, you need to adjust the VPC network to ensure that the resource group is connected to the NAS mount target. Note When the VPC bound to the DataWorks resource group is the same as the VPC bound to the NAS mount target, the network can be connected normally. Mount path: The default mount path configured for the dataset is automatically read. You can manually modify it. Advanced Configuration: When developing a node to read OSS or NAS datasets, you can adjust the read method and mount protocol configurations of the dataset by configuring different datasets. Read-only: If you enable read-only access, the data development task will only be allowed to read data during runtime and will not be able to write data to OSS or NAS.

Scheduling time

The scheduling time is used to configure the period, time, and other information for the automatic execution of a scheduling node.

Note

If the node is in a workflow, parameters related to Scheduling Time are set in the Scheduling Configuration on the workflow page. If the node is not in a workflow, the Scheduling Time is set in the Scheduling Configuration for each node.

Precautions

The scheduling frequency of a task is independent of the scheduling period of its ancestor tasks
A task's scheduling frequency depends on its own scheduling period, not the scheduling period of its ancestor tasks.
DataWorks supports dependencies between tasks with different scheduling periods
In DataWorks, a recurring instance is generated for a scheduling node based on the scheduling type and period that you specify. For example, if you configure a node to run hourly, a corresponding number of hourly instances are generated for the node each day. The node runs using these instances. Dependencies set for a recurring task are essentially dependencies between the instances that the tasks generate. If the scheduling types of ancestor and descendant nodes are different, the number of recurring instances generated and their dependencies will also be different. For more information about dependencies between ancestor and descendant nodes with different scheduling periods, see Select a dependency type (cross-cycle dependency).
Tasks that are not scheduled on a daily basis perform a dry-run
In DataWorks, tasks that are not scheduled daily, such as weekly or monthly tasks, perform a dry-run outside their scheduled times. When the task's scheduled time is reached, it immediately returns a successful status. If a daily scheduling task exists downstream, it triggers the downstream task to execute. In this case, the ancestor node performs a dry-run, and the descendant scheduling node executes as scheduled.
Task running time description
This setting only defines the expected scheduling time for the task. The actual running time of the task is affected by multiple factors, such as the scheduled time of the ancestor node, the availability of task execution resources, and the task's actual running conditions. For more information, see Conditions for running a task.

Configure the scheduling time

Parameter	Description
Scheduling period	The scheduling period is the period for a task to be automatically run in a scheduling scenario. It is used to define how often the code logic in a node is actually run in the CDN mapping system of the production environment. A recurring instance is generated for a scheduling node based on the scheduling type and period that you specify. For example, if you configure a node to be run on an hourly basis, the specified number of hourly instances are generated for the node every day. The recurring task is automatically run using the recurring instances. Minute-based scheduling: Within a specified time range each day, the scheduling node runs once at an interval of `N × specified minutes`. The minimum Time Interval for minute-based scheduling is 1 minute. Hour-based scheduling: Within a specified time range each day, the scheduling node runs once at an interval of `N × 1 hour`. Day-based scheduling: The scheduling node runs once at a specified time each day. When you create a recurring task, the default scheduling period for day-based scheduling is to run once at 00:00 every day. You can specify the running time as needed. Week-based scheduling: The scheduling node automatically runs once at a specific time on specific days of the week. Month-based scheduling: The scheduling node automatically runs once at a specific time on specific days of the month. Year-based scheduling: The scheduling node automatically runs once at a specific time on specific days of the year. Important For weekly, monthly, and yearly scheduling, instances are still generated daily during non-scheduling times. The instances show a successful state, but they will actually perform a dry-run and will not actually run the task.
Effective date	The scheduling node takes effect and is automatically scheduled within the effective date range. Tasks that exceed the effective date will not be automatically scheduled. These tasks are expired tasks. You can view the number of expired tasks on the O&M dashboard and unpublish them as needed.
Cron expression	This expression is automatically generated based on the time property configuration and does not need to be configured.

Scheduling dependencies

The scheduling dependencies of a task in DataWorks refer to the Directed Acyclic Graph (DAG) between nodes in a scheduling scenario. A descendant node task starts to run only after its ancestor node tasks run successfully. Configuring scheduling dependencies ensures that the scheduling task can obtain the correct data when it runs. After an ancestor node runs successfully, DataWorks detects that the latest data for the ancestor table is generated, which allows the descendant node to retrieve the data. This prevents the descendant node from failing to retrieve data because the ancestor table data has not been generated.

Precautions

After the node dependency is configured, a running condition for a descendant node is that all its dependent ancestor nodes have run successfully. Otherwise, the current task may encounter data quality issues when retrieving data.
The actual running time of a task depends on both its own scheduled time and the completion time of its ancestor tasks. If an ancestor task has not finished running, the descendant task will not run, even if its scheduled time is earlier than that of the ancestor task. For more information about the conditions for running a task, see Diagnose a running task.

Configure scheduling dependencies

The primary purpose of task dependencies in DataWorks is to ensure that descendant nodes can retrieve data correctly. This is essentially a data lineage dependency between ancestor and descendant tables. You can choose whether to configure scheduling dependencies based on the data lineage of the tables according to your business needs. The process for configuring node scheduling dependencies is as follows.

After a node dependency is configured, a strong dependency exists by default between the output tables of the ancestor and descendant nodes. Therefore, when you configure scheduling dependencies for a task, you must confirm if a strong data lineage dependency exists. A strong data lineage dependency exists if the data output of the descendant node depends on the data output of the ancestor node. This confirmation prevents issues where the current task cannot retrieve data because the ancestor data has not been generated.

Ordinal number	Description
①	To prevent the current task from running at an unexpected time, you can first assess whether there is a strong dependency between the tables and confirm whether you need to configure scheduling dependencies based on data lineage.
②	Confirm whether the current scenario involves table data generated by a recurring task. For table data not generated by a recurring schedule in DataWorks, DataWorks cannot monitor data output through task running status. Therefore, some tables do not support configuring scheduling dependencies. Tables with data not generated by a recurring schedule in DataWorks include but are not limited to the following types: Tables generated by real-time synchronization Tables uploaded from local to DataWorks Dimension tables Tables generated by one-time tasks Periodically changing tables not generated by scheduling nodes in DataWorks
③④	Choose to depend on the same cycle or the previous cycle of the ancestor node based on whether you need to depend on yesterday's or today's data from the ancestor node, and whether an hourly or minute-based task needs to depend on its own previous hour or minute instance. Same-cycle scheduling dependency: The descendant node depends on the table data generated by the ancestor node today. Previous-cycle scheduling dependency (cross-cycle dependency): The descendant node depends on the table data generated by the ancestor node yesterday. Special dependency scenarios for hourly and minute-based tasks: To depend on the data of its own previous hour or minute cycle instance, you can set a cross-cycle dependency. If an hourly task depends on another hourly task and their scheduled times are exactly the same, setting a cross-cycle dependency can make the 2:00 instance of the descendant node depend on the 1:00 instance of the ancestor node. The principle is the same for minute-based tasks depending on minute-based tasks. Note For details on configuring scheduling dependency scenarios based on data lineage, see Select a dependency type (same-cycle dependency).
⑤⑥⑦	After the dependency is configured and published to the production environment, you can check whether the task dependency meets expectations in the Auto Triggered Task section of Operation Center.

Configure custom node dependencies

If no strong data lineage dependency exists between tasks in DataWorks, or if the dependent data is not from a table generated by a recurring scheduling node, you can customize the node's dependencies. For example, a task may not strongly depend on a specific partition of the ancestor data but only retrieves data from the latest partition at the current time. Another example is when data is from a locally uploaded table. You can configure custom dependencies in the following ways:

Depend on the root node of the workspace
In scenarios where the input data for a synchronization task originates from other business databases, or where an SQL-type task processes table data generated by a real-time synchronization task, you can attach the dependency directly to the root node of the workspace.
Depend on a zero load node
If a workspace contains many or complex business processes, you can use a zero load node to manage them. You can attach the dependencies of nodes that require central management to a specific zero load node to clarify the data forwarding path in the workspace. For example, you can control the overall scheduling time of a business process or control its overall scheduling, including freezing it (disabling scheduling).

Node output parameters

After you define an output parameter and its value for an ancestor node, you can define an input parameter for a descendant node whose value references the output parameter of the ancestor node. This allows the descendant node to use this parameter to obtain the value passed from the ancestor node.

Precautions

The Output Parameter of a node is used only as an input parameter for a descendant node. You can add a parameter in the scheduling parameter section of the descendant node and associate it with the ancestor parameter by clicking in the Actions column. Some nodes cannot directly pass the query results of an ancestor node to a descendant node. To pass the query results of an ancestor node to a descendant node, you can use an assignment node. For more information, see Assignment node.
The nodes that support node output parameters are: EMR Hive, EMR Spark SQL, ODPS Script, Hologres SQL, AnalyticDB for PostgreSQL, and MySQL nodes.

Configure node output parameters

The value of a Node Output Parameter can be a Constant or a Variable.

After you define the output parameter and submit the current node, you can select Bind The Output Parameter Of The Ancestor Node to use it as an input parameter for the descendant node when you configure scheduling parameters for the descendant node.

Parameter name: The name of the defined output parameter.
Parameter value: The value of the output parameter. The value type can be a constant or a variable:
- A constant is a fixed string.
- A variable can be a system-supported global variable, a built-in scheduling parameter, or a custom parameter.

References

Scheduling parameters: For more information, see Formats of scheduling parameters.
Scheduling policy:
- For more information, see Generate instances immediately after publishing.
- For more information, see Dry-run a task.
Scheduling time: For more information, see Scheduling time.
Scheduling dependencies:
- For more information, see Cross-cycle dependencies.
- For more information, see Same-cycle dependencies.
- For more information, see Dependencies in complex scenarios.
- For more information, see Dependencies in special scenarios.
Other references:
- For more information, see Impact of daylight saving time switch on the running of scheduling tasks.