Data development process - DataWorks - Alibaba Cloud Documentation Center

This topic describes the basic development process in the new version of DataStudio.

Prerequisites

A DataWorks workspace is created, and DataStudio is enabled.
This guide applies to the new version of DataStudio. To use it, ensure that DataStudio is enabled for your workspace. You can enable DataStudio in the following ways:
- When you create a workspace, select Use The New Version Of Data Development (DataStudio).
- You can upgrade an existing workspace from Data Development to DataStudio. On the Data Development page, click the Upgrade button at the top and follow the on-screen instructions to complete the process.
- After February 18, 2025, when an Alibaba Cloud account enables DataWorks and creates a workspace for the first time in the following regions, the new DataStudio is enabled by default.
  China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu), China (Hong Kong), Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta), Thailand (Bangkok), Germany (Frankfurt), UK (London), US (Silicon Valley), US (Virginia)
A computing resource is bound to the DataWorks workspace. You can select a computing resource based on your needs. For more information, see Bind a computing resource.

Enter the DataStudio interface

Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose Shortcuts > Data Studio in the Actions column.

Development folder planning

DataStudio supports data development in different folders. The differences between the folders are described below. You can choose a folder based on your business needs.

Folder type	Permission scope	Features	Scenarios
Personal Folder	Personal account level	Visible only to the current account. Supports code debugging. Does not support creating scheduled tasks. Supports a limited number of file types, such as `.ipynb` (Notebook files), `.sh` files, `.py` files, and `.sql` files. Files in the Personal Folder can be submitted to the Project Folder.	Personal development and testing
Project Folder	Workspace level	Supports collaborative development for teams. You can create various types of nodes and recurring workflows.	Production tasks that require recurring scheduling
Manual Folder	Workspace level	Supports one-time tasks and manually triggered workflows. Independent of the recurring scheduling system. After being published to the production environment, tasks must be manually run in the Operation Center.	Temporary, manually run tasks

Data development

After you understand the differences between the folders for various scenarios from the preparation stage, you can create a development folder based on your business needs.

Personal Folder development (for personal testing, temporary queries, and cross-project code synchronization)

Files in the Personal Folder are visible only to the current account. You can use these files for personal testing or temporary queries, but you cannot configure scheduling for them or publish them to the production environment. The Personal Folder is visible across all your workspaces and supports cross-workspace synchronization. To schedule and publish a file, you must first submit it from the Personal Folder to a Project Folder. You can then configure scheduling and publish the file from that Project Folder. For more information, see Personal Folder.

In the navigation pane on the left of DataStudio, click to go to the Data Development folder.
In the Personal Folder section, you can click to create a folder, and then create files in it as needed.
To submit a file from the Personal Folder to the Project Folder in the workspace, click Submit To Project Folder at the top of the editing window. For the next steps, see Project Folder development (for production environments).

Project Folder development (for production environments)

Files in the Project Folder support collaborative team development. You can create different types of nodes and orchestrate their upstream and downstream dependencies. For more information, see Project Folder.

In the navigation pane on the left of DataStudio, click to go to the Data Development folder.
Create a project folder, nodes, and workflows.
In the Project Folder section, you can click to create a folder, node, or workflow.
- Folder: You can use folders to manage nodes and workflows.
- Node: DataStudio supports a wide range of node types, such as Data Integration, Notebook, and MaxCompute SQL. For more information about the functions and differences of various nodes, see Node development.
- Workflow: A workflow is a tool for automating the management of data processing. It provides a visual canvas that lets you integrate various types of subtask nodes by dragging and dropping them. This makes it easy to establish dependencies between tasks, accelerate the creation of data processing flows, and improve development efficiency. For more information, see Recurring workflow.
Node orchestration.
- Node: For a standalone node, you must configure its upstream and downstream dependencies in the scheduling dependency settings.
  On the node editing page, click Scheduling Configuration in the right pane. Configure the Node Scheduling parameters to define the upstream and downstream dependencies for the node. Dependencies ensure that nodes run in the correct order. A descendant node runs only after its ancestor nodes run successfully. This ensures that the current node can retrieve the correct data at the right time.
- Workflow: A workflow lets you visually orchestrate the upstream and downstream dependencies of nodes on a canvas. You can plan the orchestration as required.
Node development.
DataStudio supports a wide range of node types. The configurable content varies by node type. For more information, see Node development to complete the node configuration.
Note
During node development, you can define variables using the ${Variable_Name} format. You can then assign constant values to the variables during the testing phase and dynamically assign values during scheduling configuration.

Manual Folder development (for one-time tasks)

You can create one-time tasks or manually triggered workflows in the Manual Folder for one-time data processing scenarios that do not require recurring scheduling.

In the navigation pane on the left of DataStudio, click to go to the Manual Folder.
Create development folders and nodes under One-time Task or Manually Triggered Workflow as needed. For more information, see One-time Task and Manually Triggered Workflow.

Test

After you finish developing a node, click Debug Configuration on the right side of the node editing page to set the debug parameters. Then, click Run in the toolbar to execute the node.

When you configure debug settings, you can set the following parameters:

In Computing Resource, specify the computing resource for submitting the task for debugging.
In DataWorks Configuration, specify the resource group for DataWorks task execution.
If you defined variables in your code using the ${Variable_Name} format, you can assign constant values to them in Script Parameters.

Note

Recurring workflows do not support debugging the entire workflow. You must debug each inner node individually.
Manually triggered workflows support running the entire workflow.

Scheduling configuration and publishing

Scheduling configuration

After you debug the node, if it needs to be published to the production environment for recurring automatic scheduling, click Scheduling Configuration on the right side of the node editing page to configure its scheduling properties.

Scheduling Parameters: Define the parameters used for node scheduling. DataWorks provides multiple assignment formats. If you defined variables using the ${Variable_Name} format during node development, you can use scheduling parameters to dynamically assign values to variables in scheduling scenarios.
Scheduling Policy: Define scheduling properties for the node in the scheduling environment, other than the execution frequency and specific execution time.
Scheduling Time: Define the execution frequency and specific execution time for the node in the scheduling environment.
Scheduling Dependencies: Define the upstream and downstream dependencies for the task. Dependencies ensure that nodes run in the correct order. A descendant node starts only after its ancestor nodes run successfully. This process ensures that the current node retrieves data correctly and at the right time.

Note

The scheduling configuration for a recurring workflow is different from that of a standalone node. For more information, see Recurring workflow.

Node publishing

After you configure the scheduling properties of the node, click the Publish button at the top of the node editing page. The node is then published to the production environment and scheduled to run periodically. For more information, see Publish a node or workflow.

Click the Publish button in the toolbar and then click Start Publishing To Production. This action publishes the task according to the publishing check process.

Note

The publishing operation may fail because it is controlled by enabled checkers. Therefore, you must confirm the final publishing status of the task in the production environment after the publishing process is complete.

Task O&M

After a node is published, an auto triggered task is generated in the production environment of the Operation Center. You can go to the Operation Center to view or adjust the properties and status of the auto triggered task and perform a data backfill for a specific data timestamp.

Quick start

When you open DataStudio, the Welcome page is displayed by default. You can follow the on-screen instructions to try a classic Notebook example or complete the DataStudio introductory tutorial.