User guide for Notebooks in the new Data Studio - DataWorks

DataWorks Notebooks support multiple cell types and provide an interactive, modular analysis environment to help you efficiently process and analyze data, create visualizations, and build models.

Function Introduction

In DataWorks, you can use Notebook nodes to build an interactive, modular, and reusable analysis environment.

Multi-engine development: DataWorks Notebooks include a SQL Cell feature that supports SQL development and analysis on multiple big data engines.
Interactive analysis:
- Interactive SQL queries: You can write widgets in Python to visually select or set parameter values. You can then reference these parameters and their values in SQL to enable interactive queries between Python and SQL.
- Write SQL query results to a DataFrame: You can store SQL query results directly in a Pandas DataFrame or MaxFrame DataFrame object and pass these results as variables to subsequent cells.
- Generate visual charts: You can read the DataFrame variable in a Python cell to plot charts based on the data. This creates an efficient interaction between Python and SQL.
Integrated big data and AI development: In a DataWorks Notebook, you can use libraries such as Pandas for data cleaning and preparation to ensure that the data meets the input requirements of your algorithm models. You can then use the cleaned data to easily develop, train, and evaluate your models. This provides a seamless connection between big data and AI.
Intelligent code generation: DataWorks Notebooks have a built-in intelligent programming assistant that supports generating SQL and Python code with DataWorks Copilot to improve development efficiency.
Attach datasets: In DataWorks Notebooks, on the Scheduling Configuration > Scheduling Policy tab, you can add a dataset to a Notebook. This allows the node to read data from OSS or NAS, or write files to OSS or NAS during runtime.

Prerequisites

Create a workspace and use the new version of Data Studio. You can create a workspace for the new version of Data Studio.
A Serverless resource group is available. For more information, see Use a Serverless resource group.
A personal development environment instance is created. Running a Notebook in DataStudio requires a personal development environment instance. For more information, see Create a personal development environment instance.

Notes

When you run this task using a Serverless resource group, the maximum supported configuration for a single task is 64 CU. However, we recommend that you do not exceed 16 CU. This prevents resource shortages caused by excessive CUs, which can affect task startup.

Supported cell types

SQL cell:
- Supported cell types: MaxCompute SQL, Hologres SQL, EMR SPARK SQL, StarRocks SQL, Flink SQL Batch, and Flink SQL Streaming.
- Supported computing resources: MaxCompute, Hologres, EMR Serverless Spark, EMR Serverless StarRocks, and Fully Managed Flink.
Python cell.
Markdown cell.

Create a personal development environment instance

Notebooks run on personal development environment instances. Before you start, you must create and switch to a target instance. You can install dependencies for Notebook node development, such as third-party Python libraries, in a personal development environment instance.

Create a Notebook node

Go to the Data Studio (New Version) page.
Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose Shortcuts > Data Studio in the Actions column.
Create a Notebook.
In DataWorks, you can create a Notebook in the Project Folder, My Folder, or under One-Time Tasks.
- In the navigation pane on the left, click the icon to go to the Data Development page. Create a Notebook in the Project Folder or My Folder.
  - Create a Notebook in the Project Folder:
    - Click the icon and select Notebook to create a new Notebook.
    - If you have already created a working directory, hover over the directory name, right-click, and choose New > New Node > Notebook to create a new Notebook.
    - If you have already created a workflow, you can add a Notebook node when editing the workflow.
  - Create a Notebook in My Folder:
    - Click the icon to create a new Notebook file.
    - Click the icon and add a file in .ipynb format to create a new Notebook.
    - If you have already created a folder, hover over the folder name, right-click, and choose New Notebook to create a new Notebook.
- In the navigation pane on the left, click the icon to go to the One-Time page. Under One-Time Tasks, click the icon and choose New Node > Notebook to create a new Notebook.

Develop a Notebook node

1. Add a cell

In the Notebook node toolbar, you can click the SQL, Python, or Markdown button to quickly create the corresponding cell type. You can also quickly add a new cell above or below a specific cell in the code editor.

Add a cell above the current cell: Hover over the top edge of a cell to display the add button. Click the button to insert a new cell above the current one.
Add a cell below the current cell: Hover over the bottom edge of a cell to display the add button. Click the button to insert a new cell below the current one.

Note

To reorder cells, hover over the blue line in front of a cell, and then drag it to a new position.

2. (Optional) Switch the cell type

In a cell, you can click the Cell Type button in the lower-right corner to switch between cell types. For more information about cell types, see Supported cell types.

You can change a SQL cell from a MaxCompute SQL cell to a Hologres SQL cell or another SQL cell type.
You can change a SQL cell to a Python or Markdown cell, or switch a Python or Markdown cell to a SQL cell.

Note

When you switch the cell type, the content is retained. You must manually adjust the code in the cell to match the new type.

3. Develop cell code

You can edit SQL, Python, and Markdown code in the corresponding cells. When you develop code in a SQL cell, ensure that the SQL syntax matches the selected SQL cell type, which is the computing resource type. You can use DataWorks Copilot Ask for programming assistance. You can access the intelligent assistant in the following ways:

From the cell toolbar: Click the icon in the upper-right corner of the cell to open the Copilot chat box in the editor for programming assistance.
From the cell's context menu: Right-click the cell and choose Copilot > Chat In Editor for programming assistance.
Using a keyboard shortcut:
- macOS: Press Command+I to open the intelligent assistant chat box.
- Windows: Press Ctrl+I to open the intelligent assistant chat box.

Run a Notebook

1. Select a personal development environment

When you run a Notebook directly in DataStudio, the Python cells in the Notebook run based on a personal development environment. Therefore, you must select a created personal development environment instance at the top of the page as the runtime environment for the Notebook.

2. Confirm or switch the Python kernel

Confirm or switch the Python kernel: Click the icon in the upper-right corner of the Notebook node to confirm the Python kernel version for the current Python cell, or to switch to another Python kernel version.

3. (Optional) Select a computing resource

SQL cell: Click the icon in the lower-right corner of the SQL cell. You must specify an attached computing resource. When you run the cell, the SQL statement is executed using the specified computing resource.
Python cell: By default, a Python cell uses the kernel of the personal development environment instance to run the code. To access a specific computing resource service, you can also use a built-in Magic Command to connect to a MaxCompute computing resource.

4. Run Notebook cells

After you finish developing the Notebook cells, you can test all cells or run a single cell.

Run all cells: After editing the Notebook, click the icon at the top to test and run all cells in the Notebook node.
Run a single cell: After editing a cell within the Notebook, click the icon to the left of the cell to test and run it.

5. View the results

SQL cell

You can write various types of SQL scripts in a cell. After you run a SQL script, the results are printed below the cell.

Scenario 1: If the SQL does not contain a SELECT statement, only the run log is displayed by default after the cell is executed.

CREATE TABLE IF NOT EXISTS product (
    product_id BIGINT,
    product_name STRING,
    product_type STRING,
    price DECIMAL(10, 2)
)
LIFECYCLE 30; -- The data lifecycle is 30 days. Data is automatically deleted after this period. This setting is optional.

Scenario 2: If the SQL contains a SELECT statement, the run log is displayed, and the results can be viewed in two ways: as a table or as a visual chart. The system also automatically generates a DataFrame variable from the query results.
```
SELECT 
product_id,
product_name,
product_type,
price 
FROM product;
```
- Generate a DataFrame data object:
  The SQL cell automatically generates a return variable. You can click the df_* variable name in the lower-left corner of the SQL cell to rename the generated DataFrame variable.
- View the SQL query table:After the SQL query runs, the results are displayed in a table by default in the log area.
  The results of an SQL query are displayed in a table in the log area by default.
- View the visual chart for the SQL query
  After the SQL query runs, click the icon on the left of the log area to view a visual chart of the data generated by the query.

Python cell

You can write Python scripts in a cell. After you run a Python script, the results are printed below the cell.

Scenario 1: Print only text output.
```
print("Hello World")
```

Scenario 2: Use a Pandas DataFrame.

import pandas as pd

# Define product data, including details: product name, region, and login frequency.
product_data = {
    'Product_Name': ['DataWorks', 'RDS MySQL', 'EMR Spark', 'MaxCompute'],
    'Product_Region': ['East China 2 (Shanghai)', 'North China 2 (Beijing)', 'South China 1 (Shenzhen)', 'Hong Kong'],
    'Login_Frequency': [33, 22, 11, 44]
}

# Create a DataFrame from the given data.
df_products = pd.DataFrame(product_data)

# Print the DataFrame to show the product information.
print(df_products)

Scenario 3: Plot a chart.

import matplotlib.pyplot as plt

# Data
categories = ['DataWorks', 'RDS MySQL', 'MaxCompute', 'EMR Spark', 'Hologres']
values = [23, 45, 56, 78, 30]

# Create a bar chart
plt.figure(figsize=(10, 6))
plt.bar(categories, values, color=['blue', 'green', 'red', 'purple', 'orange'])

# Add a title and labels
plt.title('Example Bar Chart')
plt.xlabel('category')
plt.ylabel('value')

# Show the chart
plt.show()

Markdown cell

After you finish writing, click the icon to display the formatted Markdown text.
```
# DataWorks Notebook
```

Note

In a Markdown cell that is already displaying formatted text, click the icon to continue editing the cell.

What to do next: Publish the node

Configure scheduling: If a Notebook in the Project Folder needs to run on a recurring schedule in the production environment, you must configure its scheduling properties. For example, you can specify a recurring schedule time.
By default, Notebooks in the Project Folder, My Folder, or under One-Time Tasks run on the kernel of your personal development environment. When you publish a Notebook to the production environment, the system uses the image environment that you selected in the scheduling configuration. Before you publish the Notebook, ensure that the selected image contains the necessary dependencies for the Notebook node to run. You can create a DataWorks image from a personal development environment to use for scheduling.
Publish the node: A Notebook node runs according to its scheduling configuration only after it is published to the production environment. You can publish a node to the production environment in the following ways.
- Publish a Notebook from the Project Folder: Save the Notebook, and then click to publish it. After publishing, you can view the Notebook task on the Task O&M > Recurring Task O&M > Auto Triggered Tasks page in the Operation Center.
- Publish a Notebook from My Folder: Save the Notebook. Click the icon to submit the Notebook from My Folder to the Project Folder. Then, click to publish the Notebook. After publishing, you can view the Notebook task on the Task O&M > Recurring Task O&M > Auto Triggered Tasks page in the Operation Center.
- Publish a Notebook from One-Time Tasks: Save the Notebook, and then click to publish it. After publishing, you can view the Notebook task on the Task O&M > One-time Task O&M > One-time Tasks page in the Operation Center.
Unpublish a task: To unpublish a Notebook, right-click the node, select Delete, and follow the on-screen instructions to unpublish or delete the Notebook.

Scenarios and practices

Use built-in Magic Commands to connect to a MaxCompute computing resource

In a Python cell, you can use built-in Magic Commands to connect to a MaxCompute computing resource. This avoids the need to repeatedly define connection information and plaintext AccessKey information in Python.

Note

Before you connect to a MaxCompute computing resource, ensure that you have attached a MaxCompute (ODPS) computing resource.

Scenario 1: Establish a MaxCompute MaxFrame Session connection

When developing in a Python cell, you can use the following built-in Magic Command to open the MaxCompute computing resource selector and access the MaxCompute MaxFrame service.

Use a Magic Command to connect to and access a MaxCompute MaxFrame Session.
```
mf_session = %maxframe
```
Use a Magic Command in a Python cell to release the MaxCompute MaxFrame connection:
```
mf_session.destroy()
```

Scenario 2: Connect to a MaxCompute computing resource

When developing in a Python cell, you can use the following built-in Magic Command to open the MaxCompute computing resource selector. This lets you interact with MaxCompute using Python for operations such as data loading, queries, and DDL operations.

Use a Magic Command to create a MaxCompute connection.
Entering the following command in a cell opens the MaxCompute computing resource selector.
```
o=%odps 
```
Use the obtained MaxCompute computing resource to run a PyODPS script.
For example, to retrieve all tables in the current project:
```
with o.execute_sql('show tables').open_reader() as reader:
    print(reader.raw)
```

Write data from a dataset to a MaxCompute table

DataWorks supports creating NAS-type datasets. You can then use the dataset in Notebook development to read and write data in NAS storage.

The following example shows how to write test data (testfile.csv) from a dataset attached to a personal development environment instance (mount path: /mnt/data/dataset02) to a MaxCompute table (mc_testtb).

Pass SQL cell results to a Python cell

When a SQL cell produces output, a DataFrame variable is automatically generated. This variable can be accessed by a Python cell, enabling interaction between SQL and Python cells.

Run the SQL cell to generate a DataFrame.
- If the SQL cell contains one query, the result of that query is automatically captured as a DataFrame variable.
- If the SQL cell contains multiple queries, the DataFrame variable will be the result of the last query.
Note
- The DataFrame variable name defaults to df_**. You can click the variable name in the lower-left corner of the cell to customize it.
- If a cell contains multiple SQL queries, the DataFrame variable will only store the result of the last executed query.
Retrieve the DataFrame variable in a Python cell.
In a Python cell, you can retrieve the DataFrame variable by directly referencing its name.

Reference a Python resource in a Notebook

During Notebook development, you can reference a MaxCompute resource using the format ##@resource_reference{"custom_name.py"}. The following is a simple example of how to reference a Python resource:

Note

Referencing a Python resource in a Notebook only works in the production environment. It does not work in the development environment. You must publish the Notebook to the production environment and execute it in the Operation Center.

Create a new Python resource

Add a Python resource file.
1. Go to the DataWorks Workspaces page. In the top navigation bar, switch to the destination region. Find the created workspace and click Quick Access > Data Studio in the Actions column to go to DataStudio.
2. In the navigation pane on the left, click to go to Resource Management.
3. On the Resource Management page, click the New button or . You can also first Create a Folder to organize your resources, and then right-click the folder and choose New to select the specific resource or function type to create.
4. Create a MaxCompute Python resource.
  In this example, the Python resource is named hello.py.
Edit the content of the Python resource file. The following is sample code:
```
# your_script.py
def greet(name):
    print(f"Hello, {name}!")
```
After editing, click Save to save the Python code.
After you edit and save the code, click the icon to commit the hello.py resource.
After the resource is committed, click the icon to publish the hello.py resource to the development and production environments.

Reference the Python resource

Add a Notebook node. For more information, see Create a Notebook node.

Add a Python cell to the Notebook. For more information, see Add a cell.

In the Python cell, write ##@resource_reference{"hello.py"} to reference the new MaxCompute Python resource. The following is sample code:

# This comment references a Python resource named hello.py during scheduling.
##@resource_reference{"hello.py"}

import sys
import os

# Add the current working directory to the path.
sys.path.append(os.path.abspath('./hello.py'))  # Or use a relative path, adjust as needed.
from hello import greet  # Replace with the actual function name.
greet('DataWorks')

After you write the code in the Python cell and configure the node scheduling, save and publish the Notebook node.
Go to the Operation Center (Workflow). On the Recurring Task O&M > Auto Triggered Tasks page, find the published Notebook node. In the Actions column, click Backfill Data to perform a data backfill for the Notebook node. For more information about data backfill, see Perform data backfill and view the data backfill instance (new version).
After the data backfill is complete, you can view the run log of the Notebook node to confirm whether the Python cell was executed successfully.

Reference workspace parameters in a Notebook

During Notebook development, you can reference workspace parameters in SQL and Python cells using the format ${workspace.param}. The following is a simple example of how to reference a workspace parameter.

Note

Before you reference a workspace parameter in a cell, you must create the workspace parameter.
In the example, param is the name of the workspace parameter you created. Replace it with the name of your desired workspace parameter during development.

Reference a workspace parameter in a SQL cell.
```
SELECT '${workspace.param}';
```
This queries the workspace parameter. After a successful run, the specific value of the workspace parameter is printed.
Reference a workspace parameter in a Python cell.
```
print('${workspace.param}')
```
This outputs the workspace parameter. After a successful run, the specific value of the workspace parameter is printed.

Use PySpark with Magic Commands

During Notebook development, you can use Magic Commands in a Python cell to quickly create and start a Livy service. This connects to MaxCompute Spark and EMR Serverless Spark computing resources for efficient development and debugging.

Scope:
- MaxCompute computing resources and EMR Serverless Spark computing resources.
- Personal development environment instances created before 2025-08-01 do not support this feature. To use this feature, you need to create a new personal development environment.
Prerequisites: You have attached a MaxCompute computing resource or an EMR Serverless Spark computing resource to your workspace.

Connect to a computing resource using Python

In a Notebook's Python cell, you can use the following commands to quickly create, connect to, or release a Livy service on the target computing resource.

MaxCompute commands

Magic command	Description	Notes
`%maxcompute_spark`	Running this command performs the following operations: Creates a Livy service Starts the Livy service Creates a Spark Session Note You cannot view Livy and Spark Session information in the MaxCompute console.	Running a Notebook in DataStudio: When you run a Notebook in DataStudio, you must select the name of a personal development environment instance. The first time you run this command in a Notebook within the selected instance, a new Livy service is created. If the Livy service is not deleted, subsequent runs of the `%maxcompute_spark` command in the same instance will reuse the existing Livy service. Running a Notebook after publishing to production: When a Notebook runs in the production environment, each task instance creates a new Livy service. The Livy service is automatically stopped and deleted when the task instance finishes running.
`%maxcompute_spark stop`	Running this command cleans up the Spark Session and stops the Livy service.	To publish the Notebook task to the production environment, the task code does not need to include this Magic Command.
`%maxcompute_spark delete`	Running this command deletes the Livy service.	When a Notebook task instance runs in the production environment, the system automatically appends the `%close_session` command to the end of the code. This stops and deletes the Livy service for the current task instance. Note The system-appended `%close_session` command actually executes the `%maxcompute_spark delete` command to clean up the Spark Session and delete the Livy service.

EMR Serverless Spark commands

Magic command	Description	Notes
`%emr_serverless_spark`	Running this command performs the following operations: Creates a Livy service Starts the Livy service Creates a Spark Session Note: Note After you run the command, you can go to the E-MapReduce console to view and manage the Livy Gateway and Spark Session of the EMR Serverless Spark engine. A Livy service created through a DataWorks Notebook has a name prefixed with `dw_AlibabaCloudAccountID`.	Running a Notebook in DataStudio: When you run a Notebook in DataStudio, you must select the name of a personal development environment instance. The first time you run this command in a Notebook within the selected instance, a new Livy service is created. If the Livy service is not deleted, subsequent runs of the `%emr_serverless_spark` command in the same instance will reuse the existing Livy service. Running a Notebook after publishing to production: When a Notebook runs in the production environment, each task instance creates a new Livy service. The Livy service is automatically stopped and deleted when the task instance finishes running.
`%emr_serverless_spark stop`	Running this command cleans up the Spark Session and stops the Livy service.	To publish the Notebook task to the production environment, the task code does not need to include this Magic Command.
`%emr_serverless_spark delete`	Running this command deletes the Livy service.	When a Notebook task instance runs in the production environment, the system automatically appends the `%close_session` command to the end of the code. This actively cleans up the Spark Session and deletes the Livy service. Note The system-appended `%close_session` command actually executes the `%emr_serverless_spark delete` command to clean up the Spark Session and delete the Livy service.

Submit and execute Spark code using Python

You can add a Python cell in a Notebook to edit and execute PySpark code.

Ensure that you are connected to the target computing resource. In a preceding Python cell, you must have already used a Magic Command (such as %emr_serverless_spark or %maxcompute_spark) to connect to the target computing resource. For more information, see Connect to a computing resource using Python.
Write PySpark code.
In a new Python cell, add the %%spark command to use the Spark computing resource connected in the previous step, and then edit your PySpark code. For example:
```
%%spark
spark.sql("DROP TABLE IF EXISTS dwd_user_info_d")
spark.sql("CREATE TABLE dwd_user_info_d(id STRING, name STRING, age BIGINT, city STRING)")
spark.sql("INSERT INTO dwd_user_info_d SELECT '001', 'Jack', 30, 'Beijing'")
spark.sql("SELECT * FROM dwd_user_info_d").show()
spark.sql("SELECT COUNT(*) FROM dwd_user_info_d").show()
```
Note
- If a Python cell includes the %%spark command, it can connect to and run on the target computing resource's Spark engine.
- If a Python cell does not include the %%spark command, it can only run on the local environment.

Appendix: General operations

DataWorks Notebook operations are based on VSCode's Jupyter Notebook. The following are some general operations for cells:

Cell toolbar operations

Add a tag to a cell:
- First time: Click the icon in the cell toolbar, select Add Cell Tag, and add a tag in the pop-up window.
- Subsequent times: Click the icon below the cell to quickly add more tags.
Edit cell tags: Click the icon in the cell toolbar and select Edit Cell Tags (JSON) to go to the JSON editor page and edit the tags.
Mark a cell as a parameter: Click the icon in the cell toolbar and select Mark Cell as Parameters to add a parameter tag to the cell.

General node operations

View Notebook variables: In the Notebook toolbar at the top, click the icon to view all variable parameters in the Notebook. This includes the Name, Type, Size, and Value of the variables.
View Notebook outline: In the Notebook toolbar at the top, click the icon to view the text outline of the Notebook formed by Markdown cells.
Switch the Python runtime kernel: Click the icon in the upper-right corner of the Notebook node to confirm the Python kernel version for the current Python cell, or to switch to another Python kernel version.