MaxCompute SQL node - DataWorks - Alibaba Cloud Documentation Center

The MaxCompute SQL node in DataWorks lets you periodically schedule MaxCompute SQL tasks. You can integrate this node with other node types for combined scheduling. MaxCompute SQL tasks use a SQL-like syntax and are ideal for the distributed processing of massive data (terabyte-scale) in scenarios that do not require high real-time performance. This topic describes how to develop a MaxCompute SQL node task in DataWorks.

Introduction

MaxCompute SQL is used to process and query data in MaxCompute. It supports common SQL operations, such as SELECT, INSERT, UPDATE, and DELETE, along with specific MaxCompute syntax and functions. This lets you write SQL-like statements to query and process data without writing complex data processing logic. For more information about the SQL syntax, see SQL overview.

Prerequisites

A MaxCompute computing resource is attached to the DataWorks workspace.
(Optional) If you are a Resource Access Management (RAM) user, you must be a member of the workspace with the Developer or Workspace Admin role. Because the Workspace Admin role includes extensive permissions, it must be granted with caution. For more information about adding a member to a workspace, see Add a member to a workspace.
Note
If you use an Alibaba Cloud account, you can skip this step.

Limits

The following limits apply when you use a MaxCompute SQL node:

Category	Description
Comments	Single-line comments that start with `--` are supported. Multi-line comments enclosed in `/* */` are not supported. For more information, see MaxCompute SQL comments. The following limits also apply to comments. You cannot add a comment after a statement that contains only a keyword, such as SET or USE. You cannot use a semicolon (;) in a comment. You cannot add a comment at the end of a completed statement. A statement is considered complete if it ends with a semicolon (;).


SQL submission	In ODPS SQL, you cannot run a statement that contains only SET or USE. These keywords must be used together with a specific SQL statement.
SQL development	The SQL code size cannot exceed 128 KB. The number of SQL commands cannot exceed 200.
Query results	Only SQL statements that start with SELECT or WITH can output a formatted result set. The following limits apply to query results: If a query result contains more than 10,000 rows, a maximum of 10,000 rows can be displayed. If the size of a query result exceeds 10 MB, the error message `Result is too large, exceed the limit size: 10MB` is reported. Note If you encounter these limits, you can download the query results to your local computer for viewing using one of the following methods: Method 1: Run the Tunnel Download command using the MaxCompute client (odpscmd). Method 2: Use the export feature of the SQL Query module in DataWorks to export the query results to your local computer.

Notes

Make sure that the account used to run the MaxCompute SQL task has the required permissions for the corresponding MaxCompute project. For more information, see Access control for DataWorks on MaxCompute and MaxCompute permissions.
The execution of MaxCompute SQL tasks depends on quota resources. If a task runs for a long time, go to the MaxCompute console to check the quota consumption and ensure that sufficient resources are available to run the task. For more information, see Computing resources - Quota management.
When you develop a MaxCompute SQL node task, if the task involves special parameters such as an OSS address, you must enclose them in double quotation marks (""). Otherwise, the task may fail to be parsed and run.
The execution order of statements that contain keywords such as SET and USE varies depending on the environment in DataWorks. For more information, see Appendix: SQL execution order in different environments.
In some extreme cases, such as an unexpected server power outage or an active/standby switchover, DataWorks may fail to completely stop the associated MaxCompute task processes. In this case, go to the corresponding project in the MaxCompute computing resource to stop the job.

Create a MaxCompute SQL node

For information about how to create a MaxCompute SQL node, see Create a MaxCompute SQL node.

Develop a MaxCompute SQL node

On the node editing page, follow these steps to develop the node.

Develop SQL code

DataWorks provides scheduling parameters that allow you to dynamically pass values to variables in your code. In a MaxCompute SQL node, you can define variables in the ${variable_name} format. Then, you can assign values to the variables in the Scheduling Parameters section of the Schedule tab. For more information about the supported formats of scheduling parameters, see Supported formats of scheduling parameters. MaxCompute SQL commands are similar to standard SQL syntax. They support Data Definition Language (DDL), Data Manipulation Language (DML), and Data Query Language (DQL) statements, along with specific MaxCompute syntax. For more information about the syntax and examples, see SQL overview.

The following sections provide examples for different scenarios:

Note

If you use a new data type in a MaxCompute V2.0 extension function, you must add SET odps.sql.type.system.odps2=true; before the SQL statement of the function and execute them together. This ensures that the new data type can be used. For more information about MaxCompute V2.0 data types, see Data type editions.
The execution order of MaxCompute SQL statements varies between the Data Development environment and Operation Center. For more information, see Appendix: SQL execution order in different environments.

Create a table

You can use the CREATE TABLE statement to create non-partitioned tables, partitioned tables, foreign tables, and clustered tables. For more information, see CREATE TABLE. The following code provides an example:

-- Create a partitioned table named students.
CREATE TABLE IF NOT EXISTS students
( id BIGINT,
  name STRING,
  age BIGINT,
  birth DATE)
partitioned BY (gender STRING);

Insert data

You can use the INSERT INTO or INSERT OVERWRITE statement to insert or update data in a destination table. For more information, see Insert or overwrite data (INSERT INTO | INSERT OVERWRITE).

Use INSERT OVERWRITE instead of INSERT INTO to prevent unexpected data duplication. For more information, see Insert or overwrite data.

The following code provides an example:

-- Insert data.
INSERT OVERWRITE students PARTITION(gender='boy') VALUES (1,'ZhangSan',15,DATE '2008-05-15') ;

The INSERT statement can trigger the Compare DDL Fields feature. This feature checks and compares the fields in the SELECT clause with the fields of the destination table.

This feature is not supported if a MaxCompute project has the Layer 3 schema model enabled, but the tenant does not.

Query data

You can use the SELECT statement to perform operations such as nested queries, group queries, and sorting. For more information, see SELECT syntax. The following code provides an example:

-- Enable a full table scan. This setting is valid only for the current session.
SET odps.sql.allow.fullscan=true; 

-- Query the information of all male students and sort the results by ID in ascending order.
SELECT * FROM students WHERE gender='boy' ORDER BY id;

Note

By default, a RAM user does not have permissions to query data from production tables. To query data from production tables, you must apply for the required permissions in Security Center. For more information about the preset data permissions and access control for MaxCompute in DataWorks, see Details about data permission control for MaxCompute. For more information about how to grant permissions using commands in MaxCompute, see Manage user permissions using commands.

Use SQL functions

MaxCompute supports built-in functions and user-defined functions (UDFs) for data development and analysis. You can create and use SQL functions as needed. For more information about built-in functions, see Built-in functions. For more information about UDFs, see MaxCompute UDFs. The following examples describe how to use SQL functions.

Built-in functions: MaxCompute provides built-in functions that you can call directly. Building on the preceding examples, the following command uses the dateadd function to modify the values in the birth column by adding a specified time interval:
```
-- Enable a full table scan. This setting is valid only for the current session.
SET odps.sql.allow.fullscan=true; 
SELECT id, name, age, birth, dateadd(birth,1,'mm') AS birth_dateadd FROM students;
```
User-defined functions (UDFs): To use a UDF, you must write the function code, upload the code as a resource, and register the function. For more information, see Resource Management.

Debug a MaxCompute SQL node

On the right side of the node editing page, configure the parameters in the Debug Configuration section.

Parameter	Description
Computing resource	Select the MaxCompute computing resource that you attached.
Computing quota	Select a quota that you created. The quota provides the computing resources (CPU and memory) required for the computing job. If no computing quota is available, click Create Computing Quota in the drop-down list to go to the MaxCompute console and configure a quota.
Resource group	Select a scheduling resource group that has passed the connectivity test with the computing resource. For more information, see Network connection solutions.

In the parameter dialog box on the toolbar, select the MaxCompute data source you created and click Run to execute the MaxCompute SQL task.

View the results

The execution results are displayed in a workbook. You can open the results in the workbook, perform other operations in DataWorks, or copy and paste the content to a local Excel file.
Note
Due to an adjustment in the China time zone information released by the International Organization for Standardization, you may encounter a time difference for specific periods when you run related SQL statements in DataWorks. The difference is 5 minutes and 52 seconds for dates and times from 1900 to 1928, and 9 seconds for dates and times before 1900.
To view the running logs, click the LogView link on the tab in the running results. For more information, see Use Logview V2.0 to view job running information.
To sort the results on the results page, click the drop-down arrow for a field, select ascending or descending order from the Sort list, and then click OK.

What to do next

Schedule the node: To schedule a node in the project folder, set the Scheduling Policy and configure its scheduling properties in the Schedule section on the right side of the node.
Publish the node: If the task needs to be run in the production environment, click the icon to publish the task to the production environment. A node in the project folder can be run on a schedule only after it is published to the production environment.

Appendix: SQL execution order in different environments

The execution order of statements that contain keywords such as SET and USE in a MaxCompute SQL node varies depending on the execution environment in DataWorks.

In DataStudio: All statements that contain keywords such as SET and USE in the current task code are merged and executed before all other SQL statements.
In the scheduling environment: The statements are executed in the order in which they are written.

Assume that the following code is defined in the node.

SET a=b;
CREATE TABLE name1(id string);
SET c=d;
CREATE TABLE name2(id string);

The execution order in different environments is as follows:

SQL statement to execute	DataStudio	Scheduling O&M
First SQL statement	`SET a=b; SET c=d; CREATE TABLE name1(id string);`	`SET a=b; CREATE TABLE name1(id string);`
Second SQL statement	`SET a=b; SET c=d; CREATE TABLE name2(id string);`	`SET c=d; CREATE TABLE name2(id string);`

References

For more examples of MaxCompute SQL tasks, see the following topics: