Data upload - DataWorks - Alibaba Cloud Documentation Center

The DataWorks data upload feature lets you upload data from local files, DataAnalysis workbooks, Object Storage Service (OSS) files, and HTTP files to engines such as MaxCompute, EMR Hive, Hologres, and StarRocks for analysis and management. This feature provides a convenient data transmission service to help you quickly use data to drive your business. This topic describes how to use the data upload feature.

Precautions

If you perform cross-border data uploads, such as transferring data from mainland China to outside mainland China or between different countries or regions, read the related compliance statement in advance. Otherwise, the data upload may fail, and you will be held legally responsible.
Before you upload data, set the table headers to English. If the table headers are in Chinese, parsing may fail and cause an upload error.

Limits

Resource group limits: The data upload feature requires you to specify a schedule resource group and a Data Integration resource group.
- Only Serverless resource groups (recommended), exclusive resource groups for scheduling, and exclusive resource groups for Data Integration are supported. You must configure a schedule resource group and a Data Integration resource group for the corresponding engine in DataAnalysis > More > System Administration.
- The selected resource group must be attached to the DataWorks workspace where the destination table resides. Ensure that the data source used by the data upload task can connect to the selected resource group over the network.
  Note
  - To configure resource groups for an engine in DataAnalysis, see System administration.
  - To establish a network connection between a data source and a resource group, see Network connection solutions.
  - To attach an exclusive resource group to a workspace, see Use an exclusive resource group for scheduling and Use an exclusive resource group for Data Integration.
Table limits: You can upload data only to tables that you own. This applies in the following scenarios:
- The table details page in Data Map shows that you are the Table Owner. For more information about how to view table details, see View table details.
- The table is a new table that you created when you uploaded data using the data upload feature.

Billing

Data upload incurs the following fees:

Data transmission fees.
If you create a new table, computing and storage fees are charged.

The preceding fees are charged by the respective engines. For specific fees, see the billing documentation for the corresponding engine: MaxCompute billing, Hologres billing, E-MapReduce billing, and EMR Serverless StarRocks product billing.

Go to the data upload page

Go to the Upload and Download page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Integration > Data Upload and Download. On the page that appears, click Go to Data Upload and Download.
In the navigation pane on the left, click the icon to go to the Data Upload page.
Click Data Upload and follow the on-screen instructions to upload the data.

Select the file data to upload

You can upload data from local files, workbooks, OSS, and HTTP files. Select a data source as needed.

Note

When you upload a file, specify whether to filter out dirty data as needed.

Yes: If dirty data is encountered, the platform automatically ignores it and continues to upload the data.
No: If dirty data is encountered, the platform does not ignore it, and the data upload is interrupted.

Local file

If the data that you want to upload is in a local file, select this method.

Set Data Source to Local File.
Specify Data To Upload: Drag your local files to the Select File area.
Note
- The supported file formats are CSV, XLS, XLSX, and JSON. The maximum file size is 5 GB for CSV files and 100 MB for other file formats.
- By default, the first sheet of a file is uploaded. To upload multiple sheets from a file, you must create a table for each sheet and make it the first sheet of the file.
- Uploading files in SQL format is not supported.

Workbook

If the data that you want to upload is in a DataWorks DataAnalysis workbook, select this method.

Set Data Source to Workbook.
Specify Data To Upload:
1. From the drop-down list next to Select File, select the workbook file to upload.
2. If the workbook does not exist, click the New button next to it to create one. You can also go to the DataAnalysis module to create a workbook and import data.

Object Storage Service (OSS)

If the data that you want to upload is in Object Storage Service (OSS), select this method.

Prerequisites:

You have created an OSS bucket and stored the data to be uploaded in it. You can then upload the data from OSS to the corresponding data source.
To avoid permission issues, use Resource Access Management (RAM) to grant the Alibaba Cloud account that you use to upload data the permissions to access the destination bucket before you upload the data.

Steps:

Set Data Source to Object Storage OSS.
Specify The Data To Upload:
1. From the Select Bucket drop-down list, select the destination OSS bucket that stores the data to upload.
  Note
  You can upload data only from a bucket that is in the same region as the current DataWorks workspace.
2. In the Select File area, select the file data that you want to upload.
  Note
  Only files in CSV, XLS, XLSX, and JSON formats are supported.

HTTP file

If the data that you want to upload is in an HTTP file, select this method.

Set Data Source to HTTP File.

Specify Data To Upload:

Parameter	Configuration description
File Address	The address where the file data is stored. Note File addresses in HTTP and HTTPS formats are supported.
File Type	The file type is automatically detected based on the file you upload. Files in `CSV`, `XLS`, and `XLSX` formats are supported. The maximum size of a `CSV` file is 5 GB. The maximum size of other files is 50 MB.
Request Method	GET, POST, and PUT are supported. Using GET to obtain data is recommended. However, the specific method depends on your defined allowed request methods.
Advanced Parameters	You can also set the Request Header and Request Body in the Advanced Parameters section as needed.

Set the destination table

In the Set Destination Table section, select a Destination Engine for the data upload and configure the related parameters for the selected engine.

Important

When you set the destination table, distinguish between the production (PROD) and development (DEV) environments when you select a data source. If you select the wrong environment, the data is uploaded to the other environment.

MaxCompute

To upload data to a MaxCompute table, configure the following parameters.

Parameter		Configuration description
MaxCompute project name		Select a MaxCompute data source that is attached to the current region. If the data source that you want to use is not found, you can attach a MaxCompute compute resource to the current workspace to generate a data source with the same name.
Destination table		Select Existing Table or New Table.
Destination Table > Existing Table	Select destination table	The table where the data is stored. You can search for the table by keyword. Note You can upload data only to tables that you own. For more information, see Limits.
Destination Table > Existing Table	Upload mode	Select a method to add the data to the destination table. Overwrite: Clears the data in the destination table and then imports all the data into the corresponding mapped fields in the destination table. Append: Appends the data to the corresponding mapped fields in the destination table.
Destination Table > New Table	Table name	Enter a custom name for the new table. Note When a new table is created for the MaxCompute engine, the MaxCompute account information configured for the DataWorks computing resources is used. The table is then created in the corresponding MaxCompute project.
	Table type	Select Non-partitioned Table or Partitioned Table as needed. If you select Partitioned Table, specify the partition fields and their values.
	Lifecycle	Specify the lifecycle of the table. After the table expires, it may become unavailable. For more information about table lifecycles, see Lifecycle and Lifecycle action.

EMR HIVE

To upload data to an EMR HIVE table, configure the following parameters.

Parameter	Configuration description
Data source	Select an EMR Hive data source (Alibaba Cloud instance mode) that is attached to the workspace in the current region.
Destination table	You can upload data only to an Existing Table.
Select destination table	The table where the data is stored. You can search for the table by keyword. Note If the destination table does not exist, follow the on-screen instructions to go to Table Management in Data Development to create a table. You can upload data only to tables that you own. For more information, see Limits.
Upload mode	Select a method to add the data to the destination table. Overwrite: Clears the data in the destination table and then imports all the data into the corresponding mapped fields in the destination table. Append: Appends the data to the corresponding mapped fields in the destination table.

Hologres

To upload data to a Hologres table, configure the following parameters.

Parameter	Configuration description
Data source	Select a Hologres data source that is attached to the workspace in the current region. If the data source that you want to use is not found, you can attach a Hologres compute resource to the current workspace to generate a data source with the same name.
Destination table	You can upload data only to an Existing Table.
Select destination table	The table where the data is stored. You can search for the table by keyword. Note If the destination table does not exist, follow the on-screen instructions to go to the Hologres console to create a table. You can upload data only to tables that you own. For more information, see Limits.
Upload mode	Select a method to add the data to the destination table. Overwrite: Clears the data in the destination table and then imports all the data into the corresponding mapped fields in the destination table. Append: Appends the data to the corresponding mapped fields in the destination table.
Primary key conflict policy	If a data upload causes a primary key conflict in the destination table, you can adopt one of the following policies. Ignore: The uploaded data is ignored. The data in the destination table is not updated. Update (replace): The uploaded data completely overwrites the old data in the destination table. Fields that are not mapped are forcibly set to NULL. Update (update): The uploaded data overwrites the old data in the destination table, but only for the mapped fields.

StarRocks

To upload data to a StarRocks table, configure the following parameters.

Parameter	Configuration description
Data source	Select a StarRocks data source that is attached to the workspace in the current region.
Destination table	You can upload data only to an Existing Table.
Select destination table	The table where the data is stored. You can search for the table by keyword. Note If the destination table does not exist, follow the on-screen instructions to go to the EMR Serverless StarRocks instance page to create a table. You can upload data only to tables that you own. For more information, see Limits.
Upload mode	Select a method to add the data to the destination table. Overwrite: Clears the data in the destination table and then imports all the data into the corresponding mapped fields in the destination table. Append: Appends the data to the corresponding mapped fields in the destination table.
Advanced parameters	You can configure Stream Load request parameters.

Preview the data to upload

After you set the destination table, you can adjust the file encoding and data mapping based on the data preview.

Note

You can preview only the first 20 rows of data.

File Encoding: If the data contains garbled text, you can switch the encoding format. UTF-8, GB18030, Big5, UTF-16LE, and UTF-16BE are supported.
Preview data and set destination table fields:
- Upload data to an existing table: You must configure the mapping between the columns in the source file and the fields in the destination table. After the mapping is configured, the data can be uploaded. You can select Map By Column Name or Map By Position. After the mapping is complete, you can also customize the field names in the destination table.
  Note
  - If a column in the source data is not mapped to a field in the destination table, the data in that column is grayed out and is not uploaded.
  - A column in the source data cannot be mapped to multiple fields in the destination table.
  - The field name and field type cannot be empty. Otherwise, the data cannot be uploaded.
- Upload data to a new table: You can use Smart Field Generation to automatically fill in field information, or you can manually modify the field information.
  Note
  - The field name and field type cannot be empty. Otherwise, the data cannot be uploaded.
  - The EMR Hive, Hologres, and StarRocks engines do not support creating a new table during data upload.
Ignore First Row: Specify whether to upload the first row of the file data, which is usually the column names, to the destination table.
- Selected: If the first row of the file contains column names, the first row is not uploaded to the destination table.
- Not selected: If the first row of the file contains data, the first row is uploaded to the destination table.

Upload the data

After you preview the data, click the Data Upload button in the lower-left corner to upload the data.

What to do next

After the data is uploaded, you can click the icon in the navigation pane on the left to go to the Data Upload page. Find the data upload task that you created and perform the following operations as needed:

Continue upload: In the Actions column, click Continue Upload to upload the data again.
Query data: In the Actions column, click Query Data to query and analyze the data.
View upload data details: Click the destination Table Name to go to Data Map and view the detailed information of the destination table. For more information, see General data query and management.

Appendix: Compliance statement for cross-border data upload

Important

If you perform cross-border data uploads, such as transferring data from mainland China to outside mainland China or between different countries or regions, read the related compliance statement in advance. Otherwise, the data upload may fail, and you will be held legally responsible.

Cross-border data operations will cause your business data in the cloud to be transferred to the region or product deployment area that you select. You must ensure that such operations comply with the following requirements:

You have the right to process the relevant business data in the cloud.
You have adopted sufficient data security protection technologies and policies.
The data transfer complies with the requirements of relevant laws and regulations. For example, the transferred data does not contain any content that is restricted or prohibited from being transferred or disclosed by applicable laws.

Alibaba Cloud reminds you that if your data upload operation may result in cross-border data transfer, you should consult with professional legal or compliance personnel before you perform the operation. Ensure that the cross-border data transfer complies with the requirements of applicable laws, regulations, and regulatory policies. For example, you must obtain valid authorization from personal information subjects, complete the signing and filing of relevant contract clauses, and complete relevant security assessments and other legal obligations.

If you perform cross-border data operations without complying with this statement, you will bear the corresponding legal consequences. You are also liable for any losses incurred by Alibaba Cloud and its affiliates.

References

DataStudio also supports uploading data from local CSV or text files to MaxCompute tables. For more information, see Upload data.
For more information about operations on MaxCompute tables, see Create and use a MaxCompute table.
For more information about operations on Hologres tables, see Create a Hologres table.
For more information about operations on EMR tables, see Create an EMR table.

FAQ

Resource group configuration issue.
Error message: The current file source or destination engine requires a resource group to be configured for data upload. Contact the workspace administrator to configure a resource group.
Solution: To configure resource groups for an engine in DataAnalysis, see System administration.
Resource group attachment issue.
Error message: The global data upload resource group configured for your current workspace is not attached to the workspace to which the upload table belongs. Contact the workspace administrator to attach it.
Solution: You can attach the resource group that you set in System Administration to the workspace.