Data Integration provides a codeless UI that lets you periodically synchronize full or incremental data from a source table, including sharded tables, to a destination table without writing any code. You can configure a synchronization task by selecting the source and destination in the UI and configuring scheduling parameters in DataWorks. This topic describes the general configurations for a batch synchronization task in the codeless UI. The configurations may vary for different data sources. For more information, see Supported data sources and synchronization solutions.
Preparations
Configure the data sources. Before you configure a Data Integration sync task, ensure that you have configured the source and destination databases in Data Source Management in DataWorks. For more information about data source configuration, see Data source list.
NoteFor more information about the data sources supported by batch synchronization and their configurations, see Supported data sources and synchronization solutions.
For more information about data source features, see Data Source Management.
Purchase a resource group with a suitable specification and attach it to the workspace. For more information, see Use a Serverless resource group for Data Integration and Use an exclusive resource group for Data Integration.
Establish a network connection between the resource group and the data source. For more information, see Configure network connections.
Step 1: Create a batch synchronization node
New Data Development
Log on to the DataWorks console. Switch to the destination region. In the navigation pane on the left, choose . Select the desired workspace from the drop-down list and click Enter DataStudio.
Create a workflow. For more information, see Orchestrate workflows.
Create a batch synchronization node. You can use one of the following methods:
Method 1: Click the
icon in the upper-right corner of the workflow list and choose .Method 2: Double-click the workflow name and drag the Batch Synchronization node from the Data Integration directory to the workflow editor on the right.
Configure the basic information, source, and destination for the node. Then, click Confirm.
Previous Data Development
Log in to the DataWorks console. Switch to the destination region. In the navigation pane on the left, click . Select the desired workspace from the drop-down list and click Enter Data Development.
Create a workflow. For more information, see Create a workflow.
Create a batch synchronization node. You can use one of the following methods:
Method 1: Expand the workflow, right-click Data Integration, and select .
Method 2: Double-click the workflow name and drag the Batch Synchronization node from the Data Integration directory to the workflow editor on the right.
Create a batch synchronization node as prompted.
Step 2: Configure the data source and resource group
Select the source and destination data sources for the batch synchronization task.
Select the resource group and resource quota for running the task. For recommended resource quota configurations, see Data Integration performance metrics.
Test the network connectivity between the data source and the resource group. If the connection fails, configure the network connection as prompted or as described in the documentation. For more information, see Configure network connections.
If you created a resource group but it is not displayed, check whether the resource group is attached to the workspace. For more information, see Use a Serverless resource group for Data Integration and Use an exclusive resource group for Data Integration.
Serverless resource groups allow you to specify an upper limit for the computing units (CUs) of a sync task. If your sync task fails due to an out-of-memory (OOM) error because of insufficient resources, you can adjust the CU usage for the resource group.
Step 3: Configure the source and destination
In the source and destination sections, configure the tables from which to read data and to which to write data. You can also specify the data range to synchronize.
Plugin configurations can vary. The following section provides examples of common configurations. To check whether a plugin supports a specific configuration and how to implement it, see the documentation for that plugin. For more information, see Data source list.
Source
Some source types support data filtering. You can specify a condition (a
WHEREclause without the `where` keyword) to filter source data. At runtime, the task synchronizes only the data that meets the condition. For more information, see Scenario: Configure a batch synchronization task for incremental data.To perform incremental synchronization, you can combine this filter condition with scheduling parameters to make it dynamic. For example, with
gmt_create >= '${bizdate}', the task synchronizes only the new data from the current day each time it runs. You also need to assign a value to the variable defined here when you configure scheduling properties. For more information, see Supported formats of scheduling parameters.The method for configuring incremental synchronization varies by data source (plugin).
If you do not configure a filter condition, the task synchronizes all data from the table by default.
We recommend using the table's primary key for `splitPk` because primary keys are usually distributed evenly. This helps prevent data hot spots in the created shards.
Currently, `splitPk` only supports integer data for sharding. It does not support strings, floating-point numbers, dates, or other types. If you specify an unsupported type, the `splitPk` feature is ignored, and the task uses a single channel for synchronization.
If you do not specify `splitPk`, or if the value is empty, the data synchronization task uses a single channel to sync the table data.
Not all plugins support specifying a shard key to configure task sharding logic. The preceding information is for example only. See the documentation for your specific plugin. For more information, see Supported data sources and synchronization solutions.
Data processing
ImportantData processing is a feature available in the new Data Development. If you are using a previous version, you must upgrade your workspace to use this feature. For information about how to upgrade, see DataStudio upgrade guide.
Data processing lets you process data from the source table using methods such as string replacement, AI-assisted processing, and data vectorization before you write the processed data to the destination table.

Click the switch to turn on data processing.
In the Data Processing List, click Add Node and select a data processing type: String Replacement, AI-assisted Processing, or Data Vectorization. You can add multiple data processing nodes, which DataWorks will process sequentially.
Configure the data processing rules as prompted. For AI-assisted processing and data vectorization, see Intelligent data processing.
NoteData processing requires additional computing resources, which increases the resource overhead and runtime of the data synchronization task. To avoid affecting synchronization efficiency, keep the processing logic as simple as possible.
Destination
Operation
Description
Configure statements to execute before and after synchronization
Some data sources support executing SQL statements on the destination before data is written (pre-sync) and after the data is written (post-sync).
MySQL Writer supports `preSql` and `postSql` configuration items, which allow you to execute MySQL commands before or after data is written to MySQL. For example, you can configure the MySQL command
truncate table tablenamein the Pre-SQL Statement (preSql) configuration item to clear existing data from the table before synchronization.Define the write mode for conflicts
Define how to write data to the destination when conflicts, such as path or primary key conflicts, occur. This configuration varies based on the data source attributes and writer plugin support. For configuration details, see the documentation for the specific writer plugin.
Operation | Description |
Configure the synchronization range | |
Configure a shard key for a relational database | Define the field in the source data that will be used as the shard key. The synchronization task splits the data into multiple tasks based on this key for concurrent, batched data reading. |
Step 4: Configure field mappings
After you select the source and destination, you must specify the mapping between the source and destination columns. The task writes data from the source fields to the corresponding destination fields based on these mappings.
During synchronization, mismatched field types between the source and destination can generate dirty data and cause write failures. To set the tolerance for dirty data, refer to the Channel Control settings in the next step.
If a source field is not mapped to a destination field, its data is not synchronized.
If the automatic mapping is not what you expect, you can adjust the mappings manually.
If you do not need a mapping for a specific field, you can manually delete the line that connects the source and destination fields. The data from that source field is not synchronized.
Mapping by name and mapping by row are supported. You can also perform the following operations:
Assign values to destination fields: You can use Add a row to add constants, scheduling parameters, or built-in variables to the destination table, such as '123', '${scheduling_parameter}', or '#{built_in_variable}#'.
NoteWhen you configure scheduling in the next step, you can assign values to scheduling parameters. For more information about how to use scheduling parameters, see Supported formats of scheduling parameters.
Add built-in variables: You can manually add built-in variables and map them to destination fields to output them to a downstream node.
The available built-in variables for each plugin are as follows:
Built-in variable
Description
Supported plugins
'
#{DATASOURCE_NAME_SRC}#'Source data source name
MySQL Reader
MySQL (sharded) Reader
PolarDB Reader
PolarDB (sharded) Reader
PostgreSQL Reader
PolarDB-O Reader
PolarDB-O (sharded) Reader
'
#{DB_NAME_SRC}#'Name of the database where the source table is located
MySQL Reader
MySQL (sharded) Reader
PolarDB Reader
PolarDB (sharded) Reader
PostgreSQL Reader
PolarDB-O Reader
PolarDB-O (sharded) Reader
'
#{SCHEMA_NAME_SRC}#'Name of the schema where the source table is located
PolarDB Reader
PolarDB (sharded) Reader
PostgreSQL Reader
PolarDB-O Reader
PolarDB-O (sharded) Reader
'
#{TABLE_NAME_SRC}#'Source table name
MySQL Reader
MySQL (sharded) Reader
PolarDB Reader
PolarDB (sharded) Reader
PostgreSQL Reader
PolarDB-O Reader
PolarDB-O (sharded) Reader
Edit Source Fields: Click Manually Edit Mapping to perform the following operations:
Use functions that are supported by the source database to process fields. For example, you can use `Max(id)` to synchronize only the maximum value.
Manually edit source fields if not all fields were pulled during the field mapping process.
NoteMaxCompute Reader does not support the use of functions.
Step 5: Configure the channel
In the new Data Development, the Configure Channel feature is in the Advanced Configuration section on the right side of the task configuration interface.
You can use channel control to configure properties related to the data synchronization process. For more information about the parameters, see Relationship between concurrency and throttling for batch synchronization.
Parameter | Description |
Maximum Concurrency | Defines the maximum number of threads for concurrently reading from the source or writing to the destination for the current task. Note
|
Synchronization Rate | Controls the synchronization rate.
Note The traffic measure is a metric of Data Integration itself and does not represent actual network interface card (NIC) traffic. Typically, NIC traffic is 1 to 2 times the channel traffic. The actual traffic inflation depends on the data storage system's transfer serialization. |
Dirty Data Policy | Dirty data refers to records that fail to be written to the destination due to exceptions such as type conflicts or constraint violations. Batch synchronization supports defining a dirty data policy, which lets you set a tolerance for dirty data and its impact on the task.
Important An excessive amount of dirty data can affect the overall speed of the synchronization task. |
Distributed Processing Capability | Controls whether to use distributed mode to execute the current task.
If you have high requirements for synchronization performance, you can use distributed mode. Distributed mode can also use fragmented machine resources, which is friendly to resource utilization. Important
|
Time Zone | If the source and destination require cross-time zone synchronization, you can set the source time zone to perform time zone conversion. |
In addition to the preceding configurations, the overall synchronization speed is also affected by factors such as source data source performance and the synchronization network environment. For more information about synchronization speed and optimization, see Speed up or limit the speed of batch synchronization tasks.
Step 6: Configure scheduling properties
For a periodically scheduled batch synchronization task, you need to configure its scheduling properties. On the node's edit page, click Scheduling Configuration on the right to configure them.
You must configure scheduling parameters, a scheduling policy, a scheduling time, and scheduling dependencies for the sync task. The configuration process is the same as for other data development nodes and is not described in this topic.
For information about scheduling configuration in the new Data Development, see Node scheduling (new version).
For information about scheduling configuration in the previous Data Development, see Node scheduling configuration (previous version).
For more information about how to use scheduling parameters, see Common scenarios of scheduling parameters in Data Integration.
Step 7: Test and publish the task
Configure test parameters.
On the batch synchronization task configuration page, you can click Test Configuration on the right and configure the following parameters to run a test.
Configuration item
Description
Resource Group
Select a resource group that is connected to the data source.
Script Parameters
Assign values to placeholder parameters in the data synchronization task. For example, if the task is configured with the
${bizdate}parameter, you need to configure a date parameter in theyyyymmddformat.Run the task.
Click the
Run icon in the toolbar to run and test the task in Data Development. After the task is run, you can create a node of the destination table type to query the destination table data and check whether the synchronized data meets your expectations.Publish the task.
After the task runs successfully, if it needs to be scheduled periodically, click the
icon in the toolbar of the node configuration page to publish the task to the production environment. For more information about how to publish tasks, see Publish tasks.
Limits
Some data sources do not support the configuration of batch synchronization tasks in the codeless UI.
After you select a data source, if a message is displayed indicating that the codeless UI is not supported, click the
icon in the toolbar to switch to the code editor and continue to configure the task. For more information, see Configure a task in the code editor.
The codeless UI is easy to use but does not support some advanced features. If you require more fine-grained configuration management, you can click the convert to script icon in the toolbar to switch to the code editor to configure the batch synchronization task.
What to do next
After the task is published to the production environment, you can go to Operation Center in the production environment to view the scheduled task. For more information about how to run and manage batch synchronization tasks, monitor their status, and perform O&M on resource groups, see O&M for batch synchronization tasks.