Synchronize sharded MySQL databases and tables to MaxCompute - DataWorks

Data Integration supports real-time synchronization of sharded data from sources such as MySQL and PolarDB to MaxCompute. This topic describes how to synchronize sharded data from MySQL to MaxCompute using Data Integration.

Background information

In real-world business scenarios, data synchronization often cannot be completed using one or more simple offline or real-time sync tasks. Instead, it requires a combination of multiple offline sync, real-time sync, and data processing tasks, which leads to high configuration complexity. This is especially true in scenarios with sharded MySQL databases and tables, where many upstream databases and tables must be written to a single MaxCompute table simultaneously. Configuring multiple tasks for this scenario would make the configuration complex and difficult to maintain.

To address these pain points, the one-click synchronization solution in DataWorks Data Integration provides a configurable sync task solution for business scenarios. This solution supports one-click synchronization for different data sources, which makes it easy and fast for businesses to synchronize data.

The solution for synchronizing sharded data to MaxCompute is a real-time solution that is based on binary logging (Binlog) and a T+1 merge process that generates offline tables. Real-time data is written to a Log table. The Base table stores the full data in a daily partition. Each day, a merge task combines the previous day's Base table partition with the real-time data from the Log table to create a new full partition. When you query data, you typically query the latest partition of the Base table.

Prerequisites

You have purchased a Serverless resource group or an exclusive resource group for Data Integration.
You have created a MySQL data source and a MaxCompute data source. For more information, see Data Source Configuration.
- MySQL: If the source is a MySQL database, you must enable the binary logging feature. For more information, see Prepare a MySQL environment.
- MaxCompute: When you read data from or write data to a MaxCompute table, you can enable related properties as needed. For more information, see Prepare a MaxCompute environment.
You have established a network connection between the resource group and the data source. For more information, see Network connection solutions.

Limits

You cannot synchronize source data to MaxCompute foreign tables.

Procedure

1. Choose a sync task type

Go to the Data Integration page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Integration > Data Integration. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Integration.
In the navigation pane on the left, click Sync Task. On the page that appears, click Create Sync Task to go to the task creation page. Configure the following basic information.
- Source and Destination: MySQL → MaxCompute
- New Task Name: Enter a custom name for the sync task.
- Sync Type: Real-time synchronization of sharded databases and tables to MaxCompute.

2. Configure network and resources

In the Network and Resource Configuration section, select the Data Source (multiple sharded data sources), Resource Group, and Destination (MaxCompute) for the sync task. Then, click Test All Connectivity.
After you confirm that the connections to the source and destination data sources are successful, click Next.

3. Basic configuration

You can customize the Solution Name, Owner, and Target Task Location for the sync task. Click Next to go to the Configure Source Sharded Tables step.

If you select Automatically Create Workflow, the system automatically creates a business workflow in the workspace. The workflow is named in the `oneclick` + `Source Data Source Name` + `to` + `Destination Data Source Name` format.

4. Configure source sharded tables

The smallest unit in a sharded data synchronization link is a logical table. One logical table corresponds to a set of physical database and table rules and one destination base table. The feature that automatically generates logical tables can greatly simplify the configuration steps for logical tables. In most cases, you can use this feature to scan the source data source and simplify most configuration operations. If the automatic scan results do not meet your expectations, you need to make only minor corrections. For complex physical database and table rules, you can manually add logical tables.

Automatically generate logical tables

Click Auto-generate Logical Tables. In the data source list, select the data sources to scan for generating logical tables.
In the Configure Logical Table Generation Rules section, click Add Rule. You can select a Preset Rule or select Manually Enter to enter a custom rule.
Click Start Execution.
After the merge is complete, click Next to go to the Set Destination Table step.

Manually add logical tables

If the results from Auto-generate Logical Tables do not meet your expectations, you can edit the logical table matching rules. You can also click Manually Add Logical Table to add logical tables that were not scanned.

Click Manually Add Logical Table. Add the logical tables that you need to synchronize. Data from the same logical table is synchronized to the same destination table.
Set the conditions for the source database or source table.
The conditions for each sync table are divided into database conditions and table conditions. In the Source Table Selection Conditions and Source Database Selection Conditions sections, click Add Condition to add the respective conditions. When the sync task runs, it searches for and integrates source databases and tables based on these rules. These databases and tables serve as the sharded source for the destination table.
Click Apply.

Batch edit database rules

Database sharding rules can limit the source database conditions. You can apply these conditions to all sync tables.

For example, if all data comes from databases such as xiaobo1, xiaobo2, and xiaobo3, you can add the database sharding rule condition shown in the following figure:

5. Set destination table

Set the Write Mode.
Incremental data is written in real time to a Log table in MaxCompute. The incremental data in the Log table is then periodically merged with the full data in the destination Base table, and the final result is written to the Base table.
Configure time-based automatic partitioning.
In the Time-based Automatic Partitioning Settings section, you can configure the task to write data to a partitioned or non-partitioned table in MaxCompute. You can also define the name of the partition field. In this example, data is written to a partitioned table, and the partition field is ds.
Note
If you choose to write to a partitioned table, you can click the icon to define the name of the destination table's partition field.

Refresh the mappings between source and destination tables.

Click Refresh Source and MaxCompute Table Mapping to display the destination table list.

Feature	Description
Select a primary key for a table without one	Because the current solution does not support the synchronization of tables without primary keys, you need to click the button in the Sync Primary Key column to set a Custom Primary Key. This involves selecting one or more fields from the table to act as the primary key, which is then used for deduplication when data is written to the destination. In this example, the `id` column is selected to act as the primary key for the table without a primary key.
Edit the destination table schema	Click the MaxCompute Table Name to preview the MaxCompute table schema. When creating the MaxCompute table, Data Integration automatically adds a column named `_src_info_` to identify which child table the row of data comes from. The `_src_info_` column, combined with the source table's primary key, can also be used to uniquely identify a row of data. Note By default, the lifecycle of a MaxCompute table created automatically is only 30 days. There might also be field type mappings. This means if the destination database does not have a data type that matches the source, the sync task will automatically match the source field with a writable field type in the destination when creating the destination table. To modify the MaxCompute table's lifecycle or the destination table's field type mappings, click the destination table name in the MaxCompute Table Name column to make changes.
Select a table creation method	Supports Auto-create Table and Use Existing Table. When you set Table Creation Method to Use Existing Table, the MaxCompute Base Table Name column displays the automatically created table name. You can also select the name of the table you want to use from the drop-down list. When you set Table Creation Method to Auto-create Table, the automatically created table name is displayed. You can click the table name to view and modify the table creation statement. In this example, Auto-create Table is selected.
Full synchronization	In the Full Synchronization column, you can choose whether to synchronize the full data to the destination before starting real-time synchronization. If you turn off full synchronization, the corresponding table will not undergo offline full data synchronization. This is suitable for scenarios where the full data has already been synchronized to the destination by other means. In this example, full synchronization is enabled for all tables.

After you confirm that the destination tables, fields, and data read and write relationships are correct, click Next.

6. Set table-level synchronization rules

When you synchronize sharded MySQL data to MaxCompute, you can configure Data Manipulation Language (DML) synchronization rules. For example, to retain data that is deleted from the source database in the offline table, you can configure the rule as shown in the following figure: . This lets you define the processing policy for insert, update, or delete operations that occur in the source table.

Normal Processing: The source DML message is passed to the destination data source for processing.
Ignore: The message is discarded, and the corresponding DML message is not sent to the destination data source. The corresponding data does not change.
Conditional Normal Processing: If you select this option, you can configure a filter condition. The sync task filters the source data based on the filter expression that you configure. Data that meets the condition is processed normally. Data that does not meet the condition is ignored.

Note

If you do not set a rule, Normal Processing is used by default.

Click Next to go to the DDL Message Processing Rules step.

7. Set DDL message processing rules

DDL operations may be performed on the source. You can click Configure DDL Capability in the upper-right corner of the page to configure rules to process DDL messages from the source based on your business requirements.

Note

For more information, see Configure rules to process DDL messages.

Click Next to go to the Runtime Resource Settings step.

8. Runtime resource settings

After the sync link is created, separate subtasks are generated for offline full data synchronization and real-time incremental data synchronization. In the Runtime Resource Settings step, you must configure the properties for both the offline and real-time sync tasks.

These properties include the resource groups used for real-time incremental synchronization and offline full synchronization, and the scheduling resource group used for offline full synchronization. You can also click Advanced Configuration to configure parameters such as Retry On Subquery Failure.

Note

DataWorks offline sync tasks are submitted to the exclusive resource group for Data Integration by the scheduling resource group. Therefore, offline sync tasks use scheduling resource group resources in addition to the resources of the exclusive resource group for Data Integration. This incurs scheduling instance fees.

9. Execute the sync task

After you complete all configurations, click Complete Configuration at the bottom of the page.
On the Data Integration > Sync Task page, find the sync task that you created. In the Actions column, click Submit and Execute.
In the Task List, click the Name/ID of the task to view the detailed execution process.

What to do next

After you configure the task, you can manage the task that you created, add or remove tables, configure monitoring and alerts, and view key operational metrics. For more information, see O&M for full and incremental synchronization tasks.