A MySQL data source provides a bidirectional channel for reading data from and writing data to MySQL. This topic describes the data synchronization capabilities that DataWorks provides for MySQL.
Supported MySQL versions
Offline read and write operations:
Supports MySQL 5.5.x, MySQL 5.6.x, MySQL 5.7.x, and MySQL 8.0.x. It is also compatible with Amazon RDS for MySQL, Azure Database for MySQL, and Amazon Aurora MySQL.
Offline synchronization supports reading data from views.
Real-time read operations:
Data Integration reads MySQL data in real time by subscribing to MySQL binary logs. Real-time synchronization is supported only for MySQL 5.5.x, MySQL 5.6.x, MySQL 5.7.x, and MySQL 8.0.x. For version 8.0.x, existing features are compatible, but new features such as functional indexes are not supported. It is also compatible with Amazon RDS for MySQL, Azure Database for MySQL, and Amazon Aurora MySQL.
ImportantTo synchronize data from a DRDS for MySQL database, do not configure it as a MySQL data source. Instead, configure it directly as a DRDS data source. For more information, see Configure a DRDS data source.
Limits
Real-time read
Data synchronization from MySQL read-only instances is not supported.
Tables that contain functional indexes cannot be synchronized.
XA ROLLBACK is not supported.
For transactional data in the XA PREPARE state, real-time synchronization writes the data to the destination. If an XA ROLLBACK occurs, real-time synchronization does not roll back the prepared data. To handle an XA ROLLBACK scenario, you must manually remove the table from the real-time sync task and then add the table back to restart the synchronization.
Only the ROW format for MySQL server binary logging is supported.
Real-time synchronization does not synchronize records in associated tables that are deleted by cascade delete operations.
For Amazon Aurora MySQL databases, you must connect to your primary or writer database. AWS does not allow binary logging to be activated on read replicas of Aurora MySQL. Real-time sync tasks require binary logging to perform incremental updates.
Real-time synchronization for online DDL operations supports only the addition of columns (Add Column) to MySQL tables using Data Management (DMS).
Reading stored procedures from MySQL is not supported.
Offline read
When the MySQL Reader plugin synchronizes multiple tables, such as in a sharding scenario, a single table can be split into multiple tasks only if the task concurrency is greater than the number of tables. Otherwise, the number of tasks is equal to the number of tables.
Reading stored procedures from MySQL is not supported.
Supported field types
For a complete list of field types for each MySQL version, see the official MySQL documentation. The following table lists the support status for major field types in MySQL 8.0.x as an example.
Field type | Offline read (MySQL Reader) | Offline write (MySQL Writer) | Real-time read | Real-time write |
TINYINT |
|
|
|
|
SMALLINT |
|
|
|
|
INTEGER |
|
|
|
|
BIGINT |
|
|
|
|
FLOAT |
|
|
|
|
DOUBLE |
|
|
|
|
DECIMAL/NUMBERIC |
|
|
|
|
REAL |
|
|
|
|
VARCHAR |
|
|
|
|
JSON |
|
|
|
|
TEXT |
|
|
|
|
MEDIUMTEXT |
|
|
|
|
LONGTEXT |
|
|
|
|
VARBINARY |
|
|
|
|
BINARY |
|
|
|
|
TINYBLOB |
|
|
|
|
MEDIUMBLOB |
|
|
|
|
LONGBLOB |
|
|
|
|
ENUM |
|
|
|
|
SET |
|
|
|
|
BOOLEAN |
|
|
|
|
BIT |
|
|
|
|
DATE |
|
|
|
|
DATETIME |
|
|
|
|
TIMESTAMP |
|
|
|
|
TIME |
|
|
|
|
YEAR |
|
|
|
|
LINESTRING |
|
|
|
|
POLYGON |
|
|
|
|
MULTIPOINT |
|
|
|
|
MULTILINESTRING |
|
|
|
|
MULTIPOLYGON |
|
|
|
|
GEOMETRYCOLLECTION |
|
|
|
|
Preparations
Before you configure a MySQL data source in DataWorks, complete the preparations in this topic to ensure that subsequent tasks run as expected.
The following sections describe the preparations that you must complete before you synchronize MySQL data.
Confirm the MySQL version
Data Integration has specific version requirements for MySQL. Refer to the Supported MySQL versions section to check whether your MySQL version is supported. You can run the following statement in your MySQL database to check its version.
SELECT version();Configure account permissions
Plan and create a dedicated MySQL account for DataWorks to access the data source.
Optional: Create an account.
For more information, see Create a MySQL account.
Grant permissions.
Offline
In an offline synchronization scenario:
For offline reading of MySQL data, the account requires the read (
SELECT) permission on the tables to be synchronized.For offline writing of MySQL data, the account requires the write (
INSERT,DELETE, andUPDATE) permissions on the tables to be synchronized.
Real-time
In a real-time synchronization scenario, the account requires the
SELECT,REPLICATION SLAVE, andREPLICATION CLIENTpermissions on the database.
You can use the following command to grant permissions to the account, or grant the
SUPERpermission directly. When you use the following statement, replace'sync_account'with the account you created.-- CREATE USER 'sync_account'@'%' IDENTIFIED BY 'password'; // Create a synchronization account, set a password, and allow it to log on from any host. The percent sign (%) indicates any host. GRANT SELECT, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'sync_account'@'%'; // Grant the SELECT, REPLICATION SLAVE, and REPLICATION CLIENT permissions to the synchronization account.*.*grants the specified permissions on all tables in all databases to the synchronization account. You can also grant permissions on a specific table in a specific database. For example, to grant the permissions on the user table in the test database, use the statementGRANT SELECT, REPLICATION CLIENT ON test.user TO 'sync_account'@'%';.NoteThe
REPLICATION SLAVEpermission is a global permission. You cannot grant it for a specific table in a specific database.
Enable MySQL binary logging (for real-time synchronization only)
Data Integration performs real-time synchronization of incremental data by subscribing to MySQL binary logs (binlogs). Before you configure a synchronization task in DataWorks, you must enable the MySQL binary logging service.
If a binary log is being consumed, the database cannot delete it. If a real-time sync task experiences high latency, the source binary log may be retained for a long time. Configure latency alerts for your tasks and monitor the disk space of your database.
Retain binary logs for at least 72 hours. This prevents data loss if a task fails and cannot be reset to an offset that precedes the failure because the required logs have been deleted. In such cases, you must perform a full offline synchronization to backfill the data.
Check if binary logging is enabled.
You can run the following statement to check if binary logging is enabled.
SHOW variables LIKE "log_bin";If the result is ON, binary logging is enabled.
If you synchronize data from a secondary database, you can also run the following statement to check if binary logging is enabled.
SHOW variables LIKE "log_slave_updates";If the result is ON, binary logging is enabled on the secondary database.
If the result is not as expected:
For open source MySQL, refer to the official MySQL documentation to enable binary logging.
For Alibaba Cloud ApsaraDB RDS for MySQL, see RDS for MySQL log backup to enable binary logging.
For Alibaba Cloud PolarDB for MySQL, see Enable binary logging to enable binary logging.
Query the binary log format.
You can run the following statement to query the binary log format.
SHOW variables LIKE "binlog_format";Result description:
ROW: The binary log format is ROW.
STATEMENT: The binary log format is STATEMENT.
MIXED: The binary log format is MIXED.
ImportantDataWorks real-time synchronization supports only the ROW format for MySQL server binary logging. If the format is not ROW, you must modify the Binlog Format.
Check if full binary logging is enabled.
You can run the following statement to check if full binary logging is enabled.
SHOW variables LIKE "binlog_row_image";Result description:
FULL: Full binary logging is enabled.
MINIMAL: Minimal binary logging is enabled. Full logging is not enabled.
ImportantDataWorks real-time synchronization supports only MySQL servers with full binary logging enabled. If the query result is not FULL, you must modify the binlog_row_image configuration.
OSS binary log read authorization
When adding a MySQL data source, you can enable Enable OSS Binary Log Reading if you set Configuration Mode to Alibaba Cloud Instance Mode and the RDS for MySQL instance is in the same region as the DataWorks workspace. With this feature enabled, if the RDS binary log is inaccessible, the system will attempt to retrieve the binary log from OSS, preventing interruptions to the real-time sync task.
If you set OSS Binary Log Access Identity to Alibaba Cloud RAM User or Alibaba Cloud RAM Role, you must also configure authorization for the account as follows.
Alibaba Cloud RAM user
Log on to the RAM console and navigate to the Users page. Find the RAM user to which you want to grant permissions.
In the Actions column, click Add Permissions.
Configure the key parameters and click OK.
Set Resource Scope to Account Level.
Set the Access Policy to System Policy.
For Policy Name, enter
AliyunDataWorksAccessingRdsOSSBinlogPolicy.

Alibaba Cloud RAM role
Log on to the RAM console and navigate to the Roles page. Create a RAM role. For more information, see Create a RAM role for a trusted Alibaba Cloud account.
Key parameters:
Set Trusted Entity Type to Alibaba Cloud Account.
For Trusted Entity Name, select Other Alibaba Cloud Account, and enter the ID of the Alibaba Cloud account that owns the DataWorks workspace.
For Role Name, enter a custom name.
Grant specific permissions to the created RAM role. For more information, see Grant permissions to a RAM role.
Key parameters:
Set the Access Policy to System Policy.
Set Policy Name to
AliyunDataWorksAccessingRdsOSSBinlogPolicy.
Modify the trust policy of the created RAM role. For more information, see Modify the trust policy of a RAM role.
{ "Statement": [ { "Action": "sts:AssumeRole", "Effect": "Allow", "Principal": { "Service": [ "<Alibaba Cloud account ID of the DataWorks user>@di.dataworks.aliyuncs.com", "<Alibaba Cloud account ID of the DataWorks user>@dataworks.aliyuncs.com" ] } } ], "Version": "1" }
Add a data source
Alibaba Cloud instance mode
If your MySQL database is an Alibaba Cloud RDS instance, we recommend that you create the data source in Alibaba Cloud instance mode. The parameters are described as follows:
Parameter | Description |
Data source name | The data source name must be unique within the workspace. We recommend a name that clearly identifies the business and environment, such as |
Configuration mode | Select Alibaba Cloud instance mode. For more information about configuration modes, see Scenario 1: Instance mode (current Alibaba Cloud account) and Scenario 2: Instance mode (other Alibaba Cloud account). |
Alibaba Cloud account | Select the account to which the instance belongs. If you select Other Alibaba Cloud Account, you must configure cross-account permissions. For more information, see Cross-account authorization (RDS, Hive, or Kafka). If you select Other Alibaba Cloud account, you must provide the following information:
|
Region | The region where the instance is located. |
Instance | Select the name of the instance to connect to. |
Read replica settings | If this data source has a read-only instance (read replica), you can choose to read from the replica when you configure a task. The benefit is that it prevents interference with the primary database and does not affect its performance. |
Instance endpoint | After you select the correct instance, click Get Latest Endpoint to view the public or private endpoint, VPC, and vSwitch information for the instance. |
Database | The name of the database that the data source will access. Ensure that the specified user has access permissions for this database. |
Username/Password | The username and password for the MySQL database. If you use an RDS instance, you can create and maintain accounts in the Account Management section of the instance. |
Enable OSS binary log reading | When enabled, if the system cannot access the RDS binary log, it will try to get the binary log from OSS to prevent the real-time sync task from being interrupted. For configuration details, see OSS binary log read authorization. Set the OSS Binary Log Access Identity based on the authorization configuration. |
Authentication options | You can choose no authentication or SSL authentication. If you choose SSL authentication, the instance itself must have SSL authentication enabled. Prepare the certificate file and upload it to Certificate File Management. |
Version | You can log on to the MySQL server and run `SELECT VERSION()` to check the version number. |
Connection string mode
You can also create a data source in connection string mode, which is more flexible. The parameters for connection string mode are described as follows:
Parameter | Description |
Data source name | The data source name must be unique within the workspace. We recommend a name that clearly identifies the business and environment, such as |
Configuration mode | Select Connection String Mode. This connects to the database using a JDBC URL. |
Connection string preview | After you enter the connection address and database name, DataWorks automatically combines them into a JDBC URL for you to preview. |
Connection address | Host address IP: Enter the IP address or domain name of the database host. If the database is an Alibaba Cloud RDS instance, you can find the address in the instance details by clicking Database Connection. Port: The database port. The default is 3306. |
Database name | The name of the database that the data source will access. Ensure that the specified user has access permissions for this database. |
Username/Password | The username and password for the MySQL database. If you use an RDS instance, you can create and maintain accounts in the Account Management section of the instance. |
Version | You can log on to the MySQL server and run `SELECT VERSION()` to check the version number. |
Authentication options | You can choose no authentication or SSL authentication. If you choose SSL authentication, the instance itself must have SSL authentication enabled. Prepare the certificate file and upload it to Certificate File Management. |
Advanced parameters | Parameter: Click the parameter dropdown list and select a supported parameter name, such as connectTimeout. Value: Enter a suitable value for the selected parameter, such as 3000. The URL will be automatically updated to: jdbc:mysql://192.168.90.28:3306/test?connectTimeout=50000 |
Ensure connectivity between the data source and the resource group. Otherwise, subsequent tasks will fail. Configure the network connection based on the data source's network environment and connection mode. For more information, see Test connectivity.
Data synchronization task development: MySQL synchronization process guide
For information about the entry point for and the procedure of configuring a synchronization task, see the following configuration guides.
Configuration guide for single-table offline synchronization tasks
For the procedure, refer to Configure a task in the codeless UI and Configure a task in the code editor.
For all parameters and a script example for the code editor, refer to Appendix: MySQL script examples and parameter descriptions.
Configuration guide for single-table real-time synchronization tasks
For the procedure, refer to Configure a real-time synchronization task in DataStudio.
Configuration guide for full database synchronization
For the procedure, refer to Configure a real-time synchronization task for a full database.
FAQ
For more frequently asked questions about Data Integration, see Data Integration FAQ.
Appendix: MySQL script examples and parameter descriptions
Configure a batch synchronization task by using the code editor
If you want to configure a batch synchronization task by using the code editor, you must configure the related parameters in the script based on the unified script format requirements. For more information, see Configure a batch synchronization task by using the code editor. The following information describes the parameters that you must configure for data sources when you configure a batch synchronization task by using the code editor.
Reader script example
This topic provides configuration examples for a single table in a single database and for sharded databases and tables.
The comments in the JSON examples in this topic are for explanation only. You must remove them from your actual configuration.
Configure a single table in a single database
{ "type": "job", "version": "2.0",// Version number. "steps": [ { "stepType": "mysql",// Plugin name. "parameter": { "column": [// Column names. "id" ], "connection": [ { "querySql": [ "select a,b from join1 c join join2 d on c.id = d.id;" ], "datasource": ""// Data source name. } ], "where": "",// Filter condition. "splitPk": "",// Shard key. "encoding": "UTF-8"// Encoding format. }, "name": "Reader", "category": "reader" }, { "stepType": "stream", "parameter": {}, "name": "Writer", "category": "writer" } ], "setting": { "errorLimit": { "record": "0"// Number of error records. }, "speed": { "throttle": true,// If throttle is false, the mbps parameter is ignored and the rate is not limited. If throttle is true, the rate is limited. "concurrent": 1,// Job concurrency. "mbps": "12"// Rate limit. 1 mbps = 1 MB/s. } }, "order": { "hops": [ { "from": "Reader", "to": "Writer" } ] } }Configure sharded databases and tables
NoteSharding allows the MySQL Reader to process multiple MySQL tables that have the same schema. This type of sharding involves writing data from multiple MySQL source tables to a single destination table. To configure sharding for an entire database, you can create a task in the Data Integration console and select the full database sharding feature.
{ "type": "job", "version": "2.0", "steps": [ { "stepType": "mysql", "parameter": { "indexes": [ { "type": "unique", "column": [ "id" ] } ], "envType": 0, "useSpecialSecret": false, "column": [ "id", "buyer_name", "seller_name", "item_id", "city", "zone" ], "tableComment": "Test order table", "connection": [ { "datasource": "rds_dataservice", "table": [ "rds_table" ] }, { "datasource": "rds_workshop_log", "table": [ "rds_table" ] } ], "where": "", "splitPk": "id", "encoding": "UTF-8" }, "name": "Reader", "category": "reader" }, { "stepType": "odps", "parameter": {}, "name": "Writer", "category": "writer" }, { "name": "Processor", "stepType": null, "category": "processor", "copies": 1, "parameter": { "nodes": [], "edges": [], "groups": [], "version": "2.0" } } ], "setting": { "executeMode": null, "errorLimit": { "record": "" }, "speed": { "concurrent": 2, "throttle": false } }, "order": { "hops": [ { "from": "Reader", "to": "Writer" } ] } }
Reader script parameters
Script parameter | Description | Required | Default value |
datasource | The name of the data source. In the code editor, you can add a data source. The value of this parameter must be the same as the name of the added data source. | Yes | None |
table | The name of the table to synchronize. A data integration task can read data from only one table. Advanced usage examples for the table parameter to configure a range:
Note The task reads all matched tables. It specifically reads the columns specified in the column parameter. If a table or a specified column does not exist, the task fails. | Yes | None |
column | A collection of column names to be synchronized from the configured table, described as a JSON array. By default, all columns are used, for example, [*].
| Yes | None |
splitPk | When you use MySQL Reader to extract data, if you specify the splitPk parameter, the data is sharded based on the field that splitPk represents. This allows data synchronization to start concurrent tasks, which improves efficiency.
| No | None |
splitFactor | The sharding factor. This determines the number of shards for data synchronization. If you configure concurrency, the data is split into a number of shards equal to concurrency × splitFactor. For example, if concurrency is 5 and splitFactor is 5, the data is split into 25 shards and processed by 5 concurrent threads. Note We recommend a value between 1 and 100. A value that is too large may cause an out-of-memory error. | No | 5 |
where | A filter condition. In many business scenarios, you might synchronize only the current day's data. To do this, set the where condition to
| No | None |
querySql (Advanced mode. This parameter is not supported in the codeless UI.) | In some business scenarios, the `where` parameter is not sufficient to describe the filter conditions. You can use this parameter to define a custom filter SQL statement. When this parameter is configured, the data synchronization system ignores the `tables`, `columns`, and `splitPk` parameters and uses this SQL statement to filter data. For example, to synchronize data after a multi-table join, use Note The querySql parameter is case-sensitive. For example, querysql will not take effect. | No | None |
useSpecialSecret | For multi-source data sources, this specifies whether to use the password of each individual data source. Valid values:
If you have configured multiple source data sources with different usernames and passwords, you can set this parameter to true to use the password of each respective data source. | No | false |
Writer script example
{
"type": "job",
"version": "2.0",// Version number.
"steps": [
{
"stepType": "stream",
"parameter": {},
"name": "Reader",
"category": "reader"
},
{
"stepType": "mysql",// Plugin name.
"parameter": {
"postSql": [],// Post-import SQL statements.
"datasource": "",// Data source.
"column": [// Column names.
"id",
"value"
],
"writeMode": "insert",// Write mode. You can set this to insert, replace, or update.
"batchSize": 1024,// Number of records to submit in a single batch.
"table": "",// Table name.
"nullMode": "skipNull",// NULL value handling policy.
"skipNullColumn": [// Columns for which to skip NULL values.
"id",
"value"
],
"preSql": [
"delete from XXX;"// Pre-import SQL statements.
]
},
"name": "Writer",
"category": "writer"
}
],
"setting": {
"errorLimit": {// Number of error records.
"record": "0"
},
"speed": {
"throttle": true,// If throttle is false, the mbps parameter is ignored and the rate is not limited. If throttle is true, the rate is limited.
"concurrent": 1,// Job concurrency.
"mbps": "12"// Rate limit. Controls the maximum synchronization rate to prevent excessive read/write pressure on the upstream/downstream database. 1 mbps = 1 MB/s.
}
},
"order": {
"hops": [
{
"from": "Reader",
"to": "Writer"
}
]
}
}Writer script parameters
Script parameter | Description | Required | Default value |
datasource | The name of the data source. In the code editor, you can add a data source. The value of this parameter must be the same as the name of the added data source. | Yes | None |
table | The name of the table to synchronize. | Yes | None |
writeMode | The import mode. You can choose insert into, on duplicate key update, or replace into.
| No | insert |
nullMode | The NULL value handling policy. Valid values:
Important If you set this to skipNull, the task dynamically constructs the write SQL statement to support default values in the destination. This increases the number of FLUSH operations and reduces synchronization speed. In the worst case, it may flush for every row. | No | writeNull |
skipNullColumn | When nullMode is set to skipNull, the columns configured in this parameter are not forced to be written as Format: | No | All columns configured for the task. |
column | The fields in the destination table to which data should be written. Separate fields with commas, for example, | Yes | None |
preSql | An SQL statement that is executed before the data synchronization task starts. The codeless UI allows only one SQL statement. The code editor supports multiple SQL statements. For example, you can clear old data from a table before execution (`truncate table tablename`). Note Transactions are not supported for multiple SQL statements. | No | None |
postSql | An SQL statement that is executed after the data synchronization task finishes. The codeless UI allows only one SQL statement. The code editor supports multiple SQL statements. For example, you can add a timestamp column: Note Transactions are not supported for multiple SQL statements. | No | None |
batchSize | The number of records to submit in a single batch. A larger value can significantly reduce network interactions between the data synchronization system and MySQL and improve overall throughput. If this value is too large, it may cause an out-of-memory (OOM) error in the data synchronization process. | No | 256 |
updateColumn | When writeMode is set to update, these are the fields to update when a primary key or unique index conflict occurs. Separate fields with commas, for example, | No | None |