DataWorks MySQL data synchronization capabilities, read from and write to MySQL, how to create a MySQL data source - DataWorks

A MySQL data source provides a bidirectional channel for reading data from and writing data to MySQL. This topic describes the data synchronization capabilities that DataWorks provides for MySQL.

Supported MySQL versions

Offline read and write operations:
Supports MySQL 5.5.x, MySQL 5.6.x, MySQL 5.7.x, and MySQL 8.0.x. It is also compatible with Amazon RDS for MySQL, Azure Database for MySQL, and Amazon Aurora MySQL.
Offline synchronization supports reading data from views.
Real-time read operations:
Data Integration reads MySQL data in real time by subscribing to MySQL binary logs. Real-time synchronization is supported only for MySQL 5.5.x, MySQL 5.6.x, MySQL 5.7.x, and MySQL 8.0.x. For version 8.0.x, existing features are compatible, but new features such as functional indexes are not supported. It is also compatible with Amazon RDS for MySQL, Azure Database for MySQL, and Amazon Aurora MySQL.
Important
To synchronize data from a DRDS for MySQL database, do not configure it as a MySQL data source. Instead, configure it directly as a DRDS data source. For more information, see Configure a DRDS data source.

Limits

Real-time read

Data synchronization from MySQL read-only instances is not supported.
Tables that contain functional indexes cannot be synchronized.
XA ROLLBACK is not supported.
For transactional data in the XA PREPARE state, real-time synchronization writes the data to the destination. If an XA ROLLBACK occurs, real-time synchronization does not roll back the prepared data. To handle an XA ROLLBACK scenario, you must manually remove the table from the real-time sync task and then add the table back to restart the synchronization.
Only the ROW format for MySQL server binary logging is supported.
Real-time synchronization does not synchronize records in associated tables that are deleted by cascade delete operations.
For Amazon Aurora MySQL databases, you must connect to your primary or writer database. AWS does not allow binary logging to be activated on read replicas of Aurora MySQL. Real-time sync tasks require binary logging to perform incremental updates.
Real-time synchronization for online DDL operations supports only the addition of columns (Add Column) to MySQL tables using Data Management (DMS).
Reading stored procedures from MySQL is not supported.

Offline read

When the MySQL Reader plugin synchronizes multiple tables, such as in a sharding scenario, a single table can be split into multiple tasks only if the task concurrency is greater than the number of tables. Otherwise, the number of tasks is equal to the number of tables.
Reading stored procedures from MySQL is not supported.

Supported field types

For a complete list of field types for each MySQL version, see the official MySQL documentation. The following table lists the support status for major field types in MySQL 8.0.x as an example.

Field type	Offline read (MySQL Reader)	Offline write (MySQL Writer)	Real-time read	Real-time write
TINYINT
SMALLINT
INTEGER
BIGINT
FLOAT
DOUBLE
DECIMAL/NUMBERIC
REAL
VARCHAR
JSON
TEXT
MEDIUMTEXT
LONGTEXT
VARBINARY
BINARY
TINYBLOB
MEDIUMBLOB
LONGBLOB
ENUM
SET
BOOLEAN
BIT
DATE
DATETIME
TIMESTAMP
TIME
YEAR
LINESTRING
POLYGON
MULTIPOINT
MULTILINESTRING
MULTIPOLYGON
GEOMETRYCOLLECTION

Preparations

Before you configure a MySQL data source in DataWorks, complete the preparations in this topic to ensure that subsequent tasks run as expected.

The following sections describe the preparations that you must complete before you synchronize MySQL data.

Confirm the MySQL version

Data Integration has specific version requirements for MySQL. Refer to the Supported MySQL versions section to check whether your MySQL version is supported. You can run the following statement in your MySQL database to check its version.

SELECT version();

Configure account permissions

Plan and create a dedicated MySQL account for DataWorks to access the data source.

Optional: Create an account.
For more information, see Create a MySQL account.
Grant permissions.
- Offline
  In an offline synchronization scenario:
  - For offline reading of MySQL data, the account requires the read (SELECT) permission on the tables to be synchronized.
  - For offline writing of MySQL data, the account requires the write (INSERT, DELETE, and UPDATE) permissions on the tables to be synchronized.
- Real-time
  In a real-time synchronization scenario, the account requires the SELECT, REPLICATION SLAVE, and REPLICATION CLIENT permissions on the database.
You can use the following command to grant permissions to the account, or grant the SUPER permission directly. When you use the following statement, replace 'sync_account' with the account you created.
```
-- CREATE USER 'sync_account'@'%' IDENTIFIED BY 'password'; // Create a synchronization account, set a password, and allow it to log on from any host. The percent sign (%) indicates any host.
GRANT SELECT, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'sync_account'@'%'; // Grant the SELECT, REPLICATION SLAVE, and REPLICATION CLIENT permissions to the synchronization account.
```
*.* grants the specified permissions on all tables in all databases to the synchronization account. You can also grant permissions on a specific table in a specific database. For example, to grant the permissions on the user table in the test database, use the statement GRANT SELECT, REPLICATION CLIENT ON test.user TO 'sync_account'@'%';.
Note
The REPLICATION SLAVE permission is a global permission. You cannot grant it for a specific table in a specific database.

Enable MySQL binary logging (for real-time synchronization only)

Data Integration performs real-time synchronization of incremental data by subscribing to MySQL binary logs (binlogs). Before you configure a synchronization task in DataWorks, you must enable the MySQL binary logging service.

Important

If a binary log is being consumed, the database cannot delete it. If a real-time sync task experiences high latency, the source binary log may be retained for a long time. Configure latency alerts for your tasks and monitor the disk space of your database.
Retain binary logs for at least 72 hours. This prevents data loss if a task fails and cannot be reset to an offset that precedes the failure because the required logs have been deleted. In such cases, you must perform a full offline synchronization to backfill the data.

Check if binary logging is enabled.
- You can run the following statement to check if binary logging is enabled.
```
SHOW variables LIKE "log_bin";
```
  If the result is ON, binary logging is enabled.
- If you synchronize data from a secondary database, you can also run the following statement to check if binary logging is enabled.
```
SHOW variables LIKE "log_slave_updates";
```
  If the result is ON, binary logging is enabled on the secondary database.
If the result is not as expected:
- For open source MySQL, refer to the official MySQL documentation to enable binary logging.
- For Alibaba Cloud ApsaraDB RDS for MySQL, see RDS for MySQL log backup to enable binary logging.
- For Alibaba Cloud PolarDB for MySQL, see Enable binary logging to enable binary logging.
Query the binary log format.
You can run the following statement to query the binary log format.
```
SHOW variables LIKE "binlog_format";
```
Result description:
- ROW: The binary log format is ROW.
- STATEMENT: The binary log format is STATEMENT.
- MIXED: The binary log format is MIXED.
Important
DataWorks real-time synchronization supports only the ROW format for MySQL server binary logging. If the format is not ROW, you must modify the Binlog Format.
Check if full binary logging is enabled.
You can run the following statement to check if full binary logging is enabled.
```
SHOW variables LIKE "binlog_row_image";
```
Result description:
- FULL: Full binary logging is enabled.
- MINIMAL: Minimal binary logging is enabled. Full logging is not enabled.
Important
DataWorks real-time synchronization supports only MySQL servers with full binary logging enabled. If the query result is not FULL, you must modify the binlog_row_image configuration.

OSS binary log read authorization

When adding a MySQL data source, you can enable Enable OSS Binary Log Reading if you set Configuration Mode to Alibaba Cloud Instance Mode and the RDS for MySQL instance is in the same region as the DataWorks workspace. With this feature enabled, if the RDS binary log is inaccessible, the system will attempt to retrieve the binary log from OSS, preventing interruptions to the real-time sync task.

If you set OSS Binary Log Access Identity to Alibaba Cloud RAM User or Alibaba Cloud RAM Role, you must also configure authorization for the account as follows.

Alibaba Cloud RAM user
1. Log on to the RAM console and navigate to the Users page. Find the RAM user to which you want to grant permissions.
2. In the Actions column, click Add Permissions.
3. Configure the key parameters and click OK.
  - Set Resource Scope to Account Level.
  - Set the Access Policy to System Policy.
  - For Policy Name, enter AliyunDataWorksAccessingRdsOSSBinlogPolicy.
Alibaba Cloud RAM role
1. Log on to the RAM console and navigate to the Roles page. Create a RAM role. For more information, see Create a RAM role for a trusted Alibaba Cloud account.
  Key parameters:
  - Set Trusted Entity Type to Alibaba Cloud Account.
  - For Trusted Entity Name, select Other Alibaba Cloud Account, and enter the ID of the Alibaba Cloud account that owns the DataWorks workspace.
  - For Role Name, enter a custom name.
2. Grant specific permissions to the created RAM role. For more information, see Grant permissions to a RAM role.
  Key parameters:
  - Set the Access Policy to System Policy.
  - Set Policy Name to AliyunDataWorksAccessingRdsOSSBinlogPolicy.
3. Modify the trust policy of the created RAM role. For more information, see Modify the trust policy of a RAM role.
```
{
    "Statement": [
        {
            "Action": "sts:AssumeRole",
            "Effect": "Allow",
            "Principal": {
                "Service": [
                    "<Alibaba Cloud account ID of the DataWorks user>@di.dataworks.aliyuncs.com",
                    "<Alibaba Cloud account ID of the DataWorks user>@dataworks.aliyuncs.com"
                ]
            }
        }
    ],
    "Version": "1"
}
```

Add a data source

Alibaba Cloud instance mode

If your MySQL database is an Alibaba Cloud RDS instance, we recommend that you create the data source in Alibaba Cloud instance mode. The parameters are described as follows:

Parameter	Description
Data source name	The data source name must be unique within the workspace. We recommend a name that clearly identifies the business and environment, such as `rds_mysql_order_dev`.
Configuration mode	Select Alibaba Cloud instance mode. For more information about configuration modes, see Scenario 1: Instance mode (current Alibaba Cloud account) and Scenario 2: Instance mode (other Alibaba Cloud account).
Alibaba Cloud account	Select the account to which the instance belongs. If you select Other Alibaba Cloud Account, you must configure cross-account permissions. For more information, see Cross-account authorization (RDS, Hive, or Kafka). If you select Other Alibaba Cloud account, you must provide the following information: Other Alibaba Cloud account ID: The ID of the Alibaba Cloud account that owns the instance. RAM role name for authorization: The RAM role provided by the other Alibaba Cloud account. This role must have permissions to use the target instance.
Region	The region where the instance is located.
Instance	Select the name of the instance to connect to.
Read replica settings	If this data source has a read-only instance (read replica), you can choose to read from the replica when you configure a task. The benefit is that it prevents interference with the primary database and does not affect its performance.
Instance endpoint	After you select the correct instance, click Get Latest Endpoint to view the public or private endpoint, VPC, and vSwitch information for the instance.
Database	The name of the database that the data source will access. Ensure that the specified user has access permissions for this database.
Username/Password	The username and password for the MySQL database. If you use an RDS instance, you can create and maintain accounts in the Account Management section of the instance.
Enable OSS binary log reading	When enabled, if the system cannot access the RDS binary log, it will try to get the binary log from OSS to prevent the real-time sync task from being interrupted. For configuration details, see OSS binary log read authorization. Set the OSS Binary Log Access Identity based on the authorization configuration.
Authentication options	You can choose no authentication or SSL authentication. If you choose SSL authentication, the instance itself must have SSL authentication enabled. Prepare the certificate file and upload it to Certificate File Management.
Version	You can log on to the MySQL server and run `SELECT VERSION()` to check the version number.

Connection string mode

You can also create a data source in connection string mode, which is more flexible. The parameters for connection string mode are described as follows:

Parameter	Description
Data source name	The data source name must be unique within the workspace. We recommend a name that clearly identifies the business and environment, such as `rds_mysql_order_dev`.
Configuration mode	Select Connection String Mode. This connects to the database using a JDBC URL.
Connection string preview	After you enter the connection address and database name, DataWorks automatically combines them into a JDBC URL for you to preview.
Connection address	Host address IP: Enter the IP address or domain name of the database host. If the database is an Alibaba Cloud RDS instance, you can find the address in the instance details by clicking Database Connection. Port: The database port. The default is 3306.
Database name	The name of the database that the data source will access. Ensure that the specified user has access permissions for this database.
Username/Password	The username and password for the MySQL database. If you use an RDS instance, you can create and maintain accounts in the Account Management section of the instance.
Version	You can log on to the MySQL server and run `SELECT VERSION()` to check the version number.
Authentication options	You can choose no authentication or SSL authentication. If you choose SSL authentication, the instance itself must have SSL authentication enabled. Prepare the certificate file and upload it to Certificate File Management.
Advanced parameters	Parameter: Click the parameter dropdown list and select a supported parameter name, such as connectTimeout. Value: Enter a suitable value for the selected parameter, such as 3000. The URL will be automatically updated to: jdbc:mysql://192.168.90.28:3306/test?connectTimeout=50000

Important

Ensure connectivity between the data source and the resource group. Otherwise, subsequent tasks will fail. Configure the network connection based on the data source's network environment and connection mode. For more information, see Test connectivity.

Data synchronization task development: MySQL synchronization process guide

For information about the entry point for and the procedure of configuring a synchronization task, see the following configuration guides.

FAQ

For more frequently asked questions about Data Integration, see Data Integration FAQ.

Appendix: MySQL script examples and parameter descriptions

Configure a batch synchronization task by using the code editor

If you want to configure a batch synchronization task by using the code editor, you must configure the related parameters in the script based on the unified script format requirements. For more information, see Configure a batch synchronization task by using the code editor. The following information describes the parameters that you must configure for data sources when you configure a batch synchronization task by using the code editor.

Reader script example

This topic provides configuration examples for a single table in a single database and for sharded databases and tables.

Note

The comments in the JSON examples in this topic are for explanation only. You must remove them from your actual configuration.

Configure a single table in a single database

{
  "type": "job",
  "version": "2.0",// Version number.
  "steps": [
    {
      "stepType": "mysql",// Plugin name.
      "parameter": {
        "column": [// Column names.
          "id"
        ],
        "connection": [
          {
            "querySql": [
              "select a,b from join1 c join join2 d on c.id = d.id;"
            ],
            "datasource": ""// Data source name.
          }
        ],
        "where": "",// Filter condition.
        "splitPk": "",// Shard key.
        "encoding": "UTF-8"// Encoding format.
      },
      "name": "Reader",
      "category": "reader"
    },
    {
      "stepType": "stream",
      "parameter": {},
      "name": "Writer",
      "category": "writer"
    }
  ],
  "setting": {
    "errorLimit": {
      "record": "0"// Number of error records.
    },
    "speed": {
      "throttle": true,// If throttle is false, the mbps parameter is ignored and the rate is not limited. If throttle is true, the rate is limited.
      "concurrent": 1,// Job concurrency.
      "mbps": "12"// Rate limit. 1 mbps = 1 MB/s.
    }
  },
  "order": {
    "hops": [
      {
        "from": "Reader",
        "to": "Writer"
      }
    ]
  }
}

Configure sharded databases and tables