Offline synchronization of the entire MySQL database to Elasticsearch - DataWorks

Data Integration supports offline synchronization of entire databases from sources such as MySQL, PolarDB, and SQL Server to Elasticsearch. This topic uses MySQL as the source and Elasticsearch as the destination to describe how to synchronize an entire MySQL database to Elasticsearch in offline mode.

Prerequisites

You have purchased a Serverless resource group or an exclusive resource group for Data Integration.
You have created MySQL and Elasticsearch data sources. For more information, see Create a data source for Data Integration.
You have established network connectivity between the resource group and data sources. For more information, see Network connectivity solutions.

Procedure

Step 1: Select the synchronization task type

Go to the Data Integration page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Integration > Data Integration. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Integration.
In the left-side navigation pane, click Sync Task, and then click Create Sync Task at the top of the page to go to the sync task creation page. Configure the following basic information:
- Data Source And Destination: MySQL→Elasticsearch
- New Task Name: Customize a name for the sync task.
- Sync Type: Offline synchronization of the entire database to Elasticsearch.

Step 2: Configure network and resources

In the Network And Resources section, select a Resource Group for the sync task. You can allocate the number of compute units (CUs) for Task Resource Usage.
For Source Data Source, select the added MySQL data source. For Destination Data Source, select the added Elasticsearch data source. Then, click Test Connectivity.
After you confirm that both the source and destination data sources are connected, click Next.

Step 3: Configure the synchronization source and rules

Select the table that you want to synchronize.
In this step, you can select the tables that you want to synchronize from the source database in the Source Database Tables section, and click the icon to move them to the Selected Tables section on the right.
Set the mapping rules from table names to index names.
After you select a source database and a table, data is written to the destination schema and table that are named the same as the source database and table by default. If no such destination schema or table exists, the system automatically creates the schema or table in the destination. You can also customize the schema or index names that are written to the destination by using Set Mapping Rules From Table Names To Index Names.
Configuration description:
- Conversion Rules For Source Database Names And Destination Schema Names: All conversion rules are applied to the original database names. After the conversion is complete, you can use the result as a variable ${db_name_src_transed} in Destination Index Name Rule.
  Important
  - If you do not use Destination Index Name Rule, this rule directly affects the final actual destination schema name.
  - If you use Destination Index Name Rule, this rule not only affects the value of the variable ${db_name_src_transed}, but also affects the final actual destination schema name.
- Conversion Rules For Source Table Names And Destination Index Names: All conversion rules are applied to the original table names. After the conversion is complete, you can use the result as a variable ${db_table_name_src_transed} in Destination Index Name Rule.
  Important
  - If you do not use Destination Index Name Rule, this rule directly affects the final actual destination index name.
  - If you use Destination Index Name Rule, this rule only affects the value of the variable ${db_table_name_src_transed} and does not directly affect the final actual destination index name. The final destination index name is determined by Destination Index Name Rule.
- Destination Index Name Rule: You can use built-in variables to name the destination index.
  The following built-in variables are available:
  - ${db_table_name_src_transed}: The index name after the conversion in "Conversion Rules for Source Table Names and Destination Index Names".
  - ${db_name_src_transed}: The destination schema name after the conversion in "Conversion Rules for Source Database Names and Destination Schema Names".
  - ${ds_name_src}: The source data source name.
For example, to replace the doc_ prefix in the source database name with the pre_ prefix, and synchronize the source tables table_01, table_02, and table_03 to an index named my_table, and finally add the suffix _post to this index, you should configure the following:
Click Next to configure the destination index.

Step 4: Configure the destination index

Click Refresh Source Tables And Elasticsearch Index Mapping to generate the destination index based on the Set Mapping Rules From Table Names To Index Names that you configured in Step 3. If you did not configure mapping rules in Step 3, data is written to the destination index that has the same name as the source table by default. If no such destination index exists, the system creates one by default. You can also modify the synchronization primary key and index creation method.

Note

The destination table name is automatically converted based on the table name conversion rules that you configured in Set Mapping Rules From Table Names To Index Names.

In the Sync Primary Key column, select a primary key replacement solution.
- If the tables in the source database have primary keys, the system removes duplicate data based on the primary keys during data synchronization.
- If the tables in the source database do not have primary keys, you need to click the icon to customize a primary key. This means using one or more non-primary key fields as a composite key to replace the primary key for removing duplicate data during synchronization.
In the Index Creation Method column, select whether to automatically create an index or use an existing index.
- When Index Creation Method is set to Automatically Create Index, the Elasticsearch Index Name column displays the automatically created Elasticsearch index name. You can click the name of the index to modify the values of the parameters that are related to the index.
- When Index Creation Method is set to Use Existing Index, you can select the index name that you want to use from the drop-down list in the Elasticsearch Index Name column.
Click Next to configure synchronization rules.

Step 5: Configure synchronization rules

The following offline synchronization solutions for entire databases are supported. You can select a solution based on your needs. Different solutions require different parameters.

Synchronization Solution	Description
Full One-time Synchronization Followed By Periodic Incremental Synchronization	If you select this method, the system first synchronizes all data to Elasticsearch. Then, the system synchronizes only incremental data in the source to Elasticsearch each time the system runs the solution based on the specified filter conditions and scheduling cycle.
Full One-time Synchronization Only	If you select this method, you need to perform the synchronization operations only once to synchronize all data in the source to Elasticsearch.
Incremental One-time Synchronization Only	If you select this method, you need to perform synchronization operations only once to synchronize incremental data in the source to Elasticsearch based on the specified filter conditions.
Periodic Full Synchronization	If you select this method, you must specify a scheduling cycle for the batch synchronization solution. Then, the system synchronizes all data in the source to Elasticsearch each time the system runs the solution based on the specified scheduling cycle.
Periodic Incremental Synchronization	If you select this method, the system synchronizes only incremental data in the source to Elasticsearch each time the system runs the solution based on the specified filter conditions and scheduling cycle.

Full One-time Synchronization Followed by Periodic Incremental Synchronization

Full Synchronization
Parameter	Description
Clear The Original Index Before Writing	Yes: The system clears the original data in the index before writing new data. No: The system does not clear the original data in the index before writing new data. Important If you set this parameter to Yes, all data in the destination index is deleted before new data is written. Exercise caution when you select this option.
Write Type	Insert: This is the default value. The system directly inserts data into the Elasticsearch index. Update: If a record with the same primary key exists, the system updates the record. Otherwise, the system inserts the record. Note When the system updates a record, it first deletes the entire original record and then inserts a new record.
Batch Size	The number of data records that can be written to Elasticsearch at a time. The default value is `1000`. You can configure this parameter based on your actual network conditions and the amount of data that you want to synchronize to reduce network overheads.
Incremental Synchronization
Write Type	Insert: This is the default value. The system directly inserts data into the Elasticsearch index. Update: If a record with the same primary key exists, the system updates the record. Otherwise, the system inserts the record. Note When the system updates a record, it first deletes the entire original record and then inserts a new record.
Batch Size	The number of data records that can be written to Elasticsearch at a time. The default value is `1000`. You can configure this parameter based on your actual network conditions and the amount of data that you want to synchronize to reduce network overheads.
Incremental Condition	You can use a WHERE clause to filter the tables that you want to synchronize. You only need to enter the WHERE clause in the Incremental Condition field without the WHERE keyword. When you write the WHERE clause, you can use built-in system variables, such as `${bdp.system.bizdate}` to represent the data timestamp and `${bdp.system.cyctime}` to represent the scheduled time.
Scheduling Settings
Because periodic scheduling is required, you need to define the related properties for the periodic scheduling task, including Scheduling Cycle, Effective Date, and Pause Scheduling. The scheduling configuration for the current synchronization task is the same as the scheduling configuration for nodes in Data Development. For more information about the parameters, see Node scheduling.

Full one-time synchronization only

Full Synchronization
Parameter	Description
Clear The Original Index Before Writing	Yes: The system clears the original data in the index before writing new data. No: The system does not clear the original data in the index before writing new data. Important If you set this parameter to Yes, all data in the destination index is deleted before new data is written. Exercise caution when you select this option.
Write Type	Insert: This is the default value. The system directly inserts data into the Elasticsearch index. Update: If a record with the same primary key exists, the system updates the record. Otherwise, the system inserts the record. Note When the system updates a record, it first deletes the entire original record and then inserts a new record.
Batch Size	The number of data records that can be written to Elasticsearch at a time. The default value is `1000`. You can configure this parameter based on your actual network conditions and the amount of data that you want to synchronize to reduce network overheads.

Incremental one-time synchronization only

Incremental Synchronization
Parameter	Description
Write Type	Insert: This is the default value. The system directly inserts data into the Elasticsearch index. Update: If a record with the same primary key exists, the system updates the record. Otherwise, the system inserts the record. Note When the system updates a record, it first deletes the entire original record and then inserts a new record.
Batch Size	The number of data records that can be written to Elasticsearch at a time. The default value is `1000`. You can configure this parameter based on your actual network conditions and the amount of data that you want to synchronize to reduce network overheads.
Incremental Condition	You can use a WHERE clause to filter the tables that you want to synchronize. You only need to enter the WHERE clause in the Incremental Condition field without the WHERE keyword. When you write the WHERE clause, you can use built-in system variables, such as `${bdp.system.bizdate}` to represent the data timestamp and `${bdp.system.cyctime}` to represent the scheduled time.

Periodic full synchronization

Full Synchronization
Parameter	Description
Clear The Original Index Before Writing	Yes: The system clears the original data in the index before writing new data. No: The system does not clear the original data in the index before writing new data. Important If you set this parameter to Yes, all data in the destination index is deleted before new data is written. Exercise caution when you select this option.
Write Type	Insert: This is the default value. The system directly inserts data into the Elasticsearch index. Update: If a record with the same primary key exists, the system updates the record. Otherwise, the system inserts the record. Note When the system updates a record, it first deletes the entire original record and then inserts a new record.
Batch Size	The number of data records that can be written to Elasticsearch at a time. The default value is `1000`. You can configure this parameter based on your actual network conditions and the amount of data that you want to synchronize to reduce network overheads.
Scheduling Settings
Because periodic scheduling is required, you need to define the related properties for the periodic scheduling task, including Scheduling Cycle, Effective Date, and Pause Scheduling. The scheduling configuration for the current synchronization task is the same as the scheduling configuration for nodes in Data Development. For more information about the parameters, see Node scheduling.

Periodic incremental synchronization

Incremental Synchronization
Parameter	Description
Write Type	Insert: This is the default value. The system directly inserts data into the Elasticsearch index. Update: If a record with the same primary key exists, the system updates the record. Otherwise, the system inserts the record. Note When the system updates a record, it first deletes the entire original record and then inserts a new record.
Batch Size	The number of data records that can be written to Elasticsearch at a time. The default value is `1000`. You can configure this parameter based on your actual network conditions and the amount of data that you want to synchronize to reduce network overheads.
Incremental Condition	You can use a WHERE clause to filter the tables that you want to synchronize. You only need to enter the WHERE clause in the Incremental Condition field without the WHERE keyword. When you write the WHERE clause, you can use built-in system variables, such as `${bdp.system.bizdate}` to represent the data timestamp and `${bdp.system.cyctime}` to represent the scheduled time.
Scheduling Settings
Because periodic scheduling is required, you need to define the related properties for the periodic scheduling task, including Scheduling Cycle, Effective Date, and Pause Scheduling. The scheduling configuration for the current synchronization task is the same as the scheduling configuration for nodes in Data Development. For more information about the parameters, see Node scheduling.

After you complete the configuration, click Next to configure runtime resources.

Step 6: Configure runtime resources

Based on the synchronization solution that you selected in the previous step, different runtime resources need to be configured in this step. The synchronization task creates both full offline synchronization tasks and incremental offline synchronization tasks. You can configure the task names and resource groups (full offline task resource group, incremental offline task resource group, and scheduling resource group) that are used to execute the tasks. You can also modify advanced parameters provided by Data Integration, such as Advanced Configuration, Expected Maximum Concurrency, Synchronization Rate, Dirty Data Tolerance, and Maximum Number Of Connections Supported By The Source.

Step 7: Execute the synchronization task

After you complete all configurations, click Complete Configuration at the bottom of the page.
On the Data Integration > Sync Task page, find the created synchronization task and click Submit And Execute in the Operation column.
Click Task List and then click the Name/ID of the corresponding task to view the detailed execution procedure of the task.

Perform O&M operations on the synchronization task

View the status of the synchronization task

After you create a synchronization task, you can view the list of created synchronization tasks and the basic information of each task on the Sync Task page.

You can click Execution Details in the Operation column to go to the task details page and view the execution status of the task.