All Products
Search
Document Center

DataWorks:Accelerate or throttle offline synchronization

Last Updated:Oct 21, 2025

This topic describes the factors that affect data synchronization speed, how to adjust concurrency to maximize synchronization speed, job throttling options, and solutions for slow data synchronization scenarios.

Overview

  • Data synchronization speed is affected by many factors, such as task configurations, database performance, and network conditions. For more information, see Factors that affect data synchronization speed.

  • Slow data synchronization can occur at different stages of the process. This topic describes solutions for slow performance at each stage. For more information, see Scenarios and solutions for slow data synchronization.

  • If the database performance is limited, a faster synchronization speed is not always better. A high speed may overstress the database and affect other production services. Data Integration provides throttling options that you can configure as needed. For more information, see Limit the synchronization speed.

Factors that affect data synchronization speed

Data synchronization speed is affected by factors such as the source and destination database environments and task configurations. You are primarily responsible for monitoring and tuning the performance, load, and network conditions of the source and destination databases.

The following factors affect data synchronization speed:

Factor

Description

Source data source

  • Database performance: CPU, memory, SSD, network, and disk performance.

  • Concurrency: A higher concurrency on the data source results in a heavier database load. A database with better performance can handle higher concurrency. This lets you configure more concurrent data extractions for the data synchronization job.

  • Network: The bandwidth (throughput) and speed of the network.

Resource group for scheduling used by the offline sync task

Offline sync tasks are dispatched by schedule resources to run on Data Integration execution resources. The usage of schedule resources also affects the overall synchronization efficiency.

Offline sync task configuration

  • Transfer speed: Whether an upper limit is set for the task synchronization speed.

  • Concurrency: The maximum number of threads that can read from the source or write to the destination data storage in parallel.

  • WAIT resources.

  • The Bytes setting: A single thread has Bytes=1048576. If the network speed is sensitive, a timeout may occur. In this case, set Bytes to a smaller value.

  • Whether an index is created for query statements.

Destination data source

  • Performance: CPU, memory, SSD, network, and disk performance.

  • Load: An excessive load on the destination database affects the data write efficiency of the sync task.

  • Network: The bandwidth (throughput) and speed of the network.

Scenarios and solutions for slow data synchronization

Note

For more information about offline sync task logs, see Analyze offline sync logs.

Scenario of slow data synchronization

Phenomenon

Possible cause

Solution

Waiting for schedule resources

  • Phenomenon 1: The sync task log shows the task is waiting for the gateway.

  • Phenomenon 2: The instance properties page shows a long wait time for resources.

Offline tasks are dispatched by a scheduling resource group to an engine for execution. If the number of tasks running on the scheduling resource group reaches its upper limit, new tasks must wait for running tasks to complete and release resources.

On the Operation Center page, you can view which tasks are occupying resources while the current task is waiting.

Note

If you use a shared resource group for scheduling, migrate the task to an exclusive or Serverless resource group for execution.

Waiting for execution resources

The sync task log shows `wait`.wait

The remaining resources in the Data Integration resource group are insufficient to run the current task.

For example, a resource group supports a maximum of eight concurrent threads. Three tasks are configured, each requiring three concurrent threads. If two tasks run at the same time, they use six threads. The resource group has only two threads left. The third task, which requires three threads, must wait because of insufficient resources. The log for this task shows `wait`.

Check if other tasks are running and using many resources in the resource group. You can use the following solutions to resolve this issue:

Note
  • On the Operation Center page, you can view the resource usage and the information about the tasks that are using resources while the current task is waiting.等待资源

  • The maximum number of concurrent threads that a resource group can run varies based on its specifications. For more information, see Performance metrics and billing standards.

  1. Check if the tasks that occupy resources are stuck or have slowed down significantly. If so, resolve these issues first or stop some of the tasks.

  2. If the tasks are not stuck, wait for them to complete and release the resources. Then, start the current task.

  3. You can also find the list of tasks that are using the resources and their owners. Coordinate with them to reduce the concurrency of their tasks.

  4. You can also reduce the concurrency of the current sync task and then resubmit and publish it.

  5. You can also scale out the resource group for execution. For more information, see Scale-out and scale-in operations.

Sync task runs too slowly

The sync task log shows run, but the speed is 0. The task is running. If this state persists, click Detail log to view the execution details.运行慢If the Detail log shows a large value for the WaitReaderTime parameter, it indicates that the task is waiting a long time for the source to return data.查看日志

  • The source shard key is not configured properly.

    The SQL statements generated based on the shard key to read data from the source database execute slowly.

  • The SQL statements used to read data from the source take a long time to execute (for example, the `where` or `querySql` parameters in some plugins).

    Scenario example: A data synchronization task slows down because of a full table scan. This happens when the `WHERE` clause does not have an index.

  • The database load is high at the time of synchronization.

  • Network issues exist, such as bandwidth (throughput) and network speed.

Note

The data synchronization speed cannot be guaranteed over the Internet.

  • To resolve slow statement execution:

    • When you configure pre- or post-SQL statements:

      • Ensure that an index is added to the fields used for data filtering. This prevents the sync task from performing a full table scan.

      • Avoid or reduce complex processing, such as using functions. If necessary, perform these operations in the database before synchronization.

    • Check if the source data table contains too much data. If so, split the data into multiple tasks.

    • Query the logs to find the blocked SQL statements and consult a database administrator for a solution.

  • Check the database load at the time of synchronization.

The sync task log shows run, but the speed is 0. The task is running. If this state persists, click Detail log to view the execution details.运行慢If the Detail log shows a large value for the WaitWriterTime parameter, it indicates that the task is taking a long time to write data to the destination.

  • The pre- or post-SQL statements configured in the writer plugin execute slowly (for example, the SQL statements configured in the `preSql` or `postSql` parameters in some plugins).

  • The database load is high at the time of synchronization.

  • Network issues exist, such as bandwidth (throughput) and network speed.

Note

The data synchronization speed cannot be guaranteed over the Internet.

The log shows run and a non-zero speed, but the synchronization process is slow.日志

  • The shard key for a relational database task is not configured properly. This causes the concurrency setting to be ineffective, and the task runs with a single thread.

  • The concurrency is set too low.

  • A large amount of dirty data is generated during synchronization, which affects the speed.

  • Database performance issues exist.

    Note

    A database with better performance can handle higher concurrency. This lets you configure a higher concurrency for the data synchronization job.

  • Network issues exist, such as bandwidth (throughput) and network speed.

Note

The data synchronization speed cannot be guaranteed over the Internet.

  1. Configure the shard key properly. For more information about configuring a task shard key, see Configure a task in the codeless UI.

  2. Within the maximum concurrency supported by the resource group, plan the concurrency for each task and increase the concurrency for the current task as needed.

    In the codeless UI, configure the concurrency to specify the degree of parallelism for the task. The following code shows how to configure the concurrency in the code editor.日志

    Note

    The maximum number of concurrent threads that a resource group can run varies based on its specifications. For more information, see Performance metrics and billing standards.

  3. Handle dirty data. For more information about dirty data, see Data Integration.

  4. When you set the concurrency for distributed tasks, the number of machines in the resource group cannot exceed the maximum concurrency of a single machine in that group.

  5. When you synchronize data across clouds or regions, establish a network connection and use an internal network for synchronization. For more information about network connectivity solutions, see Network connectivity solutions.

  6. Check the database load.

Limit the synchronization speed

By default, Data Integration sync tasks are not throttled. A task runs at the highest possible speed within the configured concurrency limit. However, a high speed may overstress the database and affect other production services. Data Integration provides a throttling option that you can configure as needed. After you enable throttling, we recommend that you set the maximum speed to no more than 30 MB/s. The following code shows how to configure throttling in the code editor to set a bandwidth limit of 1 MB/s.

"setting": {
      "speed": {
         "throttle": true // Enables throttling.
        "mbps": 1, // The specific rate.
      }
    }
  • The throttle parameter can be set to true or false:

    • When throttle is set to true, throttling is enabled. You must set a specific data value for mbps. If you do not set mbps, the program encounters an error or the rate is abnormal.

    • When throttle is set to false, throttling is disabled, and the mbps configuration is ignored.

  • The traffic measure is a Data Integration metric and does not represent the actual network interface card (NIC) traffic. Typically, the NIC traffic is one to two times the channel traffic. The actual traffic overhead depends on the data serialization of the data storage system.

  • A single semi-structured file does not have a shard key. For multiple files, you can set a job speed limit. However, the effective speed limit is also related to the number of files.

    For example, for n files, the effective speed limit is n MB/s:

    • If you set the speed limit to n+1 MB/s, the data is synchronized at n MB/s.

    • If you set the speed limit to n-1 MB/s, the data is synchronized at n-1 MB/s.

  • For a relational database, you must configure a shard key for the speed limit to be effective across multiple threads. Relational databases typically support only numeric shard keys. However, Oracle databases support both numeric and string shard keys.

FAQ

  • FAQ for offline synchronization.

  • The `BatchSize` or `maxfilesize` parameter controls the number of records in a batch submission. A suitable value can reduce network interactions between Data Integration and the database and increase throughput. However, if this value is too large, an out-of-memory (OOM) error may occur in the synchronization process. If this error occurs, see FAQ for offline synchronization.

Appendix: Check the actual parallelism

On the log details page of a data sync task, find a log entry in the format JobContainer - Job set Channel-Number to 2 channels.. The value of channels is the actual degree of parallelism for the task.查看实际并发

Appendix: Relationship between parallelism and resource usage

In an exclusive resource group, resource usage is determined by the relationship between concurrency and CPU, and between concurrency and memory:

  • Relationship between concurrency and CPU

    In an exclusive resource group, the ratio of vCPUs to concurrency is 1:2. For example, an ECS machine with 4 vCPUs and 8 GiB of memory provides a concurrency quota of 8 for its exclusive resource group. It can run a maximum of eight offline sync tasks with a concurrency of 1, or four offline sync tasks with a concurrency of 2.

    If the concurrency required by a new task submitted to an exclusive resource group is greater than the remaining concurrency quota of the group, the new task must wait. It runs after the running tasks in the group are complete and the remaining concurrency quota is sufficient for the new task.

    Note

    If the concurrency set for a new task exceeds the maximum concurrency quota of the exclusive resource group, the task will be permanently stuck in the waiting state. For example, this occurs if you submit a task with a concurrency of 10 to an exclusive resource group on an ECS machine with 4 vCPUs and 8 GiB of memory. Because the resource group allocates resources based on the submission order, subsequent tasks will also be blocked.

  • Relationship between concurrency and memory

    In an exclusive resource group, the memory occupied by a single task is calculated as Min{768 + (Concurrency - 1) × 256, 8029} MB. However, you can override this calculation in the task settings. In the code editor, set the JSON path $.setting.jvmOption.jvm

    Ensure that the total memory used by all running tasks is at least 1 GB less than the total memory of all machines in the exclusive resource group. This allows the tasks to run smoothly. If this condition is not met, the Linux OOM Killer mechanism may forcibly stop the tasks.

    Note

    If you do not use the code editor to increase the task's memory, you only need to consider the concurrency quota limit of the exclusive resource group when you submit tasks.

Appendix: Synchronization speed

Read and write speeds vary greatly among different data sources. The following sections describe the average single-thread synchronization speed for typical data sources in an exclusive resource group:

  • Average single-thread speed for different Writer plugins

    Writer

    Average single-thread speed (KB/s)

    AnalyticDB for PostgreSQL

    147.8

    AnalyticDB for MySQL

    181.3

    ClickHouse

    5259.3

    DataHub

    45.8

    DRDS

    93.1

    Elasticsearch

    74.0

    FTP

    565.6

    GDB

    17.1

    HBase

    2395.0

    hbase20xsql

    37.8

    HDFS

    1301.3

    Hive

    1960.4

    HybridDB for MySQL

    323.0

    HybridDB for PostgreSQL

    116.0

    Kafka

    0.9

    LogHub

    788.5

    MongoDB

    51.6

    MySQL

    54.9

    ODPS

    660.6

    Oracle

    66.7

    OSS

    3718.4

    OTS

    138.5

    PolarDB

    45.6

    PostgreSQL

    168.4

    Redis

    7846.7

    SQLServer

    8.3

    Stream

    116.1

    TSDB

    2.3

    Vertica

    272.0

  • Average single-thread speed for different Reader plugins

    Reader

    Average single-thread speed (KB/s)

    AnalyticDB for PostgreSQL

    220.3

    AnalyticDB for MySQL

    248.6

    DRDS

    146.4

    Elasticsearch

    215.8

    FTP

    279.4

    HBase

    1605.6

    hbase20xsql

    465.3

    HDFS

    2202.9

    Hologres

    741.0

    HybridDB for MySQL

    111.3

    HybridDB for PostgreSQL

    496.9

    Kafka

    3117.2

    LogHub

    1014.1

    MongoDB

    361.3

    MySQL

    459.5

    ODPS

    207.2

    Oracle

    133.5

    OSS

    665.3

    OTS

    229.3

    OTSStream

    661.7

    PolarDB

    238.2

    PostgreSQL

    165.6

    RDBMS

    845.6

    SQLServer

    143.7

    Stream

    85.0

    Vertica

    454.3