Supported capabilities of DataWorks for SQL Server data synchronization - DataWorks

The SQL Server data source provides a bidirectional channel to read data from and write data to SQL Server. This topic describes the data synchronization capabilities that DataWorks provides for SQL Server.

Supported versions

SQL Server Reader uses the com.microsoft.sqlserver sqljdbc4 4.0 driver. For more information about the driver capabilities, see the official documentation. The following table lists the SQL Server versions that the driver supports.

Version	Supported (Yes/No)
SQL Server 2016	Yes
SQL Server 2014	Yes
SQL Server 2012	Yes
PDW 2008R2 AU34	Yes
SQL Server 2008 R2	Yes
SQL Server 2008	Yes
SQL Server 2019	No
SQL Server 2018	No
Azure SQL Managed Instance	No
Azure Synapse Analytics	No
Azure SQL Database	Yes

Limits

Offline synchronization supports reading data from views.

Supported field types

For a complete list of SQL Server field types, see the SQL Server Help documentation. The following table lists the support for common field types in SQL Server 2016 as an example.

SQL Server 2016 field type	SQL Server Reader	SQL Server Writer
bigint	Support	Supported
bit	Supported	Supported
decimal	Supported	Support
int	Supported	Supported
money	Supported	Supported
numeric	Supported	Supported
smallint	Supported	Supported
smallmoney	Supported	Supported
tinyint	Supported	Supported
float	Supported	Supported
real	Supported	Supported
date	Supported	Supported
datetime2	Supported	Supported
datetime	Supported	Support
datetimeoffset	Not supported	Not supported
smalldatetime	Supported	Supported
time	Supported	Supported
char	Supported	Supported
text	Supported	Supported
varchar	Supported	Supported
nchar	Supported	Supported
ntext	Supported	Supported
nvarchar	Supported	Supported
binary	Supported	Support
image	Supported	Supported
varbinary	Support	Supported
cursor	Not supported	Not supported
hierarchyid	Not supported	Not supported
sql_variant	Supported	Supported
Spatial Geometry Types	Not supported	Not supported
table	Not supported	Not supported
rowversion	Not supported	Not supported
uniqueidentifier	Supported	Supported
xml	Supported	Supported
Spatial Geography Types	Not supported	Not supported

The following table lists the mappings between SQL Server data types and the data types that SQL Server Reader and SQL Server Writer support.

Category	SQL Server data type
Integer	BIGINT, INT, SMALLINT, and TINYINT
Floating-point	FLOAT, DECIMAL, REAL, and NUMERIC
String	CHAR, NCHAR, NTEXT, NVARCHAR, TEXT, VARCHAR, NVARCHAR (MAX), and VARCHAR (MAX)
Date and time	DATE, DATETIME, and TIME
Boolean	BIT
Binary	BINARY, VARBINARY, VARBINARY (MAX), and TIMESTAMP

Add a data source

Before you develop a synchronization task in DataWorks, you must add the required data source to DataWorks by following the instructions in Data Source Management. You can view the infotips of parameters in the DataWorks console to understand the meanings of the parameters when you add a data source.

Develop a data synchronization task

For information about the entry point for and the procedure of configuring a synchronization task, see the following configuration guides.

Configure a batch synchronization task for a single table

For more information about this procedure, see Configure a task in the codeless UI and Configure a task in the code editor.
For all parameters and a script example for the code editor, see Appendix: Script examples and parameters.

Configure a batch read synchronization task for an entire database

For more information about this procedure, see Configure a real-time synchronization task for an entire database.

FAQ

Data restoration for active/standby synchronization
You can use an active/standby disaster recovery architecture for SQL Server where the secondary database continuously restores data from the primary database using binary logs. Data synchronization between the primary and secondary databases has a specific latency. In some cases, such as when network latency occurs, the data restored to the secondary database may differ significantly from the data in the primary database. The data synchronized from the secondary database is not a complete and real-time snapshot.
Consistency constraints
SQL Server is a relational database management system (RDBMS) and provides interfaces for strongly consistent data queries. For example, during a synchronization task, SQL Server Reader does not retrieve any data updates that are written during the task because of the database's snapshot feature.
The preceding description applies to the data synchronization consistency of SQL Server Reader in a single-threaded model. SQL Server Reader can use concurrent data retrieval based on your configuration. Therefore, strong data consistency cannot be guaranteed.
When SQL Server Reader shards data based on the splitPk parameter, it starts multiple concurrent tasks to complete the data synchronization. These concurrent tasks do not belong to the same read transaction and have time intervals between them. Therefore, the synchronized data is not a complete and consistent data snapshot.
Consistent snapshots for multi-threaded tasks are not technically feasible. You can address this issue only from an engineering perspective. Engineering methods involve trade-offs. The following solutions are provided. You can choose one based on your requirements.
- Use single-threaded synchronization. This means you do not shard the data. The disadvantage is that the speed is relatively slow, but consistency can be guaranteed.
- Stop other data writers to ensure that the current data is static. For example, you can lock tables or stop synchronization to the secondary database. The disadvantage is that this may affect online services.
Database encoding
SQL Server Reader uses Java Database Connectivity (JDBC) to retrieve data. JDBC is compatible with various encodings and performs encoding conversion at the underlying layer. Therefore, you do not need to specify an encoding for SQL Server Reader. It automatically retrieves and converts the encoding.
Methods for incremental data synchronization
SQL Server Reader uses JDBC SELECT statements to retrieve data. You can use SELECT…WHERE… to retrieve incremental data in the following ways:
- When an online application writes data to the database, you can populate a modify field with a timestamp for changes, including new additions, updates, and logical deletions. For this type of application, SQL Server Reader needs to add only a where clause with the timestamp of the previous synchronization phase.
- For new streaming data, SQL Server Reader can add a WHERE clause with the maximum auto-increment ID from the previous phase.
If no business fields are available to distinguish new and modified data, SQL Server Reader cannot perform incremental data synchronization and can synchronize only full data.
SQL security
SQL Server Reader provides the querySql parameter for you to implement custom SELECT statements. SQL Server Reader does not perform any security checks on the querySql statement.

Appendix: Script examples and parameters

Configure a batch synchronization task by using the code editor

If you want to configure a batch synchronization task by using the code editor, you must configure the related parameters in the script based on the unified script format requirements. For more information, see Configure a batch synchronization task by using the code editor. The following information describes the parameters that you must configure for data sources when you configure a batch synchronization task by using the code editor.

Reader script example

{
    "type":"job",
    "version":"2.0",// The version number.
    "steps":[
        {
            "stepType":"sqlserver",// The plug-in name.
            "parameter":{
                "datasource":"",// The data source.
                "column":[// The fields.
                    "id",
                    "name"
                ],
                "where":"",// The filter condition.
                "splitPk":"",// If you specify splitPk, the data is sharded based on the specified field.
                "table":""// The data table.
            },
            "name":"Reader",
            "category":"reader"
        },
        {
            "stepType":"stream",
            "parameter":{},
            "name":"Writer",
            "category":"writer"
        }
    ],
    "setting":{
        "errorLimit":{
            "record":"0"// The number of error records.
        },
        "speed":{
            "throttle":true,// Specifies whether to enable throttling. If you set throttle to false, the mbps parameter does not take effect, and no rate limit is imposed. If you set throttle to true, a rate limit is imposed.
            "concurrent":1, // The number of concurrent jobs.
            "mbps":"12"// The rate limit. 1 mbps is equal to 1 MB/s.
        }
    },
    "order":{
        "hops":[
            {
                "from":"Reader",
                "to":"Writer"
            }
        ]
    }
}

If you want to use a querySql query, the following code provides an example of the Reader part of the script. In this example, the SQL Server data source is sql_server_source, the table to query is dbo.test_table, and the column to query is name.

{
    "stepType": "sqlserver",
    "parameter": {
        "connection": [
            {
                "querySql": ["select name from dbo.test_table"],
                "datasource": "sql_server_source"
            }
        ],
        "datasource": "sql_server_source",
        "column": ["name"],
        "where": "",
        "splitPk": "id"
    },
    "name": "Reader",
    "category": "reader"
}

Reader script parameters

Parameter	Description	Required	Default value
datasource	The name of the data source. The value must be the same as the name of the data source that you add in the code editor.	Yes	None
table	The name of the source table. A task can synchronize data from only one table.	Yes	None
column	The columns from which you want to synchronize data. Specify the columns in a JSON array. To synchronize all columns, set this parameter to `["*"]`. You can select a subset of columns to export. You can export columns in an order different from the schema of the source table. You can configure constants. You must follow the SQL syntax. Example: `["id", "table","1", "'mingya.wmy'", "'null'", "to_char(a+1)", "2.3" , "true"]` . id is a regular column name. table is a column name that contains a reserved word. 1 is an integer constant. 'mingya.wmy' is a string constant. Note that it must be enclosed in single quotation marks. 'null' is a string. to_char(a + 1) is a function expression. 2.3 is a floating-point number. true is a Boolean value. You must explicitly specify the columns to synchronize for the column parameter. This parameter cannot be empty.	Yes	None
splitFactor	The split factor, which specifies the number of shards for data synchronization. If multiple concurrent threads are configured, data is sharded into a number of shards equal to Concurrency × splitFactor. For example, if Concurrency is 5 and splitFactor is 5, data is sharded into 25 shards (5 × 5) and executed on 5 concurrent threads. Note The recommended value is an integer from 1 to 100. An excessively large value may cause an out-of-memory (OOM) error.	No	5
splitPk	When SQL Server Reader extracts data, you can specify splitPk to shard data based on the specified field. The data synchronization system then starts concurrent tasks to synchronize the data, which can improve data synchronization efficiency. We recommend that you use the primary key of the table as the value of splitPk. A primary key is usually evenly distributed. This prevents data hotspots in the shards. Currently, splitPk supports sharding only for integer data. If you specify a field of another data type, such as string, floating-point number, or date, SQL Server Reader reports an error.	No	None
where	The filter condition. SQL Server Reader combines the specified column, table, and where parameters into an SQL statement to retrieve data. For example, during testing, you can set the where condition to `limit 10`. In a typical business scenario, you might synchronize data for the current day by setting the where condition to `gmt_create > $bizdate`. The where condition allows for efficient incremental synchronization. If the where condition is empty, all data in the table is synchronized.	No	None
querySql	Format: `"querysql" : "query statement",`. In some business scenarios, the where parameter is not sufficient to describe the filter conditions. You can use this parameter to customize the filter SQL statement. If you configure this parameter, Data Integration ignores the tables and columns parameters and uses this SQL statement to filter data. For example, to join multiple tables for data synchronization, use `select a,b from table_a join table_b on table_a.id = table_b.id`. When you configure querySql, SQL Server Reader ignores the column, table, and where parameters.	No	None
fetchSize	The number of records to retrieve in each batch. This parameter determines the number of network interactions between Data Integration and the database server and can improve data retrieval performance. Note An excessively large fetchSize value (> 2048) may cause an OOM error in the data synchronization process.	No	1024

SQL Server Reader combines the information that you configure for table, column, and where into an SQL statement and sends the statement to the SQL Server database.
SQL Server directly sends the information that you configure for querySql to the SQL Server database.

Writer script example

{
    "type":"job",
    "version":"2.0",// The version number.
    "steps":[
        {
            "stepType":"stream",
            "parameter":{},
            "name":"Reader",
            "category":"reader"
        },
        {
            "stepType":"sqlserver",// The plug-in name.
            "parameter":{
                "postSql":[],// The SQL statement to execute after the data synchronization task.
                "datasource":"",// The data source.
                "column":[// The fields.
                    "id",
                    "name"
                ],
                "table":"",// The table name.
                "preSql":[]// The SQL statement to execute before the data synchronization task.
            },
            "name":"Writer",
            "category":"writer"
        }
    ],
    "setting":{
        "errorLimit":{
            "record":"0"// The number of error records.
        },
        "speed":{
            "throttle":true,// Specifies whether to enable throttling. If you set throttle to false, the mbps parameter does not take effect, and no rate limit is imposed. If you set throttle to true, a rate limit is imposed.
            "concurrent":1, // The number of concurrent jobs.
            "mbps":"12"// The rate limit. 1 mbps is equal to 1 MB/s.
        }
    },
    "order":{
        "hops":[
            {
                "from":"Reader",
                "to":"Writer"
            }
        ]
    }
}

Writer script parameters

Parameter	Description	Required	Default value
datasource	The name of the data source. The value must be the same as the name of the data source that you add in the code editor.	Yes	None
table	The names of the tables to be synchronized.	Yes	None
column	The destination columns to which you want to write data. Separate the column names with commas. Example: `"column":["id","name","age"]`. To write data to all columns in order, use an asterisk (). Example: `"column":[""]`.	Yes	None
preSql	An SQL statement that is executed before the data synchronization task starts. In the codeless UI, you can execute only one SQL statement. In the code editor, you can execute multiple SQL statements. For example, you can use this parameter to clear previous data.	No	None
postSql	An SQL statement that is executed after the data synchronization task is complete. In the codeless UI, you can execute only one SQL statement. In the code editor, you can execute multiple SQL statements. For example, you can use this parameter to add a timestamp.	No	None
writeMode	The write mode. The value insert is supported. If a primary key or unique index conflict occurs, Data Integration treats the data as dirty data but retains the original data.	No	insert
batchSize	The number of records to submit in each batch. This value can greatly reduce the number of network interactions between Data Integration and SQL Server and improve the overall throughput. If you set this value too high, an OOM error may occur in the data synchronization process.	No	1,024