The SQL Server data source provides a bidirectional channel to read data from and write data to SQL Server. This topic describes the data synchronization capabilities that DataWorks provides for SQL Server.
Supported versions
SQL Server Reader uses the com.microsoft.sqlserver sqljdbc4 4.0 driver. For more information about the driver capabilities, see the official documentation. The following table lists the SQL Server versions that the driver supports.
Version | Supported (Yes/No) |
SQL Server 2016 | Yes |
SQL Server 2014 | Yes |
SQL Server 2012 | Yes |
PDW 2008R2 AU34 | Yes |
SQL Server 2008 R2 | Yes |
SQL Server 2008 | Yes |
SQL Server 2019 | No |
SQL Server 2018 | No |
Azure SQL Managed Instance | No |
Azure Synapse Analytics | No |
Azure SQL Database | Yes |
Limits
Offline synchronization supports reading data from views.
Supported field types
For a complete list of SQL Server field types, see the SQL Server Help documentation. The following table lists the support for common field types in SQL Server 2016 as an example.
SQL Server 2016 field type | SQL Server Reader | SQL Server Writer |
bigint | Support | Supported |
bit | Supported | Supported |
decimal | Supported | Support |
int | Supported | Supported |
money | Supported | Supported |
numeric | Supported | Supported |
smallint | Supported | Supported |
smallmoney | Supported | Supported |
tinyint | Supported | Supported |
float | Supported | Supported |
real | Supported | Supported |
date | Supported | Supported |
datetime2 | Supported | Supported |
datetime | Supported | Support |
datetimeoffset | Not supported | Not supported |
smalldatetime | Supported | Supported |
time | Supported | Supported |
char | Supported | Supported |
text | Supported | Supported |
varchar | Supported | Supported |
nchar | Supported | Supported |
ntext | Supported | Supported |
nvarchar | Supported | Supported |
binary | Supported | Support |
image | Supported | Supported |
varbinary | Support | Supported |
cursor | Not supported | Not supported |
hierarchyid | Not supported | Not supported |
sql_variant | Supported | Supported |
Spatial Geometry Types | Not supported | Not supported |
table | Not supported | Not supported |
rowversion | Not supported | Not supported |
uniqueidentifier | Supported | Supported |
xml | Supported | Supported |
Spatial Geography Types | Not supported | Not supported |
The following table lists the mappings between SQL Server data types and the data types that SQL Server Reader and SQL Server Writer support.
Category | SQL Server data type |
Integer | BIGINT, INT, SMALLINT, and TINYINT |
Floating-point | FLOAT, DECIMAL, REAL, and NUMERIC |
String | CHAR, NCHAR, NTEXT, NVARCHAR, TEXT, VARCHAR, NVARCHAR (MAX), and VARCHAR (MAX) |
Date and time | DATE, DATETIME, and TIME |
Boolean | BIT |
Binary | BINARY, VARBINARY, VARBINARY (MAX), and TIMESTAMP |
Add a data source
Before you develop a synchronization task in DataWorks, you must add the required data source to DataWorks by following the instructions in Data Source Management. You can view the infotips of parameters in the DataWorks console to understand the meanings of the parameters when you add a data source.
Develop a data synchronization task
For information about the entry point for and the procedure of configuring a synchronization task, see the following configuration guides.
Configure a batch synchronization task for a single table
For more information about this procedure, see Configure a task in the codeless UI and Configure a task in the code editor.
For all parameters and a script example for the code editor, see Appendix: Script examples and parameters.
Configure a batch read synchronization task for an entire database
For more information about this procedure, see Configure a real-time synchronization task for an entire database.
FAQ
Data restoration for active/standby synchronization
You can use an active/standby disaster recovery architecture for SQL Server where the secondary database continuously restores data from the primary database using binary logs. Data synchronization between the primary and secondary databases has a specific latency. In some cases, such as when network latency occurs, the data restored to the secondary database may differ significantly from the data in the primary database. The data synchronized from the secondary database is not a complete and real-time snapshot.
Consistency constraints
SQL Server is a relational database management system (RDBMS) and provides interfaces for strongly consistent data queries. For example, during a synchronization task, SQL Server Reader does not retrieve any data updates that are written during the task because of the database's snapshot feature.
The preceding description applies to the data synchronization consistency of SQL Server Reader in a single-threaded model. SQL Server Reader can use concurrent data retrieval based on your configuration. Therefore, strong data consistency cannot be guaranteed.
When SQL Server Reader shards data based on the splitPk parameter, it starts multiple concurrent tasks to complete the data synchronization. These concurrent tasks do not belong to the same read transaction and have time intervals between them. Therefore, the synchronized data is not a complete and consistent data snapshot.
Consistent snapshots for multi-threaded tasks are not technically feasible. You can address this issue only from an engineering perspective. Engineering methods involve trade-offs. The following solutions are provided. You can choose one based on your requirements.
Use single-threaded synchronization. This means you do not shard the data. The disadvantage is that the speed is relatively slow, but consistency can be guaranteed.
Stop other data writers to ensure that the current data is static. For example, you can lock tables or stop synchronization to the secondary database. The disadvantage is that this may affect online services.
Database encoding
SQL Server Reader uses Java Database Connectivity (JDBC) to retrieve data. JDBC is compatible with various encodings and performs encoding conversion at the underlying layer. Therefore, you do not need to specify an encoding for SQL Server Reader. It automatically retrieves and converts the encoding.
Methods for incremental data synchronization
SQL Server Reader uses JDBC SELECT statements to retrieve data. You can use
SELECT…WHERE…to retrieve incremental data in the following ways:When an online application writes data to the database, you can populate a modify field with a timestamp for changes, including new additions, updates, and logical deletions. For this type of application, SQL Server Reader needs to add only a where clause with the timestamp of the previous synchronization phase.
For new streaming data, SQL Server Reader can add a WHERE clause with the maximum auto-increment ID from the previous phase.
If no business fields are available to distinguish new and modified data, SQL Server Reader cannot perform incremental data synchronization and can synchronize only full data.
SQL security
SQL Server Reader provides the querySql parameter for you to implement custom SELECT statements. SQL Server Reader does not perform any security checks on the querySql statement.
Appendix: Script examples and parameters
Configure a batch synchronization task by using the code editor
If you want to configure a batch synchronization task by using the code editor, you must configure the related parameters in the script based on the unified script format requirements. For more information, see Configure a batch synchronization task by using the code editor. The following information describes the parameters that you must configure for data sources when you configure a batch synchronization task by using the code editor.
Reader script example
{
"type":"job",
"version":"2.0",// The version number.
"steps":[
{
"stepType":"sqlserver",// The plug-in name.
"parameter":{
"datasource":"",// The data source.
"column":[// The fields.
"id",
"name"
],
"where":"",// The filter condition.
"splitPk":"",// If you specify splitPk, the data is sharded based on the specified field.
"table":""// The data table.
},
"name":"Reader",
"category":"reader"
},
{
"stepType":"stream",
"parameter":{},
"name":"Writer",
"category":"writer"
}
],
"setting":{
"errorLimit":{
"record":"0"// The number of error records.
},
"speed":{
"throttle":true,// Specifies whether to enable throttling. If you set throttle to false, the mbps parameter does not take effect, and no rate limit is imposed. If you set throttle to true, a rate limit is imposed.
"concurrent":1, // The number of concurrent jobs.
"mbps":"12"// The rate limit. 1 mbps is equal to 1 MB/s.
}
},
"order":{
"hops":[
{
"from":"Reader",
"to":"Writer"
}
]
}
}If you want to use a querySql query, the following code provides an example of the Reader part of the script. In this example, the SQL Server data source is sql_server_source, the table to query is dbo.test_table, and the column to query is name.
{
"stepType": "sqlserver",
"parameter": {
"connection": [
{
"querySql": ["select name from dbo.test_table"],
"datasource": "sql_server_source"
}
],
"datasource": "sql_server_source",
"column": ["name"],
"where": "",
"splitPk": "id"
},
"name": "Reader",
"category": "reader"
}Reader script parameters
Parameter | Description | Required | Default value |
datasource | The name of the data source. The value must be the same as the name of the data source that you add in the code editor. | Yes | None |
table | The name of the source table. A task can synchronize data from only one table. | Yes | None |
column | The columns from which you want to synchronize data. Specify the columns in a JSON array. To synchronize all columns, set this parameter to `["*"]`.
| Yes | None |
splitFactor | The split factor, which specifies the number of shards for data synchronization. If multiple concurrent threads are configured, data is sharded into a number of shards equal to Concurrency × splitFactor. For example, if Concurrency is 5 and splitFactor is 5, data is sharded into 25 shards (5 × 5) and executed on 5 concurrent threads. Note The recommended value is an integer from 1 to 100. An excessively large value may cause an out-of-memory (OOM) error. | No | 5 |
splitPk | When SQL Server Reader extracts data, you can specify splitPk to shard data based on the specified field. The data synchronization system then starts concurrent tasks to synchronize the data, which can improve data synchronization efficiency.
| No | None |
where | The filter condition. SQL Server Reader combines the specified column, table, and where parameters into an SQL statement to retrieve data. For example, during testing, you can set the where condition to `limit 10`. In a typical business scenario, you might synchronize data for the current day by setting the where condition to
| No | None |
querySql | Format: | No | None |
fetchSize | The number of records to retrieve in each batch. This parameter determines the number of network interactions between Data Integration and the database server and can improve data retrieval performance. Note An excessively large fetchSize value (> 2048) may cause an OOM error in the data synchronization process. | No | 1024 |
SQL Server Reader combines the information that you configure for table, column, and where into an SQL statement and sends the statement to the SQL Server database.
SQL Server directly sends the information that you configure for querySql to the SQL Server database.
Writer script example
{
"type":"job",
"version":"2.0",// The version number.
"steps":[
{
"stepType":"stream",
"parameter":{},
"name":"Reader",
"category":"reader"
},
{
"stepType":"sqlserver",// The plug-in name.
"parameter":{
"postSql":[],// The SQL statement to execute after the data synchronization task.
"datasource":"",// The data source.
"column":[// The fields.
"id",
"name"
],
"table":"",// The table name.
"preSql":[]// The SQL statement to execute before the data synchronization task.
},
"name":"Writer",
"category":"writer"
}
],
"setting":{
"errorLimit":{
"record":"0"// The number of error records.
},
"speed":{
"throttle":true,// Specifies whether to enable throttling. If you set throttle to false, the mbps parameter does not take effect, and no rate limit is imposed. If you set throttle to true, a rate limit is imposed.
"concurrent":1, // The number of concurrent jobs.
"mbps":"12"// The rate limit. 1 mbps is equal to 1 MB/s.
}
},
"order":{
"hops":[
{
"from":"Reader",
"to":"Writer"
}
]
}
}Writer script parameters
Parameter | Description | Required | Default value |
datasource | The name of the data source. The value must be the same as the name of the data source that you add in the code editor. | Yes | None |
table | The names of the tables to be synchronized. | Yes | None |
column | The destination columns to which you want to write data. Separate the column names with commas. Example: | Yes | None |
preSql | An SQL statement that is executed before the data synchronization task starts. In the codeless UI, you can execute only one SQL statement. In the code editor, you can execute multiple SQL statements. For example, you can use this parameter to clear previous data. | No | None |
postSql | An SQL statement that is executed after the data synchronization task is complete. In the codeless UI, you can execute only one SQL statement. In the code editor, you can execute multiple SQL statements. For example, you can use this parameter to add a timestamp. | No | None |
writeMode | The write mode. The value insert is supported. If a primary key or unique index conflict occurs, Data Integration treats the data as dirty data but retains the original data. | No | insert |
batchSize | The number of records to submit in each batch. This value can greatly reduce the number of network interactions between Data Integration and SQL Server and improve the overall throughput. If you set this value too high, an OOM error may occur in the data synchronization process. | No | 1,024 |