You can create a RestAPI data source to write JSON data via a RESTful API to another data source, such as MaxCompute, using a data synchronization task. A RestAPI data source can also be used as a destination to receive data from other data sources. This topic describes the data synchronization capabilities of the RestAPI data source in DataWorks.
Limits
Currently, this data source supports only Serverless resource groups and exclusive resource groups for Data Integration.
You cannot set a timeout parameter. The built-in request timeout period in DataWorks is 60 s. If an API query takes longer than 60 s to respond, the task fails.
Supported field types
When data is synchronized to a destination, only a single-layer table schema is supported. Nested field structures are not supported. For example, if an API returns the structure `{data: {user: { id: 1, name:'lily'}, value: 123}}`, the fields must be processed as parallel fields such as `user_id`, `user_name`, and `value` at the destination.
Type classification | Data Integration column configuration type |
Integer | LONG, INT |
String | STRING |
Floating-point | DOUBLE, FLOAT |
Boolean | BOOLEAN |
Date and time | DATE |
Add a data source
Before you develop a synchronization task in DataWorks, you must add the required data source to DataWorks by following the instructions in Data Source Management. You can view the infotips of parameters in the DataWorks console to understand the meanings of the parameters when you add a data source.
Develop a data synchronization task
For information about the entry point for and the procedure of configuring a synchronization task, see the following configuration guides.
Configuration guide for a single-table offline sync task
For instructions, see Codeless UI configuration or Code editor configuration.
For a list of all parameters for the code editor configuration and sample code, see Appendix: Code and parameters.
Examples
FAQ
Can I only specify the number of pages for data requests?
Answer: Yes, you can.
Is automatic paging supported? For example, can paging stop when no more data is returned for the request parameters?
Answer: No, it is not. Otherwise, the data cannot be split.
If I must specify the number of pages, but the specified number is greater than the actual number of pages, what happens when the subsequent pages are empty?
Answer: If a subsequent page is empty, it is treated as an empty result from a SQL query. The system proceeds to the next query.
Is only single-layer JSON parsing supported?
Answer: Yes, it is. Deep parsing is not performed.
How do I configure a non-array type for a RestAPI in DataWorks Data Integration?
Answer: In the
readersection, within theparametersection, set thedataPathparameter to the path of the non-array data. For example,dataPath:"data.list". This allows the plugin to locate the data field that you want to read. Then, set thedataModeparameter tomultiData. This setting instructs DataWorks to process the data as multiple separate records, even if the data is not in an array format in the source data.NoteNote that in
multiDatamode, thecolumnconfiguration is not applicable. You must specify the data path to read directly in thedataPathparameter.The following code shows a configuration example for a non-array type for a RestAPI in DataWorks Data Integration:
reader: { name: "restapi", parameter: { dataPath: "data.list", dataMode: "multiData", // Other parameters } }
Appendix: Script demo and parameter description
Configure a batch synchronization task by using the code editor
If you want to configure a batch synchronization task by using the code editor, you must configure the related parameters in the script based on the unified script format requirements. For more information, see Configure a batch synchronization task by using the code editor. The following information describes the parameters that you must configure for data sources when you configure a batch synchronization task by using the code editor.
Reader script demo
The following code provides a script example:
{ "type":"job", "version":"2.0", "steps":[ { "stepType":"restapi", "parameter":{ "url":"http://127.0.0.1:5000/get_array5", "dataMode":"oneData", "responseType":"json", "column":[ { "type":"long", "name":"a.b" // Find data from the a.b path. }, { "type":"string", // Find data from the a.c path. "name":"a.c" } ], "dirtyData":"null", "method":"get", "socketTimeout":"60000", "defaultHeader":{ "X-Custom-Header":"test header" }, "customHeader":{ "X-Custom-Header2":"test header2" }, "parameters":"abc=1&def=1" }, "name":"restapireader", "category":"reader" }, { "stepType":"stream", "parameter":{ }, "name":"Writer", "category":"writer" } ], "setting":{ "errorLimit":{ "record":"" }, "speed":{ "throttle":true, // If throttle is set to false, the mbps parameter does not take effect and the data rate is not limited. If throttle is set to true, the data rate is limited. "concurrent":1, // The concurrency of the job. "mbps":"12"// The maximum data rate. 1 mbps is equal to 1 MB/s. } }, "order":{ "hops":[ { "from":"Reader", "to":"Writer" } ] } }The following code shows the configuration in the code editor:
After the Restapi plugin sends an HTTP or HTTPS request, it receives a response body in JSON format. The dataPath parameter specifies the JSONPath used to extract data from the body. The following two examples show how to configure the parameters: Example 1: The API returns the following body. The business data is in the DATA field, and the API returns multiple rows of data at a time. DATA is an array. { "HEADER": { "BUSID": "bid1", "RECID": "uuid", "SENDER": "dc", "RECEIVER": "pre", "DTSEND": "202201250000" }, "DATA": [ { "SERNR": "sernr1" }, { "SERNR": "sernr2" } ] } If you want to extract multiple rows of data from DATA as multiple synchronization records, set column to "column": [ "SERNR" ], dataMode to "dataMode": "multiData", and dataPath to "dataPath": "DATA". Example 2: The API returns the following body. The business data is in the content.DATA field, and the API returns one row of data at a time. DATA is an object. { "HEADER": { "BUSID": "bid1", "RECID": "uuid", "SENDER": "dc", "RECEIVER": "pre", "DTSEND": "202201250000" }, "content": { "DATA": { "SERNR": "sernr2" } } } If you want to extract one row of data from content.DATA as one synchronization record, set column to "column": [ "SERNR" ], dataMode to "dataMode": "oneData", and dataPath to "dataPath": "content.DATA".
Reader script parameters
The following parameters are used when you add a data source and configure a Data Integration node.
The plugin does not support scheduling parameters.
Parameter | Description | Required | Default value |
url | The address of the RESTful API. | Yes | None |
dataMode | The format of the JSON data returned for a RESTful request.
| Yes | None |
responseType | The format of the returned data. Only JSON is supported. | Yes | JSON |
column | The list of fields to read. The type parameter specifies the type of the source data, and the name parameter specifies the JSON path from which to retrieve the data for the current column. You can specify the column fields. Example: "column":[{"type":"long","name":"a.b" // Find data from the a.b path.}, {"type":"string","name":"a.c" // Find data from the a.c path.}] You must specify the type and name parameters for each column. | Yes | None |
dataPath | The path used to query a single JSON object or a JSON array from the returned result. | No | None |
method | The request method. Valid values: get and post. | Yes | None |
socketTimeout | The socket timeout period for accessing data from the RESTful API. Unit: milliseconds. | No | 60000 |
customHeader | The header information passed to the RESTful API. | No | None |
parameters | The parameter information passed to the RESTful API.
| No | None |
dirtyData | The method for handling a situation where data cannot be found in the specified JSON path of a column.
| Yes | dirty |
requestTimes | The number of times to request data from the RESTful address.
| Yes | single |
requestParam | If you set requestTimes to multiple, you must specify the parameter for the loop, such as pageNumber. The plugin passes the pageNumber parameter to the RESTful API in a loop based on the specified startIndex, endIndex, and step parameters to send multiple requests. | No | None |
startIndex | The start index for the loop request. The start index is included in the loop. | No | None |
endIndex | The end index for the loop request. The end index is included in the loop. | No | None |
step | The step size for the loop request. | No | None |
authType | The authentication method. Valid values:
| No | None |
authUsername/authPassword | The username and password for Basic Auth. | No | None |
authToken | The token for Token Auth. | No | None |
accessKey/accessSecret | The account information for Aliyun API signature authentication. | No | None |
Writer script demo
{
"type":"job",
"version":"2.0",
"steps":[
{
"stepType":"stream",
"parameter":{
},
"name":"Reader",
"category":"reader"
},
{
"stepType":"restapi",
"parameter":{
"url":"http://127.0.0.1:5000/writer1",
"dataMode":"oneData",
"responseType":"json",
"column":[
{
"type":"long", // Place column data in the a.b path.
"name":"a.b"
},
{
"type":"string", // Place column data in the a.c path.
"name":"a.c"
}
],
"method":"post",
"defaultHeader":{
"X-Custom-Header":"test header"
},
"customHeader":{
"X-Custom-Header2":"test header2"
},
"parameters":"abc=1&def=1",
"batchSize":256
},
"name":"restapiwriter",
"category":"writer"
}
],
"setting":{
"errorLimit":{
"record":"0" // The number of error records.
},
"speed":{
"throttle":true,// If throttle is set to false, the mbps parameter does not take effect and the data rate is not limited. If throttle is set to true, the data rate is limited.
"concurrent":1, // The concurrency of the job.
"mbps":"12"// The maximum data rate. 1 mbps is equal to 1 MB/s.
}
},
"order":{
"hops":[
{
"from":"Reader",
"to":"Writer"
}
]
}
}Writer script parameters
Parameter | Description | Required | Default value |
url | The address of the RESTful API. | Yes | None |
dataMode | The format of the JSON data passed in a RESTful request.
| Yes | None |
column | The list of field paths corresponding to the generated JSON data. The type parameter specifies the type of the source data, and the name parameter specifies the JSON path where the data for the current column is placed. You can specify the column fields. Example: "column":[{"type":"long","name":"a.b" // Place column data in the a.b path.}, {"type":"string","name":"a.c" // Place column data in the a.c path.}] Note You must specify the type and name parameters for each column. | Yes | None |
dataPath | The path of the JSON object where the data result is placed. | No | None |
method | The request method. Valid values: post and put. | Yes | None |
customHeader | The header information passed to the RESTful API. | No | None |
authType | The authentication method.
| No | None |
authUsername/authPassword | The username and password for Basic Auth. | No | None |
authToken | The token for Token Auth. | No | None |
accessKey/accessSecret | The account information for Aliyun API signature authentication. | No | None |
batchSize | The maximum number of records in a single request when dataMode is set to multiData. | Yes | 512 |