DataWorks API data source - DataWorks - Alibaba Cloud Documentation Center

You can create a RestAPI data source to write JSON data via a RESTful API to another data source, such as MaxCompute, using a data synchronization task. A RestAPI data source can also be used as a destination to receive data from other data sources. This topic describes the data synchronization capabilities of the RestAPI data source in DataWorks.

Limits

Currently, this data source supports only Serverless resource groups and exclusive resource groups for Data Integration.
You cannot set a timeout parameter. The built-in request timeout period in DataWorks is 60 s. If an API query takes longer than 60 s to respond, the task fails.

Supported field types

Important

When data is synchronized to a destination, only a single-layer table schema is supported. Nested field structures are not supported. For example, if an API returns the structure `{data: {user: { id: 1, name:'lily'}, value: 123}}`, the fields must be processed as parallel fields such as `user_id`, `user_name`, and `value` at the destination.

Type classification	Data Integration column configuration type
Integer	LONG, INT
String	STRING
Floating-point	DOUBLE, FLOAT
Boolean	BOOLEAN
Date and time	DATE

Add a data source

Before you develop a synchronization task in DataWorks, you must add the required data source to DataWorks by following the instructions in Data Source Management. You can view the infotips of parameters in the DataWorks console to understand the meanings of the parameters when you add a data source.

Develop a data synchronization task

For information about the entry point for and the procedure of configuring a synchronization task, see the following configuration guides.

Configuration guide for a single-table offline sync task

For instructions, see Codeless UI configuration or Code editor configuration.
For a list of all parameters for the code editor configuration and sample code, see Appendix: Code and parameters.

Examples

FAQ

Can I only specify the number of pages for data requests?
- Answer: Yes, you can.
Is automatic paging supported? For example, can paging stop when no more data is returned for the request parameters?
- Answer: No, it is not. Otherwise, the data cannot be split.
If I must specify the number of pages, but the specified number is greater than the actual number of pages, what happens when the subsequent pages are empty?
- Answer: If a subsequent page is empty, it is treated as an empty result from a SQL query. The system proceeds to the next query.
Is only single-layer JSON parsing supported?
- Answer: Yes, it is. Deep parsing is not performed.
How do I configure a non-array type for a RestAPI in DataWorks Data Integration?
- Answer: In the reader section, within the parameter section, set the dataPath parameter to the path of the non-array data. For example, dataPath:"data.list". This allows the plugin to locate the data field that you want to read. Then, set the dataMode parameter to multiData. This setting instructs DataWorks to process the data as multiple separate records, even if the data is not in an array format in the source data.
  Note
  Note that in multiData mode, the column configuration is not applicable. You must specify the data path to read directly in the dataPath parameter.
  The following code shows a configuration example for a non-array type for a RestAPI in DataWorks Data Integration:
```
reader: {
  name: "restapi",
  parameter: {
    dataPath: "data.list",
    dataMode: "multiData",
    // Other parameters
  }
}
```

Appendix: Script demo and parameter description

Configure a batch synchronization task by using the code editor

If you want to configure a batch synchronization task by using the code editor, you must configure the related parameters in the script based on the unified script format requirements. For more information, see Configure a batch synchronization task by using the code editor. The following information describes the parameters that you must configure for data sources when you configure a batch synchronization task by using the code editor.

Reader script demo

The following code provides a script example:

{
    "type":"job",
    "version":"2.0",
    "steps":[
        {
            "stepType":"restapi",
            "parameter":{
                "url":"http://127.0.0.1:5000/get_array5",
                "dataMode":"oneData",
                "responseType":"json",
                "column":[
                    {
                        "type":"long",
                        "name":"a.b"  // Find data from the a.b path.
                    },
                    {
                        "type":"string",  // Find data from the a.c path.
                        "name":"a.c"
                    }
                ],
                "dirtyData":"null",
                "method":"get",
                "socketTimeout":"60000",
                "defaultHeader":{
                    "X-Custom-Header":"test header"
                },
                "customHeader":{
                    "X-Custom-Header2":"test header2"
                },
                "parameters":"abc=1&amp;def=1"
            },
            "name":"restapireader",
            "category":"reader"
        },
        {
            "stepType":"stream",
            "parameter":{

            },
            "name":"Writer",
            "category":"writer"
        }
    ],
    "setting":{
        "errorLimit":{
            "record":""
        },
        "speed":{
            "throttle":true,  // If throttle is set to false, the mbps parameter does not take effect and the data rate is not limited. If throttle is set to true, the data rate is limited.
            "concurrent":1,  // The concurrency of the job. 
            "mbps":"12"// The maximum data rate. 1 mbps is equal to 1 MB/s.
        }
    },
    "order":{
        "hops":[
            {
                "from":"Reader",
                "to":"Writer"
            }
        ]
    }
}

The following code shows the configuration in the code editor:

After the Restapi plugin sends an HTTP or HTTPS request, it receives a response body in JSON format. The dataPath parameter specifies the JSONPath used to extract data from the body. The following two examples show how to configure the parameters:


Example 1: The API returns the following body. The business data is in the DATA field, and the API returns multiple rows of data at a time. DATA is an array.
{
    "HEADER": {
        "BUSID": "bid1",
        "RECID": "uuid",
        "SENDER": "dc",
        "RECEIVER": "pre",
        "DTSEND": "202201250000"
    },
    "DATA": [
        {
            "SERNR": "sernr1"
        },
        {
            "SERNR": "sernr2"
        }
    ]
}

If you want to extract multiple rows of data from DATA as multiple synchronization records, set column to "column": [ "SERNR" ], dataMode to "dataMode": "multiData", and dataPath to "dataPath": "DATA".


Example 2: The API returns the following body. The business data is in the content.DATA field, and the API returns one row of data at a time. DATA is an object.
{
    "HEADER": {
        "BUSID": "bid1",
        "RECID": "uuid",
        "SENDER": "dc",
        "RECEIVER": "pre",
        "DTSEND": "202201250000"
    },
    "content": {
        "DATA": {
            "SERNR": "sernr2"
        }
    }
}

If you want to extract one row of data from content.DATA as one synchronization record, set column to "column": [ "SERNR" ], dataMode to "dataMode": "oneData", and dataPath to "dataPath": "content.DATA".

Reader script parameters

Note

The following parameters are used when you add a data source and configure a Data Integration node.

The plugin does not support scheduling parameters.

Parameter	Description	Required	Default value
url	The address of the RESTful API.	Yes	None
dataMode	The format of the JSON data returned for a RESTful request. oneData: retrieves one piece of data from the returned JSON data. multiData: retrieves a JSON array from the returned JSON data and passes multiple pieces of data to the writer.	Yes	None
responseType	The format of the returned data. Only JSON is supported.	Yes	JSON
column	The list of fields to read. The type parameter specifies the type of the source data, and the name parameter specifies the JSON path from which to retrieve the data for the current column. You can specify the column fields. Example: "column":[{"type":"long","name":"a.b" // Find data from the a.b path.}, {"type":"string","name":"a.c" // Find data from the a.c path.}] You must specify the type and name parameters for each column.	Yes	None
dataPath	The path used to query a single JSON object or a JSON array from the returned result.	No	None
method	The request method. Valid values: get and post.	Yes	None
socketTimeout	The socket timeout period for accessing data from the RESTful API. Unit: milliseconds.	No	60000
customHeader	The header information passed to the RESTful API.	No	None
parameters	The parameter information passed to the RESTful API. For a GET request, enter parameters in the `abc=1&def=1` format. For a POST request, enter parameters in JSON format.	No	None
dirtyData	The method for handling a situation where data cannot be found in the specified JSON path of a column. dirty: If a column cannot be found when a record is parsed, the record is marked as dirty data. null: If a column cannot be found when a record is parsed, the value of the column is set to null.	Yes	dirty
requestTimes	The number of times to request data from the RESTful address. single: sends only one request. multiple: sends multiple requests.	Yes	single
requestParam	If you set requestTimes to multiple, you must specify the parameter for the loop, such as pageNumber. The plugin passes the pageNumber parameter to the RESTful API in a loop based on the specified startIndex, endIndex, and step parameters to send multiple requests.	No	None
startIndex	The start index for the loop request. The start index is included in the loop.	No	None
endIndex	The end index for the loop request. The end index is included in the loop.	No	None
step	The step size for the loop request.	No	None
authType	The authentication method. Valid values: Basic Authentication: basic authentication If the data source supports username- and password-based authentication, you can select Basic Authentication and configure the username and password that can be used for authentication. During data integration, the username and password are transferred to the RESTful API URL for authentication. The data source is connected only after the authentication is successful. Token Authentication: token-based authentication If the data source supports token-based authentication, you can select Token Authentication and configure a fixed token value that can be used for authentication. During data integration, the token is contained in the request header, such as {"Authorization":"Bearer TokenXXXXXX"}, and transferred to the RESTful API URL for authentication. The data source is connected only after the authentication is successful. Note If you want to use a custom authentication method, you can select Token Authentication and configure a fixed token value in the `Token` field. The token value can be used for authentication after it is encrypted.	No	None
authUsername/authPassword	The username and password for Basic Auth.	No	None
authToken	The token for Token Auth.	No	None
accessKey/accessSecret	The account information for Aliyun API signature authentication.	No	None

Writer script demo

{
    "type":"job",
    "version":"2.0",
    "steps":[
        {
            "stepType":"stream",
            "parameter":{

            },
            "name":"Reader",
            "category":"reader"
        },
        {
            "stepType":"restapi",
            "parameter":{
                "url":"http://127.0.0.1:5000/writer1",
                "dataMode":"oneData",
                "responseType":"json",
                "column":[
                    {
                        "type":"long", // Place column data in the a.b path.
                        "name":"a.b"
                    },
                    {
                        "type":"string", // Place column data in the a.c path.
                        "name":"a.c"
                    }
                ],
                "method":"post",
                "defaultHeader":{
                    "X-Custom-Header":"test header"
                },
                "customHeader":{
                    "X-Custom-Header2":"test header2"
                },
                "parameters":"abc=1&amp;def=1",
                "batchSize":256
            },
            "name":"restapiwriter",
            "category":"writer"
        }
    ],
    "setting":{
        "errorLimit":{
            "record":"0" // The number of error records.
        },
        "speed":{
            "throttle":true,// If throttle is set to false, the mbps parameter does not take effect and the data rate is not limited. If throttle is set to true, the data rate is limited.
            "concurrent":1, // The concurrency of the job.
            "mbps":"12"// The maximum data rate. 1 mbps is equal to 1 MB/s.
        }
    },
    "order":{
        "hops":[
            {
                "from":"Reader",
                "to":"Writer"
            }
        ]
    }
}

Writer script parameters

Parameter	Description	Required	Default value
url	The address of the RESTful API.	Yes	None
dataMode	The format of the JSON data passed in a RESTful request. oneData: passes one record per request. The number of requests is the same as the number of records. multiData: passes a batch of records per request. The number of requests is determined by the number of tasks split on the reader side.	Yes	None
column	The list of field paths corresponding to the generated JSON data. The type parameter specifies the type of the source data, and the name parameter specifies the JSON path where the data for the current column is placed. You can specify the column fields. Example: "column":[{"type":"long","name":"a.b" // Place column data in the a.b path.}, {"type":"string","name":"a.c" // Place column data in the a.c path.}] Note You must specify the type and name parameters for each column.	Yes	None
dataPath	The path of the JSON object where the data result is placed.	No	None
method	The request method. Valid values: post and put.	Yes	None
customHeader	The header information passed to the RESTful API.	No	None
authType	The authentication method. Basic Authentication: basic authentication If the data source supports username- and password-based authentication, you can select Basic Authentication and configure the username and password that can be used for authentication. During data integration, the username and password are transferred to the RESTful API URL for authentication. The data source is connected only after the authentication is successful. Token Authentication: token-based authentication If the data source supports token-based authentication, you can select Token Authentication and configure a fixed token value that can be used for authentication. During data integration, the token is contained in the request header, such as {"Authorization":"Bearer TokenXXXXXX"}, and transferred to the RESTful API URL for authentication. The data source is connected only after the authentication is successful. Note If you want to use a custom authentication method, you can select Token Authentication and configure a fixed token value in the `Token` field. The token value can be used for authentication after it is encrypted.	No	None
authUsername/authPassword	The username and password for Basic Auth.	No	None
authToken	The token for Token Auth.	No	None
accessKey/accessSecret	The account information for Aliyun API signature authentication.	No	None
batchSize	The maximum number of records in a single request when dataMode is set to multiData.	Yes	512