Introduction to the data synchronization capabilities of DataWorks for OSS - DataWorks

The OSS data source provides a bidirectional channel to read data from and write data to OSS. This topic describes the data synchronization capabilities that DataWorks provides for OSS.

Supported field types and limits

Offline read

OSS Reader reads data from OSS and converts it into the Data Integration protocol format. OSS is a storage service for unstructured data. OSS Reader supports the following features.

Supported

Unsupported

Supports reading text files in TXT format. The schema in the TXT file must be a two-dimensional table.
Supports CSV-like files with custom separators.
Supports ORC and PARQUET formats.
Supports reading multiple data types, which are represented as String. Supports column cropping and column constants.
Supports recursive reads and file name filtering.
Supports text compression. The available compression formats are gzip, bzip2, and zip.
Note
A compressed package cannot contain multiple compressed files.
Supports concurrent reads for multiple objects.

Does not support multi-threaded concurrent reads for a single object (file).
Technically, a single compressed object cannot be read concurrently by multiple threads.

Important

If your data is in a CSV file, ensure that the file is in standard CSV format. For example, if a column contains a double quotation mark ("), you must escape it using two double quotation marks (""). Otherwise, the file may be parsed incorrectly. If the file contains multiple separators, we recommend that you read the data as the text type.
Because OSS is an unstructured data source that stores file-based data, you must confirm that the field structure is as expected before you synchronize data. Similarly, if the data structure in the source file changes, you must update the field structure in the task configuration. Otherwise, data corruption may occur during synchronization.

Offline write

OSS Writer converts data from the data synchronization protocol format into text files and writes the files to OSS. OSS is a storage service for unstructured data. OSS Writer supports the following features.

Supported	Unsupported
Supports writing only text files. BLOBs such as videos and images are not supported. The schema in the text file must be a two-dimensional table. Supports CSV-like files with custom separators. Supports ORC and PARQUET formats. Supports multi-threaded writes. Each thread writes to a different sub-file. Supports file scrolling. When a file exceeds a specific size, it switches to a new file.	Does not support concurrent writes to a single file. OSS does not provide data types. OSS Writer writes all data to OSS objects as the STRING type. Writing to an OSS bucket with the Cold Archive storage class is not supported. A single object (file) cannot exceed 100 GB.

Type classification	Data Integration column configuration type
Integer	LONG
String	STRING
Floating-point	DOUBLE
Boolean	BOOLEAN
Date/Time	DATE

Real-time write

You can write data to OSS in real time.
You can write data in real time from a single table to data lakes, such as Hudi (0.12.x), Paimon, and Iceberg.

Create a data source

Before you develop a synchronization task in DataWorks, you must add the required data source to DataWorks by following the instructions in Data Source Management. You can view the infotips of parameters in the DataWorks console to understand the meanings of the parameters when you add a data source.

Note

If you create an OSS data source for a different account, you must grant the required permissions to that account. For more information, see Use a bucket policy for cross-account access to OSS.
For information about how to configure the data source using a RAM role, see Configure a data source using a RAM role.
If you create an OSS data source in a different region, you must use a public endpoint to connect. For more information, see Overview of endpoints and network connectivity.

Develop a data synchronization task

For information about the entry point for and the procedure of configuring a synchronization task, see the following configuration guides.

Configure an offline synchronization task for a single table

For more information, see Configure a task in the codeless UI and Configure a task in the code editor.
For information about all parameters and a script demo for the code editor, see Appendix: Script demos and parameter descriptions in this topic.

Configure a real-time synchronization task for a single table

For more information, see Configure a real-time synchronization task in Data Integration and Configure a real-time synchronization task in DataStudio.

Configure a full database synchronization task

For more information, see Offline synchronization task for a full database and Real-time synchronization task for a full database.

FAQ

Is there a limit on the number of OSS files that can be read?

How do I handle dirty data when reading a CSV file with multiple separators?

Appendix: Script demos and parameter descriptions

Configure a batch synchronization task by using the code editor

If you want to configure a batch synchronization task by using the code editor, you must configure the related parameters in the script based on the unified script format requirements. For more information, see Configuration in the code editor. The following information describes the parameters that you must configure for data sources when you configure a batch synchronization task by using the code editor.

Reader script demo: General example

{
    "type":"job",
    "version":"2.0",// Version number.
    "steps":[
        {
            "stepType":"oss",// Plug-in name.
            "parameter":{
                "nullFormat":"",// Defines the string that can be interpreted as null.
                "compress":"",// Text compression type.
                "datasource":"",// Data source.
                "column":[// Fields.
                    {
                        "index":0,// Column ordinal number.
                        "type":"string"// Data type.
                    },
                    {
                        "index":1,
                        "type":"long"
                    },
                    {
                        "index":2,
                        "type":"double"
                    },
                    {
                        "index":3,
                        "type":"boolean"
                    },
                    {
                        "format":"yyyy-MM-dd HH:mm:ss", // Time format.
                        "index":4,
                        "type":"date"
                    }
                ],
                "skipHeader":"",// For CSV-like files, the header might be a title and needs to be skipped.
                "encoding":"",// Encoding format.
                "fieldDelimiter":",",// Column delimiter.
                "fileFormat": "",// Text type.
                "object":[]// Object prefix.
            },
            "name":"Reader",
            "category":"reader"
        },
        {
            "stepType":"stream",
            "parameter":{},
            "name":"Writer",
            "category":"writer"
        }
    ],
    "setting":{
        "errorLimit":{
            "record":""// Number of error records.
        },
        "speed":{
            "throttle":true,// If throttle is set to false, the mbps parameter does not take effect, which means no rate limiting. If throttle is set to true, rate limiting is enabled.
            "concurrent":1, // Job concurrency.
            "mbps":"12"// Rate limit. Here, 1 mbps = 1 MB/s.
        }
    },
    "order":{
        "hops":[
            {
                "from":"Reader",
                "to":"Writer"
            }
        ]
    }
}

Reader script demo: Read ORC or Parquet files from OSS

DataWorks reuses HDFS Reader to read files in ORC or Parquet format from OSS. In addition to the existing parameters for OSS Reader, you must configure extended parameters, such as Path (for ORC files) and FileFormat (for ORC and Parquet files).

The following example shows how to read an ORC file from OSS.

{
"stepType": "oss",
"parameter": {
"datasource": "",
"fileFormat": "orc",
"path": "/tests/case61/orc__691b6815_9260_4037_9899_****",
"column": [
{
"index": 0,
"type": "long"
},
{
"index": "1",
"type": "string"
},
{
"index": "2",
"type": "string"
}
]
}
}

The following example shows how to read a Parquet file from OSS.

{
  "type":"job",
    "version":"2.0",
    "steps":[
    {
      "stepType":"oss",
      "parameter":{
        "nullFormat":"",
        "compress":"",
        "fileFormat":"parquet",
        "path":"/*",
        "parquetSchema":"message m { optional BINARY registration_dttm (UTF8); optional Int64 id; optional BINARY first_name (UTF8); optional BINARY last_name (UTF8); optional BINARY email (UTF8); optional BINARY gender (UTF8); optional BINARY ip_address (UTF8); optional BINARY cc (UTF8); optional BINARY country (UTF8); optional BINARY birthdate (UTF8); optional DOUBLE salary; optional BINARY title (UTF8); optional BINARY comments (UTF8); }",
        "column":[
          {
            "index":"0",
            "type":"string"
          },
          {
            "index":"1",
            "type":"long"
          },
          {
            "index":"2",
            "type":"string"
          },
          {
            "index":"3",
            "type":"string"
          },
          {
            "index":"4",
            "type":"string"
          },
          {
            "index":"5",
            "type":"string"
          },
          {
            "index":"6",
            "type":"string"
          },
          {
            "index":"7",
            "type":"string"
          },
          {
            "index":"8",
            "type":"string"
          },
          {
            "index":"9",
            "type":"string"
          },
          {
            "index":"10",
            "type":"double"
          },
          {
            "index":"11",
            "type":"string"
          },
          {
            "index":"12",
            "type":"string"
          }
        ],
        "skipHeader":"false",
        "encoding":"UTF-8",
        "fieldDelimiter":",",
        "fieldDelimiterOrigin":",",
        "datasource":"wpw_demotest_oss",
        "envType":0,
        "object":[
          "wpw_demo/userdata1.parquet"
        ]
      },
      "name":"Reader",
      "category":"reader"
    },
    {
      "stepType":"odps",
      "parameter":{
        "partition":"dt=${bizdate}",
        "truncate":true,
        "datasource":"0_odps_wpw_demotest",
        "envType":0,
        "column":[
          "id"
        ],
        "emptyAsNull":false,
        "table":"wpw_0827"
      },
      "name":"Writer",
      "category":"writer"
    }
  ],
    "setting":{
    "errorLimit":{
      "record":""
    },
    "locale":"zh_CN",
      "speed":{
      "throttle":false,
        "concurrent":2
    }
  },
  "order":{
    "hops":[
      {
        "from":"Reader",
        "to":"Writer"
      }
    ]
  }
}

Reader script parameters

Parameter	Description	Required	Default value
datasource	The name of the data source. The code editor supports adding data sources. The value of this parameter must be the same as the name of the added data source.	Yes	None
Object	This parameter specifies one or more objects to synchronize from OSS. You can configure this parameter in three ways: explicit path, wildcard path, and dynamic parameter path. 1. Configuration methods Explicit path Basic rule: The path starts from the root directory of the bucket and does not need to include the bucket name. Specify a single file: Enter the full path of the file. For example: `my_folder/my_file.txt`. Specify multiple objects: Use commas (`,`) to separate the paths of multiple files or folders. For example: `folder_a/file1.txt`, `folder_a/file2.txt`. Wildcard path Use a wildcard character to match multiple files that follow a specific pattern. ``: Matches zero or more characters. `?`: Matches a single character. Examples: `abc[0-9].txt` matches `abc0.txt`, `abc10.txt`, and `abc_test_9.txt`. `abc?.txt` matches `abc1.txt` and `abcX.txt`. Dynamic parameter path Embed scheduling parameters in the path to automate synchronization. When the task runs, the parameters are replaced with their actual values. Example: If you set the path to `raw_data/${bizdate}/abc.txt`, the task can dynamically synchronize the folder for the corresponding data timestamp every day. For more information about scheduling parameters, see Supported formats of scheduling parameters. Important Use wildcards with caution. Using a wildcard, especially ``, triggers a traversal scan of the OSS path. If the number of files is large, this can consume a lot of memory and time, and may even cause the task to fail due to memory overflow. We recommend that you do not use broad wildcards in a production environment. If this issue occurs, split the files into different folders and try to synchronize them again. The data synchronization system treats all objects synchronized in a job as a single data table. You must ensure that all objects can adapt to the same set of schema information. 2. Concurrent read mechanism and performance* The configuration method directly affects the concurrent performance of data extraction: Single-threaded mode: When you specify only a single, uncompressed file, the task extracts data in single-threaded mode. Multi-threaded mode: When you specify multiple specific files or use a wildcard character to match multiple files, the task automatically enables multi-threaded concurrent reading to significantly improve extraction efficiency. You can configure the specific number of concurrent threads in Channel Control.	Yes	None
parquetSchema	This parameter is configured when you read data from OSS in Parquet format. It takes effect only when fileFormat is set to parquet. It specifies the data types stored in the Parquet file. After you specify parquetSchema, make sure that the overall configuration complies with JSON syntax. `message MessageTypeName { Required/Optional, Data type, Column name; ......................; }` The following describes the configuration format of parquetSchema: MessageTypeName: Enter a name. Required/Optional: `required` indicates that the field cannot be empty. `optional` indicates that the field can be empty. We recommend that you set all fields to `optional`. Data type: Parquet files support BOOLEAN, Int32, Int64, Int96, FLOAT, DOUBLE, BINARY (use BINARY for string types), and fixed_len_byte_array types. Each row setting must end with a semicolon, including the last row. The following is a configuration example. `"parquetSchema": "message m { optional int32 minute_id; optional int32 dsp_id; optional int32 adx_pid; optional int64 req; optional int64 res; optional int64 suc; optional int64 imp; optional double revenue; }"`	No	None
column	The list of fields to read. `type` specifies the data type of the source data. `index` specifies the column number in the text file, starting from 0. `value` specifies that the current column is a constant. The data for this column is not read from the source file but is automatically generated based on the value. By default, you can read all data as the String type. The configuration is as follows. `"column": [""]` You can specify the column field information. The configuration is as follows. `"column": { "type": "long", "index": 0 // Get an int field from the first column of the OSS text file. }, { "type": "string", "value": "alibaba" // Generate a string field "alibaba" from within OSS Reader as the current field. }` Note* For the column information you specify, `type` is required, and you must specify either `index` or `value`.	Yes	All data is read as the STRING type.
fileFormat	The format of the source OSS file. For example, csv or text. Both formats support custom separators.	Yes	csv
fieldDelimiter	The column delimiter for reading data. Note When OSS Reader reads data, you need to specify the column delimiter. If not specified, the default is a comma (,). The comma (,) is also the default value on the configuration page. If the separator is not visible, enter its Unicode encoding. For example, \u001b or \u007c.	Yes	,
lineDelimiter	The row delimiter for reading data. Note This parameter is effective only when fileFormat is set to text.	No	None
compress	The compression type of the text file. By default, this parameter is not specified, which means no compression. Supported compression types are gzip, bzip2, and zip.	No	No compression
encoding	The encoding format of the files to read.	No	utf-8
nullFormat	In a text file, a standard string cannot be used to define a null pointer. Data synchronization provides nullFormat to define which strings can be interpreted as null. For example: If you set `nullFormat:"null"`, which is a visible character, and the source data is `null`, Data Integration treats it as a null field. If you set `nullFormat:"\u0001"`, which is an invisible character, and the source data is the string `\u0001`, Data Integration treats it as a null field. If you do not specify the `"nullFormat"` parameter, the source data is written to the destination as is, without any conversion.	No	None
skipHeader	For CSV-like files, the header might be a title and needs to be skipped. By default, the header is not skipped. The skipHeader parameter is not supported for compressed files.	No	false
csvReaderConfig	The parameter settings for reading CSV files. This is a Map type. CsvReader is used to read CSV files. Many configurations are available. If you do not configure this parameter, the default values are used.	No	None

Writer script demo: General example

{
    "type":"job",
    "version":"2.0",
    "steps":[
        {
            "stepType":"stream",
            "parameter":{},
            "name":"Reader",
            "category":"reader"
        },
        {
            "stepType":"oss",// Plug-in name.
            "parameter":{
                "nullFormat":"",// Data Integration provides nullFormat to define which strings can be interpreted as null.
                "dateFormat":"",// Date format.
                "datasource":"",// Data source.
                "writeMode":"",// Write mode.
                "writeSingleObject":"false", // Specifies whether to synchronize data to a single OSS file.
                "encoding":"",// Encoding format.
                "fieldDelimiter":",",// Column delimiter.
                "fileFormat":"",// Text type.
                "object":""// Object prefix.
            },
            "name":"Writer",
            "category":"writer"
        }
    ],
    "setting":{
        "errorLimit":{
            "record":"0"// Number of error records.
        },
        "speed":{
            "throttle":true,// If throttle is set to false, the mbps parameter does not take effect, which means no rate limiting. If throttle is set to true, rate limiting is enabled.
            "concurrent":1, // Job concurrency.
            "mbps":"12"// Rate limit. Here, 1 mbps = 1 MB/s.
        }
    },
    "order":{
        "hops":[
            {
                "from":"Reader",
                "to":"Writer"
            }
        ]
    }
}

Writer script demo: Write ORC or Parquet files to OSS

DataWorks reuses HDFS Writer to write ORC or Parquet files to OSS. In addition to the existing OSS Writer parameters, you must configure additional parameters, such as Path and FileFormat. For more information about these parameters, see HDFS Writer.

The following are examples of writing ORC or Parquet files to OSS:

Important

The following code provides examples for reference only. You must modify the parameters based on your actual column names and data types. Do not copy the code.

Write to OSS in ORC file format

To write ORC files, you must use the code editor. In the code editor, set the fileFormat parameter to orc, set the path parameter to the destination file path, and configure the column parameter in the following format: {"name":"your column name","type": "your column type"}.

The following ORC data types are supported for write operations:

Field type	Offline write to OSS (ORC format)
TINYINT	Supported
SMALLINT	Supported
INT	Supported
BIGINT	Supported
FLOAT	Supported
DOUBLE	Supported
TIMESTAMP	Supported
DATE	Supported
VARCHAR	Supported
STRING	Supported
CHAR	Support
BOOLEAN	Supported
DECIMAL	Supported
BINARY	Supported

{
"stepType": "oss",
"parameter": {
"datasource": "",
"fileFormat": "orc",
"path": "/tests/case61",
"fileName": "orc",
"writeMode": "append",
"column": [
{
"name": "col1",
"type": "BIGINT"
},
{
"name": "col2",
"type": "DOUBLE"
},
{
"name": "col3",
"type": "STRING"
}
],
"writeMode": "append",
"fieldDelimiter": "\t",
"compress": "NONE",
"encoding": "UTF-8"
}
}

Write to OSS in Parquet file format

{
"stepType": "oss",
"parameter": {
"datasource": "",
"fileFormat": "parquet",
"path": "/tests/case61",
"fileName": "test",
"writeMode": "append",
"fieldDelimiter": "\t",
"compress": "SNAPPY",
"encoding": "UTF-8",
"parquetSchema": "message test { required int64 int64_col;\n required binary str_col (UTF8);\nrequired group params (MAP) {\nrepeated group key_value {\nrequired binary key (UTF8);\nrequired binary value (UTF8);\n}\n}\nrequired group params_arr (LIST) {\nrepeated group list {\nrequired binary element (UTF8);\n}\n}\nrequired group params_struct {\nrequired int64 id;\n required binary name (UTF8);\n }\nrequired group params_arr_complex (LIST) {\nrepeated group list {\nrequired group element {\n required int64 id;\n required binary name (UTF8);\n}\n}\n}\nrequired group params_complex (MAP) {\nrepeated group key_value {\nrequired binary key (UTF8);\nrequired group value {\nrequired int64 id;\n required binary name (UTF8);\n}\n}\n}\nrequired group params_struct_complex {\nrequired int64 id;\n required group detail {\nrequired int64 id;\n required binary name (UTF8);\n}\n}\n}",
"dataxParquetMode": "fields"
}
}

Writer script parameters

Parameter	Description	Required	Default value
datasource	The name of the data source. The code editor supports adding data sources. The value of this parameter must be the same as the name of the added data source.	Yes	None
object	The name of the file to be written by OSS Writer. OSS uses file names to simulate directories. OSS has the following limits on object names: If you set `"object": "datax"`, the written object name starts with `datax` and has a random string appended. If you set `"object": "cdo/datax"`, the written object name starts with `/cdo/datax` and has a random string appended. The separator for simulating directories in OSS is a forward slash (/). If you do not want a random UUID as a suffix, set `"writeSingleObject" : "true"`. For more information, see the description of writeSingleObject.	Yes	None
ossBlockSize	The size of OSS blocks. The default block size is 16 MB. When the file is written in parquet or ORC format, you can add this parameter at the same level as the object parameter. Because OSS multipart upload supports a maximum of 10,000 parts, the default single file size limit is 160 GB. If the number of parts exceeds the limit, you can increase the block size to support larger file uploads.	No	16
writeMode	The action to perform on the data before OSS Writer writes it: truncate: Clears all objects that match the object name prefix before writing. For example, if you set `"object":"abc"`, all objects starting with `abc` will be cleared. append: No action is taken before writing. Data Integration OSS Writer writes directly using the object name and appends a random UUID suffix to ensure no file name conflicts. For example, if the object name you specify is `DI`, the actual written name is DI_**__**. nonConflict: If an object with a matching prefix is found at the specified path, an error is reported. For example, if you set `"object":"abc"` and an object named `abc123` exists, an error is reported.	Yes	None
writeSingleObject	Specifies whether to write data to a single file in OSS: true: Writes to a single file. When no data can be read, no empty file is generated. false: Writes to multiple files. When no data can be read, if a file header is configured, an empty file containing only the file header is output. Otherwise, only an empty file is output. Note When writing data in ORC or Parquet format, the writeSingleObject parameter does not take effect. This means you cannot use this parameter to write to a single ORC or Parquet file in a multi-concurrency scenario. To write to a single file, you can set the concurrency to 1. However, a random suffix will be added to the file name, and setting the concurrency to 1 will affect the speed of the synchronization task.	No	false
fileFormat	The format of the file to be written. The following formats are supported: csv: Only strict csv format is supported. If the data to be written includes a column delimiter, it will be escaped according to the csv escape syntax. The escape character is a double quotation mark ("). text: Simply splits the data to be written using the column delimiter. No escaping is performed if the data to be written includes the column delimiter. parquet: If you use this file type, you must add the parquetSchema parameter to define the data types. Important To write data in Parquet format, you need to switch to the code editor and configure parquetSchema. For a configuration example, see Appendix: Script demos and parameter descriptions. If you do not configure parquetSchema, DataWorks will convert the data types based on the source field types according to a certain policy. For more information about the conversion policy, see Appendix: Conversion policy for Parquet data types. ORC: If you use this format, you need to switch to the code editor.	No	text
compress	The compression format of the data file written to OSS. This parameter must be configured in the code editor for the task. Note Compression is not supported for csv and text file types. Parquet/orc files support compression formats such as gzip and snappy.	No	None
fieldDelimiter	The column delimiter for writing data.	No	,
encoding	The encoding format of the file to be written.	No	utf-8
parquetSchema	This is a required parameter for writing to OSS in Parquet file format. It describes the structure of the object file. This parameter is effective only when fileFormat is set to parquet. The format is as follows. `message MessageTypeName { Required/Optional, Data type, Column name; ......................; }` The following describes the configuration items: MessageTypeName: Enter a name. Required/Optional: `required` indicates that the field cannot be empty. `optional` indicates that the field can be empty. We recommend that you set all fields to `optional`. Data type: Parquet files support BOOLEAN, INT32, INT64, INT96, FLOAT, DOUBLE, BINARY (use BINARY for string types), and FIXED_LEN_BYTE_ARRAY types. Note Each row setting must end with a semicolon, including the last row. The following is an example. `message m { optional int64 id; optional int64 date_id; optional binary datetimestring; optional int32 dspId; optional int32 advertiserId; optional int32 status; optional int64 bidding_req_num; optional int64 imp; optional int64 click_num; }`	No	None
nullFormat	In a text file, a standard string cannot be used to define a null pointer. The data synchronization system provides nullFormat to define which strings can be interpreted as null. For example, if you set `nullFormat="null"` and the source data is `null`, the data synchronization system treats it as a null field.	No	None
header	The header of the file written to OSS. For example, `["id", "name", "age"]`.	No	None
maxFileSize (Advanced configuration. Not supported in the codeless UI)	The maximum size of a single object file written to OSS. The default value is 10,000 × 10 MB. This is similar to controlling the size of a log file when printing log4j logs. When OSS performs multipart upload, the size of each part is 10 MB. This is also the minimum granularity for log rotation files, which means a maxFileSize less than 10 MB is treated as 10 MB. The maximum number of parts supported by each OSS InitiateMultipartUploadRequest is 10,000. When rotation occurs, the object name is formed by appending suffixes like _1, _2, _3 to the original object prefix, which already includes a random UUID. Note The default unit is MB. Configuration example: "maxFileSize":300 sets the single file size to 300 MB. `maxFileSize` is effective only for csv and text formats. It is calculated based on the memory level of the synchronization task process and cannot precisely control the actual size of the destination file. The actual file size at the destination may exceed the expected size due to data bloat.	No	100,000
suffix (Advanced configuration. Not supported in the codeless UI)	The suffix of the file name generated when data synchronization writes data. For example, if you set suffix to .csv, the final file name is fileName****.csv.	No	None

Appendix: Conversion policy for Parquet data types

If you do not configure the parquetSchema parameter, DataWorks converts the data types based on the source field types. The following table describes the conversion policy.

Converted data type	Parquet type	Parquet logical type
CHAR / VARCHAR / STRING	BINARY	UTF8
BOOLEAN	BOOLEAN	Not applicable
BINARY / VARBINARY	BINARY	Not applicable
DECIMAL	FIXED_LEN_BYTE_ARRAY	DECIMAL
TINYINT	INT32	INT_8
SMALLINT	INT32	INT_16
INT/INTEGER	INT32	Not applicable
BIGINT	INT64	Not applicable
FLOAT	FLOAT	Not applicable
DOUBLE	DOUBLE	Not applicable
DATE	INT32	DATE
TIME	INT32	TIME_MILLIS
TIMESTAMP/DATETIME	INT96	Not applicable