DataWorks is Alibaba Cloud's all-in-one big data development and governance platform. It manages your entire data lifecycle from ingestion and processing to governance and service delivery. Through highly integrated modules, DataWorks streamlines and visualizes complex data engineering workflows, significantly lowering the barrier to data development. This guide introduces the core modules of DataWorks and explains their primary purposes, key capabilities, and applicable scenarios.
Workflow
Set up resources: Configure your environment in Management Center. Define data source connections, allocate resource groups, bind compute engines such as MaxCompute or Hologres, and manage member permissions.
Ingest and integrate data: Use Data Integration to ingest data from source business systems into your big data platform. Data Integration supports batch (offline), real-time (streaming), full, and incremental synchronization.
Design data models: Before large-scale development begins, design standardized models to ensure an organized and maintainable data architecture. This stage covers Data Warehouse Planning, Data Standard definition, Dimensional Modeling (designing dimension and fact tables), and core business Data Metric definition.
Process and transform data:
Write code such as SQL or Python in the Data Studio WebIDE or Notebook. Use workflow orchestration to organize independent task nodes into a directed acyclic graph (DAG).
Configure scheduling policies, then publish the workflow to Operation Center. Operation Center handles periodic scheduling, task monitoring, alerts, and operations tasks like data backfill. Configure Data Quality monitoring rules for output tables to ensure accuracy.
DataWorks Copilot, an AI assistant, helps generate and optimize code, troubleshoot issues, and streamline development and operations.
Analyze data: Provide analysts and operations teams with SQL query, data insights, and workbooks through DataAnalysis. This enables ad hoc queries and self-service BI analysis.
Share and exchange data: Use DataService Studio to wrap data into standard API services. Use data push for programmatic access.
End-to-end data governance: Data governance capabilities span the entire data flow, ensuring data is trustworthy, controllable, and usable. Metadata syncs automatically to Data Map, helping users discover data and trace lineage. Data Asset Governance identifies and resolves development and data issues through governance plans. Security Center protects sensitive data throughout.
DataWorks orchestrates the entire workflow while underlying compute engines such as MaxCompute, Hologres, Realtime Compute for Apache Flink, and E-MapReduce handle computation and storage.
Combined use cases
Flexibly combine DataWorks modules to meet different data processing and application requirements. The following sections describe several typical combination patterns.
Pattern 1: Batch data warehouse construction
This is the most common pattern for building enterprise data warehouses and performing periodic batch processing with BI analysis.
Objective: Build a stable, reliable, and traceable batch data warehouse.
Module combination:
Implementation:
Data Integration: Synchronize incremental data daily from business systems such as RDS to the Operational Data Store (ODS) layer in MaxCompute.
Data Modeling: Plan data warehouse layers and design models in advance. Layers include Detail (DWD), Summary (DWS), Dimension (DIM), and Application (ADS).
Data Studio: Write MaxCompute SQL tasks to clean, transform, and load ODS data into model tables. Use Copilot to generate and optimize code during development.
Data Quality: Configure monitoring rules for core DWS and DWD reports. Examples: "Daily partition row count must not be zero" or "Key amount field values must stay within normal ranges."
Operation Center: Configure all tasks as a dependency-based DAG in Data Studio. Set the scheduling cycle to daily, then publish the workflow to Operation Center. Configure baselines and Data Quality rules for monitoring and operations.
Data Map: Analysts and business users search Data Map to understand metric definitions and view complete upstream processing lineage.
Roles: Data engineers and data architects.
Pattern 2: Real-time data development
This pattern suits low-latency scenarios like real-time dashboards, recommendations, and risk control.
Objective: Process and analyze streaming data in real time for second-level or minute-level business insights.
Module combination:
Implementation:
Data Integration: Configure real-time sync tasks to stream data from behavioral logs or message queues (Kafka) to data lakes or middleware.
Data Studio: Create Flink SQL tasks for windowing, aggregation, and other stream calculations. Example: "Count product clicks over the last minute."
Result output: Flink tasks write results in real time to high-performance interactive analytics engines like Hologres.
Build reports or dashboards using:
DataAnalysis: Connect data sources to Hologres and generate cards via SQL queries or data insights. Combine cards into dynamically updating reports.
DataService Studio: Generate APIs with Hologres as the data source and provide data to tools like DataV or Quick BI to build real-time analytics dashboards.
Roles: Real-time development engineers and data analysts.
Pattern 3: Data exploration and analysis
This pattern serves analysts and operations personnel who need to quickly validate ideas and perform ad hoc data exploration.
Objective: Provide a self-service, efficient query and analysis environment that lowers the barrier to data access.
Module combination:
Implementation:
Data Map: Analysts search for keywords such as revenue or active users to find relevant metrics and data tables. View table metadata and lineage to confirm the data meets analysis requirements.
Security Center: Use data access control, classification, and masking to ensure analysts use data in a compliant and secure manner.
DataAnalysis: After confirming the target table, use SQL Query and Analysis or Data Insight to write exploratory queries. Example: "Query sales distribution by product category in Singapore last quarter."
Result presentation: Export query results directly or generate charts quickly in DataAnalysis for sharing or report creation.
Roles: Data analysts, business operations personnel, and data product managers.
Pattern 4: Data service encapsulation
This pattern applies when upstream business systems like web applications or mini-programs directly access data.
Objective: Quickly and securely wrap data warehouse tables or complex queries into standard API operations.
Module combination:
Implementation:
Data preparation: Use the batch data warehouse construction pattern (Pattern 1) to process a result table in Data Studio, such as a "user persona tag table."
DataService Studio: Enter DataService Studio and create a new API operation.
API configuration: Point the API's query logic to the "user persona tag table." Set the request parameter to "User ID" and select tag fields to return.
Performance and security: Configure caching policies for the API to improve high-frequency query performance. Manage the API through grouping and authorization.
Publish and call: After publishing the API and granting required permissions, backend engineers obtain the API's endpoint and authentication information. Integrate the API into business code to retrieve user persona tags in real time based on user ID.
Roles: Data engineers and backend developers.
What to do next
After understanding these usage patterns, start using DataWorks by following these practical examples: