Cloud Composer is Google Cloud’s managed Apache Airflow service. It uses Airflow DAGs to orchestrate tasks and interacts with GCP through a set of Airflow operators, sensors, and hooks provided by the Airflow GCP provider package.
Quick summary table
| GCP Service | Typical Airflow operators / sensors | Typical use case |
|---|---|---|
| Cloud Functions | Use from DAG via operators or invoke via client libraries (e.g., PythonOperator calling Cloud Functions API) | Run lightweight, on-demand compute for event-based tasks (e.g., validate a landed file, update metadata) |
| Pub/Sub | Use publish operators or trigger DAGs via sensors / external triggers | Event-driven pipelines and decoupling producers and consumers |
| Cloud Storage (GCS) | GCSHook, GCSListObjectsOperator, GCSToGCSOperator, GCSToBigQueryOperator | Staging raw data, intermediate artifacts, and outputs |
| BigQuery | BigQueryInsertJobOperator, BigQueryExecuteQueryOperator | Analytical storage and query engine for batch/interactive analytics |
| Dataflow | DataflowCreatePythonJobOperator, DataflowCreateJavaJobOperator | Scalable streaming or batch ETL/ELT pipelines |
| Dataproc | DataprocCreateClusterOperator, DataprocSubmitJobOperator | Run Spark/Hadoop jobs with on-demand clusters to save costs |
Integration patterns and examples
-
Cloud Functions
- Use case: Run a small, single-purpose function when an event occurs (e.g., validate or enrich a file after it lands in Cloud Storage).
- How Composer invokes it: From a DAG task either by using a dedicated Cloud Functions operator or by calling the Cloud Functions REST API / client library from a
PythonOperator. - Why use it: Keeps light business logic out of DAG definitions and provides fast, event-driven execution.
- Example (invoke via Python client inside a
PythonOperator):
-
Pub/Sub
- Use case: Decouple producers and consumers; use messages to trigger downstream processing.
- How Composer interacts: Publish messages from a DAG (or use a sensor / Cloud Function to trigger DAGs on message arrival).
- Why use it: Enables scalable, loosely coupled architectures and event-driven orchestration.
- Example (publish using client library inside a
PythonOperator):
-
Cloud Storage
- Use case: Landing zone for raw files, intermediate artifacts, and final outputs.
- How Composer interacts: Use GCS operators/hooks to transfer files, list objects, or stream content to downstream tasks (Dataflow, BigQuery).
- Why use it: Durable, highly available storage integrated across GCP.
- Example (list objects using GCS hook):
-
BigQuery
- Use case: Analytical warehouse for queries, table refreshes, and materialized results.
- How Composer interacts: Submit jobs with
BigQueryInsertJobOperatororBigQueryExecuteQueryOperator, monitor completion and branch downstream tasks. - Why use it: Fast, serverless analytics that pairs naturally with Composer orchestration.
- Example (submit a query job):
-
Dataflow
- Use case: Scalable streaming or batch ETL (e.g., log ingestion, enrichment, ML feature processing).
- How Composer interacts: Launch, monitor, and retry Dataflow jobs using the Dataflow operators; Composer handles orchestration and job lifecycle.
- Why use it: Offload heavy processing to a managed, autoscaling system while Composer coordinates when jobs start and complete.
- Example (submit a Python Dataflow job — conceptual):
-
Dataproc
- Use case: Run Spark, Hive, or Hadoop jobs that need JVM-based engines.
- How Composer interacts: Create ephemeral clusters and submit jobs using Dataproc operators to minimize costs.
- Why use it: Best for existing Spark/Hadoop workloads where Composer schedules and monitors jobs.
- Example (submit a job to Dataproc):
Practical orchestration considerations
- Use the appropriate GCP Airflow operators and sensors so your DAGs stay declarative and maintainable.
- Favor idempotent tasks and explicit retry policies to handle transient service failures (timeouts, quota errors).
- Correlate Airflow task logs with service-level logs and metrics (Cloud Logging, Cloud Monitoring) to troubleshoot cross-system failures faster.
- Keep cost in mind: prefer ephemeral Dataproc clusters or on-demand Dataflow jobs and set TTLs where applicable.
- Design tasks to be small, testable, and observable: simple tasks are easier to retry and to reason about in distributed systems.
Choosing between Cloud Composer and Cloud Workflows
Cloud Composer (Airflow) and Cloud Workflows solve different orchestration problems:-
Cloud Composer
- Best for complex, long-running DAGs, data pipelines, and dependencies between heterogeneous tasks (Python code, SQL jobs, long-running jobs).
- Benefits from the Airflow ecosystem of operators, sensors, and scheduling features.
-
Cloud Workflows
- Best for short, operational orchestration and API-driven workflows (service orchestration, retries with backoff, lightweight sequences).
- Lightweight, ideal for calling REST APIs and composing short-lived service interactions.
Links and references
- Cloud Composer documentation
- Apache Airflow GCP provider (operators & hooks)
- BigQuery documentation
- Dataflow documentation
- Dataproc documentation
- Cloud Functions documentation
- Pub/Sub documentation
- Cloud Storage documentation