Skip to main content
Welcome back. This lesson covers practical guidance for improving performance, autoscaling behavior, and cost efficiency of Google Cloud Dataflow pipelines. Over months or years of production use, these are frequent operational concerns—this guide highlights key concepts, trade-offs, and mitigation techniques to keep pipelines healthy and affordable. Topics covered:
  • Autoscaling behavior: streaming vs batch
  • Shuffle service trade-offs
  • Detecting and mitigating data skew / hot keys
  • Streaming Engine: separation of compute and service-managed state
  • Cost optimization and periodic design reviews
Let’s get started.

Autoscaling behavior: streaming vs batch

Streaming pipelines run continuously and typically autoscale gradually as load changes. Batch pipelines usually start many workers quickly, process the workload, then scale down when the job completes. These different behaviors imply distinct design and cost trade-offs:
  • Streaming: steady, incremental autoscaling; optimized for sustained workload and low-latency processing.
  • Batch: fast ramp-up and ramp-down; optimized for burst throughput and short-lived runs.
When choosing a pipeline model, consider the trigger cadence. For example, a job that must run every 2 seconds is usually better implemented as a streaming pipeline—attempting to run a batch job that frequently will cause rapid worker churn and unnecessary cost.
CharacteristicStreamingBatch
RuntimeContinuousShort-lived
Autoscaling patternGradual scalingFast ramp-up / ramp-down
Best forLow-latency, steady eventsHigh-throughput batch windows
Cost riskOverprovisioning if misconfiguredWorker churn for frequent runs
A slide titled "Autoscaling Behavior" showing two charts: a "Streaming" graph with a green line that gradually increases workers over time (gradual scaling), and a "Batch" graph with a red line that rapidly ramps up to a plateau then drops off (fast ramp-up/down).

Shuffle service: local disk vs external shuffle

By default, Dataflow can store intermediate shuffle data on worker VMs’ local disks. In this mode, if a worker fails, the stage may need to restart and reprocess lost intermediate data—costly in time and compute. Enabling the Dataflow Shuffle Service stores intermediate data in a separate, resilient service so workers become effectively stateless for intermediate data. A replacement worker can continue from the external shuffle store without reprocessing the entire stage, improving fault tolerance and enabling faster, more reliable scaling. However, the external shuffle service introduces overhead and may not be beneficial for very short-lived batch jobs (e.g., tasks that run for only a minute or two). Evaluate expected job duration, failure tolerance, and cost trade-offs before enabling it.
Use the external shuffle service for long-running or failure-sensitive jobs. For tiny, short-lived batch jobs, the external shuffle overhead can increase latency and cost.
An infographic titled "Shuffle Service" comparing two architectures: the left shows workers each containing shuffle data ("Without Shuffle Service" — limited sharing), and the right shows separate workers using a centralized "Shuffle Service (External Storage)" ("With Shuffle Service" — better scaling).

Detecting and mitigating data skew / hot keys

Data skew (hot keys) happens when a small subset of keys receives a disproportionate share of the data. This creates bottlenecks: the worker handling the hot key becomes overloaded while others are idle, reducing parallelism and increasing latency. Common mitigations:
  • Key salting (sharding / key prefixing): split a hot key into multiple salted keys for partial aggregation, then remove the salt and perform a final aggregation.
  • Reshuffle / multi-stage aggregation: perform partial combines (fanout) before a final CombinePerKey.
  • Worker-side combining: reduce the amount of shuffle data by aggregating locally where possible.
Practical detection and steps:
  • Monitor per-key throughput and latency using Dataflow job metrics and logs to identify skew.
  • For identified hot keys, implement a two-stage combine pattern:
Example pattern (pseudocode): Step 1 — Shard and partial combine:
PCollection<KV<String, V>> input = ...;
PCollection<KV<String, V>> sharded =
  input.apply(MapElements.into(...).via(kv -> KV.of(salt(kv.getKey()), kv.getValue())));
PCollection<KV<String, Accum>> partial = sharded.apply(Combine.perKey(partialCombineFn));
Step 2 — Remove salt and final combine:
PCollection<KV<String, Accum>> unsalted =
  partial.apply(MapElements.into(...).via(kv -> KV.of(unsalt(kv.getKey()), kv.getValue())));
PCollection<KV<String, Result>> finalResults =
  unsalted.apply(Combine.perKey(finalCombineFn));
  • Choose the number of shards based on observed load for the hot key (e.g., K-0..K-N).
  • Use fanout/combiner patterns to reduce intermediate shuffle volume.
These mitigations often yield large speedups when a few keys dominate the workload.
An infographic titled "Data Skew and Hot Keys" showing a problem diagram where one hot key (Key A) accounts for 90% of load and overloads one worker, and a solution diagram illustrating key salting (splitting Key A into A-1/A-2/A-3) to achieve balanced distribution and better parallelization.

Streaming Engine: separation of compute and service-managed state

Dataflow supports two runtime models:
  • Legacy worker-centric streaming (worker holds state and coordination),
  • Managed Streaming Engine (service-managed state and coordination).
The Streaming Engine moves pipeline state and coordination off worker VMs into a Google-managed service, decoupling compute from state storage. Key benefits:
  • Faster autoscaling (workers can be added/removed more quickly),
  • Lower worker memory footprint,
  • Improved worker resource utilization,
  • Better support for low-latency, short-interval streaming.
If your pipeline has short trigger gaps (e.g., events every 1–3 seconds) or strict autoscaling/latency needs, the Streaming Engine is usually recommended. For more details, see the official documentation: Dataflow Streaming Engine.

Cost optimization and ongoing design review

Control costs by selecting appropriate machine types, autoscaling settings, and shuffle/engine configurations. Practical ideas:
  • Right-size worker machine types for CPU and memory needs.
  • Use preemptible workers for batch jobs where occasional interruptions are acceptable (significant cost savings).
  • Tune autoscaling parameters and set reasonable minimum/maximum worker counts to balance latency and cost.
  • Avoid enabling shuffle service for tiny, short-lived jobs where overhead outweighs reliability gains.
  • Apply partial aggregation or fanout to reduce shuffle volume and network I/O.
Checklist for practical optimization:
AreaAction
Machine selectionRight-size CPUs/memory; consider custom machine types
AutoscalingSet min/max workers; tune scaling policies
ShuffleEnable external shuffle for long or failure-sensitive jobs; avoid for tiny jobs
WorkersConsider preemptible workers for batch
Data partitioningMitigate hot keys with salting / fanout
Review cadenceRe-evaluate pipelines periodically (see note)
Do a design review at regular intervals (for example, yearly). Re-evaluating pipelines against newer managed services and features often yields both performance and cost benefits.
Operational practice: schedule periodic architecture reviews—cloud features and pricing evolve quickly, so a pipeline that was optimal two years ago may no longer be the best choice. Ask whether long-running jobs should remain in Dataflow, be simplified, consolidated, or migrated to newer services. That’s it for this lesson. See you in the next lesson.

Watch Video