Skip to main content
Welcome back. In this lesson we dive into one of the foundational topics in data engineering: data ingestion patterns. Whether you are building a small application or a large data platform, reliably moving data from sources into a storage or processing system is essential for analytics, reporting, and machine learning. Here we introduce core concepts and the two primary ingestion approaches: batch and streaming. At a high level, a data ingestion pipeline moves raw data from sources into a storage/processing system where it becomes available for analysis or model training. Typical pipeline stages are:
  • Raw data sources
    The original, unprocessed data from systems such as transactional databases, application logs, IoT sensors, or social media feeds.
  • Ingestion framework
    The mechanisms or services that pull or receive data from sources: connectors, agents, APIs, or messaging systems (for example, change-data-capture connectors, log collectors, or Pub/Sub).
  • Staging / raw landing
    Where ingested raw data is stored (often a data lake or staging bucket). This stage usually includes lightweight validation, schema tagging, and deduplication; heavier enrichment and joins occur later in processing.
  • Processing and transformation
    Jobs or streaming processors that clean, normalize, and enrich data to create analysis-ready datasets.
  • Final storage and consumption
    Processed datasets are written to destinations such as a data warehouse, curated data lake, or analytics platform for reporting, dashboards, or downstream ML pipelines.
Choosing the right ingestion approach affects latency, throughput, operational complexity, and how you handle ordering, duplicates, and failures. Consider business SLAs (latency vs. completeness), data volume, and toolchain maturity when designing your ingestion layer.
Why the ingestion pattern matters
  • The ingestion pattern determines end-to-end latency, resource usage, fault tolerance, and operational complexity.
  • It affects the choice of tools and architecture (batch schedulers, stream processors, messaging systems, CDC tools).
  • It drives design decisions for ordering, exactly-once or at-least-once semantics, duplicate handling, and state management.
An infographic titled "Data Ingestion – Introduction" showing raw data sources (Database, Web/App, IoT Sensor, Social Media) funneled through a central "Data Ingestion" process that collects, cleans, and transfers data. The processed data flows to destination systems (Data Warehouse, Data Lake, Analytics Platform) and highlights batch and streaming ingestion.
Two primary ingestion patterns
  • Batch ingestion
    • Moves data in discrete groups (batches) on a schedule (hourly, daily, nightly).
    • Common use cases: nightly ETL that aggregates yesterday’s transactions; large backfills; bulk data movement to a data warehouse.
    • Characteristics: higher end-to-end latency, simpler processing model, easier to reason about correctness for large bulk operations.
    • Variants:
      • Full loads (replace entire dataset)
      • Incremental loads (only new or changed rows)
      • Change data capture (CDC) for near-real-time incremental loads
      • Micro-batching (small batches at high frequency) as a hybrid to reduce latency
  • Streaming ingestion
    • Continuously ingests and processes events in near real time as they are produced.
    • Common use cases: live user-event analytics, real-time monitoring, fraud detection, anomaly detection, feature pipelines for online ML.
    • Characteristics: low latency, event-driven, often requires windowing semantics, state management, idempotency, and careful handling of event-time vs processing-time.
    • Considerations: event ordering, backpressure, checkpointing, retention, and duplicate event handling. Guarantees vary by platform (at-least-once, exactly-once).
Comparison: Batch vs Streaming
DimensionBatch ingestionStreaming ingestion
LatencyMinutes to hours (or longer)Milliseconds to seconds
Typical usePeriodic aggregations, backfills, ELT into warehouseReal-time analytics, alerts, feature updates
ComplexitySimpler semantics, easier reprocessingHarder: state, windowing, ordering, checkpoints
Fault modelRetry entire batch or re-run jobCheckpoints, replay from message broker, idempotent sinks
ExamplesScheduled ETL jobs, nightly loadsEvent streams, CDC into Pub/Sub, real-time pipelines
Both patterns can coexist in a modern data platform: use batch for large-scale transformations and historical reprocessing, and streaming for low-latency analytics and timely alerts. Hybrid architectures (e.g., streaming ingestion with periodic micro-batch reprocessing) are common. This lesson examines available GCP services for both batch and streaming ingestion and dives deeper into their technical trade-offs. Common GCP building blocks include Cloud Storage, Cloud Pub/Sub, Dataflow, Dataproc, and BigQuery for different stages of ingestion and processing:
  • Ingestion: Cloud Pub/Sub, Transfer Service, direct connector agents, Cloud Storage uploads
  • Processing: Dataflow (stream & batch), Dataproc (batch Spark/Hadoop), Cloud Run/Cloud Functions (light-weight ingestion)
  • Storage & consumption: Cloud Storage (raw landing), BigQuery (analytics), BigTable/Firestore (low-latency lookups)
Be careful with ordering and deduplication. Streaming systems often deliver at-least-once by default; implement idempotent writes or deduplication windows in sinks to avoid data inflation. When exact ordering matters, design with partition keys and windowing semantics in mind.
This lesson also covers the specific differences between batch and streaming ingestion in more detail. Links and references

Watch Video