Data Ingestion Introduction

Welcome back. In this lesson we dive into one of the foundational topics in data engineering: data ingestion patterns. Whether you are building a small application or a large data platform, reliably moving data from sources into a storage or processing system is essential for analytics, reporting, and machine learning. Here we introduce core concepts and the two primary ingestion approaches: batch and streaming. At a high level, a data ingestion pipeline moves raw data from sources into a storage/processing system where it becomes available for analysis or model training. Typical pipeline stages are:

Raw data sources
The original, unprocessed data from systems such as transactional databases, application logs, IoT sensors, or social media feeds.
Ingestion framework
The mechanisms or services that pull or receive data from sources: connectors, agents, APIs, or messaging systems (for example, change-data-capture connectors, log collectors, or Pub/Sub).
Staging / raw landing
Where ingested raw data is stored (often a data lake or staging bucket). This stage usually includes lightweight validation, schema tagging, and deduplication; heavier enrichment and joins occur later in processing.
Processing and transformation
Jobs or streaming processors that clean, normalize, and enrich data to create analysis-ready datasets.
Final storage and consumption
Processed datasets are written to destinations such as a data warehouse, curated data lake, or analytics platform for reporting, dashboards, or downstream ML pipelines.

Choosing the right ingestion approach affects latency, throughput, operational complexity, and how you handle ordering, duplicates, and failures. Consider business SLAs (latency vs. completeness), data volume, and toolchain maturity when designing your ingestion layer.

Why the ingestion pattern matters

The ingestion pattern determines end-to-end latency, resource usage, fault tolerance, and operational complexity.
It affects the choice of tools and architecture (batch schedulers, stream processors, messaging systems, CDC tools).
It drives design decisions for ordering, exactly-once or at-least-once semantics, duplicate handling, and state management.

An infographic titled "Data Ingestion – Introduction" showing raw data sources (Database, Web/App, IoT Sensor, Social Media) funneled through a central "Data Ingestion" process that collects, cleans, and transfers data. The processed data flows to destination systems (Data Warehouse, Data Lake, Analytics Platform) and highlights batch and streaming ingestion.

Two primary ingestion patterns

Batch ingestion
- Moves data in discrete groups (batches) on a schedule (hourly, daily, nightly).
- Common use cases: nightly ETL that aggregates yesterday’s transactions; large backfills; bulk data movement to a data warehouse.
- Characteristics: higher end-to-end latency, simpler processing model, easier to reason about correctness for large bulk operations.
- Variants:
  - Full loads (replace entire dataset)
  - Incremental loads (only new or changed rows)
  - Change data capture (CDC) for near-real-time incremental loads
  - Micro-batching (small batches at high frequency) as a hybrid to reduce latency
Streaming ingestion
- Continuously ingests and processes events in near real time as they are produced.
- Common use cases: live user-event analytics, real-time monitoring, fraud detection, anomaly detection, feature pipelines for online ML.
- Characteristics: low latency, event-driven, often requires windowing semantics, state management, idempotency, and careful handling of event-time vs processing-time.
- Considerations: event ordering, backpressure, checkpointing, retention, and duplicate event handling. Guarantees vary by platform (at-least-once, exactly-once).

Comparison: Batch vs Streaming

Dimension	Batch ingestion	Streaming ingestion
Latency	Minutes to hours (or longer)	Milliseconds to seconds
Typical use	Periodic aggregations, backfills, ELT into warehouse	Real-time analytics, alerts, feature updates
Complexity	Simpler semantics, easier reprocessing	Harder: state, windowing, ordering, checkpoints
Fault model	Retry entire batch or re-run job	Checkpoints, replay from message broker, idempotent sinks
Examples	Scheduled ETL jobs, nightly loads	Event streams, CDC into Pub/Sub, real-time pipelines

Both patterns can coexist in a modern data platform: use batch for large-scale transformations and historical reprocessing, and streaming for low-latency analytics and timely alerts. Hybrid architectures (e.g., streaming ingestion with periodic micro-batch reprocessing) are common. This lesson examines available GCP services for both batch and streaming ingestion and dives deeper into their technical trade-offs. Common GCP building blocks include Cloud Storage, Cloud Pub/Sub, Dataflow, Dataproc, and BigQuery for different stages of ingestion and processing:

Ingestion: Cloud Pub/Sub, Transfer Service, direct connector agents, Cloud Storage uploads
Processing: Dataflow (stream & batch), Dataproc (batch Spark/Hadoop), Cloud Run/Cloud Functions (light-weight ingestion)
Storage & consumption: Cloud Storage (raw landing), BigQuery (analytics), BigTable/Firestore (low-latency lookups)

Be careful with ordering and deduplication. Streaming systems often deliver at-least-once by default; implement idempotent writes or deduplication windows in sinks to avoid data inflation. When exact ordering matters, design with partition keys and windowing semantics in mind.

This lesson also covers the specific differences between batch and streaming ingestion in more detail. Links and references

Watch Video

Cloud Key Management Service KMS

Batch vs Streaming Ingestion Patterns

Introduction

GCP Networking

Identity and Access Management (IAM) in GCP

Cloud Observability

Development & CI/CD

Data Security & Encryption

Data Ingestion Options

Data Storage Options

Database (SQL, NoSQL and memory)

Data Orchestration Options

Data Processing

Data Integration & Transformation Tools

Data Warehouse & Analytics Options

Machine Learning Options

Multi-Cloud & Lakehouse Solutions

Data Management and Governance

GCP Data Engineering Architecture and Landscape

GCP Core Fundamentals & Understanding

Data Ingestion Introduction

Watch Video