Datastream For Change Data Capture CDC

Hello and welcome back. In this lesson we cover Datastream for Change Data Capture (CDC) on Google Cloud. This builds on Cloud Pub/Sub concepts and continues the data ingestion journey by explaining how Datastream captures and delivers database changes in near real time.

What is Datastream?

Datastream is a serverless, real-time data replication service on Google Cloud that continuously captures changes from supported source databases and delivers them to configured destinations. It reads database transaction logs (binlog/WAL/redo logs), converts those changes into events, and streams them to targets with minimal impact on the source systems. Why this matters for data engineers

Keeps analytics and downstream systems up to date with minimal latency.
Transfers only changes (inserts, updates, deletes) rather than full database dumps, reducing load, storage, and network costs.
Enables event-driven architectures, real-time analytics, and near-real-time ETL pipelines.

Exam tip: Datastream supports MySQL, PostgreSQL, and Oracle as source databases. Use it to stream change events to BigQuery, Cloud Storage, or Pub/Sub for downstream processing.

A slide titled "Data Replication Concepts" with four colored blocks labeled Destination Format (Avro/JSON/Parquet), CDC (change data capture), Targets (BigQuery, Cloud Storage, Cloud Spanner) and Source Systems (MySQL, PostgreSQL, Oracle). Each block includes an icon and short text explaining how changed data is identified and replicated to downstream targets in different formats.

How Datastream works (high level)

Application writes data to the source database (inserts/updates/deletes).
Datastream monitors database transaction logs (e.g., MySQL binlog, PostgreSQL WAL, Oracle redo logs) to capture changes in order.
The CDC engine parses and optionally filters events (by table, schema, column, or operation).
Datastream writes change events to the configured destination (BigQuery, Cloud Storage, Pub/Sub, etc.) in the chosen format (commonly Avro or JSON).
Downstream consumers process, transform, or analyze those change events in near real time.

Core components and recommended configurations

Component	Purpose	Example / Notes
Stream	Logical grouping of tables/schemas to replicate	Create one stream per application or workload boundary for easier management
Connection profile	Stores connection and network details for the source	Includes credentials, VPC peering or private IP settings
Stream object	Represents an individual table or schema within a stream	Specify schema/table filters to limit scope
CDC engine	Reads transaction logs and converts them into change events	Managed by Datastream — no infrastructure to run
Destination formats	Output serialization formats for delivered events	Typically Avro or JSON. Parquet is often produced downstream after conversion
Monitoring	Metrics, logs, and alerts to observe stream health	Track latency, connector errors, and throughput

Supported sources and typical targets

Source databases	Typical destinations	Use cases
MySQL, PostgreSQL, Oracle	BigQuery	Near real-time analytics and reporting
MySQL, PostgreSQL, Oracle	Cloud Storage	Data lake storage (raw change events)
MySQL, PostgreSQL, Oracle	Cloud Pub/Sub	Event-driven architectures and microservices

Lifecycle of a change inside Datastream

Log capture: Datastream reads the database transaction log to capture a reliable, ordered sequence of changes.
Parse & filter: Logs are parsed into structured change events; filters (tables, columns, or operation types) can be applied to reduce noise.
Delivery: Events are written to the destination in the configured format.
Ordering & duplicates: Ordering is preserved as captured from transaction logs. Deduplication and exact delivery semantics depend on destination and consumer-side logic.

Important: CDC requires database transaction logging (e.g., MySQL binlog or PostgreSQL WAL), which may be disabled by default. Ensure transaction logs are enabled and that Datastream has network access and the necessary permissions to read them. Coordinate changes with your DBA or infra team before creating streams.

Practical checklist before creating a Datastream stream

Enable transaction logs on the source:
- MySQL: enable binlog
- PostgreSQL: enable WAL and logical replication settings
- Oracle: enable redo logs and supplemental logging
Ensure network connectivity:
- VPC peering, private IP, or public connectivity with proper firewall rules
Service account permissions:
- Grant Datastream and any downstream services the needed IAM roles
Define scope:
- Choose tables/schemas to replicate, and whether to include DDL or only DML
Choose output format and destination:
- Avro or JSON for change events; convert to Parquet later for analytics if needed
Plan for ordering & de-duplication:
- Implement idempotent consumers or deduplication logic where required

Example real-world scenario

E-commerce platform: Orders are created in a transactional database. Datastream captures new order inserts and updates, writes change events to Pub/Sub and Cloud Storage, and streams orders into BigQuery (via Dataflow or a native connector) for dashboards, fraud detection, and shipment workflows. This keeps downstream systems up to date without impacting OLTP performance.

When to use Datastream

Need low-latency replication of database changes with minimal source impact
Want a managed CDC solution integrated with Google Cloud destinations
Building event-driven architectures, streaming ETL, or real-time analytics

Watch Video

Demo Hands on with PubSub Subscription and its integration

Structured Unstructured and Semi structured Data

Introduction

GCP Networking

Identity and Access Management (IAM) in GCP

Cloud Observability

Development & CI/CD

Data Security & Encryption

Data Ingestion Options

Data Storage Options

Database (SQL, NoSQL and memory)

Data Orchestration Options

Data Processing

Data Integration & Transformation Tools

Data Warehouse & Analytics Options

Machine Learning Options

Multi-Cloud & Lakehouse Solutions

Data Management and Governance

GCP Data Engineering Architecture and Landscape

GCP Core Fundamentals & Understanding

Datastream For Change Data Capture CDC

What is Datastream?

How Datastream works (high level)

Core components and recommended configurations

Supported sources and typical targets

Lifecycle of a change inside Datastream

Practical checklist before creating a Datastream stream

Example real-world scenario

When to use Datastream

Further reading and references

Watch Video

​What is Datastream?

​How Datastream works (high level)

​Core components and recommended configurations

​Supported sources and typical targets

​Lifecycle of a change inside Datastream

​Practical checklist before creating a Datastream stream

​Example real-world scenario

​When to use Datastream

​Further reading and references

Watch Video

What is Datastream?

How Datastream works (high level)

Core components and recommended configurations

Supported sources and typical targets

Lifecycle of a change inside Datastream

Practical checklist before creating a Datastream stream

Example real-world scenario

When to use Datastream

Further reading and references