Skip to main content
Hello and welcome back. In this lesson we cover Datastream for Change Data Capture (CDC) on Google Cloud. This builds on Cloud Pub/Sub concepts and continues the data ingestion journey by explaining how Datastream captures and delivers database changes in near real time.

What is Datastream?

Datastream is a serverless, real-time data replication service on Google Cloud that continuously captures changes from supported source databases and delivers them to configured destinations. It reads database transaction logs (binlog/WAL/redo logs), converts those changes into events, and streams them to targets with minimal impact on the source systems. Why this matters for data engineers
  • Keeps analytics and downstream systems up to date with minimal latency.
  • Transfers only changes (inserts, updates, deletes) rather than full database dumps, reducing load, storage, and network costs.
  • Enables event-driven architectures, real-time analytics, and near-real-time ETL pipelines.
Exam tip: Datastream supports MySQL, PostgreSQL, and Oracle as source databases. Use it to stream change events to BigQuery, Cloud Storage, or Pub/Sub for downstream processing.
A slide titled "Data Replication Concepts" with four colored blocks labeled Destination Format (Avro/JSON/Parquet), CDC (change data capture), Targets (BigQuery, Cloud Storage, Cloud Spanner) and Source Systems (MySQL, PostgreSQL, Oracle). Each block includes an icon and short text explaining how changed data is identified and replicated to downstream targets in different formats.

How Datastream works (high level)

  1. Application writes data to the source database (inserts/updates/deletes).
  2. Datastream monitors database transaction logs (e.g., MySQL binlog, PostgreSQL WAL, Oracle redo logs) to capture changes in order.
  3. The CDC engine parses and optionally filters events (by table, schema, column, or operation).
  4. Datastream writes change events to the configured destination (BigQuery, Cloud Storage, Pub/Sub, etc.) in the chosen format (commonly Avro or JSON).
  5. Downstream consumers process, transform, or analyze those change events in near real time.
ComponentPurposeExample / Notes
StreamLogical grouping of tables/schemas to replicateCreate one stream per application or workload boundary for easier management
Connection profileStores connection and network details for the sourceIncludes credentials, VPC peering or private IP settings
Stream objectRepresents an individual table or schema within a streamSpecify schema/table filters to limit scope
CDC engineReads transaction logs and converts them into change eventsManaged by Datastream — no infrastructure to run
Destination formatsOutput serialization formats for delivered eventsTypically Avro or JSON. Parquet is often produced downstream after conversion
MonitoringMetrics, logs, and alerts to observe stream healthTrack latency, connector errors, and throughput

Supported sources and typical targets

Source databasesTypical destinationsUse cases
MySQL, PostgreSQL, OracleBigQueryNear real-time analytics and reporting
MySQL, PostgreSQL, OracleCloud StorageData lake storage (raw change events)
MySQL, PostgreSQL, OracleCloud Pub/SubEvent-driven architectures and microservices

Lifecycle of a change inside Datastream

  • Log capture: Datastream reads the database transaction log to capture a reliable, ordered sequence of changes.
  • Parse & filter: Logs are parsed into structured change events; filters (tables, columns, or operation types) can be applied to reduce noise.
  • Delivery: Events are written to the destination in the configured format.
  • Ordering & duplicates: Ordering is preserved as captured from transaction logs. Deduplication and exact delivery semantics depend on destination and consumer-side logic.
Important: CDC requires database transaction logging (e.g., MySQL binlog or PostgreSQL WAL), which may be disabled by default. Ensure transaction logs are enabled and that Datastream has network access and the necessary permissions to read them. Coordinate changes with your DBA or infra team before creating streams.

Practical checklist before creating a Datastream stream

  • Enable transaction logs on the source:
    • MySQL: enable binlog
    • PostgreSQL: enable WAL and logical replication settings
    • Oracle: enable redo logs and supplemental logging
  • Ensure network connectivity:
    • VPC peering, private IP, or public connectivity with proper firewall rules
  • Service account permissions:
    • Grant Datastream and any downstream services the needed IAM roles
  • Define scope:
    • Choose tables/schemas to replicate, and whether to include DDL or only DML
  • Choose output format and destination:
    • Avro or JSON for change events; convert to Parquet later for analytics if needed
  • Plan for ordering & de-duplication:
    • Implement idempotent consumers or deduplication logic where required

Example real-world scenario

E-commerce platform: Orders are created in a transactional database. Datastream captures new order inserts and updates, writes change events to Pub/Sub and Cloud Storage, and streams orders into BigQuery (via Dataflow or a native connector) for dashboards, fraud detection, and shipment workflows. This keeps downstream systems up to date without impacting OLTP performance.

When to use Datastream

  • Need low-latency replication of database changes with minimal source impact
  • Want a managed CDC solution integrated with Google Cloud destinations
  • Building event-driven architectures, streaming ETL, or real-time analytics

Further reading and references

Summary Datastream is Google Cloud’s managed CDC service for streaming database changes (MySQL, PostgreSQL, Oracle) to destinations such as BigQuery, Cloud Storage, and Pub/Sub. Use Datastream when you need to replicate only the changes with low latency and minimal impact to the source. We’ll cover data store options in the next section.

Watch Video