Structured Unstructured and Semi structured Data

Hello and welcome back. Before we explore storage options in Google Cloud, it helps to understand the types of data you routinely encounter as a data engineer. This lesson explains structured, semi-structured, and unstructured data using an e-commerce scenario to make the differences concrete. Use cases covered:

How transactional data maps to relational storage
Where event logs and JSON fit (semi-structured)
How media and free-form text are treated as unstructured data

Let’s walk through an example. E-commerce example: one shopper places an order

Structured data
Order ID, payment details, invoice records, and shipping information are generated and exchanged through APIs and follow a fixed schema. They map cleanly into rows and columns (relational tables or spreadsheets) and are easy to query with SQL.
Semi-structured data
User navigation events, clickstreams, filter choices, and product search queries are often captured as JSON events or logs. They contain keys and nested attributes but don’t require a fixed tabular schema — so they’re semi-structured.
Unstructured data
Product images, video, customer support chat transcripts, and free-text reviews come from humans and lack a consistent schema. These are unstructured data types (images, audio, video, and free-form text) that typically require metadata, indexing, or ML to extract structure.

To put this in context:

An infographic titled "An E-Commerce Company's Data Flow in the Cloud" showing shoppers around a large computer screen and labeled data types (Semi-Structured, Structured, Unstructured) with steps to improve user experience, personalize offers, and optimize operations.

Summary: key differences and examples

Structured data
Fits into tables (rows and columns) — examples: CRM records, financial transactions, inventory tables. Best for relational databases and data warehouses.
Semi-structured data
Contains tags/keys and optional nested fields — examples: JSON, XML, structured logs, event data. Provides schema flexibility while preserving queryable attributes.
Unstructured data
No consistent schema — examples: images, video, audio, social media posts, and free-form text. Requires metadata, search indexes, or AI/ML to extract structure and meaning.

Comparison at a glance:

Data Type	Typical Formats	Example Sources	Best Storage / Access Pattern
Structured	CSV, Parquet, relational tables	Order records, invoices, account tables	`Cloud SQL`, `BigQuery`, or relational warehouses
Semi-Structured	JSON, XML, structured logs	Clickstreams, API events, telemetry	Object storage + query tools (e.g., `GCS` + `BigQuery`/Dataflow)
Unstructured	Images, audio, video, free-text	Product photos, support calls, reviews	Object storage (e.g., `GCS`) + ML/indexing services

An infographic titled "Cloud Storage – Structured, Unstructured, and Semi-Structured Data" that categorizes data types into Structured, Semi-Structured, and Unstructured with icons and example sources (e.g., relational databases and spreadsheets; XML/JSON and HTML/email; images, audio/video, text and social media).

Key concept: cloud object storage In cloud environments, files of all three types are frequently stored as objects. Exported tables, JSON logs, or uploaded media are each treated as objects by cloud object stores.

What is an object?
An object is a single stored unit (a file) that can contain structured, semi-structured, or unstructured content. Examples include:
- a database export file (structured)
- a JSON or XML log file (semi-structured)
- an image, video, or review text (unstructured)
Metadata matters
Each object is stored with metadata (file name, content type, creation timestamp, custom tags). Metadata enables discovery, filtering, and faster access across millions of objects.

A presentation slide titled "What Is an Object?" that defines an object as "a single unit of stored data" and shows examples (database table export, JSON/XML log file, image/video/review text). A bottom bar notes "Together with its metadata" and the slide is © KodeKloud.

In Google Cloud, Cloud Storage (GCS) is the recommended object store for these files. A GCS bucket is schema-agnostic and acts as a container for any object type.

Remember: object storage is schema-agnostic. Use metadata, consistent naming conventions, and catalogs (for example Data Catalog) or query tools (for example BigQuery) to organize and access structured files such as CSV or Parquet stored in GCS.

Practical tips

Store raw exports and media in GCS and maintain a catalog of metadata for discoverability.
Use partitioned Parquet or Avro for large structured datasets to improve query performance and cost.
For semi-structured logs, keep raw JSON in object storage and use pipelines (Dataflow, Dataproc) to normalize or stream them into analytical stores.
For unstructured media, store originals in GCS and index metadata or thumbnails for search and preview; apply ML APIs for tagging and text extraction.

Watch Video

Datastream For Change Data Capture CDC

Google Cloud Storage GCS Overview and Bucket Setup

Introduction

GCP Networking

Identity and Access Management (IAM) in GCP

Cloud Observability

Development & CI/CD

Data Security & Encryption

Data Ingestion Options

Data Storage Options

Database (SQL, NoSQL and memory)

Data Orchestration Options

Data Processing

Data Integration & Transformation Tools

Data Warehouse & Analytics Options

Machine Learning Options

Multi-Cloud & Lakehouse Solutions

Data Management and Governance

GCP Data Engineering Architecture and Landscape

GCP Core Fundamentals & Understanding

Structured Unstructured and Semi structured Data

Watch Video