DP-900: Microsoft Azure Data Fundamentals

Analyzing Data

Storing Data

In this lesson from the Azure Data Fundamentals (DP-900) course, we explore the Storing phase of the data pipeline. After extracting and transforming data with Azure Data Factory, the next step is to load your cleansed data into the optimal storage solution for batch processing, reporting, and analytical workloads.

Data Stores for Batch Processing

Whether you’re aggregating sales metrics or preparing large datasets for machine learning, choosing the right batch storage is crucial.

The image is a flowchart illustrating the process of storing data, starting from various data sources, transforming the data, and then storing it, before analyzing and reporting on it.

Key Batch Storage Options

Data StoreTypeKey FeaturesCommon Use Case
Azure Synapse & ADLSCloud analytics lakePolyBase for direct querying, supports Parquet/CSV/JSONLarge-scale data lake for SQL analytics
Apache HBaseNoSQL wide-columnHorizontally scalable, HDFS-native, low I/O bottlenecksReal-time reads/writes on big tables
Apache HiveHadoop data warehouseORC columnar storage, fast SUM/AVG/COUNT, fault-tolerantAd-hoc aggregations on Hadoop clusters

Azure Synapse Analytics and Azure Data Lake Storage

Azure Synapse Analytics integrates seamlessly with Azure Data Lake Storage (ADLS). You can store data in multiple formats—CSV, Parquet, JSON, and more—and use PolyBase to run T-SQL queries directly against your files.

Note

No ETL is required before querying. PolyBase pushes computation down to ADLS, boosting performance and reducing costs.

Apache HBase

Apache HBase is a Java-based, NoSQL wide-column store built on Hadoop Distributed File System (HDFS). It offers:

  • Horizontal scalability: Distributes data across many nodes for parallel reads and writes.
  • HDFS ecosystem: Leverages existing Hadoop clusters, minimizing I/O bottlenecks.
  • Easy migration: On-premises HBase schemas lift-and-shift to Azure effortlessly.

Apache Hive

Apache Hive transforms Hadoop into a fault-tolerant data warehouse. Hive uses the Optimized Row Columnar (ORC) format to:

  • Accelerate aggregations like SUM, AVG, and COUNT.
  • Read only relevant columns, reducing I/O overhead.
  • Maintain schema-on-read for flexible data ingestion.

The image is a comparison of data storage technologies: Azure Synapse and Data Lake, Apache HBase, and Apache Hive, highlighting their features and purposes.

Data Warehouse Schemas for Analytics

When designing for analytics, your table schemas often deviate from OLTP patterns. Two popular modeling techniques are the Star Schema and the Cube Schema.

Schema TypeStructureBest For
Star SchemaCentral fact table + surrounding dimensionsMulti-dimensional filters, ad-hoc queries
Cube SchemaMultiple “slices” of the same fact by dimensionHigh-speed aggregations across axes

Star Schema

A star schema centers around a fact table containing numeric measures (e.g., sales amounts), linked to multiple dimension tables (e.g., Customer, Product, Date).

  • Fact table: Holds transactional metrics over time.
  • Dimension tables: Store descriptive attributes for filtering and grouping.

This model simplifies SQL joins and enhances query performance, especially in Azure Synapse dedicated SQL pools.

The image depicts a star schema for data analysis, showing a central fact table "Quantity Sold" connected to four dimension tables: Countries, Customers, Date Sold, and Products. It also explains the difference between fact and dimension tables.

Cube Schema

A cube schema extends beyond two dimensions by stacking multiple slices of data—each slice represents the fact table at a specific dimension value (e.g., year).

  • Fast aggregations along time, geography, and product axes.
  • Pre-aggregated views reduce query runtime.
  • Scales to 4+ dimensions in many OLAP systems.

Cubes optimize dashboards and BI tools that require instant drill-downs and rollups.

The image illustrates an analytical schema using a cube to represent multi-dimensional data, with axes labeled for columns (quantity sold), entities (products), and time (2010 to next year). It explains that cubes speed up data aggregation compared to two-dimensional tables.

Some OLAP solutions support even more dimensions—each tailored for specialized analytical queries.

Watch Video

Watch video content

Previous
Demo Data Factory