DP-900: Microsoft Azure Data Fundamentals

Analyzing Data

Integrated Solutions

Welcome to the Azure Data Fundamentals (DP-900) course. In this lesson, we’ll cover end-to-end integrated solutions for ETL/ELT, data warehousing, and analytics. Instead of piecing together best-of-breed components, you can leverage platforms that manage every step—from ingestion to real-time insights—within a unified service.

Azure Synapse Analytics

Azure Synapse Analytics is Microsoft’s flagship integrated analytics platform. It combines the power of Azure Data Factory, a data warehouse, Apache Spark, and Azure Data Explorer into a seamless experience. By default, Synapse runs continuously, enabling real-time data processing and analytics.

The image is a flowchart illustrating the integration process of Azure Synapse Analytics, showing data flow from various DBMS and files through a data factory to Azure Synapse, and then to Apache Spark and Azure Data Explorer. It includes a tip about reducing costs by pausing the service.

Warning

Continuous mode in Synapse ensures up-to-the-minute insights but may lead to high compute costs. Schedule or pause pipelines to run only when you need batch reports (daily, weekly, or monthly).

Key Components

  • Azure Data Factory: Orchestrates ETL/ELT workflows
  • Synapse SQL Pool: Dedicated or serverless warehousing
  • Apache Spark: In-memory big data processing
  • Azure Data Explorer: Interactive data exploration

Learn more: Azure Synapse Analytics Documentation


Data Lake Architecture and PolyBase

Under the hood, Synapse’s storage relies on a Data Lake built on Azure Storage. Unlike traditional data warehouses, you can ingest raw files—CSV, JSON, XML, Parquet—without upfront transformation.

The image illustrates a data lake architecture using PolyBase to integrate various file formats like CSV, JSON, XML, and Parquet, with a focus on extract, load, and transform processes.

When you run a SQL query against your lake, Synapse uses PolyBase to:

  1. Extract raw data into the lake
  2. Load it dynamically during query execution
  3. Transform it on-the-fly

Note

Parquet is a column-oriented storage format optimized for analytics. It delivers high compression and performance, similar to columnar databases.


Azure Storage and Hierarchical Namespace

Your data lake files are stored as blobs in an Azure Storage account. To enable folder-like organization, activate hierarchical namespace (Data Lake Storage Gen2). This provides:

  • Filesystem semantics (folders/subfolders)
  • Fine-grained ACLs on directories and files
  • Improved performance for large-scale analytics

The image illustrates the relationship between a Data Lake and Azure Storage blobs, highlighting that it is built on top of Azure Storage with features like hierarchical storage and access control.

For more information, see Azure Data Lake Storage Gen2.


Delta Lake: ACID and Unified Workloads

A raw data lake is flexible but lacks transactions, indexing, and ACID guarantees. Delta Lake extends cloud object storage with a transactional layer, bringing warehouse capabilities—schema enforcement, time travel, and unified batch/streaming workloads.

The image illustrates the concept of Delta Lake, showing it as a structure that adds warehousing functionality to a data lake.

Delta Lake Benefits:

  • ACID transactions on Parquet data
  • Schema evolution and enforcement
  • Switch from batch to streaming without code changes

Read the open-source project: Delta Lake


Apache Databricks

Beyond Microsoft’s stack, Apache Databricks is the most popular managed platform for Spark and Delta Lake. It offers a unified analytics environment for ETL, data warehousing, machine learning, and BI.

The image describes Apache Databricks as a unified toolset built on Delta Lake and Apache Spark, supporting tasks like extracting, transforming, loading, warehousing, and analyzing data. It facilitates deploying and sharing analytics solutions.

Why Choose Databricks?

  • Fully managed Spark clusters
  • Built-in support for Delta Lake transactions
  • Collaborative notebooks and job scheduling
  • Scales on demand, with pay-as-you-go pricing

Explore: Databricks Documentation


Integrated Platforms Comparison

PlatformCore ComponentsIdeal Use CaseDocumentation
Azure Synapse AnalyticsData Factory, SQL Pools, Spark, Data ExplorerEnd-to-end analytics with flexible compute optionsSynapse Docs
Apache DatabricksApache Spark, Delta Lake, MLflowUnified analytics & AI workspaces on managed SparkDatabricks Docs

When to Use Each Platform

  • Azure Synapse Analytics: Best for organizations needing integrated SQL and Spark with hybrid provisioning (serverless + dedicated).
  • Apache Databricks: Ideal for data science, machine learning, and collaborative analytics teams using Spark and Delta Lake.

Watch Video

Watch video content

Previous
Processing Modes