KodeKloud Notes

Welcome to the Azure Data Fundamentals (DP-900) course. In this lesson, we’ll cover end-to-end integrated solutions for ETL/ELT, data warehousing, and analytics. Instead of piecing together best-of-breed components, you can leverage platforms that manage every step—from ingestion to real-time insights—within a unified service.

Azure Synapse Analytics

Azure Synapse Analytics is Microsoft’s flagship integrated analytics platform. It combines the power of Azure Data Factory, a data warehouse, Apache Spark, and Azure Data Explorer into a seamless experience. By default, Synapse runs continuously, enabling real-time data processing and analytics.

The image is a flowchart illustrating the integration process of Azure Synapse Analytics, showing data flow from various DBMS and files through a data factory to Azure Synapse, and then to Apache Spark and Azure Data Explorer. It includes a tip about reducing costs by pausing the service.

Warning

Continuous mode in Synapse ensures up-to-the-minute insights but may lead to high compute costs. Schedule or pause pipelines to run only when you need batch reports (daily, weekly, or monthly).

Key Components

Azure Data Factory: Orchestrates ETL/ELT workflows
Synapse SQL Pool: Dedicated or serverless warehousing
Apache Spark: In-memory big data processing
Azure Data Explorer: Interactive data exploration

Learn more: Azure Synapse Analytics Documentation

Data Lake Architecture and PolyBase

Under the hood, Synapse’s storage relies on a Data Lake built on Azure Storage. Unlike traditional data warehouses, you can ingest raw files—CSV, JSON, XML, Parquet—without upfront transformation.

The image illustrates a data lake architecture using PolyBase to integrate various file formats like CSV, JSON, XML, and Parquet, with a focus on extract, load, and transform processes.

When you run a SQL query against your lake, Synapse uses PolyBase to:

Extract raw data into the lake
Load it dynamically during query execution
Transform it on-the-fly

Note

Parquet is a column-oriented storage format optimized for analytics. It delivers high compression and performance, similar to columnar databases.

Azure Storage and Hierarchical Namespace

Your data lake files are stored as blobs in an Azure Storage account. To enable folder-like organization, activate hierarchical namespace (Data Lake Storage Gen2). This provides:

Filesystem semantics (folders/subfolders)
Fine-grained ACLs on directories and files
Improved performance for large-scale analytics

The image illustrates the relationship between a Data Lake and Azure Storage blobs, highlighting that it is built on top of Azure Storage with features like hierarchical storage and access control.

For more information, see Azure Data Lake Storage Gen2.

Delta Lake: ACID and Unified Workloads

A raw data lake is flexible but lacks transactions, indexing, and ACID guarantees. Delta Lake extends cloud object storage with a transactional layer, bringing warehouse capabilities—schema enforcement, time travel, and unified batch/streaming workloads.

The image illustrates the concept of Delta Lake, showing it as a structure that adds warehousing functionality to a data lake.

Delta Lake Benefits:

ACID transactions on Parquet data
Schema evolution and enforcement
Switch from batch to streaming without code changes

Read the open-source project: Delta Lake

Apache Databricks

Beyond Microsoft’s stack, Apache Databricks is the most popular managed platform for Spark and Delta Lake. It offers a unified analytics environment for ETL, data warehousing, machine learning, and BI.

The image describes Apache Databricks as a unified toolset built on Delta Lake and Apache Spark, supporting tasks like extracting, transforming, loading, warehousing, and analyzing data. It facilitates deploying and sharing analytics solutions.

Why Choose Databricks?

Fully managed Spark clusters
Built-in support for Delta Lake transactions
Collaborative notebooks and job scheduling
Scales on demand, with pay-as-you-go pricing

Explore: Databricks Documentation

Integrated Platforms Comparison

Platform	Core Components	Ideal Use Case	Documentation
Azure Synapse Analytics	Data Factory, SQL Pools, Spark, Data Explorer	End-to-end analytics with flexible compute options	Synapse Docs
Apache Databricks	Apache Spark, Delta Lake, MLflow	Unified analytics & AI workspaces on managed Spark	Databricks Docs

When to Use Each Platform

Azure Synapse Analytics: Best for organizations needing integrated SQL and Spark with hybrid provisioning (serverless + dedicated).
Apache Databricks: Ideal for data science, machine learning, and collaborative analytics teams using Spark and Delta Lake.

Watch Video

Watch video content