DP-600 - Lakehouse and Data Warehouse

Quick Navigation

Fabric Lakehouse Overview
Delta Lake and ACID
SQL Analytics Endpoint
Spark Notebooks
OPTIMIZE and VACUUM
Fabric Data Warehouse
Lakehouse vs Warehouse
Medallion Architecture

Fabric Lakehouse Overview

A Fabric Lakehouse is an item that combines the open storage of a data lake with the SQL analytics capabilities of a data warehouse. It stores data in OneLake (ADLS Gen2) in open formats (Delta/Parquet) and provides:

Files section: Raw file storage (CSV, JSON, Parquet, any format) - accessible via Spark
Tables section: Managed Delta tables - auto-registered for SQL, Spark, and Direct Lake
SQL Analytics Endpoint: Auto-generated read-only T-SQL interface for querying Tables
Default semantic model: Auto-generated Power BI semantic model on the Lakehouse tables

Delta Lake and ACID Transactions

Delta Lake is the open-source storage format that underlies Fabric Lakehouse tables. Key properties:

ACID Property	Description
Atomicity	Each transaction is all-or-nothing - partial writes are not visible
Consistency	Data is always in a valid state before and after each transaction
Isolation	Concurrent transactions do not interfere with each other
Durability	Committed transactions survive system failures

How Delta Lake works:

Data is stored as Parquet files in the table directory
A _delta_log directory contains JSON transaction log files recording every change
Time travel: Query any past version using VERSION AS OF n or TIMESTAMP AS OF 'date'
Schema enforcement: Delta rejects writes that don't match the table schema
Schema evolution: New columns can be added with mergeSchema option

SQL Analytics Endpoint

Every Fabric Lakehouse automatically generates a SQL Analytics Endpoint that exposes the Tables section as a read-only T-SQL database. Features:

Query Delta tables using T-SQL SELECT statements
Create views, functions, and stored procedures on top of the tables
Connect Power BI in DirectQuery mode (or Direct Lake mode for semantic models)
Connect from SQL Server Management Studio (SSMS) or Azure Data Studio using the SQL endpoint connection string
Read-only: DML (INSERT/UPDATE/DELETE) is not supported via the SQL endpoint - use the Fabric Data Warehouse for DML

The SQL endpoint auto-discovers Delta tables in the Lakehouse Tables section and registers them automatically. You don't need to create external table definitions - they appear as regular T-SQL tables.

Spark Notebooks

Fabric Notebooks run Apache Spark and support PySpark (Python), Spark SQL, Scala, and R. Key notebook operations on a Lakehouse:

# Read a Delta table as Spark DataFrame
df = spark.read.format("delta").load("Tables/sales")
# or using spark.sql with registered table
df = spark.sql("SELECT * FROM lakehouse1.sales")

# Write DataFrame as Delta table
df.write.format("delta").mode("overwrite").saveAsTable("sales_cleaned")

# Append to existing Delta table
df.write.format("delta").mode("append").saveAsTable("sales_cleaned")

# Upsert using Delta merge
from delta.tables import DeltaTable
target = DeltaTable.forName(spark, "sales")
target.alias("t").merge(
    df.alias("s"),
    "t.OrderID = s.OrderID"
).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()

Notebooks can also use the Lakehouse Explorer panel to drag table names into cells. Attaching a Lakehouse to a notebook makes it the default Lakehouse - tables are accessible without schema prefix.

OPTIMIZE and VACUUM

Command	Purpose	Syntax
OPTIMIZE	Compact many small Parquet files into fewer large files (default 1 GB target). Improves read performance.	`OPTIMIZE tableName [ZORDER BY (col1, col2)]`
VACUUM	Delete old Parquet files no longer referenced by the transaction log. Frees storage. Breaks time travel for deleted versions.	`VACUUM tableName [RETAIN n HOURS]`

-- In Spark notebook
spark.sql("OPTIMIZE lakehouse1.sales ZORDER BY (CustomerID, OrderDate)")
spark.sql("VACUUM lakehouse1.sales RETAIN 168 HOURS")  -- 7 days

Run OPTIMIZE regularly after bulk loads or streaming writes to maintain query performance. Run VACUUM after OPTIMIZE to clean up files older than the retention window.

Fabric Data Warehouse

The Fabric Data Warehouse is a fully managed, T-SQL warehouse that supports the complete SQL DML surface (INSERT, UPDATE, DELETE, MERGE). Unlike the Lakehouse SQL endpoint, the Warehouse is a first-class write destination.

Storage backed by OneLake in Delta/Parquet format
Full ACID transactions on DML operations
Supports views, stored procedures, functions
Cross-database queries across Lakehouses and Warehouses in the same workspace
Visual Query Builder - drag-and-drop SQL query generation
Column-level security and Dynamic Data Masking

Lakehouse vs Data Warehouse

Feature	Lakehouse	Data Warehouse
Primary interface	Spark Notebooks + SQL endpoint	T-SQL (full read/write)
DML (INSERT/UPDATE/DELETE)	Via Spark only	Yes - T-SQL DML
SQL endpoint	Read-only auto-generated	Full read/write
Raw file storage	Yes (Files section)	No
Spark notebooks	Native	Not natively (use pipeline/dataflow)
Best for	Data engineers, mixed Spark + SQL, raw + curated	SQL analysts, BI star schema, full DML needed
Direct Lake	Yes	Yes (via SQL endpoint)

Medallion Architecture

The medallion (Bronze/Silver/Gold) architecture is a recommended data organization pattern in Fabric:

Layer	Data Quality	Operations	Typical Storage
Bronze (Raw)	Unprocessed - as ingested	Ingest only - Copy Data, Dataflow, Notebook	Lakehouse Files or Tables (raw)
Silver (Cleaned)	Validated, deduplicated, typed	Spark Notebook or Dataflow Gen2 transforms	Lakehouse Delta Tables
Gold (Curated)	Business-ready aggregates or star schema	SQL transforms, Warehouse DML	Lakehouse Tables or Warehouse

In Fabric, all three layers live in OneLake. Power BI Direct Lake semantic models typically connect to the Gold layer for best performance and data quality.

← Take DP-600 Practice Tests | Back to Study Topics

ï¿½ï¿½

Search Tutorials