DP-600 - Lakehouse and Data Warehouse
Quick Navigation
Fabric Lakehouse Overview
A Fabric Lakehouse is an item that combines the open storage of a data lake with the SQL analytics capabilities of a data warehouse. It stores data in OneLake (ADLS Gen2) in open formats (Delta/Parquet) and provides:
- Files section: Raw file storage (CSV, JSON, Parquet, any format) - accessible via Spark
- Tables section: Managed Delta tables - auto-registered for SQL, Spark, and Direct Lake
- SQL Analytics Endpoint: Auto-generated read-only T-SQL interface for querying Tables
- Default semantic model: Auto-generated Power BI semantic model on the Lakehouse tables
Delta Lake and ACID Transactions
Delta Lake is the open-source storage format that underlies Fabric Lakehouse tables. Key properties:
| ACID Property | Description |
|---|---|
| Atomicity | Each transaction is all-or-nothing - partial writes are not visible |
| Consistency | Data is always in a valid state before and after each transaction |
| Isolation | Concurrent transactions do not interfere with each other |
| Durability | Committed transactions survive system failures |
How Delta Lake works:
- Data is stored as Parquet files in the table directory
- A _delta_log directory contains JSON transaction log files recording every change
- Time travel: Query any past version using
VERSION AS OF norTIMESTAMP AS OF 'date' - Schema enforcement: Delta rejects writes that don't match the table schema
- Schema evolution: New columns can be added with
mergeSchemaoption
SQL Analytics Endpoint
Every Fabric Lakehouse automatically generates a SQL Analytics Endpoint that exposes the Tables section as a read-only T-SQL database. Features:
- Query Delta tables using T-SQL SELECT statements
- Create views, functions, and stored procedures on top of the tables
- Connect Power BI in DirectQuery mode (or Direct Lake mode for semantic models)
- Connect from SQL Server Management Studio (SSMS) or Azure Data Studio using the SQL endpoint connection string
- Read-only: DML (INSERT/UPDATE/DELETE) is not supported via the SQL endpoint - use the Fabric Data Warehouse for DML
Spark Notebooks
Fabric Notebooks run Apache Spark and support PySpark (Python), Spark SQL, Scala, and R. Key notebook operations on a Lakehouse:
# Read a Delta table as Spark DataFrame
df = spark.read.format("delta").load("Tables/sales")
# or using spark.sql with registered table
df = spark.sql("SELECT * FROM lakehouse1.sales")
# Write DataFrame as Delta table
df.write.format("delta").mode("overwrite").saveAsTable("sales_cleaned")
# Append to existing Delta table
df.write.format("delta").mode("append").saveAsTable("sales_cleaned")
# Upsert using Delta merge
from delta.tables import DeltaTable
target = DeltaTable.forName(spark, "sales")
target.alias("t").merge(
df.alias("s"),
"t.OrderID = s.OrderID"
).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()
Notebooks can also use the Lakehouse Explorer panel to drag table names into cells. Attaching a Lakehouse to a notebook makes it the default Lakehouse - tables are accessible without schema prefix.
OPTIMIZE and VACUUM
| Command | Purpose | Syntax |
|---|---|---|
| OPTIMIZE | Compact many small Parquet files into fewer large files (default 1 GB target). Improves read performance. | OPTIMIZE tableName [ZORDER BY (col1, col2)] |
| VACUUM | Delete old Parquet files no longer referenced by the transaction log. Frees storage. Breaks time travel for deleted versions. | VACUUM tableName [RETAIN n HOURS] |
-- In Spark notebook
spark.sql("OPTIMIZE lakehouse1.sales ZORDER BY (CustomerID, OrderDate)")
spark.sql("VACUUM lakehouse1.sales RETAIN 168 HOURS") -- 7 days
Run OPTIMIZE regularly after bulk loads or streaming writes to maintain query performance. Run VACUUM after OPTIMIZE to clean up files older than the retention window.
Fabric Data Warehouse
The Fabric Data Warehouse is a fully managed, T-SQL warehouse that supports the complete SQL DML surface (INSERT, UPDATE, DELETE, MERGE). Unlike the Lakehouse SQL endpoint, the Warehouse is a first-class write destination.
- Storage backed by OneLake in Delta/Parquet format
- Full ACID transactions on DML operations
- Supports views, stored procedures, functions
- Cross-database queries across Lakehouses and Warehouses in the same workspace
- Visual Query Builder - drag-and-drop SQL query generation
- Column-level security and Dynamic Data Masking
Lakehouse vs Data Warehouse
| Feature | Lakehouse | Data Warehouse |
|---|---|---|
| Primary interface | Spark Notebooks + SQL endpoint | T-SQL (full read/write) |
| DML (INSERT/UPDATE/DELETE) | Via Spark only | Yes - T-SQL DML |
| SQL endpoint | Read-only auto-generated | Full read/write |
| Raw file storage | Yes (Files section) | No |
| Spark notebooks | Native | Not natively (use pipeline/dataflow) |
| Best for | Data engineers, mixed Spark + SQL, raw + curated | SQL analysts, BI star schema, full DML needed |
| Direct Lake | Yes | Yes (via SQL endpoint) |
Medallion Architecture
The medallion (Bronze/Silver/Gold) architecture is a recommended data organization pattern in Fabric:
| Layer | Data Quality | Operations | Typical Storage |
|---|---|---|---|
| Bronze (Raw) | Unprocessed - as ingested | Ingest only - Copy Data, Dataflow, Notebook | Lakehouse Files or Tables (raw) |
| Silver (Cleaned) | Validated, deduplicated, typed | Spark Notebook or Dataflow Gen2 transforms | Lakehouse Delta Tables |
| Gold (Curated) | Business-ready aggregates or star schema | SQL transforms, Warehouse DML | Lakehouse Tables or Warehouse |
← Take DP-600 Practice Tests | Back to Study Topics