Search Tutorials


DP-600 Lakehouse and Data Warehouse | Microsoft Fabric Analytics Engineer | JavaInUse

DP-600 - Lakehouse and Data Warehouse

Fabric Lakehouse Overview

A Fabric Lakehouse is an item that combines the open storage of a data lake with the SQL analytics capabilities of a data warehouse. It stores data in OneLake (ADLS Gen2) in open formats (Delta/Parquet) and provides:

  • Files section: Raw file storage (CSV, JSON, Parquet, any format) - accessible via Spark
  • Tables section: Managed Delta tables - auto-registered for SQL, Spark, and Direct Lake
  • SQL Analytics Endpoint: Auto-generated read-only T-SQL interface for querying Tables
  • Default semantic model: Auto-generated Power BI semantic model on the Lakehouse tables

Delta Lake and ACID Transactions

Delta Lake is the open-source storage format that underlies Fabric Lakehouse tables. Key properties:

ACID PropertyDescription
AtomicityEach transaction is all-or-nothing - partial writes are not visible
ConsistencyData is always in a valid state before and after each transaction
IsolationConcurrent transactions do not interfere with each other
DurabilityCommitted transactions survive system failures

How Delta Lake works:

  • Data is stored as Parquet files in the table directory
  • A _delta_log directory contains JSON transaction log files recording every change
  • Time travel: Query any past version using VERSION AS OF n or TIMESTAMP AS OF 'date'
  • Schema enforcement: Delta rejects writes that don't match the table schema
  • Schema evolution: New columns can be added with mergeSchema option

SQL Analytics Endpoint

Every Fabric Lakehouse automatically generates a SQL Analytics Endpoint that exposes the Tables section as a read-only T-SQL database. Features:

  • Query Delta tables using T-SQL SELECT statements
  • Create views, functions, and stored procedures on top of the tables
  • Connect Power BI in DirectQuery mode (or Direct Lake mode for semantic models)
  • Connect from SQL Server Management Studio (SSMS) or Azure Data Studio using the SQL endpoint connection string
  • Read-only: DML (INSERT/UPDATE/DELETE) is not supported via the SQL endpoint - use the Fabric Data Warehouse for DML
The SQL endpoint auto-discovers Delta tables in the Lakehouse Tables section and registers them automatically. You don't need to create external table definitions - they appear as regular T-SQL tables.

Spark Notebooks

Fabric Notebooks run Apache Spark and support PySpark (Python), Spark SQL, Scala, and R. Key notebook operations on a Lakehouse:

# Read a Delta table as Spark DataFrame
df = spark.read.format("delta").load("Tables/sales")
# or using spark.sql with registered table
df = spark.sql("SELECT * FROM lakehouse1.sales")

# Write DataFrame as Delta table
df.write.format("delta").mode("overwrite").saveAsTable("sales_cleaned")

# Append to existing Delta table
df.write.format("delta").mode("append").saveAsTable("sales_cleaned")

# Upsert using Delta merge
from delta.tables import DeltaTable
target = DeltaTable.forName(spark, "sales")
target.alias("t").merge(
    df.alias("s"),
    "t.OrderID = s.OrderID"
).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()

Notebooks can also use the Lakehouse Explorer panel to drag table names into cells. Attaching a Lakehouse to a notebook makes it the default Lakehouse - tables are accessible without schema prefix.

OPTIMIZE and VACUUM

CommandPurposeSyntax
OPTIMIZECompact many small Parquet files into fewer large files (default 1 GB target). Improves read performance.OPTIMIZE tableName [ZORDER BY (col1, col2)]
VACUUMDelete old Parquet files no longer referenced by the transaction log. Frees storage. Breaks time travel for deleted versions.VACUUM tableName [RETAIN n HOURS]
-- In Spark notebook
spark.sql("OPTIMIZE lakehouse1.sales ZORDER BY (CustomerID, OrderDate)")
spark.sql("VACUUM lakehouse1.sales RETAIN 168 HOURS")  -- 7 days

Run OPTIMIZE regularly after bulk loads or streaming writes to maintain query performance. Run VACUUM after OPTIMIZE to clean up files older than the retention window.

Fabric Data Warehouse

The Fabric Data Warehouse is a fully managed, T-SQL warehouse that supports the complete SQL DML surface (INSERT, UPDATE, DELETE, MERGE). Unlike the Lakehouse SQL endpoint, the Warehouse is a first-class write destination.

  • Storage backed by OneLake in Delta/Parquet format
  • Full ACID transactions on DML operations
  • Supports views, stored procedures, functions
  • Cross-database queries across Lakehouses and Warehouses in the same workspace
  • Visual Query Builder - drag-and-drop SQL query generation
  • Column-level security and Dynamic Data Masking

Lakehouse vs Data Warehouse

FeatureLakehouseData Warehouse
Primary interfaceSpark Notebooks + SQL endpointT-SQL (full read/write)
DML (INSERT/UPDATE/DELETE)Via Spark onlyYes - T-SQL DML
SQL endpointRead-only auto-generatedFull read/write
Raw file storageYes (Files section)No
Spark notebooksNativeNot natively (use pipeline/dataflow)
Best forData engineers, mixed Spark + SQL, raw + curatedSQL analysts, BI star schema, full DML needed
Direct LakeYesYes (via SQL endpoint)

Medallion Architecture

The medallion (Bronze/Silver/Gold) architecture is a recommended data organization pattern in Fabric:

LayerData QualityOperationsTypical Storage
Bronze (Raw)Unprocessed - as ingestedIngest only - Copy Data, Dataflow, NotebookLakehouse Files or Tables (raw)
Silver (Cleaned)Validated, deduplicated, typedSpark Notebook or Dataflow Gen2 transformsLakehouse Delta Tables
Gold (Curated)Business-ready aggregates or star schemaSQL transforms, Warehouse DMLLakehouse Tables or Warehouse
In Fabric, all three layers live in OneLake. Power BI Direct Lake semantic models typically connect to the Gold layer for best performance and data quality.

← Take DP-600 Practice Tests  |  Back to Study Topics

Popular Posts

��