Top 20 Azure Databricks Interview Questions and Answers
- What is Azure Databricks?
- What are the different types of Databricks clusters?
- What is Delta Lake and why is it important?
- Explain the Databricks workspace architecture.
- What is Unity Catalog in Databricks?
- How do you optimize Spark jobs in Databricks?
- What is the difference between Databricks notebooks and Jobs?
- How do you handle streaming data in Databricks?
- What are Databricks widgets and how are they used?
- How do you implement CI/CD for Databricks?
- What is AutoML in Databricks?
- Explain cluster pools in Databricks.
- How do you manage secrets in Databricks?
- What is Delta Live Tables (DLT)?
- How do you handle slowly changing dimensions?
- What are Photon and its benefits?
- How do you integrate Databricks with Azure Data Factory?
- What is MLflow and how is it used in Databricks?
- How do you troubleshoot Spark jobs in Databricks?
- What are best practices for Databricks production workloads?
Microsoft Azure Interview Questions
Comprehensive interview questions for Azure cloud services and data engineering roles.
1. What is Azure Databricks?
Azure Databricks is a fast, easy, and collaborative Apache Spark-based analytics platform optimized for Azure. It's a first-party Microsoft service developed in partnership with Databricks.Key Features:
- Unified Analytics: Data engineering, data science, and ML in one platform
- Collaborative Workspace: Notebooks, repos, dashboards
- Optimized Spark: 5-10x faster with Photon engine
- Delta Lake: ACID transactions, schema enforcement
- Enterprise Security: Azure AD, RBAC, encryption
Use Cases:
- ETL/ELT pipelines
- Real-time streaming analytics
- Machine learning at scale
- Data warehousing with lakehouse
2. What are the different types of Databricks clusters?
1. All-Purpose Clusters:- Interactive workloads
- Multiple users can share
- Can be started/stopped manually
- Higher cost per DBU
2. Job Clusters:
- Created for specific job execution
- Terminated when job completes
- Lower cost per DBU
- Recommended for production pipelines
Cluster Modes:
| Mode | Description | Use Case |
|---|---|---|
| Standard | Single user, full Spark features | Data engineering |
| High Concurrency | Shared by multiple users, isolation | Interactive analytics |
| Single Node | Driver only, no workers | Small workloads, ML |
// Cluster configuration example
{
"cluster_name": "my-production-cluster",
"spark_version": "13.3.x-scala2.12",
"node_type_id": "Standard_DS3_v2",
"num_workers": 4,
"autoscale": {
"min_workers": 2,
"max_workers": 8
},
"spark_conf": {
"spark.sql.adaptive.enabled": "true",
"spark.databricks.delta.optimizeWrite.enabled": "true"
}
}
3. What is Delta Lake and why is it important?
Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads.Key Features:
- ACID Transactions: Serializable isolation levels
- Schema Enforcement: Prevent bad data writes
- Schema Evolution: Add columns without rewriting
- Time Travel: Query historical versions
- Unified Batch/Streaming: Same table, both workloads
# Create Delta table
df.write.format("delta").save("/delta/events")
# Create managed table
df.write.format("delta").saveAsTable("events")
# Read Delta table
df = spark.read.format("delta").load("/delta/events")
# Time travel - read previous version
df_v1 = spark.read.format("delta").option("versionAsOf", 1).load("/delta/events")
df_timestamp = spark.read.format("delta").option("timestampAsOf", "2024-01-01").load("/delta/events")
# MERGE (Upsert) operation
from delta.tables import DeltaTable
deltaTable = DeltaTable.forPath(spark, "/delta/events")
deltaTable.alias("target").merge(
updates_df.alias("source"),
"target.id = source.id"
).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()
4. Explain the Databricks workspace architecture.
Control Plane (Databricks-managed):- Workspace application
- Cluster management
- Notebook storage
- Job scheduling
- Identity and access management
Data Plane (Customer's Azure subscription):
- Compute resources (VMs for clusters)
- DBFS storage (Azure Blob Storage)
- Data sources (ADLS, SQL, etc.)
Workspace Components:
Workspace/ âââ Data/ <- Catalog, schemas, tables âââ Compute/ <- Clusters, pools, warehouses âââ Workflows/ <- Jobs, DLT pipelines âââ Notebooks/ <- Code notebooks âââ Repos/ <- Git repositories âââ Machine Learning/ <- MLflow experiments, models âââ SQL/ <- SQL queries, dashboards
5. What is Unity Catalog in Databricks?
Unity Catalog is a unified governance solution for all data and AI assets in Databricks, providing centralized access control, auditing, and lineage.Key Features:
- Three-Level Namespace: Catalog > Schema > Table/View
- Centralized Governance: Single place for all data assets
- Fine-Grained Access Control: Table, row, and column level
- Data Lineage: Track data flow automatically
- Cross-Workspace: Share data across workspaces
-- Create catalog
CREATE CATALOG IF NOT EXISTS sales_catalog;
-- Create schema
CREATE SCHEMA IF NOT EXISTS sales_catalog.bronze;
-- Create table with Unity Catalog
CREATE TABLE sales_catalog.bronze.raw_orders (
order_id INT,
customer_id INT,
amount DECIMAL(10,2)
);
-- Grant permissions
GRANT SELECT ON TABLE sales_catalog.bronze.raw_orders TO data_analysts;
GRANT ALL PRIVILEGES ON SCHEMA sales_catalog.bronze TO data_engineers;
-- Row-level security
CREATE FUNCTION region_filter(region STRING)
RETURN IF(IS_ACCOUNT_GROUP_MEMBER('sales_team'), true, region = current_user());
ALTER TABLE sales ADD ROW FILTER region_filter ON (region);