Search Tutorials


Top Azure Databricks Interview Questions (2026) | JavaInuse

Top 20 Azure Databricks Interview Questions and Answers


  1. What is Azure Databricks?
  2. What are the different types of Databricks clusters?
  3. What is Delta Lake and why is it important?
  4. Explain the Databricks workspace architecture.
  5. What is Unity Catalog in Databricks?
  6. How do you optimize Spark jobs in Databricks?
  7. What is the difference between Databricks notebooks and Jobs?
  8. How do you handle streaming data in Databricks?
  9. What are Databricks widgets and how are they used?
  10. How do you implement CI/CD for Databricks?
  11. What is AutoML in Databricks?
  12. Explain cluster pools in Databricks.
  13. How do you manage secrets in Databricks?
  14. What is Delta Live Tables (DLT)?
  15. How do you handle slowly changing dimensions?
  16. What are Photon and its benefits?
  17. How do you integrate Databricks with Azure Data Factory?
  18. What is MLflow and how is it used in Databricks?
  19. How do you troubleshoot Spark jobs in Databricks?
  20. What are best practices for Databricks production workloads?

Microsoft Azure Interview Questions

Comprehensive interview questions for Azure cloud services and data engineering roles.

1. What is Azure Databricks?

Azure Databricks is a fast, easy, and collaborative Apache Spark-based analytics platform optimized for Azure. It's a first-party Microsoft service developed in partnership with Databricks.

Key Features:
- Unified Analytics: Data engineering, data science, and ML in one platform
- Collaborative Workspace: Notebooks, repos, dashboards
- Optimized Spark: 5-10x faster with Photon engine
- Delta Lake: ACID transactions, schema enforcement
- Enterprise Security: Azure AD, RBAC, encryption

Use Cases:
- ETL/ELT pipelines
- Real-time streaming analytics
- Machine learning at scale
- Data warehousing with lakehouse

2. What are the different types of Databricks clusters?

1. All-Purpose Clusters:
- Interactive workloads
- Multiple users can share
- Can be started/stopped manually
- Higher cost per DBU

2. Job Clusters:
- Created for specific job execution
- Terminated when job completes
- Lower cost per DBU
- Recommended for production pipelines

Cluster Modes:
ModeDescriptionUse Case
StandardSingle user, full Spark featuresData engineering
High ConcurrencyShared by multiple users, isolationInteractive analytics
Single NodeDriver only, no workersSmall workloads, ML

// Cluster configuration example
{
    "cluster_name": "my-production-cluster",
    "spark_version": "13.3.x-scala2.12",
    "node_type_id": "Standard_DS3_v2",
    "num_workers": 4,
    "autoscale": {
        "min_workers": 2,
        "max_workers": 8
    },
    "spark_conf": {
        "spark.sql.adaptive.enabled": "true",
        "spark.databricks.delta.optimizeWrite.enabled": "true"
    }
}

3. What is Delta Lake and why is it important?

Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads.

Key Features:
- ACID Transactions: Serializable isolation levels
- Schema Enforcement: Prevent bad data writes
- Schema Evolution: Add columns without rewriting
- Time Travel: Query historical versions
- Unified Batch/Streaming: Same table, both workloads

# Create Delta table
df.write.format("delta").save("/delta/events")

# Create managed table
df.write.format("delta").saveAsTable("events")

# Read Delta table
df = spark.read.format("delta").load("/delta/events")

# Time travel - read previous version
df_v1 = spark.read.format("delta").option("versionAsOf", 1).load("/delta/events")
df_timestamp = spark.read.format("delta").option("timestampAsOf", "2024-01-01").load("/delta/events")

# MERGE (Upsert) operation
from delta.tables import DeltaTable

deltaTable = DeltaTable.forPath(spark, "/delta/events")
deltaTable.alias("target").merge(
    updates_df.alias("source"),
    "target.id = source.id"
).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()

4. Explain the Databricks workspace architecture.

Control Plane (Databricks-managed):
- Workspace application
- Cluster management
- Notebook storage
- Job scheduling
- Identity and access management

Data Plane (Customer's Azure subscription):
- Compute resources (VMs for clusters)
- DBFS storage (Azure Blob Storage)
- Data sources (ADLS, SQL, etc.)

Workspace Components:
Workspace/
├── Data/                    <- Catalog, schemas, tables
├── Compute/                 <- Clusters, pools, warehouses
├── Workflows/               <- Jobs, DLT pipelines
├── Notebooks/               <- Code notebooks
├── Repos/                   <- Git repositories
├── Machine Learning/        <- MLflow experiments, models
└── SQL/                     <- SQL queries, dashboards

5. What is Unity Catalog in Databricks?

Unity Catalog is a unified governance solution for all data and AI assets in Databricks, providing centralized access control, auditing, and lineage.

Key Features:
- Three-Level Namespace: Catalog > Schema > Table/View
- Centralized Governance: Single place for all data assets
- Fine-Grained Access Control: Table, row, and column level
- Data Lineage: Track data flow automatically
- Cross-Workspace: Share data across workspaces

-- Create catalog
CREATE CATALOG IF NOT EXISTS sales_catalog;

-- Create schema
CREATE SCHEMA IF NOT EXISTS sales_catalog.bronze;

-- Create table with Unity Catalog
CREATE TABLE sales_catalog.bronze.raw_orders (
    order_id INT,
    customer_id INT,
    amount DECIMAL(10,2)
);

-- Grant permissions
GRANT SELECT ON TABLE sales_catalog.bronze.raw_orders TO data_analysts;
GRANT ALL PRIVILEGES ON SCHEMA sales_catalog.bronze TO data_engineers;

-- Row-level security
CREATE FUNCTION region_filter(region STRING)
RETURN IF(IS_ACCOUNT_GROUP_MEMBER('sales_team'), true, region = current_user());

ALTER TABLE sales ADD ROW FILTER region_filter ON (region);




6. How do you optimize Spark jobs in Databricks?

1. Partitioning:
# Optimal partition size: 128MB-256MB
df = df.repartition(200)  # Increase parallelism
df = df.coalesce(10)      # Decrease without shuffle

# Partition by column
df.write.partitionBy("date").parquet("/output")

2. Caching:
# Cache frequently accessed data
df.cache()
df.persist(StorageLevel.MEMORY_AND_DISK)

# Delta cache (automatic for Delta tables)
spark.conf.set("spark.databricks.io.cache.enabled", "true")

3. Broadcast Joins:
from pyspark.sql.functions import broadcast

# Broadcast small tables (< 10MB)
result = large_df.join(broadcast(small_df), "key")

4. Adaptive Query Execution (AQE):
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")

5. Delta Lake Optimization:
-- Optimize table (compact small files)
OPTIMIZE delta_table;

-- Z-Order for better data skipping
OPTIMIZE delta_table ZORDER BY (customer_id, date);

-- Vacuum old files
VACUUM delta_table RETAIN 168 HOURS;

7. What is the difference between Databricks notebooks and Jobs?

AspectNotebooksJobs
PurposeInteractive developmentScheduled production runs
ExecutionManual, cell by cellAutomated, full notebook/script
ClusterAll-purpose clustersJob clusters (recommended)
MonitoringReal-time outputJob run logs, alerts
CostHigher (interactive pricing)Lower (job pricing)
Use CaseDevelopment, explorationProduction pipelines

Job Configuration:
{
    "name": "Daily ETL Pipeline",
    "tasks": [{
        "task_key": "bronze_ingestion",
        "notebook_task": {
            "notebook_path": "/Repos/prod/etl/bronze_layer"
        },
        "new_cluster": {
            "spark_version": "13.3.x-scala2.12",
            "node_type_id": "Standard_DS3_v2",
            "num_workers": 4
        }
    }],
    "schedule": {
        "quartz_cron_expression": "0 0 6 * * ?",
        "timezone_id": "UTC"
    },
    "email_notifications": {
        "on_failure": ["team@company.com"]
    }
}

8. How do you handle streaming data in Databricks?

Structured Streaming:
# Read from Kafka
df_stream = (spark.readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "host:9092")
    .option("subscribe", "events")
    .option("startingOffsets", "latest")
    .load()
)

# Parse JSON
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StringType, IntegerType

schema = StructType() \
    .add("event_id", StringType()) \
    .add("event_type", StringType()) \
    .add("timestamp", StringType())

df_parsed = df_stream.select(
    from_json(col("value").cast("string"), schema).alias("data")
).select("data.*")

# Write to Delta table
(df_parsed.writeStream
    .format("delta")
    .outputMode("append")
    .option("checkpointLocation", "/checkpoints/events")
    .trigger(processingTime="10 seconds")
    .toTable("streaming_events")
)

Auto Loader (Recommended for files):
# Auto Loader for incremental file ingestion
df = (spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", "/schema/events")
    .option("cloudFiles.inferColumnTypes", "true")
    .load("/landing/events/")
)

# Write to Delta
df.writeStream \
    .format("delta") \
    .option("checkpointLocation", "/checkpoints/events") \
    .trigger(availableNow=True) \
    .toTable("bronze_events")

9. What are Databricks widgets and how are they used?

Widgets allow you to add parameters to notebooks, making them reusable and configurable.

# Create widgets
dbutils.widgets.text("input_path", "/default/path", "Input Path")
dbutils.widgets.dropdown("environment", "dev", ["dev", "staging", "prod"], "Environment")
dbutils.widgets.combobox("date", "2024-01-01", ["2024-01-01", "2024-01-02"], "Date")
dbutils.widgets.multiselect("regions", "US", ["US", "EU", "APAC"], "Regions")

# Get widget values
input_path = dbutils.widgets.get("input_path")
env = dbutils.widgets.get("environment")

# Use in queries
df = spark.read.parquet(f"{input_path}/{env}")

# Remove widgets
dbutils.widgets.remove("input_path")
dbutils.widgets.removeAll()

Job Parameters:
// Pass parameters in job configuration
{
    "notebook_task": {
        "notebook_path": "/Repos/prod/etl/pipeline",
        "base_parameters": {
            "input_path": "/data/prod",
            "environment": "prod",
            "date": "{{start_date}}"
        }
    }
}

10. How do you implement CI/CD for Databricks?

1. Databricks Repos (Git Integration):
# Repository structure
my-databricks-project/
├── notebooks/
│   ├── bronze/
│   ├── silver/
│   └── gold/
├── src/
│   └── common/
│       └── utils.py
├── tests/
├── .github/
│   └── workflows/
│       └── deploy.yml
└── databricks.yml

2. Databricks Asset Bundles (DABs):
# databricks.yml
bundle:
  name: my-etl-pipeline

workspace:
  host: https://adb-xxx.azuredatabricks.net

resources:
  jobs:
    daily_etl:
      name: "Daily ETL Pipeline"
      tasks:
        - task_key: bronze
          notebook_task:
            notebook_path: ./notebooks/bronze/ingest.py
          new_cluster:
            spark_version: "13.3.x-scala2.12"
            num_workers: 2

environments:
  dev:
    workspace:
      host: https://adb-dev.azuredatabricks.net
  prod:
    workspace:
      host: https://adb-prod.azuredatabricks.net

3. GitHub Actions Workflow:
name: Deploy to Databricks
on:
  push:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
    
    - name: Install Databricks CLI
      run: pip install databricks-cli
    
    - name: Configure CLI
      run: |
        databricks configure --token <

11. What is AutoML in Databricks?

Databricks AutoML automates the machine learning pipeline: feature engineering, model selection, hyperparameter tuning.

from databricks import automl

# Classification
summary = automl.classify(
    dataset=train_df,
    target_col="label",
    primary_metric="f1",
    timeout_minutes=60,
    max_trials=50
)

# Regression
summary = automl.regress(
    dataset=train_df,
    target_col="price",
    primary_metric="rmse",
    exclude_cols=["id", "timestamp"]
)

# Forecasting
summary = automl.forecast(
    dataset=train_df,
    target_col="sales",
    time_col="date",
    horizon=30,
    frequency="d"
)

# Get best model
best_model = summary.best_trial.load_model()
print(f"Best trial: {summary.best_trial.metrics}")

# View generated notebook
displayHTML(f"Best Trial Notebook")

12. Explain cluster pools in Databricks.

Cluster pools maintain a set of idle, ready-to-use instances to reduce cluster start and auto-scaling times.

Benefits:
- Reduced cluster startup time (seconds vs minutes)
- Faster auto-scaling
- Cost savings through instance reuse
- Consistent instance types

// Pool configuration
{
    "instance_pool_name": "data-engineering-pool",
    "node_type_id": "Standard_DS3_v2",
    "min_idle_instances": 2,
    "max_capacity": 20,
    "idle_instance_autotermination_minutes": 30,
    "azure_attributes": {
        "availability": "SPOT_WITH_FALLBACK_AZURE",
        "spot_bid_max_price": -1
    },
    "preloaded_spark_versions": ["13.3.x-scala2.12"]
}

// Cluster using pool
{
    "cluster_name": "etl-cluster",
    "instance_pool_id": "pool-xxxxx",
    "num_workers": 4,
    "spark_version": "13.3.x-scala2.12"
}

13. How do you manage secrets in Databricks?

Databricks Secret Scopes:
# Create secret scope (CLI)
databricks secrets create-scope --scope my-secrets

# Add secret
databricks secrets put --scope my-secrets --key db-password

# List secrets
databricks secrets list --scope my-secrets

# Access in notebook
db_password = dbutils.secrets.get(scope="my-secrets", key="db-password")

# Connection string example
jdbc_url = f"jdbc:sqlserver://server.database.windows.net:1433;database=mydb"
connection_properties = {
    "user": dbutils.secrets.get("my-secrets", "db-user"),
    "password": dbutils.secrets.get("my-secrets", "db-password"),
    "driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}
df = spark.read.jdbc(jdbc_url, "dbo.MyTable", properties=connection_properties)

Azure Key Vault-Backed Scope:
# Create Key Vault-backed scope (via API)
# Secrets are automatically synced from Key Vault

databricks secrets create-scope --scope kv-scope \
  --scope-backend-type AZURE_KEYVAULT \
  --resource-id /subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.KeyVault/vaults/{vault} \
  --dns-name https://{vault}.vault.azure.net/

14. What is Delta Live Tables (DLT)?

Delta Live Tables is a declarative framework for building reliable, maintainable, and testable data pipelines.

import dlt
from pyspark.sql.functions import *

# Bronze layer - raw ingestion
@dlt.table(
    comment="Raw orders from source",
    table_properties={"quality": "bronze"}
)
def raw_orders():
    return (
        spark.readStream
        .format("cloudFiles")
        .option("cloudFiles.format", "json")
        .load("/landing/orders/")
    )

# Silver layer - cleaned data
@dlt.table(comment="Cleaned orders")
@dlt.expect_or_drop("valid_amount", "amount > 0")
@dlt.expect("valid_customer", "customer_id IS NOT NULL")
def cleaned_orders():
    return (
        dlt.read_stream("raw_orders")
        .select(
            col("order_id"),
            col("customer_id"),
            col("amount").cast("decimal(10,2)"),
            to_timestamp("order_date").alias("order_date")
        )
        .filter(col("order_date") >= "2023-01-01")
    )

# Gold layer - aggregations
@dlt.table(comment="Daily order summary")
def daily_orders():
    return (
        dlt.read("cleaned_orders")
        .groupBy(date_trunc("day", "order_date").alias("date"))
        .agg(
            count("*").alias("order_count"),
            sum("amount").alias("total_amount")
        )
    )

DLT Benefits:
- Declarative syntax (define WHAT, not HOW)
- Built-in data quality expectations
- Automatic dependency management
- Managed infrastructure
- Lineage and monitoring




15. How do you handle slowly changing dimensions?

SCD Type 1 (Overwrite):
from delta.tables import DeltaTable

deltaTable = DeltaTable.forPath(spark, "/delta/customers")

deltaTable.alias("target").merge(
    updates_df.alias("source"),
    "target.customer_id = source.customer_id"
).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()

SCD Type 2 (Historical tracking):
from pyspark.sql.functions import *

# Add SCD columns
def process_scd2(target_df, source_df, key_cols, value_cols):
    # Find changed records
    changes = source_df.join(
        target_df.filter(col("is_current") == True),
        key_cols,
        "left"
    ).filter(
        # Detect changes in value columns
        reduce(lambda a, b: a | b, 
               [col(f"source.{c}") != col(f"target.{c}") for c in value_cols])
    )
    
    # Close old records
    closed_records = changes.select(
        *[col(f"target.{c}") for c in target_df.columns]
    ).withColumn("is_current", lit(False)) \
     .withColumn("end_date", current_timestamp())
    
    # Insert new records
    new_records = changes.select(
        *[col(f"source.{c}") for c in source_df.columns]
    ).withColumn("is_current", lit(True)) \
     .withColumn("start_date", current_timestamp()) \
     .withColumn("end_date", lit(None).cast("timestamp"))
    
    return closed_records.union(new_records)

Using Delta Lake MERGE:
deltaTable.alias("target").merge(
    staged_updates.alias("staged"),
    "target.customer_id = staged.customer_id"
).whenMatchedUpdate(
    condition="target.name != staged.name OR target.address != staged.address",
    set={
        "is_current": "false",
        "end_date": "current_timestamp()"
    }
).whenNotMatchedInsert(
    values={
        "customer_id": "staged.customer_id",
        "name": "staged.name",
        "address": "staged.address",
        "is_current": "true",
        "start_date": "current_timestamp()",
        "end_date": "null"
    }
).execute()

16. What are Photon and its benefits?

Photon is Databricks' native vectorized query engine written in C++ that provides significant performance improvements for Spark SQL workloads.

Key Features:
- Vectorized Processing: Batch-level operations instead of row-by-row
- Native Execution: Runs natively on CPU, bypassing JVM
- Automatic: No code changes required
- Compatible: Works with existing Spark SQL APIs

Performance Benefits:
- 2-8x faster for ETL workloads
- Better CPU utilization
- Reduced memory pressure
- Improved I/O efficiency

// Enable Photon on cluster
{
    "cluster_name": "photon-cluster",
    "spark_version": "13.3.x-photon-scala2.12",  // Use Photon runtime
    "node_type_id": "Standard_DS3_v2",
    "runtime_engine": "PHOTON"
}

-- Check if query uses Photon
EXPLAIN COST SELECT * FROM large_table WHERE date > '2024-01-01'
-- Look for "PhotonExec" in execution plan

17. How do you integrate Databricks with Azure Data Factory?

1. Linked Service:
{
    "name": "AzureDatabricks",
    "type": "AzureDatabricks",
    "typeProperties": {
        "domain": "https://adb-xxx.azuredatabricks.net",
        "authentication": "MSI",  // Managed Identity
        "workspaceResourceId": "/subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.Databricks/workspaces/{workspace}"
    }
}

2. Notebook Activity:
{
    "name": "RunDatabricksNotebook",
    "type": "DatabricksNotebook",
    "linkedServiceName": {
        "referenceName": "AzureDatabricks",
        "type": "LinkedServiceReference"
    },
    "typeProperties": {
        "notebookPath": "/Repos/prod/etl/transform",
        "baseParameters": {
            "date": "@pipeline().parameters.date",
            "input_path": "@pipeline().parameters.inputPath"
        }
    },
    "policy": {
        "timeout": "01:00:00",
        "retry": 2
    }
}

3. New Job Cluster vs Existing Cluster:
- New Job Cluster: Recommended for production, auto-terminates
- Existing Cluster: For development, faster but higher cost

18. What is MLflow and how is it used in Databricks?

MLflow is an open-source platform for managing the ML lifecycle, including experimentation, reproducibility, and deployment.

Components:
- Tracking: Log parameters, metrics, artifacts
- Models: Package and deploy models
- Registry: Centralized model store
- Projects: Reproducible runs

import mlflow
from mlflow.models import infer_signature

# Start experiment
mlflow.set_experiment("/Shared/my-ml-experiment")

with mlflow.start_run(run_name="random_forest_v1") as run:
    # Log parameters
    mlflow.log_param("n_estimators", 100)
    mlflow.log_param("max_depth", 10)
    
    # Train model
    model = RandomForestClassifier(n_estimators=100, max_depth=10)
    model.fit(X_train, y_train)
    
    # Log metrics
    accuracy = model.score(X_test, y_test)
    mlflow.log_metric("accuracy", accuracy)
    
    # Log model
    signature = infer_signature(X_train, model.predict(X_train))
    mlflow.sklearn.log_model(model, "model", signature=signature)
    
    # Log artifacts
    mlflow.log_artifact("feature_importance.png")

# Register model in Model Registry
result = mlflow.register_model(
    f"runs:/{run.info.run_id}/model",
    "fraud_detection_model"
)

# Transition to production
client = mlflow.tracking.MlflowClient()
client.transition_model_version_stage(
    name="fraud_detection_model",
    version=result.version,
    stage="Production"
)

19. How do you troubleshoot Spark jobs in Databricks?

1. Spark UI:
- View stages, tasks, and their durations
- Identify shuffle read/write
- Check for data skew
- Monitor memory usage

2. Common Issues and Solutions:
# Data Skew - Check partition sizes
df.groupBy(spark_partition_id()).count().show()

# Solution: Salt the key
from pyspark.sql.functions import concat, lit, rand

df_salted = df.withColumn("salted_key", 
    concat(col("key"), lit("_"), (rand() * 10).cast("int")))

# Memory Issues - Adjust configurations
spark.conf.set("spark.sql.shuffle.partitions", "400")
spark.conf.set("spark.executor.memory", "8g")
spark.conf.set("spark.driver.memory", "4g")

# Slow shuffles - Check for large shuffles
# In Spark UI: Look for high "Shuffle Read" or "Shuffle Write"

# OOM Errors - Reduce partition size
df = df.repartition(1000)  # More partitions = smaller per-partition data

3. Query Explain Plan:
# Get execution plan
df.explain(mode="extended")
df.explain(mode="cost")

# SQL explain
EXPLAIN FORMATTED SELECT * FROM table WHERE ...

20. What are best practices for Databricks production workloads?

1. Cluster Configuration:
- Use job clusters (not all-purpose) for production
- Enable auto-scaling
- Use cluster pools for faster startup
- Enable Photon for SQL workloads

2. Code Organization:
- Use Databricks Repos for version control
- Separate notebooks: Bronze, Silver, Gold layers
- Create shared utility libraries
- Implement unit tests

3. Data Management:
- Use Delta Lake for all tables
- Implement OPTIMIZE and VACUUM schedules
- Use Z-ORDER for frequently filtered columns
- Monitor table sizes and partitions

4. Security:
- Use Unity Catalog for governance
- Store secrets in secret scopes
- Use managed identities
- Implement row-level security

5. Monitoring:
# Log metrics to Azure Monitor
from applicationinsights import TelemetryClient

tc = TelemetryClient(dbutils.secrets.get("secrets", "app-insights-key"))
tc.track_metric("rows_processed", row_count)
tc.track_event("pipeline_completed", {"status": "success"})
tc.flush()

6. Cost Optimization:
- Use spot instances for fault-tolerant workloads
- Implement auto-termination
- Right-size clusters based on monitoring
- Schedule jobs during off-peak hours

Microsoft Azure Interview Questions

Comprehensive interview questions for Azure cloud services and data engineering roles.


Popular Posts