Search Tutorials


Top Azure Data Factory Interview Questions (2026) | JavaInuse

Top 20 Azure Data Factory Interview Questions and Answers


  1. What is Azure Data Factory?
  2. What are the key components of Azure Data Factory?
  3. What is Integration Runtime and its types?
  4. Explain the difference between Copy Activity and Data Flow.
  5. What are Linked Services and Datasets?
  6. How do you implement incremental data loading?
  7. What are the different trigger types in ADF?
  8. How do you handle errors and retries in ADF?
  9. What is Mapping Data Flow vs Wrangling Data Flow?
  10. How do you parameterize pipelines in ADF?
  11. Explain ADF expressions and functions.
  12. How do you implement CI/CD for Azure Data Factory?
  13. What are ADF global parameters?
  14. How do you monitor and troubleshoot ADF pipelines?
  15. What is the difference between ForEach and Until activities?
  16. How do you call stored procedures in ADF?
  17. What are data flow transformations?
  18. How do you handle schema drift in ADF?
  19. What are managed virtual networks in ADF?
  20. How do you optimize ADF pipeline performance?

Microsoft Azure Interview Questions

Comprehensive interview questions for Azure cloud services and data engineering roles.

1. What is Azure Data Factory?

Azure Data Factory (ADF) is a cloud-based data integration service that allows you to create data-driven workflows for orchestrating data movement and transforming data at scale.

Key Capabilities:
- ETL/ELT: Extract, transform, and load data
- 95+ Connectors: Cloud and on-premises sources
- Code-Free UI: Visual pipeline design
- Serverless: Auto-scale, pay-per-use
- SSIS Integration: Lift and shift SSIS packages

Use Cases:
- Data migration to cloud
- Data warehouse loading
- Data integration from multiple sources
- Big data processing orchestration

2. What are the key components of Azure Data Factory?

1. Pipelines:
- Logical grouping of activities
- Unit of execution

2. Activities:
- Data movement (Copy)
- Data transformation (Data Flow, HDInsight, Databricks)
- Control flow (ForEach, If, Switch, Wait)

3. Datasets:
- Named view of data
- Points to data in linked service

4. Linked Services:
- Connection strings to data stores
- Connection strings to compute

5. Triggers:
- Schedule pipeline execution
- Event-driven execution

6. Integration Runtime:
- Compute infrastructure for activities

// Pipeline JSON structure
{
    "name": "CopyPipeline",
    "properties": {
        "activities": [
            {
                "name": "CopyFromBlobToSQL",
                "type": "Copy",
                "inputs": [{"referenceName": "BlobDataset"}],
                "outputs": [{"referenceName": "SQLDataset"}],
                "typeProperties": {
                    "source": {"type": "BlobSource"},
                    "sink": {"type": "SqlSink"}
                }
            }
        ]
    }
}

3. What is Integration Runtime and its types?

Integration Runtime (IR) is the compute infrastructure used by Azure Data Factory to provide data integration capabilities across different network environments.

Types of Integration Runtime:
TypeDescriptionUse Case
Azure IRMicrosoft-managed, public cloudCloud-to-cloud data movement
Self-Hosted IRCustomer-managed, on-premisesOn-prem to cloud, private networks
Azure-SSIS IRManaged SSIS environmentRunning SSIS packages in Azure

// Self-Hosted IR Configuration
{
    "name": "SelfHostedIR",
    "type": "SelfHosted",
    "typeProperties": {}
}

// Azure IR with VNET
{
    "name": "ManagedVNetIR",
    "type": "Managed",
    "typeProperties": {
        "computeProperties": {
            "location": "East US",
            "dataFlowProperties": {
                "computeType": "General",
                "coreCount": 8,
                "timeToLive": 10
            }
        }
    },
    "managedVirtualNetwork": {
        "referenceName": "default",
        "type": "ManagedVirtualNetworkReference"
    }
}

4. Explain the difference between Copy Activity and Data Flow.

AspectCopy ActivityData Flow
PurposeMove data (ETL/ELT)Transform data (ETL)
TransformationsLimited (column mapping, type conversion)Full transformations (join, aggregate, pivot)
ComputeAzure IR or Self-Hosted IRSpark clusters (auto-managed)
CodingNo codeNo code (visual designer)
PerformanceFastest for simple copyBetter for complex transformations
CostDIU-based pricingvCore-hours pricing

When to use Copy Activity:
- Simple data movement
- Staging to data lake
- Need fastest copy performance

When to use Data Flow:
- Complex transformations needed
- Multiple source joins
- Business logic in transformation

5. What are Linked Services and Datasets?

Linked Services:
Define the connection information needed to connect to external resources. Similar to connection strings.

// Azure Blob Storage Linked Service
{
    "name": "AzureBlobStorage",
    "type": "AzureBlobStorage",
    "typeProperties": {
        "connectionString": {
            "type": "SecureString",
            "value": "DefaultEndpointsProtocol=https;AccountName=myaccount;AccountKey=***"
        }
    }
}

// Azure SQL Database Linked Service
{
    "name": "AzureSqlDatabase",
    "type": "AzureSqlDatabase",
    "typeProperties": {
        "connectionString": "Server=server.database.windows.net;Database=mydb;User ID=user;Password=***;",
        "encryptedCredential": "..."
    }
}

Datasets:
Represent the data structure within the data stores. Points to specific files, tables, or containers.

// Parquet Dataset
{
    "name": "ParquetDataset",
    "type": "Parquet",
    "linkedServiceName": {
        "referenceName": "AzureBlobStorage",
        "type": "LinkedServiceReference"
    },
    "typeProperties": {
        "location": {
            "type": "AzureBlobStorageLocation",
            "container": "data",
            "folderPath": "input",
            "fileName": "*.parquet"
        },
        "compressionCodec": "snappy"
    },
    "schema": [
        {"name": "id", "type": "INT32"},
        {"name": "name", "type": "UTF8"}
    ]
}




6. How do you implement incremental data loading?

Method 1: Watermark Pattern
// 1. Store watermark value in control table
// 2. Lookup current watermark
// 3. Copy data where LastModified > watermark
// 4. Update watermark

// Pipeline activities:
{
    "activities": [
        {
            "name": "GetWatermark",
            "type": "Lookup",
            "typeProperties": {
                "source": {"type": "SqlSource", "sqlReaderQuery": "SELECT watermark FROM ControlTable WHERE TableName='Orders'"}
            }
        },
        {
            "name": "GetMaxDate",
            "type": "Lookup",
            "typeProperties": {
                "source": {"type": "SqlSource", "sqlReaderQuery": "SELECT MAX(LastModified) as MaxDate FROM Orders"}
            }
        },
        {
            "name": "CopyIncrementalData",
            "type": "Copy",
            "dependsOn": [...],
            "typeProperties": {
                "source": {
                    "type": "SqlSource",
                    "sqlReaderQuery": "SELECT * FROM Orders WHERE LastModified > '@{activity('GetWatermark').output.firstRow.watermark}' AND LastModified <= '@{activity('GetMaxDate').output.firstRow.MaxDate}'"
                }
            }
        },
        {
            "name": "UpdateWatermark",
            "type": "StoredProcedure",
            "dependsOn": [{"activity": "CopyIncrementalData", "dependencyConditions": ["Succeeded"]}],
            "typeProperties": {
                "storedProcedureName": "UpdateWatermark",
                "storedProcedureParameters": {
                    "TableName": {"value": "Orders"},
                    "NewWatermark": {"value": "@activity('GetMaxDate').output.firstRow.MaxDate"}
                }
            }
        }
    ]
}

Method 2: Change Data Capture (CDC)
- Native CDC connector for SQL Server, Oracle
- Track inserts, updates, deletes

Method 3: Tumbling Window Trigger
- Process data in time-based windows
- Automatic watermark management

7. What are the different trigger types in ADF?

1. Schedule Trigger:
{
    "name": "DailyTrigger",
    "type": "ScheduleTrigger",
    "typeProperties": {
        "recurrence": {
            "frequency": "Day",
            "interval": 1,
            "startTime": "2024-01-01T06:00:00Z",
            "endTime": "2024-12-31T06:00:00Z",
            "timeZone": "UTC",
            "schedule": {
                "hours": [6],
                "minutes": [0]
            }
        }
    },
    "pipelines": [{"pipelineReference": {"referenceName": "DailyETL"}}]
}

2. Tumbling Window Trigger:
{
    "name": "HourlyWindow",
    "type": "TumblingWindowTrigger",
    "typeProperties": {
        "frequency": "Hour",
        "interval": 1,
        "startTime": "2024-01-01T00:00:00Z",
        "delay": "00:15:00",
        "maxConcurrency": 10,
        "retryPolicy": {"count": 3, "intervalInSeconds": 30}
    }
}
// Access window times: @trigger().outputs.windowStartTime, @trigger().outputs.windowEndTime

3. Event-Based Trigger:
{
    "name": "BlobCreatedTrigger",
    "type": "BlobEventsTrigger",
    "typeProperties": {
        "blobPathBeginsWith": "/container/input/",
        "blobPathEndsWith": ".csv",
        "events": ["Microsoft.Storage.BlobCreated"],
        "scope": "/subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.Storage/storageAccounts/{account}"
    }
}
// Access file info: @triggerBody().fileName, @triggerBody().folderPath

4. Custom Event Trigger:
- Triggered by custom events in Event Grid

8. How do you handle errors and retries in ADF?

Activity-Level Retry:
{
    "name": "CopyActivity",
    "type": "Copy",
    "policy": {
        "timeout": "01:00:00",
        "retry": 3,
        "retryIntervalInSeconds": 30,
        "secureOutput": false
    }
}

Error Handling with Dependencies:
{
    "activities": [
        {
            "name": "MainActivity",
            "type": "Copy",
            ...
        },
        {
            "name": "OnSuccess",
            "type": "SqlServerStoredProcedure",
            "dependsOn": [{"activity": "MainActivity", "dependencyConditions": ["Succeeded"]}]
        },
        {
            "name": "OnFailure",
            "type": "WebActivity",
            "dependsOn": [{"activity": "MainActivity", "dependencyConditions": ["Failed"]}],
            "typeProperties": {
                "url": "https://hooks.slack.com/services/...",
                "method": "POST",
                "body": {"text": "Pipeline failed: @{pipeline().Pipeline}"}
            }
        },
        {
            "name": "OnCompletion",
            "type": "SetVariable",
            "dependsOn": [{"activity": "MainActivity", "dependencyConditions": ["Completed"]}]
        }
    ]
}

Try-Catch Pattern with Execute Pipeline:
// Main pipeline calls child pipeline
// If child fails, catch and handle error
{
    "name": "ExecuteChildPipeline",
    "type": "ExecutePipeline",
    "typeProperties": {
        "pipeline": {"referenceName": "ChildPipeline"},
        "waitOnCompletion": true
    }
}
// On failure path: log error, send notification, set variable for downstream

9. What is Mapping Data Flow vs Wrangling Data Flow?

AspectMapping Data FlowWrangling Data Flow
PurposeProduction ETLData preparation
InterfaceVisual canvasPower Query M
ScaleSpark clustersSpark (limited)
UsersData engineersData analysts
DebuggingData preview, debug modePower Query preview
StatusGA, recommendedPreview

Note: Wrangling Data Flow is being deprecated. Use Mapping Data Flow or Power Query in Dataflows Gen2 (Fabric).

10. How do you parameterize pipelines in ADF?

Pipeline Parameters:
// Define parameters
{
    "name": "ParameterizedPipeline",
    "properties": {
        "parameters": {
            "inputPath": {"type": "String", "defaultValue": "/default/input"},
            "outputPath": {"type": "String"},
            "date": {"type": "String"},
            "isFullLoad": {"type": "Bool", "defaultValue": false}
        },
        "activities": [
            {
                "name": "CopyData",
                "type": "Copy",
                "typeProperties": {
                    "source": {
                        "type": "BlobSource",
                        "recursive": "@pipeline().parameters.isFullLoad"
                    }
                },
                "inputs": [{
                    "referenceName": "BlobDataset",
                    "parameters": {"folderPath": "@pipeline().parameters.inputPath"}
                }]
            }
        ]
    }
}

Dataset Parameters:
// Parameterized dataset
{
    "name": "DynamicDataset",
    "properties": {
        "parameters": {
            "folderPath": {"type": "String"},
            "fileName": {"type": "String"}
        },
        "typeProperties": {
            "location": {
                "type": "AzureBlobStorageLocation",
                "container": "data",
                "folderPath": "@dataset().folderPath",
                "fileName": "@dataset().fileName"
            }
        }
    }
}

11. Explain ADF expressions and functions.

System Variables:
@pipeline().DataFactory        // Factory name
@pipeline().Pipeline           // Pipeline name
@pipeline().RunId              // Current run ID
@pipeline().TriggerType        // Manual, Schedule, etc.
@pipeline().TriggerId          // Trigger ID
@pipeline().TriggerName        // Trigger name
@pipeline().TriggerTime        // Trigger time

@activity('ActivityName').output.rowsCopied  // Activity output
@item()                        // Current item in ForEach
@variables('varName')          // Variable value

Common Functions:
// String functions
@concat('Hello', ' ', 'World')
@substring('Hello World', 0, 5)
@replace('Hello', 'H', 'J')
@toUpper('hello')
@split('a,b,c', ',')

// Date functions
@utcNow()
@formatDateTime(utcNow(), 'yyyy-MM-dd')
@adddays(utcNow(), -1)
@startOfMonth(utcNow())
@dayOfWeek(utcNow())

// Logical functions
@if(equals(1, 1), 'yes', 'no')
@coalesce(null, null, 'default')
@empty('')  // true

// Conversion functions
@int('123')
@string(123)
@json('{"key": "value"}')

// Collection functions
@length(array)
@first(array)
@last(array)
@take(array, 3)

12. How do you implement CI/CD for Azure Data Factory?

1. Git Integration:
- Connect ADF to Azure DevOps or GitHub
- Collaboration branch (main), publish branch (adf_publish)
- Feature branches for development

2. ARM Template Deployment:
// Azure DevOps Pipeline
trigger:
  branches:
    include:
      - adf_publish

pool:
  vmImage: 'ubuntu-latest'

stages:
- stage: DeployDev
  jobs:
  - job: Deploy
    steps:
    - task: AzureResourceManagerTemplateDeployment@3
      inputs:
        deploymentScope: 'Resource Group'
        azureResourceManagerConnection: 'AzureConnection'
        resourceGroupName: 'rg-adf-dev'
        location: 'East US'
        templateLocation: 'Linked artifact'
        csmFile: '$(Build.SourcesDirectory)/ARMTemplateForFactory.json'
        csmParametersFile: '$(Build.SourcesDirectory)/ARMTemplateParametersForFactory.json'
        overrideParameters: '-factoryName "adf-dev" -LS_Blob_connectionString "$(BlobConnectionString)"'

3. Pre/Post Deployment Script:
# Stop triggers before deployment
$triggers = Get-AzDataFactoryV2Trigger -ResourceGroupName $rg -DataFactoryName $adf
foreach ($trigger in $triggers) {
    Stop-AzDataFactoryV2Trigger -ResourceGroupName $rg -DataFactoryName $adf -Name $trigger.Name
}

# Deploy ARM template...

# Start triggers after deployment
foreach ($trigger in $triggers) {
    Start-AzDataFactoryV2Trigger -ResourceGroupName $rg -DataFactoryName $adf -Name $trigger.Name
}

13. What are ADF global parameters?

Global parameters are constants that can be referenced across all pipelines in a data factory.

// Define global parameters in ADF
{
    "globalParameters": {
        "environment": {"type": "String", "value": "production"},
        "dataLakeAccountName": {"type": "String", "value": "prodlake"},
        "retryCount": {"type": "Int", "value": 3}
    }
}

// Reference in pipeline
@pipeline().globalParameters.environment
@pipeline().globalParameters.dataLakeAccountName

// Override during deployment (ARM parameter)
"parameters": {
    "globalParameters": {
        "value": {
            "environment": {"type": "String", "value": "[parameters('environment')]"}
        }
    }
}

Use Cases:
- Environment-specific configurations
- Common paths or connection info
- Shared settings across pipelines

14. How do you monitor and troubleshoot ADF pipelines?

1. Monitor Hub:
- Pipeline runs history
- Activity runs details
- Trigger runs

2. Diagnostic Settings:
// Send logs to Log Analytics
az monitor diagnostic-settings create --name adf-diagnostics \
  --resource /subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.DataFactory/factories/{adf} \
  --logs '[{"category": "PipelineRuns", "enabled": true}, {"category": "ActivityRuns", "enabled": true}, {"category": "TriggerRuns", "enabled": true}]' \
  --workspace {log-analytics-workspace-id}

3. Log Analytics Queries:
// Failed pipeline runs
ADFPipelineRun
| where Status == "Failed"
| project TimeGenerated, PipelineName, RunId, ErrorMessage
| order by TimeGenerated desc

// Long-running activities
ADFActivityRun
| where Duration > 3600  // > 1 hour
| project ActivityName, PipelineName, Duration, Status

4. Alerts:
- Set alerts on failed runs
- Monitor copy activity metrics
- Alert on data flow failures




15. What is the difference between ForEach and Until activities?

AspectForEachUntil
Iteration TypeCollection-basedCondition-based
TerminationAfter processing all itemsWhen condition becomes true
ParallelismSupports parallel executionSequential only
Use CaseProcess list of files/tablesPolling, waiting for condition

ForEach Example:
{
    "name": "ProcessTables",
    "type": "ForEach",
    "typeProperties": {
        "items": {"value": "@pipeline().parameters.tableList", "type": "Expression"},
        "isSequential": false,
        "batchCount": 20,  // Max parallel
        "activities": [
            {
                "name": "CopyTable",
                "type": "Copy",
                "inputs": [{"parameters": {"tableName": "@item().name"}}]
            }
        ]
    }
}

Until Example:
{
    "name": "WaitForFile",
    "type": "Until",
    "typeProperties": {
        "expression": {"value": "@equals(activity('CheckFile').output.exists, true)"},
        "timeout": "01:00:00",
        "activities": [
            {
                "name": "CheckFile",
                "type": "GetMetadata",
                "typeProperties": {"fieldList": ["exists"]}
            },
            {
                "name": "Wait30Seconds",
                "type": "Wait",
                "typeProperties": {"waitTimeInSeconds": 30}
            }
        ]
    }
}

16. How do you call stored procedures in ADF?

Stored Procedure Activity:
{
    "name": "ExecuteStoredProcedure",
    "type": "SqlServerStoredProcedure",
    "linkedServiceName": {"referenceName": "AzureSqlDatabase"},
    "typeProperties": {
        "storedProcedureName": "usp_ProcessData",
        "storedProcedureParameters": {
            "startDate": {"value": "@pipeline().parameters.startDate", "type": "DateTime"},
            "endDate": {"value": "@pipeline().parameters.endDate", "type": "DateTime"},
            "batchSize": {"value": "1000", "type": "Int32"}
        }
    }
}

Copy Activity with Stored Procedure (Sink):
{
    "name": "CopyWithSP",
    "type": "Copy",
    "typeProperties": {
        "source": {...},
        "sink": {
            "type": "SqlSink",
            "sqlWriterStoredProcedureName": "usp_UpsertData",
            "sqlWriterTableType": "DataTableType",
            "storedProcedureTableTypeParameterName": "DataTable"
        }
    }
}

-- Stored procedure for upsert
CREATE PROCEDURE usp_UpsertData @DataTable DataTableType READONLY
AS
BEGIN
    MERGE TargetTable AS target
    USING @DataTable AS source
    ON target.Id = source.Id
    WHEN MATCHED THEN UPDATE SET target.Value = source.Value
    WHEN NOT MATCHED THEN INSERT (Id, Value) VALUES (source.Id, source.Value);
END

17. What are data flow transformations?

Source/Sink Transformations:
- Source: Read from datasets
- Sink: Write to datasets

Multiple Input/Output:
- Join: Combine rows from two streams
- Union: Stack rows from multiple streams
- Lookup: Enrich data from reference
- Exists: Filter based on existence
- Conditional Split: Route rows to different outputs

Single Input:
- Select: Choose, rename, reorder columns
- Filter: Row-level filtering
- Derived Column: Create/modify columns
- Aggregate: Group and aggregate
- Sort: Order rows
- Pivot/Unpivot: Reshape data
- Window: Window functions (rank, running total)
- Alter Row: Mark for insert/update/delete/upsert
- Flatten: Expand arrays/nested structures
- Parse: Parse JSON/XML/text

// Derived Column expression examples
// String manipulation
upper(trim(CustomerName))

// Date formatting
toDate(OrderDate, 'yyyy-MM-dd')

// Conditional logic
iif(Amount > 1000, 'High', 'Low')

// Null handling
coalesce(Email, 'no-email@company.com')

// Complex calculation
round(Amount * (1 + TaxRate/100), 2)

18. How do you handle schema drift in ADF?

Schema drift occurs when source data schema changes (new columns, different types).

1. Enable Schema Drift in Data Flow:
// Source settings
{
    "allowSchemaDrift": true,
    "validateSchema": false
}

// Access drifted columns dynamically
byName('NewColumn')
byPosition(1)
columnNames()
columns()

2. Copy Activity Schema Mapping:
// Auto-mapping (allow changes)
"translator": {
    "type": "TabularTranslator",
    "mappings": [],  // Empty = auto-map all
    "collectionReference": ""
}

// Explicit mapping (strict)
"translator": {
    "type": "TabularTranslator",
    "mappings": [
        {"source": {"name": "Id"}, "sink": {"name": "Id", "type": "Int64"}},
        {"source": {"name": "Name"}, "sink": {"name": "CustomerName", "type": "String"}}
    ]
}

3. Late-Arriving Columns:
- Use derived column to handle missing columns
- Set default values for new columns

19. What are managed virtual networks in ADF?

Managed Virtual Network provides network isolation for Azure Integration Runtime, ensuring data doesn't traverse the public internet.

Features:
- Private endpoints to Azure services
- No public IP for compute
- Traffic stays within Microsoft network

// Enable Managed VNET on Integration Runtime
{
    "name": "ManagedVNetIR",
    "type": "Managed",
    "managedVirtualNetwork": {
        "type": "ManagedVirtualNetworkReference",
        "referenceName": "default"
    }
}

// Create Managed Private Endpoint
{
    "name": "SqlPrivateEndpoint",
    "properties": {
        "privateLinkResourceId": "/subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.Sql/servers/{server}",
        "groupId": "sqlServer",
        "fqdns": ["server.database.windows.net"]
    }
}

Supported Services:
- Azure Storage, Azure SQL, Cosmos DB
- Azure Synapse, Azure Key Vault
- Power BI, and more

20. How do you optimize ADF pipeline performance?

1. Copy Activity Optimization:
// Increase DIU (Data Integration Units)
"dataIntegrationUnits": 256,  // Max 256

// Enable parallel copy
"parallelCopies": 32,

// Enable staging for cross-region
"enableStaging": true,
"stagingSettings": {
    "linkedServiceName": {"referenceName": "StagingStorage"}
}

2. Data Flow Optimization:
// Increase core count
"dataFlowProperties": {
    "computeType": "MemoryOptimized",
    "coreCount": 16
}

// Enable TTL (Time to Live)
"timeToLive": 10  // Minutes to keep cluster warm

3. General Best Practices:
- Use appropriate IR location (close to data)
- Use Parquet format for intermediate storage
- Avoid unnecessary data movement
- Use ForEach with parallelism
- Partition data for parallel processing
- Enable compression for network transfer

4. Monitoring for Optimization:
- Check data read/written volumes
- Monitor queue times
- Analyze throughput metrics
- Identify bottleneck activities

Microsoft Azure Interview Questions

Comprehensive interview questions for Azure cloud services and data engineering roles.


Popular Posts