Top 20 Azure Data Factory Interview Questions and Answers
- What is Azure Data Factory?
- What are the key components of Azure Data Factory?
- What is Integration Runtime and its types?
- Explain the difference between Copy Activity and Data Flow.
- What are Linked Services and Datasets?
- How do you implement incremental data loading?
- What are the different trigger types in ADF?
- How do you handle errors and retries in ADF?
- What is Mapping Data Flow vs Wrangling Data Flow?
- How do you parameterize pipelines in ADF?
- Explain ADF expressions and functions.
- How do you implement CI/CD for Azure Data Factory?
- What are ADF global parameters?
- How do you monitor and troubleshoot ADF pipelines?
- What is the difference between ForEach and Until activities?
- How do you call stored procedures in ADF?
- What are data flow transformations?
- How do you handle schema drift in ADF?
- What are managed virtual networks in ADF?
- How do you optimize ADF pipeline performance?
Microsoft Azure Interview Questions
Comprehensive interview questions for Azure cloud services and data engineering roles.
1. What is Azure Data Factory?
Azure Data Factory (ADF) is a cloud-based data integration service that allows you to create data-driven workflows for orchestrating data movement and transforming data at scale.Key Capabilities:
- ETL/ELT: Extract, transform, and load data
- 95+ Connectors: Cloud and on-premises sources
- Code-Free UI: Visual pipeline design
- Serverless: Auto-scale, pay-per-use
- SSIS Integration: Lift and shift SSIS packages
Use Cases:
- Data migration to cloud
- Data warehouse loading
- Data integration from multiple sources
- Big data processing orchestration
2. What are the key components of Azure Data Factory?
1. Pipelines:- Logical grouping of activities
- Unit of execution
2. Activities:
- Data movement (Copy)
- Data transformation (Data Flow, HDInsight, Databricks)
- Control flow (ForEach, If, Switch, Wait)
3. Datasets:
- Named view of data
- Points to data in linked service
4. Linked Services:
- Connection strings to data stores
- Connection strings to compute
5. Triggers:
- Schedule pipeline execution
- Event-driven execution
6. Integration Runtime:
- Compute infrastructure for activities
// Pipeline JSON structure
{
"name": "CopyPipeline",
"properties": {
"activities": [
{
"name": "CopyFromBlobToSQL",
"type": "Copy",
"inputs": [{"referenceName": "BlobDataset"}],
"outputs": [{"referenceName": "SQLDataset"}],
"typeProperties": {
"source": {"type": "BlobSource"},
"sink": {"type": "SqlSink"}
}
}
]
}
}
3. What is Integration Runtime and its types?
Integration Runtime (IR) is the compute infrastructure used by Azure Data Factory to provide data integration capabilities across different network environments.Types of Integration Runtime:
| Type | Description | Use Case |
|---|---|---|
| Azure IR | Microsoft-managed, public cloud | Cloud-to-cloud data movement |
| Self-Hosted IR | Customer-managed, on-premises | On-prem to cloud, private networks |
| Azure-SSIS IR | Managed SSIS environment | Running SSIS packages in Azure |
// Self-Hosted IR Configuration
{
"name": "SelfHostedIR",
"type": "SelfHosted",
"typeProperties": {}
}
// Azure IR with VNET
{
"name": "ManagedVNetIR",
"type": "Managed",
"typeProperties": {
"computeProperties": {
"location": "East US",
"dataFlowProperties": {
"computeType": "General",
"coreCount": 8,
"timeToLive": 10
}
}
},
"managedVirtualNetwork": {
"referenceName": "default",
"type": "ManagedVirtualNetworkReference"
}
}
4. Explain the difference between Copy Activity and Data Flow.
| Aspect | Copy Activity | Data Flow |
|---|---|---|
| Purpose | Move data (ETL/ELT) | Transform data (ETL) |
| Transformations | Limited (column mapping, type conversion) | Full transformations (join, aggregate, pivot) |
| Compute | Azure IR or Self-Hosted IR | Spark clusters (auto-managed) |
| Coding | No code | No code (visual designer) |
| Performance | Fastest for simple copy | Better for complex transformations |
| Cost | DIU-based pricing | vCore-hours pricing |
When to use Copy Activity:
- Simple data movement
- Staging to data lake
- Need fastest copy performance
When to use Data Flow:
- Complex transformations needed
- Multiple source joins
- Business logic in transformation
5. What are Linked Services and Datasets?
Linked Services:Define the connection information needed to connect to external resources. Similar to connection strings.
// Azure Blob Storage Linked Service
{
"name": "AzureBlobStorage",
"type": "AzureBlobStorage",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "DefaultEndpointsProtocol=https;AccountName=myaccount;AccountKey=***"
}
}
}
// Azure SQL Database Linked Service
{
"name": "AzureSqlDatabase",
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": "Server=server.database.windows.net;Database=mydb;User ID=user;Password=***;",
"encryptedCredential": "..."
}
}
Datasets:
Represent the data structure within the data stores. Points to specific files, tables, or containers.
// Parquet Dataset
{
"name": "ParquetDataset",
"type": "Parquet",
"linkedServiceName": {
"referenceName": "AzureBlobStorage",
"type": "LinkedServiceReference"
},
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"container": "data",
"folderPath": "input",
"fileName": "*.parquet"
},
"compressionCodec": "snappy"
},
"schema": [
{"name": "id", "type": "INT32"},
{"name": "name", "type": "UTF8"}
]
}