Top 20 Azure Data Factory Real-Time Scenario Interview Questions
- How do you handle incremental data loading from a SQL database?
- How do you handle slowly changing dimensions (SCD) Type 2?
- How do you process files that arrive in batches?
- How do you handle data quality checks in ADF?
- How do you implement a data lake architecture using ADF?
- How do you handle API pagination while extracting data?
- How do you implement dynamic pipelines based on metadata?
- How do you handle large file processing?
- How do you implement error handling and retry logic?
- How do you synchronize data between multiple databases?
- How do you handle schema drift in source data?
- How do you implement data archival strategy?
- How do you handle time zone conversions in data pipelines?
- How do you implement CDC (Change Data Capture)?
- How do you handle hierarchical/nested JSON data?
- How do you implement parallel processing for multiple sources?
- How do you handle PII data masking?
- How do you implement data reconciliation?
- How do you migrate on-premises SSIS packages to ADF?
- How do you optimize ADF pipeline performance?
Microsoft Azure Interview Questions
Comprehensive interview questions for Azure cloud services and data engineering roles.
1. How do you handle incremental data loading from a SQL database?
Scenario: You need to load only new or modified records from a SQL Server database daily.Solution - Watermark Pattern:
-- Watermark Table
CREATE TABLE WatermarkTable (
TableName VARCHAR(100),
WatermarkColumn VARCHAR(100),
WatermarkValue DATETIME
);
-- Source Query (dynamic)
SELECT * FROM Orders
WHERE ModifiedDate > '@{activity('Lookup1').output.firstRow.WatermarkValue}'
AND ModifiedDate <= '@{pipeline().parameters.NewWatermarkValue}'
Pipeline Structure:
Pipeline: IncrementalLoad
âââ Lookup (Get Old Watermark)
â âââ Query: SELECT WatermarkValue FROM WatermarkTable WHERE TableName='Orders'
âââ Lookup (Get New Watermark)
â âââ Query: SELECT MAX(ModifiedDate) as NewWatermark FROM Orders
âââ Copy Activity (Copy Delta Data)
â âââ Source Query: SELECT * FROM Orders WHERE ModifiedDate > oldWatermark AND ModifiedDate <= newWatermark
â âââ Sink: ADLS/Destination
âââ Stored Procedure (Update Watermark)
âââ UPDATE WatermarkTable SET WatermarkValue = newWatermark WHERE TableName='Orders'
For SCD Type 1 (Update existing):
Use Data Flow with AlterRow transformation to handle Upsert logic.
2. How do you handle slowly changing dimensions (SCD) Type 2?
Scenario: Track historical changes to customer data with effective dates.Solution - Data Flow for SCD Type 2:
Data Flow: SCDType2_Customer
âââ Source (New Customer Data)
âââ Source (Existing Dimension - Active Records)
â âââ Filter: IsActive = 1
âââ Lookup (Match on Business Key)
â âââ Left: New Data, Right: Existing
âââ Conditional Split
â âââ NewRecords: isNull(ExistingKey)
â âââ ChangedRecords: hash(NewData) != hash(ExistingData)
â âââ UnchangedRecords: Default
âââ Derived Column (For New Records)
â âââ SurrogateKey: autoIncrement()
â âââ EffectiveStartDate: currentDate()
â âââ EffectiveEndDate: toDate('9999-12-31')
â âââ IsActive: 1
âââ Union (New + Changed Records)
âââ Derived Column (Expire Old Records)
â âââ EffectiveEndDate: currentDate()
â âââ IsActive: 0
âââ Sink (Dimension Table - Insert/Update)
-- Dimension Table Structure
CREATE TABLE DimCustomer (
SurrogateKey INT IDENTITY PRIMARY KEY,
CustomerID INT, -- Business Key
CustomerName VARCHAR(100),
Address VARCHAR(200),
EffectiveStartDate DATE,
EffectiveEndDate DATE,
IsActive BIT
);
AlterRow Expression:
-- In AlterRow transformation Insert if: isNull(ExistingSurrogateKey) -- New records Update if: IsActive == 0 -- Expire old active record
3. How do you process files that arrive in batches?
Scenario: Multiple CSV files arrive in a folder; process all files and move to archive.
Pipeline: ProcessBatchFiles
âââ Get Metadata (List Files)
â âââ Dataset: Source Folder
â âââ Field List: childItems
âââ Filter (Only CSV Files)
â âââ Condition: @endsWith(item().name, '.csv')
âââ ForEach (Process Each File)
â âââ Items: @activity('Filter1').output.Value
â âââ Sequential: false (parallel)
â âââ Activities:
â âââ Copy Activity
â â âââ Source: @item().name
â â âââ Sink: Destination
â âââ Copy Activity (Move to Archive)
â â âââ Source: @item().name
â â âââ Sink: Archive/@item().name
â âââ Delete Activity
â âââ Delete source file after archive
âââ Send Email (Notification)
âââ Summary of processed files
-- Dynamic File Path Expression
@concat('raw/', item().name)
-- Archive Path with Timestamp
@concat('archive/', formatDateTime(utcNow(), 'yyyy/MM/dd'), '/', item().name)
Error Handling for Individual Files:
ForEach Settings:
âââ Batch Count: 20 (process 20 files in parallel)
âââ Sequential: false
âââ Activities:
âââ Try (Execute Pipeline - Process Single File)
â âââ On Success: Archive file
â âââ On Failure: Move to Error folder, Log error
4. How do you handle data quality checks in ADF?
Scenario: Validate data before loading to destination; reject bad records.Data Flow: DataQualityChecks âââ Source (Raw Data) âââ Derived Column (Quality Flags) â âââ IsEmailValid: regexMatch(Email, '^[A-Za-z0-9+_.-]+@(.+)$') â âââ IsDateValid: !isNull(toDate(DateString, 'yyyy-MM-dd')) â âââ IsAmountValid: Amount > 0 && Amount < 1000000 â âââ HasRequiredFields: !isNull(CustomerID) && !isNull(ProductID) âââ Derived Column (Quality Score) â âââ QualityScore: iif(IsEmailValid, 1, 0) + iif(IsDateValid, 1, 0) + ... âââ Conditional Split â âââ ValidRecords: QualityScore == 4 (all checks passed) â âââ PartiallyValid: QualityScore >= 2 â âââ InvalidRecords: Default âââ Sink (Valid â Production Table) âââ Sink (PartiallyValid â Review Queue) âââ Sink (Invalid â Error Log Table) -- Error Log Entry ErrorTable: âââ RecordID âââ SourceFile âââ ErrorType âââ ErrorDetails âââ OriginalData (JSON) âââ ProcessedDate
Data Validation Rules Example:
-- Conditional Split expressions Valid: !isNull(CustomerID) && length(CustomerID) == 10 && !isNull(Email) && regexMatch(Email, '^[A-Za-z0-9+_.-]+@(.+)$') && Amount > 0 DuplicateCheck (using Window function): rowNumber = row_number() over(partition by CustomerID order by ModifiedDate desc) Keep only rowNumber == 1
5. How do you implement a data lake architecture using ADF?
Scenario: Implement medallion architecture (Bronze/Silver/Gold) for analytics.
Data Lake Structure:
âââ Bronze (Raw Layer)
â âââ /bronze/{source}/{table}/{year}/{month}/{day}/
â âââ Format: Raw JSON/CSV/Parquet as-is
âââ Silver (Cleansed Layer)
â âââ /silver/{domain}/{entity}/
â âââ Format: Delta/Parquet, deduplicated, typed
âââ Gold (Curated Layer)
âââ /gold/{subject_area}/{table}/
âââ Format: Delta/Parquet, aggregated, business logic
Pipeline: MedallionArchitecture
âââ Pipeline: Bronze_Ingestion
â âââ Copy Raw Data (as-is)
â âââ Add metadata columns (source, ingestionTime)
âââ Pipeline: Silver_Processing
â âââ Data Flow: Cleanse & Transform
â â âââ Remove duplicates
â â âââ Apply data types
â â âââ Handle nulls
â â âââ Standardize formats
â âââ Write Delta format (for time travel)
âââ Pipeline: Gold_Aggregation
âââ Data Flow: Business Logic
â âââ Joins across entities
â âââ Aggregations
â âââ KPI calculations
âââ Write to Gold layer
-- File naming convention
@concat(
'bronze/sales/orders/',
formatDateTime(utcNow(),'yyyy'), '/',
formatDateTime(utcNow(),'MM'), '/',
formatDateTime(utcNow(),'dd'), '/',
'orders_',
formatDateTime(utcNow(),'yyyyMMddHHmmss'),
'.parquet'
)