Top 20 Azure Data Factory Real-Time Scenario Interview Questions

How do you handle incremental data loading from a SQL database?
How do you handle slowly changing dimensions (SCD) Type 2?
How do you process files that arrive in batches?
How do you handle data quality checks in ADF?
How do you implement a data lake architecture using ADF?
How do you handle API pagination while extracting data?
How do you implement dynamic pipelines based on metadata?
How do you handle large file processing?
How do you implement error handling and retry logic?
How do you synchronize data between multiple databases?
How do you handle schema drift in source data?
How do you implement data archival strategy?
How do you handle time zone conversions in data pipelines?
How do you implement CDC (Change Data Capture)?
How do you handle hierarchical/nested JSON data?
How do you implement parallel processing for multiple sources?
How do you handle PII data masking?
How do you implement data reconciliation?
How do you migrate on-premises SSIS packages to ADF?
How do you optimize ADF pipeline performance?

Microsoft Azure Interview Questions

Comprehensive interview questions for Azure cloud services and data engineering roles.

Azure Data Engineer Data engineering & architecture questions Azure App Service Web apps, API apps & deployment Azure Synapse Analytics Analytics workspace & SQL pools Azure Data Lake Storage Gen2 & analytics questions Azure Databricks Spark clusters & notebooks Azure Data Factory ETL pipelines & data integration Azure Logic Apps Workflow automation & connectors Azure Active Directory Identity & access management Azure Machine Learning ML workspace & model deployment Azure DevOps CI/CD pipelines & repos Azure Data Factory Real-time Scenario-based ADF questions

1. How do you handle incremental data loading from a SQL database?

Scenario: You need to load only new or modified records from a SQL Server database daily.

Solution - Watermark Pattern:

-- Watermark Table
CREATE TABLE WatermarkTable (
    TableName VARCHAR(100),
    WatermarkColumn VARCHAR(100),
    WatermarkValue DATETIME
);

-- Source Query (dynamic)
SELECT * FROM Orders 
WHERE ModifiedDate > '@{activity('Lookup1').output.firstRow.WatermarkValue}'
  AND ModifiedDate <= '@{pipeline().parameters.NewWatermarkValue}'

Pipeline Structure:

Pipeline: IncrementalLoad
âââ Lookup (Get Old Watermark)
â   âââ Query: SELECT WatermarkValue FROM WatermarkTable WHERE TableName='Orders'
âââ Lookup (Get New Watermark)
â   âââ Query: SELECT MAX(ModifiedDate) as NewWatermark FROM Orders
âââ Copy Activity (Copy Delta Data)
â   âââ Source Query: SELECT * FROM Orders WHERE ModifiedDate > oldWatermark AND ModifiedDate <= newWatermark
â   âââ Sink: ADLS/Destination
âââ Stored Procedure (Update Watermark)
    âââ UPDATE WatermarkTable SET WatermarkValue = newWatermark WHERE TableName='Orders'

For SCD Type 1 (Update existing):
Use Data Flow with AlterRow transformation to handle Upsert logic.

2. How do you handle slowly changing dimensions (SCD) Type 2?

Scenario: Track historical changes to customer data with effective dates.

Solution - Data Flow for SCD Type 2:

Data Flow: SCDType2_Customer
âââ Source (New Customer Data)
âââ Source (Existing Dimension - Active Records)
â   âââ Filter: IsActive = 1
âââ Lookup (Match on Business Key)
â   âââ Left: New Data, Right: Existing
âââ Conditional Split
â   âââ NewRecords: isNull(ExistingKey)
â   âââ ChangedRecords: hash(NewData) != hash(ExistingData)
â   âââ UnchangedRecords: Default
âââ Derived Column (For New Records)
â   âââ SurrogateKey: autoIncrement()
â   âââ EffectiveStartDate: currentDate()
â   âââ EffectiveEndDate: toDate('9999-12-31')
â   âââ IsActive: 1
âââ Union (New + Changed Records)
âââ Derived Column (Expire Old Records)
â   âââ EffectiveEndDate: currentDate()
â   âââ IsActive: 0
âââ Sink (Dimension Table - Insert/Update)

-- Dimension Table Structure
CREATE TABLE DimCustomer (
    SurrogateKey INT IDENTITY PRIMARY KEY,
    CustomerID INT,  -- Business Key
    CustomerName VARCHAR(100),
    Address VARCHAR(200),
    EffectiveStartDate DATE,
    EffectiveEndDate DATE,
    IsActive BIT
);

AlterRow Expression:

-- In AlterRow transformation
Insert if: isNull(ExistingSurrogateKey)  -- New records
Update if: IsActive == 0  -- Expire old active record

3. How do you process files that arrive in batches?

Scenario: Multiple CSV files arrive in a folder; process all files and move to archive.

Pipeline: ProcessBatchFiles
âââ Get Metadata (List Files)
â   âââ Dataset: Source Folder
â   âââ Field List: childItems
âââ Filter (Only CSV Files)
â   âââ Condition: @endsWith(item().name, '.csv')
âââ ForEach (Process Each File)
â   âââ Items: @activity('Filter1').output.Value
â   âââ Sequential: false (parallel)
â   âââ Activities:
â       âââ Copy Activity
â       â   âââ Source: @item().name
â       â   âââ Sink: Destination
â       âââ Copy Activity (Move to Archive)
â       â   âââ Source: @item().name
â       â   âââ Sink: Archive/@item().name
â       âââ Delete Activity
â           âââ Delete source file after archive
âââ Send Email (Notification)
    âââ Summary of processed files

-- Dynamic File Path Expression
@concat('raw/', item().name)

-- Archive Path with Timestamp
@concat('archive/', formatDateTime(utcNow(), 'yyyy/MM/dd'), '/', item().name)

Error Handling for Individual Files:

ForEach Settings:
âââ Batch Count: 20 (process 20 files in parallel)
âââ Sequential: false
âââ Activities:
    âââ Try (Execute Pipeline - Process Single File)
    â   âââ On Success: Archive file
    â   âââ On Failure: Move to Error folder, Log error

4. How do you handle data quality checks in ADF?

Scenario: Validate data before loading to destination; reject bad records.

Data Flow: DataQualityChecks
âââ Source (Raw Data)
âââ Derived Column (Quality Flags)
â   âââ IsEmailValid: regexMatch(Email, '^[A-Za-z0-9+_.-]+@(.+)$')
â   âââ IsDateValid: !isNull(toDate(DateString, 'yyyy-MM-dd'))
â   âââ IsAmountValid: Amount > 0 && Amount < 1000000
â   âââ HasRequiredFields: !isNull(CustomerID) && !isNull(ProductID)
âââ Derived Column (Quality Score)
â   âââ QualityScore: iif(IsEmailValid, 1, 0) + iif(IsDateValid, 1, 0) + ...
âââ Conditional Split
â   âââ ValidRecords: QualityScore == 4 (all checks passed)
â   âââ PartiallyValid: QualityScore >= 2
â   âââ InvalidRecords: Default
âââ Sink (Valid â Production Table)
âââ Sink (PartiallyValid â Review Queue)
âââ Sink (Invalid â Error Log Table)

-- Error Log Entry
ErrorTable: 
âââ RecordID
âââ SourceFile
âââ ErrorType
âââ ErrorDetails
âââ OriginalData (JSON)
âââ ProcessedDate

Data Validation Rules Example:

-- Conditional Split expressions
Valid: 
  !isNull(CustomerID) && 
  length(CustomerID) == 10 &&
  !isNull(Email) && 
  regexMatch(Email, '^[A-Za-z0-9+_.-]+@(.+)$') &&
  Amount > 0

DuplicateCheck (using Window function):
  rowNumber = row_number() over(partition by CustomerID order by ModifiedDate desc)
  Keep only rowNumber == 1

5. How do you implement a data lake architecture using ADF?

Scenario: Implement medallion architecture (Bronze/Silver/Gold) for analytics.

Data Lake Structure:
âââ Bronze (Raw Layer)
â   âââ /bronze/{source}/{table}/{year}/{month}/{day}/
â   âââ Format: Raw JSON/CSV/Parquet as-is
âââ Silver (Cleansed Layer)
â   âââ /silver/{domain}/{entity}/
â   âââ Format: Delta/Parquet, deduplicated, typed
âââ Gold (Curated Layer)
    âââ /gold/{subject_area}/{table}/
    âââ Format: Delta/Parquet, aggregated, business logic

Pipeline: MedallionArchitecture
âââ Pipeline: Bronze_Ingestion
â   âââ Copy Raw Data (as-is)
â   âââ Add metadata columns (source, ingestionTime)
âââ Pipeline: Silver_Processing
â   âââ Data Flow: Cleanse & Transform
â   â   âââ Remove duplicates
â   â   âââ Apply data types
â   â   âââ Handle nulls
â   â   âââ Standardize formats
â   âââ Write Delta format (for time travel)
âââ Pipeline: Gold_Aggregation
    âââ Data Flow: Business Logic
    â   âââ Joins across entities
    â   âââ Aggregations
    â   âââ KPI calculations
    âââ Write to Gold layer

-- File naming convention
@concat(
    'bronze/sales/orders/',
    formatDateTime(utcNow(),'yyyy'), '/',
    formatDateTime(utcNow(),'MM'), '/',
    formatDateTime(utcNow(),'dd'), '/',
    'orders_',
    formatDateTime(utcNow(),'yyyyMMddHHmmss'),
    '.parquet'
)

6. How do you handle API pagination while extracting data?

Scenario: Extract data from REST API that returns paginated results.

-- Method 1: Copy Activity with Pagination Rules
Copy Activity Settings:
âââ Source: REST
â   âââ Pagination Rules:
â       âââ AbsoluteUrl: $.nextLink (for next page URL in response)
â       OR
â       âââ QueryParameters.page: RANGE:1:100:1 (page 1 to 100)
â       OR
â       âââ QueryParameters.offset: RANGE:0:10000:100 (offset pagination)

-- Method 2: Until Loop for Complex Pagination
Pipeline: API_Pagination
âââ Set Variable (hasMoreData = true, pageNumber = 1)
âââ Until (hasMoreData == false)
â   âââ Activities:
â       âââ Web Activity (Call API)
â       â   âââ URL: @concat(baseUrl, '?page=', variables('pageNumber'))
â       âââ If Condition (Check for more data)
â       â   âââ Condition: @greater(length(activity('Web1').output.data), 0)
â       â   âââ True:
â       â       âââ Copy Data to Storage
â       â       âââ Set Variable (pageNumber = pageNumber + 1)
â       â   âââ False:
â       â       âââ Set Variable (hasMoreData = false)

-- API Response handling
{
  "data": [...],
  "pagination": {
    "currentPage": 1,
    "totalPages": 50,
    "nextPageUrl": "https://api.example.com/data?page=2"
  }
}

-- Expression to check more pages
@less(
    int(activity('CallAPI').output.pagination.currentPage),
    int(activity('CallAPI').output.pagination.totalPages)
)

7. How do you implement dynamic pipelines based on metadata?

Scenario: Single pipeline to load multiple tables based on configuration metadata.

-- Metadata Table
CREATE TABLE PipelineMetadata (
    SourceSchema VARCHAR(50),
    SourceTable VARCHAR(100),
    TargetSchema VARCHAR(50),
    TargetTable VARCHAR(100),
    LoadType VARCHAR(20),  -- 'Full' or 'Incremental'
    WatermarkColumn VARCHAR(100),
    IsActive BIT,
    LastLoadDate DATETIME
);

Pipeline: MetadataDrivenLoad
âââ Lookup (Get Active Tables)
â   âââ Query: SELECT * FROM PipelineMetadata WHERE IsActive = 1
âââ ForEach (Process Each Table)
â   âââ Items: @activity('Lookup1').output.value
â   âââ Activities:
â       âââ Execute Pipeline (GenericCopyPipeline)
â           âââ Parameters:
â               âââ SourceSchema: @item().SourceSchema
â               âââ SourceTable: @item().SourceTable
â               âââ TargetTable: @item().TargetTable
â               âââ LoadType: @item().LoadType
â               âââ WatermarkColumn: @item().WatermarkColumn

Pipeline: GenericCopyPipeline
âââ Parameters: SourceSchema, SourceTable, LoadType, WatermarkColumn
âââ If Condition (LoadType == 'Incremental')
â   âââ True: Execute Incremental Load
â   âââ False: Execute Full Load
âââ Copy Activity
â   âââ Source Query (Dynamic):
â       @if(equals(pipeline().parameters.LoadType, 'Full'),
â           concat('SELECT * FROM ', pipeline().parameters.SourceSchema, '.', pipeline().parameters.SourceTable),
â           concat('SELECT * FROM ', pipeline().parameters.SourceSchema, '.', pipeline().parameters.SourceTable, 
â                  ' WHERE ', pipeline().parameters.WatermarkColumn, ' > ''', variables('LastWatermark'), '''')
â       )
âââ Update Metadata (LastLoadDate)

8. How do you handle large file processing?

Scenario: Process a 50GB file efficiently without memory issues.

-- Strategy 1: Chunked Processing
# Split large file into chunks using Azure Function or Databricks
# Then process each chunk in parallel

Pipeline: LargeFileProcessing
âââ Azure Function (Split File)
â   âââ Split 50GB into 500MB chunks
â   âââ Return list of chunk file paths
âââ ForEach (Process Chunks in Parallel)
â   âââ Batch Count: 10
â   âââ Copy Activity for each chunk
âââ Azure Function (Merge Results if needed)

-- Strategy 2: Data Flow Partitioning
Data Flow Settings:
âââ Source: Large File
â   âââ Enable Staging (for large files)
âââ Optimize Tab:
â   âââ Partitioning: Hash or Round Robin
â   âââ Number of Partitions: 200
âââ Sink:
    âââ Single file per partition: false
    âââ File name option: Pattern
    âââ Pattern: output_part{n}.parquet

-- Strategy 3: Copy Activity Settings
Copy Activity:
âââ Enable Staging: true
âââ Staging Storage: Azure Blob
âââ Parallel Copies: 32
âââ Data Integration Units (DIU): 256 (max)
âââ Sink:
    âââ Copy behavior: PreserveHierarchy
    âââ Max concurrent connections: 50

-- For very large files, use PolyBase or COPY command
Copy Activity Sink (Synapse):
âââ Copy method: PolyBase
âââ PolyBase Settings:
â   âââ Allow PolyBase: true
â   âââ Reject Type: value
â   âââ Reject Value: 10

9. How do you implement error handling and retry logic?

Scenario: Handle transient failures with retry and log all errors for investigation.

Pipeline: RobustETLPipeline
âââ Activity Settings (on each activity):
â   âââ Retry: 3
â   âââ Retry Interval: 30 seconds
â   âââ Timeout: 01:00:00
â   âââ Secure Output: false (to see errors)
â
âââ Copy Activity
â   âââ On Success â Continue to next step
â   âââ On Failure â Execute "LogError" activity
â
âââ Stored Procedure (LogError)
â   âââ Parameters:
â       âââ PipelineName: @pipeline().Pipeline
â       âââ RunId: @pipeline().RunId
â       âââ ActivityName: @activity('CopyData').ActivityName
â       âââ ErrorMessage: @activity('CopyData').output.errors[0].Message
â       âââ ErrorCode: @activity('CopyData').Error.errorCode
â       âââ Timestamp: @utcNow()

-- Error Logging Table
CREATE TABLE PipelineErrorLog (
    LogId INT IDENTITY PRIMARY KEY,
    PipelineName VARCHAR(200),
    RunId VARCHAR(100),
    ActivityName VARCHAR(200),
    ErrorMessage NVARCHAR(MAX),
    ErrorCode VARCHAR(50),
    Timestamp DATETIME,
    Status VARCHAR(20)
);

-- Web Activity for Alert
Web Activity (Send Alert):
âââ URL: Logic App HTTP trigger URL
âââ Method: POST
âââ Body: 
{
    "pipeline": "@{pipeline().Pipeline}",
    "error": "@{activity('CopyData').Error.message}",
    "runId": "@{pipeline().RunId}"
}

-- Conditional execution
Upon Failure path:
âââ Log Error
âââ Send Alert
âââ Set Pipeline Variable (HasError = true)

At End of Pipeline:
âââ If HasError == true
â   âââ Fail Pipeline with custom message

10. How do you synchronize data between multiple databases?

Scenario: Keep data in sync between SQL Server and Azure SQL Database.

-- Bi-directional Sync Pattern
Pipeline: BidirectionalSync
âââ Lookup (Changes from Source A - Last 24 hours)
â   âââ Query: SELECT * FROM ChangeTracking WHERE ModifiedDate > @lastSync
âââ Lookup (Changes from Source B - Last 24 hours)
âââ Data Flow (Conflict Resolution)
â   âââ Source: Changes from A
â   âââ Source: Changes from B
â   âââ Join on Business Key
â   âââ Conditional Split:
â       âââ OnlyInA: Apply to B
â       âââ OnlyInB: Apply to A
â       âââ InBoth: Use latest ModifiedDate (conflict resolution)
âââ Copy Activity (A â B changes)
âââ Copy Activity (B â A changes)
âââ Update Sync Timestamp

-- Conflict Resolution Logic in Data Flow
Derived Column:
âââ WinningSource: iif(ModifiedDateA > ModifiedDateB, 'A', 'B')
âââ FinalValue: iif(WinningSource == 'A', ValueA, ValueB)

-- Using Change Tracking (SQL Server)
DECLARE @last_sync_version bigint = (SELECT LastSyncVersion FROM SyncMetadata);
DECLARE @current_version bigint = CHANGE_TRACKING_CURRENT_VERSION();

SELECT t.*, ct.SYS_CHANGE_OPERATION
FROM MyTable t
RIGHT JOIN CHANGETABLE(CHANGES MyTable, @last_sync_version) ct
ON t.PrimaryKey = ct.PrimaryKey;

-- Update sync version after successful sync
UPDATE SyncMetadata SET LastSyncVersion = @current_version;

11. How do you handle schema drift in source data?

Scenario: Source adds new columns without notice; pipeline should handle gracefully.

-- Enable Schema Drift in Data Flow
Data Flow Settings:
âââ Source:
â   âââ Allow schema drift: true
â   âââ Infer drifted column types: true
â   âââ Validate schema: false
âââ Transformations:
â   âââ Use byName() to reference columns
â   âââ Select: Map drifted columns
âââ Sink:
    âââ Schema drift: Allow
    âââ Auto-map drifted columns: true

-- Handle specific drifted columns
Select Transformation:
âââ Fixed Mapping:
â   âââ CustomerID â CustomerID
â   âââ Name â Name
âââ Rule-based Mapping:
    âââ Match: name matches 'Custom.*'
    âââ Name: 'drifted_' + $$
    âââ Type: string

-- Expression for unknown columns
Derived Column:
âââ AllDriftedAsJson: 
    toJSON(
        mapIf(
            columnNames(),
            !in(['CustomerID','Name','Email'], #item),
            byName(#item)
        )
    )

-- Sink to flexible schema (JSON/Parquet)
Data Flow Sink:
âââ Dataset: Parquet with no schema
âââ Settings:
â   âââ Output to single file: false
â   âââ Auto-schema: true

-- Alert on schema changes
Pipeline:
âââ Get Metadata (Get current schema)
âââ Compare with stored schema
âââ If different â Log change and alert

12. How do you implement data archival strategy?

Scenario: Archive data older than 2 years to cold storage, maintain hot data for queries.

Pipeline: DataArchival
âââ Lookup (Get Tables to Archive)
â   âââ Query: SELECT TableName, RetentionDays FROM ArchivalConfig
âââ ForEach Table
â   âââ Activities:
â       âââ Copy Activity (Archive Old Data)
â       â   âââ Source Query:
â       â       SELECT * FROM @{item().TableName}
â       â       WHERE CreatedDate < DATEADD(day, -@{item().RetentionDays}, GETDATE())
â       â   âââ Sink: Archive Storage (Cool/Archive tier)
â       â       âââ Path: archive/{table}/{year}/{month}/
â       âââ Stored Procedure (Verify Archive)
â       â   âââ Compare row counts
â       âââ Stored Procedure (Delete Archived Data)
â           âââ DELETE FROM table WHERE CreatedDate < cutoff

-- Archive File Organization
archive/
âââ sales/
â   âââ 2022/
â   â   âââ 01/orders_202201.parquet
â   â   âââ 02/orders_202202.parquet
â   âââ 2023/
âââ customers/
    âââ ...

-- Lifecycle Management Policy (Azure Storage)
{
  "rules": [
    {
      "name": "moveToArchive",
      "type": "Lifecycle",
      "definition": {
        "filters": {
          "blobTypes": ["blockBlob"],
          "prefixMatch": ["archive/"]
        },
        "actions": {
          "baseBlob": {
            "tierToCool": { "daysAfterModificationGreaterThan": 30 },
            "tierToArchive": { "daysAfterModificationGreaterThan": 90 }
          }
        }
      }
    }
  ]
}

-- Partitioned Table Strategy (Synapse)
-- Drop old partitions instead of delete
ALTER TABLE Orders SWITCH PARTITION 1 TO Orders_Archive PARTITION 1;

13. How do you handle time zone conversions in data pipelines?

Scenario: Data comes from multiple regions; standardize all timestamps to UTC.

-- Data Flow: Time Zone Standardization
Derived Column:
âââ UTCTimestamp:
    convertTimeZone(
        LocalTimestamp,
        SourceTimeZone,  -- 'Eastern Standard Time'
        'UTC'
    )

-- Multiple source regions
Derived Column:
âââ UTCTimestamp:
    case(
        SourceRegion == 'US-East', convertTimeZone(LocalTimestamp, 'Eastern Standard Time', 'UTC'),
        SourceRegion == 'US-West', convertTimeZone(LocalTimestamp, 'Pacific Standard Time', 'UTC'),
        SourceRegion == 'EU', convertTimeZone(LocalTimestamp, 'Central European Standard Time', 'UTC'),
        SourceRegion == 'Asia', convertTimeZone(LocalTimestamp, 'Singapore Standard Time', 'UTC'),
        LocalTimestamp  -- default, assume UTC
    )

-- ADF Expression (in Copy Activity)
@convertTimeZone(
    activity('Lookup1').output.firstRow.LocalTime,
    'Eastern Standard Time',
    'UTC'
)

-- Handle DST (Daylight Saving Time)
-- Use IANA time zone names for automatic DST handling
Derived Column:
âââ UTCTimestamp:
    convertTimeZone(
        LocalTimestamp,
        'America/New_York',  -- IANA format handles DST
        'UTC'
    )

-- Store both local and UTC
Output Columns:
âââ OriginalTimestamp (as received)
âââ SourceTimeZone
âââ UTCTimestamp (standardized)
âââ ProcessedAt (pipeline execution time in UTC)

14. How do you implement CDC (Change Data Capture)?

Scenario: Capture real-time changes from SQL database and stream to data lake.

-- Method 1: Native ADF CDC Connector
Copy Activity with CDC:
âââ Source: SQL Server CDC
â   âââ Enable CDC: true
â   âââ Net changes: true
â   âââ Start from: Last checkpoint
âââ Sink: Delta Lake
â   âââ Update method: Upsert
â   âââ Key columns: [PrimaryKey]

-- Method 2: SQL Server Change Tracking
-- Enable on database
ALTER DATABASE MyDB SET CHANGE_TRACKING = ON;
ALTER TABLE Orders ENABLE CHANGE_TRACKING;

-- Query changes
DECLARE @last_sync bigint = @{variables('LastSyncVersion')};
SELECT 
    ct.SYS_CHANGE_OPERATION,
    ct.SYS_CHANGE_VERSION,
    o.*
FROM CHANGETABLE(CHANGES Orders, @last_sync) ct
LEFT JOIN Orders o ON ct.OrderID = o.OrderID;

-- Data Flow CDC Pattern
Data Flow:
âââ Source (CDC Query)
âââ Conditional Split
â   âââ Inserts: SYS_CHANGE_OPERATION == 'I'
â   âââ Updates: SYS_CHANGE_OPERATION == 'U'
â   âââ Deletes: SYS_CHANGE_OPERATION == 'D'
âââ AlterRow
â   âââ Insert if: SYS_CHANGE_OPERATION == 'I'
â   âââ Update if: SYS_CHANGE_OPERATION == 'U'
â   âââ Delete if: SYS_CHANGE_OPERATION == 'D'
âââ Sink (Delta Lake)
    âââ Enable merge (upsert)

-- Method 3: Event-driven with Event Grid
Trigger: Storage Event (file created)
âââ When new CDC file arrives in landing zone
âââ Pipeline processes and merges changes

15. How do you handle hierarchical/nested JSON data?

Scenario: Flatten complex nested JSON from API into relational tables.

-- Source JSON
{
  "orderId": "ORD001",
  "customer": {
    "id": "C001",
    "name": "John Doe",
    "addresses": [
      {"type": "billing", "city": "NYC"},
      {"type": "shipping", "city": "LA"}
    ]
  },
  "items": [
    {"productId": "P001", "qty": 2, "price": 100},
    {"productId": "P002", "qty": 1, "price": 200}
  ]
}

-- Data Flow: Flatten JSON
Source (JSON file)
âââ Flatten (Customer Addresses)
â   âââ Unroll: customer.addresses
â   âââ Output: orderId, customer.id, addresses.type, addresses.city
âââ Flatten (Order Items)  
â   âââ Unroll: items
â   âââ Output: orderId, items.productId, items.qty, items.price
âââ Select (Order Header)
â   âââ orderId, customer.id, customer.name

-- Create multiple outputs
Conditional Split:
âââ OrderHeader â Sink (Orders table)
âââ OrderItems â Sink (OrderItems table)
âââ CustomerAddresses â Sink (Addresses table)

-- Flatten Transformation Settings
Flatten:
âââ Unroll by: items[]
âââ Unroll root: 
âââ Input columns: orderId, customer.id
âââ Output: orderId, customerId, productId, qty, price

-- Handle deeply nested
Parse Transformation:
âââ Column: nestedJson
âââ Expression: @json
âââ Document form: Single document

-- Extract specific paths
Derived Column:
âââ customerId: customer.id
âââ customerName: customer.name
âââ billingCity: customer.addresses[?(@.type=='billing')].city

16. How do you implement parallel processing for multiple sources?

Scenario: Load data from 50 different source tables simultaneously with optimal performance.

Pipeline: ParallelMultiSourceLoad
âââ Lookup (Get Table List - 50 tables)
âââ ForEach (Parallel Processing)
â   âââ Settings:
â       âââ Sequential: false
â       âââ Batch Count: 20  -- Process 20 tables at a time
â   âââ Activities:
â       âââ Execute Pipeline (LoadSingleTable)

-- Optimize with Batching
Pipeline: BatchParallelLoad
âââ Set Variable (Create Batches)
â   âââ Expression: 
â       @chunk(activity('Lookup').output.value, 10)  -- Batches of 10
âââ ForEach (Process Batches Sequentially)
â   âââ Sequential: true  -- One batch at a time
â   âââ Activities:
â       âââ ForEach (Process Tables in Batch Parallel)
â           âââ Sequential: false
â           âââ Batch Count: 10

-- Data Flow with Multiple Sources
Data Flow: ParallelSources
âââ Source1 (Table A) ââ
âââ Source2 (Table B) ââ¼ââ Union â Transform â Sink
âââ Source3 (Table C) ââ

-- Integration Runtime Scaling
Self-hosted IR:
âââ Create IR with 4 nodes
âââ Each node handles different connections
âââ Auto load balancing

Azure IR:
âââ Time to live: 10 minutes
âââ Core count: 16 (General purpose)
âââ Reserved for heavy workloads

-- Concurrent Pipeline Runs
Pipeline Settings:
âââ Max concurrent runs: 10
âââ Activities Settings:
    âââ Copy Activity:
    â   âââ Parallel copies: 32
    â   âââ DIU: 256
    âââ Data Flow:
        âââ Core count: 16
        âââ Compute type: Memory optimized

17. How do you handle PII data masking?

Scenario: Mask sensitive data (SSN, email, credit card) before loading to analytics.

-- Data Flow: PII Masking
Derived Column (Masking Rules):
âââ SSN_Masked: 
    concat('XXX-XX-', right(SSN, 4))
    -- 123-45-6789 â XXX-XX-6789

âââ Email_Masked:
    concat(
        left(Email, 2),
        '****@',
        split(Email, '@')[2]
    )
    -- john.doe@email.com â jo****@email.com

âââ CreditCard_Masked:
    concat('****-****-****-', right(CreditCard, 4))
    -- 1234-5678-9012-3456 â ****-****-****-3456

âââ Phone_Masked:
    concat('(***) ***-', right(replace(Phone, '-', ''), 4))

âââ Name_Masked:
    concat(left(FirstName, 1), '****')

-- Hash for consistent anonymization (lookup possible)
âââ CustomerID_Hashed:
    sha2(256, concat(CustomerID, 'salt_value'))

-- Conditional Masking based on environment
âââ Email_Output:
    iif(
        '@{pipeline().parameters.Environment}' == 'Production',
        Email,  -- Keep original in prod
        Email_Masked  -- Mask in non-prod
    )

-- Dynamic Masking with Metadata
Lookup: MaskingRules
âââ ColumnName: 'SSN', MaskType: 'SSN'
âââ ColumnName: 'Email', MaskType: 'Email'

Join with column list:
âââ Apply mask function based on MaskType

18. How do you implement data reconciliation?

Scenario: Verify data integrity between source and target after ETL load.

Pipeline: DataReconciliation
âââ Lookup (Source Count)
â   âââ Query: SELECT COUNT(*) as SourceCount FROM SourceTable WHERE Date = @loadDate
âââ Lookup (Target Count)
â   âââ Query: SELECT COUNT(*) as TargetCount FROM TargetTable WHERE LoadDate = @loadDate
âââ Lookup (Source Checksum)
â   âââ Query: SELECT SUM(CAST(CHECKSUM(Amount) AS BIGINT)) as SourceSum FROM SourceTable
âââ Lookup (Target Checksum)
â   âââ Query: SELECT SUM(Amount) as TargetSum FROM TargetTable
âââ If Condition (Validate Counts)
â   âââ Condition: 
â       @equals(
â           activity('SourceCount').output.firstRow.SourceCount,
â           activity('TargetCount').output.firstRow.TargetCount
â       )
â   âââ False â Fail pipeline with details
âââ If Condition (Validate Sums)
â   âââ Condition: 
â       @equals(
â           activity('SourceSum').output.firstRow.SourceSum,
â           activity('TargetSum').output.firstRow.TargetSum
â       )
âââ Stored Procedure (Log Reconciliation)
    âââ INSERT INTO ReconciliationLog (
            LoadDate, SourceCount, TargetCount, 
            SourceSum, TargetSum, Status, Variance
        )

-- Detailed Reconciliation Report
Data Flow: DetailedReconciliation
âââ Source (Source System)
âââ Source (Target System)
âââ Join (Full Outer on Business Key)
âââ Conditional Split:
â   âââ MissingInTarget: isNull(TargetKey)
â   âââ MissingInSource: isNull(SourceKey)
â   âââ ValueMismatch: SourceAmount != TargetAmount
â   âââ Matched: Default
âââ Sink (Discrepancy Report)

-- Expression for variance percentage
Derived Column:
âââ VariancePercent:
    abs(SourceCount - TargetCount) * 100.0 / SourceCount

19. How do you migrate on-premises SSIS packages to ADF?

Scenario: Convert existing SSIS packages to ADF pipelines.

-- Option 1: Lift and Shift (Azure-SSIS IR)
-- Run existing packages unchanged
Azure-SSIS Integration Runtime:
âââ Deploy SSIS packages to SSISDB
âââ Execute using Execute SSIS Package activity
âââ Benefits: Minimal changes, familiar tooling
âââ Limitations: Still SSIS, not cloud-native

Pipeline Activity:
âââ Execute SSIS Package
â   âââ SSIS package path: /SSISDB/Folder/Project/Package.dtsx
â   âââ Connection managers: Override with Azure connections
â   âââ Parameters: Pass pipeline parameters

-- Option 2: Convert to ADF Native
SSIS Component â ADF Equivalent:
âââ Data Flow Task â Copy Activity / Data Flow
âââ Execute SQL â Stored Procedure Activity
âââ For Loop â ForEach Activity
âââ Sequence Container â Pipeline with dependencies
âââ Script Task â Azure Function / Databricks
âââ Lookup â Lookup Activity
âââ Derived Column â Data Flow Derived Column
âââ Aggregate â Data Flow Aggregate
âââ Sort â Data Flow Sort
âââ Merge Join â Data Flow Join

-- Migration Assessment Tool
# Use SSIS Migration Assessment Tool
# Identifies compatibility issues
# Generates migration report

-- Conversion Example
SSIS Package:
âââ Execute SQL (Get Max Date)
âââ Data Flow (Load Incremental)
â   âââ OLE DB Source
â   âââ Derived Column
â   âââ OLE DB Destination
âââ Execute SQL (Update Watermark)

Converted ADF Pipeline:
âââ Lookup (Get Max Date)
âââ Data Flow
â   âââ Source
â   âââ Derived Column
â   âââ Sink
âââ Stored Procedure (Update Watermark)

20. How do you optimize ADF pipeline performance?

Best Practices for Optimization:

-- 1. Copy Activity Optimization
Copy Activity Settings:
âââ DIU (Data Integration Units): Start with Auto, increase for large data
âââ Parallel copies: 32 (default), increase for many small files
âââ Enable staging for cloud-to-cloud copies (uses PolyBase)
âââ Use binary copy when no transformation needed

-- 2. Data Flow Optimization
Data Flow Settings:
âââ Core count: 16-256 based on data size
âââ TTL (Time to Live): 10+ minutes for iterative development
âââ Compute type: Memory Optimized for complex transformations
âââ Enable staging for Synapse sink

Partitioning Strategy:
âââ Hash: For large skewed data
âââ Round Robin: For even distribution
âââ Key: Preserve source partitioning

-- 3. Pipeline Design
Optimization:
âââ Use ForEach with parallel execution (Sequential: false)
âââ Batch activities to reduce overhead
âââ Use Execute Pipeline for modularity and parallel runs
âââ Avoid unnecessary lookups (cache results)

-- 4. Source Optimization
For SQL Sources:
âââ Use indexed columns in WHERE clause
âââ Avoid SELECT * (specify columns)
âââ Use partitioning hint for large tables

Source Query:
âââ Query: SELECT col1, col2 FROM Table 
â          WHERE ModifiedDate > ? 
â          OPTION (MAXDOP 8)
âââ Partition option: Physical partitions of table

-- 5. Sink Optimization
For SQL Sink:
âââ Pre-copy script: TRUNCATE TABLE target (if full load)
âââ Batch size: 10000
âââ Bulk insert table lock: true
âââ Use staging with PolyBase for Synapse

For ADLS Sink:
âââ Block size: 100MB
âââ Max concurrent connections: 50

-- 6. Integration Runtime
Self-hosted IR:
âââ Use multiple nodes (up to 4) for HA and load balancing
âââ Separate IR for different workloads
âââ Place IR close to data source

Azure IR:
âââ Use region close to data
âââ Consider managed VNET for security

-- 7. Monitoring and Alerting
Monitor:
âââ Track DIU utilization
âââ Monitor queue time vs execution time
âââ Set up alerts for long-running pipelines
âââ Use Log Analytics for detailed analysis

Microsoft Azure Interview Questions

Comprehensive interview questions for Azure cloud services and data engineering roles.

Search Tutorials

Top 20 Azure Data Factory Real-Time Scenario Interview Questions

Microsoft Azure Interview Questions

1. How do you handle incremental data loading from a SQL database?

2. How do you handle slowly changing dimensions (SCD) Type 2?

3. How do you process files that arrive in batches?

4. How do you handle data quality checks in ADF?

5. How do you implement a data lake architecture using ADF?

6. How do you handle API pagination while extracting data?

7. How do you implement dynamic pipelines based on metadata?

8. How do you handle large file processing?

9. How do you implement error handling and retry logic?

10. How do you synchronize data between multiple databases?

11. How do you handle schema drift in source data?

12. How do you implement data archival strategy?

13. How do you handle time zone conversions in data pipelines?

14. How do you implement CDC (Change Data Capture)?

15. How do you handle hierarchical/nested JSON data?

16. How do you implement parallel processing for multiple sources?

17. How do you handle PII data masking?

18. How do you implement data reconciliation?

19. How do you migrate on-premises SSIS packages to ADF?

20. How do you optimize ADF pipeline performance?

Microsoft Azure Interview Questions

Popular Posts