Search Tutorials


Top Azure Data Lake Interview Questions (2026) | JavaInuse

Top 20 Azure Data Lake Storage Interview Questions and Answers


  1. What is Azure Data Lake Storage?
  2. What is the difference between ADLS Gen1 and Gen2?
  3. What is Hierarchical Namespace (HNS)?
  4. What storage tiers are available in ADLS Gen2?
  5. How does data redundancy work in ADLS?
  6. What is Azure Blob Storage vs ADLS Gen2?
  7. How do you secure data in Azure Data Lake?
  8. What are Access Control Lists (ACLs) in ADLS?
  9. How do you implement data lifecycle management?
  10. What are the different ways to access ADLS Gen2?
  11. How do you optimize storage costs in ADLS?
  12. What is the difference between RBAC and ACLs?
  13. How do you integrate ADLS with Azure Synapse?
  14. What file formats are recommended for Data Lake?
  15. How do you implement data partitioning in Data Lake?
  16. What is soft delete in ADLS Gen2?
  17. How do you monitor ADLS Gen2?
  18. What is Azure Data Lake Analytics?
  19. How do you handle schema evolution in Data Lake?
  20. What are best practices for Data Lake architecture?

Microsoft Azure Interview Questions

Comprehensive interview questions for Azure cloud services and data engineering roles.

1. What is Azure Data Lake Storage?

Azure Data Lake Storage (ADLS) is a highly scalable and cost-effective data lake solution for big data analytics. It combines the power of a high-performance file system with massive scale and economy.

Key Features:
- Unlimited Scale: Petabytes of data, billions of files
- Hadoop Compatible: HDFS semantics, ABFS driver
- High Performance: Optimized for analytics workloads
- Security: Azure AD, RBAC, ACLs, encryption
- Cost Effective: Blob storage pricing with tiering

Common Use Cases:
- Enterprise data lakes
- Big data analytics (Spark, Synapse)
- Machine learning data storage
- Data archival and compliance

2. What is the difference between ADLS Gen1 and Gen2?

FeatureADLS Gen1ADLS Gen2
FoundationSeparate serviceBuilt on Blob Storage
StatusBeing retired (Feb 2024)Current recommended
PricingHigher costBlob storage pricing
Storage TiersNot supportedHot, Cool, Cold, Archive
RedundancyLRS, GRSLRS, ZRS, GRS, GZRS, RA-GRS
PerformanceGoodBetter (tiered performance)
APIsWebHDFS onlyBlob + ADLS + HDFS APIs
Blob FeaturesNoFull blob capabilities

Migration Recommendation:
All Gen1 users should migrate to Gen2 using Azure Data Factory or Azure Portal migration tool.

3. What is Hierarchical Namespace (HNS)?

Hierarchical Namespace enables ADLS Gen2 to organize objects/files into a hierarchy of directories and subdirectories, similar to a traditional file system.

Without HNS (Blob Storage):
// Flat namespace - virtual directories
container/
  folder1/file1.txt    <- Single object with "/" in name
  folder1/file2.txt
  folder2/subfolder/file3.txt
  
// Renaming "folder1" requires copying ALL files

With HNS (ADLS Gen2):
// True directory hierarchy
container/
  folder1/            <- Actual directory object
    file1.txt
    file2.txt
  folder2/
    subfolder/
      file3.txt
      
// Renaming "folder1" is atomic (metadata change)

Benefits of HNS:
- Atomic directory operations (rename, delete)
- Better performance for big data workloads
- POSIX-like ACLs for fine-grained security
- Required for HDFS compatibility

4. What storage tiers are available in ADLS Gen2?

ADLS Gen2 supports multiple access tiers for cost optimization:

Hot Tier:
- Highest storage cost, lowest access cost
- Frequently accessed data
- No minimum storage duration

Cool Tier:
- Lower storage cost, higher access cost
- Infrequently accessed (30+ days)
- 30-day minimum storage

Cold Tier:
- Even lower storage cost
- Rarely accessed (90+ days)
- 90-day minimum storage

Archive Tier:
- Lowest storage cost, highest access cost
- Rarely accessed (180+ days)
- 180-day minimum storage
- Requires rehydration before access (hours)

# Set tier using Azure CLI
az storage blob set-tier --account-name myaccount \
  --container-name data --name oldfile.parquet --tier Archive

# Lifecycle management policy (JSON)
{
  "rules": [{
    "name": "archiveOldData",
    "type": "Lifecycle",
    "definition": {
      "filters": {"blobTypes": ["blockBlob"], "prefixMatch": ["logs/"]},
      "actions": {
        "baseBlob": {
          "tierToCool": {"daysAfterModificationGreaterThan": 30},
          "tierToArchive": {"daysAfterModificationGreaterThan": 90}
        }
      }
    }
  }]
}

5. How does data redundancy work in ADLS?

Redundancy Options:

LRS (Locally Redundant Storage):
- 3 copies within single data center
- 11 nines durability
- Lowest cost

ZRS (Zone-Redundant Storage):
- 3 copies across availability zones
- 12 nines durability
- Protects against data center failures

GRS (Geo-Redundant Storage):
- LRS + async copy to secondary region
- 16 nines durability
- 6 total copies

GZRS (Geo-Zone-Redundant Storage):
- ZRS + async copy to secondary region
- Highest durability and availability

RA-GRS/RA-GZRS:
- Read access to secondary region
- Higher availability for read operations

# Create storage account with GRS
az storage account create --name mydatalake --resource-group myRG \
  --location eastus --sku Standard_GRS --kind StorageV2 \
  --hierarchical-namespace true




6. What is Azure Blob Storage vs ADLS Gen2?

ADLS Gen2 IS Azure Blob Storage with Hierarchical Namespace enabled.

FeatureBlob Storage (No HNS)ADLS Gen2 (HNS Enabled)
NamespaceFlatHierarchical
Directory OperationsSimulated (slow)Atomic (fast)
HDFS CompatibilityLimitedFull ABFS driver
ACLsContainer level onlyFile/directory level
Big Data PerformanceGoodOptimized
Blob FeaturesAllMost (some limitations)
PricingStandard blobSame storage, higher transaction

When to use Blob without HNS:
- Simple object storage
- CDN/static website hosting
- No big data analytics

When to enable HNS (ADLS Gen2):
- Big data analytics (Spark, Synapse, Databricks)
- Need fine-grained ACLs
- Directory-level operations

7. How do you secure data in Azure Data Lake?

1. Authentication:
- Azure Active Directory (recommended)
- Shared Key
- Shared Access Signatures (SAS)
- Managed Identities

2. Authorization:
- RBAC (Role-Based Access Control)
- ACLs (Access Control Lists)
- SAS tokens

3. Encryption:
// Encryption at rest
- Microsoft-managed keys (default)
- Customer-managed keys (Azure Key Vault)
- Infrastructure encryption (double encryption)

// Encryption in transit
- HTTPS required by default
- TLS 1.2 minimum

# Enable customer-managed keys
az storage account update --name mydatalake --resource-group myRG \
  --encryption-key-source Microsoft.Keyvault \
  --encryption-key-vault https://myvault.vault.azure.net \
  --encryption-key-name mykey

4. Network Security:
- Private endpoints
- Service endpoints
- Firewall rules

8. What are Access Control Lists (ACLs) in ADLS?

ACLs provide fine-grained access control at the directory and file level in ADLS Gen2.

ACL Types:
- Access ACL: Controls access to an object
- Default ACL: Template for child objects (directories only)

Permission Types:
// POSIX-style permissions
R = Read    (4)  - List directory contents / Read file
W = Write   (2)  - Create/delete children / Write file
X = Execute (1)  - Traverse directory / Execute file

// ACL Entry Format
[scope]:[type]:[id]:[permissions]

// Examples
user::rwx          <- Owning user
group::r-x         <- Owning group
other::---         <- Everyone else
user:abc123:r-x    <- Specific user
group:mygroup:rwx  <- Specific group
mask::rwx          <- Maximum permissions for named entries

Setting ACLs:
# Set access ACL
az storage fs access set --acl "user::rwx,group::r-x,other::---" \
  --path mydir --file-system mycontainer --account-name mydatalake

# Set default ACL (for inheritance)
az storage fs access set --acl "default:user::rwx,default:group::r-x" \
  --path mydir --file-system mycontainer --account-name mydatalake

# Recursive ACL update
az storage fs access set-recursive --acl "user:abc123:r-x" \
  --path mydir --file-system mycontainer --account-name mydatalake

9. How do you implement data lifecycle management?

Lifecycle management automates tiering and deletion of data based on rules.

{
  "rules": [
    {
      "enabled": true,
      "name": "TierToArchive",
      "type": "Lifecycle",
      "definition": {
        "filters": {
          "blobTypes": ["blockBlob"],
          "prefixMatch": ["historical/"]
        },
        "actions": {
          "baseBlob": {
            "tierToCool": {"daysAfterModificationGreaterThan": 30},
            "tierToCold": {"daysAfterModificationGreaterThan": 90},
            "tierToArchive": {"daysAfterModificationGreaterThan": 180},
            "delete": {"daysAfterModificationGreaterThan": 2555}
          },
          "snapshot": {
            "delete": {"daysAfterCreationGreaterThan": 90}
          }
        }
      }
    },
    {
      "enabled": true,
      "name": "DeleteOldVersions",
      "type": "Lifecycle",
      "definition": {
        "filters": {"blobTypes": ["blockBlob"]},
        "actions": {
          "version": {
            "delete": {"daysAfterCreationGreaterThan": 365}
          }
        }
      }
    }
  ]
}

Apply Policy:
az storage account management-policy create --account-name mydatalake \
  --resource-group myRG --policy @lifecycle-policy.json

10. What are the different ways to access ADLS Gen2?

1. Azure Portal:
- Browse containers and files
- Upload/download files
- Manage ACLs

2. Azure Storage Explorer:
- Desktop application
- Full management capabilities
- ACL and metadata editing

3. SDKs and APIs:
# Python SDK
from azure.storage.filedatalake import DataLakeServiceClient

service_client = DataLakeServiceClient(
    account_url=f"https://{account_name}.dfs.core.windows.net",
    credential=DefaultAzureCredential()
)

file_system_client = service_client.get_file_system_client("mycontainer")
file_client = file_system_client.get_file_client("myfile.parquet")

# Download
download = file_client.download_file()
data = download.readall()

4. ABFS Driver (Spark, Databricks):
// Spark DataFrame read
val df = spark.read.parquet("abfss://container@account.dfs.core.windows.net/path")

// Configuration
spark.conf.set("fs.azure.account.key.account.dfs.core.windows.net", "key")
// Or use OAuth/Managed Identity

5. AzCopy:
azcopy copy "/local/path" "https://account.dfs.core.windows.net/container/folder" --recursive

11. How do you optimize storage costs in ADLS?

1. Use Appropriate Tiers:
- Hot for frequently accessed
- Cool for infrequently accessed (30+ days)
- Archive for rarely accessed (180+ days)

2. Lifecycle Management:
- Automate tier transitions
- Delete old/unused data

3. Choose Right Redundancy:
- LRS for dev/test
- GRS for production
- Don't over-provision

4. Optimize File Sizes:
- Avoid small files (100MB-1GB optimal)
- Compact small files periodically

5. Reserved Capacity:
- 1 or 3-year commitments
- Up to 38% savings

6. Monitor and Analyze:
# Use Storage Analytics
az storage logging update --account-name mydatalake --log rwd --retention 30 --services b

# Azure Cost Management
- Set budgets and alerts
- Analyze by container/path

12. What is the difference between RBAC and ACLs?

AspectRBACACLs
ScopeSubscription, RG, Storage, ContainerDirectory, File
GranularityCoarse (container level)Fine (file level)
InheritanceAzure resource hierarchyParent directory to children
ManagementAzure Portal, CLI, ARMStorage Explorer, CLI, SDK
Identity TypesAzure AD onlyAzure AD + Object IDs

Recommended Approach:
- Use RBAC for management operations (create container, manage settings)
- Use ACLs for data access control (read/write specific files)

# RBAC: Grant container-level access
az role assignment create --role "Storage Blob Data Contributor" \
  --assignee user@domain.com \
  --scope "/subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.Storage/storageAccounts/{account}/blobServices/default/containers/{container}"

# ACL: Grant file-level access
az storage fs access set --acl "user:abc123-object-id:r-x" \
  --path specific-folder/file.parquet \
  --file-system container --account-name account

13. How do you integrate ADLS with Azure Synapse?

1. Linked Service Connection:
// Create linked service in Synapse
{
    "name": "DataLakeLinkedService",
    "type": "AzureBlobFS",
    "typeProperties": {
        "url": "https://mydatalake.dfs.core.windows.net",
        "accountKey": {"type": "SecureString", "value": "***"}
    }
}
// Or use Managed Identity (recommended)

2. Serverless SQL Pool:
-- Query Parquet files directly
SELECT *
FROM OPENROWSET(
    BULK 'https://mydatalake.dfs.core.windows.net/container/path/*.parquet',
    FORMAT = 'PARQUET'
) AS data;

-- Create external table
CREATE EXTERNAL TABLE Sales (
    OrderID INT,
    Amount DECIMAL(10,2)
)
WITH (
    LOCATION = '/sales/',
    DATA_SOURCE = DataLake,
    FILE_FORMAT = ParquetFormat
);

3. Spark Pool:
# Read directly using linked service
df = spark.read.load(
    'abfss://container@mydatalake.dfs.core.windows.net/path',
    format='parquet'
)

# Write back to data lake
df.write.mode('overwrite').parquet(
    'abfss://container@mydatalake.dfs.core.windows.net/output'
)

14. What file formats are recommended for Data Lake?

FormatBest ForCompressionSchema Evolution
ParquetAnalytics, columnar queriesExcellent (Snappy, GZIP)Good
Delta LakeACID transactions, CDCBuilt on ParquetExcellent
ORCHive workloadsExcellentGood
AvroRow-based, streamingGoodExcellent
JSONInteroperabilityPoor (compressible)Flexible
CSVSimple data exchangePoorNone

Recommendations:
- Analytics: Parquet or Delta Lake
- Streaming: Avro or JSON
- ML Pipelines: Parquet
- Legacy Integration: CSV




15. How do you implement data partitioning in Data Lake?

Partition Strategy:
// Common partitioning schemes
/data/
  year=2023/
    month=01/
      day=01/
        file1.parquet
        file2.parquet
      day=02/
    month=02/
  year=2024/

// Or flat partitioning
/data/
  date=2023-01-01/
  date=2023-01-02/

Writing Partitioned Data (Spark):
# Write with partitioning
df.write \
    .mode("overwrite") \
    .partitionBy("year", "month", "day") \
    .parquet("abfss://container@account.dfs.core.windows.net/sales")

# Read with partition pruning
df = spark.read.parquet("abfss://container@account.dfs.core.windows.net/sales") \
    .filter("year = 2023 AND month = 6")

Best Practices:
- Partition on frequently filtered columns
- Avoid over-partitioning (too many small files)
- Balance partition size (100MB-1GB files)
- Consider query patterns

16. What is soft delete in ADLS Gen2?

Soft delete allows recovery of accidentally deleted blobs, containers, and versions.

# Enable blob soft delete
az storage account blob-service-properties update \
  --account-name mydatalake --enable-delete-retention true \
  --delete-retention-days 30

# Enable container soft delete
az storage account blob-service-properties update \
  --account-name mydatalake --enable-container-delete-retention true \
  --container-delete-retention-days 30

# List soft-deleted blobs
az storage blob list --account-name mydatalake --container mycontainer \
  --include d --query "[?deleted]"

# Restore soft-deleted blob
az storage blob undelete --account-name mydatalake \
  --container mycontainer --name myfile.parquet

Note: Soft delete increases storage costs during retention period.

17. How do you monitor ADLS Gen2?

1. Azure Monitor Metrics:
- Ingress/Egress bytes
- Transactions count
- Availability percentage
- Latency metrics

2. Diagnostic Settings:
# Enable diagnostic logs
az monitor diagnostic-settings create --name mydiag \
  --resource /subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.Storage/storageAccounts/{account}/blobServices/default \
  --logs '[{"category": "StorageRead", "enabled": true}, {"category": "StorageWrite", "enabled": true}]' \
  --workspace {log-analytics-workspace-id}

3. Storage Analytics Logging:
- Request-level logging
- Success/failure tracking
- Authentication details

4. Azure Advisor Recommendations:
- Cost optimization suggestions
- Security recommendations

18. What is Azure Data Lake Analytics?

Azure Data Lake Analytics (ADLA) is a distributed analytics service that makes big data easy using U-SQL.

Note: ADLA is being deprecated. Microsoft recommends using Azure Synapse Analytics instead.

U-SQL Example:
// U-SQL script
@searchlog =
    EXTRACT UserId int,
            Query string,
            Duration int
    FROM "/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();

@result =
    SELECT Query,
           SUM(Duration) AS TotalDuration
    FROM @searchlog
    GROUP BY Query;

OUTPUT @result
TO "/output/SearchLogResult.csv"
USING Outputters.Csv();

Migration Path:
- Migrate to Azure Synapse serverless SQL
- Migrate to Azure Databricks
- Use Spark pools in Synapse

19. How do you handle schema evolution in Data Lake?

1. Delta Lake (Recommended):
# Automatic schema evolution
df.write \
    .format("delta") \
    .mode("append") \
    .option("mergeSchema", "true") \
    .save("abfss://container@account.dfs.core.windows.net/table")

# Schema enforcement
df.write \
    .format("delta") \
    .option("overwriteSchema", "true") \
    .mode("overwrite") \
    .save(path)

2. Parquet Schema Evolution:
# Spark merge schema option
df = spark.read \
    .option("mergeSchema", "true") \
    .parquet("path")

3. Serverless SQL:
-- Use WITH clause for schema specification
SELECT *
FROM OPENROWSET(
    BULK 'path/*.parquet',
    FORMAT = 'PARQUET'
) WITH (
    OrderID INT,
    Amount DECIMAL(10,2),
    NewColumn VARCHAR(100)  -- Handle new columns
) AS data;

20. What are best practices for Data Lake architecture?

1. Medallion Architecture (Bronze/Silver/Gold):
/datalake/
  bronze/           <- Raw data (as-is from source)
    source1/
    source2/
  silver/           <- Cleaned, validated, conformed
    domain1/
    domain2/
  gold/             <- Business-level aggregates
    reporting/
    ml_features/

2. Naming Conventions:
- Consistent folder structure
- Include dates in paths for time-series
- Use lowercase, underscores or hyphens

3. Data Organization:
- Separate by domain/subject area
- Partition by common query filters
- Maintain data catalog (Purview)

4. Security:
- Principle of least privilege
- Use managed identities
- Enable soft delete and versioning
- Encrypt with customer-managed keys

5. Performance:
- Optimal file sizes (100MB-1GB)
- Use columnar formats (Parquet)
- Compact small files regularly
- Colocate related data

6. Governance:
- Use Azure Purview for data catalog
- Implement data lineage tracking
- Document schemas and ownership

Microsoft Azure Interview Questions

Comprehensive interview questions for Azure cloud services and data engineering roles.


Popular Posts