Top 20 Azure Data Lake Storage Interview Questions and Answers
- What is Azure Data Lake Storage?
- What is the difference between ADLS Gen1 and Gen2?
- What is Hierarchical Namespace (HNS)?
- What storage tiers are available in ADLS Gen2?
- How does data redundancy work in ADLS?
- What is Azure Blob Storage vs ADLS Gen2?
- How do you secure data in Azure Data Lake?
- What are Access Control Lists (ACLs) in ADLS?
- How do you implement data lifecycle management?
- What are the different ways to access ADLS Gen2?
- How do you optimize storage costs in ADLS?
- What is the difference between RBAC and ACLs?
- How do you integrate ADLS with Azure Synapse?
- What file formats are recommended for Data Lake?
- How do you implement data partitioning in Data Lake?
- What is soft delete in ADLS Gen2?
- How do you monitor ADLS Gen2?
- What is Azure Data Lake Analytics?
- How do you handle schema evolution in Data Lake?
- What are best practices for Data Lake architecture?
Microsoft Azure Interview Questions
Comprehensive interview questions for Azure cloud services and data engineering roles.
1. What is Azure Data Lake Storage?
Azure Data Lake Storage (ADLS) is a highly scalable and cost-effective data lake solution for big data analytics. It combines the power of a high-performance file system with massive scale and economy.Key Features:
- Unlimited Scale: Petabytes of data, billions of files
- Hadoop Compatible: HDFS semantics, ABFS driver
- High Performance: Optimized for analytics workloads
- Security: Azure AD, RBAC, ACLs, encryption
- Cost Effective: Blob storage pricing with tiering
Common Use Cases:
- Enterprise data lakes
- Big data analytics (Spark, Synapse)
- Machine learning data storage
- Data archival and compliance
2. What is the difference between ADLS Gen1 and Gen2?
| Feature | ADLS Gen1 | ADLS Gen2 |
|---|---|---|
| Foundation | Separate service | Built on Blob Storage |
| Status | Being retired (Feb 2024) | Current recommended |
| Pricing | Higher cost | Blob storage pricing |
| Storage Tiers | Not supported | Hot, Cool, Cold, Archive |
| Redundancy | LRS, GRS | LRS, ZRS, GRS, GZRS, RA-GRS |
| Performance | Good | Better (tiered performance) |
| APIs | WebHDFS only | Blob + ADLS + HDFS APIs |
| Blob Features | No | Full blob capabilities |
Migration Recommendation:
All Gen1 users should migrate to Gen2 using Azure Data Factory or Azure Portal migration tool.
3. What is Hierarchical Namespace (HNS)?
Hierarchical Namespace enables ADLS Gen2 to organize objects/files into a hierarchy of directories and subdirectories, similar to a traditional file system.Without HNS (Blob Storage):
// Flat namespace - virtual directories container/ folder1/file1.txt <- Single object with "/" in name folder1/file2.txt folder2/subfolder/file3.txt // Renaming "folder1" requires copying ALL files
With HNS (ADLS Gen2):
// True directory hierarchy
container/
folder1/ <- Actual directory object
file1.txt
file2.txt
folder2/
subfolder/
file3.txt
// Renaming "folder1" is atomic (metadata change)
Benefits of HNS:
- Atomic directory operations (rename, delete)
- Better performance for big data workloads
- POSIX-like ACLs for fine-grained security
- Required for HDFS compatibility
4. What storage tiers are available in ADLS Gen2?
ADLS Gen2 supports multiple access tiers for cost optimization:Hot Tier:
- Highest storage cost, lowest access cost
- Frequently accessed data
- No minimum storage duration
Cool Tier:
- Lower storage cost, higher access cost
- Infrequently accessed (30+ days)
- 30-day minimum storage
Cold Tier:
- Even lower storage cost
- Rarely accessed (90+ days)
- 90-day minimum storage
Archive Tier:
- Lowest storage cost, highest access cost
- Rarely accessed (180+ days)
- 180-day minimum storage
- Requires rehydration before access (hours)
# Set tier using Azure CLI
az storage blob set-tier --account-name myaccount \
--container-name data --name oldfile.parquet --tier Archive
# Lifecycle management policy (JSON)
{
"rules": [{
"name": "archiveOldData",
"type": "Lifecycle",
"definition": {
"filters": {"blobTypes": ["blockBlob"], "prefixMatch": ["logs/"]},
"actions": {
"baseBlob": {
"tierToCool": {"daysAfterModificationGreaterThan": 30},
"tierToArchive": {"daysAfterModificationGreaterThan": 90}
}
}
}
}]
}
5. How does data redundancy work in ADLS?
Redundancy Options:LRS (Locally Redundant Storage):
- 3 copies within single data center
- 11 nines durability
- Lowest cost
ZRS (Zone-Redundant Storage):
- 3 copies across availability zones
- 12 nines durability
- Protects against data center failures
GRS (Geo-Redundant Storage):
- LRS + async copy to secondary region
- 16 nines durability
- 6 total copies
GZRS (Geo-Zone-Redundant Storage):
- ZRS + async copy to secondary region
- Highest durability and availability
RA-GRS/RA-GZRS:
- Read access to secondary region
- Higher availability for read operations
# Create storage account with GRS az storage account create --name mydatalake --resource-group myRG \ --location eastus --sku Standard_GRS --kind StorageV2 \ --hierarchical-namespace true