Top 20 AWS Data Engineer Interview Questions and Answers
- What are the key AWS services for Data Engineering?
- What is AWS Glue and how does it work?
- What is Amazon Redshift?
- How do you design a data lake on AWS?
- What is AWS Lake Formation?
- What is Amazon EMR?
- What is Amazon Kinesis?
- How do you implement ETL pipelines on AWS?
- What is AWS Data Pipeline?
- What is Amazon Athena?
- How do you optimize S3 for data engineering?
- What is the difference between Glue and EMR?
- How do you handle schema evolution in AWS?
- What is AWS Step Functions?
- How do you implement CDC on AWS?
- What is Amazon DynamoDB Streams?
- How do you secure data on AWS?
- What is AWS Glue Data Catalog?
- How do you monitor data pipelines on AWS?
- What are best practices for AWS Data Engineering?
AWS Interview Questions - All Topics
1. What are the key AWS services for Data Engineering?
AWS Data Engineering Services: Storage: âââ Amazon S3 - Object storage, data lake foundation âââ Amazon EBS - Block storage for EC2 âââ Amazon EFS - Managed file system Data Processing: âââ AWS Glue - Serverless ETL âââ Amazon EMR - Managed Hadoop/Spark âââ AWS Lambda - Serverless compute âââ Amazon EC2 - Custom processing Data Warehousing: âââ Amazon Redshift - Cloud data warehouse âââ Amazon Redshift Spectrum - Query S3 data âââ Amazon Redshift Serverless - On-demand warehouse Streaming: âââ Amazon Kinesis Data Streams - Real-time streaming âââ Amazon Kinesis Firehose - Data delivery âââ Amazon Kinesis Analytics - Stream processing âââ Amazon MSK - Managed Kafka Analytics & Query: âââ Amazon Athena - Serverless SQL on S3 âââ Amazon QuickSight - BI and visualization âââ Amazon OpenSearch - Search and analytics Orchestration: âââ AWS Step Functions - Workflow orchestration âââ Amazon MWAA - Managed Airflow âââ AWS Data Pipeline - Data movement
2. What is AWS Glue and how does it work?
AWS Glue is a fully managed serverless ETL service for data preparation and loading.Components:
AWS Glue Components:
âââ Data Catalog
â âââ Databases
â âââ Tables (metadata)
â âââ Connections
âââ Crawlers
â âââ Auto-discover schema
âââ ETL Jobs
â âââ Spark jobs
â âââ Python Shell
â âââ Ray (ML workloads)
âââ Glue Studio
â âââ Visual ETL designer
âââ Data Quality
âââ Built-in rules
# Glue Job Example (PySpark)
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# Read from catalog
datasource = glueContext.create_dynamic_frame.from_catalog(
database="mydb",
table_name="raw_data"
)
# Transform
mapped = ApplyMapping.apply(
frame=datasource,
mappings=[
("id", "string", "customer_id", "string"),
("name", "string", "customer_name", "string"),
("amount", "double", "total_amount", "double")
]
)
# Write to S3
glueContext.write_dynamic_frame.from_options(
frame=mapped,
connection_type="s3",
connection_options={"path": "s3://bucket/processed/"},
format="parquet"
)
job.commit()
3. What is Amazon Redshift?
Amazon Redshift is a fully managed, petabyte-scale cloud data warehouse.Architecture:
Redshift Cluster:
âââ Leader Node
â âââ SQL parsing
â âââ Query planning
â âââ Result aggregation
âââ Compute Nodes
âââ Node Slices
âââ Local storage
âââ Parallel processing
Node Types:
âââ RA3 (Recommended)
â âââ Managed storage (RMS)
â âââ Scale compute/storage independently
â âââ ra3.xlplus, ra3.4xlarge, ra3.16xlarge
âââ DC2 (Dense Compute)
â âââ Local SSD storage
â âââ dc2.large, dc2.8xlarge
âââ DS2 (Dense Storage) - Legacy
# Create table with distribution
CREATE TABLE sales (
sale_id INT,
customer_id INT,
product_id INT,
sale_date DATE,
amount DECIMAL(10,2)
)
DISTKEY(customer_id)
SORTKEY(sale_date)
DISTSTYLE KEY;
-- Distribution Styles:
-- KEY: Rows with same key on same slice
-- EVEN: Round-robin distribution
-- ALL: Copy to all nodes (small dims)
-- AUTO: Redshift decides
4. How do you design a data lake on AWS?
AWS Data Lake Architecture:
S3 Bucket Structure (Medallion):
s3://data-lake/
âââ raw/ # Bronze - Raw data
â âââ source1/
â â âââ year=2024/month=01/day=15/
â âââ source2/
âââ staged/ # Silver - Cleansed
â âââ domain1/
â â âââ table1/
â âââ domain2/
âââ curated/ # Gold - Business ready
â âââ analytics/
â âââ reporting/
âââ archive/ # Cold storage
Components:
1. Storage: S3 with lifecycle policies
2. Catalog: AWS Glue Data Catalog / Lake Formation
3. Processing: Glue, EMR, Athena
4. Security: Lake Formation permissions
5. Governance: Data quality, lineage
# Lake Formation setup
import boto3
lakeformation = boto3.client('lakeformation')
# Register S3 location
lakeformation.register_resource(
ResourceArn='arn:aws:s3:::my-data-lake',
UseServiceLinkedRole=True
)
# Grant permissions
lakeformation.grant_permissions(
Principal={'DataLakePrincipalIdentifier': 'arn:aws:iam::123456789012:role/DataAnalyst'},
Resource={
'Table': {
'DatabaseName': 'analytics',
'Name': 'sales'
}
},
Permissions=['SELECT']
)
5. What is AWS Lake Formation?
AWS Lake Formation simplifies data lake setup, security, and governance.Key Features:
- Centralized data catalog
- Fine-grained access control (column/row level)
- Data sharing across accounts
- Blueprint-based ingestion
- Governed tables with ACID
Lake Formation Security Model:
Traditional IAM:
- Coarse-grained (bucket/prefix level)
- Complex policies for multiple tables
- Hard to manage at scale
Lake Formation:
- Fine-grained (column/row level)
- Centralized permissions
- Tag-based access control
# Column-level security
lakeformation.grant_permissions(
Principal={'DataLakePrincipalIdentifier': 'arn:aws:iam::123456789012:role/Analyst'},
Resource={
'TableWithColumns': {
'DatabaseName': 'hr',
'Name': 'employees',
'ColumnNames': ['name', 'department', 'title']
# Excludes: salary, ssn
}
},
Permissions=['SELECT']
)
# Row-level security with data filters
lakeformation.create_data_cells_filter(
TableData={
'DatabaseName': 'sales',
'TableName': 'orders',
'Name': 'us_only_filter',
'RowFilter': {
'FilterExpression': "region = 'US'"
},
'ColumnNames': ['order_id', 'customer', 'amount']
}
)