Top 20 AWS Data Engineering Scenario Questions
- Design a real-time clickstream analytics pipeline
- Build a data lake with raw, curated, and enriched zones
- Design a CDC pipeline from RDS to data warehouse
- Build a real-time fraud detection system
- Design a multi-tenant data platform
- Build an event-driven ETL pipeline
- Design a data quality monitoring framework
- Build a cost-optimized data archive solution
- Design a cross-region data replication strategy
- Build a real-time recommendation engine pipeline
- Design a data governance framework
- Build a streaming ML inference pipeline
- Design a disaster recovery data strategy
- Build a data catalog and discovery platform
- Design a PII data handling pipeline
- Build a real-time aggregation dashboard
- Design a schema evolution strategy
- Build a data pipeline testing framework
- Design a data lineage tracking system
- Build a serverless data processing architecture
AWS Interview Questions - All Topics
1. Design a real-time clickstream analytics pipeline
Scenario: E-commerce company needs real-time user behavior analytics.
Architecture:
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â Web/Mobile Apps â
â â â
â â¼ â
â API Gateway âââ¶ Kinesis Data Streams â
â â â
â ââââââââââââââââ¼âââââââââââââââ â
â â â â â
â â¼ â¼ â¼ â
â Lambda Kinesis Kinesis â
â (Real-time) Analytics Firehose â
â â â â â
â â¼ â¼ â¼ â
â DynamoDB OpenSearch S3 (Parquet) â
â (Counters) (Dashboard) (Data Lake) â
â â â
â â¼ â
â Athena/Redshift â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
# Kinesis Stream for clickstream
kinesis.create_stream(StreamName='clickstream', ShardCount=10)
# Lambda for real-time processing
def handler(event, context):
for record in event['Records']:
data = json.loads(base64.b64decode(record['kinesis']['data']))
# Update real-time counters in DynamoDB
dynamodb.update_item(
TableName='page_views',
Key={'page_id': {'S': data['page_id']}, 'hour': {'S': data['hour']}},
UpdateExpression='ADD view_count :inc',
ExpressionAttributeValues={':inc': {'N': '1'}}
)
# Check for anomalies
if data['events_per_minute'] > 1000:
sns.publish(TopicArn=ALERT_TOPIC, Message=f"Traffic spike: {data}")
# Kinesis Analytics for aggregations
CREATE STREAM pageview_aggregates AS
SELECT STREAM
page_id,
TUMBLE_START(event_time, INTERVAL '1' MINUTE) as window_start,
COUNT(*) as views,
COUNT(DISTINCT user_id) as unique_users
FROM clickstream
GROUP BY page_id, TUMBLE(event_time, INTERVAL '1' MINUTE);
# Firehose for S3 landing
- Format conversion to Parquet
- Partitioning: year/month/day/hour
- Compression: Snappy
2. Build a data lake with raw, curated, and enriched zones
Scenario: Build a multi-zone data lake with proper governance.
Architecture:
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â AWS Lake Formation â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ¤
â â
â ââââââââââââââââ ââââââââââââââââ ââââââââââââââââ â
â â RAW ZONE â â CURATED ZONE â âENRICHED ZONE â â
â â (Bronze) â â (Silver) â â (Gold) â â
â â â â â â â â
â â s3://lake/ â â s3://lake/ â â s3://lake/ â â
â â raw/ â â curated/ â â enriched/ â â
â â â â â â â â
â â - JSON â â - Parquet â â - Parquet â â
â â - CSV â â - Deduplicatedâ â - Aggregated â â
â â - As-is data â â - Validated â â - ML featuresâ â
â ââââââââ¬ââââââââ ââââââââ¬ââââââââ ââââââââ¬ââââââââ â
â â â â â
â â¼ â¼ â¼ â
â ââââââââââââââââââââââââââââââââââââââââââââââââââââ â
â â AWS Glue Data Catalog â â
â ââââââââââââââââââââââââââââââââââââââââââââââââââââ â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
# Glue ETL: Raw to Curated
raw_df = glueContext.create_dynamic_frame.from_catalog(
database="raw_db",
table_name="events"
)
# Data quality checks
from awsgluedq.transforms import EvaluateDataQuality
quality_results = EvaluateDataQuality.apply(
frame=raw_df,
ruleset="""
Rules = [
ColumnExists "user_id",
IsComplete "event_timestamp",
ColumnValues "event_type" in ["click", "view", "purchase"]
]
"""
)
# Transform and deduplicate
curated_df = raw_df.toDF() \
.dropDuplicates(["event_id"]) \
.withColumn("processed_date", current_date())
# Write to curated zone
glueContext.write_dynamic_frame.from_options(
frame=DynamicFrame.fromDF(curated_df, glueContext, "curated"),
connection_type="s3",
connection_options={"path": "s3://lake/curated/events/"},
format="parquet",
format_options={"compression": "snappy"}
)
# Lake Formation permissions
lakeformation.grant_permissions(
Principal={'DataLakePrincipalIdentifier': 'arn:aws:iam::...:role/AnalystRole'},
Resource={'Table': {'DatabaseName': 'curated_db', 'Name': 'events'}},
Permissions=['SELECT']
)
3. Design a CDC pipeline from RDS to data warehouse
Scenario: Real-time replication from operational RDS to Redshift analytics.
Architecture:
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â â
â RDS (PostgreSQL) â
â â â
â â (Binary Log / Logical Replication) â
â â¼ â
â AWS DMS (CDC Task) â
â â â
â â¼ â
â Kinesis Data Streams â
â â â
â ââââââââââââââââ¬âââââââââââââââ â
â â â â â
â â¼ â¼ â¼ â
â Lambda Firehose S3 (Archive) â
â (Transform) (to Redshift) â
â â â â
â â¼ â¼ â
â DynamoDB Redshift â
â (Latest State) (Analytics) â
â â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
# DMS Task configuration
dms.create_replication_task(
ReplicationTaskIdentifier='rds-to-kinesis-cdc',
SourceEndpointArn=source_endpoint_arn,
TargetEndpointArn=kinesis_endpoint_arn,
ReplicationInstanceArn=replication_instance_arn,
MigrationType='cdc', # Change data capture only
TableMappings=json.dumps({
"rules": [{
"rule-type": "selection",
"rule-id": "1",
"rule-name": "1",
"object-locator": {
"schema-name": "public",
"table-name": "%"
},
"rule-action": "include"
}]
}),
ReplicationTaskSettings=json.dumps({
"TargetMetadata": {
"ParallelLoadThreads": 4,
"ParallelLoadBufferSize": 500
},
"FullLoadSettings": {
"TargetTablePrepMode": "DO_NOTHING"
},
"Logging": {"EnableLogging": True}
})
)
# Lambda to handle CDC events
def handler(event, context):
for record in event['Records']:
cdc_event = json.loads(base64.b64decode(record['kinesis']['data']))
operation = cdc_event['metadata']['operation'] # INSERT, UPDATE, DELETE
table = cdc_event['metadata']['table-name']
data = cdc_event['data']
if operation == 'INSERT':
# Upsert to DynamoDB for latest state
dynamodb.put_item(TableName=f'{table}_current', Item=data)
elif operation == 'UPDATE':
dynamodb.put_item(TableName=f'{table}_current', Item=data)
elif operation == 'DELETE':
dynamodb.delete_item(TableName=f'{table}_current', Key={'id': data['id']})
4. Build a real-time fraud detection system
Scenario: Detect fraudulent transactions in real-time with ML.
Architecture:
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â Transaction API â
â â â
â â¼ â
â Kinesis Data Streams (transactions) â
â â â
â ââââââââââââââââ¬âââââââââââââââ â
â â â â â
â â¼ â¼ â¼ â
â Lambda Lambda Firehose â
â (Feature Eng) (Rule Engine) (Archive) â
â â â â
â â¼ â¼ â
â SageMaker DynamoDB â
â (ML Inference) (Rules/History) â
â â â
â ââââââââââââââââ¬âââââââââââââââ â
â â â â
â â¼ â¼ â
â Approve SNS Alert â
â Transaction (Fraud Team) â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
# Feature engineering Lambda
def handler(event, context):
for record in event['Records']:
txn = json.loads(base64.b64decode(record['kinesis']['data']))
# Get historical features from DynamoDB
user_history = dynamodb.get_item(
TableName='user_transactions',
Key={'user_id': {'S': txn['user_id']}}
)
# Calculate real-time features
features = {
'amount': txn['amount'],
'merchant_category': txn['merchant_category'],
'hour_of_day': datetime.fromisoformat(txn['timestamp']).hour,
'avg_transaction_amount': user_history.get('avg_amount', 0),
'transaction_count_24h': user_history.get('count_24h', 0),
'distance_from_last_txn': calculate_distance(txn, user_history),
'time_since_last_txn': calculate_time_delta(txn, user_history)
}
# Call SageMaker endpoint
response = sagemaker_runtime.invoke_endpoint(
EndpointName='fraud-detection-model',
ContentType='application/json',
Body=json.dumps(features)
)
prediction = json.loads(response['Body'].read())
fraud_score = prediction['fraud_probability']
if fraud_score > 0.8:
# Block transaction and alert
sns.publish(TopicArn=FRAUD_ALERT_TOPIC, Message=json.dumps(txn))
return {'action': 'BLOCK', 'reason': 'High fraud score'}
elif fraud_score > 0.5:
# Request additional verification
return {'action': 'VERIFY', 'reason': 'Suspicious activity'}
else:
return {'action': 'APPROVE'}
5. Design a multi-tenant data platform
Scenario: SaaS platform with isolated data for multiple tenants.
Architecture Options:
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â Option 1: Siloed (Separate resources per tenant) â
â â
â Tenant A Tenant B Tenant C â
â ââââââââââ ââââââââââ ââââââââââ â
â âS3/tenant-aâ âS3/tenant-bâ âS3/tenant-câ â
â âRedshift-A â âRedshift-B â âRedshift-C â â
â âGlue Jobs Aâ âGlue Jobs Bâ âGlue Jobs Câ â
â ââââââââââ ââââââââââ ââââââââââ â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â Option 2: Pooled (Shared resources with isolation) â
â â
â ââââââââââââââââââââââââââââââââââââââââââââââââââââââ â
â â Shared S3 â â
â â s3://data-lake/tenant_id=A/ â â
â â s3://data-lake/tenant_id=B/ â â
â â s3://data-lake/tenant_id=C/ â â
â ââââââââââââââââââââââââââââââââââââââââââââââââââââââ â
â â â
â ââââââââââââââââââââââââââââââââââââââââââââââââââââââ â
â â Lake Formation (Row/Column Security) â â
â â Policy: tenant_id = ${session:tenant_id} â â
â ââââââââââââââââââââââââââââââââââââââââââââââââââââââ â
â â â
â ââââââââââââââââââââââââââââââââââââââââââââââââââââââ â
â â Shared Redshift â â
â â RLS: WHERE tenant_id = current_setting('tenant') â â
â ââââââââââââââââââââââââââââââââââââââââââââââââââââââ â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
# Lake Formation data cell filtering
lakeformation.create_data_cells_filter(
TableData={
'TableCatalogId': account_id,
'DatabaseName': 'multi_tenant_db',
'TableName': 'customer_data',
'Name': 'tenant_filter'
},
RowFilter={
'FilterExpression': 'tenant_id = :tenant',
'AllRowsWildcard': {}
}
)
# Grant with session tags
lakeformation.grant_permissions(
Principal={'DataLakePrincipalIdentifier': f'arn:aws:iam::{account_id}:role/TenantRole'},
Resource={
'DataCellsFilter': {
'TableCatalogId': account_id,
'DatabaseName': 'multi_tenant_db',
'TableName': 'customer_data',
'Name': 'tenant_filter'
}
},
Permissions=['SELECT'],
PermissionsWithGrantOption=[]
)
# Redshift Row-Level Security
CREATE RLS POLICY tenant_isolation
USING (tenant_id = current_setting('app.tenant_id')::varchar);
ATTACH RLS POLICY tenant_isolation ON customer_data TO ROLE analyst_role;