Top AWS Data Engineering Scenario Questions (2026)

6. Build an event-driven ETL pipeline

Scenario: Trigger ETL automatically when new data arrives.

Architecture:
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â  S3 Bucket (Raw Data Landing)                               â
â       â                                                      â
â       â (S3 Event Notification)                             â
â       â¼                                                      â
â  EventBridge                                                 â
â       â                                                      â
â       ââââââââââââââââ¬âââââââââââââââ                       â
â       â              â              â                       â
â       â¼              â¼              â¼                       â
â  Step Functions   Lambda          SNS                       â
â  (Orchestration)  (Validation)    (Notification)            â
â       â                                                      â
â       ââââââââââââââââ¬âââââââââââââââ                       â
â       â              â              â                       â
â       â¼              â¼              â¼                       â
â  Glue Crawler    Glue Job        Quality Check              â
â  (Schema)        (Transform)     (DQ Rules)                 â
â       â              â              â                       â
â       ââââââââââââââââ¼âââââââââââââââ                       â
â                      â¼                                       â
â                  Curated Zone                                â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ

# EventBridge Rule
{
  "source": ["aws.s3"],
  "detail-type": ["Object Created"],
  "detail": {
    "bucket": {"name": ["raw-data-landing"]},
    "object": {"key": [{"prefix": "incoming/"}]}
  }
}

# Step Functions Definition
{
  "StartAt": "ValidateFile",
  "States": {
    "ValidateFile": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:...:function:validate-file",
      "Next": "FileValid?"
    },
    "FileValid?": {
      "Type": "Choice",
      "Choices": [{
        "Variable": "$.validation.status",
        "StringEquals": "VALID",
        "Next": "UpdateCatalog"
      }],
      "Default": "HandleInvalidFile"
    },
    "UpdateCatalog": {
      "Type": "Task",
      "Resource": "arn:aws:states:::glue:startCrawler.sync",
      "Parameters": {"Name": "raw-data-crawler"},
      "Next": "RunETL"
    },
    "RunETL": {
      "Type": "Task",
      "Resource": "arn:aws:states:::glue:startJobRun.sync",
      "Parameters": {
        "JobName": "raw-to-curated",
        "Arguments": {"--source_path.$": "$.s3.object.key"}
      },
      "Next": "QualityCheck"
    },
    "QualityCheck": {
      "Type": "Task",
      "Resource": "arn:aws:states:::glue:startDataQualityRulesetEvaluationRun.sync",
      "Parameters": {"RulesetNames": ["curated-quality-rules"]},
      "Next": "NotifySuccess"
    },
    "NotifySuccess": {
      "Type": "Task",
      "Resource": "arn:aws:states:::sns:publish",
      "Parameters": {"TopicArn": "...", "Message": "ETL completed"},
      "End": true
    },
    "HandleInvalidFile": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:...:function:handle-invalid",
      "End": true
    }
  }
}

7. Design a data quality monitoring framework

Scenario: Continuous data quality monitoring with alerting.

Framework Components:
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â                Data Quality Framework                        â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ¤
â                                                              â
â  âââââââââââââââââââââââââââââââââââââââââââââââââââââââ   â
â  â                 DQ Rule Repository                   â   â
â  â  - Completeness (null checks)                       â   â
â  â  - Uniqueness (duplicate checks)                    â   â
â  â  - Validity (format, range checks)                  â   â
â  â  - Consistency (referential integrity)              â   â
â  â  - Timeliness (freshness checks)                    â   â
â  âââââââââââââââââââââââââââââââââââââââââââââââââââââââ   â
â                          â                                  â
â  âââââââââââââââââââââââââââââââââââââââââââââââââââââââ   â
â  â              Glue Data Quality                       â   â
â  â  ruleset = """                                      â   â
â  â    Rules = [                                        â   â
â  â      Completeness "email" > 0.99,                   â   â
â  â      Uniqueness "customer_id" = 1.0,               â   â
â  â      ColumnValues "status" in ["A","I","P"],       â   â
â  â      RowCount between 1000 and 1000000             â   â
â  â    ]                                                â   â
â  â  """                                                â   â
â  âââââââââââââââââââââââââââââââââââââââââââââââââââââââ   â
â                          â                                  â
â  âââââââââââââââââââââââââââââââââââââââââââââââââââââââ   â
â  â            Quality Metrics Dashboard                â   â
â  â  CloudWatch â QuickSight                           â   â
â  âââââââââââââââââââââââââââââââââââââââââââââââââââââââ   â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ

# Glue Data Quality Rules
glue.create_data_quality_ruleset(
    Name='customer_data_quality',
    Ruleset="""
        Rules = [
            Completeness "email" >= 0.99,
            Completeness "customer_id" = 1.0,
            Uniqueness "customer_id" = 1.0,
            ColumnValues "country_code" in ["US", "CA", "UK", "DE", "FR"],
            ColumnLength "phone" between 10 and 15,
            ColumnValues "created_date" matches "\\d{4}-\\d{2}-\\d{2}",
            CustomSql "SELECT COUNT(*) FROM primary_table t1 
                       LEFT JOIN reference_table t2 ON t1.ref_id = t2.id 
                       WHERE t2.id IS NULL" = 0,
            RowCount > 0,
            DataFreshness "updated_at" <= 24 hours
        ]
    """,
    TargetTable={
        'TableName': 'customers',
        'DatabaseName': 'production_db'
    }
)

# Schedule quality checks
glue.start_data_quality_ruleset_evaluation_run(
    DataSource={
        'GlueTable': {'DatabaseName': 'production_db', 'TableName': 'customers'}
    },
    Role='GlueDataQualityRole',
    RulesetNames=['customer_data_quality']
)

# CloudWatch metrics for alerting
cloudwatch.put_metric_data(
    Namespace='DataQuality',
    MetricData=[{
        'MetricName': 'CompletionRate',
        'Dimensions': [{'Name': 'Table', 'Value': 'customers'}],
        'Value': 0.95,
        'Unit': 'Percent'
    }]
)

8. Build a cost-optimized data archive solution

Scenario: Archive historical data with minimal cost while maintaining accessibility.

Architecture:
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â                  Data Lifecycle                              â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ¤
â                                                              â
â  Hot Data (0-30 days)           â S3 Standard              â
â  âââ High access frequency       â Cost: $$$                â
â  âââ Low latency required        â                          â
â                                  â                          â
â  Warm Data (30-90 days)         â S3 Standard-IA           â
â  âââ Medium access frequency     â Cost: $$                 â
â  âââ Acceptable latency          â                          â
â                                  â                          â
â  Cold Data (90-365 days)        â S3 Glacier Instant       â
â  âââ Rare access                 â Cost: $                  â
â  âââ Minutes retrieval           â                          â
â                                  â                          â
â  Archive (>365 days)            â S3 Glacier Deep Archive  â
â  âââ Very rare access            â Cost: Â¢                  â
â  âââ Hours retrieval             â                          â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ

# S3 Lifecycle Policy
s3.put_bucket_lifecycle_configuration(
    Bucket='data-lake',
    LifecycleConfiguration={
        'Rules': [{
            'ID': 'archive-historical-data',
            'Filter': {'Prefix': 'processed/'},
            'Status': 'Enabled',
            'Transitions': [
                {'Days': 30, 'StorageClass': 'STANDARD_IA'},
                {'Days': 90, 'StorageClass': 'GLACIER_IR'},
                {'Days': 365, 'StorageClass': 'DEEP_ARCHIVE'}
            ],
            'NoncurrentVersionTransitions': [
                {'NoncurrentDays': 30, 'StorageClass': 'GLACIER'}
            ],
            'Expiration': {'Days': 2555}  # 7 years for compliance
        }]
    }
)

# Intelligent Tiering for unpredictable access
s3.put_bucket_intelligent_tiering_configuration(
    Bucket='data-lake',
    Id='auto-archive',
    IntelligentTieringConfiguration={
        'Status': 'Enabled',
        'Tierings': [
            {'Days': 90, 'AccessTier': 'ARCHIVE_ACCESS'},
            {'Days': 180, 'AccessTier': 'DEEP_ARCHIVE_ACCESS'}
        ]
    }
)

# Query archived data with Athena
CREATE EXTERNAL TABLE archived_data
STORED AS PARQUET
LOCATION 's3://data-lake/archive/'
TBLPROPERTIES ('storage.class'='GLACIER_IR');

-- Restore before querying
SELECT * FROM archived_data WHERE year = 2020;  -- Triggers restore

9. Design a cross-region data replication strategy

Scenario: Global company needs data available in multiple regions.

Architecture:
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â                   US-EAST-1 (Primary)                        â
â  âââââââââââââââââââââââââââââââââââââââââââââââââââââââ   â
â  â  S3 Bucket ââââ Cross-Region Replication âââââ     â   â
â  â  Redshift ââââââ Datashare ââââââââââââââââââââââ â   â
â  â  DynamoDB ââââ Global Tables âââââââââââââââââââââ¼ââ  â
â  âââââââââââââââââââââââââââââââââââââââââââââââââââââââ   â
â                          â                                  â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
                           â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â                   EU-WEST-1 (Secondary)                      â
â  âââââââââââââââââââââââââââââââââââââââââââââââââââââââ   â
â  â  S3 Bucket (Replica)                                â   â
â  â  Redshift (Consumer Cluster)                        â   â
â  â  DynamoDB (Global Table Replica)                    â   â
â  âââââââââââââââââââââââââââââââââââââââââââââââââââââââ   â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ

# S3 Cross-Region Replication
s3.put_bucket_replication(
    Bucket='primary-bucket',
    ReplicationConfiguration={
        'Role': 'arn:aws:iam::...:role/S3ReplicationRole',
        'Rules': [{
            'ID': 'replicate-all',
            'Status': 'Enabled',
            'Priority': 1,
            'Filter': {'Prefix': ''},
            'Destination': {
                'Bucket': 'arn:aws:s3:::replica-bucket-eu',
                'ReplicationTime': {'Status': 'Enabled', 'Time': {'Minutes': 15}},
                'Metrics': {'Status': 'Enabled', 'EventThreshold': {'Minutes': 15}}
            },
            'DeleteMarkerReplication': {'Status': 'Enabled'}
        }]
    }
)

# Redshift Data Sharing
-- Primary cluster (producer)
CREATE DATASHARE sales_share;
ALTER DATASHARE sales_share ADD SCHEMA public;
ALTER DATASHARE sales_share ADD TABLE public.sales;
GRANT USAGE ON DATASHARE sales_share TO ACCOUNT '123456789012' REGION 'eu-west-1';

-- Secondary cluster (consumer)
CREATE DATABASE sales_db FROM DATASHARE sales_share
OF ACCOUNT '987654321098' NAMESPACE 'ns-xxx';

# DynamoDB Global Tables
dynamodb.create_global_table(
    GlobalTableName='global-users',
    ReplicationGroup=[
        {'RegionName': 'us-east-1'},
        {'RegionName': 'eu-west-1'},
        {'RegionName': 'ap-southeast-1'}
    ]
)

10. Build a real-time recommendation engine pipeline

Scenario: E-commerce real-time product recommendations.

Architecture:
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â  User Activity                                               â
â       â                                                      â
â       â¼                                                      â
â  Kinesis Streams (user_events)                              â
â       â                                                      â
â       ââââââââââââââââ¬âââââââââââââââ                       â
â       â              â              â                       â
â       â¼              â¼              â¼                       â
â  Lambda          Firehose       Personalize                 â
â  (Feature Store) (Archive)      (Training)                  â
â       â                              â                       â
â       â¼                              â                       â
â  SageMaker âââââââââââââââââââââââââââ                      â
â  Feature Store                                               â
â       â                                                      â
â       â¼                                                      â
â  âââââââââââââââââââââââââââââââââââââââââââââââââââââââ   â
â  â            Recommendation Flow                       â   â
â  â  API Gateway â Lambda â Personalize/SageMaker       â   â
â  â                      â ElastiCache (cached recs)    â   â
â  âââââââââââââââââââââââââââââââââââââââââââââââââââââââ   â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ

# Real-time feature update
def update_user_features(event, context):
    for record in event['Records']:
        activity = json.loads(base64.b64decode(record['kinesis']['data']))
        
        # Update real-time features in Feature Store
        sagemaker_featurestore.put_record(
            FeatureGroupName='user-features',
            Record=[
                {'FeatureName': 'user_id', 'ValueAsString': activity['user_id']},
                {'FeatureName': 'last_viewed_category', 'ValueAsString': activity['category']},
                {'FeatureName': 'session_view_count', 'ValueAsString': str(activity['view_count'])},
                {'FeatureName': 'last_activity_time', 'ValueAsString': activity['timestamp']}
            ]
        )
        
        # Update interaction dataset for Personalize
        personalize_events.put_events(
            trackingId='tracking-id',
            userId=activity['user_id'],
            sessionId=activity['session_id'],
            eventList=[{
                'eventType': activity['event_type'],
                'itemId': activity['product_id'],
                'sentAt': datetime.now().timestamp()
            }]
        )

# Get recommendations API
def get_recommendations(event, context):
    user_id = event['pathParameters']['user_id']
    
    # Check cache first
    cached = elasticache.get(f'recs:{user_id}')
    if cached:
        return {'statusCode': 200, 'body': cached}
    
    # Get real-time features
    features = sagemaker_featurestore.get_record(
        FeatureGroupName='user-features',
        RecordIdentifierValueAsString=user_id
    )
    
    # Get recommendations
    response = personalize_runtime.get_recommendations(
        campaignArn='arn:aws:personalize:...:campaign/product-recs',
        userId=user_id,
        numResults=10,
        context={'DEVICE': event.get('device', 'web')}
    )
    
    # Cache and return
    elasticache.setex(f'recs:{user_id}', 300, json.dumps(response))
    return {'statusCode': 200, 'body': json.dumps(response)}

11. Design a data governance framework

Scenario: Enterprise data governance with access controls, auditing, and compliance.

Framework:
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â               Data Governance Framework                      â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ¤
â                                                              â
â  ââââââââââââââââââââââââââââââââââââââââââââââââââââââââ  â
â  â               AWS Lake Formation                      â  â
â  â  âââââââââââââââââââââââââââââââââââââââââââââââââââ â  â
â  â  â Data Catalog  â Permissions â Column/Row Securityââ  â
â  â  âââââââââââââââââââââââââââââââââââââââââââââââââââ â  â
â  ââââââââââââââââââââââââââââââââââââââââââââââââââââââââ  â
â                          â                                  â
â  ââââââââââââââââââââââââââââââââââââââââââââââââââââââââ  â
â  â             Data Classification                       â  â
â  â  ââââââââââââ ââââââââââââ ââââââââââââ             â  â
â  â  â PUBLIC   â âINTERNAL  â âCONFIDENTIALâ            â  â
â  â  â          â â          â â (PII/PHI) â            â  â
â  â  ââââââââââââ ââââââââââââ ââââââââââââ             â  â
â  ââââââââââââââââââââââââââââââââââââââââââââââââââââââââ  â
â                          â                                  â
â  ââââââââââââââââââââââââââââââââââââââââââââââââââââââââ  â
â  â               Macie (PII Detection)                   â  â
â  ââââââââââââââââââââââââââââââââââââââââââââââââââââââââ  â
â                          â                                  â
â  ââââââââââââââââââââââââââââââââââââââââââââââââââââââââ  â
â  â           CloudTrail (Audit Logging)                  â  â
â  ââââââââââââââââââââââââââââââââââââââââââââââââââââââââ  â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ

# Tag-Based Access Control
lakeformation.add_lf_tags_to_resource(
    Resource={'Table': {'DatabaseName': 'hr_db', 'Name': 'employees'}},
    LFTags=[
        {'TagKey': 'classification', 'TagValues': ['confidential']},
        {'TagKey': 'pii', 'TagValues': ['true']}
    ]
)

# Grant permissions based on tags
lakeformation.grant_permissions(
    Principal={'DataLakePrincipalIdentifier': 'arn:aws:iam::...:role/HRAnalyst'},
    Resource={
        'LFTagPolicy': {
            'ResourceType': 'TABLE',
            'Expression': [
                {'TagKey': 'classification', 'TagValues': ['confidential']},
                {'TagKey': 'department', 'TagValues': ['hr']}
            ]
        }
    },
    Permissions=['SELECT']
)

# Enable Macie for PII detection
macie.create_classification_job(
    name='pii-scan-data-lake',
    s3JobDefinition={
        'bucketDefinitions': [{'accountId': account_id, 'buckets': ['data-lake']}],
        'scoping': {
            'includes': {'and': [{'simpleScopeTerm': {'key': 'OBJECT_EXTENSION', 'values': ['parquet', 'csv']}}]}
        }
    },
    jobType='SCHEDULED',
    scheduleFrequency={'dailySchedule': {}}
)

12. Build a streaming ML inference pipeline

Scenario: Real-time ML predictions on streaming data.

Architecture:
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â  IoT Sensors / Applications                                  â
â       â                                                      â
â       â¼                                                      â
â  Kinesis Data Streams                                        â
â       â                                                      â
â       ââââââââââââââââââââ¬âââââââââââââââ                   â
â       â                  â              â                   â
â       â¼                  â¼              â¼                   â
â  Kinesis Analytics   Lambda          Firehose               â
â  (Flink + ML)        (SageMaker)     (Archive)              â
â       â                  â                                   â
â       â¼                  â¼                                   â
â  Output Stream       DynamoDB                                â
â       â              (Results)                               â
â       â¼                                                      â
â  Lambda â SNS (Alerts)                                       â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ

# Lambda with SageMaker inference
def handler(event, context):
    predictions = []
    
    for record in event['Records']:
        data = json.loads(base64.b64decode(record['kinesis']['data']))
        
        # Prepare features
        features = prepare_features(data)
        
        # Batch inference for efficiency
        response = sagemaker_runtime.invoke_endpoint(
            EndpointName='anomaly-detector',
            ContentType='application/json',
            Body=json.dumps({'instances': [features]})
        )
        
        prediction = json.loads(response['Body'].read())
        
        if prediction['anomaly_score'] > 0.9:
            # Send alert
            sns.publish(
                TopicArn=ALERT_TOPIC,
                Message=json.dumps({'device_id': data['device_id'], 'anomaly': prediction})
            )
        
        predictions.append({
            'device_id': data['device_id'],
            'timestamp': data['timestamp'],
            'prediction': prediction
        })
    
    # Store predictions
    with dynamodb.Table('predictions').batch_writer() as batch:
        for pred in predictions:
            batch.put_item(Item=pred)
    
    return {'processed': len(predictions)}

# Kinesis Analytics with Flink ML
# Use Amazon Kinesis Data Analytics for Apache Flink
# with SageMaker integration or built-in ML algorithms

from pyflink.table import StreamTableEnvironment

t_env.execute_sql("""
    CREATE TABLE predictions AS
    SELECT 
        device_id,
        event_time,
        ML_PREDICT('anomaly-model', temperature, pressure, vibration) as anomaly_score
    FROM sensor_data
    WHERE ML_PREDICT('anomaly-model', temperature, pressure, vibration) > 0.5
""")

13. Design a disaster recovery data strategy

Scenario: Design DR strategy with RPO < 1 hour and RTO < 4 hours.

DR Tiers:
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â  Tier 1: Pilot Light (Low cost, Higher RTO)                 â
â  - Core infrastructure pre-provisioned                       â
â  - Data replicated continuously                              â
â  - Compute scaled up on failover                            â
â  RTO: 4-8 hours | RPO: ~1 hour                              â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ¤
â  Tier 2: Warm Standby (Medium cost, Medium RTO)             â
â  - Scaled-down version running                               â
â  - Data replicated in near real-time                        â
â  - Quick scale-up on failover                               â
â  RTO: 1-4 hours | RPO: ~15 minutes                          â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ¤
â  Tier 3: Multi-Site Active (High cost, Lowest RTO)          â
â  - Full production in multiple regions                       â
â  - Real-time data sync                                       â
â  - Automatic failover                                        â
â  RTO: minutes | RPO: ~0                                      â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ

Implementation:
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â  Primary (us-east-1)              DR (us-west-2)            â
â  ââââââââââââââââââ              ââââââââââââââââââ        â
â  â S3 âââââââââââââ¼ââ CRR âââââââºâ S3 (Replica)   â        â
â  â RDS ââââââââââââ¼ââ Read Rep ââºâ RDS (Standby)  â        â
â  â Redshift âââââââ¼ââ Snapshot ââºâ Redshift       â        â
â  â DynamoDB âââââââ¼ââ Global ââââºâ DynamoDB       â        â
â  â OpenSearch âââââ¼ââ CCR âââââââºâ OpenSearch     â        â
â  ââââââââââââââââââ              ââââââââââââââââââ        â
â                                                             â
â  Route 53 (Health checks + Failover routing)               â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ

# Automated Redshift snapshot to DR region
redshift.create_snapshot_copy_grant(
    SnapshotCopyGrantName='dr-grant',
    KmsKeyId='arn:aws:kms:us-west-2:...:key/...'
)

redshift.enable_snapshot_copy(
    ClusterIdentifier='production-cluster',
    DestinationRegion='us-west-2',
    RetentionPeriod=7,
    SnapshotCopyGrantName='dr-grant'
)

# RDS Cross-Region Read Replica
rds.create_db_instance_read_replica(
    DBInstanceIdentifier='dr-replica',
    SourceDBInstanceIdentifier='arn:aws:rds:us-east-1:...:db:production',
    DBInstanceClass='db.r5.xlarge',
    AvailabilityZone='us-west-2a',
    SourceRegion='us-east-1'
)

# Failover automation with Lambda
def failover_handler(event, context):
    # 1. Promote RDS replica
    rds.promote_read_replica(DBInstanceIdentifier='dr-replica')
    
    # 2. Update Route 53
    route53.change_resource_record_sets(
        HostedZoneId='...',
        ChangeBatch={'Changes': [{'Action': 'UPSERT', 'ResourceRecordSet': {...}}]}
    )
    
    # 3. Scale up DR Redshift
    redshift.resize_cluster(
        ClusterIdentifier='dr-cluster',
        ClusterType='multi-node',
        NumberOfNodes=4
    )

14. Build a data catalog and discovery platform

Scenario: Enterprise data catalog for data discovery and collaboration.

Architecture:
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â                 Data Catalog Platform                        â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ¤
â                                                              â
â  ââââââââââââââââââââââââââââââââââââââââââââââââââââââââ  â
â  â              AWS Glue Data Catalog                    â  â
â  â  âââââââââââââââââââââââââââââââââââââââââââââââââââ â  â
â  â  â Databases â Tables â Schemas â Partitions      â â  â
â  â  âââââââââââââââââââââââââââââââââââââââââââââââââââ â  â
â  ââââââââââââââââââââââââââââââââââââââââââââââââââââââââ  â
â                          â                                  â
â  ââââââââââââââââââââââââââââââââââââââââââââââââââââââââ  â
â  â                Glue Crawlers                          â  â
â  â  S3 â Crawler â Catalog (auto-schema detection)      â  â
â  â  RDS â Crawler â Catalog (JDBC)                      â  â
â  â  Redshift â Crawler â Catalog                        â  â
â  ââââââââââââââââââââââââââââââââââââââââââââââââââââââââ  â
â                          â                                  â
â  ââââââââââââââââââââââââââââââââââââââââââââââââââââââââ  â
â  â              Data Quality & Profiling                 â  â
â  â  Glue Data Quality + Custom profiling jobs           â  â
â  ââââââââââââââââââââââââââââââââââââââââââââââââââââââââ  â
â                          â                                  â
â  ââââââââââââââââââââââââââââââââââââââââââââââââââââââââ  â
â  â                Search & Discovery                     â  â
â  â  OpenSearch (catalog search) + QuickSight (viz)      â  â
â  ââââââââââââââââââââââââââââââââââââââââââââââââââââââââ  â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ

# Enhanced table metadata
glue.update_table(
    DatabaseName='analytics',
    TableInput={
        'Name': 'customer_orders',
        'Description': 'Customer orders from e-commerce platform',
        'Parameters': {
            'classification': 'parquet',
            'data_owner': 'analytics-team',
            'data_steward': 'john.doe@company.com',
            'refresh_frequency': 'daily',
            'pii_columns': 'email,phone,address',
            'business_glossary_term': 'Customer Transaction',
            'quality_score': '95'
        },
        'StorageDescriptor': {
            'Columns': [
                {'Name': 'customer_id', 'Type': 'string', 'Comment': 'Unique customer identifier'},
                {'Name': 'email', 'Type': 'string', 'Comment': 'Customer email (PII)'},
                {'Name': 'order_total', 'Type': 'decimal(10,2)', 'Comment': 'Total order value'}
            ],
            'Location': 's3://data-lake/orders/',
            'InputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat',
            'SerdeInfo': {'SerializationLibrary': 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'}
        }
    }
)

# Index catalog in OpenSearch for search
def index_catalog_to_opensearch():
    tables = glue.get_tables(DatabaseName='analytics')['TableList']
    
    for table in tables:
        doc = {
            'name': table['Name'],
            'database': table['DatabaseName'],
            'description': table.get('Description', ''),
            'columns': [col['Name'] for col in table['StorageDescriptor']['Columns']],
            'owner': table['Parameters'].get('data_owner', ''),
            'tags': table['Parameters'].get('classification', ''),
            'updated': table['UpdateTime'].isoformat()
        }
        
        opensearch.index(index='data-catalog', body=doc, id=f"{table['DatabaseName']}.{table['Name']}")

15. Design a PII data handling pipeline

Scenario: Handle PII data with encryption, masking, and access controls.

Architecture:
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â                  PII Data Pipeline                           â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ¤
â                                                              â
â  Raw Data (PII) â Macie (Detection) â Classification        â
â       â                                                      â
â       â¼                                                      â
â  ââââââââââââââââââââââââââââââââââââââââââââââââââââââââ  â
â  â              Glue ETL (Transformation)                â  â
â  â  - Tokenization (replace with tokens)                 â  â
â  â  - Encryption (KMS)                                   â  â
â  â  - Masking (partial visibility)                       â  â
â  â  - Hashing (one-way for matching)                     â  â
â  ââââââââââââââââââââââââââââââââââââââââââââââââââââââââ  â
â       â                                                      â
â       ââââââââââââââââ¬âââââââââââââââ                       â
â       â              â              â                       â
â       â¼              â¼              â¼                       â
â  Tokenized      Encrypted       Masked                      â
â  (Analytics)    (Secure Store)  (Reporting)                 â
â                                                              â
â  Lake Formation (Column-level security on PII)              â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ

# PII Detection and Masking in Glue
from pyspark.sql.functions import sha2, regexp_replace, col
from awsglue.transforms import *

# Read raw data
raw_df = glueContext.create_dynamic_frame.from_catalog(
    database="raw_db", table_name="customers"
).toDF()

# Tokenize email (reversible)
def tokenize_email(email):
    return dynamodb.put_item(
        TableName='pii_tokens',
        Item={'token': str(uuid.uuid4()), 'value': email},
        ReturnValues='ALL_OLD'
    )

# Hash for matching (irreversible)
hashed_df = raw_df.withColumn('email_hash', sha2(col('email'), 256))

# Mask SSN (show last 4)
masked_df = hashed_df.withColumn(
    'ssn_masked',
    regexp_replace(col('ssn'), r'^\d{5}', 'XXX-XX-')
)

# Encrypt sensitive columns with KMS
from cryptography.fernet import Fernet
def encrypt_column(value, kms_key):
    # Use KMS data key for encryption
    data_key = kms.generate_data_key(KeyId=kms_key, KeySpec='AES_256')
    cipher = Fernet(base64.b64encode(data_key['Plaintext']))
    return cipher.encrypt(value.encode())

encrypted_df = masked_df.withColumn(
    'address_encrypted',
    encrypt_udf(col('address'))
)

# Lake Formation column-level security
lakeformation.grant_permissions(
    Principal={'DataLakePrincipalIdentifier': 'arn:aws:iam::...:role/AnalystRole'},
    Resource={
        'TableWithColumns': {
            'DatabaseName': 'customers_db',
            'Name': 'customers',
            'ColumnNames': ['customer_id', 'name', 'city'],  # Exclude PII
            'ColumnWildcard': {}
        }
    },
    Permissions=['SELECT']
)

16. Build a real-time aggregation dashboard

Scenario: Real-time business metrics dashboard with sub-second updates.

Architecture:
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â  Events (Sales, Clicks, etc.)                                â
â       â                                                      â
â       â¼                                                      â
â  Kinesis Data Streams                                        â
â       â                                                      â
â       ââââââââââââââââ¬âââââââââââââââ                       â
â       â              â              â                       â
â       â¼              â¼              â¼                       â
â  Lambda          Analytics      Firehose                     â
â  (Pre-agg)       (Windows)      (Archive)                    â
â       â              â                                       â
â       â¼              â¼                                       â
â  ElastiCache     TimeStream                                  â
â  (Real-time)     (Time series)                              â
â       â              â                                       â
â       ââââââââââââââââ¼âââââââââââââââ                       â
â                      â              â                       â
â                      â¼              â¼                       â
â               API Gateway      QuickSight                   â
â                    â           (Dashboard)                   â
â                    â¼                                         â
â               WebSocket                                      â
â               (Live updates)                                 â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ

# Lambda for real-time aggregation
def handler(event, context):
    aggregates = defaultdict(lambda: {'count': 0, 'sum': 0})
    
    for record in event['Records']:
        data = json.loads(base64.b64decode(record['kinesis']['data']))
        
        # Aggregate by dimension
        key = f"{data['product_category']}:{data['region']}"
        aggregates[key]['count'] += 1
        aggregates[key]['sum'] += data['amount']
    
    # Update Redis with atomic operations
    pipe = redis.pipeline()
    for key, agg in aggregates.items():
        pipe.hincrby(f'metrics:{current_minute()}', f'{key}:count', agg['count'])
        pipe.hincrbyfloat(f'metrics:{current_minute()}', f'{key}:sum', agg['sum'])
        pipe.expire(f'metrics:{current_minute()}', 3600)
    pipe.execute()
    
    # Publish to WebSocket
    api_gateway.post_to_connection(
        ConnectionId=connection_id,
        Data=json.dumps({'type': 'update', 'aggregates': dict(aggregates)})
    )

# Timestream for historical time series
timestream_write.write_records(
    DatabaseName='metrics_db',
    TableName='sales_metrics',
    Records=[{
        'Dimensions': [
            {'Name': 'category', 'Value': data['category']},
            {'Name': 'region', 'Value': data['region']}
        ],
        'MeasureName': 'sales',
        'MeasureValue': str(data['amount']),
        'MeasureValueType': 'DOUBLE',
        'Time': str(int(time.time() * 1000)),
        'TimeUnit': 'MILLISECONDS'
    }]
)

# Query Timestream
SELECT 
    bin(time, 1m) as minute,
    category,
    SUM(measure_value::double) as total_sales
FROM "metrics_db"."sales_metrics"
WHERE time > ago(1h)
GROUP BY bin(time, 1m), category
ORDER BY minute DESC

17. Design a schema evolution strategy

Scenario: Handle schema changes without breaking existing pipelines.

Strategies:
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â                Schema Evolution Strategies                   â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ¤
â                                                              â
â  1. Backward Compatible (readers can read old data)         â
â     - Add optional fields with defaults                      â
â     - Don't remove required fields                           â
â                                                              â
â  2. Forward Compatible (old readers can read new data)      â
â     - Don't add required fields                              â
â     - Old code ignores new fields                            â
â                                                              â
â  3. Full Compatible (both directions)                       â
â     - Only add optional fields                               â
â     - Only remove optional fields                            â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ

# Glue Schema Registry
glue.create_schema(
    RegistryId={'RegistryName': 'data-schemas'},
    SchemaName='customer_events',
    DataFormat='AVRO',
    Compatibility='BACKWARD',  # BACKWARD, FORWARD, FULL, NONE
    SchemaDefinition=json.dumps({
        "type": "record",
        "name": "CustomerEvent",
        "fields": [
            {"name": "customer_id", "type": "string"},
            {"name": "event_type", "type": "string"},
            {"name": "timestamp", "type": "long"},
            {"name": "email", "type": ["null", "string"], "default": null}  # Optional
        ]
    })
)

# Register new schema version
glue.register_schema_version(
    SchemaId={'RegistryName': 'data-schemas', 'SchemaName': 'customer_events'},
    SchemaDefinition=json.dumps({
        "type": "record",
        "name": "CustomerEvent",
        "fields": [
            {"name": "customer_id", "type": "string"},
            {"name": "event_type", "type": "string"},
            {"name": "timestamp", "type": "long"},
            {"name": "email", "type": ["null", "string"], "default": null},
            {"name": "phone", "type": ["null", "string"], "default": null}  # New field
        ]
    })
)

# Kinesis producer with schema registry
from aws_glue_schema_registry import GlueSchemaRegistryAvroSerializer

serializer = GlueSchemaRegistryAvroSerializer(
    registry_name='data-schemas'
)

data = {'customer_id': '123', 'event_type': 'click', 'timestamp': int(time.time())}
encoded = serializer.encode('customer_events', data)

kinesis.put_record(StreamName='events', Data=encoded, PartitionKey='123')

# Glue ETL handling schema evolution
df = glueContext.create_dynamic_frame.from_catalog(database="db", table_name="events")
df_resolved = df.resolveChoice(choice="make_struct")  # Handle schema conflicts

18. Build a data pipeline testing framework

Scenario: Comprehensive testing for data pipelines.

Testing Framework:
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â                Data Pipeline Testing                         â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ¤
â                                                              â
â  ââââââââââââââââââââââââââââââââââââââââââââââââââââââââ  â
â  â                 Unit Tests                            â  â
â  â  - Transformation logic                               â  â
â  â  - Data validation functions                          â  â
â  â  - Schema mapping                                     â  â
â  ââââââââââââââââââââââââââââââââââââââââââââââââââââââââ  â
â                          â                                  â
â  ââââââââââââââââââââââââââââââââââââââââââââââââââââââââ  â
â  â              Integration Tests                        â  â
â  â  - End-to-end pipeline execution                      â  â
â  â  - AWS service interactions                           â  â
â  â  - Data flow validation                               â  â
â  ââââââââââââââââââââââââââââââââââââââââââââââââââââââââ  â
â                          â                                  â
â  ââââââââââââââââââââââââââââââââââââââââââââââââââââââââ  â
â  â               Data Quality Tests                      â  â
â  â  - Row counts                                         â  â
â  â  - Schema validation                                  â  â
â  â  - Business rule validation                           â  â
â  â  - Referential integrity                              â  â
â  ââââââââââââââââââââââââââââââââââââââââââââââââââââââââ  â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ

# Unit test for Glue transformation
import pytest
from pyspark.sql import SparkSession
from my_glue_job import transform_data

@pytest.fixture(scope="session")
def spark():
    return SparkSession.builder.master("local[*]").getOrCreate()

def test_transform_removes_nulls(spark):
    input_data = [
        {"id": 1, "name": "Alice", "email": "alice@test.com"},
        {"id": 2, "name": None, "email": "bob@test.com"},
        {"id": 3, "name": "Charlie", "email": None}
    ]
    input_df = spark.createDataFrame(input_data)
    
    result_df = transform_data(input_df)
    
    assert result_df.filter("name IS NULL").count() == 0
    assert result_df.count() == 2

def test_transform_standardizes_email(spark):
    input_data = [{"id": 1, "email": "ALICE@TEST.COM"}]
    input_df = spark.createDataFrame(input_data)
    
    result_df = transform_data(input_df)
    
    assert result_df.first()["email"] == "alice@test.com"

# Integration test with moto (AWS mocking)
import moto

@moto.mock_s3
def test_pipeline_writes_to_s3():
    s3 = boto3.client('s3')
    s3.create_bucket(Bucket='test-bucket')
    
    # Run pipeline
    run_pipeline(source='test_data.csv', destination='s3://test-bucket/output/')
    
    # Verify output
    objects = s3.list_objects_v2(Bucket='test-bucket', Prefix='output/')
    assert objects['KeyCount'] > 0

# Data quality test
def test_data_quality_after_pipeline():
    result = glue.start_data_quality_ruleset_evaluation_run(
        DataSource={'GlueTable': {'DatabaseName': 'test_db', 'TableName': 'output'}},
        Role='GlueRole',
        RulesetNames=['quality_rules']
    )
    
    # Wait and check results
    status = glue.get_data_quality_ruleset_evaluation_run(RunId=result['RunId'])
    assert status['Status'] == 'SUCCEEDED'
    assert all(r['Result'] == 'PASS' for r in status['Results'])

19. Design a data lineage tracking system

Scenario: Track data lineage from source to consumption.

Architecture:
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â                 Data Lineage System                          â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ¤
â                                                              â
â  ââââââââââââââââââââââââââââââââââââââââââââââââââââââââ  â
â  â                Lineage Sources                        â  â
â  â  - Glue Job metadata                                  â  â
â  â  - Step Functions execution                           â  â
â  â  - CloudTrail S3 events                               â  â
â  â  - Custom instrumentation                             â  â
â  ââââââââââââââââââââââââââââââââââââââââââââââââââââââââ  â
â                          â                                  â
â  ââââââââââââââââââââââââââââââââââââââââââââââââââââââââ  â
â  â             Lineage Collection                        â  â
â  â  EventBridge â Lambda â Neptune (Graph DB)           â  â
â  ââââââââââââââââââââââââââââââââââââââââââââââââââââââââ  â
â                          â                                  â
â  ââââââââââââââââââââââââââââââââââââââââââââââââââââââââ  â
â  â               Lineage Graph                           â  â
â  â                                                       â  â
â  â   [Source A] âââº [ETL Job 1] âââº [Table X]          â  â
â  â        â              â              â               â  â
â  â        â              â¼              â¼               â  â
â  â        âââââº [ETL Job 2] âââº [Table Y] âââº [Report] â  â
â  â                                                       â  â
â  ââââââââââââââââââââââââââââââââââââââââââââââââââââââââ  â
â                          â                                  â
â  ââââââââââââââââââââââââââââââââââââââââââââââââââââââââ  â
â  â              Lineage Visualization                    â  â
â  â  QuickSight / Custom UI with Neptune queries         â  â
â  ââââââââââââââââââââââââââââââââââââââââââââââââââââââââ  â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ

# Capture Glue job lineage
def capture_glue_lineage(job_run):
    lineage_event = {
        'job_name': job_run['JobName'],
        'run_id': job_run['JobRunId'],
        'inputs': extract_inputs(job_run),
        'outputs': extract_outputs(job_run),
        'timestamp': job_run['CompletedOn'].isoformat(),
        'execution_time': job_run['ExecutionTime']
    }
    
    # Store in Neptune
    gremlin.addV('ETLJob').property('name', lineage_event['job_name']).property('run_id', lineage_event['run_id'])
    
    for input_table in lineage_event['inputs']:
        gremlin.addE('reads_from').from_(job_vertex).to(input_table_vertex)
    
    for output_table in lineage_event['outputs']:
        gremlin.addE('writes_to').from_(job_vertex).to(output_table_vertex)

# Query lineage graph
def get_upstream_lineage(table_name, depth=5):
    query = f"""
    g.V().has('Table', 'name', '{table_name}')
      .repeat(__.in('writes_to').out('reads_from'))
      .times({depth})
      .path()
      .by('name')
    """
    return gremlin.submit(query)

def get_downstream_lineage(table_name, depth=5):
    query = f"""
    g.V().has('Table', 'name', '{table_name}')
      .repeat(__.out('reads_from').in('writes_to'))
      .times({depth})
      .path()
      .by('name')
    """
    return gremlin.submit(query)

# Impact analysis
def get_impacted_assets(source_change):
    downstream = get_downstream_lineage(source_change, depth=10)
    impacted_jobs = [node for path in downstream for node in path if node['type'] == 'ETLJob']
    impacted_tables = [node for path in downstream for node in path if node['type'] == 'Table']
    impacted_reports = [node for path in downstream for node in path if node['type'] == 'Report']
    return {'jobs': impacted_jobs, 'tables': impacted_tables, 'reports': impacted_reports}

20. Build a serverless data processing architecture

Scenario: Fully serverless data platform with minimal operational overhead.

Architecture:
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â             Serverless Data Platform                         â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ¤
â                                                              â
â  Ingestion:                                                  â
â  ââââââââââââââââââââââââââââââââââââââââââââââââââââââââ  â
â  â  API Gateway â Lambda â Kinesis Data Streams         â  â
â  â  S3 Events â Lambda â SQS â Lambda                   â  â
â  â  EventBridge â Step Functions                        â  â
â  ââââââââââââââââââââââââââââââââââââââââââââââââââââââââ  â
â                          â                                  â
â  Processing:                                                 â
â  ââââââââââââââââââââââââââââââââââââââââââââââââââââââââ  â
â  â  Lambda (real-time)                                   â  â
â  â  Glue (Serverless Spark)                             â  â
â  â  Athena (ad-hoc queries)                             â  â
â  â  Step Functions (orchestration)                       â  â
â  ââââââââââââââââââââââââââââââââââââââââââââââââââââââââ  â
â                          â                                  â
â  Storage:                                                    â
â  ââââââââââââââââââââââââââââââââââââââââââââââââââââââââ  â
â  â  S3 (data lake)                                       â  â
â  â  DynamoDB (operational)                               â  â
â  â  Timestream (time series)                             â  â
â  â  Redshift Serverless (analytics)                      â  â
â  ââââââââââââââââââââââââââââââââââââââââââââââââââââââââ  â
â                          â                                  â
â  Serving:                                                    â
â  ââââââââââââââââââââââââââââââââââââââââââââââââââââââââ  â
â  â  API Gateway â Lambda â DynamoDB/Athena              â  â
â  â  AppSync (GraphQL)                                    â  â
â  â  QuickSight (BI)                                      â  â
â  ââââââââââââââââââââââââââââââââââââââââââââââââââââââââ  â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ

# Serverless ETL with Step Functions + Athena
{
  "StartAt": "RunAthenaQuery",
  "States": {
    "RunAthenaQuery": {
      "Type": "Task",
      "Resource": "arn:aws:states:::athena:startQueryExecution.sync",
      "Parameters": {
        "QueryString": "INSERT INTO curated.orders SELECT * FROM raw.orders WHERE order_date = ''",
        "WorkGroup": "primary",
        "ResultConfiguration": {
          "OutputLocation": "s3://athena-results/"
        }
      },
      "Next": "UpdateGlueCatalog"
    },
    "UpdateGlueCatalog": {
      "Type": "Task",
      "Resource": "arn:aws:states:::aws-sdk:glue:updatePartition",
      "Parameters": {
        "DatabaseName": "curated",
        "TableName": "orders",
        "PartitionValueList.$": "States.Array($.date)"
      },
      "Next": "NotifyCompletion"
    },
    "NotifyCompletion": {
      "Type": "Task",
      "Resource": "arn:aws:states:::sns:publish",
      "Parameters": {
        "TopicArn": "arn:aws:sns:...:etl-notifications",
        "Message.$": "States.Format('ETL completed for {}', $.date)"
      },
      "End": true
    }
  }
}

# Redshift Serverless for on-demand analytics
redshift_serverless.create_workgroup(
    workgroupName='analytics',
    namespaceName='analytics-ns',
    baseCapacity=32,  # Auto-scales to 0 when idle
    maxCapacity=128,
    configParameters=[
        {'parameterKey': 'auto_mv', 'parameterValue': 'true'},
        {'parameterKey': 'enable_case_sensitive_identifier', 'parameterValue': 'true'}
    ]
)

# Cost optimization
- Lambda: Pay per invocation + duration
- Glue: Pay per DPU-hour
- Athena: Pay per TB scanned
- Redshift Serverless: Pay per RPU-hour (scales to 0)
- S3: Pay per GB stored + requests

AWS Interview Questions - All Topics

AWS Data Engineer AWS Lambda AWS Redshift AWS S3 & Lake Formation AWS EMR & Glue AWS Step Functions AWS IAM & Cognito AWS SageMaker AWS CI/CD (CodePipeline) AWS Kinesis AWS Data Real-time Scenarios

Search Tutorials

Top 20 AWS Data Engineering Scenario Questions

1. Design a real-time clickstream analytics pipeline

2. Build a data lake with raw, curated, and enriched zones

3. Design a CDC pipeline from RDS to data warehouse

4. Build a real-time fraud detection system

5. Design a multi-tenant data platform

6. Build an event-driven ETL pipeline

7. Design a data quality monitoring framework

8. Build a cost-optimized data archive solution

9. Design a cross-region data replication strategy

10. Build a real-time recommendation engine pipeline

11. Design a data governance framework

12. Build a streaming ML inference pipeline

13. Design a disaster recovery data strategy

14. Build a data catalog and discovery platform

15. Design a PII data handling pipeline

16. Build a real-time aggregation dashboard

17. Design a schema evolution strategy

18. Build a data pipeline testing framework

19. Design a data lineage tracking system

20. Build a serverless data processing architecture

Popular Posts