Search Tutorials


Top AWS Data Engineer Interview Questions (2026) | JavaInuse

Top 20 AWS Data Engineer Interview Questions and Answers


  1. What are the key AWS services for Data Engineering?
  2. What is AWS Glue and how does it work?
  3. What is Amazon Redshift?
  4. How do you design a data lake on AWS?
  5. What is AWS Lake Formation?
  6. What is Amazon EMR?
  7. What is Amazon Kinesis?
  8. How do you implement ETL pipelines on AWS?
  9. What is AWS Data Pipeline?
  10. What is Amazon Athena?
  11. How do you optimize S3 for data engineering?
  12. What is the difference between Glue and EMR?
  13. How do you handle schema evolution in AWS?
  14. What is AWS Step Functions?
  15. How do you implement CDC on AWS?
  16. What is Amazon DynamoDB Streams?
  17. How do you secure data on AWS?
  18. What is AWS Glue Data Catalog?
  19. How do you monitor data pipelines on AWS?
  20. What are best practices for AWS Data Engineering?

1. What are the key AWS services for Data Engineering?

AWS Data Engineering Services:

Storage:
├── Amazon S3 - Object storage, data lake foundation
├── Amazon EBS - Block storage for EC2
└── Amazon EFS - Managed file system

Data Processing:
├── AWS Glue - Serverless ETL
├── Amazon EMR - Managed Hadoop/Spark
├── AWS Lambda - Serverless compute
└── Amazon EC2 - Custom processing

Data Warehousing:
├── Amazon Redshift - Cloud data warehouse
├── Amazon Redshift Spectrum - Query S3 data
└── Amazon Redshift Serverless - On-demand warehouse

Streaming:
├── Amazon Kinesis Data Streams - Real-time streaming
├── Amazon Kinesis Firehose - Data delivery
├── Amazon Kinesis Analytics - Stream processing
└── Amazon MSK - Managed Kafka

Analytics & Query:
├── Amazon Athena - Serverless SQL on S3
├── Amazon QuickSight - BI and visualization
└── Amazon OpenSearch - Search and analytics

Orchestration:
├── AWS Step Functions - Workflow orchestration
├── Amazon MWAA - Managed Airflow
└── AWS Data Pipeline - Data movement

2. What is AWS Glue and how does it work?

AWS Glue is a fully managed serverless ETL service for data preparation and loading.

Components:
AWS Glue Components:
├── Data Catalog
│   ├── Databases
│   ├── Tables (metadata)
│   └── Connections
├── Crawlers
│   └── Auto-discover schema
├── ETL Jobs
│   ├── Spark jobs
│   ├── Python Shell
│   └── Ray (ML workloads)
├── Glue Studio
│   └── Visual ETL designer
└── Data Quality
    └── Built-in rules

# Glue Job Example (PySpark)
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Read from catalog
datasource = glueContext.create_dynamic_frame.from_catalog(
    database="mydb",
    table_name="raw_data"
)

# Transform
mapped = ApplyMapping.apply(
    frame=datasource,
    mappings=[
        ("id", "string", "customer_id", "string"),
        ("name", "string", "customer_name", "string"),
        ("amount", "double", "total_amount", "double")
    ]
)

# Write to S3
glueContext.write_dynamic_frame.from_options(
    frame=mapped,
    connection_type="s3",
    connection_options={"path": "s3://bucket/processed/"},
    format="parquet"
)

job.commit()

3. What is Amazon Redshift?

Amazon Redshift is a fully managed, petabyte-scale cloud data warehouse.

Architecture:
Redshift Cluster:
├── Leader Node
│   ├── SQL parsing
│   ├── Query planning
│   └── Result aggregation
└── Compute Nodes
    ├── Node Slices
    ├── Local storage
    └── Parallel processing

Node Types:
├── RA3 (Recommended)
│   ├── Managed storage (RMS)
│   ├── Scale compute/storage independently
│   └── ra3.xlplus, ra3.4xlarge, ra3.16xlarge
├── DC2 (Dense Compute)
│   ├── Local SSD storage
│   └── dc2.large, dc2.8xlarge
└── DS2 (Dense Storage) - Legacy

# Create table with distribution
CREATE TABLE sales (
    sale_id INT,
    customer_id INT,
    product_id INT,
    sale_date DATE,
    amount DECIMAL(10,2)
)
DISTKEY(customer_id)
SORTKEY(sale_date)
DISTSTYLE KEY;

-- Distribution Styles:
-- KEY: Rows with same key on same slice
-- EVEN: Round-robin distribution
-- ALL: Copy to all nodes (small dims)
-- AUTO: Redshift decides

4. How do you design a data lake on AWS?

AWS Data Lake Architecture:

S3 Bucket Structure (Medallion):
s3://data-lake/
├── raw/                    # Bronze - Raw data
│   ├── source1/
│   │   └── year=2024/month=01/day=15/
│   └── source2/
├── staged/                 # Silver - Cleansed
│   ├── domain1/
│   │   └── table1/
│   └── domain2/
├── curated/                # Gold - Business ready
│   ├── analytics/
│   └── reporting/
└── archive/                # Cold storage

Components:
1. Storage: S3 with lifecycle policies
2. Catalog: AWS Glue Data Catalog / Lake Formation
3. Processing: Glue, EMR, Athena
4. Security: Lake Formation permissions
5. Governance: Data quality, lineage

# Lake Formation setup
import boto3

lakeformation = boto3.client('lakeformation')

# Register S3 location
lakeformation.register_resource(
    ResourceArn='arn:aws:s3:::my-data-lake',
    UseServiceLinkedRole=True
)

# Grant permissions
lakeformation.grant_permissions(
    Principal={'DataLakePrincipalIdentifier': 'arn:aws:iam::123456789012:role/DataAnalyst'},
    Resource={
        'Table': {
            'DatabaseName': 'analytics',
            'Name': 'sales'
        }
    },
    Permissions=['SELECT']
)

5. What is AWS Lake Formation?

AWS Lake Formation simplifies data lake setup, security, and governance.

Key Features:
- Centralized data catalog
- Fine-grained access control (column/row level)
- Data sharing across accounts
- Blueprint-based ingestion
- Governed tables with ACID

Lake Formation Security Model:

Traditional IAM:
- Coarse-grained (bucket/prefix level)
- Complex policies for multiple tables
- Hard to manage at scale

Lake Formation:
- Fine-grained (column/row level)
- Centralized permissions
- Tag-based access control

# Column-level security
lakeformation.grant_permissions(
    Principal={'DataLakePrincipalIdentifier': 'arn:aws:iam::123456789012:role/Analyst'},
    Resource={
        'TableWithColumns': {
            'DatabaseName': 'hr',
            'Name': 'employees',
            'ColumnNames': ['name', 'department', 'title']
            # Excludes: salary, ssn
        }
    },
    Permissions=['SELECT']
)

# Row-level security with data filters
lakeformation.create_data_cells_filter(
    TableData={
        'DatabaseName': 'sales',
        'TableName': 'orders',
        'Name': 'us_only_filter',
        'RowFilter': {
            'FilterExpression': "region = 'US'"
        },
        'ColumnNames': ['order_id', 'customer', 'amount']
    }
)




6. What is Amazon EMR?

Amazon EMR (Elastic MapReduce) is a managed big data platform for processing vast amounts of data using Spark, Hive, Presto, and more.

EMR Cluster Architecture:
├── Master Node
│   ├── Resource management
│   ├── Job coordination
│   └── YARN ResourceManager
├── Core Nodes
│   ├── HDFS storage
│   ├── Task execution
│   └── Data processing
└── Task Nodes (optional)
    ├── Task execution only
    └── No HDFS (spot instances)

EMR Deployment Options:
1. EMR on EC2 - Traditional cluster
2. EMR on EKS - Kubernetes
3. EMR Serverless - No infrastructure

# Create EMR cluster (boto3)
emr = boto3.client('emr')

cluster = emr.run_job_flow(
    Name='MySparkCluster',
    ReleaseLabel='emr-6.10.0',
    Applications=[
        {'Name': 'Spark'},
        {'Name': 'Hive'}
    ],
    Instances={
        'MasterInstanceType': 'm5.xlarge',
        'SlaveInstanceType': 'm5.xlarge',
        'InstanceCount': 3,
        'Ec2SubnetId': 'subnet-xxx',
        'Ec2KeyName': 'my-key'
    },
    Steps=[{
        'Name': 'Spark Job',
        'ActionOnFailure': 'CONTINUE',
        'HadoopJarStep': {
            'Jar': 'command-runner.jar',
            'Args': ['spark-submit', '--class', 'MyApp', 's3://bucket/app.jar']
        }
    }],
    JobFlowRole='EMR_EC2_DefaultRole',
    ServiceRole='EMR_DefaultRole',
    AutoTerminationPolicy={'IdleTimeout': 3600}
)

# EMR Serverless
emr_serverless = boto3.client('emr-serverless')
job = emr_serverless.start_job_run(
    applicationId='app-id',
    executionRoleArn='arn:aws:iam::123456789012:role/EMRServerlessRole',
    jobDriver={
        'sparkSubmit': {
            'entryPoint': 's3://bucket/script.py',
            'sparkSubmitParameters': '--conf spark.executor.memory=4g'
        }
    }
)

7. What is Amazon Kinesis?

Amazon Kinesis is a platform for real-time streaming data processing.

ServicePurposeUse Case
Kinesis Data StreamsReal-time data ingestionCustom processing
Kinesis Data FirehoseData deliveryLoad to S3, Redshift, OpenSearch
Kinesis Data AnalyticsStream processingSQL/Flink on streams
Kinesis Video StreamsVideo ingestionML on video

# Kinesis Data Streams - Producer
import boto3
import json

kinesis = boto3.client('kinesis')

def send_to_kinesis(data):
    response = kinesis.put_record(
        StreamName='my-stream',
        Data=json.dumps(data),
        PartitionKey=data['customer_id']
    )
    return response

# Consumer with KCL pattern
from amazon_kclpy import kcl

class RecordProcessor(kcl.RecordProcessorBase):
    def process_records(self, records, checkpointer):
        for record in records:
            data = json.loads(record.get('data'))
            # Process data
            print(f"Processing: {data}")
        checkpointer.checkpoint()

# Kinesis Firehose - Delivery stream
firehose = boto3.client('firehose')
firehose.put_record(
    DeliveryStreamName='my-delivery-stream',
    Record={'Data': json.dumps(event)}
)

# Firehose automatically batches and delivers to:
# - S3, Redshift, OpenSearch, Splunk, HTTP endpoint

8. How do you implement ETL pipelines on AWS?

ETL Pipeline Options:

1. AWS Glue (Serverless)
├── Visual ETL (Glue Studio)
├── PySpark/Scala scripts
├── Bookmarks for incremental
└── Job triggers/workflows

2. Amazon EMR
├── Spark/Hive jobs
├── Step Functions orchestration
└── Spot instances for cost

3. Lambda + Step Functions
├── Small data volumes
├── Event-driven
└── Serverless

# Glue Workflow Example
glue = boto3.client('glue')

# Create workflow
glue.create_workflow(Name='daily-etl')

# Add trigger
glue.create_trigger(
    Name='daily-trigger',
    Type='SCHEDULED',
    Schedule='cron(0 1 * * ? *)',
    Actions=[{'JobName': 'extract-job'}],
    WorkflowName='daily-etl'
)

# Add conditional trigger
glue.create_trigger(
    Name='transform-after-extract',
    Type='CONDITIONAL',
    Predicate={
        'Conditions': [{
            'LogicalOperator': 'EQUALS',
            'JobName': 'extract-job',
            'State': 'SUCCEEDED'
        }]
    },
    Actions=[{'JobName': 'transform-job'}],
    WorkflowName='daily-etl'
)

# Glue Bookmark for incremental load
job.init(args['JOB_NAME'], args)
# Bookmark tracks processed files automatically
datasource = glueContext.create_dynamic_frame.from_catalog(
    database="mydb",
    table_name="source_table",
    transformation_ctx="datasource"  # Enable bookmark
)

9. What is AWS Data Pipeline?

AWS Data Pipeline is a web service for orchestrating data movement and transformation (legacy - consider Step Functions or MWAA).

Data Pipeline Components:
├── Pipeline Definition
│   ├── Data Nodes (S3, RDS, Redshift)
│   ├── Activities (Copy, SQL, EMR)
│   └── Schedules
├── Task Runner
│   └── Executes activities
└── Preconditions
    └── Check before execution

# Modern Alternative: Step Functions + Glue
{
  "StartAt": "ExtractData",
  "States": {
    "ExtractData": {
      "Type": "Task",
      "Resource": "arn:aws:states:::glue:startJobRun.sync",
      "Parameters": {
        "JobName": "extract-job"
      },
      "Next": "TransformData"
    },
    "TransformData": {
      "Type": "Task",
      "Resource": "arn:aws:states:::glue:startJobRun.sync",
      "Parameters": {
        "JobName": "transform-job"
      },
      "Next": "LoadData"
    },
    "LoadData": {
      "Type": "Task",
      "Resource": "arn:aws:states:::glue:startJobRun.sync",
      "Parameters": {
        "JobName": "load-job"
      },
      "End": true
    }
  }
}

10. What is Amazon Athena?

Amazon Athena is a serverless interactive query service that uses SQL to analyze data in S3.

# Create external table
CREATE EXTERNAL TABLE sales (
    sale_id STRING,
    customer_id STRING,
    amount DOUBLE,
    sale_date DATE
)
PARTITIONED BY (year STRING, month STRING)
STORED AS PARQUET
LOCATION 's3://bucket/sales/'
TBLPROPERTIES ('parquet.compression'='SNAPPY');

-- Add partitions
MSCK REPAIR TABLE sales;
-- Or manually
ALTER TABLE sales ADD PARTITION (year='2024', month='01')
LOCATION 's3://bucket/sales/year=2024/month=01/';

-- Query with partition pruning
SELECT customer_id, SUM(amount) as total
FROM sales
WHERE year = '2024' AND month = '01'
GROUP BY customer_id;

-- CTAS (Create Table As Select)
CREATE TABLE aggregated_sales
WITH (
    format = 'PARQUET',
    external_location = 's3://bucket/aggregated/',
    partitioned_by = ARRAY['year']
) AS
SELECT customer_id, SUM(amount) as total, year
FROM sales
GROUP BY customer_id, year;

# Athena Best Practices:
# 1. Use columnar formats (Parquet, ORC)
# 2. Partition data by query patterns
# 3. Use compression (Snappy, ZSTD)
# 4. Avoid small files (aim for 128MB+)

11. How do you optimize S3 for data engineering?

S3 Optimization Strategies:

1. File Formats:
├── Parquet - Columnar, great for analytics
├── ORC - Columnar, Hive optimized
├── Avro - Row-based, schema evolution
└── JSON/CSV - Avoid for large datasets

2. File Sizing:
├── Target: 128MB - 1GB per file
├── Too small: Overhead, slow listing
└── Too large: Can't parallelize

3. Partitioning:
s3://bucket/table/year=2024/month=01/day=15/data.parquet

4. S3 Storage Classes:
├── Standard - Frequent access
├── Intelligent-Tiering - Unknown patterns
├── Standard-IA - Infrequent access
├── Glacier - Archive
└── Glacier Deep Archive - Long-term

# Lifecycle policy
{
    "Rules": [{
        "ID": "ArchiveOldData",
        "Status": "Enabled",
        "Filter": {"Prefix": "raw/"},
        "Transitions": [
            {"Days": 30, "StorageClass": "STANDARD_IA"},
            {"Days": 90, "StorageClass": "GLACIER"}
        ]
    }]
}

5. S3 Select & Glacier Select:
# Query inside objects
s3.select_object_content(
    Bucket='bucket',
    Key='data.parquet',
    Expression="SELECT * FROM s3object WHERE amount > 1000",
    ExpressionType='SQL',
    InputSerialization={'Parquet': {}},
    OutputSerialization={'JSON': {}}
)

12. What is the difference between Glue and EMR?

AspectAWS GlueAmazon EMR
ManagementFully serverlessManaged cluster
PricingPer DPU-hourPer instance-hour
ScalingAuto-scalingManual or auto
FrameworksSpark, PythonSpark, Hive, Presto, etc.
CustomizationLimitedFull control
Best ForETL jobs, catalogComplex processing
Startup TimeMinutes5-10 minutes

Choose Glue when:
- Serverless is preferred
- Standard ETL workloads
- Need Data Catalog
- Predictable workloads

Choose EMR when:
- Need custom frameworks
- Long-running clusters
- Cost optimization with Spot
- Complex ML workloads

13. How do you handle schema evolution in AWS?

Schema Evolution Strategies:

1. Glue Schema Registry
from aws_glue_schema_registry import GlueSchemaRegistry

registry = GlueSchemaRegistry(region='us-east-1')

# Register schema
schema = registry.register_schema(
    registry_name='my-registry',
    schema_name='customer-events',
    data_format='AVRO',
    schema_definition=avro_schema,
    compatibility='BACKWARD'  # or FORWARD, FULL, NONE
)

2. Glue Crawler with Schema Changes
glue.update_crawler(
    Name='my-crawler',
    SchemaChangePolicy={
        'UpdateBehavior': 'UPDATE_IN_DATABASE',
        'DeleteBehavior': 'LOG'
    }
)

3. Delta Lake on S3
from delta import DeltaTable

# Schema evolution with merge
delta_table = DeltaTable.forPath(spark, "s3://bucket/delta/")
delta_table.alias("target").merge(
    source_df.alias("source"),
    "target.id = source.id"
).whenMatchedUpdateAll() \
 .whenNotMatchedInsertAll() \
 .execute()

# Enable auto-merge for new columns
spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", "true")

4. Iceberg Tables
CREATE TABLE catalog.db.events (
    id STRING,
    data STRING
) USING iceberg;

-- Add column (schema evolution)
ALTER TABLE catalog.db.events ADD COLUMN new_field STRING;

14. What is AWS Step Functions?

AWS Step Functions orchestrates serverless workflows using state machines.

# Data Pipeline with Step Functions
{
  "Comment": "ETL Pipeline",
  "StartAt": "CheckSourceData",
  "States": {
    "CheckSourceData": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:check-source",
      "Next": "StartGlueJob"
    },
    "StartGlueJob": {
      "Type": "Task",
      "Resource": "arn:aws:states:::glue:startJobRun.sync",
      "Parameters": {
        "JobName": "etl-job",
        "Arguments": {
          "--source_path.$": "$.sourcePath",
          "--target_path.$": "$.targetPath"
        }
      },
      "Catch": [{
        "ErrorEquals": ["States.ALL"],
        "Next": "NotifyFailure"
      }],
      "Next": "ValidateOutput"
    },
    "ValidateOutput": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:validate",
      "Next": "JobSucceeded"
    },
    "NotifyFailure": {
      "Type": "Task",
      "Resource": "arn:aws:states:::sns:publish",
      "Parameters": {
        "TopicArn": "arn:aws:sns:us-east-1:123456789012:alerts",
        "Message": "ETL job failed"
      },
      "Next": "JobFailed"
    },
    "JobSucceeded": {
      "Type": "Succeed"
    },
    "JobFailed": {
      "Type": "Fail"
    }
  }
}

15. How do you implement CDC on AWS?

CDC (Change Data Capture) Options:

1. AWS DMS (Database Migration Service)
dms = boto3.client('dms')

# Create replication task for CDC
dms.create_replication_task(
    ReplicationTaskIdentifier='cdc-task',
    SourceEndpointArn='arn:aws:dms:...:endpoint:source',
    TargetEndpointArn='arn:aws:dms:...:endpoint:target',
    ReplicationInstanceArn='arn:aws:dms:...:rep:instance',
    MigrationType='cdc',  # or 'full-load-and-cdc'
    TableMappings=json.dumps({
        "rules": [{
            "rule-type": "selection",
            "rule-id": "1",
            "rule-name": "1",
            "object-locator": {
                "schema-name": "public",
                "table-name": "%"
            },
            "rule-action": "include"
        }]
    })
)

2. DynamoDB Streams + Lambda
# Lambda triggered by DynamoDB stream
def handler(event, context):
    for record in event['Records']:
        if record['eventName'] == 'INSERT':
            new_item = record['dynamodb']['NewImage']
        elif record['eventName'] == 'MODIFY':
            old_item = record['dynamodb']['OldImage']
            new_item = record['dynamodb']['NewImage']
        elif record['eventName'] == 'REMOVE':
            old_item = record['dynamodb']['OldImage']
        # Process changes

3. Debezium on MSK
# Debezium captures changes from MySQL/PostgreSQL
# Streams to Kafka (MSK) for downstream processing

4. Zero-ETL Integration
# Aurora to Redshift zero-ETL (near real-time)
# No ETL code required




16. What is Amazon DynamoDB Streams?

DynamoDB Streams captures item-level changes in DynamoDB tables for real-time processing.

# Enable DynamoDB Streams
dynamodb = boto3.client('dynamodb')

dynamodb.update_table(
    TableName='Orders',
    StreamSpecification={
        'StreamEnabled': True,
        'StreamViewType': 'NEW_AND_OLD_IMAGES'
        # Options: KEYS_ONLY, NEW_IMAGE, OLD_IMAGE, NEW_AND_OLD_IMAGES
    }
)

# Process with Lambda
def handler(event, context):
    for record in event['Records']:
        event_name = record['eventName']  # INSERT, MODIFY, REMOVE
        
        if event_name == 'INSERT':
            new_item = record['dynamodb']['NewImage']
            # New order created
            process_new_order(new_item)
            
        elif event_name == 'MODIFY':
            old_item = record['dynamodb']['OldImage']
            new_item = record['dynamodb']['NewImage']
            # Order updated - check status change
            if old_item['status'] != new_item['status']:
                notify_status_change(new_item)
                
        elif event_name == 'REMOVE':
            # Order deleted
            old_item = record['dynamodb']['OldImage']
            archive_order(old_item)

# Use cases:
# - Real-time analytics
# - Cross-region replication
# - Materialized views
# - Event-driven architectures

17. How do you secure data on AWS?

Data Security Layers:

1. Encryption at Rest
# S3 encryption
- SSE-S3: S3 managed keys
- SSE-KMS: Customer managed keys
- SSE-C: Customer provided keys

# Enable default encryption
s3.put_bucket_encryption(
    Bucket='my-bucket',
    ServerSideEncryptionConfiguration={
        'Rules': [{
            'ApplyServerSideEncryptionByDefault': {
                'SSEAlgorithm': 'aws:kms',
                'KMSMasterKeyID': 'arn:aws:kms:...'
            }
        }]
    }
)

2. Encryption in Transit
- TLS/SSL for all connections
- VPC endpoints for private access

3. Access Control
# IAM policy for data access
{
    "Version": "2012-10-17",
    "Statement": [{
        "Effect": "Allow",
        "Action": ["s3:GetObject"],
        "Resource": "arn:aws:s3:::bucket/prefix/*",
        "Condition": {
            "StringEquals": {
                "s3:ExistingObjectTag/classification": "public"
            }
        }
    }]
}

4. Lake Formation for fine-grained
# Column-level security
# Row-level security with data filters
# Tag-based access control

5. VPC and Network
# VPC endpoints for S3, Glue, etc.
# Security groups
# NACLs

18. What is AWS Glue Data Catalog?

The Glue Data Catalog is a centralized metadata repository compatible with Apache Hive metastore.

Glue Catalog Hierarchy:
├── Catalog (AWS Account)
│   ├── Database 1
│   │   ├── Table A
│   │   │   ├── Columns
│   │   │   ├── Partitions
│   │   │   └── Properties
│   │   └── Table B
│   └── Database 2
└── Connections

# Create database
glue.create_database(
    DatabaseInput={
        'Name': 'analytics',
        'Description': 'Analytics database',
        'LocationUri': 's3://bucket/analytics/'
    }
)

# Create table
glue.create_table(
    DatabaseName='analytics',
    TableInput={
        'Name': 'sales',
        'StorageDescriptor': {
            'Columns': [
                {'Name': 'id', 'Type': 'string'},
                {'Name': 'amount', 'Type': 'double'}
            ],
            'Location': 's3://bucket/analytics/sales/',
            'InputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat',
            'OutputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat',
            'SerdeInfo': {
                'SerializationLibrary': 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
            }
        },
        'PartitionKeys': [
            {'Name': 'year', 'Type': 'string'},
            {'Name': 'month', 'Type': 'string'}
        ],
        'TableType': 'EXTERNAL_TABLE'
    }
)

# Crawler auto-discovers schema
glue.create_crawler(
    Name='sales-crawler',
    Role='GlueServiceRole',
    DatabaseName='analytics',
    Targets={'S3Targets': [{'Path': 's3://bucket/analytics/sales/'}]},
    Schedule='cron(0 1 * * ? *)'
)

19. How do you monitor data pipelines on AWS?

Monitoring Tools:

1. CloudWatch Metrics
# Glue job metrics
- glue.driver.memory.usage
- glue.executor.memory.usage
- glue.ALL.elapsed_time

# Custom metrics from job
from awsglue.metrics import GlueMetrics
metrics = GlueMetrics()
metrics.emit("records_processed", 1000)

2. CloudWatch Logs
# Glue continuous logging
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# Logs automatically streamed to CloudWatch

3. CloudWatch Alarms
cloudwatch.put_metric_alarm(
    AlarmName='GlueJobFailure',
    MetricName='glue.driver.aggregate.numFailedTasks',
    Namespace='Glue',
    Statistic='Sum',
    Period=300,
    EvaluationPeriods=1,
    Threshold=1,
    ComparisonOperator='GreaterThanOrEqualToThreshold',
    AlarmActions=['arn:aws:sns:...']
)

4. EventBridge for job events
{
    "source": ["aws.glue"],
    "detail-type": ["Glue Job State Change"],
    "detail": {
        "state": ["FAILED", "TIMEOUT"]
    }
}

5. Data Quality Monitoring
# Glue Data Quality
dqdl_rules = """
    Rules = [
        RowCount > 1000,
        IsComplete "customer_id",
        ColumnValues "amount" > 0
    ]
"""

20. What are best practices for AWS Data Engineering?

1. Architecture:
- Use medallion architecture (raw/staged/curated)
- Implement data lake with Lake Formation
- Separate storage and compute

2. Data Storage:
# Best practices
- Use Parquet/ORC for analytics
- Partition by query patterns
- Target 128MB-1GB file sizes
- Implement lifecycle policies
- Enable versioning for critical data

3. Processing:
- Use serverless (Glue, Athena) when possible
- Implement incremental processing
- Use bookmarks/watermarks
- Handle schema evolution

4. Security:
- Encrypt at rest and in transit
- Use Lake Formation for fine-grained access
- Implement least privilege
- Use VPC endpoints

5. Cost Optimization:
# Cost strategies
- Use S3 Intelligent-Tiering
- Compress data (Snappy, ZSTD)
- Use Spot instances for EMR
- Right-size Glue DPUs
- Partition pruning in queries

6. Operations:
- Implement comprehensive monitoring
- Set up alerting for failures
- Use Infrastructure as Code
- Document data lineage
- Implement data quality checks


Popular Posts