Search Tutorials


Top AWS EMR and Glue Interview Questions (2026) | JavaInuse

Top 20 AWS EMR and Glue Interview Questions


  1. What is Amazon EMR?
  2. What is EMR architecture?
  3. What are EMR cluster types?
  4. What is AWS Glue?
  5. What are Glue components?
  6. How do you create a Glue ETL job?
  7. What is the Glue Data Catalog?
  8. What are Glue crawlers?
  9. How do you optimize Glue jobs?
  10. What is EMR on EKS?
  11. What is EMR Serverless?
  12. How do you configure EMR applications?
  13. What are EMR instance fleets?
  14. How do you handle Spark on EMR?
  15. What are Glue job bookmarks?
  16. What are Glue workflows?
  17. How do you implement Glue DataBrew?
  18. What is Glue Streaming?
  19. How do you monitor EMR and Glue?
  20. What are EMR and Glue best practices?

1. What is Amazon EMR?

Amazon EMR (Elastic MapReduce) is a managed cluster platform for running big data frameworks like Apache Spark, Hive, Presto, and Hadoop.

EMR Features:
├── Managed Hadoop ecosystem
├── Auto-scaling capabilities
├── Spot instance support
├── Integration with S3 (EMRFS)
├── Multiple deployment options
└── Cost-effective big data processing

Supported Frameworks:
├── Apache Spark
├── Apache Hive
├── Presto/Trino
├── Apache Flink
├── Apache HBase
├── Apache Hadoop
└── Apache Hudi, Delta Lake, Iceberg

# Create EMR cluster via CLI
aws emr create-cluster \
    --name "My Spark Cluster" \
    --release-label emr-7.0.0 \
    --applications Name=Spark Name=Hive \
    --instance-type m5.xlarge \
    --instance-count 3 \
    --use-default-roles \
    --ec2-attributes SubnetId=subnet-xxx

2. What is EMR architecture?

EMR Cluster Architecture:

┌─────────────────────────────────────────────────────┐
│                  EMR Cluster                         │
├─────────────────────────────────────────────────────┤
│  ┌─────────────┐   ┌─────────────┐                  │
│  │ Master Node │   │ Master Node │  (HA optional)   │
│  │  - YARN RM  │   │  - Standby  │                  │
│  │  - Hive     │   │             │                  │
│  │  - Spark    │   │             │                  │
│  └─────────────┘   └─────────────┘                  │
├─────────────────────────────────────────────────────┤
│  ┌─────────────┐   ┌─────────────┐   ┌───────────┐ │
│  │ Core Node 1 │   │ Core Node 2 │   │Core Node 3│ │
│  │  - HDFS     │   │  - HDFS     │   │  - HDFS   │ │
│  │  - YARN NM  │   │  - YARN NM  │   │  - YARN NM│ │
│  └─────────────┘   └─────────────┘   └───────────┘ │
├─────────────────────────────────────────────────────┤
│  ┌─────────────┐   ┌─────────────┐   (Auto-scales) │
│  │ Task Node 1 │   │ Task Node 2 │                  │
│  │  - YARN NM  │   │  - YARN NM  │                  │
│  │  - No HDFS  │   │  - No HDFS  │                  │
│  └─────────────┘   └─────────────┘                  │
└─────────────────────────────────────────────────────┘
                         │
                         ▼
              ┌─────────────────────┐
              │    Amazon S3        │
              │  (EMRFS - Storage)  │
              └─────────────────────┘

Node Types:
├── Master: Cluster coordination, resource management
├── Core: HDFS storage + computation
└── Task: Computation only (no HDFS, Spot-friendly)

3. What are EMR cluster types?

TypeDescriptionUse Case
EMR on EC2Traditional managed clustersFull control, long-running
EMR on EKSRun on KubernetesContainer orchestration
EMR ServerlessNo infrastructure managementVariable workloads
EMR on OutpostsRun on-premisesData residency requirements

Cluster Modes:

1. Long-running cluster
# Persistent, for interactive queries
aws emr create-cluster \
    --keep-job-flow-alive-when-no-steps \
    ...

2. Transient cluster
# Terminates after steps complete
aws emr create-cluster \
    --auto-terminate \
    --steps Type=Spark,Name=MyJob,Args=[...] \
    ...

3. Instance Groups vs Instance Fleets
# Instance Groups: Same instance type per group
# Instance Fleets: Mix of instance types (cost optimization)

aws emr create-cluster \
    --instance-fleets '[
        {
            "InstanceFleetType": "MASTER",
            "TargetOnDemandCapacity": 1,
            "InstanceTypeConfigs": [{"InstanceType": "m5.xlarge"}]
        },
        {
            "InstanceFleetType": "CORE",
            "TargetSpotCapacity": 4,
            "InstanceTypeConfigs": [
                {"InstanceType": "m5.xlarge", "WeightedCapacity": 1},
                {"InstanceType": "m5.2xlarge", "WeightedCapacity": 2}
            ]
        }
    ]'

4. What is AWS Glue?

AWS Glue is a fully managed ETL (Extract, Transform, Load) service with serverless infrastructure.

Glue Features:
├── Serverless ETL engine (Spark-based)
├── Data Catalog (Hive metastore compatible)
├── Crawlers (schema discovery)
├── Visual ETL (Glue Studio)
├── Job bookmarks (incremental processing)
├── Workflows (orchestration)
└── DataBrew (no-code data prep)

Glue Components:
┌────────────────────────────────────────────────────┐
│                   AWS Glue                          │
├────────────────┬───────────────┬───────────────────┤
│  Data Catalog  │  ETL Engine   │   Orchestration   │
│  ┌──────────┐  │  ┌─────────┐  │  ┌─────────────┐  │
│  │Databases │  │  │Glue Jobs│  │  │ Workflows   │  │
│  │Tables    │  │  │(Spark)  │  │  │ Triggers    │  │
│  │Crawlers  │  │  │Streaming│  │  │ Schedules   │  │
│  └──────────┘  │  └─────────┘  │  └─────────────┘  │
└────────────────┴───────────────┴───────────────────┘

Pricing:
├── ETL Jobs: DPU-hours (Data Processing Units)
├── Data Catalog: Free up to 1M objects
├── Crawlers: DPU-hours
└── DataBrew: Sessions (interactive) + jobs

5. What are Glue components?

Core Components:

1. Data Catalog
# Centralized metadata repository
# Hive metastore compatible
# Used by Athena, EMR, Redshift

2. Databases and Tables
import boto3
glue = boto3.client('glue')

glue.create_database(
    DatabaseInput={'Name': 'my_database'}
)

glue.create_table(
    DatabaseName='my_database',
    TableInput={
        'Name': 'my_table',
        'StorageDescriptor': {
            'Columns': [
                {'Name': 'id', 'Type': 'string'},
                {'Name': 'name', 'Type': 'string'}
            ],
            'Location': 's3://bucket/path/',
            'InputFormat': 'org.apache.hadoop.mapred.TextInputFormat',
            'SerdeInfo': {'SerializationLibrary': 'org.apache.hadoop.hive.serde2.OpenCSVSerde'}
        }
    }
)

3. Connections
# Database connections (JDBC)
# Network configuration

4. Crawlers
# Auto-discover schemas
# Populate Data Catalog

5. Jobs
# ETL processing (Spark, Python Shell)
# Visual or code-based

6. Triggers
# Schedule or event-based execution

6. How do you create a Glue ETL job?

# Glue ETL Job (PySpark)
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

# Initialize
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Read from catalog
datasource = glueContext.create_dynamic_frame.from_catalog(
    database="my_database",
    table_name="source_table"
)

# Or read from S3
datasource = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options={"paths": ["s3://bucket/input/"]},
    format="parquet"
)

# Transform
mapped = ApplyMapping.apply(
    frame=datasource,
    mappings=[
        ("old_col", "string", "new_col", "string"),
        ("amount", "double", "amount", "double")
    ]
)

# Filter
filtered = Filter.apply(
    frame=mapped,
    f=lambda x: x["amount"] > 100
)

# Write to S3
glueContext.write_dynamic_frame.from_options(
    frame=filtered,
    connection_type="s3",
    connection_options={"path": "s3://bucket/output/"},
    format="parquet"
)

job.commit()




7. What is the Glue Data Catalog?

The Glue Data Catalog is a centralized metadata repository compatible with Apache Hive metastore.

Data Catalog Structure:
├── Catalog (Account level)
│   ├── Database 1
│   │   ├── Table A
│   │   │   ├── Columns
│   │   │   ├── Partitions
│   │   │   └── Properties
│   │   └── Table B
│   └── Database 2
│       └── Table C

# Query catalog
tables = glue.get_tables(DatabaseName='my_database')
for table in tables['TableList']:
    print(f"Table: {table['Name']}")
    for col in table['StorageDescriptor']['Columns']:
        print(f"  - {col['Name']}: {col['Type']}")

# Partition management
glue.create_partition(
    DatabaseName='my_database',
    TableName='my_table',
    PartitionInput={
        'Values': ['2024', '01', '15'],
        'StorageDescriptor': {
            'Location': 's3://bucket/data/year=2024/month=01/day=15/',
            ...
        }
    }
)

# Batch create partitions
glue.batch_create_partition(
    DatabaseName='my_database',
    TableName='my_table',
    PartitionInputList=[...]
)

Integration:
├── Athena: Query tables directly
├── EMR: Use as Hive metastore
├── Redshift Spectrum: External tables
├── Lake Formation: Fine-grained access
└── Data Quality: Define rules

8. What are Glue crawlers?

Crawlers automatically discover schema and update the Data Catalog.

# Create crawler
glue.create_crawler(
    Name='my-crawler',
    Role='GlueServiceRole',
    DatabaseName='my_database',
    Targets={
        'S3Targets': [
            {
                'Path': 's3://bucket/data/',
                'Exclusions': ['*.tmp', 'temp/*']
            }
        ],
        'JdbcTargets': [
            {
                'ConnectionName': 'my-jdbc-connection',
                'Path': 'database/schema/table'
            }
        ]
    },
    Schedule='cron(0 1 * * ? *)',  # Daily at 1 AM
    SchemaChangePolicy={
        'UpdateBehavior': 'UPDATE_IN_DATABASE',
        'DeleteBehavior': 'LOG'
    },
    RecrawlPolicy={
        'RecrawlBehavior': 'CRAWL_NEW_FOLDERS_ONLY'
    },
    Configuration=json.dumps({
        'Version': 1.0,
        'CrawlerOutput': {
            'Partitions': {'AddOrUpdateBehavior': 'InheritFromTable'}
        }
    })
)

# Run crawler
glue.start_crawler(Name='my-crawler')

# Crawler behaviors:
├── Detect new partitions
├── Infer schema from data
├── Create/update tables
├── Handle schema evolution
└── Support multiple data stores

9. How do you optimize Glue jobs?

Optimization Strategies:

1. Right-size DPUs
# Standard: 2-10 DPUs for small jobs
# G.1X: 1 DPU per worker (memory-intensive)
# G.2X: 2 DPUs per worker (compute-intensive)

2. Enable job metrics
glue.create_job(
    Name='my-job',
    ...
    DefaultArguments={
        '--enable-metrics': 'true',
        '--enable-spark-ui': 'true',
        '--spark-event-logs-path': 's3://bucket/spark-logs/'
    }
)

3. Partition pruning
# Filter on partition columns
datasource = glueContext.create_dynamic_frame.from_catalog(
    database="db",
    table_name="table",
    push_down_predicate="year='2024' and month='01'"
)

4. Use push-down predicates
# Filter at source, not in Spark
datasource = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    format="parquet",
    connection_options={
        "paths": ["s3://bucket/data/"],
        "recurse": True
    },
    additional_options={
        "filterPredicate": "amount > 100"
    }
)

5. Optimize file sizes
# Avoid small files (< 128MB)
# Use coalesce/repartition
df = datasource.toDF()
df.coalesce(10).write.parquet("s3://bucket/output/")

6. Use Spark DataFrame when needed
df = datasource.toDF()  # Convert to DataFrame
# Perform complex transformations
dynamic_frame = DynamicFrame.fromDF(df, glueContext)

10. What is EMR on EKS?

EMR on EKS runs EMR jobs on Amazon Elastic Kubernetes Service clusters.

Benefits:
├── Shared infrastructure with other apps
├── Kubernetes-native management
├── Faster startup than EC2
├── Fine-grained resource control
└── Multi-tenant clusters

Architecture:
┌─────────────────────────────────────────────┐
│              Amazon EKS Cluster             │
├─────────────────────────────────────────────┤
│  ┌─────────────┐  ┌──────────────────────┐  │
│  │ EMR Virtual │  │   Other Workloads    │  │
│  │   Cluster   │  │   (Microservices)    │  │
│  │  ┌───────┐  │  │                      │  │
│  │  │ Spark │  │  │                      │  │
│  │  │  Job  │  │  │                      │  │
│  │  └───────┘  │  │                      │  │
│  └─────────────┘  └──────────────────────┘  │
└─────────────────────────────────────────────┘

# Create virtual cluster
aws emr-containers create-virtual-cluster \
    --name my-virtual-cluster \
    --container-provider '{
        "id": "eks-cluster-id",
        "type": "EKS",
        "info": {
            "eksInfo": {"namespace": "emr"}
        }
    }'

# Submit job
aws emr-containers start-job-run \
    --virtual-cluster-id vc-xxx \
    --name spark-job \
    --execution-role-arn arn:aws:iam::xxx:role/EMRJobRole \
    --release-label emr-6.9.0-latest \
    --job-driver '{
        "sparkSubmitJobDriver": {
            "entryPoint": "s3://bucket/script.py",
            "sparkSubmitParameters": "--conf spark.executor.memory=4G"
        }
    }'

11. What is EMR Serverless?

EMR Serverless provides serverless Spark and Hive without managing infrastructure.

EMR Serverless Features:
├── No cluster management
├── Auto-scaling workers
├── Pre-initialized capacity option
├── Pay per use (vCPU, memory, storage)
└── Supports Spark and Hive

# Create application
emr_serverless = boto3.client('emr-serverless')

app = emr_serverless.create_application(
    name='my-spark-app',
    releaseLabel='emr-6.9.0',
    type='SPARK',
    initialCapacity={
        'Driver': {
            'workerCount': 1,
            'workerConfiguration': {
                'cpu': '4vCPU',
                'memory': '16GB'
            }
        },
        'Executor': {
            'workerCount': 10,
            'workerConfiguration': {
                'cpu': '4vCPU',
                'memory': '16GB'
            }
        }
    },
    maximumCapacity={
        'cpu': '200vCPU',
        'memory': '800GB'
    }
)

# Start job
job = emr_serverless.start_job_run(
    applicationId=app['applicationId'],
    executionRoleArn='arn:aws:iam::xxx:role/EMRServerlessRole',
    jobDriver={
        'sparkSubmit': {
            'entryPoint': 's3://bucket/script.py',
            'sparkSubmitParameters': '--conf spark.executor.cores=4'
        }
    },
    configurationOverrides={
        'monitoringConfiguration': {
            's3MonitoringConfiguration': {
                'logUri': 's3://bucket/logs/'
            }
        }
    }
)

12. How do you configure EMR applications?

# Configuration via CLI
aws emr create-cluster \
    --configurations '[
        {
            "Classification": "spark-defaults",
            "Properties": {
                "spark.executor.memory": "8g",
                "spark.executor.cores": "4",
                "spark.dynamicAllocation.enabled": "true"
            }
        },
        {
            "Classification": "hive-site",
            "Properties": {
                "hive.metastore.client.factory.class": 
                    "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
            }
        },
        {
            "Classification": "emrfs-site",
            "Properties": {
                "fs.s3.enableServerSideEncryption": "true"
            }
        }
    ]'

# Bootstrap actions
aws emr create-cluster \
    --bootstrap-actions '[
        {
            "Name": "Install packages",
            "ScriptBootstrapAction": {
                "Path": "s3://bucket/bootstrap.sh",
                "Args": ["arg1", "arg2"]
            }
        }
    ]'

# bootstrap.sh example
#!/bin/bash
sudo pip install pandas numpy
sudo yum install -y htop

# Steps (jobs)
aws emr add-steps \
    --cluster-id j-xxx \
    --steps '[
        {
            "Type": "Spark",
            "Name": "My Spark Job",
            "ActionOnFailure": "CONTINUE",
            "Args": [
                "spark-submit",
                "--deploy-mode", "cluster",
                "s3://bucket/script.py"
            ]
        }
    ]'

13. What are EMR instance fleets?

Instance Fleets allow mixing instance types for cost optimization and capacity availability.

Instance Fleets vs Instance Groups:

Instance Groups:
└── One instance type per group
└── Simpler but less flexible

Instance Fleets:
└── Multiple instance types
└── Spot allocation strategy
└── On-Demand/Spot mix
└── Better cost optimization

# Instance Fleet configuration
{
    "InstanceFleets": [
        {
            "InstanceFleetType": "MASTER",
            "TargetOnDemandCapacity": 1,
            "InstanceTypeConfigs": [
                {"InstanceType": "m5.xlarge", "WeightedCapacity": 1}
            ]
        },
        {
            "InstanceFleetType": "CORE",
            "TargetOnDemandCapacity": 2,
            "TargetSpotCapacity": 8,
            "InstanceTypeConfigs": [
                {"InstanceType": "m5.xlarge", "WeightedCapacity": 1, "BidPriceAsPercentageOfOnDemandPrice": 100},
                {"InstanceType": "m5.2xlarge", "WeightedCapacity": 2, "BidPriceAsPercentageOfOnDemandPrice": 100},
                {"InstanceType": "r5.xlarge", "WeightedCapacity": 1, "BidPriceAsPercentageOfOnDemandPrice": 100}
            ],
            "LaunchSpecifications": {
                "SpotSpecification": {
                    "TimeoutDurationMinutes": 60,
                    "TimeoutAction": "SWITCH_TO_ON_DEMAND",
                    "AllocationStrategy": "capacity-optimized"
                }
            }
        },
        {
            "InstanceFleetType": "TASK",
            "TargetSpotCapacity": 20,
            "InstanceTypeConfigs": [...]
        }
    ]
}

Spot Allocation Strategies:
├── capacity-optimized: Lowest interruption probability
├── price-capacity-optimized: Balance price and capacity
└── lowest-price: Cheapest (higher interruption risk)

14. How do you handle Spark on EMR?

# Spark Submit
spark-submit \
    --master yarn \
    --deploy-mode cluster \
    --executor-memory 8g \
    --executor-cores 4 \
    --num-executors 10 \
    --conf spark.dynamicAllocation.enabled=true \
    --conf spark.shuffle.service.enabled=true \
    s3://bucket/script.py

# PySpark script
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("MyApp") \
    .config("spark.sql.adaptive.enabled", "true") \
    .getOrCreate()

# Read from S3
df = spark.read.parquet("s3://bucket/input/")

# Transformations
result = df \
    .filter(df.amount > 100) \
    .groupBy("category") \
    .agg({"amount": "sum"})

# Write with partitioning
result.write \
    .partitionBy("category") \
    .mode("overwrite") \
    .parquet("s3://bucket/output/")

# Spark optimizations for EMR
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.speculation", "true")
spark.conf.set("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2")

# Use EMRFS optimized committer
spark.conf.set("spark.sql.parquet.fs.optimized.committer.optimization-enabled", "true")

15. What are Glue job bookmarks?

Job bookmarks enable incremental processing by tracking processed data.

# Enable job bookmarks in job creation
glue.create_job(
    Name='incremental-job',
    ...
    DefaultArguments={
        '--job-bookmark-option': 'job-bookmark-enable'
    }
)

# In job script
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'job-bookmark-option'])
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Read with bookmark awareness
datasource = glueContext.create_dynamic_frame.from_catalog(
    database="my_database",
    table_name="source_table",
    transformation_ctx="datasource"  # Required for bookmarks
)

# Process data...

# Write with bookmark awareness
glueContext.write_dynamic_frame.from_options(
    frame=result,
    connection_type="s3",
    connection_options={"path": "s3://bucket/output/"},
    format="parquet",
    transformation_ctx="output"  # Required for bookmarks
)

job.commit()  # Saves bookmark state

# Bookmark options:
├── job-bookmark-enable: Track and skip processed data
├── job-bookmark-disable: Process all data
└── job-bookmark-pause: Don't update bookmark

# Reset bookmark
glue.reset_job_bookmark(JobName='incremental-job')

16. What are Glue workflows?

Workflows orchestrate multiple crawlers and jobs with dependencies.

# Create workflow
glue.create_workflow(
    Name='etl-workflow',
    Description='Daily ETL pipeline'
)

# Add triggers
# 1. Schedule trigger (starts workflow)
glue.create_trigger(
    Name='daily-schedule',
    WorkflowName='etl-workflow',
    Type='SCHEDULED',
    Schedule='cron(0 1 * * ? *)',
    Actions=[{'CrawlerName': 'source-crawler'}]
)

# 2. Conditional trigger (after crawler)
glue.create_trigger(
    Name='after-crawler',
    WorkflowName='etl-workflow',
    Type='CONDITIONAL',
    Predicate={
        'Conditions': [
            {
                'LogicalOperator': 'EQUALS',
                'CrawlerName': 'source-crawler',
                'CrawlState': 'SUCCEEDED'
            }
        ]
    },
    Actions=[{'JobName': 'transform-job'}]
)

# 3. After transform job
glue.create_trigger(
    Name='after-transform',
    WorkflowName='etl-workflow',
    Type='CONDITIONAL',
    Predicate={
        'Logical': 'AND',
        'Conditions': [
            {'JobName': 'transform-job', 'State': 'SUCCEEDED'}
        ]
    },
    Actions=[
        {'JobName': 'load-job'},
        {'CrawlerName': 'output-crawler'}
    ]
)

# Activate triggers
glue.start_trigger(Name='daily-schedule')

# Manual workflow run
glue.start_workflow_run(Name='etl-workflow')




17. How do you implement Glue DataBrew?

Glue DataBrew is a visual data preparation tool for cleaning and normalizing data without code.

DataBrew Components:
├── Datasets: Source data definitions
├── Projects: Interactive data exploration
├── Recipes: Saved transformations
└── Jobs: Apply recipes to datasets

# Create dataset
databrew = boto3.client('databrew')

databrew.create_dataset(
    Name='sales-dataset',
    Input={
        'S3InputDefinition': {
            'Bucket': 'my-bucket',
            'Key': 'data/sales.csv'
        }
    },
    FormatOptions={
        'Csv': {
            'Delimiter': ',',
            'HeaderRow': True
        }
    }
)

# Create project (for interactive work)
databrew.create_project(
    Name='sales-project',
    DatasetName='sales-dataset',
    RecipeName='sales-recipe',
    RoleArn='arn:aws:iam::xxx:role/DataBrewRole'
)

# Create recipe job (apply transformations)
databrew.create_recipe_job(
    Name='sales-job',
    DatasetName='sales-dataset',
    RecipeReference={
        'Name': 'sales-recipe',
        'RecipeVersion': 'LATEST_PUBLISHED'
    },
    RoleArn='arn:aws:iam::xxx:role/DataBrewRole',
    Outputs=[
        {
            'Location': {'Bucket': 'my-bucket', 'Key': 'output/'},
            'Format': 'PARQUET',
            'Overwrite': True
        }
    ]
)

# Common transformations:
├── Column operations (rename, delete, type change)
├── Filtering and aggregation
├── String manipulation
├── Date/time parsing
├── Data validation and cleaning
└── Statistical functions

18. What is Glue Streaming?

Glue Streaming ETL enables continuous processing of streaming data from Kinesis or Kafka.

# Glue Streaming Job
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql.functions import *

args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Read from Kinesis
kinesis_options = {
    "streamARN": "arn:aws:kinesis:us-east-1:xxx:stream/my-stream",
    "startingPosition": "TRIM_HORIZON",
    "inferSchema": "true",
    "classification": "json"
}

source_df = glueContext.create_data_frame.from_options(
    connection_type="kinesis",
    connection_options=kinesis_options,
    transformation_ctx="source"
)

# Or from Kafka
kafka_options = {
    "connectionName": "my-kafka-connection",
    "topicName": "my-topic",
    "startingOffsets": "earliest",
    "inferSchema": "true",
    "classification": "json"
}

# Process streaming data
def process_batch(data_frame, batch_id):
    if data_frame.count() > 0:
        # Transform
        transformed = data_frame.filter(col("amount") > 100)
        
        # Write to S3
        transformed.write \
            .mode("append") \
            .parquet("s3://bucket/streaming-output/")

# Write stream
glueContext.forEachBatch(
    frame=source_df,
    batch_function=process_batch,
    options={"windowSize": "60 seconds", "checkpointLocation": "s3://bucket/checkpoint/"}
)

job.commit()

19. How do you monitor EMR and Glue?

EMR Monitoring:

1. CloudWatch Metrics
├── YARNMemoryAvailablePercentage
├── AppsRunning, AppsPending
├── CoreNodesPending, TaskNodesPending
├── HDFSUtilization
└── IsIdle

2. Spark UI
# Access via SSH tunnel or Application UIs
http://master-dns:18080  # Spark History Server
http://master-dns:8088   # YARN ResourceManager

3. CloudWatch Logs
# Enable in cluster configuration
"Classification": "spark-log4j",
"Properties": {
    "log4j.appender.DRFA": "org.apache.log4j.DailyRollingFileAppender"
}

Glue Monitoring:

1. Job Metrics
├── glue.driver.aggregate.numCompletedTasks
├── glue.driver.aggregate.numFailedTasks
├── glue.driver.BlockManager.memory.memUsed_MB
└── glue.driver.ExecutorAllocationManager.executors.numberAllExecutors

2. Enable detailed monitoring
DefaultArguments={
    '--enable-metrics': 'true',
    '--enable-continuous-cloudwatch-log': 'true',
    '--enable-spark-ui': 'true'
}

3. Job run insights
# Check job runs
runs = glue.get_job_runs(JobName='my-job')
for run in runs['JobRuns']:
    print(f"Status: {run['JobRunState']}")
    print(f"Duration: {run['ExecutionTime']} seconds")
    if 'ErrorMessage' in run:
        print(f"Error: {run['ErrorMessage']}")

20. What are EMR and Glue best practices?

EMR Best Practices:
1. Use Instance Fleets with Spot
-- Mix instance types
-- capacity-optimized allocation
-- Task nodes for Spot (no HDFS)

2. Use EMRFS (S3) over HDFS
-- Separate compute and storage
-- Persistent data on S3
-- Use S3 for output

3. Enable auto-scaling
{
    "Constraints": {
        "MinCapacity": 2,
        "MaxCapacity": 20
    },
    "Rules": [{
        "Name": "ScaleOut",
        "Action": {"SimpleScalingPolicyConfiguration": {"AdjustmentType": "CHANGE_IN_CAPACITY", "ScalingAdjustment": 2}},
        "Trigger": {"CloudWatchAlarmDefinition": {"MetricName": "YARNMemoryAvailablePercentage", "ComparisonOperator": "LESS_THAN", "Threshold": 20}}
    }]
}

4. Optimize Spark configurations
-- spark.sql.adaptive.enabled=true
-- spark.dynamicAllocation.enabled=true
-- spark.speculation=true

Glue Best Practices:
1. Right-size DPUs
-- Start small, increase if needed
-- Use G.2X for compute-intensive

2. Use job bookmarks for incremental
-- Avoid reprocessing
-- Use transformation_ctx

3. Partition data effectively
-- Use partition predicates
-- push_down_predicate parameter

4. Optimize output
-- Target file sizes 128MB-1GB
-- Use Parquet/ORC formats
-- Coalesce before writing

Common Guidelines:
- Use Glue Data Catalog for metadata
- Monitor costs with tags
- Enable encryption (at rest, in transit)
- Use VPC for network isolation
- Implement proper IAM roles


Popular Posts