Top 20 AWS EMR and Glue Interview Questions
- What is Amazon EMR?
- What is EMR architecture?
- What are EMR cluster types?
- What is AWS Glue?
- What are Glue components?
- How do you create a Glue ETL job?
- What is the Glue Data Catalog?
- What are Glue crawlers?
- How do you optimize Glue jobs?
- What is EMR on EKS?
- What is EMR Serverless?
- How do you configure EMR applications?
- What are EMR instance fleets?
- How do you handle Spark on EMR?
- What are Glue job bookmarks?
- What are Glue workflows?
- How do you implement Glue DataBrew?
- What is Glue Streaming?
- How do you monitor EMR and Glue?
- What are EMR and Glue best practices?
AWS Interview Questions - All Topics
1. What is Amazon EMR?
Amazon EMR (Elastic MapReduce) is a managed cluster platform for running big data frameworks like Apache Spark, Hive, Presto, and Hadoop.
EMR Features:
âââ Managed Hadoop ecosystem
âââ Auto-scaling capabilities
âââ Spot instance support
âââ Integration with S3 (EMRFS)
âââ Multiple deployment options
âââ Cost-effective big data processing
Supported Frameworks:
âââ Apache Spark
âââ Apache Hive
âââ Presto/Trino
âââ Apache Flink
âââ Apache HBase
âââ Apache Hadoop
âââ Apache Hudi, Delta Lake, Iceberg
# Create EMR cluster via CLI
aws emr create-cluster \
--name "My Spark Cluster" \
--release-label emr-7.0.0 \
--applications Name=Spark Name=Hive \
--instance-type m5.xlarge \
--instance-count 3 \
--use-default-roles \
--ec2-attributes SubnetId=subnet-xxx
2. What is EMR architecture?
EMR Cluster Architecture:
âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â EMR Cluster â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââ¤
â âââââââââââââââ âââââââââââââââ â
â â Master Node â â Master Node â (HA optional) â
â â - YARN RM â â - Standby â â
â â - Hive â â â â
â â - Spark â â â â
â âââââââââââââââ âââââââââââââââ â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââ¤
â âââââââââââââââ âââââââââââââââ âââââââââââââ â
â â Core Node 1 â â Core Node 2 â âCore Node 3â â
â â - HDFS â â - HDFS â â - HDFS â â
â â - YARN NM â â - YARN NM â â - YARN NMâ â
â âââââââââââââââ âââââââââââââââ âââââââââââââ â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââ¤
â âââââââââââââââ âââââââââââââââ (Auto-scales) â
â â Task Node 1 â â Task Node 2 â â
â â - YARN NM â â - YARN NM â â
â â - No HDFS â â - No HDFS â â
â âââââââââââââââ âââââââââââââââ â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â
â¼
âââââââââââââââââââââââ
â Amazon S3 â
â (EMRFS - Storage) â
âââââââââââââââââââââââ
Node Types:
âââ Master: Cluster coordination, resource management
âââ Core: HDFS storage + computation
âââ Task: Computation only (no HDFS, Spot-friendly)
3. What are EMR cluster types?
| Type | Description | Use Case |
|---|---|---|
| EMR on EC2 | Traditional managed clusters | Full control, long-running |
| EMR on EKS | Run on Kubernetes | Container orchestration |
| EMR Serverless | No infrastructure management | Variable workloads |
| EMR on Outposts | Run on-premises | Data residency requirements |
Cluster Modes:
1. Long-running cluster
# Persistent, for interactive queries
aws emr create-cluster \
--keep-job-flow-alive-when-no-steps \
...
2. Transient cluster
# Terminates after steps complete
aws emr create-cluster \
--auto-terminate \
--steps Type=Spark,Name=MyJob,Args=[...] \
...
3. Instance Groups vs Instance Fleets
# Instance Groups: Same instance type per group
# Instance Fleets: Mix of instance types (cost optimization)
aws emr create-cluster \
--instance-fleets '[
{
"InstanceFleetType": "MASTER",
"TargetOnDemandCapacity": 1,
"InstanceTypeConfigs": [{"InstanceType": "m5.xlarge"}]
},
{
"InstanceFleetType": "CORE",
"TargetSpotCapacity": 4,
"InstanceTypeConfigs": [
{"InstanceType": "m5.xlarge", "WeightedCapacity": 1},
{"InstanceType": "m5.2xlarge", "WeightedCapacity": 2}
]
}
]'
4. What is AWS Glue?
AWS Glue is a fully managed ETL (Extract, Transform, Load) service with serverless infrastructure.Glue Features: âââ Serverless ETL engine (Spark-based) âââ Data Catalog (Hive metastore compatible) âââ Crawlers (schema discovery) âââ Visual ETL (Glue Studio) âââ Job bookmarks (incremental processing) âââ Workflows (orchestration) âââ DataBrew (no-code data prep) Glue Components: ââââââââââââââââââââââââââââââââââââââââââââââââââââââ â AWS Glue â ââââââââââââââââââ¬ââââââââââââââââ¬âââââââââââââââââââ⤠â Data Catalog â ETL Engine â Orchestration â â ââââââââââââ â âââââââââââ â âââââââââââââââ â â âDatabases â â âGlue Jobsâ â â Workflows â â â âTables â â â(Spark) â â â Triggers â â â âCrawlers â â âStreamingâ â â Schedules â â â ââââââââââââ â âââââââââââ â âââââââââââââââ â ââââââââââââââââââ´ââââââââââââââââ´ââââââââââââââââââââ Pricing: âââ ETL Jobs: DPU-hours (Data Processing Units) âââ Data Catalog: Free up to 1M objects âââ Crawlers: DPU-hours âââ DataBrew: Sessions (interactive) + jobs
5. What are Glue components?
Core Components:
1. Data Catalog
# Centralized metadata repository
# Hive metastore compatible
# Used by Athena, EMR, Redshift
2. Databases and Tables
import boto3
glue = boto3.client('glue')
glue.create_database(
DatabaseInput={'Name': 'my_database'}
)
glue.create_table(
DatabaseName='my_database',
TableInput={
'Name': 'my_table',
'StorageDescriptor': {
'Columns': [
{'Name': 'id', 'Type': 'string'},
{'Name': 'name', 'Type': 'string'}
],
'Location': 's3://bucket/path/',
'InputFormat': 'org.apache.hadoop.mapred.TextInputFormat',
'SerdeInfo': {'SerializationLibrary': 'org.apache.hadoop.hive.serde2.OpenCSVSerde'}
}
}
)
3. Connections
# Database connections (JDBC)
# Network configuration
4. Crawlers
# Auto-discover schemas
# Populate Data Catalog
5. Jobs
# ETL processing (Spark, Python Shell)
# Visual or code-based
6. Triggers
# Schedule or event-based execution
6. How do you create a Glue ETL job?
# Glue ETL Job (PySpark)
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
# Initialize
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# Read from catalog
datasource = glueContext.create_dynamic_frame.from_catalog(
database="my_database",
table_name="source_table"
)
# Or read from S3
datasource = glueContext.create_dynamic_frame.from_options(
connection_type="s3",
connection_options={"paths": ["s3://bucket/input/"]},
format="parquet"
)
# Transform
mapped = ApplyMapping.apply(
frame=datasource,
mappings=[
("old_col", "string", "new_col", "string"),
("amount", "double", "amount", "double")
]
)
# Filter
filtered = Filter.apply(
frame=mapped,
f=lambda x: x["amount"] > 100
)
# Write to S3
glueContext.write_dynamic_frame.from_options(
frame=filtered,
connection_type="s3",
connection_options={"path": "s3://bucket/output/"},
format="parquet"
)
job.commit()