Top 20 AWS Redshift Interview Questions and Answers
- What is Amazon Redshift?
- What is Redshift architecture?
- What are Redshift node types?
- What are distribution styles in Redshift?
- What are sort keys in Redshift?
- What is Redshift Spectrum?
- How do you load data into Redshift?
- What is the COPY command?
- What is Redshift Serverless?
- How do you optimize query performance?
- What is workload management (WLM)?
- What are materialized views in Redshift?
- How do you implement data sharing?
- What is concurrency scaling?
- How do you handle vacuuming and analyzing?
- What are Redshift security best practices?
- How do you monitor Redshift?
- What is Redshift ML?
- How do you migrate to Redshift?
- What are Redshift best practices?
AWS Interview Questions - All Topics
1. What is Amazon Redshift?
Amazon Redshift is a fully managed, petabyte-scale cloud data warehouse that uses columnar storage and parallel query execution for fast analytics.Key Features:
- Columnar storage with compression
- Massively Parallel Processing (MPP)
- SQL-based interface (PostgreSQL compatible)
- Automatic backups and snapshots
- Integration with AWS ecosystem
Redshift Capabilities:
âââ Traditional BI/Analytics
âââ Operational Analytics
âââ Data Lake Queries (Spectrum)
âââ Federated Queries (RDS, Aurora)
âââ Machine Learning (Redshift ML)
âââ Data Sharing (cross-account)
âââ Streaming Ingestion (Kinesis, MSK)
# Connect to Redshift
import psycopg2
conn = psycopg2.connect(
host='cluster.us-east-1.redshift.amazonaws.com',
port=5439,
database='mydb',
user='admin',
password='password'
)
cursor = conn.cursor()
cursor.execute("SELECT * FROM sales LIMIT 10")
2. What is Redshift architecture?
Redshift Cluster Architecture:
âââââââââââââââââââââââââââââââââââââââââââ
â Leader Node â
â - SQL parsing and planning â
â - Query optimization â
â - Result aggregation â
â - Client connections â
ââââââââââââââââââ¬âââââââââââââââââââââââââ
â
ââââââââââââââ¼âââââââââââââ
â¼ â¼ â¼
ââââââââââ ââââââââââ ââââââââââ
âCompute â âCompute â âCompute â
âNode 1 â âNode 2 â âNode 3 â
ââââââââââ ââââââââââ ââââââââââ
ââSlice1ââ ââSlice1ââ ââSlice1ââ
âââââââââ¤â âââââââââ¤â âââââââââ¤â
ââSlice2ââ ââSlice2ââ ââSlice2ââ
ââââââââââ ââââââââââ ââââââââââ
ââââââââââ ââââââââââ ââââââââââ
Key Concepts:
âââ Leader Node: Query coordination
âââ Compute Nodes: Data storage and query execution
âââ Slices: Portion of compute node's memory/disk
âââ Columns: Data stored by column, not row
âââ Blocks: Columnar data stored in 1MB blocks
3. What are Redshift node types?
| Type | Storage | Use Case |
|---|---|---|
| RA3 | Managed (RMS) | Scale compute/storage independently |
| DC2 | Local SSD | High performance, smaller datasets |
| DS2 | Local HDD | Legacy, large storage (deprecated) |
RA3 Node Types (Recommended):
âââ ra3.xlplus: 32 GB, 32 TB managed storage
âââ ra3.4xlarge: 96 GB, 128 TB managed storage
âââ ra3.16xlarge: 384 GB, 128 TB managed storage
DC2 Node Types:
âââ dc2.large: 160 GB SSD, 0.16 TB/node
âââ dc2.8xlarge: 2.56 TB SSD, 2.56 TB/node
RA3 Benefits:
- Redshift Managed Storage (RMS)
- Hot data cached on SSD
- Cold data on S3 automatically
- Scale compute without storage limits
- Cross-AZ durability
# Create RA3 cluster
aws redshift create-cluster \
--cluster-identifier my-cluster \
--node-type ra3.4xlarge \
--number-of-nodes 3 \
--master-username admin \
--master-user-password MyPassword123
4. What are distribution styles in Redshift?
Distribution style determines how data is distributed across nodes for parallel processing.
Distribution Styles:
1. KEY Distribution
-- Rows with same key on same slice
-- Best for join columns
CREATE TABLE orders (
order_id INT,
customer_id INT,
amount DECIMAL(10,2)
)
DISTKEY(customer_id);
2. EVEN Distribution
-- Round-robin distribution
-- Default, good for uniform access
CREATE TABLE events (
event_id INT,
event_data VARCHAR(1000)
)
DISTSTYLE EVEN;
3. ALL Distribution
-- Copy entire table to all nodes
-- Small dimension tables
CREATE TABLE countries (
country_code CHAR(2),
country_name VARCHAR(100)
)
DISTSTYLE ALL;
4. AUTO Distribution
-- Redshift chooses automatically
-- Starts as ALL, converts to EVEN if grows
CREATE TABLE products (
product_id INT,
name VARCHAR(200)
)
DISTSTYLE AUTO;
Best Practices:
âââ KEY: Large fact tables, join columns
âââ ALL: Small dimension tables (< 3M rows)
âââ EVEN: No clear join pattern
âââ AUTO: Let Redshift decide
5. What are sort keys in Redshift?
Sort keys determine the physical order of data on disk, enabling efficient range queries and joins.
Sort Key Types:
1. Compound Sort Key
-- Columns sorted in defined order
-- Best for queries filtering on prefix
CREATE TABLE sales (
sale_date DATE,
region VARCHAR(50),
product_id INT,
amount DECIMAL(10,2)
)
COMPOUND SORTKEY(sale_date, region);
-- Efficient: WHERE sale_date = '2024-01-01'
-- Efficient: WHERE sale_date = '2024-01-01' AND region = 'US'
-- Inefficient: WHERE region = 'US' (skips first column)
2. Interleaved Sort Key
-- Equal weight to all columns
-- Good for multiple query patterns
CREATE TABLE events (
event_date DATE,
user_id INT,
event_type VARCHAR(50)
)
INTERLEAVED SORTKEY(event_date, user_id, event_type);
-- Efficient for any column filter
-- Higher VACUUM overhead
3. AUTO Sort Key
-- Redshift manages automatically
CREATE TABLE logs (
log_time TIMESTAMP,
message TEXT
)
SORTKEY AUTO;
Zone Maps:
-- Redshift maintains min/max per block
-- Skips blocks that can't contain data
-- Sort key maximizes zone map effectiveness