Top 20 AWS Redshift Interview Questions and Answers

What is Amazon Redshift?
What is Redshift architecture?
What are Redshift node types?
What are distribution styles in Redshift?
What are sort keys in Redshift?
What is Redshift Spectrum?
How do you load data into Redshift?
What is the COPY command?
What is Redshift Serverless?
How do you optimize query performance?
What is workload management (WLM)?
What are materialized views in Redshift?
How do you implement data sharing?
What is concurrency scaling?
How do you handle vacuuming and analyzing?
What are Redshift security best practices?
How do you monitor Redshift?
What is Redshift ML?
How do you migrate to Redshift?
What are Redshift best practices?

AWS Interview Questions - All Topics

AWS Data Engineer AWS Lambda AWS Redshift AWS S3 & Lake Formation AWS EMR & Glue AWS Step Functions AWS IAM & Cognito AWS SageMaker AWS CI/CD (CodePipeline) AWS Kinesis AWS Data Real-time Scenarios

1. What is Amazon Redshift?

Amazon Redshift is a fully managed, petabyte-scale cloud data warehouse that uses columnar storage and parallel query execution for fast analytics.

Key Features:
- Columnar storage with compression
- Massively Parallel Processing (MPP)
- SQL-based interface (PostgreSQL compatible)
- Automatic backups and snapshots
- Integration with AWS ecosystem

Redshift Capabilities:
âââ Traditional BI/Analytics
âââ Operational Analytics
âââ Data Lake Queries (Spectrum)
âââ Federated Queries (RDS, Aurora)
âââ Machine Learning (Redshift ML)
âââ Data Sharing (cross-account)
âââ Streaming Ingestion (Kinesis, MSK)

# Connect to Redshift
import psycopg2

conn = psycopg2.connect(
    host='cluster.us-east-1.redshift.amazonaws.com',
    port=5439,
    database='mydb',
    user='admin',
    password='password'
)
cursor = conn.cursor()
cursor.execute("SELECT * FROM sales LIMIT 10")

2. What is Redshift architecture?

Redshift Cluster Architecture:

âââââââââââââââââââââââââââââââââââââââââââ
â           Leader Node                    â
â  - SQL parsing and planning             â
â  - Query optimization                   â
â  - Result aggregation                   â
â  - Client connections                   â
ââââââââââââââââââ¬âââââââââââââââââââââââââ
                 â
    ââââââââââââââ¼âââââââââââââ
    â¼            â¼            â¼
ââââââââââ  ââââââââââ  ââââââââââ
âCompute â  âCompute â  âCompute â
âNode 1  â  âNode 2  â  âNode 3  â
ââââââââââ  ââââââââââ  ââââââââââ
ââSlice1ââ  ââSlice1ââ  ââSlice1ââ
âââââââââ¤â  âââââââââ¤â  âââââââââ¤â
ââSlice2ââ  ââSlice2ââ  ââSlice2ââ
ââââââââââ  ââââââââââ  ââââââââââ
ââââââââââ  ââââââââââ  ââââââââââ

Key Concepts:
âââ Leader Node: Query coordination
âââ Compute Nodes: Data storage and query execution
âââ Slices: Portion of compute node's memory/disk
âââ Columns: Data stored by column, not row
âââ Blocks: Columnar data stored in 1MB blocks

3. What are Redshift node types?

Type	Storage	Use Case
RA3	Managed (RMS)	Scale compute/storage independently
DC2	Local SSD	High performance, smaller datasets
DS2	Local HDD	Legacy, large storage (deprecated)

RA3 Node Types (Recommended):
âââ ra3.xlplus: 32 GB, 32 TB managed storage
âââ ra3.4xlarge: 96 GB, 128 TB managed storage
âââ ra3.16xlarge: 384 GB, 128 TB managed storage

DC2 Node Types:
âââ dc2.large: 160 GB SSD, 0.16 TB/node
âââ dc2.8xlarge: 2.56 TB SSD, 2.56 TB/node

RA3 Benefits:
- Redshift Managed Storage (RMS)
- Hot data cached on SSD
- Cold data on S3 automatically
- Scale compute without storage limits
- Cross-AZ durability

# Create RA3 cluster
aws redshift create-cluster \
    --cluster-identifier my-cluster \
    --node-type ra3.4xlarge \
    --number-of-nodes 3 \
    --master-username admin \
    --master-user-password MyPassword123

4. What are distribution styles in Redshift?

Distribution style determines how data is distributed across nodes for parallel processing.

Distribution Styles:

1. KEY Distribution
-- Rows with same key on same slice
-- Best for join columns
CREATE TABLE orders (
    order_id INT,
    customer_id INT,
    amount DECIMAL(10,2)
)
DISTKEY(customer_id);

2. EVEN Distribution
-- Round-robin distribution
-- Default, good for uniform access
CREATE TABLE events (
    event_id INT,
    event_data VARCHAR(1000)
)
DISTSTYLE EVEN;

3. ALL Distribution
-- Copy entire table to all nodes
-- Small dimension tables
CREATE TABLE countries (
    country_code CHAR(2),
    country_name VARCHAR(100)
)
DISTSTYLE ALL;

4. AUTO Distribution
-- Redshift chooses automatically
-- Starts as ALL, converts to EVEN if grows
CREATE TABLE products (
    product_id INT,
    name VARCHAR(200)
)
DISTSTYLE AUTO;

Best Practices:
âââ KEY: Large fact tables, join columns
âââ ALL: Small dimension tables (< 3M rows)
âââ EVEN: No clear join pattern
âââ AUTO: Let Redshift decide

5. What are sort keys in Redshift?

Sort keys determine the physical order of data on disk, enabling efficient range queries and joins.

Sort Key Types:

1. Compound Sort Key
-- Columns sorted in defined order
-- Best for queries filtering on prefix
CREATE TABLE sales (
    sale_date DATE,
    region VARCHAR(50),
    product_id INT,
    amount DECIMAL(10,2)
)
COMPOUND SORTKEY(sale_date, region);

-- Efficient: WHERE sale_date = '2024-01-01'
-- Efficient: WHERE sale_date = '2024-01-01' AND region = 'US'
-- Inefficient: WHERE region = 'US' (skips first column)

2. Interleaved Sort Key
-- Equal weight to all columns
-- Good for multiple query patterns
CREATE TABLE events (
    event_date DATE,
    user_id INT,
    event_type VARCHAR(50)
)
INTERLEAVED SORTKEY(event_date, user_id, event_type);

-- Efficient for any column filter
-- Higher VACUUM overhead

3. AUTO Sort Key
-- Redshift manages automatically
CREATE TABLE logs (
    log_time TIMESTAMP,
    message TEXT
)
SORTKEY AUTO;

Zone Maps:
-- Redshift maintains min/max per block
-- Skips blocks that can't contain data
-- Sort key maximizes zone map effectiveness

6. What is Redshift Spectrum?

Redshift Spectrum enables querying data in S3 directly from Redshift without loading it.

Spectrum Architecture:
ââââââââââââââââ     âââââââââââââââââââââââ
â   Redshift   ââââââ¶â  Spectrum Layer     â
â   Cluster    â     â  (Shared compute)   â
ââââââââââââââââ     ââââââââââââ¬âââââââââââ
                                â
                     ââââââââââââ¼âââââââââââ
                     â   Amazon S3         â
                     â   (External data)   â
                     âââââââââââââââââââââââ

# Create external schema
CREATE EXTERNAL SCHEMA spectrum_schema
FROM DATA CATALOG
DATABASE 'my_glue_database'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftSpectrumRole'
CREATE EXTERNAL DATABASE IF NOT EXISTS;

# Create external table
CREATE EXTERNAL TABLE spectrum_schema.sales (
    sale_id INT,
    customer_id INT,
    amount DECIMAL(10,2),
    sale_date DATE
)
PARTITIONED BY (year INT, month INT)
STORED AS PARQUET
LOCATION 's3://bucket/sales/';

# Add partitions
ALTER TABLE spectrum_schema.sales
ADD PARTITION (year=2024, month=1)
LOCATION 's3://bucket/sales/year=2024/month=1/';

# Query joining local and S3 data
SELECT c.name, SUM(s.amount)
FROM local_schema.customers c
JOIN spectrum_schema.sales s ON c.id = s.customer_id
WHERE s.year = 2024
GROUP BY c.name;

7. How do you load data into Redshift?

Data Loading Methods:

1. COPY Command (Fastest)
COPY sales FROM 's3://bucket/data/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftRole'
FORMAT AS PARQUET;

2. INSERT (Small data)
INSERT INTO sales (id, amount) VALUES (1, 100.00);

3. AWS Glue
-- ETL job writes directly to Redshift

4. Kinesis Data Firehose
-- Streaming data delivery

5. Streaming Ingestion
CREATE MATERIALIZED VIEW sales_stream AS
SELECT *
FROM kinesis_schema.sales_stream
WHERE is_json_valid(json_data);

6. Zero-ETL (Aurora â Redshift)
-- Near real-time replication
-- No ETL code needed

Loading Best Practices:
âââ Use COPY, not INSERT for bulk
âââ Split files (parallel load)
âââ Use columnar formats (Parquet, ORC)
âââ Compress data (GZIP, LZO, ZSTD)
âââ Sort input files by sort key
âââ Load during maintenance window

8. What is the COPY command?

COPY is the most efficient way to load data into Redshift from S3, DynamoDB, or remote hosts.

COPY Command Syntax:
COPY table_name
FROM 's3://bucket/prefix'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftRole'
[options];

# Load CSV with options
COPY sales
FROM 's3://bucket/sales/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftRole'
CSV
IGNOREHEADER 1
DELIMITER ','
DATEFORMAT 'YYYY-MM-DD'
TIMEFORMAT 'auto'
REGION 'us-east-1'
MAXERROR 100
COMPUPDATE ON
STATUPDATE ON;

# Load Parquet (no format options needed)
COPY sales
FROM 's3://bucket/sales/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftRole'
FORMAT AS PARQUET;

# Load from manifest
COPY sales
FROM 's3://bucket/manifest.json'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftRole'
MANIFEST;

# Manifest file
{
  "entries": [
    {"url": "s3://bucket/file1.csv", "mandatory": true},
    {"url": "s3://bucket/file2.csv", "mandatory": true}
  ]
}

# Check load errors
SELECT * FROM stl_load_errors ORDER BY starttime DESC LIMIT 10;

9. What is Redshift Serverless?

Redshift Serverless provides on-demand analytics without managing cluster infrastructure.

Redshift Serverless Components:
âââ Namespace: Database objects, users, schemas
âââ Workgroup: Compute resources, network config
âââ RPU: Redshift Processing Units (billing)

Benefits:
- No cluster management
- Auto-scaling compute
- Pay per use (RPU-hours)
- Ideal for variable workloads

# Create via AWS CLI
aws redshift-serverless create-namespace \
    --namespace-name my-namespace \
    --admin-username admin \
    --admin-user-password MyPassword123 \
    --db-name mydb

aws redshift-serverless create-workgroup \
    --workgroup-name my-workgroup \
    --namespace-name my-namespace \
    --base-capacity 32 \
    --max-capacity 512

# Capacity settings
Base Capacity: 32-512 RPUs (minimum when active)
Max Capacity: Up to 512 RPUs (auto-scale limit)

Pricing:
- RPU-hours for compute
- GB-months for RMS storage
- No charge when idle

# Connection
Endpoint: workgroup-name.account-id.region.redshift-serverless.amazonaws.com:5439

10. How do you optimize query performance?

Query Optimization Strategies:

1. Use EXPLAIN to analyze
EXPLAIN SELECT * FROM sales WHERE date > '2024-01-01';
-- Check for seq scans, sort, distribution

2. Optimize table design
-- Appropriate distribution key (join columns)
-- Sort key (filter/join columns)
-- Compression encoding

3. Write efficient queries
-- Avoid SELECT *
-- Use WHERE to filter early
-- Avoid DISTINCT when possible
-- Use approximate functions

-- Bad
SELECT DISTINCT category FROM products;

-- Better
SELECT category FROM products GROUP BY category;

-- For counts, use approximation
SELECT APPROXIMATE COUNT(DISTINCT user_id) FROM events;

4. Use result caching
-- Enable by default for identical queries
SET enable_result_cache_for_session TO ON;

5. Analyze statistics
ANALYZE sales;
-- Updates statistics for query optimizer

6. Review SVL tables
-- Execution details
SELECT * FROM svl_query_summary WHERE query = 123;

-- Disk-based operations (bad)
SELECT * FROM svl_query_report WHERE query = 123 AND is_diskbased = 't';

11. What is workload management (WLM)?

WLM manages query queues and resource allocation for different workloads.

WLM Modes:

1. Automatic WLM (Recommended)
-- Redshift manages memory and concurrency
-- Uses machine learning

2. Manual WLM
-- Define queues, memory, concurrency manually

# Configure WLM Parameter Group
{
    "wlm_json_configuration": [
        {
            "name": "ETL",
            "query_group": ["etl"],
            "memory_percent_to_use": 40,
            "max_execution_time": 3600000
        },
        {
            "name": "Reporting",
            "user_group": ["analysts"],
            "memory_percent_to_use": 40,
            "concurrency_scaling": "auto"
        },
        {
            "name": "Default",
            "memory_percent_to_use": 20,
            "query_concurrency": 5
        }
    ]
}

# Assign query to queue
SET query_group TO 'etl';
COPY sales FROM 's3://...';
RESET query_group;

# Short Query Acceleration (SQA)
-- Routes short queries to fast lane
-- Enable: Automatic WLM with SQA
-- Identifies queries < threshold (seconds)

# Monitor WLM
SELECT * FROM stv_wlm_query_state;
SELECT * FROM stl_wlm_query;

12. What are materialized views in Redshift?

Materialized views store precomputed query results for faster access to complex aggregations.

# Create materialized view
CREATE MATERIALIZED VIEW daily_sales AS
SELECT 
    sale_date,
    region,
    SUM(amount) as total_amount,
    COUNT(*) as order_count
FROM sales
GROUP BY sale_date, region;

# Refresh options
-- Manual refresh
REFRESH MATERIALIZED VIEW daily_sales;

-- Auto refresh (Redshift manages)
CREATE MATERIALIZED VIEW daily_sales
AUTO REFRESH YES
AS SELECT ...;

# Incremental refresh
-- For views with aggregations on tables with sort keys
-- Much faster than full refresh

# Query rewriting
-- Optimizer automatically uses MV
-- Even if you query base table
SET mv_enable_aqmv_for_session TO TRUE;

SELECT sale_date, SUM(amount)
FROM sales
WHERE sale_date > '2024-01-01'
GROUP BY sale_date;
-- May use daily_sales MV automatically

# Streaming ingestion with MV
CREATE MATERIALIZED VIEW streaming_sales AS
SELECT 
    approximate_date,
    SUM(amount) as total
FROM kinesis_stream
GROUP BY approximate_date;

13. How do you implement data sharing?

Data sharing enables sharing live data across Redshift clusters without copying.

Data Sharing Components:
âââ Producer: Cluster that shares data
âââ Consumer: Cluster that accesses data
âââ Datashare: Collection of shared objects
âââ Namespace: Cluster identifier

# On Producer Cluster
-- Create datashare
CREATE DATASHARE sales_share;

-- Add objects to share
ALTER DATASHARE sales_share ADD SCHEMA public;
ALTER DATASHARE sales_share ADD TABLE public.sales;
ALTER DATASHARE sales_share ADD TABLE public.customers;

-- Grant to consumer (same account)
GRANT USAGE ON DATASHARE sales_share
TO NAMESPACE 'consumer-namespace-id';

-- Grant to consumer (cross-account)
GRANT USAGE ON DATASHARE sales_share
TO ACCOUNT '123456789012';

# On Consumer Cluster
-- Create database from datashare
CREATE DATABASE shared_db FROM DATASHARE sales_share
OF NAMESPACE 'producer-namespace-id';

-- Query shared data
SELECT * FROM shared_db.public.sales;

Benefits:
âââ Live data (no ETL)
âââ No storage cost for consumer
âââ Read-only access
âââ Cross-account sharing
âââ Supports Redshift Serverless

14. What is concurrency scaling?

Concurrency scaling automatically adds temporary cluster capacity during usage spikes.

How it Works:
1. Main cluster queue fills up
2. Redshift launches scaling clusters
3. Queries routed to scaling clusters
4. Results returned seamlessly
5. Scaling clusters terminated when idle

# Enable for WLM queue
{
    "name": "Reporting",
    "concurrency_scaling": "auto"
}

# Monitor scaling
SELECT * FROM svcs_concurrency_scaling_usage;

# Pricing
-- 1 hour free credit per 24 hours (per cluster)
-- Additional: Per-second billing

# Eligible queries
-- Read queries (SELECT)
-- COPY, UNLOAD, INSERT INTO SELECT
-- Not: DDL, maintenance

# Configure via parameter
-- max_concurrency_scaling_clusters: 0-10
-- 0 = disabled
-- 10 = up to 10 additional clusters

Benefits:
âââ Predictable performance during spikes
âââ No capacity planning needed
âââ Pay only when used
âââ Transparent to users

15. How do you handle vacuuming and analyzing?

VACUUM reclaims space and re-sorts rows; ANALYZE updates statistics for query optimization.

VACUUM Operations:

# Full vacuum (sort + delete)
VACUUM FULL sales;

# Delete only (reclaim deleted rows)
VACUUM DELETE ONLY sales;

# Sort only (re-sort unsorted rows)
VACUUM SORT ONLY sales;

# Reindex (interleaved sort key)
VACUUM REINDEX sales;

# Automatic vacuum
-- Redshift runs automatically during maintenance
-- Sort and delete thresholds monitored

# Check vacuum status
SELECT * FROM svv_vacuum_progress;
SELECT * FROM svv_vacuum_summary;

# Check table health
SELECT "table", unsorted, vacuum_sort_benefit
FROM svv_table_info
WHERE unsorted > 5;

ANALYZE Operations:

# Analyze all columns
ANALYZE sales;

# Analyze specific columns
ANALYZE sales(sale_date, customer_id);

# Auto analyze
-- Runs automatically after COPY
-- STATUPDATE ON in COPY command

# Check statistics
SELECT * FROM svv_table_info;

Best Practices:
âââ Let auto vacuum/analyze run
âââ Manual vacuum after large deletes
âââ Analyze after schema changes
âââ Monitor unsorted percentage

16. What are Redshift security best practices?

Security Layers:

1. Network Security
-- VPC with private subnets
-- Security groups (restrict port 5439)
-- VPC endpoints for S3, Glue

2. Encryption
-- At rest: KMS managed keys
-- In transit: SSL/TLS required

# Force SSL
ALTER USER admin SET require_ssl = true;

3. Authentication
-- Database users
-- IAM authentication
-- Federated access (SAML)

# IAM authentication
CREATE USER iam_user PASSWORD DISABLE;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO iam_user;

# Generate temp credentials
aws redshift get-cluster-credentials \
    --db-user iam_user \
    --cluster-identifier my-cluster

4. Authorization
-- Role-based access control
-- Row-level security
-- Column-level security

# Create role
CREATE ROLE analysts;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO ROLE analysts;
GRANT ROLE analysts TO user1;

# Row-level security
CREATE RLS POLICY region_policy
WITH (region VARCHAR(50))
USING (region = current_setting('app.region'));

5. Audit Logging
-- Enable audit logging to S3
-- CloudTrail for API calls

17. How do you monitor Redshift?

Monitoring Tools:

1. CloudWatch Metrics
âââ CPUUtilization
âââ PercentageDiskSpaceUsed
âââ ReadIOPS, WriteIOPS
âââ DatabaseConnections
âââ QueryDuration
âââ WLMQueueLength

2. System Tables (STL, STV, SVL, SVV)
# Recent queries
SELECT query, starttime, endtime, querytxt
FROM stl_query
ORDER BY starttime DESC LIMIT 10;

# Query execution steps
SELECT * FROM svl_query_report WHERE query = 123;

# Disk-based queries (need more memory)
SELECT query, segment, step, is_diskbased
FROM svl_query_summary
WHERE is_diskbased = 't';

# Current running queries
SELECT * FROM stv_recents WHERE status = 'Running';

3. Query Monitoring Rules
# Terminate long queries
CREATE OR REPLACE RULE rule_long_queries AS
ON timeout WHEN (query_execution_time > 3600)
THEN abort;

4. Advisor Recommendations
SELECT * FROM svv_advisor_recommendations;
-- Distribution key suggestions
-- Sort key suggestions
-- Compression changes

5. CloudWatch Logs
-- User activity logs
-- Connection logs
-- User logs

18. What is Redshift ML?

Redshift ML enables creating, training, and deploying machine learning models using SQL.

# Create model (uses SageMaker Autopilot)
CREATE MODEL customer_churn_model
FROM (
    SELECT 
        tenure,
        monthly_charges,
        total_charges,
        contract_type,
        churn  -- Target column
    FROM customers
)
TARGET churn
FUNCTION predict_churn
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftMLRole'
SETTINGS (
    S3_BUCKET 'my-redshift-ml-bucket',
    MAX_RUNTIME 3600
);

# Check model status
SHOW MODEL customer_churn_model;

# Make predictions
SELECT 
    customer_id,
    predict_churn(tenure, monthly_charges, total_charges, contract_type) as predicted_churn
FROM new_customers;

# Supported problem types
âââ Binary classification
âââ Multi-class classification
âââ Regression
âââ BYOM (Bring Your Own Model)

# Import existing SageMaker model
CREATE MODEL sentiment_model
FUNCTION predict_sentiment (text VARCHAR)
RETURNS VARCHAR
SAGEMAKER 'sentiment-endpoint'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftMLRole';

19. How do you migrate to Redshift?

Migration Options:

1. AWS Schema Conversion Tool (SCT)
-- Convert schemas from Oracle, SQL Server, etc.
-- Identifies conversion issues
-- Generates target DDL

2. AWS Database Migration Service (DMS)
-- Full load + CDC
-- Minimal downtime
-- Supports many sources

# DMS Task for Redshift
{
    "TargetMetadata": {
        "TargetSchema": "",
        "SupportLobs": false,
        "FullLobMode": false,
        "LobMaxSize": 0
    },
    "FullLoadSettings": {
        "TargetTablePrepMode": "TRUNCATE_BEFORE_LOAD"
    }
}

3. AWS Glue
-- ETL transformations
-- Schema flexibility

4. COPY from S3
-- Export from source to S3
-- COPY into Redshift

Migration Steps:
1. Assess (SCT assessment report)
2. Convert schema (SCT)
3. Migrate data (DMS or COPY)
4. Validate data
5. Cutover application
6. Monitor performance

Best Practices:
âââ Start with assessment
âââ Test with subset of data
âââ Optimize table design
âââ Validate row counts
âââ Plan maintenance window

20. What are Redshift best practices?

1. Table Design:

-- Choose appropriate distribution
-- Large fact tables: DISTKEY on join column
-- Small dimensions: DISTSTYLE ALL
-- Default: AUTO

-- Choose sort keys wisely
-- Filter columns in COMPOUND
-- Multiple patterns: INTERLEAVED (carefully)

-- Use compression
ANALYZE COMPRESSION sales;
-- Apply recommended encodings

2. Data Loading:
- Use COPY, not INSERT for bulk
- Split files for parallel load
- Use columnar formats (Parquet)
- Compress source files

3. Query Optimization:

-- Avoid SELECT *
-- Filter early with WHERE
-- Use EXPLAIN to analyze
-- Leverage result caching
-- Use materialized views

4. Maintenance:
- Monitor table health (unsorted %)
- Let auto vacuum run
- Update statistics regularly
- Review advisor recommendations

5. Cost Optimization:
- Right-size cluster
- Use Reserved Instances
- Pause clusters when idle
- Use Redshift Serverless for variable workloads
- Archive old data to S3 (query with Spectrum)