Top 20 GCP Dataproc & Dataflow Interview Questions
- What is Dataproc?
- What is Dataflow?
- What are the differences between Dataproc and Dataflow?
- What is Apache Beam?
- How do you create a Dataproc cluster?
- What is Dataproc Serverless?
- How do you write a Dataflow pipeline?
- What are Dataflow templates?
- How do you handle streaming in Dataflow?
- What are Dataproc autoscaling policies?
- How do you optimize Spark jobs on Dataproc?
- What are Dataflow windowing strategies?
- How do you handle late data in Dataflow?
- What is Dataproc Hub?
- How do you connect to BigQuery from Dataproc?
- What are Dataflow side inputs?
- How do you monitor Dataproc jobs?
- What are Dataflow Flex Templates?
- How do you optimize Dataflow pipelines?
- What are best practices for data processing?
☁ Google Cloud Interview Questions
📊 GCP Data Engineer
BigQuery, Dataflow, Pub/Sub, GCS
⚡ Cloud Functions
Serverless, Triggers, Cloud Run
🗃 BigQuery
Data Warehouse, ML, Analytics
📦 Cloud Storage & Data Lake
GCS, Dataplex, Data Catalog
🚀 Dataproc & Dataflow
Spark, Hadoop, Apache Beam
🔄 Workflows & Composer
Orchestration, Airflow, Scheduling
🔒 IAM & Identity
Roles, Service Accounts, Identity Platform
🤖 Vertex AI
ML Platform, AutoML, Pipelines
🛠 Cloud Build & Deploy
CI/CD, Artifact Registry, GKE
📨 Pub/Sub & Streaming
Messaging, Streaming, Event-Driven
🎯 Data Engineering Scenarios
Real-world Architecture Questions
1. What is Dataproc?
Dataproc is a fully managed service for running Apache Spark, Hadoop, and other open-source data processing frameworks.
Dataproc Features:
+-- Managed Spark/Hadoop clusters
+-- Fast cluster provisioning (~90 seconds)
+-- Autoscaling
+-- Per-second billing
+-- Integrated with GCP services
+-- Optional components (Jupyter, Presto, etc.)
+-- Dataproc Serverless for Spark
Dataproc Architecture:
+-------------------------------------------------------------+
| Dataproc Cluster |
+-------------------------------------------------------------+
| +-----------------------------------------------------+ |
| | Master Node(s) | |
| | +---------+ +---------+ +---------+ | |
| | | YARN RM | | HDFS | | Spark | | |
| | | | | NameNode| | History | | |
| | +---------+ +---------+ +---------+ | |
| +-----------------------------------------------------+ |
| | |
| +-----------------------------------------------------+ |
| | Worker Nodes | |
| | +---------+ +---------+ +---------+ | |
| | | Worker 1| | Worker 2| | Worker N| | |
| | | NodeMgr | | NodeMgr | | NodeMgr | | |
| | | DataNode| | DataNode| | DataNode| | |
| | +---------+ +---------+ +---------+ | |
| +-----------------------------------------------------+ |
| |
| Optional: |
| +-- Secondary workers (preemptible) |
| +-- High availability (3 masters) |
| +-- Custom machine types |
+-------------------------------------------------------------+
# Create simple cluster
gcloud dataproc clusters create my-cluster \
--region=us-central1 \
--num-workers=2 \
--worker-machine-type=n1-standard-4 \
--image-version=2.1-debian11
2. What is Dataflow?
Dataflow is a fully managed service for executing Apache Beam pipelines for batch and stream processing.
Dataflow Features:
+-- Serverless (no cluster management)
+-- Unified batch and stream processing
+-- Auto-scaling
+-- Apache Beam SDK
+-- Exactly-once processing
+-- Integrated with GCP services
Dataflow Architecture:
+-------------------------------------------------------------+
| Dataflow Service |
+-------------------------------------------------------------+
| |
| +-----------------------------------------------------+ |
| | Pipeline | |
| | | |
| | Source -> Transform -> Transform -> Sink | |
| | | | | | | |
| | Read ParDo GroupBy Write | |
| | Pub/Sub Filter Window BigQuery | |
| | GCS Map Aggregate GCS | |
| +-----------------------------------------------------+ |
| | |
| v |
| +-----------------------------------------------------+ |
| | Worker Pool (Auto-managed) | |
| | +---------+ +---------+ +---------+ | |
| | | Worker 1| | Worker 2| | Worker N| | |
| | | | | | | | | |
| | +---------+ +---------+ +---------+ | |
| | Auto-scales based on workload | |
| +-----------------------------------------------------+ |
+-------------------------------------------------------------+
# Simple Dataflow pipeline (Python)
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
options = PipelineOptions([
'--project=my-project',
'--region=us-central1',
'--runner=DataflowRunner',
'--temp_location=gs://bucket/temp'
])
with beam.Pipeline(options=options) as p:
(p
| 'Read' >> beam.io.ReadFromText('gs://bucket/input.txt')
| 'Transform' >> beam.Map(lambda x: x.upper())
| 'Write' >> beam.io.WriteToText('gs://bucket/output'))
3. What are the differences between Dataproc and Dataflow?
| Aspect | Dataproc | Dataflow |
|---|---|---|
| Infrastructure | Managed clusters | Serverless |
| Frameworks | Spark, Hadoop, Flink, Presto | Apache Beam only |
| Scaling | Autoscaling (configurable) | Automatic |
| Use case | Complex/existing Spark workloads | New pipelines, streaming |
| State management | Manual (checkpointing) | Automatic |
| Pricing | Per VM + per second | Per vCPU-hour + GB-hour |
| Lift and shift | Easier for existing code | Requires Beam rewrite |
| Cluster management | Required | None |
When to use Dataproc:
+-- Existing Spark/Hadoop jobs
+-- Complex ML with Spark MLlib
+-- Interactive analysis (Jupyter)
+-- Need fine-grained control
+-- Multiple frameworks in one cluster
+-- Presto/Trino for SQL queries
When to use Dataflow:
+-- New data pipelines
+-- Streaming data processing
+-- Unified batch/stream logic
+-- No infrastructure management desired
+-- Need exactly-once semantics
+-- Auto-scaling without configuration
# Same logic in both:
# Dataproc (PySpark)
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("wordcount").getOrCreate()
df = spark.read.text("gs://bucket/input.txt")
counts = df.groupBy("value").count()
counts.write.format("json").save("gs://bucket/output")
# Dataflow (Apache Beam)
import apache_beam as beam
with beam.Pipeline() as p:
(p
| beam.io.ReadFromText('gs://bucket/input.txt')
| beam.combiners.Count.PerElement()
| beam.io.WriteToText('gs://bucket/output'))
4. What is Apache Beam?
Apache Beam is a unified programming model for batch and streaming data processing.
Apache Beam Concepts:
Pipeline:
+-- Complete data processing workflow
+-- Contains PCollections and PTransforms
+-- Runs on a runner (Dataflow, Spark, etc.)
PCollection:
+-- Distributed dataset
+-- Immutable
+-- Can be bounded (batch) or unbounded (stream)
+-- Elements can be any type
PTransform:
+-- Data transformation operation
+-- Takes PCollection, outputs PCollection
+-- Built-in: ParDo, Map, Filter, GroupByKey, etc.
+-- Composite transforms
# Beam Pipeline Example
import apache_beam as beam
class ParseEvent(beam.DoFn):
def process(self, element):
import json
event = json.loads(element)
yield {
'user_id': event['user_id'],
'event_type': event['event_type'],
'timestamp': event['timestamp']
}
class FilterClicks(beam.DoFn):
def process(self, element):
if element['event_type'] == 'click':
yield element
with beam.Pipeline() as p:
events = (
p
| 'Read' >> beam.io.ReadFromText('gs://bucket/events.json')
| 'Parse' >> beam.ParDo(ParseEvent())
| 'Filter' >> beam.ParDo(FilterClicks())
| 'Count' >> beam.combiners.Count.PerKey()
| 'Format' >> beam.Map(lambda x: f'{x[0]}: {x[1]}')
| 'Write' >> beam.io.WriteToText('gs://bucket/output')
)
Beam Runners:
+-- DirectRunner - Local testing
+-- DataflowRunner - Google Cloud Dataflow
+-- SparkRunner - Apache Spark
+-- FlinkRunner - Apache Flink
+-- SamzaRunner - Apache Samza
5. How do you create a Dataproc cluster?
Cluster Creation Methods:
1. gcloud CLI
gcloud dataproc clusters create my-cluster \
--region=us-central1 \
--zone=us-central1-a \
--master-machine-type=n1-standard-4 \
--master-boot-disk-size=500GB \
--num-workers=2 \
--worker-machine-type=n1-standard-4 \
--worker-boot-disk-size=500GB \
--num-secondary-workers=2 \
--secondary-worker-type=preemptible \
--image-version=2.1-debian11 \
--optional-components=JUPYTER,PRESTO \
--enable-component-gateway \
--initialization-actions=gs://bucket/init.sh \
--metadata=PIP_PACKAGES=pandas,numpy \
--properties=spark:spark.executor.memory=4g \
--scopes=cloud-platform \
--max-idle=1h
2. Terraform
resource "google_dataproc_cluster" "cluster" {
name = "my-cluster"
region = "us-central1"
cluster_config {
master_config {
num_instances = 1
machine_type = "n1-standard-4"
disk_config {
boot_disk_size_gb = 500
}
}
worker_config {
num_instances = 2
machine_type = "n1-standard-4"
disk_config {
boot_disk_size_gb = 500
}
}
preemptible_worker_config {
num_instances = 2
}
software_config {
image_version = "2.1-debian11"
optional_components = ["JUPYTER", "PRESTO"]
override_properties = {
"spark:spark.executor.memory" = "4g"
}
}
gce_cluster_config {
subnetwork = "default"
service_account_scopes = ["cloud-platform"]
}
autoscaling_config {
policy_uri = google_dataproc_autoscaling_policy.asp.name
}
}
}
3. Submit job to ephemeral cluster
gcloud dataproc jobs submit pyspark my_job.py \
--cluster=my-cluster \
--region=us-central1 \
--jars=gs://spark-lib/bigquery-connector.jar