Top 20 GCP Data Engineer Interview Questions
- What is Google Cloud Platform for Data Engineering?
- What is BigQuery?
- What is Cloud Dataflow?
- What is Pub/Sub?
- What is Cloud Storage (GCS)?
- What is Dataproc?
- What is Cloud Composer?
- What is Dataplex?
- What is Data Catalog?
- How do you design a data lake on GCP?
- What are BigQuery best practices?
- How do you handle streaming data on GCP?
- What is BigQuery ML?
- How do you implement ETL on GCP?
- What is Dataform?
- How do you optimize costs on GCP?
- What is Cloud Data Fusion?
- How do you secure data on GCP?
- What are GCP data integration patterns?
- How do you monitor data pipelines on GCP?
☁ Google Cloud Interview Questions
📊 GCP Data Engineer
BigQuery, Dataflow, Pub/Sub, GCS
⚡ Cloud Functions
Serverless, Triggers, Cloud Run
🗃 BigQuery
Data Warehouse, ML, Analytics
📦 Cloud Storage & Data Lake
GCS, Dataplex, Data Catalog
🚀 Dataproc & Dataflow
Spark, Hadoop, Apache Beam
🔄 Workflows & Composer
Orchestration, Airflow, Scheduling
🔒 IAM & Identity
Roles, Service Accounts, Identity Platform
🤖 Vertex AI
ML Platform, AutoML, Pipelines
🛠 Cloud Build & Deploy
CI/CD, Artifact Registry, GKE
📨 Pub/Sub & Streaming
Messaging, Streaming, Event-Driven
🎯 Data Engineering Scenarios
Real-world Architecture Questions
1. What is Google Cloud Platform for Data Engineering?
Google Cloud Platform (GCP) provides a comprehensive suite of data engineering services for building scalable data pipelines.
GCP Data Engineering Services:
+-- Storage
| +-- Cloud Storage (GCS) - Object storage
| +-- BigQuery - Data warehouse
| +-- Cloud SQL - Managed relational DB
| +-- Cloud Spanner - Global relational DB
| +-- Bigtable - NoSQL wide-column
| +-- Firestore - Document DB
|
+-- Processing
| +-- Dataflow - Stream/batch processing
| +-- Dataproc - Managed Spark/Hadoop
| +-- Cloud Functions - Serverless compute
| +-- Cloud Run - Containerized apps
|
+-- Orchestration
| +-- Cloud Composer - Managed Airflow
| +-- Workflows - Serverless orchestration
| +-- Cloud Scheduler - Cron jobs
|
+-- Analytics
| +-- BigQuery - Analytics warehouse
| +-- Looker - BI platform
| +-- Data Studio - Dashboards
|
+-- Governance
+-- Dataplex - Data fabric
+-- Data Catalog - Metadata management
+-- DLP API - Data protection
Typical Data Pipeline:
+----------+ +----------+ +----------+ +----------+
| Sources |--->| Pub/Sub |--->| Dataflow |--->| BigQuery |
| (Apps, | | (Ingest) | | (Process)| | (Analyze)|
| IoT) | | | | | | |
+----------+ +----------+ +----------+ +----------+
2. What is BigQuery?
BigQuery is a serverless, highly scalable, and cost-effective enterprise data warehouse with built-in ML capabilities.
BigQuery Features:
+-- Serverless - No infrastructure management
+-- Columnar storage - Optimized for analytics
+-- Petabyte scale - Handles massive datasets
+-- SQL interface - Standard SQL support
+-- Built-in ML - BigQuery ML
+-- Real-time analytics - Streaming inserts
+-- Separation of storage and compute
BigQuery Architecture:
+-----------------------------------------------------+
| BigQuery |
+-----------------------------------------------------+
| +------------------------------------------------+ |
| | Dremel Execution Engine | |
| | (Distributed query processing) | |
| +------------------------------------------------+ |
| | |
| +------------------------------------------------+ |
| | Colossus Storage | |
| | (Distributed columnar storage) | |
| +------------------------------------------------+ |
+-----------------------------------------------------+
# Query example
SELECT
DATE(timestamp) as date,
COUNT(*) as events,
SUM(revenue) as total_revenue
FROM `project.dataset.events`
WHERE timestamp >= '2024-01-01'
GROUP BY date
ORDER BY date;
# Create table
CREATE TABLE `project.dataset.users` (
user_id STRING,
email STRING,
created_at TIMESTAMP,
metadata STRUCT<
source STRING,
campaign STRING
>
)
PARTITION BY DATE(created_at)
CLUSTER BY user_id;
3. What is Cloud Dataflow?
Cloud Dataflow is a fully managed service for executing Apache Beam pipelines for both batch and stream processing.
Dataflow Features:
+-- Unified batch and streaming
+-- Auto-scaling
+-- Exactly-once processing
+-- Apache Beam SDK
+-- Templates for common patterns
+-- Integration with GCP services
# Python Dataflow Pipeline
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
options = PipelineOptions([
'--project=my-project',
'--region=us-central1',
'--runner=DataflowRunner',
'--temp_location=gs://my-bucket/temp',
'--streaming' # For streaming jobs
])
with beam.Pipeline(options=options) as pipeline:
(pipeline
| 'Read from Pub/Sub' >> beam.io.ReadFromPubSub(
subscription='projects/my-project/subscriptions/my-sub')
| 'Parse JSON' >> beam.Map(lambda x: json.loads(x))
| 'Filter' >> beam.Filter(lambda x: x['type'] == 'purchase')
| 'Extract fields' >> beam.Map(lambda x: {
'user_id': x['user_id'],
'amount': x['amount'],
'timestamp': x['timestamp']
})
| 'Window' >> beam.WindowInto(
beam.window.FixedWindows(60)) # 1-minute windows
| 'Write to BigQuery' >> beam.io.WriteToBigQuery(
'project:dataset.table',
schema='user_id:STRING,amount:FLOAT,timestamp:TIMESTAMP',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND))
4. What is Pub/Sub?
Pub/Sub is a fully managed, real-time messaging service for event-driven systems and streaming analytics.
Pub/Sub Concepts:
+-- Topic - Named resource for messages
+-- Subscription - Named resource for receiving
+-- Message - Data + attributes
+-- Publisher - Sends messages to topic
+-- Subscriber - Receives from subscription
Pub/Sub Architecture:
+-----------------------------------------------------+
| Publishers |
| +-----+ +-----+ +-----+ |
| |App 1| |App 2| |App 3| |
| +--+--+ +--+--+ +--+--+ |
| +-------+--------+ |
| v |
| +--------+ |
| | Topic | |
| +---+----+ |
| +-------+--------+ |
| v v v |
| +-----+ +-----+ +-----+ |
| |Sub A| |Sub B| |Sub C| |
| +--+--+ +--+--+ +--+--+ |
| v v v |
| +-----+ +-----+ +-----+ |
| |Svc 1| |Svc 2| |Svc 3| |
| +-----+ +-----+ +-----+ |
+-----------------------------------------------------+
# Python Pub/Sub
from google.cloud import pubsub_v1
# Publisher
publisher = pubsub_v1.PublisherClient()
topic_path = publisher.topic_path('my-project', 'my-topic')
data = json.dumps({'event': 'purchase', 'amount': 99.99})
future = publisher.publish(topic_path, data.encode('utf-8'),
user_id='123', event_type='purchase')
print(f'Published message ID: {future.result()}')
# Subscriber
subscriber = pubsub_v1.SubscriberClient()
subscription_path = subscriber.subscription_path('my-project', 'my-sub')
def callback(message):
print(f'Received: {message.data}')
message.ack()
streaming_pull_future = subscriber.subscribe(subscription_path, callback=callback)
5. What is Cloud Storage (GCS)?
Cloud Storage is a unified object storage for developers and enterprises with high durability and availability.
Storage Classes:
+-- Standard - Frequently accessed data
+-- Nearline - Once per month access
+-- Coldline - Once per quarter access
+-- Archive - Once per year access
# Python GCS operations
from google.cloud import storage
client = storage.Client()
# Create bucket
bucket = client.create_bucket('my-bucket', location='US')
# Upload file
blob = bucket.blob('data/file.parquet')
blob.upload_from_filename('/local/file.parquet')
# Download file
blob.download_to_filename('/local/downloaded.parquet')
# List objects
blobs = client.list_blobs('my-bucket', prefix='data/')
for blob in blobs:
print(blob.name)
# Lifecycle rules
bucket.lifecycle_rules = [{
'action': {'type': 'SetStorageClass', 'storageClass': 'NEARLINE'},
'condition': {'age': 30}
}, {
'action': {'type': 'SetStorageClass', 'storageClass': 'COLDLINE'},
'condition': {'age': 90}
}, {
'action': {'type': 'Delete'},
'condition': {'age': 365}
}]
bucket.patch()
GCS URL formats:
- gs://bucket-name/object-path
- https://storage.googleapis.com/bucket-name/object-path