Search Tutorials


Top GCP Data Engineer Interview Questions (2026) | JavaInUse

Top 20 GCP Data Engineer Interview Questions


  1. What is Google Cloud Platform for Data Engineering?
  2. What is BigQuery?
  3. What is Cloud Dataflow?
  4. What is Pub/Sub?
  5. What is Cloud Storage (GCS)?
  6. What is Dataproc?
  7. What is Cloud Composer?
  8. What is Dataplex?
  9. What is Data Catalog?
  10. How do you design a data lake on GCP?
  11. What are BigQuery best practices?
  12. How do you handle streaming data on GCP?
  13. What is BigQuery ML?
  14. How do you implement ETL on GCP?
  15. What is Dataform?
  16. How do you optimize costs on GCP?
  17. What is Cloud Data Fusion?
  18. How do you secure data on GCP?
  19. What are GCP data integration patterns?
  20. How do you monitor data pipelines on GCP?

Google Cloud Interview Questions

1. What is Google Cloud Platform for Data Engineering?

Google Cloud Platform (GCP) provides a comprehensive suite of data engineering services for building scalable data pipelines.

GCP Data Engineering Services:
+-- Storage
|   +-- Cloud Storage (GCS) - Object storage
|   +-- BigQuery - Data warehouse
|   +-- Cloud SQL - Managed relational DB
|   +-- Cloud Spanner - Global relational DB
|   +-- Bigtable - NoSQL wide-column
|   +-- Firestore - Document DB
|
+-- Processing
|   +-- Dataflow - Stream/batch processing
|   +-- Dataproc - Managed Spark/Hadoop
|   +-- Cloud Functions - Serverless compute
|   +-- Cloud Run - Containerized apps
|
+-- Orchestration
|   +-- Cloud Composer - Managed Airflow
|   +-- Workflows - Serverless orchestration
|   +-- Cloud Scheduler - Cron jobs
|
+-- Analytics
|   +-- BigQuery - Analytics warehouse
|   +-- Looker - BI platform
|   +-- Data Studio - Dashboards
|
+-- Governance
    +-- Dataplex - Data fabric
    +-- Data Catalog - Metadata management
    +-- DLP API - Data protection

Typical Data Pipeline:
+----------+    +----------+    +----------+    +----------+
|  Sources |--->|  Pub/Sub |--->| Dataflow |--->| BigQuery |
|  (Apps,  |    | (Ingest) |    | (Process)|    | (Analyze)|
|  IoT)    |    |          |    |          |    |          |
+----------+    +----------+    +----------+    +----------+

2. What is BigQuery?

BigQuery is a serverless, highly scalable, and cost-effective enterprise data warehouse with built-in ML capabilities.

BigQuery Features:
+-- Serverless - No infrastructure management
+-- Columnar storage - Optimized for analytics
+-- Petabyte scale - Handles massive datasets
+-- SQL interface - Standard SQL support
+-- Built-in ML - BigQuery ML
+-- Real-time analytics - Streaming inserts
+-- Separation of storage and compute

BigQuery Architecture:
+-----------------------------------------------------+
|                    BigQuery                         |
+-----------------------------------------------------+
|  +------------------------------------------------+ |
|  |              Dremel Execution Engine           | |
|  |  (Distributed query processing)                | |
|  +------------------------------------------------+ |
|                        |                            |
|  +------------------------------------------------+ |
|  |              Colossus Storage                  | |
|  |  (Distributed columnar storage)                | |
|  +------------------------------------------------+ |
+-----------------------------------------------------+

# Query example
SELECT
  DATE(timestamp) as date,
  COUNT(*) as events,
  SUM(revenue) as total_revenue
FROM `project.dataset.events`
WHERE timestamp >= '2024-01-01'
GROUP BY date
ORDER BY date;

# Create table
CREATE TABLE `project.dataset.users` (
  user_id STRING,
  email STRING,
  created_at TIMESTAMP,
  metadata STRUCT<
    source STRING,
    campaign STRING
  >
)
PARTITION BY DATE(created_at)
CLUSTER BY user_id;

3. What is Cloud Dataflow?

Cloud Dataflow is a fully managed service for executing Apache Beam pipelines for both batch and stream processing.

Dataflow Features:
+-- Unified batch and streaming
+-- Auto-scaling
+-- Exactly-once processing
+-- Apache Beam SDK
+-- Templates for common patterns
+-- Integration with GCP services

# Python Dataflow Pipeline
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions

options = PipelineOptions([
    '--project=my-project',
    '--region=us-central1',
    '--runner=DataflowRunner',
    '--temp_location=gs://my-bucket/temp',
    '--streaming'  # For streaming jobs
])

with beam.Pipeline(options=options) as pipeline:
    (pipeline
     | 'Read from Pub/Sub' >> beam.io.ReadFromPubSub(
         subscription='projects/my-project/subscriptions/my-sub')
     | 'Parse JSON' >> beam.Map(lambda x: json.loads(x))
     | 'Filter' >> beam.Filter(lambda x: x['type'] == 'purchase')
     | 'Extract fields' >> beam.Map(lambda x: {
         'user_id': x['user_id'],
         'amount': x['amount'],
         'timestamp': x['timestamp']
     })
     | 'Window' >> beam.WindowInto(
         beam.window.FixedWindows(60))  # 1-minute windows
     | 'Write to BigQuery' >> beam.io.WriteToBigQuery(
         'project:dataset.table',
         schema='user_id:STRING,amount:FLOAT,timestamp:TIMESTAMP',
         write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND))

4. What is Pub/Sub?

Pub/Sub is a fully managed, real-time messaging service for event-driven systems and streaming analytics.

Pub/Sub Concepts:
+-- Topic - Named resource for messages
+-- Subscription - Named resource for receiving
+-- Message - Data + attributes
+-- Publisher - Sends messages to topic
+-- Subscriber - Receives from subscription

Pub/Sub Architecture:
+-----------------------------------------------------+
|                    Publishers                       |
|  +-----+  +-----+  +-----+                          |
|  |App 1|  |App 2|  |App 3|                          |
|  +--+--+  +--+--+  +--+--+                          |
|     +-------+--------+                              |
|             v                                       |
|         +--------+                                  |
|         | Topic  |                                  |
|         +---+----+                                  |
|     +-------+--------+                              |
|     v       v        v                              |
|  +-----+  +-----+  +-----+                          |
|  |Sub A|  |Sub B|  |Sub C|                          |
|  +--+--+  +--+--+  +--+--+                          |
|     v       v        v                              |
|  +-----+  +-----+  +-----+                          |
|  |Svc 1|  |Svc 2|  |Svc 3|                          |
|  +-----+  +-----+  +-----+                          |
+-----------------------------------------------------+

# Python Pub/Sub
from google.cloud import pubsub_v1

# Publisher
publisher = pubsub_v1.PublisherClient()
topic_path = publisher.topic_path('my-project', 'my-topic')

data = json.dumps({'event': 'purchase', 'amount': 99.99})
future = publisher.publish(topic_path, data.encode('utf-8'),
                          user_id='123', event_type='purchase')
print(f'Published message ID: {future.result()}')

# Subscriber
subscriber = pubsub_v1.SubscriberClient()
subscription_path = subscriber.subscription_path('my-project', 'my-sub')

def callback(message):
    print(f'Received: {message.data}')
    message.ack()

streaming_pull_future = subscriber.subscribe(subscription_path, callback=callback)

5. What is Cloud Storage (GCS)?

Cloud Storage is a unified object storage for developers and enterprises with high durability and availability.

Storage Classes:
+-- Standard - Frequently accessed data
+-- Nearline - Once per month access
+-- Coldline - Once per quarter access
+-- Archive - Once per year access

# Python GCS operations
from google.cloud import storage

client = storage.Client()

# Create bucket
bucket = client.create_bucket('my-bucket', location='US')

# Upload file
blob = bucket.blob('data/file.parquet')
blob.upload_from_filename('/local/file.parquet')

# Download file
blob.download_to_filename('/local/downloaded.parquet')

# List objects
blobs = client.list_blobs('my-bucket', prefix='data/')
for blob in blobs:
    print(blob.name)

# Lifecycle rules
bucket.lifecycle_rules = [{
    'action': {'type': 'SetStorageClass', 'storageClass': 'NEARLINE'},
    'condition': {'age': 30}
}, {
    'action': {'type': 'SetStorageClass', 'storageClass': 'COLDLINE'},
    'condition': {'age': 90}
}, {
    'action': {'type': 'Delete'},
    'condition': {'age': 365}
}]
bucket.patch()

GCS URL formats:
- gs://bucket-name/object-path
- https://storage.googleapis.com/bucket-name/object-path





6. What is Dataproc?

Dataproc is a fully managed service for running Apache Spark and Hadoop clusters.

Dataproc Features:
+-- Managed Spark/Hadoop clusters
+-- Fast cluster creation (90 seconds)
+-- Autoscaling
+-- Preemptible VMs for cost savings
+-- Integration with GCS, BigQuery
+-- Serverless option available

# Create cluster with gcloud
gcloud dataproc clusters create my-cluster \
    --region=us-central1 \
    --master-machine-type=n1-standard-4 \
    --worker-machine-type=n1-standard-4 \
    --num-workers=2 \
    --image-version=2.0-debian10 \
    --optional-components=JUPYTER \
    --enable-component-gateway

# Submit PySpark job
gcloud dataproc jobs submit pyspark \
    --cluster=my-cluster \
    --region=us-central1 \
    gs://my-bucket/scripts/etl_job.py \
    -- --input=gs://my-bucket/input/ \
       --output=gs://my-bucket/output/

# PySpark job example
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('ETL').getOrCreate()

# Read from GCS
df = spark.read.parquet('gs://my-bucket/raw/events/')

# Transform
result = df.filter(df.event_type == 'purchase') \
    .groupBy('user_id') \
    .agg({'amount': 'sum', 'event_id': 'count'})

# Write to BigQuery
result.write.format('bigquery') \
    .option('table', 'project.dataset.user_purchases') \
    .mode('overwrite') \
    .save()

# Dataproc Serverless
gcloud dataproc batches submit pyspark \
    gs://my-bucket/scripts/etl_job.py \
    --region=us-central1 \
    --deps-bucket=gs://my-bucket/deps

7. What is Cloud Composer?

Cloud Composer is a fully managed Apache Airflow service for workflow orchestration.

Cloud Composer Features:
+-- Managed Apache Airflow
+-- GCP service integration
+-- Web UI for monitoring
+-- Python-based DAGs
+-- Composer 2 with improved scaling

# Airflow DAG for GCP
from airflow import DAG
from airflow.providers.google.cloud.operators.bigquery import (
    BigQueryInsertJobOperator, BigQueryCheckOperator
)
from airflow.providers.google.cloud.operators.dataflow import (
    DataflowStartFlexTemplateOperator
)
from airflow.providers.google.cloud.transfers.gcs_to_bigquery import (
    GCSToBigQueryOperator
)
from datetime import datetime, timedelta

default_args = {
    'owner': 'data-team',
    'depends_on_past': False,
    'start_date': datetime(2024, 1, 1),
    'retries': 2,
    'retry_delay': timedelta(minutes=5),
}

with DAG('daily_etl', default_args=default_args, schedule_interval='@daily') as dag:
    
    # Load raw data to BigQuery
    load_raw = GCSToBigQueryOperator(
        task_id='load_raw_data',
        bucket='raw-data-bucket',
        source_objects=['events/{{ ds }}/*.json'],
        destination_project_dataset_table='project.raw.events',
        source_format='NEWLINE_DELIMITED_JSON',
        write_disposition='WRITE_APPEND'
    )
    
    # Transform with BigQuery
    transform = BigQueryInsertJobOperator(
        task_id='transform_data',
        configuration={
            'query': {
                'query': '''
                    INSERT INTO `project.curated.daily_summary`
                    SELECT DATE(timestamp) as date, COUNT(*) as events
                    FROM `project.raw.events`
                    WHERE DATE(timestamp) = '{{ ds }}'
                    GROUP BY date
                ''',
                'useLegacySql': False
            }
        }
    )
    
    # Data quality check
    quality_check = BigQueryCheckOperator(
        task_id='quality_check',
        sql='SELECT COUNT(*) > 0 FROM `project.curated.daily_summary` WHERE date = "{{ ds }}"',
        use_legacy_sql=False
    )
    
    load_raw >> transform >> quality_check

8. What is Dataplex?

Dataplex is a data fabric that unifies distributed data and automates data management and governance.

Dataplex Concepts:
+-- Lake - Logical grouping of data
+-- Zone - Subdivision (raw, curated)
+-- Asset - Actual data (GCS, BigQuery)
+-- Tasks - Automated data processing
+-- Data Quality - Automated validation

Dataplex Architecture:
+-----------------------------------------------------+
|                    Dataplex Lake                    |
+-----------------------------------------------------+
|  +-----------------+    +-----------------+         |
|  |    Raw Zone     |    |  Curated Zone   |         |
|  |                 |    |                 |         |
|  |  +-----------+  |    |  +-----------+  |         |
|  |  | GCS Asset |  |--->|  | BQ Asset  |  |         |
|  |  |(raw files)|  |    |  | (tables)  |  |         |
|  |  +-----------+  |    |  +-----------+  |         |
|  +-----------------+    +-----------------+         |
|                                                     |
|  +---------------------------------------------+    |
|  |            Data Quality Rules               |    |
|  |  - Completeness, Uniqueness, Validity       |    |
|  +---------------------------------------------+    |
+-----------------------------------------------------+

# Create lake with gcloud
gcloud dataplex lakes create my-lake \
    --location=us-central1 \
    --display-name="Analytics Lake"

# Create zone
gcloud dataplex zones create raw-zone \
    --lake=my-lake \
    --location=us-central1 \
    --type=RAW \
    --resource-location-type=SINGLE_REGION

# Add asset
gcloud dataplex assets create raw-events \
    --lake=my-lake \
    --zone=raw-zone \
    --location=us-central1 \
    --resource-type=STORAGE_BUCKET \
    --resource-name=projects/my-project/buckets/raw-events-bucket \
    --discovery-enabled

9. What is Data Catalog?

Data Catalog is a fully managed metadata management service for discovering and managing data.

Data Catalog Features:
+-- Automatic metadata discovery
+-- Custom tags and templates
+-- Search across projects
+-- Policy tags for column-level security
+-- Integration with BigQuery, GCS, Pub/Sub

# Create tag template
from google.cloud import datacatalog_v1

client = datacatalog_v1.DataCatalogClient()

template = datacatalog_v1.TagTemplate()
template.display_name = "Data Quality Metadata"
template.fields["owner"] = datacatalog_v1.TagTemplateField(
    display_name="Data Owner",
    type_=datacatalog_v1.FieldType(primitive_type="STRING")
)
template.fields["quality_score"] = datacatalog_v1.TagTemplateField(
    display_name="Quality Score",
    type_=datacatalog_v1.FieldType(primitive_type="DOUBLE")
)
template.fields["pii"] = datacatalog_v1.TagTemplateField(
    display_name="Contains PII",
    type_=datacatalog_v1.FieldType(primitive_type="BOOL")
)

created_template = client.create_tag_template(
    parent="projects/my-project/locations/us-central1",
    tag_template_id="data_quality",
    tag_template=template
)

# Search catalog
scope = datacatalog_v1.SearchCatalogRequest.Scope()
scope.include_project_ids.append("my-project")

results = client.search_catalog(
    scope=scope,
    query="tag:data_quality.pii=true"
)

for result in results:
    print(f"Found: {result.linked_resource}")

10. How do you design a data lake on GCP?

GCP Data Lake Architecture:
+-------------------------------------------------------------+
|                      Data Sources                           |
|  +-----+ +-----+ +-----+ +-----+ +-----+                    |
|  | API | | IoT | | DB  | |Files| |Stream|                   |
|  +--+--+ +--+--+ +--+--+ +--+--+ +--+--+                    |
+----+-------+-------+-------+-------+------------------------+
     |       |       |       |       |
     v       v       v       v       v
+-------------------------------------------------------------+
|                    Ingestion Layer                          |
|  +--------------+  +--------------+  +--------------+       |
|  |   Pub/Sub    |  |  Dataflow    |  |  Transfer    |       |
|  |  (Streaming) |  |  (ETL)       |  |  Service     |       |
|  +--------------+  +--------------+  +--------------+       |
+-------------------------------------------------------------+
                         |
                         v
+-------------------------------------------------------------+
|                    Storage Layer                            |
|  +------------------------------------------------------+   |
|  |                 Cloud Storage (GCS)                  |   |
|  |  +------------+ +------------+ +------------+        |   |
|  |  | Raw Zone   | | Processed  | | Curated    |        |   |
|  |  | (Landing)  | | Zone       | | Zone       |        |   |
|  |  +------------+ +------------+ +------------+        |   |
|  +------------------------------------------------------+   |
|                         |                                   |
|  +------------------------------------------------------+   |
|  |                    BigQuery                          |   |
|  |  (Analytics tables, ML models, BI views)             |   |
|  +------------------------------------------------------+   |
+-------------------------------------------------------------+

Zone Design:
+-- Raw Zone (gs://lake/raw/)
|   +-- Format: JSON, CSV, Avro (as received)
|   +-- Retention: Full history
|   +-- Access: Data engineers only
|
+-- Processed Zone (gs://lake/processed/)
|   +-- Format: Parquet (optimized)
|   +-- Partitioned by date
|   +-- Access: Data engineers, analysts
|
+-- Curated Zone (BigQuery)
    +-- Format: BigQuery tables
    +-- Business-ready datasets
    +-- Access: All analytics users

# Bucket organization
gs://my-lake/
+-- raw/
|   +-- source_name/
|       +-- year=2024/month=01/day=15/
|           +-- data.json
+-- processed/
|   +-- domain/
|       +-- entity/
|           +-- year=2024/month=01/
|               +-- data.parquet
+-- curated/
    +-- External tables pointing to processed data

11. What are BigQuery best practices?

Performance Best Practices:

1. Partitioning
-- Partition by date for time-series data
CREATE TABLE `project.dataset.events`
PARTITION BY DATE(event_timestamp)
CLUSTER BY user_id, event_type
AS SELECT * FROM source_table;

-- Partition pruning in queries
SELECT * FROM events
WHERE DATE(event_timestamp) = '2024-01-15'  -- Scans one partition

2. Clustering
-- Cluster frequently filtered columns
CREATE TABLE orders
PARTITION BY DATE(order_date)
CLUSTER BY customer_id, product_category;

3. Avoid SELECT *
-- Bad
SELECT * FROM large_table;

-- Good
SELECT user_id, event_type, timestamp FROM large_table;

4. Use approximate functions
-- Exact (slow)
SELECT COUNT(DISTINCT user_id) FROM events;

-- Approximate (fast)
SELECT APPROX_COUNT_DISTINCT(user_id) FROM events;

5. Materialized views
CREATE MATERIALIZED VIEW `project.dataset.daily_stats`
PARTITION BY date
CLUSTER BY category
AS
SELECT
  DATE(timestamp) as date,
  category,
  COUNT(*) as events,
  SUM(amount) as total
FROM `project.dataset.events`
GROUP BY date, category;

Cost Optimization:
+-- Use flat-rate pricing for predictable workloads
+-- Set custom quotas per project/user
+-- Use BI Engine for dashboards
+-- Expire partitions with partition_expiration_days
+-- Preview queries before running (dry run)

12. How do you handle streaming data on GCP?

Streaming Architecture:
+-------------------------------------------------------------+
|  Sources                                                    |
|  +-----+ +-----+ +-----+                                    |
|  | IoT | | Apps| |Kafka|                                    |
|  +--+--+ +--+--+ +--+--+                                    |
|     +-------+-------+                                       |
|             v                                               |
|  +-------------------------------------------------------+  |
|  |                    Pub/Sub                            |  |
|  |  (Message buffering, fan-out)                         |  |
|  +-------------------------------------------------------+  |
|             |                                               |
|             v                                               |
|  +-------------------------------------------------------+  |
|  |                   Dataflow                            |  |
|  |  (Windowing, aggregation, enrichment)                 |  |
|  +-------------------------------------------------------+  |
|             |                                               |
|     +-------+-------+                                       |
|     v       v       v                                       |
|  BigQuery  GCS   Bigtable                                   |
|  (Analytics)(Archive)(Real-time)                            |
+-------------------------------------------------------------+

# Dataflow streaming pipeline
import apache_beam as beam
from apache_beam.transforms import window

with beam.Pipeline(options=options) as p:
    events = (p
        | 'Read' >> beam.io.ReadFromPubSub(subscription=sub)
        | 'Parse' >> beam.Map(json.loads)
        | 'AddTimestamp' >> beam.Map(lambda x: beam.window.TimestampedValue(
            x, x['event_time']))
    )
    
    # Real-time aggregation (1-minute windows)
    windowed = (events
        | 'Window' >> beam.WindowInto(window.FixedWindows(60))
        | 'Key' >> beam.Map(lambda x: (x['category'], x['amount']))
        | 'Sum' >> beam.CombinePerKey(sum)
    )
    
    # Write to BigQuery (streaming)
    windowed | 'ToBQ' >> beam.io.WriteToBigQuery(
        'project:dataset.realtime_metrics',
        method=beam.io.WriteToBigQuery.Method.STREAMING_INSERTS
    )
    
    # Archive to GCS
    events | 'ToGCS' >> beam.io.WriteToText(
        'gs://archive/events',
        file_name_suffix='.json'
    )

13. What is BigQuery ML?

BigQuery ML enables creating and executing machine learning models using SQL.

BigQuery ML Model Types:
+-- Linear regression
+-- Logistic regression
+-- K-means clustering
+-- Matrix factorization
+-- Time series (ARIMA_PLUS)
+-- Deep neural networks
+-- XGBoost
+-- AutoML Tables
+-- Imported TensorFlow models

# Create classification model
CREATE OR REPLACE MODEL `project.dataset.churn_model`
OPTIONS(
  model_type='LOGISTIC_REG',
  input_label_cols=['churned'],
  auto_class_weights=TRUE
) AS
SELECT
  user_id,
  days_since_last_purchase,
  total_orders,
  avg_order_value,
  support_tickets,
  churned
FROM `project.dataset.user_features`
WHERE _PARTITIONDATE BETWEEN '2023-01-01' AND '2023-12-31';

# Evaluate model
SELECT * FROM ML.EVALUATE(MODEL `project.dataset.churn_model`);

# Make predictions
SELECT
  user_id,
  predicted_churned,
  predicted_churned_probs[OFFSET(1)].prob AS churn_probability
FROM ML.PREDICT(MODEL `project.dataset.churn_model`,
  (SELECT * FROM `project.dataset.user_features`
   WHERE _PARTITIONDATE = CURRENT_DATE()))
WHERE predicted_churned_probs[OFFSET(1)].prob > 0.7;

# Time series forecasting
CREATE OR REPLACE MODEL `project.dataset.sales_forecast`
OPTIONS(
  model_type='ARIMA_PLUS',
  time_series_timestamp_col='date',
  time_series_data_col='sales',
  time_series_id_col='product_id',
  horizon=30
) AS
SELECT date, product_id, sales
FROM `project.dataset.daily_sales`;

# Forecast
SELECT * FROM ML.FORECAST(MODEL `project.dataset.sales_forecast`,
  STRUCT(30 AS horizon, 0.9 AS confidence_level));

14. How do you implement ETL on GCP?

ETL Options on GCP:
+-- Dataflow - Complex transformations, streaming
+-- Dataproc - Spark-based ETL
+-- BigQuery - SQL-based transformations
+-- Cloud Data Fusion - Visual ETL
+-- Dataform - SQL-based data modeling

# BigQuery SQL-based ETL
-- Incremental load pattern
MERGE `project.curated.customers` AS target
USING (
  SELECT * FROM `project.staging.customers_daily`
  WHERE load_date = CURRENT_DATE()
) AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN
  UPDATE SET
    email = source.email,
    updated_at = CURRENT_TIMESTAMP()
WHEN NOT MATCHED THEN
  INSERT (customer_id, email, created_at, updated_at)
  VALUES (source.customer_id, source.email, CURRENT_TIMESTAMP(), CURRENT_TIMESTAMP());

# Scheduled query for daily ETL
bq query --schedule='every 24 hours' \
  --display_name='Daily Customer ETL' \
  --destination_table='project.curated.daily_summary' \
  --replace=true \
  'SELECT DATE(timestamp) as date, COUNT(*) as events 
   FROM project.raw.events 
   WHERE DATE(timestamp) = CURRENT_DATE() - 1 
   GROUP BY date'

# Dataflow template for ETL
gcloud dataflow jobs run my-etl-job \
    --gcs-location gs://dataflow-templates/latest/GCS_Text_to_BigQuery \
    --parameters \
    javascriptTextTransformGcsPath=gs://my-bucket/transform.js,\
    JSONPath=gs://my-bucket/schema.json,\
    inputFilePattern=gs://my-bucket/input/*.json,\
    outputTable=project:dataset.table,\
    bigQueryLoadingTemporaryDirectory=gs://my-bucket/temp

15. What is Dataform?

Dataform is a service for managing SQL-based data transformation workflows using a Git-like development experience.

Dataform Features:
+-- SQL-based transformations
+-- Dependency management
+-- Version control integration
+-- Data assertions/testing
+-- Documentation generation
+-- Incremental processing

# Dataform project structure
dataform/
+-- definitions/
|   +-- sources/
|   |   +-- raw_events.sqlx
|   +-- staging/
|   |   +-- stg_events.sqlx
|   +-- marts/
|       +-- dim_customers.sqlx
+-- includes/
|   +-- constants.js
+-- dataform.json

# Source declaration (raw_events.sqlx)
config {
  type: "declaration",
  schema: "raw",
  name: "events"
}

# Staging transformation (stg_events.sqlx)
config {
  type: "view",
  schema: "staging",
  description: "Cleaned and deduplicated events"
}

SELECT DISTINCT
  event_id,
  LOWER(TRIM(user_id)) AS user_id,
  PARSE_TIMESTAMP('%Y-%m-%d %H:%M:%S', event_time) AS event_timestamp,
  event_type,
  JSON_EXTRACT_SCALAR(properties, '$.page') AS page
FROM ${ref("raw", "events")}
WHERE event_id IS NOT NULL

# Mart table with incremental (dim_customers.sqlx)
config {
  type: "incremental",
  schema: "marts",
  uniqueKey: ["customer_id"],
  bigquery: {
    partitionBy: "DATE(updated_at)",
    clusterBy: ["customer_id"]
  }
}

SELECT
  customer_id,
  first_name,
  last_name,
  email,
  CURRENT_TIMESTAMP() AS updated_at
FROM ${ref("staging", "stg_customers")}
${when(incremental(), `WHERE updated_at > (SELECT MAX(updated_at) FROM ${self()})`)}





16. How do you optimize costs on GCP?

Cost Optimization Strategies:

1. BigQuery
+-- Use flat-rate for predictable workloads
+-- Partition and cluster tables
+-- Set table/partition expiration
+-- Use BI Engine for dashboards
+-- Avoid SELECT *, use column selection
+-- Use preview/dry-run before executing

# Set partition expiration
ALTER TABLE `project.dataset.logs`
SET OPTIONS (
  partition_expiration_days=90
);

2. Cloud Storage
+-- Use appropriate storage class
+-- Set lifecycle rules
+-- Enable Object Lifecycle Management
+-- Use Autoclass for variable access patterns

3. Dataflow
+-- Use autoscaling
+-- Use Dataflow Prime for better resource management
+-- Use FlexRS for cost-effective batch jobs
+-- Choose appropriate machine types

gcloud dataflow jobs run my-job \
    --flexrs-goal=COST_OPTIMIZED \
    --max-workers=10

4. Dataproc
+-- Use preemptible VMs (up to 80% savings)
+-- Autoscaling policies
+-- Use Dataproc Serverless
+-- Right-size clusters

# Preemptible secondary workers
gcloud dataproc clusters create my-cluster \
    --num-workers=2 \
    --num-secondary-workers=10 \
    --secondary-worker-type=preemptible

5. Compute
+-- Committed use discounts (1-3 year)
+-- Spot VMs for fault-tolerant workloads
+-- Right-sizing recommendations
+-- Scheduled scaling

17. What is Cloud Data Fusion?

Cloud Data Fusion is a fully managed, visual data integration service for building and managing ETL pipelines.

Data Fusion Features:
+-- Visual pipeline builder
+-- 150+ pre-built connectors
+-- Code-free transformations
+-- Built on CDAP (open source)
+-- Wrangler for data prep
+-- Reusable pipeline templates

Data Fusion Architecture:
+-------------------------------------------------------------+
|                    Cloud Data Fusion                        |
+-------------------------------------------------------------+
|  +-------------------------------------------------------+  |
|  |              Pipeline Studio (UI)                     |  |
|  |  Source -> Transform -> Transform -> Sink             |  |
|  +-------------------------------------------------------+  |
|                         |                                   |
|  +-------------------------------------------------------+  |
|  |              Execution (Dataproc)                     |  |
|  |  Spark jobs run on ephemeral clusters                 |  |
|  +-------------------------------------------------------+  |
+-------------------------------------------------------------+

Pipeline Components:
+-- Sources: GCS, BigQuery, JDBC, Kafka, Salesforce
+-- Transforms: Filter, Join, Aggregate, JavaScript
+-- Sinks: BigQuery, GCS, Bigtable, Spanner

# Deploy pipeline via API
from google.cloud import datafusion_v1

client = datafusion_v1.DataFusionClient()

# Get instance
instance = client.get_instance(
    name='projects/my-project/locations/us-central1/instances/my-instance'
)

# Create pipeline using REST API to CDAP
import requests
cdap_endpoint = instance.api_endpoint
pipeline_json = {...}  # Pipeline configuration

response = requests.put(
    f'{cdap_endpoint}/v3/namespaces/default/apps/my-pipeline',
    headers={'Authorization': f'Bearer {token}'},
    json=pipeline_json
)

18. How do you secure data on GCP?

GCP Data Security Layers:

1. Identity & Access Management (IAM)
# Grant BigQuery access
gcloud projects add-iam-policy-binding my-project \
    --member="user:analyst@company.com" \
    --role="roles/bigquery.dataViewer"

# Custom role
gcloud iam roles create DataAnalyst \
    --project=my-project \
    --permissions=bigquery.tables.getData,bigquery.jobs.create

2. Column-Level Security (BigQuery)
-- Create policy tag
-- In Data Catalog, create taxonomy and policy tags

-- Apply to column
ALTER TABLE `project.dataset.customers`
ALTER COLUMN ssn
SET OPTIONS (policy_tags = ['projects/my-project/locations/us/taxonomies/123/policyTags/456']);

3. Row-Level Security (BigQuery)
CREATE ROW ACCESS POLICY region_filter
ON `project.dataset.sales`
GRANT TO ("user:analyst@company.com")
FILTER USING (region = 'US');

4. VPC Service Controls
# Create perimeter
gcloud access-context-manager perimeters create my-perimeter \
    --title="Data Perimeter" \
    --resources=projects/123456789 \
    --restricted-services=bigquery.googleapis.com,storage.googleapis.com

5. Encryption
+-- Default: Google-managed keys
+-- CMEK: Customer-managed encryption keys
+-- CSEK: Customer-supplied encryption keys

# BigQuery with CMEK
CREATE TABLE `project.dataset.sensitive_data`
OPTIONS(
  kms_key_name='projects/my-project/locations/us/keyRings/my-ring/cryptoKeys/my-key'
) AS SELECT * FROM source;

6. Data Loss Prevention (DLP)
from google.cloud import dlp_v2

dlp = dlp_v2.DlpServiceClient()

# Inspect content for PII
response = dlp.inspect_content(
    parent=f'projects/my-project',
    inspect_config={
        'info_types': [
            {'name': 'EMAIL_ADDRESS'},
            {'name': 'PHONE_NUMBER'},
            {'name': 'CREDIT_CARD_NUMBER'}
        ]
    },
    item={'value': text_to_inspect}
)

19. What are GCP data integration patterns?

Common Integration Patterns:

1. Batch Ingestion
Sources --> Transfer Service/Dataflow --> GCS --> BigQuery

# Storage Transfer Service
gcloud transfer jobs create \
    --source-agent-pool=my-pool \
    --source-directory=/data \
    --destination-bucket=my-bucket \
    --schedule-repeats-every=1d

2. Streaming Ingestion
Sources --> Pub/Sub --> Dataflow --> BigQuery/Bigtable

3. CDC (Change Data Capture)
Database --> Datastream --> BigQuery

# Create Datastream stream
gcloud datastream streams create my-stream \
    --location=us-central1 \
    --source=my-mysql-source \
    --destination=my-bigquery-dest \
    --backfill-all

4. API Integration
Cloud Functions/Cloud Run --> Process --> BigQuery

# Cloud Function for API ingestion
def ingest_api_data(request):
    response = requests.get('https://api.example.com/data')
    data = response.json()
    
    client = bigquery.Client()
    errors = client.insert_rows_json('project.dataset.table', data)
    
    return 'Success' if not errors else f'Errors: {errors}'

5. Cross-Cloud Integration
AWS/Azure --> Transfer Service --> GCS --> BigQuery

6. Hybrid Integration
On-prem --> Transfer Appliance --> GCS
On-prem DB --> Datastream --> BigQuery

Integration Architecture:
+-------------------------------------------------------------+
|  On-Premises          |           Google Cloud              |
|  +---------+          |    +-----------------------------+  |
|  | Database|--Datastream-->| BigQuery (CDC)              |  |
|  +---------+          |    +-----------------------------+  |
|  +---------+          |    +-----------------------------+  |
|  | Files   |--Transfer--->| Cloud Storage (Batch)        |  |
|  +---------+  Service |    +-----------------------------+  |
|  +---------+          |    +-----------------------------+  |
|  | Kafka   |--Pub/Sub---->| Dataflow (Streaming)         |  |
|  +---------+  Connector|    +-----------------------------+  |
+-------------------------------------------------------------+

20. How do you monitor data pipelines on GCP?

Monitoring Stack:

1. Cloud Monitoring
+-- Metrics from all GCP services
+-- Custom metrics
+-- Dashboards
+-- Alerting policies

# Create alert policy
gcloud monitoring policies create \
    --policy-from-file=alert-policy.yaml

# alert-policy.yaml
displayName: "Dataflow Job Failed"
conditions:
- displayName: "Job State Failed"
  conditionThreshold:
    filter: 'resource.type="dataflow_job" AND metric.type="dataflow.googleapis.com/job/current_state"'
    comparison: COMPARISON_EQ
    thresholdValue: 4  # FAILED state
    duration: 60s

2. Cloud Logging
# Query Dataflow logs
gcloud logging read 'resource.type="dataflow_step" severity>=ERROR' \
    --limit=50 --format=json

# BigQuery audit logs
SELECT
  protopayload_auditlog.methodName,
  protopayload_auditlog.authenticationInfo.principalEmail,
  resource.labels.dataset_id,
  timestamp
FROM `project.cloudaudit_googleapis_com_data_access_*`
WHERE DATE(timestamp) = CURRENT_DATE()

3. BigQuery Information Schema
-- Query slot usage
SELECT
  job_id,
  user_email,
  total_slot_ms,
  total_bytes_processed,
  creation_time
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
ORDER BY total_slot_ms DESC
LIMIT 20;

4. Dataflow Monitoring
# Key metrics to monitor:
+-- system_lag (streaming)
+-- data_watermark_age
+-- elements_produced
+-- job_current_state
+-- worker_utilization

5. Custom Monitoring
from google.cloud import monitoring_v3

client = monitoring_v3.MetricServiceClient()

# Write custom metric
series = monitoring_v3.TimeSeries()
series.metric.type = 'custom.googleapis.com/pipeline/records_processed'
series.resource.type = 'global'
series.points.add(
    value={'int64_value': records_count},
    interval={'end_time': {'seconds': int(time.time())}}
)

client.create_time_series(name=f'projects/my-project', time_series=[series])

Google Cloud Interview Questions


Popular Posts