Top 20 GCP Data Engineer Interview Questions

GCP Data Engineering Services:
+-- Storage
|   +-- Cloud Storage (GCS) - Object storage
|   +-- BigQuery - Data warehouse
|   +-- Cloud SQL - Managed relational DB
|   +-- Cloud Spanner - Global relational DB
|   +-- Bigtable - NoSQL wide-column
|   +-- Firestore - Document DB
|
+-- Processing
|   +-- Dataflow - Stream/batch processing
|   +-- Dataproc - Managed Spark/Hadoop
|   +-- Cloud Functions - Serverless compute
|   +-- Cloud Run - Containerized apps
|
+-- Orchestration
|   +-- Cloud Composer - Managed Airflow
|   +-- Workflows - Serverless orchestration
|   +-- Cloud Scheduler - Cron jobs
|
+-- Analytics
|   +-- BigQuery - Analytics warehouse
|   +-- Looker - BI platform
|   +-- Data Studio - Dashboards
|
+-- Governance
    +-- Dataplex - Data fabric
    +-- Data Catalog - Metadata management
    +-- DLP API - Data protection

Typical Data Pipeline:
+----------+    +----------+    +----------+    +----------+
|  Sources |--->|  Pub/Sub |--->| Dataflow |--->| BigQuery |
|  (Apps,  |    | (Ingest) |    | (Process)|    | (Analyze)|
|  IoT)    |    |          |    |          |    |          |
+----------+    +----------+    +----------+    +----------+

2. What is BigQuery?

BigQuery is a serverless, highly scalable, and cost-effective enterprise data warehouse with built-in ML capabilities.

BigQuery Features:
+-- Serverless - No infrastructure management
+-- Columnar storage - Optimized for analytics
+-- Petabyte scale - Handles massive datasets
+-- SQL interface - Standard SQL support
+-- Built-in ML - BigQuery ML
+-- Real-time analytics - Streaming inserts
+-- Separation of storage and compute

BigQuery Architecture:
+-----------------------------------------------------+
|                    BigQuery                         |
+-----------------------------------------------------+
|  +------------------------------------------------+ |
|  |              Dremel Execution Engine           | |
|  |  (Distributed query processing)                | |
|  +------------------------------------------------+ |
|                        |                            |
|  +------------------------------------------------+ |
|  |              Colossus Storage                  | |
|  |  (Distributed columnar storage)                | |
|  +------------------------------------------------+ |
+-----------------------------------------------------+

# Query example
SELECT
  DATE(timestamp) as date,
  COUNT(*) as events,
  SUM(revenue) as total_revenue
FROM `project.dataset.events`
WHERE timestamp >= '2024-01-01'
GROUP BY date
ORDER BY date;

# Create table
CREATE TABLE `project.dataset.users` (
  user_id STRING,
  email STRING,
  created_at TIMESTAMP,
  metadata STRUCT<
    source STRING,
    campaign STRING
  >
)
PARTITION BY DATE(created_at)
CLUSTER BY user_id;

3. What is Cloud Dataflow?

Cloud Dataflow is a fully managed service for executing Apache Beam pipelines for both batch and stream processing.

Dataflow Features:
+-- Unified batch and streaming
+-- Auto-scaling
+-- Exactly-once processing
+-- Apache Beam SDK
+-- Templates for common patterns
+-- Integration with GCP services

# Python Dataflow Pipeline
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions

options = PipelineOptions([
    '--project=my-project',
    '--region=us-central1',
    '--runner=DataflowRunner',
    '--temp_location=gs://my-bucket/temp',
    '--streaming'  # For streaming jobs
])

with beam.Pipeline(options=options) as pipeline:
    (pipeline
     | 'Read from Pub/Sub' >> beam.io.ReadFromPubSub(
         subscription='projects/my-project/subscriptions/my-sub')
     | 'Parse JSON' >> beam.Map(lambda x: json.loads(x))
     | 'Filter' >> beam.Filter(lambda x: x['type'] == 'purchase')
     | 'Extract fields' >> beam.Map(lambda x: {
         'user_id': x['user_id'],
         'amount': x['amount'],
         'timestamp': x['timestamp']
     })
     | 'Window' >> beam.WindowInto(
         beam.window.FixedWindows(60))  # 1-minute windows
     | 'Write to BigQuery' >> beam.io.WriteToBigQuery(
         'project:dataset.table',
         schema='user_id:STRING,amount:FLOAT,timestamp:TIMESTAMP',
         write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND))

4. What is Pub/Sub?

Pub/Sub is a fully managed, real-time messaging service for event-driven systems and streaming analytics.

Pub/Sub Concepts:
+-- Topic - Named resource for messages
+-- Subscription - Named resource for receiving
+-- Message - Data + attributes
+-- Publisher - Sends messages to topic
+-- Subscriber - Receives from subscription

Pub/Sub Architecture:
+-----------------------------------------------------+
|                    Publishers                       |
|  +-----+  +-----+  +-----+                          |
|  |App 1|  |App 2|  |App 3|                          |
|  +--+--+  +--+--+  +--+--+                          |
|     +-------+--------+                              |
|             v                                       |
|         +--------+                                  |
|         | Topic  |                                  |
|         +---+----+                                  |
|     +-------+--------+                              |
|     v       v        v                              |
|  +-----+  +-----+  +-----+                          |
|  |Sub A|  |Sub B|  |Sub C|                          |
|  +--+--+  +--+--+  +--+--+                          |
|     v       v        v                              |
|  +-----+  +-----+  +-----+                          |
|  |Svc 1|  |Svc 2|  |Svc 3|                          |
|  +-----+  +-----+  +-----+                          |
+-----------------------------------------------------+

# Python Pub/Sub
from google.cloud import pubsub_v1

# Publisher
publisher = pubsub_v1.PublisherClient()
topic_path = publisher.topic_path('my-project', 'my-topic')

data = json.dumps({'event': 'purchase', 'amount': 99.99})
future = publisher.publish(topic_path, data.encode('utf-8'),
                          user_id='123', event_type='purchase')
print(f'Published message ID: {future.result()}')

# Subscriber
subscriber = pubsub_v1.SubscriberClient()
subscription_path = subscriber.subscription_path('my-project', 'my-sub')

def callback(message):
    print(f'Received: {message.data}')
    message.ack()

streaming_pull_future = subscriber.subscribe(subscription_path, callback=callback)

5. What is Cloud Storage (GCS)?

Cloud Storage is a unified object storage for developers and enterprises with high durability and availability.

Storage Classes:
+-- Standard - Frequently accessed data
+-- Nearline - Once per month access
+-- Coldline - Once per quarter access
+-- Archive - Once per year access

# Python GCS operations
from google.cloud import storage

client = storage.Client()

# Create bucket
bucket = client.create_bucket('my-bucket', location='US')

# Upload file
blob = bucket.blob('data/file.parquet')
blob.upload_from_filename('/local/file.parquet')

# Download file
blob.download_to_filename('/local/downloaded.parquet')

# List objects
blobs = client.list_blobs('my-bucket', prefix='data/')
for blob in blobs:
    print(blob.name)

# Lifecycle rules
bucket.lifecycle_rules = [{
    'action': {'type': 'SetStorageClass', 'storageClass': 'NEARLINE'},
    'condition': {'age': 30}
}, {
    'action': {'type': 'SetStorageClass', 'storageClass': 'COLDLINE'},
    'condition': {'age': 90}
}, {
    'action': {'type': 'Delete'},
    'condition': {'age': 365}
}]
bucket.patch()

GCS URL formats:
- gs://bucket-name/object-path
- https://storage.googleapis.com/bucket-name/object-path

6. What is Dataproc?

Dataproc is a fully managed service for running Apache Spark and Hadoop clusters.

Dataproc Features:
+-- Managed Spark/Hadoop clusters
+-- Fast cluster creation (90 seconds)
+-- Autoscaling
+-- Preemptible VMs for cost savings
+-- Integration with GCS, BigQuery
+-- Serverless option available

# Create cluster with gcloud
gcloud dataproc clusters create my-cluster \
    --region=us-central1 \
    --master-machine-type=n1-standard-4 \
    --worker-machine-type=n1-standard-4 \
    --num-workers=2 \
    --image-version=2.0-debian10 \
    --optional-components=JUPYTER \
    --enable-component-gateway

# Submit PySpark job
gcloud dataproc jobs submit pyspark \
    --cluster=my-cluster \
    --region=us-central1 \
    gs://my-bucket/scripts/etl_job.py \
    -- --input=gs://my-bucket/input/ \
       --output=gs://my-bucket/output/

# PySpark job example
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('ETL').getOrCreate()

# Read from GCS
df = spark.read.parquet('gs://my-bucket/raw/events/')

# Transform
result = df.filter(df.event_type == 'purchase') \
    .groupBy('user_id') \
    .agg({'amount': 'sum', 'event_id': 'count'})

# Write to BigQuery
result.write.format('bigquery') \
    .option('table', 'project.dataset.user_purchases') \
    .mode('overwrite') \
    .save()

# Dataproc Serverless
gcloud dataproc batches submit pyspark \
    gs://my-bucket/scripts/etl_job.py \
    --region=us-central1 \
    --deps-bucket=gs://my-bucket/deps

7. What is Cloud Composer?

Cloud Composer is a fully managed Apache Airflow service for workflow orchestration.

Cloud Composer Features:
+-- Managed Apache Airflow
+-- GCP service integration
+-- Web UI for monitoring
+-- Python-based DAGs
+-- Composer 2 with improved scaling

# Airflow DAG for GCP
from airflow import DAG
from airflow.providers.google.cloud.operators.bigquery import (
    BigQueryInsertJobOperator, BigQueryCheckOperator
)
from airflow.providers.google.cloud.operators.dataflow import (
    DataflowStartFlexTemplateOperator
)
from airflow.providers.google.cloud.transfers.gcs_to_bigquery import (
    GCSToBigQueryOperator
)
from datetime import datetime, timedelta

default_args = {
    'owner': 'data-team',
    'depends_on_past': False,
    'start_date': datetime(2024, 1, 1),
    'retries': 2,
    'retry_delay': timedelta(minutes=5),
}

with DAG('daily_etl', default_args=default_args, schedule_interval='@daily') as dag:
    
    # Load raw data to BigQuery
    load_raw = GCSToBigQueryOperator(
        task_id='load_raw_data',
        bucket='raw-data-bucket',
        source_objects=['events/{{ ds }}/*.json'],
        destination_project_dataset_table='project.raw.events',
        source_format='NEWLINE_DELIMITED_JSON',
        write_disposition='WRITE_APPEND'
    )
    
    # Transform with BigQuery
    transform = BigQueryInsertJobOperator(
        task_id='transform_data',
        configuration={
            'query': {
                'query': '''
                    INSERT INTO `project.curated.daily_summary`
                    SELECT DATE(timestamp) as date, COUNT(*) as events
                    FROM `project.raw.events`
                    WHERE DATE(timestamp) = '{{ ds }}'
                    GROUP BY date
                ''',
                'useLegacySql': False
            }
        }
    )
    
    # Data quality check
    quality_check = BigQueryCheckOperator(
        task_id='quality_check',
        sql='SELECT COUNT(*) > 0 FROM `project.curated.daily_summary` WHERE date = "{{ ds }}"',
        use_legacy_sql=False
    )
    
    load_raw >> transform >> quality_check

8. What is Dataplex?

Dataplex is a data fabric that unifies distributed data and automates data management and governance.

Dataplex Concepts:
+-- Lake - Logical grouping of data
+-- Zone - Subdivision (raw, curated)
+-- Asset - Actual data (GCS, BigQuery)
+-- Tasks - Automated data processing
+-- Data Quality - Automated validation

Dataplex Architecture:
+-----------------------------------------------------+
|                    Dataplex Lake                    |
+-----------------------------------------------------+
|  +-----------------+    +-----------------+         |
|  |    Raw Zone     |    |  Curated Zone   |         |
|  |                 |    |                 |         |
|  |  +-----------+  |    |  +-----------+  |         |
|  |  | GCS Asset |  |--->|  | BQ Asset  |  |         |
|  |  |(raw files)|  |    |  | (tables)  |  |         |
|  |  +-----------+  |    |  +-----------+  |         |
|  +-----------------+    +-----------------+         |
|                                                     |
|  +---------------------------------------------+    |
|  |            Data Quality Rules               |    |
|  |  - Completeness, Uniqueness, Validity       |    |
|  +---------------------------------------------+    |
+-----------------------------------------------------+

# Create lake with gcloud
gcloud dataplex lakes create my-lake \
    --location=us-central1 \
    --display-name="Analytics Lake"

# Create zone
gcloud dataplex zones create raw-zone \
    --lake=my-lake \
    --location=us-central1 \
    --type=RAW \
    --resource-location-type=SINGLE_REGION

# Add asset
gcloud dataplex assets create raw-events \
    --lake=my-lake \
    --zone=raw-zone \
    --location=us-central1 \
    --resource-type=STORAGE_BUCKET \
    --resource-name=projects/my-project/buckets/raw-events-bucket \
    --discovery-enabled

9. What is Data Catalog?

Data Catalog is a fully managed metadata management service for discovering and managing data.

Data Catalog Features:
+-- Automatic metadata discovery
+-- Custom tags and templates
+-- Search across projects
+-- Policy tags for column-level security
+-- Integration with BigQuery, GCS, Pub/Sub

# Create tag template
from google.cloud import datacatalog_v1

client = datacatalog_v1.DataCatalogClient()

template = datacatalog_v1.TagTemplate()
template.display_name = "Data Quality Metadata"
template.fields["owner"] = datacatalog_v1.TagTemplateField(
    display_name="Data Owner",
    type_=datacatalog_v1.FieldType(primitive_type="STRING")
)
template.fields["quality_score"] = datacatalog_v1.TagTemplateField(
    display_name="Quality Score",
    type_=datacatalog_v1.FieldType(primitive_type="DOUBLE")
)
template.fields["pii"] = datacatalog_v1.TagTemplateField(
    display_name="Contains PII",
    type_=datacatalog_v1.FieldType(primitive_type="BOOL")
)

created_template = client.create_tag_template(
    parent="projects/my-project/locations/us-central1",
    tag_template_id="data_quality",
    tag_template=template
)

# Search catalog
scope = datacatalog_v1.SearchCatalogRequest.Scope()
scope.include_project_ids.append("my-project")

results = client.search_catalog(
    scope=scope,
    query="tag:data_quality.pii=true"
)

for result in results:
    print(f"Found: {result.linked_resource}")

10. How do you design a data lake on GCP?

GCP Data Lake Architecture:
+-------------------------------------------------------------+
|                      Data Sources                           |
|  +-----+ +-----+ +-----+ +-----+ +-----+                    |
|  | API | | IoT | | DB  | |Files| |Stream|                   |
|  +--+--+ +--+--+ +--+--+ +--+--+ +--+--+                    |
+----+-------+-------+-------+-------+------------------------+
     |       |       |       |       |
     v       v       v       v       v
+-------------------------------------------------------------+
|                    Ingestion Layer                          |
|  +--------------+  +--------------+  +--------------+       |
|  |   Pub/Sub    |  |  Dataflow    |  |  Transfer    |       |
|  |  (Streaming) |  |  (ETL)       |  |  Service     |       |
|  +--------------+  +--------------+  +--------------+       |
+-------------------------------------------------------------+
                         |
                         v
+-------------------------------------------------------------+
|                    Storage Layer                            |
|  +------------------------------------------------------+   |
|  |                 Cloud Storage (GCS)                  |   |
|  |  +------------+ +------------+ +------------+        |   |
|  |  | Raw Zone   | | Processed  | | Curated    |        |   |
|  |  | (Landing)  | | Zone       | | Zone       |        |   |
|  |  +------------+ +------------+ +------------+        |   |
|  +------------------------------------------------------+   |
|                         |                                   |
|  +------------------------------------------------------+   |
|  |                    BigQuery                          |   |
|  |  (Analytics tables, ML models, BI views)             |   |
|  +------------------------------------------------------+   |
+-------------------------------------------------------------+

Zone Design:
+-- Raw Zone (gs://lake/raw/)
|   +-- Format: JSON, CSV, Avro (as received)
|   +-- Retention: Full history
|   +-- Access: Data engineers only
|
+-- Processed Zone (gs://lake/processed/)
|   +-- Format: Parquet (optimized)
|   +-- Partitioned by date
|   +-- Access: Data engineers, analysts
|
+-- Curated Zone (BigQuery)
    +-- Format: BigQuery tables
    +-- Business-ready datasets
    +-- Access: All analytics users

# Bucket organization
gs://my-lake/
+-- raw/
|   +-- source_name/
|       +-- year=2024/month=01/day=15/
|           +-- data.json
+-- processed/
|   +-- domain/
|       +-- entity/
|           +-- year=2024/month=01/
|               +-- data.parquet
+-- curated/
    +-- External tables pointing to processed data

11. What are BigQuery best practices?

Performance Best Practices:

1. Partitioning
-- Partition by date for time-series data
CREATE TABLE `project.dataset.events`
PARTITION BY DATE(event_timestamp)
CLUSTER BY user_id, event_type
AS SELECT * FROM source_table;

-- Partition pruning in queries
SELECT * FROM events
WHERE DATE(event_timestamp) = '2024-01-15'  -- Scans one partition

2. Clustering
-- Cluster frequently filtered columns
CREATE TABLE orders
PARTITION BY DATE(order_date)
CLUSTER BY customer_id, product_category;

3. Avoid SELECT *
-- Bad
SELECT * FROM large_table;

-- Good
SELECT user_id, event_type, timestamp FROM large_table;

4. Use approximate functions
-- Exact (slow)
SELECT COUNT(DISTINCT user_id) FROM events;

-- Approximate (fast)
SELECT APPROX_COUNT_DISTINCT(user_id) FROM events;

5. Materialized views
CREATE MATERIALIZED VIEW `project.dataset.daily_stats`
PARTITION BY date
CLUSTER BY category
AS
SELECT
  DATE(timestamp) as date,
  category,
  COUNT(*) as events,
  SUM(amount) as total
FROM `project.dataset.events`
GROUP BY date, category;

Cost Optimization:
+-- Use flat-rate pricing for predictable workloads
+-- Set custom quotas per project/user
+-- Use BI Engine for dashboards
+-- Expire partitions with partition_expiration_days
+-- Preview queries before running (dry run)

12. How do you handle streaming data on GCP?

Streaming Architecture:
+-------------------------------------------------------------+
|  Sources                                                    |
|  +-----+ +-----+ +-----+                                    |
|  | IoT | | Apps| |Kafka|                                    |
|  +--+--+ +--+--+ +--+--+                                    |
|     +-------+-------+                                       |
|             v                                               |
|  +-------------------------------------------------------+  |
|  |                    Pub/Sub                            |  |
|  |  (Message buffering, fan-out)                         |  |
|  +-------------------------------------------------------+  |
|             |                                               |
|             v                                               |
|  +-------------------------------------------------------+  |
|  |                   Dataflow                            |  |
|  |  (Windowing, aggregation, enrichment)                 |  |
|  +-------------------------------------------------------+  |
|             |                                               |
|     +-------+-------+                                       |
|     v       v       v                                       |
|  BigQuery  GCS   Bigtable                                   |
|  (Analytics)(Archive)(Real-time)                            |
+-------------------------------------------------------------+

# Dataflow streaming pipeline
import apache_beam as beam
from apache_beam.transforms import window

with beam.Pipeline(options=options) as p:
    events = (p
        | 'Read' >> beam.io.ReadFromPubSub(subscription=sub)
        | 'Parse' >> beam.Map(json.loads)
        | 'AddTimestamp' >> beam.Map(lambda x: beam.window.TimestampedValue(
            x, x['event_time']))
    )
    
    # Real-time aggregation (1-minute windows)
    windowed = (events
        | 'Window' >> beam.WindowInto(window.FixedWindows(60))
        | 'Key' >> beam.Map(lambda x: (x['category'], x['amount']))
        | 'Sum' >> beam.CombinePerKey(sum)
    )
    
    # Write to BigQuery (streaming)
    windowed | 'ToBQ' >> beam.io.WriteToBigQuery(
        'project:dataset.realtime_metrics',
        method=beam.io.WriteToBigQuery.Method.STREAMING_INSERTS
    )
    
    # Archive to GCS
    events | 'ToGCS' >> beam.io.WriteToText(
        'gs://archive/events',
        file_name_suffix='.json'
    )

13. What is BigQuery ML?

BigQuery ML enables creating and executing machine learning models using SQL.

BigQuery ML Model Types:
+-- Linear regression
+-- Logistic regression
+-- K-means clustering
+-- Matrix factorization
+-- Time series (ARIMA_PLUS)
+-- Deep neural networks
+-- XGBoost
+-- AutoML Tables
+-- Imported TensorFlow models

# Create classification model
CREATE OR REPLACE MODEL `project.dataset.churn_model`
OPTIONS(
  model_type='LOGISTIC_REG',
  input_label_cols=['churned'],
  auto_class_weights=TRUE
) AS
SELECT
  user_id,
  days_since_last_purchase,
  total_orders,
  avg_order_value,
  support_tickets,
  churned
FROM `project.dataset.user_features`
WHERE _PARTITIONDATE BETWEEN '2023-01-01' AND '2023-12-31';

# Evaluate model
SELECT * FROM ML.EVALUATE(MODEL `project.dataset.churn_model`);

# Make predictions
SELECT
  user_id,
  predicted_churned,
  predicted_churned_probs[OFFSET(1)].prob AS churn_probability
FROM ML.PREDICT(MODEL `project.dataset.churn_model`,
  (SELECT * FROM `project.dataset.user_features`
   WHERE _PARTITIONDATE = CURRENT_DATE()))
WHERE predicted_churned_probs[OFFSET(1)].prob > 0.7;

# Time series forecasting
CREATE OR REPLACE MODEL `project.dataset.sales_forecast`
OPTIONS(
  model_type='ARIMA_PLUS',
  time_series_timestamp_col='date',
  time_series_data_col='sales',
  time_series_id_col='product_id',
  horizon=30
) AS
SELECT date, product_id, sales
FROM `project.dataset.daily_sales`;

# Forecast
SELECT * FROM ML.FORECAST(MODEL `project.dataset.sales_forecast`,
  STRUCT(30 AS horizon, 0.9 AS confidence_level));

14. How do you implement ETL on GCP?

ETL Options on GCP:
+-- Dataflow - Complex transformations, streaming
+-- Dataproc - Spark-based ETL
+-- BigQuery - SQL-based transformations
+-- Cloud Data Fusion - Visual ETL
+-- Dataform - SQL-based data modeling

# BigQuery SQL-based ETL
-- Incremental load pattern
MERGE `project.curated.customers` AS target
USING (
  SELECT * FROM `project.staging.customers_daily`
  WHERE load_date = CURRENT_DATE()
) AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN
  UPDATE SET
    email = source.email,
    updated_at = CURRENT_TIMESTAMP()
WHEN NOT MATCHED THEN
  INSERT (customer_id, email, created_at, updated_at)
  VALUES (source.customer_id, source.email, CURRENT_TIMESTAMP(), CURRENT_TIMESTAMP());

# Scheduled query for daily ETL
bq query --schedule='every 24 hours' \
  --display_name='Daily Customer ETL' \
  --destination_table='project.curated.daily_summary' \
  --replace=true \
  'SELECT DATE(timestamp) as date, COUNT(*) as events 
   FROM project.raw.events 
   WHERE DATE(timestamp) = CURRENT_DATE() - 1 
   GROUP BY date'

# Dataflow template for ETL
gcloud dataflow jobs run my-etl-job \
    --gcs-location gs://dataflow-templates/latest/GCS_Text_to_BigQuery \
    --parameters \
    javascriptTextTransformGcsPath=gs://my-bucket/transform.js,\
    JSONPath=gs://my-bucket/schema.json,\
    inputFilePattern=gs://my-bucket/input/*.json,\
    outputTable=project:dataset.table,\
    bigQueryLoadingTemporaryDirectory=gs://my-bucket/temp

15. What is Dataform?

Dataform is a service for managing SQL-based data transformation workflows using a Git-like development experience.

Dataform Features:
+-- SQL-based transformations
+-- Dependency management
+-- Version control integration
+-- Data assertions/testing
+-- Documentation generation
+-- Incremental processing

# Dataform project structure
dataform/
+-- definitions/
|   +-- sources/
|   |   +-- raw_events.sqlx
|   +-- staging/
|   |   +-- stg_events.sqlx
|   +-- marts/
|       +-- dim_customers.sqlx
+-- includes/
|   +-- constants.js
+-- dataform.json

# Source declaration (raw_events.sqlx)
config {
  type: "declaration",
  schema: "raw",
  name: "events"
}

# Staging transformation (stg_events.sqlx)
config {
  type: "view",
  schema: "staging",
  description: "Cleaned and deduplicated events"
}

SELECT DISTINCT
  event_id,
  LOWER(TRIM(user_id)) AS user_id,
  PARSE_TIMESTAMP('%Y-%m-%d %H:%M:%S', event_time) AS event_timestamp,
  event_type,
  JSON_EXTRACT_SCALAR(properties, '$.page') AS page
FROM ${ref("raw", "events")}
WHERE event_id IS NOT NULL

# Mart table with incremental (dim_customers.sqlx)
config {
  type: "incremental",
  schema: "marts",
  uniqueKey: ["customer_id"],
  bigquery: {
    partitionBy: "DATE(updated_at)",
    clusterBy: ["customer_id"]
  }
}

SELECT
  customer_id,
  first_name,
  last_name,
  email,
  CURRENT_TIMESTAMP() AS updated_at
FROM ${ref("staging", "stg_customers")}
${when(incremental(), `WHERE updated_at > (SELECT MAX(updated_at) FROM ${self()})`)}

16. How do you optimize costs on GCP?

Cost Optimization Strategies:

1. BigQuery
+-- Use flat-rate for predictable workloads
+-- Partition and cluster tables
+-- Set table/partition expiration
+-- Use BI Engine for dashboards
+-- Avoid SELECT *, use column selection
+-- Use preview/dry-run before executing

# Set partition expiration
ALTER TABLE `project.dataset.logs`
SET OPTIONS (
  partition_expiration_days=90
);

2. Cloud Storage
+-- Use appropriate storage class
+-- Set lifecycle rules
+-- Enable Object Lifecycle Management
+-- Use Autoclass for variable access patterns

3. Dataflow
+-- Use autoscaling
+-- Use Dataflow Prime for better resource management
+-- Use FlexRS for cost-effective batch jobs
+-- Choose appropriate machine types

gcloud dataflow jobs run my-job \
    --flexrs-goal=COST_OPTIMIZED \
    --max-workers=10

4. Dataproc
+-- Use preemptible VMs (up to 80% savings)
+-- Autoscaling policies
+-- Use Dataproc Serverless
+-- Right-size clusters

# Preemptible secondary workers
gcloud dataproc clusters create my-cluster \
    --num-workers=2 \
    --num-secondary-workers=10 \
    --secondary-worker-type=preemptible

5. Compute
+-- Committed use discounts (1-3 year)
+-- Spot VMs for fault-tolerant workloads
+-- Right-sizing recommendations
+-- Scheduled scaling

17. What is Cloud Data Fusion?

Cloud Data Fusion is a fully managed, visual data integration service for building and managing ETL pipelines.

Data Fusion Features:
+-- Visual pipeline builder
+-- 150+ pre-built connectors
+-- Code-free transformations
+-- Built on CDAP (open source)
+-- Wrangler for data prep
+-- Reusable pipeline templates

Data Fusion Architecture:
+-------------------------------------------------------------+
|                    Cloud Data Fusion                        |
+-------------------------------------------------------------+
|  +-------------------------------------------------------+  |
|  |              Pipeline Studio (UI)                     |  |
|  |  Source -> Transform -> Transform -> Sink             |  |
|  +-------------------------------------------------------+  |
|                         |                                   |
|  +-------------------------------------------------------+  |
|  |              Execution (Dataproc)                     |  |
|  |  Spark jobs run on ephemeral clusters                 |  |
|  +-------------------------------------------------------+  |
+-------------------------------------------------------------+

Pipeline Components:
+-- Sources: GCS, BigQuery, JDBC, Kafka, Salesforce
+-- Transforms: Filter, Join, Aggregate, JavaScript
+-- Sinks: BigQuery, GCS, Bigtable, Spanner

# Deploy pipeline via API
from google.cloud import datafusion_v1

client = datafusion_v1.DataFusionClient()

# Get instance
instance = client.get_instance(
    name='projects/my-project/locations/us-central1/instances/my-instance'
)

# Create pipeline using REST API to CDAP
import requests
cdap_endpoint = instance.api_endpoint
pipeline_json = {...}  # Pipeline configuration

response = requests.put(
    f'{cdap_endpoint}/v3/namespaces/default/apps/my-pipeline',
    headers={'Authorization': f'Bearer {token}'},
    json=pipeline_json
)

18. How do you secure data on GCP?

GCP Data Security Layers:

1. Identity & Access Management (IAM)
# Grant BigQuery access
gcloud projects add-iam-policy-binding my-project \
    --member="user:analyst@company.com" \
    --role="roles/bigquery.dataViewer"

# Custom role
gcloud iam roles create DataAnalyst \
    --project=my-project \
    --permissions=bigquery.tables.getData,bigquery.jobs.create

2. Column-Level Security (BigQuery)
-- Create policy tag
-- In Data Catalog, create taxonomy and policy tags

-- Apply to column
ALTER TABLE `project.dataset.customers`
ALTER COLUMN ssn
SET OPTIONS (policy_tags = ['projects/my-project/locations/us/taxonomies/123/policyTags/456']);

3. Row-Level Security (BigQuery)
CREATE ROW ACCESS POLICY region_filter
ON `project.dataset.sales`
GRANT TO ("user:analyst@company.com")
FILTER USING (region = 'US');

4. VPC Service Controls
# Create perimeter
gcloud access-context-manager perimeters create my-perimeter \
    --title="Data Perimeter" \
    --resources=projects/123456789 \
    --restricted-services=bigquery.googleapis.com,storage.googleapis.com

5. Encryption
+-- Default: Google-managed keys
+-- CMEK: Customer-managed encryption keys
+-- CSEK: Customer-supplied encryption keys

# BigQuery with CMEK
CREATE TABLE `project.dataset.sensitive_data`
OPTIONS(
  kms_key_name='projects/my-project/locations/us/keyRings/my-ring/cryptoKeys/my-key'
) AS SELECT * FROM source;

6. Data Loss Prevention (DLP)
from google.cloud import dlp_v2

dlp = dlp_v2.DlpServiceClient()

# Inspect content for PII
response = dlp.inspect_content(
    parent=f'projects/my-project',
    inspect_config={
        'info_types': [
            {'name': 'EMAIL_ADDRESS'},
            {'name': 'PHONE_NUMBER'},
            {'name': 'CREDIT_CARD_NUMBER'}
        ]
    },
    item={'value': text_to_inspect}
)

19. What are GCP data integration patterns?

Common Integration Patterns:

1. Batch Ingestion
Sources --> Transfer Service/Dataflow --> GCS --> BigQuery

# Storage Transfer Service
gcloud transfer jobs create \
    --source-agent-pool=my-pool \
    --source-directory=/data \
    --destination-bucket=my-bucket \
    --schedule-repeats-every=1d

2. Streaming Ingestion
Sources --> Pub/Sub --> Dataflow --> BigQuery/Bigtable

3. CDC (Change Data Capture)
Database --> Datastream --> BigQuery

# Create Datastream stream
gcloud datastream streams create my-stream \
    --location=us-central1 \
    --source=my-mysql-source \
    --destination=my-bigquery-dest \
    --backfill-all

4. API Integration
Cloud Functions/Cloud Run --> Process --> BigQuery

# Cloud Function for API ingestion
def ingest_api_data(request):
    response = requests.get('https://api.example.com/data')
    data = response.json()
    
    client = bigquery.Client()
    errors = client.insert_rows_json('project.dataset.table', data)
    
    return 'Success' if not errors else f'Errors: {errors}'

5. Cross-Cloud Integration
AWS/Azure --> Transfer Service --> GCS --> BigQuery

6. Hybrid Integration
On-prem --> Transfer Appliance --> GCS
On-prem DB --> Datastream --> BigQuery

Integration Architecture:
+-------------------------------------------------------------+
|  On-Premises          |           Google Cloud              |
|  +---------+          |    +-----------------------------+  |
|  | Database|--Datastream-->| BigQuery (CDC)              |  |
|  +---------+          |    +-----------------------------+  |
|  +---------+          |    +-----------------------------+  |
|  | Files   |--Transfer--->| Cloud Storage (Batch)        |  |
|  +---------+  Service |    +-----------------------------+  |
|  +---------+          |    +-----------------------------+  |
|  | Kafka   |--Pub/Sub---->| Dataflow (Streaming)         |  |
|  +---------+  Connector|    +-----------------------------+  |
+-------------------------------------------------------------+

20. How do you monitor data pipelines on GCP?

Monitoring Stack:

1. Cloud Monitoring
+-- Metrics from all GCP services
+-- Custom metrics
+-- Dashboards
+-- Alerting policies

# Create alert policy
gcloud monitoring policies create \
    --policy-from-file=alert-policy.yaml

# alert-policy.yaml
displayName: "Dataflow Job Failed"
conditions:
- displayName: "Job State Failed"
  conditionThreshold:
    filter: 'resource.type="dataflow_job" AND metric.type="dataflow.googleapis.com/job/current_state"'
    comparison: COMPARISON_EQ
    thresholdValue: 4  # FAILED state
    duration: 60s

2. Cloud Logging
# Query Dataflow logs
gcloud logging read 'resource.type="dataflow_step" severity>=ERROR' \
    --limit=50 --format=json

# BigQuery audit logs
SELECT
  protopayload_auditlog.methodName,
  protopayload_auditlog.authenticationInfo.principalEmail,
  resource.labels.dataset_id,
  timestamp
FROM `project.cloudaudit_googleapis_com_data_access_*`
WHERE DATE(timestamp) = CURRENT_DATE()

3. BigQuery Information Schema
-- Query slot usage
SELECT
  job_id,
  user_email,
  total_slot_ms,
  total_bytes_processed,
  creation_time
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
ORDER BY total_slot_ms DESC
LIMIT 20;

4. Dataflow Monitoring
# Key metrics to monitor:
+-- system_lag (streaming)
+-- data_watermark_age
+-- elements_produced
+-- job_current_state
+-- worker_utilization

5. Custom Monitoring
from google.cloud import monitoring_v3

client = monitoring_v3.MetricServiceClient()

# Write custom metric
series = monitoring_v3.TimeSeries()
series.metric.type = 'custom.googleapis.com/pipeline/records_processed'
series.resource.type = 'global'
series.points.add(
    value={'int64_value': records_count},
    interval={'end_time': {'seconds': int(time.time())}}
)

client.create_time_series(name=f'projects/my-project', time_series=[series])