Search Tutorials


Top GCP Cloud Storage & Data Lake Interview Questions (2026) | JavaInUse

Top 20 GCP Cloud Storage & Data Lake Interview Questions


  1. What is Cloud Storage?
  2. What are storage classes?
  3. What is object lifecycle management?
  4. What is Dataplex?
  5. What is Data Catalog?
  6. How do you implement data lake architecture?
  7. What is BigLake?
  8. How do you secure Cloud Storage?
  9. What is transfer service?
  10. What are signed URLs?
  11. What is object versioning?
  12. How do you organize data in GCS?
  13. What is the Pub/Sub notification feature?
  14. What are retention policies?
  15. How do you optimize storage costs?
  16. What is data quality in Dataplex?
  17. What are Dataplex lakes and zones?
  18. How do you implement data lineage?
  19. What is Analytics Hub?
  20. What are Cloud Storage best practices?

Google Cloud Interview Questions

1. What is Cloud Storage?

Cloud Storage is a unified object storage for developers and enterprises with high durability and availability.

Cloud Storage Features:
+-- Object storage (unstructured data)
+-- 99.999999999% (11 nines) durability
+-- Global accessibility
+-- Multiple storage classes
+-- Strong consistency
+-- Unlimited storage

Key Concepts:
+-- Buckets - Container for objects
+-- Objects - Files with metadata
+-- Regions/Multi-regions - Data location
+-- ACLs/IAM - Access control

# Create bucket
gsutil mb -p my-project -c STANDARD -l US-CENTRAL1 gs://my-bucket/

# Or with gcloud
gcloud storage buckets create gs://my-bucket \
    --project=my-project \
    --location=US-CENTRAL1 \
    --uniform-bucket-level-access

# Upload file
gsutil cp local-file.csv gs://my-bucket/data/

# Download file
gsutil cp gs://my-bucket/data/file.csv ./

# List objects
gsutil ls gs://my-bucket/data/

# Python client
from google.cloud import storage

client = storage.Client()
bucket = client.bucket('my-bucket')

# Upload
blob = bucket.blob('data/file.csv')
blob.upload_from_filename('local-file.csv')

# Download
blob = bucket.blob('data/file.csv')
blob.download_to_filename('downloaded.csv')

# Read directly
content = blob.download_as_string()

2. What are storage classes?

ClassUse CaseMin DurationRetrieval Cost
StandardFrequent accessNoneNone
NearlineMonthly access30 days$0.01/GB
ColdlineQuarterly access90 days$0.02/GB
ArchiveYearly access365 days$0.05/GB

Storage Class Selection:

# Create bucket with storage class
gsutil mb -c NEARLINE -l US gs://archive-bucket/

# Change object storage class
gsutil rewrite -s COLDLINE gs://bucket/object

# Set default storage class
gsutil defstorageclass set NEARLINE gs://bucket/

Autoclass (Automatic Tiering):
# Automatically moves objects based on access patterns
gcloud storage buckets create gs://my-bucket \
    --location=US \
    --autoclass

Autoclass Behavior:
+-------------------------------------------------------------+
|  Object Access Pattern     |    Storage Class Transition    |
+-------------------------------------------------------------+
|  Frequently accessed       |    Standard                    |
|  Not accessed 30 days      |    Nearline                    |
|  Not accessed 90 days      |    Coldline                    |
|  Not accessed 365 days     |    Archive                     |
|  Accessed again            |    Back to Standard            |
+-------------------------------------------------------------+

Pricing (US regions):
+-- Standard: $0.020/GB/month
+-- Nearline: $0.010/GB/month
+-- Coldline: $0.004/GB/month
+-- Archive: $0.0012/GB/month

3. What is object lifecycle management?

Lifecycle management automates object transitions and deletions based on rules.

Lifecycle Rules:

# lifecycle.json
{
  "lifecycle": {
    "rule": [
      {
        "action": {"type": "SetStorageClass", "storageClass": "NEARLINE"},
        "condition": {"age": 30, "matchesStorageClass": ["STANDARD"]}
      },
      {
        "action": {"type": "SetStorageClass", "storageClass": "COLDLINE"},
        "condition": {"age": 90, "matchesStorageClass": ["NEARLINE"]}
      },
      {
        "action": {"type": "SetStorageClass", "storageClass": "ARCHIVE"},
        "condition": {"age": 365, "matchesStorageClass": ["COLDLINE"]}
      },
      {
        "action": {"type": "Delete"},
        "condition": {"age": 730}
      },
      {
        "action": {"type": "Delete"},
        "condition": {"isLive": false, "numNewerVersions": 3}
      },
      {
        "action": {"type": "AbortIncompleteMultipartUpload"},
        "condition": {"age": 7}
      }
    ]
  }
}

# Apply lifecycle
gsutil lifecycle set lifecycle.json gs://my-bucket/

# View lifecycle
gsutil lifecycle get gs://my-bucket/

Conditions Available:
+-- age - Days since creation
+-- createdBefore - Created before date
+-- isLive - Current vs noncurrent (versioned)
+-- numNewerVersions - Version count
+-- matchesStorageClass - Current class
+-- matchesPrefix - Object name prefix
+-- matchesSuffix - Object name suffix
+-- daysSinceCustomTime - Custom metadata time

Actions Available:
+-- Delete - Remove object
+-- SetStorageClass - Change class
+-- AbortIncompleteMultipartUpload - Clean up

4. What is Dataplex?

Dataplex is an intelligent data fabric that unifies distributed data and automates data management.

Dataplex Architecture:
+-------------------------------------------------------------+
|                       Dataplex                               |
+-------------------------------------------------------------+
|  +-----------------------------------------------------+   |
|  |                    Lake                              |   |
|  |  +-------------+  +-------------+  +-------------+ |   |
|  |  |  Raw Zone   |  | Curated Zone |  | Consumption | |   |
|  |  |  (Landing)  |  |  (Refined)   |  |    Zone     | |   |
|  |  +------+------+  +------+------+  +------+------+ |   |
|  |         |                |                |        |   |
|  |    +----+----+     +----+----+     +----+----+   |   |
|  |    |  Asset  |     |  Asset  |     |  Asset  |   |   |
|  |    |  (GCS)  |     | (BigQuery)|    |  (BQ)  |   |   |
|  |    +---------+     +---------+     +---------+   |   |
|  +-----------------------------------------------------+   |
|                                                              |
|  Features:                                                   |
|  +-- Unified governance                                     |
|  +-- Automated data discovery                               |
|  +-- Data quality management                                |
|  +-- Security & policies                                    |
|  +-- Metadata management                                    |
+-------------------------------------------------------------+

# Create Dataplex lake
gcloud dataplex lakes create my-lake \
    --location=us-central1 \
    --display-name="My Data Lake"

# Create zone
gcloud dataplex zones create raw-zone \
    --lake=my-lake \
    --location=us-central1 \
    --type=RAW \
    --resource-location-type=SINGLE_REGION \
    --display-name="Raw Data Zone"

# Add asset (GCS bucket)
gcloud dataplex assets create raw-data-asset \
    --lake=my-lake \
    --zone=raw-zone \
    --location=us-central1 \
    --resource-type=STORAGE_BUCKET \
    --resource-name=projects/my-project/buckets/raw-data-bucket \
    --discovery-enabled

# Add BigQuery dataset as asset
gcloud dataplex assets create curated-data-asset \
    --lake=my-lake \
    --zone=curated-zone \
    --location=us-central1 \
    --resource-type=BIGQUERY_DATASET \
    --resource-name=projects/my-project/datasets/curated

5. What is Data Catalog?

Data Catalog is a fully managed metadata management service for discovering and managing data.

Data Catalog Features:
+-- Automatic metadata discovery
+-- Custom metadata (tags)
+-- Data lineage tracking
+-- Unified search
+-- Policy tags for security
+-- Business glossary

# Search data assets
gcloud data-catalog search "type=table AND system=bigquery"

# Create tag template
gcloud data-catalog tag-templates create data-quality-template \
    --location=us-central1 \
    --display-name="Data Quality" \
    --field=id=owner,display-name="Data Owner",type=string,required=true \
    --field=id=quality_score,display-name="Quality Score",type=double \
    --field=id=pii_flag,display-name="Contains PII",type=bool

# Create entry for external data
gcloud data-catalog entries create my-external-data \
    --entry-group=my-group \
    --location=us-central1 \
    --display-name="External Sales Data" \
    --type=FILESET \
    --gcs-file-patterns="gs://bucket/sales/*"

# Python: Add tag to BigQuery table
from google.cloud import datacatalog_v1

client = datacatalog_v1.DataCatalogClient()

# Look up entry
resource_name = f"//bigquery.googleapis.com/projects/{project}/datasets/{dataset}/tables/{table}"
entry = client.lookup_entry(request={"linked_resource": resource_name})

# Create tag
tag = datacatalog_v1.Tag()
tag.template = f"projects/{project}/locations/us-central1/tagTemplates/data-quality-template"
tag.fields["owner"] = datacatalog_v1.TagField(string_value="data-team@company.com")
tag.fields["quality_score"] = datacatalog_v1.TagField(double_value=0.95)
tag.fields["pii_flag"] = datacatalog_v1.TagField(bool_value=True)

client.create_tag(parent=entry.name, tag=tag)





6. How do you implement data lake architecture?

Data Lake Architecture on GCP:
+-------------------------------------------------------------+
|                    GCP Data Lake                             |
+-------------------------------------------------------------+
|                                                              |
|  +------------------------------------------------------+  |
|  |                   Ingestion Layer                     |  |
|  |  +--------+ +--------+ +--------+ +--------+        |  |
|  |  | Pub/Sub| |Dataflow| |Transfer| | Cloud  |        |  |
|  |  |        | |        | |Service | |Functions|       |  |
|  |  +---+----+ +---+----+ +---+----+ +---+----+        |  |
|  +------+----------+----------+----------+--------------+  |
|         +----------+----------+----------+                  |
|                          |                                  |
|                          v                                  |
|  +------------------------------------------------------+  |
|  |                   Storage Layer                       |  |
|  |  +-------------+  +-------------+  +-------------+  |  |
|  |  |  Raw Zone   |  | Processed   |  |  Curated    |  |  |
|  |  |  (Landing)  |->|    Zone     |->|    Zone     |  |  |
|  |  |    GCS      |  |    GCS      |  |  BigQuery   |  |  |
|  |  +-------------+  +-------------+  +-------------+  |  |
|  +------------------------------------------------------+  |
|                          |                                  |
|                          v                                  |
|  +------------------------------------------------------+  |
|  |                 Processing Layer                      |  |
|  |  +--------+ +--------+ +--------+ +--------+        |  |
|  |  |Dataproc| |Dataflow| |Spark   | |BigQuery|        |  |
|  |  |        | |        | |Serverls| |   ML   |        |  |
|  |  +--------+ +--------+ +--------+ +--------+        |  |
|  +------------------------------------------------------+  |
|                          |                                  |
|                          v                                  |
|  +------------------------------------------------------+  |
|  |                 Consumption Layer                     |  |
|  |  +--------+ +--------+ +--------+ +--------+        |  |
|  |  | Looker | | Vertex | | Data   | |  APIs  |        |  |
|  |  | Studio | |   AI   | | Studio | |        |        |  |
|  |  +--------+ +--------+ +--------+ +--------+        |  |
|  +------------------------------------------------------+  |
+-------------------------------------------------------------+

Zone Organization:
gs://datalake-raw/           # Landing zone (raw data)
+-- source=salesforce/
+-- source=ga4/
+-- source=iot/

gs://datalake-processed/     # Processed zone
+-- domain=sales/
+-- domain=marketing/
+-- domain=operations/

project.curated.*            # BigQuery curated datasets
+-- sales_analytics
+-- customer_360
+-- operational_metrics

7. What is BigLake?

BigLake provides unified fine-grained access control for data lakes across Cloud Storage and BigQuery.

BigLake Features:
+-- Fine-grained security on GCS data
+-- Row and column-level security
+-- Apache Iceberg support
+-- Cross-cloud (S3, Azure)
+-- Unified governance
+-- Query acceleration

# Create BigLake connection
bq mk --connection \
    --connection_type=CLOUD_RESOURCE \
    --location=us \
    my-biglake-connection

# Grant connection access to GCS
gcloud projects add-iam-policy-binding my-project \
    --member="serviceAccount:connection-sa@my-project.iam.gserviceaccount.com" \
    --role="roles/storage.objectViewer"

# Create BigLake table
CREATE EXTERNAL TABLE `project.dataset.biglake_events`
WITH CONNECTION `project.us.my-biglake-connection`
OPTIONS (
  format = 'PARQUET',
  uris = ['gs://bucket/events/*.parquet'],
  metadata_cache_mode = 'AUTOMATIC'
);

# Apply row-level security
CREATE ROW ACCESS POLICY sales_region
ON `project.dataset.biglake_events`
GRANT TO ("user:analyst@company.com")
FILTER USING (region = 'NORTH');

# Apply column-level security (policy tags)
ALTER TABLE `project.dataset.biglake_events`
ALTER COLUMN customer_email
SET OPTIONS (
  policy_tags = ['projects/my-project/locations/us/taxonomies/123/policyTags/pii']
);

BigLake vs External Tables:
+--------------------+--------------------+--------------------+
| Feature            | External Table     | BigLake Table      |
+--------------------+--------------------+--------------------+
| Row-level security | ✗                  | ✓                  |
| Column-level sec.  | ✗                  | ✓                  |
| Metadata caching   | ✗                  | ✓                  |
| Apache Iceberg     | ✗                  | ✓                  |
| Cross-cloud        | ✗                  | ✓                  |
+--------------------+--------------------+--------------------+

8. How do you secure Cloud Storage?

Security Options:

1. IAM Policies (Recommended)
# Grant bucket access
gcloud storage buckets add-iam-policy-binding gs://my-bucket \
    --member="user:analyst@company.com" \
    --role="roles/storage.objectViewer"

# Roles available:
+-- storage.admin - Full control
+-- storage.objectAdmin - Object CRUD
+-- storage.objectViewer - Read objects
+-- storage.objectCreator - Create objects
+-- storage.legacyBucketReader - List + read

2. Uniform Bucket-Level Access
# Disable ACLs, use only IAM
gcloud storage buckets update gs://my-bucket \
    --uniform-bucket-level-access

3. VPC Service Controls
# Create perimeter
gcloud access-context-manager perimeters create my-perimeter \
    --title="Data Perimeter" \
    --resources=projects/12345 \
    --restricted-services=storage.googleapis.com \
    --policy=my-policy

4. Customer-Managed Encryption Keys (CMEK)
# Create key
gcloud kms keys create my-key \
    --location=us \
    --keyring=my-keyring \
    --purpose=encryption

# Create bucket with CMEK
gcloud storage buckets create gs://secure-bucket \
    --location=US \
    --default-encryption-key=projects/my-project/locations/us/keyRings/my-keyring/cryptoKeys/my-key

5. Data Access Logs
# Enable audit logging
gcloud projects get-iam-policy my-project > policy.yaml
# Add audit config for storage

6. Object Holds
# Temporary hold (prevent deletion)
gsutil retention temp set gs://bucket/object

# Event-based hold
gsutil retention event set gs://bucket/object

9. What is transfer service?

Storage Transfer Service:
+-- Transfer from AWS S3, Azure Blob
+-- Transfer between GCS buckets
+-- Transfer from HTTP/HTTPS sources
+-- Scheduled transfers
+-- On-premises transfers (Transfer Appliance)
+-- Transfer for POSIX filesystems

# Transfer from S3 to GCS
gcloud transfer jobs create \
    s3://source-bucket \
    gs://destination-bucket \
    --source-creds-file=s3-creds.json \
    --name="s3-to-gcs-daily" \
    --schedule-starts="2024-01-15T00:00:00Z" \
    --schedule-repeats-every="1d"

# Transfer within GCS (different class)
gcloud transfer jobs create \
    gs://source-bucket \
    gs://archive-bucket \
    --name="archive-old-data" \
    --include-prefixes="logs/2023/" \
    --overwrite-when="different"

# Transfer from HTTP
{
  "httpDataSource": {
    "listUrl": "https://example.com/data/manifest.txt"
  },
  "gcsDataSink": {
    "bucketName": "my-bucket",
    "path": "http-data/"
  }
}

# Python: Create transfer job
from google.cloud import storage_transfer

client = storage_transfer.StorageTransferServiceClient()

transfer_job = {
    'project_id': 'my-project',
    'transfer_spec': {
        'aws_s3_data_source': {
            'bucket_name': 'source-bucket',
            'aws_access_key': {'access_key_id': 'key', 'secret_access_key': 'secret'}
        },
        'gcs_data_sink': {
            'bucket_name': 'dest-bucket'
        }
    },
    'schedule': {
        'schedule_start_date': {'year': 2024, 'month': 1, 'day': 15},
        'start_time_of_day': {'hours': 0, 'minutes': 0}
    },
    'status': 'ENABLED'
}

result = client.create_transfer_job({'transfer_job': transfer_job})

10. What are signed URLs?

Signed URLs provide time-limited access to objects without requiring Google account authentication.

Signed URL Types:
+-- V4 signing (recommended)
+-- V2 signing (legacy)
+-- Download URLs
+-- Upload URLs

# Generate signed URL with gsutil
gsutil signurl -d 1h service-account.json gs://bucket/object

# Python: V4 signed URL
from google.cloud import storage
from datetime import timedelta

def generate_download_url(bucket_name, blob_name, expiration_minutes=15):
    client = storage.Client()
    bucket = client.bucket(bucket_name)
    blob = bucket.blob(blob_name)
    
    url = blob.generate_signed_url(
        version="v4",
        expiration=timedelta(minutes=expiration_minutes),
        method="GET"
    )
    return url

def generate_upload_url(bucket_name, blob_name, expiration_minutes=15):
    client = storage.Client()
    bucket = client.bucket(bucket_name)
    blob = bucket.blob(blob_name)
    
    url = blob.generate_signed_url(
        version="v4",
        expiration=timedelta(minutes=expiration_minutes),
        method="PUT",
        content_type="application/octet-stream"
    )
    return url

# Upload using signed URL
import requests

upload_url = generate_upload_url('my-bucket', 'uploads/file.txt')
with open('local-file.txt', 'rb') as f:
    response = requests.put(
        upload_url,
        data=f,
        headers={'Content-Type': 'application/octet-stream'}
    )

# Signed URL for resumable upload
def generate_resumable_upload_url(bucket_name, blob_name):
    client = storage.Client()
    bucket = client.bucket(bucket_name)
    blob = bucket.blob(blob_name)
    
    url = blob.generate_signed_url(
        version="v4",
        expiration=timedelta(hours=1),
        method="RESUMABLE",
        content_type="application/octet-stream"
    )
    return url

11. What is object versioning?

Object Versioning:
+-- Keeps previous versions of objects
+-- Protects against accidental deletion
+-- Each version has unique generation number
+-- Increases storage costs

# Enable versioning
gcloud storage buckets update gs://my-bucket --versioning

# Disable versioning
gcloud storage buckets update gs://my-bucket --no-versioning

# List all versions
gsutil ls -a gs://my-bucket/object

# Restore previous version
gsutil cp gs://my-bucket/object#1234567890 gs://my-bucket/object

# Delete specific version
gsutil rm gs://my-bucket/object#1234567890

# Delete all noncurrent versions
gsutil rm gs://my-bucket/object#*

Versioning with Lifecycle:
{
  "lifecycle": {
    "rule": [
      {
        "action": {"type": "Delete"},
        "condition": {
          "isLive": false,
          "numNewerVersions": 3
        }
      },
      {
        "action": {"type": "Delete"},
        "condition": {
          "isLive": false,
          "daysSinceNoncurrentTime": 30
        }
      }
    ]
  }
}

# Python: Work with versions
from google.cloud import storage

client = storage.Client()
bucket = client.bucket('my-bucket')

# List all versions
blobs = bucket.list_blobs(prefix='object', versions=True)
for blob in blobs:
    print(f'{blob.name} - Generation: {blob.generation}')

# Get specific version
blob = bucket.blob('object', generation=1234567890)
content = blob.download_as_string()

12. How do you organize data in GCS?

Data Organization Patterns:

1. By Source/Domain
gs://datalake/
+-- source=salesforce/
|   +-- entity=accounts/
|   +-- entity=opportunities/
+-- source=ga4/
|   +-- entity=events/
+-- source=iot/
    +-- device_type=sensor/

2. By Processing Stage
gs://company-data/
+-- raw/                    # Landing zone
|   +-- source/date/
+-- processed/              # Cleaned data
|   +-- domain/date/
+-- curated/                # Analytics-ready
    +-- dataset/

3. Hive-Style Partitioning
gs://bucket/events/
+-- year=2024/
|   +-- month=01/
|   |   +-- day=15/
|   |   |   +-- data.parquet
|   |   +-- day=16/
|   +-- month=02/
+-- year=2023/

# Query with partition pruning in BigQuery
CREATE EXTERNAL TABLE events
WITH PARTITION COLUMNS (year INT64, month INT64, day INT64)
OPTIONS (
  format = 'PARQUET',
  uris = ['gs://bucket/events/*'],
  hive_partition_uri_prefix = 'gs://bucket/events/'
);

4. By Tenant (Multi-tenant)
gs://saas-data/
+-- tenant=acme/
|   +-- data/
+-- tenant=globex/
|   +-- data/
+-- _shared/
    +-- reference_data/

5. Delta Lake / Iceberg
gs://bucket/delta-table/
+-- _delta_log/
|   +-- 00000000000000000000.json
|   +-- 00000000000000000001.json
+-- part-00000-xxx.parquet

13. What is the Pub/Sub notification feature?

Cloud Storage Notifications:
+-- Pub/Sub notifications
+-- Cloud Functions triggers
+-- Eventarc integration
+-- Real-time data processing

# Create Pub/Sub notification
gsutil notification create -t my-topic -f json gs://my-bucket

# List notifications
gsutil notification list gs://my-bucket

# Delete notification
gsutil notification delete projects/_/buckets/my-bucket/notificationConfigs/1

# Event types:
+-- OBJECT_FINALIZE - Object created/overwritten
+-- OBJECT_METADATA_UPDATE - Metadata changed
+-- OBJECT_DELETE - Object deleted
+-- OBJECT_ARCHIVE - Object archived

# Event payload example
{
  "kind": "storage#object",
  "id": "my-bucket/my-object/1234567890",
  "selfLink": "https://www.googleapis.com/storage/v1/b/my-bucket/o/my-object",
  "name": "my-object",
  "bucket": "my-bucket",
  "generation": "1234567890",
  "metageneration": "1",
  "contentType": "application/json",
  "timeCreated": "2024-01-15T00:00:00.000Z",
  "updated": "2024-01-15T00:00:00.000Z",
  "size": "1024"
}

# Cloud Function triggered by GCS
@functions_framework.cloud_event
def process_file(cloud_event):
    data = cloud_event.data
    bucket = data['bucket']
    name = data['name']
    
    print(f'Processing file: gs://{bucket}/{name}')
    # Process the file...

14. What are retention policies?

Retention Policies:
+-- Minimum retention period
+-- Cannot delete/overwrite during retention
+-- Compliance and regulatory requirements
+-- Can be locked (immutable)
+-- Bucket-level or object-level

# Set bucket retention policy
gcloud storage buckets update gs://my-bucket \
    --retention-period=365d

# Lock retention policy (IRREVERSIBLE)
gcloud storage buckets update gs://my-bucket \
    --lock-retention-period

# Object holds
# Temporary hold
gsutil retention temp set gs://bucket/object

# Release temporary hold
gsutil retention temp release gs://bucket/object

# Event-based hold (default for new objects)
gcloud storage buckets update gs://my-bucket \
    --default-event-based-hold

# Python: Set retention
from google.cloud import storage
from datetime import datetime, timedelta

client = storage.Client()
bucket = client.get_bucket('my-bucket')

# Set retention policy
bucket.retention_period = 365 * 24 * 60 * 60  # 365 days in seconds
bucket.patch()

# Lock retention (careful - irreversible!)
# bucket.lock_retention_policy()

# Set object hold
blob = bucket.blob('important-file.txt')
blob.temporary_hold = True
blob.patch()

Retention vs Lifecycle:
+--------------------+--------------------+--------------------+
| Feature            | Retention          | Lifecycle          |
+--------------------+--------------------+--------------------+
| Purpose            | Prevent deletion   | Auto-delete/move   |
| Enforcement        | Hard block         | Automated action   |
| Compliance         | Yes (lockable)     | No                 |
| Reversible         | Until locked       | Yes                |
+--------------------+--------------------+--------------------+

15. How do you optimize storage costs?

Cost Optimization Strategies:

1. Use Autoclass for automatic tiering
gcloud storage buckets create gs://my-bucket \
    --location=US \
    --autoclass

2. Lifecycle management
{
  "lifecycle": {
    "rule": [
      {"action": {"type": "SetStorageClass", "storageClass": "NEARLINE"},
       "condition": {"age": 30}},
      {"action": {"type": "SetStorageClass", "storageClass": "COLDLINE"},
       "condition": {"age": 90}},
      {"action": {"type": "Delete"},
       "condition": {"age": 365}}
    ]
  }
}

3. Regional vs Multi-regional
# Single region (cheaper)
gcloud storage buckets create gs://regional-bucket -l us-central1

# Multi-region (higher availability)
gcloud storage buckets create gs://multi-bucket -l US

4. Compression
# Compress before upload
gzip large-file.csv
gsutil cp large-file.csv.gz gs://bucket/

# Or enable gzip transfer
gsutil -o "GSUtil:parallel_composite_upload_threshold=150M" cp file gs://bucket/

5. Monitor usage
# Storage insights
gcloud storage insights datasets create my-insights \
    --location=us \
    --source-bucket=gs://my-bucket

# Query costs in BigQuery (billing export)
SELECT
  sku.description,
  SUM(cost) as total_cost
FROM `billing.gcp_billing_export`
WHERE service.description = 'Cloud Storage'
GROUP BY sku.description
ORDER BY total_cost DESC;

6. Clean up old versions
# Delete noncurrent versions older than 7 days
{
  "lifecycle": {
    "rule": [{
      "action": {"type": "Delete"},
      "condition": {"isLive": false, "daysSinceNoncurrentTime": 7}
    }]
  }
}

16. What is data quality in Dataplex?

Dataplex Data Quality:
+-- Automated quality scanning
+-- Custom quality rules
+-- Quality scores
+-- Integration with Data Catalog
+-- Alerts and notifications

# Create data quality scan
gcloud dataplex data-quality create-scan my-scan \
    --location=us-central1 \
    --data-source=projects/my-project/datasets/my_dataset/tables/my_table \
    --rules-file=rules.yaml

# rules.yaml
rules:
  - dimension: COMPLETENESS
    column: email
    threshold: 0.95
    non_null_expectation: {}
    
  - dimension: VALIDITY
    column: age
    threshold: 0.99
    range_expectation:
      min_value: 0
      max_value: 150
      
  - dimension: UNIQUENESS
    column: customer_id
    threshold: 1.0
    uniqueness_expectation: {}
    
  - dimension: VALIDITY
    column: email
    threshold: 0.95
    regex_expectation:
      regex: "^[a-zA-Z0-9+_.-]+@[a-zA-Z0-9.-]+$"

# Run scan
gcloud dataplex data-quality run-scan my-scan \
    --location=us-central1

# View results
gcloud dataplex data-quality get-scan-results my-scan \
    --location=us-central1

Data Quality Dimensions:
+-- COMPLETENESS - Non-null values
+-- UNIQUENESS - Unique values
+-- VALIDITY - Format/range checks
+-- ACCURACY - Reference data checks
+-- CONSISTENCY - Cross-table checks
+-- TIMELINESS - Data freshness

17. What are Dataplex lakes and zones?

Dataplex Hierarchy:
Lake --> Zone --> Asset

Lake:
+-- Logical container for data
+-- Represents business domain
+-- Unified governance
+-- Example: "Sales Data Lake"

Zone:
+-- Subdivision within lake
+-- RAW or CURATED type
+-- Example: "Raw Zone", "Analytics Zone"

Asset:
+-- Data resources (GCS, BigQuery)
+-- Metadata discovery
+-- Example: "Sales Events Bucket"

# Create complete structure
# 1. Create lake
gcloud dataplex lakes create sales-lake \
    --location=us-central1 \
    --display-name="Sales Data Lake"

# 2. Create raw zone
gcloud dataplex zones create raw-zone \
    --lake=sales-lake \
    --location=us-central1 \
    --type=RAW \
    --resource-location-type=SINGLE_REGION \
    --display-name="Raw Data Zone"

# 3. Create curated zone
gcloud dataplex zones create curated-zone \
    --lake=sales-lake \
    --location=us-central1 \
    --type=CURATED \
    --resource-location-type=SINGLE_REGION \
    --display-name="Curated Analytics Zone"

# 4. Add GCS asset to raw zone
gcloud dataplex assets create raw-events \
    --lake=sales-lake \
    --zone=raw-zone \
    --location=us-central1 \
    --resource-type=STORAGE_BUCKET \
    --resource-name=projects/my-project/buckets/sales-raw-events \
    --discovery-enabled

# 5. Add BigQuery asset to curated zone
gcloud dataplex assets create sales-analytics \
    --lake=sales-lake \
    --zone=curated-zone \
    --location=us-central1 \
    --resource-type=BIGQUERY_DATASET \
    --resource-name=projects/my-project/datasets/sales_analytics

Zone Types:
+-- RAW - Landing zone for raw data
|   +-- Any format accepted
|   +-- Schema discovery enabled
+-- CURATED - Processed/analytics data
    +-- Structured data required
    +-- Higher quality standards





18. How do you implement data lineage?

Data Lineage in GCP:
+-- Data Catalog lineage (automatic)
+-- Dataplex lineage
+-- BigQuery lineage
+-- Dataproc lineage
+-- Custom lineage via API

# Automatic lineage captured from:
+-- BigQuery queries
+-- Dataproc Spark jobs
+-- Dataflow pipelines
+-- Cloud Composer (Airflow)
+-- Data Fusion pipelines

# View lineage in Data Catalog
gcloud data-catalog entries lookup \
    '//bigquery.googleapis.com/projects/my-project/datasets/analytics/tables/summary'

# Custom lineage via API
from google.cloud import datacatalog_lineage_v1

client = datacatalog_lineage_v1.LineageClient()

# Create process (transformation)
process = datacatalog_lineage_v1.Process()
process.name = f"projects/{project}/locations/{location}/processes/etl-job-123"
process.display_name = "Daily ETL Job"

client.create_process(
    parent=f"projects/{project}/locations/{location}",
    process=process
)

# Create run
run = datacatalog_lineage_v1.Run()
run.start_time.GetCurrentTime()
run.state = datacatalog_lineage_v1.Run.State.COMPLETED

client.create_run(
    parent=process.name,
    run=run
)

# Create lineage event
event = datacatalog_lineage_v1.LineageEvent()
event.links.append(
    datacatalog_lineage_v1.EventLink(
        source=datacatalog_lineage_v1.EntityReference(
            fully_qualified_name="bigquery:my-project.raw.events"
        ),
        target=datacatalog_lineage_v1.EntityReference(
            fully_qualified_name="bigquery:my-project.analytics.summary"
        )
    )
)

client.create_lineage_event(parent=run.name, lineage_event=event)

Lineage Visualization:
+-----------+     +-----------+     +-----------+
|  Raw      |---->|  Staging  |---->|  Analytics|
|  Events   |     |  Table    |     |  Summary  |
|  (GCS)    |     |  (BQ)     |     |  (BQ)     |
+-----------+     +-----------+     +-----------+
              ▲                  ▲
              |                  |
        Dataflow Job       BigQuery Query

19. What is Analytics Hub?

Analytics Hub is a data exchange platform for sharing and discovering data assets securely.

Analytics Hub Features:
+-- Data exchanges (public/private)
+-- Linked datasets (no data copy)
+-- Fine-grained access control
+-- Data commercialization
+-- Cross-org data sharing

# Create data exchange
gcloud bigquery analytics-hubs data-exchanges create my-exchange \
    --location=us \
    --display-name="Company Data Exchange"

# Create listing
gcloud bigquery analytics-hubs listings create sales-data \
    --data-exchange=my-exchange \
    --location=us \
    --display-name="Sales Data" \
    --bigquery-dataset=projects/my-project/datasets/shared_sales

# Subscribe to listing
gcloud bigquery analytics-hubs subscriptions create sales-sub \
    --listing=projects/publisher/locations/us/dataExchanges/exchange/listings/sales-data \
    --destination-dataset=projects/my-project/datasets/subscribed_sales

Sharing Models:
+-------------------------------------------------------------+
|  Publisher                         Subscriber               |
|  +---------+                      +---------+              |
|  | Source  |                      | Linked  |              |
|  | Dataset |  ---- Listing ---->  | Dataset |              |
|  +---------+                      +---------+              |
|                                                             |
|  - Data stays in publisher project                         |
|  - Subscriber queries via linked dataset                   |
|  - No data duplication                                     |
|  - Publisher controls access                               |
+-------------------------------------------------------------+

Use Cases:
+-- Internal data marketplace
+-- Partner data sharing
+-- Data monetization
+-- Open data publishing
+-- Cross-department sharing

20. What are Cloud Storage best practices?

Best Practices:

1. Naming Conventions
# Good: Descriptive, partitioned
gs://company-datalake-prod/domain=sales/year=2024/month=01/data.parquet

# Bad: Generic, flat
gs://bucket123/file.csv

2. Security
+-- Enable Uniform bucket-level access
+-- Use IAM instead of ACLs
+-- Enable audit logging
+-- Use CMEK for sensitive data
+-- Implement VPC Service Controls
+-- Use signed URLs for external access

3. Performance
+-- Use composite objects for large files
+-- Use parallel uploads
+-- Avoid sequential naming (use hashing)
+-- Use regional buckets for compute colocation
+-- Enable CDN for public content

# Parallel upload
gsutil -m cp -r local-dir gs://bucket/

# Avoid sequential keys
# Bad: 2024-01-15-00001.json, 2024-01-15-00002.json
# Good: hash-abc123.json, hash-def456.json

4. Cost Optimization
+-- Use Autoclass for variable access patterns
+-- Implement lifecycle policies
+-- Clean up incomplete uploads
+-- Monitor with storage insights
+-- Use appropriate storage class

5. Data Organization
+-- Use consistent prefixes
+-- Implement Hive-style partitioning
+-- Separate raw/processed/curated
+-- Document bucket purposes
+-- Use Dataplex for governance

6. Reliability
+-- Enable versioning for critical data
+-- Use multi-region for high availability
+-- Implement retention policies
+-- Set up monitoring alerts
+-- Regular backup verification

# Complete bucket setup
gcloud storage buckets create gs://prod-datalake \
    --location=US \
    --uniform-bucket-level-access \
    --autoclass \
    --public-access-prevention \
    --soft-delete-duration=7d

Google Cloud Interview Questions


Popular Posts