Top 20 GCP Cloud Storage & Data Lake Interview Questions

Cloud Storage Features:
+-- Object storage (unstructured data)
+-- 99.999999999% (11 nines) durability
+-- Global accessibility
+-- Multiple storage classes
+-- Strong consistency
+-- Unlimited storage

Key Concepts:
+-- Buckets - Container for objects
+-- Objects - Files with metadata
+-- Regions/Multi-regions - Data location
+-- ACLs/IAM - Access control

# Create bucket
gsutil mb -p my-project -c STANDARD -l US-CENTRAL1 gs://my-bucket/

# Or with gcloud
gcloud storage buckets create gs://my-bucket \
    --project=my-project \
    --location=US-CENTRAL1 \
    --uniform-bucket-level-access

# Upload file
gsutil cp local-file.csv gs://my-bucket/data/

# Download file
gsutil cp gs://my-bucket/data/file.csv ./

# List objects
gsutil ls gs://my-bucket/data/

# Python client
from google.cloud import storage

client = storage.Client()
bucket = client.bucket('my-bucket')

# Upload
blob = bucket.blob('data/file.csv')
blob.upload_from_filename('local-file.csv')

# Download
blob = bucket.blob('data/file.csv')
blob.download_to_filename('downloaded.csv')

# Read directly
content = blob.download_as_string()

2. What are storage classes?

Class	Use Case	Min Duration	Retrieval Cost
Standard	Frequent access	None	None
Nearline	Monthly access	30 days	$0.01/GB
Coldline	Quarterly access	90 days	$0.02/GB
Archive	Yearly access	365 days	$0.05/GB

Storage Class Selection:

# Create bucket with storage class
gsutil mb -c NEARLINE -l US gs://archive-bucket/

# Change object storage class
gsutil rewrite -s COLDLINE gs://bucket/object

# Set default storage class
gsutil defstorageclass set NEARLINE gs://bucket/

Autoclass (Automatic Tiering):
# Automatically moves objects based on access patterns
gcloud storage buckets create gs://my-bucket \
    --location=US \
    --autoclass

Autoclass Behavior:
+-------------------------------------------------------------+
|  Object Access Pattern     |    Storage Class Transition    |
+-------------------------------------------------------------+
|  Frequently accessed       |    Standard                    |
|  Not accessed 30 days      |    Nearline                    |
|  Not accessed 90 days      |    Coldline                    |
|  Not accessed 365 days     |    Archive                     |
|  Accessed again            |    Back to Standard            |
+-------------------------------------------------------------+

Pricing (US regions):
+-- Standard: $0.020/GB/month
+-- Nearline: $0.010/GB/month
+-- Coldline: $0.004/GB/month
+-- Archive: $0.0012/GB/month

3. What is object lifecycle management?

Lifecycle management automates object transitions and deletions based on rules.

Lifecycle Rules:

# lifecycle.json
{
  "lifecycle": {
    "rule": [
      {
        "action": {"type": "SetStorageClass", "storageClass": "NEARLINE"},
        "condition": {"age": 30, "matchesStorageClass": ["STANDARD"]}
      },
      {
        "action": {"type": "SetStorageClass", "storageClass": "COLDLINE"},
        "condition": {"age": 90, "matchesStorageClass": ["NEARLINE"]}
      },
      {
        "action": {"type": "SetStorageClass", "storageClass": "ARCHIVE"},
        "condition": {"age": 365, "matchesStorageClass": ["COLDLINE"]}
      },
      {
        "action": {"type": "Delete"},
        "condition": {"age": 730}
      },
      {
        "action": {"type": "Delete"},
        "condition": {"isLive": false, "numNewerVersions": 3}
      },
      {
        "action": {"type": "AbortIncompleteMultipartUpload"},
        "condition": {"age": 7}
      }
    ]
  }
}

# Apply lifecycle
gsutil lifecycle set lifecycle.json gs://my-bucket/

# View lifecycle
gsutil lifecycle get gs://my-bucket/

Conditions Available:
+-- age - Days since creation
+-- createdBefore - Created before date
+-- isLive - Current vs noncurrent (versioned)
+-- numNewerVersions - Version count
+-- matchesStorageClass - Current class
+-- matchesPrefix - Object name prefix
+-- matchesSuffix - Object name suffix
+-- daysSinceCustomTime - Custom metadata time

Actions Available:
+-- Delete - Remove object
+-- SetStorageClass - Change class
+-- AbortIncompleteMultipartUpload - Clean up

4. What is Dataplex?

Dataplex is an intelligent data fabric that unifies distributed data and automates data management.

Dataplex Architecture:
+-------------------------------------------------------------+
|                       Dataplex                               |
+-------------------------------------------------------------+
|  +-----------------------------------------------------+   |
|  |                    Lake                              |   |
|  |  +-------------+  +-------------+  +-------------+ |   |
|  |  |  Raw Zone   |  | Curated Zone |  | Consumption | |   |
|  |  |  (Landing)  |  |  (Refined)   |  |    Zone     | |   |
|  |  +------+------+  +------+------+  +------+------+ |   |
|  |         |                |                |        |   |
|  |    +----+----+     +----+----+     +----+----+   |   |
|  |    |  Asset  |     |  Asset  |     |  Asset  |   |   |
|  |    |  (GCS)  |     | (BigQuery)|    |  (BQ)  |   |   |
|  |    +---------+     +---------+     +---------+   |   |
|  +-----------------------------------------------------+   |
|                                                              |
|  Features:                                                   |
|  +-- Unified governance                                     |
|  +-- Automated data discovery                               |
|  +-- Data quality management                                |
|  +-- Security & policies                                    |
|  +-- Metadata management                                    |
+-------------------------------------------------------------+

# Create Dataplex lake
gcloud dataplex lakes create my-lake \
    --location=us-central1 \
    --display-name="My Data Lake"

# Create zone
gcloud dataplex zones create raw-zone \
    --lake=my-lake \
    --location=us-central1 \
    --type=RAW \
    --resource-location-type=SINGLE_REGION \
    --display-name="Raw Data Zone"

# Add asset (GCS bucket)
gcloud dataplex assets create raw-data-asset \
    --lake=my-lake \
    --zone=raw-zone \
    --location=us-central1 \
    --resource-type=STORAGE_BUCKET \
    --resource-name=projects/my-project/buckets/raw-data-bucket \
    --discovery-enabled

# Add BigQuery dataset as asset
gcloud dataplex assets create curated-data-asset \
    --lake=my-lake \
    --zone=curated-zone \
    --location=us-central1 \
    --resource-type=BIGQUERY_DATASET \
    --resource-name=projects/my-project/datasets/curated

5. What is Data Catalog?

Data Catalog is a fully managed metadata management service for discovering and managing data.

Data Catalog Features:
+-- Automatic metadata discovery
+-- Custom metadata (tags)
+-- Data lineage tracking
+-- Unified search
+-- Policy tags for security
+-- Business glossary

# Search data assets
gcloud data-catalog search "type=table AND system=bigquery"

# Create tag template
gcloud data-catalog tag-templates create data-quality-template \
    --location=us-central1 \
    --display-name="Data Quality" \
    --field=id=owner,display-name="Data Owner",type=string,required=true \
    --field=id=quality_score,display-name="Quality Score",type=double \
    --field=id=pii_flag,display-name="Contains PII",type=bool

# Create entry for external data
gcloud data-catalog entries create my-external-data \
    --entry-group=my-group \
    --location=us-central1 \
    --display-name="External Sales Data" \
    --type=FILESET \
    --gcs-file-patterns="gs://bucket/sales/*"

# Python: Add tag to BigQuery table
from google.cloud import datacatalog_v1

client = datacatalog_v1.DataCatalogClient()

# Look up entry
resource_name = f"//bigquery.googleapis.com/projects/{project}/datasets/{dataset}/tables/{table}"
entry = client.lookup_entry(request={"linked_resource": resource_name})

# Create tag
tag = datacatalog_v1.Tag()
tag.template = f"projects/{project}/locations/us-central1/tagTemplates/data-quality-template"
tag.fields["owner"] = datacatalog_v1.TagField(string_value="data-team@company.com")
tag.fields["quality_score"] = datacatalog_v1.TagField(double_value=0.95)
tag.fields["pii_flag"] = datacatalog_v1.TagField(bool_value=True)

client.create_tag(parent=entry.name, tag=tag)

6. How do you implement data lake architecture?

Data Lake Architecture on GCP:
+-------------------------------------------------------------+
|                    GCP Data Lake                             |
+-------------------------------------------------------------+
|                                                              |
|  +------------------------------------------------------+  |
|  |                   Ingestion Layer                     |  |
|  |  +--------+ +--------+ +--------+ +--------+        |  |
|  |  | Pub/Sub| |Dataflow| |Transfer| | Cloud  |        |  |
|  |  |        | |        | |Service | |Functions|       |  |
|  |  +---+----+ +---+----+ +---+----+ +---+----+        |  |
|  +------+----------+----------+----------+--------------+  |
|         +----------+----------+----------+                  |
|                          |                                  |
|                          v                                  |
|  +------------------------------------------------------+  |
|  |                   Storage Layer                       |  |
|  |  +-------------+  +-------------+  +-------------+  |  |
|  |  |  Raw Zone   |  | Processed   |  |  Curated    |  |  |
|  |  |  (Landing)  |->|    Zone     |->|    Zone     |  |  |
|  |  |    GCS      |  |    GCS      |  |  BigQuery   |  |  |
|  |  +-------------+  +-------------+  +-------------+  |  |
|  +------------------------------------------------------+  |
|                          |                                  |
|                          v                                  |
|  +------------------------------------------------------+  |
|  |                 Processing Layer                      |  |
|  |  +--------+ +--------+ +--------+ +--------+        |  |
|  |  |Dataproc| |Dataflow| |Spark   | |BigQuery|        |  |
|  |  |        | |        | |Serverls| |   ML   |        |  |
|  |  +--------+ +--------+ +--------+ +--------+        |  |
|  +------------------------------------------------------+  |
|                          |                                  |
|                          v                                  |
|  +------------------------------------------------------+  |
|  |                 Consumption Layer                     |  |
|  |  +--------+ +--------+ +--------+ +--------+        |  |
|  |  | Looker | | Vertex | | Data   | |  APIs  |        |  |
|  |  | Studio | |   AI   | | Studio | |        |        |  |
|  |  +--------+ +--------+ +--------+ +--------+        |  |
|  +------------------------------------------------------+  |
+-------------------------------------------------------------+

Zone Organization:
gs://datalake-raw/           # Landing zone (raw data)
+-- source=salesforce/
+-- source=ga4/
+-- source=iot/

gs://datalake-processed/     # Processed zone
+-- domain=sales/
+-- domain=marketing/
+-- domain=operations/

project.curated.*            # BigQuery curated datasets
+-- sales_analytics
+-- customer_360
+-- operational_metrics

7. What is BigLake?

BigLake provides unified fine-grained access control for data lakes across Cloud Storage and BigQuery.

BigLake Features:
+-- Fine-grained security on GCS data
+-- Row and column-level security
+-- Apache Iceberg support
+-- Cross-cloud (S3, Azure)
+-- Unified governance
+-- Query acceleration

# Create BigLake connection
bq mk --connection \
    --connection_type=CLOUD_RESOURCE \
    --location=us \
    my-biglake-connection

# Grant connection access to GCS
gcloud projects add-iam-policy-binding my-project \
    --member="serviceAccount:connection-sa@my-project.iam.gserviceaccount.com" \
    --role="roles/storage.objectViewer"

# Create BigLake table
CREATE EXTERNAL TABLE `project.dataset.biglake_events`
WITH CONNECTION `project.us.my-biglake-connection`
OPTIONS (
  format = 'PARQUET',
  uris = ['gs://bucket/events/*.parquet'],
  metadata_cache_mode = 'AUTOMATIC'
);

# Apply row-level security
CREATE ROW ACCESS POLICY sales_region
ON `project.dataset.biglake_events`
GRANT TO ("user:analyst@company.com")
FILTER USING (region = 'NORTH');

# Apply column-level security (policy tags)
ALTER TABLE `project.dataset.biglake_events`
ALTER COLUMN customer_email
SET OPTIONS (
  policy_tags = ['projects/my-project/locations/us/taxonomies/123/policyTags/pii']
);

BigLake vs External Tables:
+--------------------+--------------------+--------------------+
| Feature            | External Table     | BigLake Table      |
+--------------------+--------------------+--------------------+
| Row-level security | ✗                  | ✓                  |
| Column-level sec.  | ✗                  | ✓                  |
| Metadata caching   | ✗                  | ✓                  |
| Apache Iceberg     | ✗                  | ✓                  |
| Cross-cloud        | ✗                  | ✓                  |
+--------------------+--------------------+--------------------+

8. How do you secure Cloud Storage?

Security Options:

1. IAM Policies (Recommended)
# Grant bucket access
gcloud storage buckets add-iam-policy-binding gs://my-bucket \
    --member="user:analyst@company.com" \
    --role="roles/storage.objectViewer"

# Roles available:
+-- storage.admin - Full control
+-- storage.objectAdmin - Object CRUD
+-- storage.objectViewer - Read objects
+-- storage.objectCreator - Create objects
+-- storage.legacyBucketReader - List + read

2. Uniform Bucket-Level Access
# Disable ACLs, use only IAM
gcloud storage buckets update gs://my-bucket \
    --uniform-bucket-level-access

3. VPC Service Controls
# Create perimeter
gcloud access-context-manager perimeters create my-perimeter \
    --title="Data Perimeter" \
    --resources=projects/12345 \
    --restricted-services=storage.googleapis.com \
    --policy=my-policy

4. Customer-Managed Encryption Keys (CMEK)
# Create key
gcloud kms keys create my-key \
    --location=us \
    --keyring=my-keyring \
    --purpose=encryption

# Create bucket with CMEK
gcloud storage buckets create gs://secure-bucket \
    --location=US \
    --default-encryption-key=projects/my-project/locations/us/keyRings/my-keyring/cryptoKeys/my-key

5. Data Access Logs
# Enable audit logging
gcloud projects get-iam-policy my-project > policy.yaml
# Add audit config for storage

6. Object Holds
# Temporary hold (prevent deletion)
gsutil retention temp set gs://bucket/object

# Event-based hold
gsutil retention event set gs://bucket/object

9. What is transfer service?

Storage Transfer Service:
+-- Transfer from AWS S3, Azure Blob
+-- Transfer between GCS buckets
+-- Transfer from HTTP/HTTPS sources
+-- Scheduled transfers
+-- On-premises transfers (Transfer Appliance)
+-- Transfer for POSIX filesystems

# Transfer from S3 to GCS
gcloud transfer jobs create \
    s3://source-bucket \
    gs://destination-bucket \
    --source-creds-file=s3-creds.json \
    --name="s3-to-gcs-daily" \
    --schedule-starts="2024-01-15T00:00:00Z" \
    --schedule-repeats-every="1d"

# Transfer within GCS (different class)
gcloud transfer jobs create \
    gs://source-bucket \
    gs://archive-bucket \
    --name="archive-old-data" \
    --include-prefixes="logs/2023/" \
    --overwrite-when="different"

# Transfer from HTTP
{
  "httpDataSource": {
    "listUrl": "https://example.com/data/manifest.txt"
  },
  "gcsDataSink": {
    "bucketName": "my-bucket",
    "path": "http-data/"
  }
}

# Python: Create transfer job
from google.cloud import storage_transfer

client = storage_transfer.StorageTransferServiceClient()

transfer_job = {
    'project_id': 'my-project',
    'transfer_spec': {
        'aws_s3_data_source': {
            'bucket_name': 'source-bucket',
            'aws_access_key': {'access_key_id': 'key', 'secret_access_key': 'secret'}
        },
        'gcs_data_sink': {
            'bucket_name': 'dest-bucket'
        }
    },
    'schedule': {
        'schedule_start_date': {'year': 2024, 'month': 1, 'day': 15},
        'start_time_of_day': {'hours': 0, 'minutes': 0}
    },
    'status': 'ENABLED'
}

result = client.create_transfer_job({'transfer_job': transfer_job})

10. What are signed URLs?

Signed URLs provide time-limited access to objects without requiring Google account authentication.

Signed URL Types:
+-- V4 signing (recommended)
+-- V2 signing (legacy)
+-- Download URLs
+-- Upload URLs

# Generate signed URL with gsutil
gsutil signurl -d 1h service-account.json gs://bucket/object

# Python: V4 signed URL
from google.cloud import storage
from datetime import timedelta

def generate_download_url(bucket_name, blob_name, expiration_minutes=15):
    client = storage.Client()
    bucket = client.bucket(bucket_name)
    blob = bucket.blob(blob_name)
    
    url = blob.generate_signed_url(
        version="v4",
        expiration=timedelta(minutes=expiration_minutes),
        method="GET"
    )
    return url

def generate_upload_url(bucket_name, blob_name, expiration_minutes=15):
    client = storage.Client()
    bucket = client.bucket(bucket_name)
    blob = bucket.blob(blob_name)
    
    url = blob.generate_signed_url(
        version="v4",
        expiration=timedelta(minutes=expiration_minutes),
        method="PUT",
        content_type="application/octet-stream"
    )
    return url

# Upload using signed URL
import requests

upload_url = generate_upload_url('my-bucket', 'uploads/file.txt')
with open('local-file.txt', 'rb') as f:
    response = requests.put(
        upload_url,
        data=f,
        headers={'Content-Type': 'application/octet-stream'}
    )

# Signed URL for resumable upload
def generate_resumable_upload_url(bucket_name, blob_name):
    client = storage.Client()
    bucket = client.bucket(bucket_name)
    blob = bucket.blob(blob_name)
    
    url = blob.generate_signed_url(
        version="v4",
        expiration=timedelta(hours=1),
        method="RESUMABLE",
        content_type="application/octet-stream"
    )
    return url

11. What is object versioning?

Object Versioning:
+-- Keeps previous versions of objects
+-- Protects against accidental deletion
+-- Each version has unique generation number
+-- Increases storage costs

# Enable versioning
gcloud storage buckets update gs://my-bucket --versioning

# Disable versioning
gcloud storage buckets update gs://my-bucket --no-versioning

# List all versions
gsutil ls -a gs://my-bucket/object

# Restore previous version
gsutil cp gs://my-bucket/object#1234567890 gs://my-bucket/object

# Delete specific version
gsutil rm gs://my-bucket/object#1234567890

# Delete all noncurrent versions
gsutil rm gs://my-bucket/object#*

Versioning with Lifecycle:
{
  "lifecycle": {
    "rule": [
      {
        "action": {"type": "Delete"},
        "condition": {
          "isLive": false,
          "numNewerVersions": 3
        }
      },
      {
        "action": {"type": "Delete"},
        "condition": {
          "isLive": false,
          "daysSinceNoncurrentTime": 30
        }
      }
    ]
  }
}

# Python: Work with versions
from google.cloud import storage

client = storage.Client()
bucket = client.bucket('my-bucket')

# List all versions
blobs = bucket.list_blobs(prefix='object', versions=True)
for blob in blobs:
    print(f'{blob.name} - Generation: {blob.generation}')

# Get specific version
blob = bucket.blob('object', generation=1234567890)
content = blob.download_as_string()

12. How do you organize data in GCS?

Data Organization Patterns:

1. By Source/Domain
gs://datalake/
+-- source=salesforce/
|   +-- entity=accounts/
|   +-- entity=opportunities/
+-- source=ga4/
|   +-- entity=events/
+-- source=iot/
    +-- device_type=sensor/

2. By Processing Stage
gs://company-data/
+-- raw/                    # Landing zone
|   +-- source/date/
+-- processed/              # Cleaned data
|   +-- domain/date/
+-- curated/                # Analytics-ready
    +-- dataset/

3. Hive-Style Partitioning
gs://bucket/events/
+-- year=2024/
|   +-- month=01/
|   |   +-- day=15/
|   |   |   +-- data.parquet
|   |   +-- day=16/
|   +-- month=02/
+-- year=2023/

# Query with partition pruning in BigQuery
CREATE EXTERNAL TABLE events
WITH PARTITION COLUMNS (year INT64, month INT64, day INT64)
OPTIONS (
  format = 'PARQUET',
  uris = ['gs://bucket/events/*'],
  hive_partition_uri_prefix = 'gs://bucket/events/'
);

4. By Tenant (Multi-tenant)
gs://saas-data/
+-- tenant=acme/
|   +-- data/
+-- tenant=globex/
|   +-- data/
+-- _shared/
    +-- reference_data/

5. Delta Lake / Iceberg
gs://bucket/delta-table/
+-- _delta_log/
|   +-- 00000000000000000000.json
|   +-- 00000000000000000001.json
+-- part-00000-xxx.parquet

13. What is the Pub/Sub notification feature?

Cloud Storage Notifications:
+-- Pub/Sub notifications
+-- Cloud Functions triggers
+-- Eventarc integration
+-- Real-time data processing

# Create Pub/Sub notification
gsutil notification create -t my-topic -f json gs://my-bucket

# List notifications
gsutil notification list gs://my-bucket

# Delete notification
gsutil notification delete projects/_/buckets/my-bucket/notificationConfigs/1

# Event types:
+-- OBJECT_FINALIZE - Object created/overwritten
+-- OBJECT_METADATA_UPDATE - Metadata changed
+-- OBJECT_DELETE - Object deleted
+-- OBJECT_ARCHIVE - Object archived

# Event payload example
{
  "kind": "storage#object",
  "id": "my-bucket/my-object/1234567890",
  "selfLink": "https://www.googleapis.com/storage/v1/b/my-bucket/o/my-object",
  "name": "my-object",
  "bucket": "my-bucket",
  "generation": "1234567890",
  "metageneration": "1",
  "contentType": "application/json",
  "timeCreated": "2024-01-15T00:00:00.000Z",
  "updated": "2024-01-15T00:00:00.000Z",
  "size": "1024"
}

# Cloud Function triggered by GCS
@functions_framework.cloud_event
def process_file(cloud_event):
    data = cloud_event.data
    bucket = data['bucket']
    name = data['name']
    
    print(f'Processing file: gs://{bucket}/{name}')
    # Process the file...

14. What are retention policies?

Retention Policies:
+-- Minimum retention period
+-- Cannot delete/overwrite during retention
+-- Compliance and regulatory requirements
+-- Can be locked (immutable)
+-- Bucket-level or object-level

# Set bucket retention policy
gcloud storage buckets update gs://my-bucket \
    --retention-period=365d

# Lock retention policy (IRREVERSIBLE)
gcloud storage buckets update gs://my-bucket \
    --lock-retention-period

# Object holds
# Temporary hold
gsutil retention temp set gs://bucket/object

# Release temporary hold
gsutil retention temp release gs://bucket/object

# Event-based hold (default for new objects)
gcloud storage buckets update gs://my-bucket \
    --default-event-based-hold

# Python: Set retention
from google.cloud import storage
from datetime import datetime, timedelta

client = storage.Client()
bucket = client.get_bucket('my-bucket')

# Set retention policy
bucket.retention_period = 365 * 24 * 60 * 60  # 365 days in seconds
bucket.patch()

# Lock retention (careful - irreversible!)
# bucket.lock_retention_policy()

# Set object hold
blob = bucket.blob('important-file.txt')
blob.temporary_hold = True
blob.patch()

Retention vs Lifecycle:
+--------------------+--------------------+--------------------+
| Feature            | Retention          | Lifecycle          |
+--------------------+--------------------+--------------------+
| Purpose            | Prevent deletion   | Auto-delete/move   |
| Enforcement        | Hard block         | Automated action   |
| Compliance         | Yes (lockable)     | No                 |
| Reversible         | Until locked       | Yes                |
+--------------------+--------------------+--------------------+

15. How do you optimize storage costs?

Cost Optimization Strategies:

1. Use Autoclass for automatic tiering
gcloud storage buckets create gs://my-bucket \
    --location=US \
    --autoclass

2. Lifecycle management
{
  "lifecycle": {
    "rule": [
      {"action": {"type": "SetStorageClass", "storageClass": "NEARLINE"},
       "condition": {"age": 30}},
      {"action": {"type": "SetStorageClass", "storageClass": "COLDLINE"},
       "condition": {"age": 90}},
      {"action": {"type": "Delete"},
       "condition": {"age": 365}}
    ]
  }
}

3. Regional vs Multi-regional
# Single region (cheaper)
gcloud storage buckets create gs://regional-bucket -l us-central1

# Multi-region (higher availability)
gcloud storage buckets create gs://multi-bucket -l US

4. Compression
# Compress before upload
gzip large-file.csv
gsutil cp large-file.csv.gz gs://bucket/

# Or enable gzip transfer
gsutil -o "GSUtil:parallel_composite_upload_threshold=150M" cp file gs://bucket/

5. Monitor usage
# Storage insights
gcloud storage insights datasets create my-insights \
    --location=us \
    --source-bucket=gs://my-bucket

# Query costs in BigQuery (billing export)
SELECT
  sku.description,
  SUM(cost) as total_cost
FROM `billing.gcp_billing_export`
WHERE service.description = 'Cloud Storage'
GROUP BY sku.description
ORDER BY total_cost DESC;

6. Clean up old versions
# Delete noncurrent versions older than 7 days
{
  "lifecycle": {
    "rule": [{
      "action": {"type": "Delete"},
      "condition": {"isLive": false, "daysSinceNoncurrentTime": 7}
    }]
  }
}

16. What is data quality in Dataplex?

Dataplex Data Quality:
+-- Automated quality scanning
+-- Custom quality rules
+-- Quality scores
+-- Integration with Data Catalog
+-- Alerts and notifications

# Create data quality scan
gcloud dataplex data-quality create-scan my-scan \
    --location=us-central1 \
    --data-source=projects/my-project/datasets/my_dataset/tables/my_table \
    --rules-file=rules.yaml

# rules.yaml
rules:
  - dimension: COMPLETENESS
    column: email
    threshold: 0.95
    non_null_expectation: {}
    
  - dimension: VALIDITY
    column: age
    threshold: 0.99
    range_expectation:
      min_value: 0
      max_value: 150
      
  - dimension: UNIQUENESS
    column: customer_id
    threshold: 1.0
    uniqueness_expectation: {}
    
  - dimension: VALIDITY
    column: email
    threshold: 0.95
    regex_expectation:
      regex: "^[a-zA-Z0-9+_.-]+@[a-zA-Z0-9.-]+$"

# Run scan
gcloud dataplex data-quality run-scan my-scan \
    --location=us-central1

# View results
gcloud dataplex data-quality get-scan-results my-scan \
    --location=us-central1

Data Quality Dimensions:
+-- COMPLETENESS - Non-null values
+-- UNIQUENESS - Unique values
+-- VALIDITY - Format/range checks
+-- ACCURACY - Reference data checks
+-- CONSISTENCY - Cross-table checks
+-- TIMELINESS - Data freshness

17. What are Dataplex lakes and zones?

Dataplex Hierarchy:
Lake --> Zone --> Asset

Lake:
+-- Logical container for data
+-- Represents business domain
+-- Unified governance
+-- Example: "Sales Data Lake"

Zone:
+-- Subdivision within lake
+-- RAW or CURATED type
+-- Example: "Raw Zone", "Analytics Zone"

Asset:
+-- Data resources (GCS, BigQuery)
+-- Metadata discovery
+-- Example: "Sales Events Bucket"

# Create complete structure
# 1. Create lake
gcloud dataplex lakes create sales-lake \
    --location=us-central1 \
    --display-name="Sales Data Lake"

# 2. Create raw zone
gcloud dataplex zones create raw-zone \
    --lake=sales-lake \
    --location=us-central1 \
    --type=RAW \
    --resource-location-type=SINGLE_REGION \
    --display-name="Raw Data Zone"

# 3. Create curated zone
gcloud dataplex zones create curated-zone \
    --lake=sales-lake \
    --location=us-central1 \
    --type=CURATED \
    --resource-location-type=SINGLE_REGION \
    --display-name="Curated Analytics Zone"

# 4. Add GCS asset to raw zone
gcloud dataplex assets create raw-events \
    --lake=sales-lake \
    --zone=raw-zone \
    --location=us-central1 \
    --resource-type=STORAGE_BUCKET \
    --resource-name=projects/my-project/buckets/sales-raw-events \
    --discovery-enabled

# 5. Add BigQuery asset to curated zone
gcloud dataplex assets create sales-analytics \
    --lake=sales-lake \
    --zone=curated-zone \
    --location=us-central1 \
    --resource-type=BIGQUERY_DATASET \
    --resource-name=projects/my-project/datasets/sales_analytics

Zone Types:
+-- RAW - Landing zone for raw data
|   +-- Any format accepted
|   +-- Schema discovery enabled
+-- CURATED - Processed/analytics data
    +-- Structured data required
    +-- Higher quality standards

Search Tutorials