Top 20 GCP Cloud Storage & Data Lake Interview Questions
- What is Cloud Storage?
- What are storage classes?
- What is object lifecycle management?
- What is Dataplex?
- What is Data Catalog?
- How do you implement data lake architecture?
- What is BigLake?
- How do you secure Cloud Storage?
- What is transfer service?
- What are signed URLs?
- What is object versioning?
- How do you organize data in GCS?
- What is the Pub/Sub notification feature?
- What are retention policies?
- How do you optimize storage costs?
- What is data quality in Dataplex?
- What are Dataplex lakes and zones?
- How do you implement data lineage?
- What is Analytics Hub?
- What are Cloud Storage best practices?
☁ Google Cloud Interview Questions
📊 GCP Data Engineer
BigQuery, Dataflow, Pub/Sub, GCS
⚡ Cloud Functions
Serverless, Triggers, Cloud Run
🗃 BigQuery
Data Warehouse, ML, Analytics
📦 Cloud Storage & Data Lake
GCS, Dataplex, Data Catalog
🚀 Dataproc & Dataflow
Spark, Hadoop, Apache Beam
🔄 Workflows & Composer
Orchestration, Airflow, Scheduling
🔒 IAM & Identity
Roles, Service Accounts, Identity Platform
🤖 Vertex AI
ML Platform, AutoML, Pipelines
🛠 Cloud Build & Deploy
CI/CD, Artifact Registry, GKE
📨 Pub/Sub & Streaming
Messaging, Streaming, Event-Driven
🎯 Data Engineering Scenarios
Real-world Architecture Questions
1. What is Cloud Storage?
Cloud Storage is a unified object storage for developers and enterprises with high durability and availability.
Cloud Storage Features:
+-- Object storage (unstructured data)
+-- 99.999999999% (11 nines) durability
+-- Global accessibility
+-- Multiple storage classes
+-- Strong consistency
+-- Unlimited storage
Key Concepts:
+-- Buckets - Container for objects
+-- Objects - Files with metadata
+-- Regions/Multi-regions - Data location
+-- ACLs/IAM - Access control
# Create bucket
gsutil mb -p my-project -c STANDARD -l US-CENTRAL1 gs://my-bucket/
# Or with gcloud
gcloud storage buckets create gs://my-bucket \
--project=my-project \
--location=US-CENTRAL1 \
--uniform-bucket-level-access
# Upload file
gsutil cp local-file.csv gs://my-bucket/data/
# Download file
gsutil cp gs://my-bucket/data/file.csv ./
# List objects
gsutil ls gs://my-bucket/data/
# Python client
from google.cloud import storage
client = storage.Client()
bucket = client.bucket('my-bucket')
# Upload
blob = bucket.blob('data/file.csv')
blob.upload_from_filename('local-file.csv')
# Download
blob = bucket.blob('data/file.csv')
blob.download_to_filename('downloaded.csv')
# Read directly
content = blob.download_as_string()
2. What are storage classes?
| Class | Use Case | Min Duration | Retrieval Cost |
|---|---|---|---|
| Standard | Frequent access | None | None |
| Nearline | Monthly access | 30 days | $0.01/GB |
| Coldline | Quarterly access | 90 days | $0.02/GB |
| Archive | Yearly access | 365 days | $0.05/GB |
Storage Class Selection:
# Create bucket with storage class
gsutil mb -c NEARLINE -l US gs://archive-bucket/
# Change object storage class
gsutil rewrite -s COLDLINE gs://bucket/object
# Set default storage class
gsutil defstorageclass set NEARLINE gs://bucket/
Autoclass (Automatic Tiering):
# Automatically moves objects based on access patterns
gcloud storage buckets create gs://my-bucket \
--location=US \
--autoclass
Autoclass Behavior:
+-------------------------------------------------------------+
| Object Access Pattern | Storage Class Transition |
+-------------------------------------------------------------+
| Frequently accessed | Standard |
| Not accessed 30 days | Nearline |
| Not accessed 90 days | Coldline |
| Not accessed 365 days | Archive |
| Accessed again | Back to Standard |
+-------------------------------------------------------------+
Pricing (US regions):
+-- Standard: $0.020/GB/month
+-- Nearline: $0.010/GB/month
+-- Coldline: $0.004/GB/month
+-- Archive: $0.0012/GB/month
3. What is object lifecycle management?
Lifecycle management automates object transitions and deletions based on rules.
Lifecycle Rules:
# lifecycle.json
{
"lifecycle": {
"rule": [
{
"action": {"type": "SetStorageClass", "storageClass": "NEARLINE"},
"condition": {"age": 30, "matchesStorageClass": ["STANDARD"]}
},
{
"action": {"type": "SetStorageClass", "storageClass": "COLDLINE"},
"condition": {"age": 90, "matchesStorageClass": ["NEARLINE"]}
},
{
"action": {"type": "SetStorageClass", "storageClass": "ARCHIVE"},
"condition": {"age": 365, "matchesStorageClass": ["COLDLINE"]}
},
{
"action": {"type": "Delete"},
"condition": {"age": 730}
},
{
"action": {"type": "Delete"},
"condition": {"isLive": false, "numNewerVersions": 3}
},
{
"action": {"type": "AbortIncompleteMultipartUpload"},
"condition": {"age": 7}
}
]
}
}
# Apply lifecycle
gsutil lifecycle set lifecycle.json gs://my-bucket/
# View lifecycle
gsutil lifecycle get gs://my-bucket/
Conditions Available:
+-- age - Days since creation
+-- createdBefore - Created before date
+-- isLive - Current vs noncurrent (versioned)
+-- numNewerVersions - Version count
+-- matchesStorageClass - Current class
+-- matchesPrefix - Object name prefix
+-- matchesSuffix - Object name suffix
+-- daysSinceCustomTime - Custom metadata time
Actions Available:
+-- Delete - Remove object
+-- SetStorageClass - Change class
+-- AbortIncompleteMultipartUpload - Clean up
4. What is Dataplex?
Dataplex is an intelligent data fabric that unifies distributed data and automates data management.
Dataplex Architecture:
+-------------------------------------------------------------+
| Dataplex |
+-------------------------------------------------------------+
| +-----------------------------------------------------+ |
| | Lake | |
| | +-------------+ +-------------+ +-------------+ | |
| | | Raw Zone | | Curated Zone | | Consumption | | |
| | | (Landing) | | (Refined) | | Zone | | |
| | +------+------+ +------+------+ +------+------+ | |
| | | | | | |
| | +----+----+ +----+----+ +----+----+ | |
| | | Asset | | Asset | | Asset | | |
| | | (GCS) | | (BigQuery)| | (BQ) | | |
| | +---------+ +---------+ +---------+ | |
| +-----------------------------------------------------+ |
| |
| Features: |
| +-- Unified governance |
| +-- Automated data discovery |
| +-- Data quality management |
| +-- Security & policies |
| +-- Metadata management |
+-------------------------------------------------------------+
# Create Dataplex lake
gcloud dataplex lakes create my-lake \
--location=us-central1 \
--display-name="My Data Lake"
# Create zone
gcloud dataplex zones create raw-zone \
--lake=my-lake \
--location=us-central1 \
--type=RAW \
--resource-location-type=SINGLE_REGION \
--display-name="Raw Data Zone"
# Add asset (GCS bucket)
gcloud dataplex assets create raw-data-asset \
--lake=my-lake \
--zone=raw-zone \
--location=us-central1 \
--resource-type=STORAGE_BUCKET \
--resource-name=projects/my-project/buckets/raw-data-bucket \
--discovery-enabled
# Add BigQuery dataset as asset
gcloud dataplex assets create curated-data-asset \
--lake=my-lake \
--zone=curated-zone \
--location=us-central1 \
--resource-type=BIGQUERY_DATASET \
--resource-name=projects/my-project/datasets/curated
5. What is Data Catalog?
Data Catalog is a fully managed metadata management service for discovering and managing data.
Data Catalog Features:
+-- Automatic metadata discovery
+-- Custom metadata (tags)
+-- Data lineage tracking
+-- Unified search
+-- Policy tags for security
+-- Business glossary
# Search data assets
gcloud data-catalog search "type=table AND system=bigquery"
# Create tag template
gcloud data-catalog tag-templates create data-quality-template \
--location=us-central1 \
--display-name="Data Quality" \
--field=id=owner,display-name="Data Owner",type=string,required=true \
--field=id=quality_score,display-name="Quality Score",type=double \
--field=id=pii_flag,display-name="Contains PII",type=bool
# Create entry for external data
gcloud data-catalog entries create my-external-data \
--entry-group=my-group \
--location=us-central1 \
--display-name="External Sales Data" \
--type=FILESET \
--gcs-file-patterns="gs://bucket/sales/*"
# Python: Add tag to BigQuery table
from google.cloud import datacatalog_v1
client = datacatalog_v1.DataCatalogClient()
# Look up entry
resource_name = f"//bigquery.googleapis.com/projects/{project}/datasets/{dataset}/tables/{table}"
entry = client.lookup_entry(request={"linked_resource": resource_name})
# Create tag
tag = datacatalog_v1.Tag()
tag.template = f"projects/{project}/locations/us-central1/tagTemplates/data-quality-template"
tag.fields["owner"] = datacatalog_v1.TagField(string_value="data-team@company.com")
tag.fields["quality_score"] = datacatalog_v1.TagField(double_value=0.95)
tag.fields["pii_flag"] = datacatalog_v1.TagField(bool_value=True)
client.create_tag(parent=entry.name, tag=tag)