Search Tutorials


Top AWS S3 and Lake Formation Interview Questions (2026) | JavaInuse

Top 20 AWS S3 and Lake Formation Interview Questions


  1. What is Amazon S3?
  2. What are S3 storage classes?
  3. What is S3 versioning and lifecycle policies?
  4. How do you secure S3 buckets?
  5. What is AWS Lake Formation?
  6. How do you set up a data lake with Lake Formation?
  7. What is the Lake Formation permission model?
  8. What is tag-based access control (TBAC)?
  9. How do you implement data sharing in Lake Formation?
  10. What are S3 performance optimization techniques?
  11. What is S3 Select and Glacier Select?
  12. How do you configure S3 event notifications?
  13. What is S3 Replication?
  14. What are S3 access points?
  15. How do you implement data governance with Lake Formation?
  16. What is the Glue Data Catalog integration?
  17. How do you handle data quality in Lake Formation?
  18. What are governed tables?
  19. How do you monitor S3 and Lake Formation?
  20. What are best practices for S3 data lakes?

1. What is Amazon S3?

Amazon S3 (Simple Storage Service) is an object storage service offering scalability, data availability, security, and performance.

S3 Key Concepts:
├── Bucket: Container for objects
├── Object: File + metadata
├── Key: Unique identifier (path)
├── Region: Physical location
└── Version ID: Object version

S3 Features:
├── 11 9s durability (99.999999999%)
├── 99.99% availability (Standard)
├── Unlimited storage
├── Objects up to 5TB
├── Multipart upload (>100MB)
└── Server-side encryption

# Upload object
import boto3

s3 = boto3.client('s3')
s3.upload_file('local_file.csv', 'my-bucket', 'data/file.csv')

# Download object
s3.download_file('my-bucket', 'data/file.csv', 'local_file.csv')

# List objects
paginator = s3.get_paginator('list_objects_v2')
for page in paginator.paginate(Bucket='my-bucket', Prefix='data/'):
    for obj in page.get('Contents', []):
        print(obj['Key'])

2. What are S3 storage classes?

ClassUse CaseRetrievalCost
StandardFrequent accessImmediate$$$$
Intelligent-TieringUnknown patternsImmediate$$$
Standard-IAInfrequent accessImmediate$$
One Zone-IAInfrequent, single AZImmediate$
Glacier InstantArchive, fast accessMilliseconds$
Glacier FlexibleArchiveMinutes-hours$
Glacier Deep ArchiveLong-term archive12-48 hours$

# Upload with storage class
s3.put_object(
    Bucket='my-bucket',
    Key='archive/data.csv',
    Body=data,
    StorageClass='GLACIER'
)

# Intelligent-Tiering archive config
s3.put_bucket_intelligent_tiering_configuration(
    Bucket='my-bucket',
    Id='archive-config',
    IntelligentTieringConfiguration={
        'Id': 'archive-config',
        'Status': 'Enabled',
        'Tierings': [
            {'Days': 90, 'AccessTier': 'ARCHIVE_ACCESS'},
            {'Days': 180, 'AccessTier': 'DEEP_ARCHIVE_ACCESS'}
        ]
    }
)

3. What is S3 versioning and lifecycle policies?

# Enable versioning
s3.put_bucket_versioning(
    Bucket='my-bucket',
    VersioningConfiguration={'Status': 'Enabled'}
)

# List versions
versions = s3.list_object_versions(Bucket='my-bucket', Prefix='data/')
for version in versions.get('Versions', []):
    print(f"{version['Key']} - {version['VersionId']}")

# Delete specific version
s3.delete_object(Bucket='my-bucket', Key='data/file.csv', VersionId='xxx')

# Lifecycle Policy
lifecycle_policy = {
    'Rules': [
        {
            'ID': 'TransitionToIA',
            'Status': 'Enabled',
            'Filter': {'Prefix': 'logs/'},
            'Transitions': [
                {'Days': 30, 'StorageClass': 'STANDARD_IA'},
                {'Days': 90, 'StorageClass': 'GLACIER'}
            ],
            'Expiration': {'Days': 365}
        },
        {
            'ID': 'DeleteOldVersions',
            'Status': 'Enabled',
            'Filter': {'Prefix': ''},
            'NoncurrentVersionTransitions': [
                {'NoncurrentDays': 30, 'StorageClass': 'GLACIER'}
            ],
            'NoncurrentVersionExpiration': {'NoncurrentDays': 90}
        },
        {
            'ID': 'AbortIncompleteUploads',
            'Status': 'Enabled',
            'Filter': {'Prefix': ''},
            'AbortIncompleteMultipartUpload': {'DaysAfterInitiation': 7}
        }
    ]
}

s3.put_bucket_lifecycle_configuration(
    Bucket='my-bucket',
    LifecycleConfiguration=lifecycle_policy
)

4. How do you secure S3 buckets?

S3 Security Layers:

1. Block Public Access (Account/Bucket level)
s3.put_public_access_block(
    Bucket='my-bucket',
    PublicAccessBlockConfiguration={
        'BlockPublicAcls': True,
        'IgnorePublicAcls': True,
        'BlockPublicPolicy': True,
        'RestrictPublicBuckets': True
    }
)

2. Bucket Policy
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "DenyUnencrypted",
            "Effect": "Deny",
            "Principal": "*",
            "Action": "s3:PutObject",
            "Resource": "arn:aws:s3:::my-bucket/*",
            "Condition": {
                "StringNotEquals": {
                    "s3:x-amz-server-side-encryption": "aws:kms"
                }
            }
        },
        {
            "Sid": "EnforceSSL",
            "Effect": "Deny",
            "Principal": "*",
            "Action": "s3:*",
            "Resource": "arn:aws:s3:::my-bucket/*",
            "Condition": {
                "Bool": {"aws:SecureTransport": "false"}
            }
        }
    ]
}

3. Encryption
# Default encryption
s3.put_bucket_encryption(
    Bucket='my-bucket',
    ServerSideEncryptionConfiguration={
        'Rules': [{
            'ApplyServerSideEncryptionByDefault': {
                'SSEAlgorithm': 'aws:kms',
                'KMSMasterKeyID': 'arn:aws:kms:...'
            },
            'BucketKeyEnabled': True  # Reduces KMS costs
        }]
    }
)

4. Access Logging
s3.put_bucket_logging(
    Bucket='my-bucket',
    BucketLoggingStatus={
        'LoggingEnabled': {
            'TargetBucket': 'logs-bucket',
            'TargetPrefix': 's3-access-logs/'
        }
    }
)

5. What is AWS Lake Formation?

AWS Lake Formation simplifies building, securing, and managing data lakes with centralized governance.

Lake Formation Components:
├── Data Catalog (Glue Catalog)
├── Blueprints (automated ingestion)
├── Security (fine-grained access)
├── Data Sharing (cross-account)
└── Governed Tables (ACID)

Benefits:
├── Centralized security management
├── Fine-grained access control
├── Column and row-level security
├── Data sharing without copying
├── Integration with Glue, Athena, Redshift

# Register S3 location
import boto3
lf = boto3.client('lakeformation')

lf.register_resource(
    ResourceArn='arn:aws:s3:::my-data-lake',
    UseServiceLinkedRole=True,
    HybridAccessEnabled=False
)

# Grant data location permission
lf.grant_permissions(
    Principal={'DataLakePrincipalIdentifier': 'arn:aws:iam::123456789012:role/GlueRole'},
    Resource={'DataLocation': {'ResourceArn': 'arn:aws:s3:::my-data-lake'}},
    Permissions=['DATA_LOCATION_ACCESS']
)




6. How do you set up a data lake with Lake Formation?

Data Lake Setup Steps:

1. Create S3 bucket structure
s3://my-data-lake/
├── raw/           # Bronze layer
├── staged/        # Silver layer
├── curated/       # Gold layer
└── archive/       # Cold storage

2. Register with Lake Formation
lf.register_resource(
    ResourceArn='arn:aws:s3:::my-data-lake',
    UseServiceLinkedRole=True
)

3. Create databases in Glue Catalog
glue = boto3.client('glue')
glue.create_database(
    DatabaseInput={
        'Name': 'raw_db',
        'Description': 'Raw data layer',
        'LocationUri': 's3://my-data-lake/raw/'
    }
)

4. Set up blueprints for ingestion
# Via console: Lake Formation > Blueprints
# Supported: Database, Log files, S3 (incremental)

5. Configure permissions
lf.grant_permissions(
    Principal={'DataLakePrincipalIdentifier': 'arn:aws:iam::123456789012:role/DataScientist'},
    Resource={
        'Database': {'Name': 'curated_db'}
    },
    Permissions=['DESCRIBE']
)

lf.grant_permissions(
    Principal={'DataLakePrincipalIdentifier': 'arn:aws:iam::123456789012:role/DataScientist'},
    Resource={
        'Table': {
            'DatabaseName': 'curated_db',
            'Name': 'sales'
        }
    },
    Permissions=['SELECT']
)

7. What is the Lake Formation permission model?

Permission Model:

Traditional IAM (Coarse-grained):
└── Bucket/prefix level access
└── Complex policies for multiple tables
└── Hard to manage at scale

Lake Formation (Fine-grained):
├── Database permissions
├── Table permissions
├── Column-level permissions
├── Row-level permissions (data filters)
└── Tag-based permissions (LF-TBAC)

# Database permissions
lf.grant_permissions(
    Principal={'DataLakePrincipalIdentifier': 'arn:aws:iam::123456789012:role/Analyst'},
    Resource={'Database': {'Name': 'analytics'}},
    Permissions=['DESCRIBE', 'CREATE_TABLE', 'ALTER', 'DROP']
)

# Table permissions
lf.grant_permissions(
    Principal={'DataLakePrincipalIdentifier': 'arn:aws:iam::123456789012:role/Analyst'},
    Resource={
        'Table': {
            'DatabaseName': 'analytics',
            'Name': 'customers'
        }
    },
    Permissions=['SELECT', 'INSERT', 'DELETE']
)

# Column-level permissions
lf.grant_permissions(
    Principal={'DataLakePrincipalIdentifier': 'arn:aws:iam::123456789012:role/Analyst'},
    Resource={
        'TableWithColumns': {
            'DatabaseName': 'hr',
            'Name': 'employees',
            'ColumnNames': ['name', 'department', 'title']
            # Excludes: salary, ssn
        }
    },
    Permissions=['SELECT']
)

8. What is tag-based access control (TBAC)?

Lake Formation Tag-Based Access Control (LF-TBAC) simplifies permission management using tags.

# Create LF-Tags
lf.create_lf_tag(
    TagKey='classification',
    TagValues=['public', 'internal', 'confidential', 'restricted']
)

lf.create_lf_tag(
    TagKey='domain',
    TagValues=['sales', 'marketing', 'finance', 'hr']
)

# Assign tags to resources
lf.add_lf_tags_to_resource(
    Resource={
        'Table': {
            'DatabaseName': 'analytics',
            'Name': 'sales_data'
        }
    },
    LFTags=[
        {'TagKey': 'classification', 'TagValues': ['internal']},
        {'TagKey': 'domain', 'TagValues': ['sales']}
    ]
)

# Grant permissions based on tags
lf.grant_permissions(
    Principal={'DataLakePrincipalIdentifier': 'arn:aws:iam::123456789012:role/SalesAnalyst'},
    Resource={
        'LFTagPolicy': {
            'ResourceType': 'TABLE',
            'Expression': [
                {'TagKey': 'domain', 'TagValues': ['sales']},
                {'TagKey': 'classification', 'TagValues': ['public', 'internal']}
            ]
        }
    },
    Permissions=['SELECT', 'DESCRIBE']
)

# Benefits of TBAC:
# - Scale to thousands of tables
# - Self-service access management
# - Automatic inheritance for new resources

9. How do you implement data sharing in Lake Formation?

Data Sharing Options:

1. Same Account Sharing
# Grant to IAM role
lf.grant_permissions(
    Principal={'DataLakePrincipalIdentifier': 'arn:aws:iam::123456789012:role/ConsumerRole'},
    Resource={'Table': {'DatabaseName': 'db', 'Name': 'table'}},
    Permissions=['SELECT']
)

2. Cross-Account Sharing
# Step 1: Share database/table
lf.grant_permissions(
    Principal={'DataLakePrincipalIdentifier': '987654321098'},  # Consumer account ID
    Resource={'Database': {'Name': 'shared_db'}},
    Permissions=['DESCRIBE'],
    PermissionsWithGrantOption=['DESCRIBE']
)

# Step 2: Consumer accepts (via RAM or Lake Formation)
# Step 3: Consumer creates resource link
lf.create_resource_link(
    ResourceLinkName='local_shared_db',
    AccountId='123456789012',
    Database={'Name': 'shared_db'}
)

3. Named Data Permissions (Tag-based sharing)
# Share all tables with specific tags
lf.grant_permissions(
    Principal={'DataLakePrincipalIdentifier': '987654321098'},
    Resource={
        'LFTagPolicy': {
            'ResourceType': 'TABLE',
            'Expression': [{'TagKey': 'shared', 'TagValues': ['yes']}]
        }
    },
    Permissions=['SELECT']
)

10. What are S3 performance optimization techniques?

S3 Performance Optimization:

1. Prefix Design (3,500 PUT/POST/DELETE, 5,500 GET per prefix)
# Good: Distribute across prefixes
s3://bucket/2024/01/15/data.csv
s3://bucket/2024/01/16/data.csv

# For high throughput, use random prefixes
s3://bucket/a1b2c3/data1.csv
s3://bucket/d4e5f6/data2.csv

2. Multipart Upload (>100MB)
from boto3.s3.transfer import TransferConfig

config = TransferConfig(
    multipart_threshold=100 * 1024 * 1024,  # 100MB
    multipart_chunksize=100 * 1024 * 1024,
    max_concurrency=10
)
s3.upload_file('large_file.csv', 'bucket', 'key', Config=config)

3. Byte-Range Fetches (parallel downloads)
# Download parts in parallel
response = s3.get_object(
    Bucket='bucket',
    Key='large_file.csv',
    Range='bytes=0-999999'  # First 1MB
)

4. S3 Transfer Acceleration
# Enable on bucket, use accelerate endpoint
s3.put_bucket_accelerate_configuration(
    Bucket='my-bucket',
    AccelerateConfiguration={'Status': 'Enabled'}
)
# Use: bucket.s3-accelerate.amazonaws.com

5. Optimal File Sizes
# Avoid small files: 128MB - 1GB optimal
# Combine small files for analytics

6. Use columnar formats
# Parquet, ORC for analytics (only read needed columns)

11. What is S3 Select and Glacier Select?

S3 Select enables SQL queries directly on S3 objects, reducing data transfer.

# S3 Select - Query CSV
response = s3.select_object_content(
    Bucket='my-bucket',
    Key='data/sales.csv',
    ExpressionType='SQL',
    Expression="SELECT customer_id, amount FROM s3object WHERE amount > 1000",
    InputSerialization={
        'CSV': {
            'FileHeaderInfo': 'USE',
            'FieldDelimiter': ','
        }
    },
    OutputSerialization={'CSV': {}}
)

for event in response['Payload']:
    if 'Records' in event:
        print(event['Records']['Payload'].decode('utf-8'))

# S3 Select - Query Parquet
response = s3.select_object_content(
    Bucket='my-bucket',
    Key='data/sales.parquet',
    ExpressionType='SQL',
    Expression="SELECT * FROM s3object WHERE sale_date > '2024-01-01'",
    InputSerialization={'Parquet': {}},
    OutputSerialization={'JSON': {}}
)

# S3 Select - Query JSON
response = s3.select_object_content(
    Bucket='my-bucket',
    Key='data/events.json',
    ExpressionType='SQL',
    Expression="SELECT s.user_id, s.event FROM s3object s WHERE s.event = 'purchase'",
    InputSerialization={'JSON': {'Type': 'LINES'}},
    OutputSerialization={'JSON': {}}
)

# Glacier Select
# Query archived data without full restore
# Supports CSV, JSON

12. How do you configure S3 event notifications?

Event Notification Targets:
├── Lambda Function
├── SQS Queue
├── SNS Topic
└── EventBridge

# Configure notifications
notification_config = {
    'LambdaFunctionConfigurations': [
        {
            'LambdaFunctionArn': 'arn:aws:lambda:us-east-1:123456789012:function:ProcessFile',
            'Events': ['s3:ObjectCreated:*'],
            'Filter': {
                'Key': {
                    'FilterRules': [
                        {'Name': 'prefix', 'Value': 'uploads/'},
                        {'Name': 'suffix', 'Value': '.csv'}
                    ]
                }
            }
        }
    ],
    'QueueConfigurations': [
        {
            'QueueArn': 'arn:aws:sqs:us-east-1:123456789012:file-queue',
            'Events': ['s3:ObjectCreated:*', 's3:ObjectRemoved:*']
        }
    ]
}

s3.put_bucket_notification_configuration(
    Bucket='my-bucket',
    NotificationConfiguration=notification_config
)

# EventBridge integration (more features)
s3.put_bucket_notification_configuration(
    Bucket='my-bucket',
    NotificationConfiguration={
        'EventBridgeConfiguration': {}
    }
)

# Lambda receives event
{
    "Records": [{
        "eventSource": "aws:s3",
        "eventName": "ObjectCreated:Put",
        "s3": {
            "bucket": {"name": "my-bucket"},
            "object": {"key": "uploads/file.csv", "size": 1024}
        }
    }]
}

13. What is S3 Replication?

Replication Types:

1. Same-Region Replication (SRR)
- Log aggregation
- Data sovereignty
- Live replication between accounts

2. Cross-Region Replication (CRR)
- Compliance requirements
- Latency reduction
- Disaster recovery

# Configure CRR
replication_config = {
    'Role': 'arn:aws:iam::123456789012:role/S3ReplicationRole',
    'Rules': [
        {
            'ID': 'ReplicateAll',
            'Status': 'Enabled',
            'Priority': 1,
            'Filter': {'Prefix': ''},
            'Destination': {
                'Bucket': 'arn:aws:s3:::destination-bucket',
                'StorageClass': 'STANDARD_IA',
                'ReplicationTime': {
                    'Status': 'Enabled',
                    'Time': {'Minutes': 15}
                },
                'Metrics': {
                    'Status': 'Enabled',
                    'EventThreshold': {'Minutes': 15}
                }
            },
            'DeleteMarkerReplication': {'Status': 'Enabled'}
        }
    ]
}

s3.put_bucket_replication(
    Bucket='source-bucket',
    ReplicationConfiguration=replication_config
)

# S3 Replication Time Control (RTC)
# SLA: 99.99% objects replicated within 15 minutes
# Requires metrics and replication time enabled

14. What are S3 access points?

S3 Access Points simplify managing data access at scale with dedicated endpoints.

# Create access point
s3control = boto3.client('s3control')

s3control.create_access_point(
    AccountId='123456789012',
    Name='analytics-access-point',
    Bucket='data-lake-bucket',
    VpcConfiguration={
        'VpcId': 'vpc-xxx'  # Restrict to VPC
    },
    PublicAccessBlockConfiguration={
        'BlockPublicAcls': True,
        'IgnorePublicAcls': True,
        'BlockPublicPolicy': True,
        'RestrictPublicBuckets': True
    }
)

# Access point policy
access_point_policy = {
    "Version": "2012-10-17",
    "Statement": [{
        "Effect": "Allow",
        "Principal": {"AWS": "arn:aws:iam::123456789012:role/AnalystRole"},
        "Action": ["s3:GetObject", "s3:ListBucket"],
        "Resource": [
            "arn:aws:s3:us-east-1:123456789012:accesspoint/analytics-access-point",
            "arn:aws:s3:us-east-1:123456789012:accesspoint/analytics-access-point/object/*"
        ],
        "Condition": {
            "StringLike": {"s3:prefix": ["analytics/*"]}
        }
    }]
}

# Access via access point
s3.get_object(
    Bucket='arn:aws:s3:us-east-1:123456789012:accesspoint/analytics-access-point',
    Key='analytics/report.csv'
)

# Multi-Region Access Points
# Single endpoint routing to nearest region

15. How do you implement data governance with Lake Formation?

Data Governance Features:

1. Data Catalog
- Centralized metadata repository
- Schema versioning
- Data lineage (via Glue)

2. Access Control
- Fine-grained permissions
- Column/row-level security
- Tag-based access

3. Audit Logging
- CloudTrail integration
- Access monitoring

# Row-Level Security with Data Filters
lf.create_data_cells_filter(
    TableData={
        'TableCatalogId': '123456789012',
        'DatabaseName': 'sales_db',
        'TableName': 'orders',
        'Name': 'us_region_filter',
        'RowFilter': {
            'FilterExpression': "region = 'US'"
        },
        'ColumnNames': ['order_id', 'customer', 'amount', 'region']
    }
)

# Grant filter to user
lf.grant_permissions(
    Principal={'DataLakePrincipalIdentifier': 'arn:aws:iam::123456789012:role/USAnalyst'},
    Resource={
        'DataCellsFilter': {
            'TableCatalogId': '123456789012',
            'DatabaseName': 'sales_db',
            'TableName': 'orders',
            'Name': 'us_region_filter'
        }
    },
    Permissions=['SELECT']
)

4. Data Masking (Column-level)
# Exclude sensitive columns from grants




16. What is the Glue Data Catalog integration?

Lake Formation uses the Glue Data Catalog as its centralized metadata repository.

Glue Catalog Components:
├── Databases
├── Tables (metadata)
├── Partitions
├── Connections
└── Crawlers

# Create database
glue.create_database(
    DatabaseInput={
        'Name': 'analytics',
        'Description': 'Analytics database',
        'LocationUri': 's3://my-lake/analytics/'
    }
)

# Create table
glue.create_table(
    DatabaseName='analytics',
    TableInput={
        'Name': 'sales',
        'StorageDescriptor': {
            'Columns': [
                {'Name': 'sale_id', 'Type': 'string'},
                {'Name': 'amount', 'Type': 'double'},
                {'Name': 'customer_id', 'Type': 'string'}
            ],
            'Location': 's3://my-lake/analytics/sales/',
            'InputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat',
            'OutputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat',
            'SerdeInfo': {
                'SerializationLibrary': 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
            }
        },
        'PartitionKeys': [
            {'Name': 'year', 'Type': 'string'},
            {'Name': 'month', 'Type': 'string'}
        ],
        'TableType': 'EXTERNAL_TABLE'
    }
)

# Crawler discovers schema
glue.create_crawler(
    Name='sales-crawler',
    Role='GlueServiceRole',
    DatabaseName='analytics',
    Targets={'S3Targets': [{'Path': 's3://my-lake/analytics/sales/'}]},
    Schedule='cron(0 1 * * ? *)'
)

17. How do you handle data quality in Lake Formation?

Data Quality with Glue Data Quality:

# Define DQDL rules
dqdl_ruleset = """
Rules = [
    RowCount > 1000,
    IsComplete "customer_id",
    IsUnique "transaction_id",
    ColumnValues "amount" > 0,
    ColumnValues "email" matches "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}",
    ColumnLength "phone" = 10,
    ColumnValues "status" in ["pending", "completed", "cancelled"],
    Completeness "email" >= 0.95,
    ReferentialIntegrity "customer_id" "customers.id"
]
"""

# Create ruleset
glue.create_data_quality_ruleset(
    Name='sales_quality_rules',
    Ruleset=dqdl_ruleset,
    TargetTable={
        'TableName': 'sales',
        'DatabaseName': 'analytics'
    }
)

# Run evaluation
glue.start_data_quality_ruleset_evaluation_run(
    DataSource={
        'GlueTable': {
            'DatabaseName': 'analytics',
            'TableName': 'sales'
        }
    },
    Role='GlueServiceRole',
    RulesetNames=['sales_quality_rules']
)

# In Glue job
from awsgluedq.transforms import EvaluateDataQuality

results = EvaluateDataQuality.apply(
    frame=datasource,
    ruleset=dqdl_ruleset
)

18. What are governed tables?

Governed tables provide ACID transactions and automatic data compaction in Lake Formation.

Governed Table Features:
├── ACID transactions
├── Time travel queries
├── Automatic compaction
├── Row-level permissions
└── Branch/merge operations

# Create governed table
glue.create_table(
    DatabaseName='governed_db',
    TableInput={
        'Name': 'transactions',
        'StorageDescriptor': {...},
        'TableType': 'GOVERNED'
    }
)

# Transaction operations
lf.start_transaction(TransactionType='READ_AND_WRITE')

# Update within transaction
lf.update_table_objects(
    CatalogId='123456789012',
    DatabaseName='governed_db',
    TableName='transactions',
    TransactionId='tx-xxx',
    WriteOperations=[
        {
            'AddObject': {
                'Uri': 's3://bucket/path/file.parquet',
                'ETag': 'xxx',
                'Size': 1024
            }
        }
    ]
)

lf.commit_transaction(TransactionId='tx-xxx')

# Time travel query (via Athena)
SELECT * FROM "governed_db"."transactions"
FOR TIMESTAMP AS OF (current_timestamp - interval '1' hour);

19. How do you monitor S3 and Lake Formation?

Monitoring Tools:

1. S3 Metrics (CloudWatch)
├── BucketSizeBytes
├── NumberOfObjects
├── AllRequests
├── GetRequests, PutRequests
├── 4xxErrors, 5xxErrors
└── BytesDownloaded, BytesUploaded

2. S3 Storage Lens
# Analytics across organization
# Dashboard with recommendations
# Prefix-level metrics

3. S3 Access Logs
s3.put_bucket_logging(
    Bucket='my-bucket',
    BucketLoggingStatus={
        'LoggingEnabled': {
            'TargetBucket': 'logs-bucket',
            'TargetPrefix': 's3-logs/'
        }
    }
)

4. CloudTrail
# API-level logging
# Data events for object operations

5. Lake Formation Audit Logging
# Data access events to CloudTrail
# Who accessed what data

# CloudWatch alarm
cloudwatch.put_metric_alarm(
    AlarmName='S3-4xx-Errors',
    MetricName='4xxErrors',
    Namespace='AWS/S3',
    Statistic='Sum',
    Period=300,
    EvaluationPeriods=1,
    Threshold=100,
    ComparisonOperator='GreaterThanThreshold',
    Dimensions=[
        {'Name': 'BucketName', 'Value': 'my-bucket'},
        {'Name': 'FilterId', 'Value': 'EntireBucket'}
    ]
)

20. What are best practices for S3 data lakes?

1. Organization:
s3://data-lake/
├── raw/           # Bronze - as received
├── staged/        # Silver - cleansed
├── curated/       # Gold - business ready
└── archive/       # Cold storage

# Partitioning
s3://bucket/table/year=2024/month=01/day=15/data.parquet

2. Data Formats:
- Use Parquet or ORC for analytics
- Optimal file size: 128MB - 1GB
- Compress with Snappy or ZSTD

3. Security:
- Block public access
- Enable default encryption (KMS)
- Use Lake Formation for fine-grained access
- Enable versioning for critical data

4. Cost Optimization:
- Use lifecycle policies
- S3 Intelligent-Tiering for unknown patterns
- Archive old data to Glacier
- Delete incomplete multipart uploads

5. Performance:
- Partition by query patterns
- Avoid small files
- Use S3 Select for filtering
- Consider Transfer Acceleration

6. Governance:
- Centralize with Lake Formation
- Implement tag-based access control
- Enable audit logging
- Define data quality rules


Popular Posts