Top 20 AWS SageMaker Interview Questions

What is Amazon SageMaker?
What are SageMaker components?
What is SageMaker Studio?
How do you train a model in SageMaker?
What are SageMaker built-in algorithms?
How do you deploy models in SageMaker?
What is SageMaker Pipelines?
What is SageMaker Feature Store?
What is SageMaker Model Registry?
What are SageMaker experiments?
What is hyperparameter tuning?
What is SageMaker Clarify?
What is SageMaker Debugger?
How do you implement MLOps with SageMaker?
What is SageMaker inference options?
What is SageMaker Processing?
What is SageMaker JumpStart?
How do you optimize costs in SageMaker?
How do you monitor SageMaker?
What are SageMaker best practices?

AWS Interview Questions - All Topics

AWS Data Engineer AWS Lambda AWS Redshift AWS S3 & Lake Formation AWS EMR & Glue AWS Step Functions AWS IAM & Cognito AWS SageMaker AWS CI/CD (CodePipeline) AWS Kinesis AWS Data Real-time Scenarios

1. What is Amazon SageMaker?

Amazon SageMaker is a fully managed machine learning platform for building, training, and deploying ML models at scale.

SageMaker Features:
âââ SageMaker Studio (IDE)
âââ Notebooks (Jupyter)
âââ Training (managed infrastructure)
âââ Hosting (deployment)
âââ Pipelines (MLOps)
âââ Feature Store
âââ Model Registry
âââ Experiments
âââ Debugger
âââ Clarify (bias/explainability)
âââ JumpStart (pre-trained models)
âââ Canvas (no-code ML)

ML Lifecycle with SageMaker:
âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â                   ML Lifecycle                       â
âââââââââââ¬âââââââââââ¬âââââââââââ¬âââââââââââ¬ââââââââââ¤
â Prepare â  Build   â  Train   â  Deploy  â Monitor â
â         â          â          â          â         â
â Data    â Notebooksâ Training â Endpointsâ Model   â
â Wranglerâ Studio   â Jobs     â Batch    â Monitor â
â Feature â Autopilotâ HPO      â Serverlesâ Clarify â
â Store   â          â Debugger â          â         â
âââââââââââ´âââââââââââ´âââââââââââ´âââââââââââ´ââââââââââ

2. What are SageMaker components?

Core Components:

1. Notebooks
âââ Notebook instances (managed Jupyter)
âââ Studio notebooks (collaborative)
âââ Pre-built kernels with ML frameworks

2. Training
âââ Training jobs (managed compute)
âââ Built-in algorithms
âââ Custom containers
âââ Distributed training
âââ Spot instances support

3. Hosting
âââ Real-time endpoints
âââ Serverless inference
âââ Batch transform
âââ Multi-model endpoints
âââ Asynchronous inference

4. MLOps Tools
âââ Pipelines (workflow orchestration)
âââ Model Registry (version control)
âââ Feature Store (feature management)
âââ Experiments (tracking)
âââ Model Monitor (drift detection)

5. Data Tools
âââ Data Wrangler (visual data prep)
âââ Ground Truth (labeling)
âââ Processing jobs (data processing)
âââ Clarify (bias detection)

# Basic SageMaker SDK usage
import sagemaker
from sagemaker import Session

session = Session()
role = sagemaker.get_execution_role()
bucket = session.default_bucket()

3. What is SageMaker Studio?

SageMaker Studio is an integrated development environment (IDE) for machine learning.

Studio Features:
âââ JupyterLab-based interface
âââ Integrated tools (all SageMaker features)
âââ Collaborative notebooks
âââ Visual experiment tracking
âââ Model building workflows
âââ Git integration

Studio Components:
âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â               SageMaker Studio                       â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââ¤
â  âââââââââââââââ  âââââââââââââââ  âââââââââââââââ â
â  â  Notebooks  â  â  Experimentsâ  â  Pipelines  â â
â  âââââââââââââââ  âââââââââââââââ  âââââââââââââââ â
â  âââââââââââââââ  âââââââââââââââ  âââââââââââââââ â
â  â  Models     â  â  Endpoints  â  â  Feature    â â
â  â  Registry   â  â             â  â  Store      â â
â  âââââââââââââââ  âââââââââââââââ  âââââââââââââââ â
â  âââââââââââââââ  âââââââââââââââ  âââââââââââââââ â
â  â  Data       â  â  JumpStart  â  â  AutoML     â â
â  â  Wrangler   â  â             â  â  (Autopilot)â â
â  âââââââââââââââ  âââââââââââââââ  âââââââââââââââ â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââ

# Create Studio domain
sagemaker_client = boto3.client('sagemaker')
sagemaker_client.create_domain(
    DomainName='my-domain',
    AuthMode='IAM',  # or 'SSO'
    DefaultUserSettings={
        'ExecutionRole': role_arn
    },
    SubnetIds=['subnet-xxx'],
    VpcId='vpc-xxx'
)

4. How do you train a model in SageMaker?

# Training with Built-in Algorithm (XGBoost)
from sagemaker.xgboost import XGBoost

xgb = XGBoost(
    entry_point='train.py',
    role=role,
    instance_count=1,
    instance_type='ml.m5.xlarge',
    framework_version='1.7-1',
    py_version='py3',
    hyperparameters={
        'max_depth': 5,
        'eta': 0.2,
        'objective': 'binary:logistic',
        'num_round': 100
    }
)

# Define data channels
train_input = sagemaker.inputs.TrainingInput(
    s3_data=f's3://{bucket}/train/',
    content_type='text/csv'
)
val_input = sagemaker.inputs.TrainingInput(
    s3_data=f's3://{bucket}/validation/',
    content_type='text/csv'
)

# Start training
xgb.fit({'train': train_input, 'validation': val_input})

# Training with Custom Script
from sagemaker.pytorch import PyTorch

estimator = PyTorch(
    entry_point='train.py',
    source_dir='src',
    role=role,
    instance_count=1,
    instance_type='ml.p3.2xlarge',
    framework_version='2.0',
    py_version='py310',
    hyperparameters={
        'epochs': 10,
        'batch_size': 64,
        'learning_rate': 0.001
    },
    metric_definitions=[
        {'Name': 'train:loss', 'Regex': 'train_loss: ([0-9\\.]+)'}
    ]
)

estimator.fit({'training': train_input})

# Access trained model
model_data = estimator.model_data  # S3 path to model artifacts

5. What are SageMaker built-in algorithms?

Algorithm	Type	Use Case
XGBoost	Supervised	Classification, Regression
Linear Learner	Supervised	Classification, Regression
K-Means	Unsupervised	Clustering
PCA	Unsupervised	Dimensionality Reduction
BlazingText	NLP	Text Classification, Word2Vec
Image Classification	Computer Vision	Image Classification
Object Detection	Computer Vision	Object Detection
Semantic Segmentation	Computer Vision	Pixel-level Classification
DeepAR	Time Series	Forecasting
Factorization Machines	Supervised	Recommendations

# Using Built-in Algorithm Container
from sagemaker import image_uris
from sagemaker.estimator import Estimator

# Get algorithm image
image_uri = image_uris.retrieve(
    framework='xgboost',
    region='us-east-1',
    version='1.7-1'
)

# Create estimator
xgb = Estimator(
    image_uri=image_uri,
    role=role,
    instance_count=1,
    instance_type='ml.m5.xlarge',
    output_path=f's3://{bucket}/output/',
    hyperparameters={
        'max_depth': 5,
        'eta': 0.2,
        'objective': 'binary:logistic',
        'num_round': 100
    }
)

# Input format requirements vary by algorithm
# XGBoost: CSV, LibSVM, Parquet
# Image Classification: RecordIO, image files
# BlazingText: Text files with labels

6. How do you deploy models in SageMaker?

Deployment Options:

1. Real-time Endpoint
from sagemaker.model import Model

model = Model(
    image_uri=image_uri,
    model_data=model_data,
    role=role
)

predictor = model.deploy(
    initial_instance_count=1,
    instance_type='ml.m5.xlarge',
    endpoint_name='my-endpoint'
)

# Invoke endpoint
response = predictor.predict(data)

# Or via boto3
runtime = boto3.client('sagemaker-runtime')
response = runtime.invoke_endpoint(
    EndpointName='my-endpoint',
    ContentType='application/json',
    Body=json.dumps(data)
)

2. Serverless Inference
from sagemaker.serverless import ServerlessInferenceConfig

serverless_config = ServerlessInferenceConfig(
    memory_size_in_mb=2048,
    max_concurrency=10
)

predictor = model.deploy(
    serverless_inference_config=serverless_config,
    endpoint_name='serverless-endpoint'
)

3. Batch Transform
transformer = model.transformer(
    instance_count=1,
    instance_type='ml.m5.xlarge',
    output_path=f's3://{bucket}/batch-output/'
)

transformer.transform(
    data=f's3://{bucket}/batch-input/',
    content_type='text/csv',
    split_type='Line'
)

4. Asynchronous Inference
async_config = AsyncInferenceConfig(
    output_path=f's3://{bucket}/async-output/',
    max_concurrent_invocations_per_instance=4
)

7. What is SageMaker Pipelines?

SageMaker Pipelines enables building, automating, and managing ML workflows.

from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.steps import ProcessingStep, TrainingStep, CreateModelStep
from sagemaker.workflow.parameters import ParameterString

# Define parameters
instance_type = ParameterString(name='TrainingInstanceType', default_value='ml.m5.xlarge')

# Processing step
from sagemaker.sklearn.processing import SKLearnProcessor

processor = SKLearnProcessor(
    framework_version='1.0-1',
    role=role,
    instance_count=1,
    instance_type='ml.m5.xlarge'
)

step_process = ProcessingStep(
    name='PreprocessData',
    processor=processor,
    inputs=[ProcessingInput(source=input_data, destination='/opt/ml/processing/input')],
    outputs=[ProcessingOutput(output_name='train', source='/opt/ml/processing/train')],
    code='preprocess.py'
)

# Training step
step_train = TrainingStep(
    name='TrainModel',
    estimator=estimator,
    inputs={
        'train': TrainingInput(s3_data=step_process.properties.ProcessingOutputConfig.Outputs['train'].S3Output.S3Uri)
    }
)

# Create model step
step_create_model = CreateModelStep(
    name='CreateModel',
    model=model,
    inputs=CreateModelInput(instance_type='ml.m5.xlarge')
)

# Define pipeline
pipeline = Pipeline(
    name='MLPipeline',
    parameters=[instance_type],
    steps=[step_process, step_train, step_create_model]
)

# Create/update pipeline
pipeline.upsert(role_arn=role)

# Start execution
execution = pipeline.start()
execution.wait()

8. What is SageMaker Feature Store?

Feature Store is a centralized repository for storing, sharing, and managing ML features.

Feature Store Components:
âââ Feature Groups: Tables of features
âââ Online Store: Low-latency serving
âââ Offline Store: Training data (S3/Athena)
âââ Feature Records: Feature values

# Create Feature Group
from sagemaker.feature_store.feature_group import FeatureGroup

feature_group = FeatureGroup(
    name='customer-features',
    sagemaker_session=session
)

# Define schema
feature_definitions = [
    {'FeatureName': 'customer_id', 'FeatureType': 'String'},
    {'FeatureName': 'age', 'FeatureType': 'Integral'},
    {'FeatureName': 'total_purchases', 'FeatureType': 'Fractional'},
    {'FeatureName': 'event_time', 'FeatureType': 'Fractional'}
]

feature_group.create(
    s3_uri=f's3://{bucket}/feature-store/',
    record_identifier_name='customer_id',
    event_time_feature_name='event_time',
    role_arn=role,
    enable_online_store=True,
    feature_definitions=feature_definitions
)

# Ingest features
import pandas as pd
df = pd.DataFrame({
    'customer_id': ['C001', 'C002'],
    'age': [25, 30],
    'total_purchases': [150.0, 200.0],
    'event_time': [time.time(), time.time()]
})

feature_group.ingest(data_frame=df, max_workers=3, wait=True)

# Get features (online - low latency)
record = feature_group.get_record(record_identifier_value_as_string='C001')

# Query features (offline - training)
query = feature_group.athena_query()
query.run(query_string='SELECT * FROM customer_features', output_location=f's3://{bucket}/query/')
df = query.as_dataframe()

9. What is SageMaker Model Registry?

Model Registry provides a central repository for model versioning and lifecycle management.

# Create Model Package Group
from sagemaker.model import Model
from sagemaker.model_metrics import ModelMetrics, MetricsSource

sagemaker_client.create_model_package_group(
    ModelPackageGroupName='fraud-detection-models',
    ModelPackageGroupDescription='Models for fraud detection'
)

# Register model version
model_metrics = ModelMetrics(
    model_statistics=MetricsSource(
        s3_uri=f's3://{bucket}/evaluation/statistics.json',
        content_type='application/json'
    ),
    bias=MetricsSource(
        s3_uri=f's3://{bucket}/clarify/bias.json',
        content_type='application/json'
    )
)

model_package = model.register(
    model_package_group_name='fraud-detection-models',
    inference_instances=['ml.m5.xlarge', 'ml.m5.2xlarge'],
    transform_instances=['ml.m5.xlarge'],
    content_types=['application/json'],
    response_types=['application/json'],
    model_metrics=model_metrics,
    approval_status='PendingManualApproval',  # or 'Approved'
    description='Fraud detection model v1.0'
)

# Approve model
sagemaker_client.update_model_package(
    ModelPackageArn=model_package.model_package_arn,
    ModelApprovalStatus='Approved'
)

# Deploy from registry
from sagemaker.model import ModelPackage

model = ModelPackage(
    role=role,
    model_package_arn=model_package_arn
)
predictor = model.deploy(instance_type='ml.m5.xlarge', initial_instance_count=1)

10. What are SageMaker experiments?

SageMaker Experiments helps organize, track, and compare ML experiments.

from sagemaker.experiments.run import Run, load_run

# Create experiment
with Run(
    experiment_name='fraud-detection-experiment',
    run_name='xgboost-run-1',
    sagemaker_session=session
) as run:
    # Log parameters
    run.log_parameter('max_depth', 5)
    run.log_parameter('learning_rate', 0.1)
    run.log_parameter('algorithm', 'xgboost')
    
    # Training
    estimator.fit(inputs)
    
    # Log metrics
    run.log_metric('accuracy', 0.95)
    run.log_metric('f1_score', 0.92)
    run.log_metric('auc', 0.98)
    
    # Log artifacts
    run.log_artifact(name='model', value=model_data)
    run.log_file('confusion_matrix.png', name='confusion_matrix')

# Query experiments
from sagemaker.analytics import ExperimentAnalytics

analytics = ExperimentAnalytics(
    experiment_name='fraud-detection-experiment',
    sagemaker_session=session
)

# Get dataframe of all runs
df = analytics.dataframe()
print(df[['run_name', 'max_depth', 'accuracy', 'f1_score']])

# Compare runs in Studio
# Visual comparison of metrics, parameters, artifacts

11. What is hyperparameter tuning?

Hyperparameter Optimization (HPO) automatically finds the best hyperparameters for your model.

from sagemaker.tuner import HyperparameterTuner, IntegerParameter, ContinuousParameter

# Define hyperparameter ranges
hyperparameter_ranges = {
    'max_depth': IntegerParameter(3, 10),
    'eta': ContinuousParameter(0.01, 0.3),
    'min_child_weight': IntegerParameter(1, 10),
    'subsample': ContinuousParameter(0.5, 1.0),
    'colsample_bytree': ContinuousParameter(0.5, 1.0)
}

# Define objective metric
objective_metric_name = 'validation:auc'
objective_type = 'Maximize'

# Create tuner
tuner = HyperparameterTuner(
    estimator=estimator,
    objective_metric_name=objective_metric_name,
    hyperparameter_ranges=hyperparameter_ranges,
    objective_type=objective_type,
    max_jobs=20,
    max_parallel_jobs=5,
    strategy='Bayesian',  # or 'Random', 'Hyperband', 'Grid'
    early_stopping_type='Auto'
)

# Start tuning
tuner.fit({'train': train_input, 'validation': val_input})

# Get best training job
best_job = tuner.best_training_job()
print(f"Best job: {best_job}")

# Get best hyperparameters
best_params = sagemaker_client.describe_training_job(
    TrainingJobName=best_job
)['HyperParameters']

# Deploy best model
best_predictor = tuner.deploy(
    initial_instance_count=1,
    instance_type='ml.m5.xlarge'
)

12. What is SageMaker Clarify?

SageMaker Clarify detects bias in data and models, and explains model predictions.

from sagemaker.clarify import (
    SageMakerClarifyProcessor,
    DataConfig, BiasConfig, ModelConfig, SHAPConfig
)

clarify_processor = SageMakerClarifyProcessor(
    role=role,
    instance_count=1,
    instance_type='ml.m5.xlarge',
    sagemaker_session=session
)

# Data configuration
data_config = DataConfig(
    s3_data_input_path=f's3://{bucket}/data/test.csv',
    s3_output_path=f's3://{bucket}/clarify-output/',
    label='target',
    headers=['feature1', 'feature2', 'age', 'gender', 'target'],
    dataset_type='text/csv'
)

# Bias configuration
bias_config = BiasConfig(
    label_values_or_threshold=[1],
    facet_name='gender',
    facet_values_or_threshold=[0],  # 0 = female
    group_name='age'
)

# Run pre-training bias analysis
clarify_processor.run_pre_training_bias(
    data_config=data_config,
    data_bias_config=bias_config
)

# Model configuration (for post-training analysis)
model_config = ModelConfig(
    model_name='my-model',
    instance_type='ml.m5.xlarge',
    instance_count=1,
    content_type='text/csv',
    accept_type='text/csv'
)

# SHAP explainability
shap_config = SHAPConfig(
    baseline=[baseline_data],
    num_samples=500,
    agg_method='mean_abs'
)

# Run post-training bias and explainability
clarify_processor.run_bias(
    data_config=data_config,
    bias_config=bias_config,
    model_config=model_config
)

clarify_processor.run_explainability(
    data_config=data_config,
    model_config=model_config,
    explainability_config=shap_config
)

13. What is SageMaker Debugger?

SageMaker Debugger captures training metrics and analyzes training jobs in real-time.

from sagemaker.debugger import (
    Rule, ProfilerRule, rule_configs,
    DebuggerHookConfig, ProfilerConfig, FrameworkProfile
)

# Debugger rules
rules = [
    Rule.sagemaker(rule_configs.vanishing_gradient()),
    Rule.sagemaker(rule_configs.overfit()),
    Rule.sagemaker(rule_configs.overtraining()),
    Rule.sagemaker(rule_configs.loss_not_decreasing()),
    ProfilerRule.sagemaker(rule_configs.LowGPUUtilization()),
    ProfilerRule.sagemaker(rule_configs.ProfilerReport())
]

# Debugger hook configuration
debugger_hook_config = DebuggerHookConfig(
    s3_output_path=f's3://{bucket}/debug-output/',
    collection_configs=[
        CollectionConfig(name='weights', parameters={'save_interval': '100'}),
        CollectionConfig(name='gradients', parameters={'save_interval': '100'}),
        CollectionConfig(name='losses', parameters={'save_interval': '10'})
    ]
)

# Profiler configuration
profiler_config = ProfilerConfig(
    system_monitor_interval_millis=500,
    framework_profile_params=FrameworkProfile(
        detailed_profiling=True,
        start_step=5,
        num_steps=10
    )
)

# Create estimator with debugging
estimator = PyTorch(
    ...,
    rules=rules,
    debugger_hook_config=debugger_hook_config,
    profiler_config=profiler_config
)

estimator.fit(inputs)

# Access debug data
from smdebug.trials import create_trial
trial = create_trial(estimator.latest_job_debugger_artifacts_path())
tensor_names = trial.tensor_names()
loss_values = trial.tensor('CrossEntropyLoss').values()

14. How do you implement MLOps with SageMaker?

MLOps Architecture:
âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â                  MLOps Pipeline                      â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââ¤
â  âââââââââââ  âââââââââââ  âââââââââââ  ââââââââââ â
â  â Code    ââ â Build & ââ â Train & ââ â Deploy â â
â  â Commit  â  â Test    â  â Evaluateâ  â        â â
â  âââââââââââ  âââââââââââ  âââââââââââ  ââââââââââ â
â       â            â            â            â      â
â       â            â            â            â      â
â  ââââââ¼âââââââââââââ¼âââââââââââââ¼âââââââââââââ¼ââââââ
â  â            SageMaker Pipelines                 ââ
â  âââââââââââââââââââââââââââââââââââââââââââââââââââ
â  ââââââââââââââ  ââââââââââââââ  âââââââââââââââââââ
â  â Feature    â  â Model      â  â Model          ââ
â  â Store      â  â Registry   â  â Monitor        ââ
â  ââââââââââââââ  ââââââââââââââ  âââââââââââââââââââ
âââââââââââââââââââââââââââââââââââââââââââââââââââââââ

# CI/CD with CodePipeline
{
  "pipeline": {
    "stages": [
      {
        "name": "Source",
        "actions": [{"actionTypeId": {"provider": "CodeCommit"}}]
      },
      {
        "name": "Build",
        "actions": [{"actionTypeId": {"provider": "CodeBuild"}}]
      },
      {
        "name": "Train",
        "actions": [{
          "actionTypeId": {"provider": "SageMaker"},
          "configuration": {"PipelineExecutionArn": "..."}
        }]
      },
      {
        "name": "Deploy",
        "actions": [{
          "actionTypeId": {"provider": "CloudFormation"},
          "configuration": {"ActionMode": "CREATE_UPDATE"}
        }]
      }
    ]
  }
}

# Model monitoring
from sagemaker.model_monitor import ModelMonitor, CronExpressionGenerator

monitor = ModelMonitor.attach(monitor_schedule_name='my-monitor')
monitor.create_monitoring_schedule(
    endpoint_input=endpoint_name,
    output=f's3://{bucket}/monitoring/',
    statistics=baseline_statistics,
    constraints=baseline_constraints,
    schedule_cron_expression=CronExpressionGenerator.hourly()
)

15. What are SageMaker inference options?

Inference Options:

1. Real-time Inference
âââ Always-on endpoints
âââ Sub-second latency
âââ Auto-scaling support
âââ Multi-model endpoints

2. Serverless Inference
âââ Pay per invocation
âââ Auto-scales to zero
âââ Cold start consideration
âââ Good for intermittent traffic

3. Batch Transform
âââ Large batch predictions
âââ No persistent endpoint
âââ Cost-effective for bulk
âââ Parallel processing

4. Asynchronous Inference
âââ Long-running predictions
âââ Queue-based
âââ S3 output
âââ Good for large payloads

# Multi-Model Endpoint
from sagemaker.multidatamodel import MultiDataModel

mme = MultiDataModel(
    name='multi-model-endpoint',
    model_data_prefix=f's3://{bucket}/models/',
    model=model,
    sagemaker_session=session
)

predictor = mme.deploy(
    initial_instance_count=1,
    instance_type='ml.m5.xlarge'
)

# Add models dynamically
mme.add_model(model_data_source='s3://bucket/model1.tar.gz', model_data_path='model1.tar.gz')
mme.add_model(model_data_source='s3://bucket/model2.tar.gz', model_data_path='model2.tar.gz')

# Invoke specific model
response = predictor.predict(data, target_model='model1.tar.gz')

16. What is SageMaker Processing?

SageMaker Processing runs data processing and model evaluation workloads.

from sagemaker.processing import ProcessingInput, ProcessingOutput, ScriptProcessor

# Using Scikit-learn
from sagemaker.sklearn.processing import SKLearnProcessor

processor = SKLearnProcessor(
    framework_version='1.0-1',
    role=role,
    instance_count=1,
    instance_type='ml.m5.xlarge'
)

processor.run(
    code='preprocessing.py',
    inputs=[
        ProcessingInput(
            source=f's3://{bucket}/raw-data/',
            destination='/opt/ml/processing/input'
        )
    ],
    outputs=[
        ProcessingOutput(
            output_name='train',
            source='/opt/ml/processing/train',
            destination=f's3://{bucket}/processed/train'
        ),
        ProcessingOutput(
            output_name='test',
            source='/opt/ml/processing/test',
            destination=f's3://{bucket}/processed/test'
        )
    ],
    arguments=['--split-ratio', '0.8']
)

# preprocessing.py
import pandas as pd
import argparse
import os
from sklearn.model_selection import train_test_split

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--split-ratio', type=float, default=0.8)
    args = parser.parse_args()
    
    # Read data
    input_path = '/opt/ml/processing/input'
    df = pd.read_csv(os.path.join(input_path, 'data.csv'))
    
    # Process
    train, test = train_test_split(df, train_size=args.split_ratio)
    
    # Save
    train.to_csv('/opt/ml/processing/train/train.csv', index=False)
    test.to_csv('/opt/ml/processing/test/test.csv', index=False)

Search Tutorials