Autoscaling Configuration

Enable automatic scaling for self-hosted models based on CPU and memory usage to handle varying inference loads efficiently. When enabled, the platform automatically adjusts the number of running instances to match demand.

Intelligent Scaling

The autoscaler monitors your model's resource usage and intelligently scales up during high demand and down during low usage periods, optimizing both performance and cost.

How to Enable Autoscaling

During model deployment, scroll to the Auto-scaling section
Check Enable auto-scaling
Configure thresholds and limits based on your needs

Configuration Options

Setting	Default	Range	Description
Min Replicas	1	1-20	Minimum number of instances always running
Max Replicas	10	2-50	Maximum number of instances to scale to
CPU Threshold	70%	1-100%	Scale up when average CPU exceeds this
Memory Threshold	80%	1-100%	Scale up when average memory exceeds this
Polling Interval	30s	10-300s	How often to check metrics
Scale Up Cooldown	60s	30-600s	Wait time after scaling up before next scale decision
Scale Down Cooldown	120s	60-900s	Wait time after scaling down before next scale decision

How Autoscaling Works

The autoscaling system follows a four-step process to maintain optimal performance:

1. Metrics Collection

Platform monitors CPU and memory usage across all instances every polling interval (default: 30 seconds)

2. Scale Up Decision

When either CPU or memory exceeds threshold, a new instance is added (one at a time)

Trigger conditions:

CPU usage > CPU threshold OR
Memory usage > Memory threshold

3. Scale Down Decision

When both CPU and memory are below 50% of threshold for sustained period, instances are removed

Trigger conditions:

CPU usage < 50% of CPU threshold AND
Memory usage < 50% of Memory threshold AND
Sustained for at least one scale down cooldown period

4. Cooldown Periods

After scaling up (60s) or down (120s), the system waits before making another scaling decision. This prevents rapid scaling oscillations and allows new instances to stabilize.

5. Dedicated Nodes

Each new instance is provisioned on its own dedicated GPU node for isolation and performance.

Important

Each autoscaled instance receives its own dedicated GPU compute node. This ensures performance isolation and prevents GPU contention but may increase infrastructure costs during scale-up events. Monitor your usage in the FinOps dashboard.

Best Practices

Development Environment

Configuration:

Autoscaling: Disabled
Fixed Instances: 1
Instance Type: Smallest GPU that fits model

Why: Minimize costs during development when traffic is minimal and predictable.

Production Environment

Configuration:

Autoscaling: Enabled
Min Replicas: ≥ 2 for high availability
Max Replicas: Based on expected peak load
Lower thresholds: CPU 60-70%, Memory 70-80%

Why: Ensure high availability and responsive scaling for production workloads.

Variable Traffic Patterns

Configuration:

Set higher max replicas (10-20)
Lower thresholds (CPU 60%, Memory 70%)
Shorter polling interval (20-30s)

Why: Respond quickly to unpredictable traffic spikes.

Steady Traffic Patterns

Configuration:

Use fixed instances instead of autoscaling
Size appropriately for average load

Why: Autoscaling overhead not needed for predictable workloads.

CPU-Intensive Models

Configuration:

Lower CPU threshold to 60%
Keep memory threshold at default 80%

Why: Prevent CPU bottlenecks in compute-heavy models.

Memory-Intensive Models

Configuration:

Lower memory threshold to 70%
Keep CPU threshold at default 70%

Why: Prevent OOM (Out of Memory) errors in large language models.

GPU-Based Models

Configuration:

Longer cooldown periods (180-300s)
Conservative thresholds (CPU 60%, Memory 70%)

Why: GPU models have longer initialization times, so avoid unnecessary scaling.

Example Configurations

Small Model - Development

Autoscaling: Disabled
Fixed Instances: 1
Instance Type: g5.xlarge (1x A10G GPU)

Use case: Development and testing of small models (7B parameters)

Cost: Minimal, single instance only when needed

Medium Model - Production (Moderate Traffic)

Min Replicas: 2
Max Replicas: 10
CPU Threshold: 70%
Memory Threshold: 80%
Polling Interval: 30s
Instance Type: g5.2xlarge (1x A10G 24GB GPU)

Use case: Production API with moderate, variable traffic

Cost: 2 instances minimum, scales up to 10 during peaks

Large Model - High-Traffic API

Min Replicas: 3
Max Replicas: 20
CPU Threshold: 60%
Memory Threshold: 70%
Polling Interval: 20s
Scale Up Cooldown: 90s
Scale Down Cooldown: 180s
Instance Type: p4d.24xlarge (8x A100 80GB GPUs)

Use case: High-traffic production API requiring low latency

Cost: 3 instances minimum, scales up to 20 for peak loads

Monitoring Autoscaling

View autoscaling activity in your model details page:

Current Status

Current Replicas: Real-time instance count
Last Scale Event: Timestamp of last scale up/down with reason
Next Evaluation: When the next scaling decision will be made

Metrics Chart

CPU Usage: Line chart showing CPU usage across all instances
Memory Usage: Line chart showing memory usage across all instances
Threshold Lines: Visual indicators of scale up/down thresholds
Time Range: View last 1h, 6h, 24h, or 7 days

Scaling History

Log of all scaling decisions with metrics snapshots:

2024-01-15 14:32:15 - Scaled up from 2 to 3 replicas (CPU: 75%, Memory: 68%)
2024-01-15 14:28:45 - CPU threshold exceeded: 75% > 70%
2024-01-15 13:45:20 - Scaled down from 3 to 2 replicas (CPU: 25%, Memory: 30%)
2024-01-15 13:40:10 - Resources below threshold for 120s

Cost Impact

Track how autoscaling affects your compute costs:

Current Cost: Real-time cost of running instances
Projected Monthly: Estimated monthly cost at current scale
Autoscaling Savings: Cost saved by scaling down during low usage
Cost per Request: Average cost per inference request

Troubleshooting

Problem: Instances scale up too frequently

Solution:

Increase CPU/Memory thresholds
Increase polling interval
Increase scale up cooldown period

Problem: Instances scale down too quickly

Solution:

Increase scale down cooldown period
Lower the scale down trigger (currently 50% of threshold)

Problem: Not scaling up when needed

Solution:

Lower CPU/Memory thresholds
Decrease polling interval
Check if max replicas limit is reached

Problem: High costs due to autoscaling

Solution:

Reduce max replicas
Increase scale down speed by lowering cooldown
Consider scheduled deployment instead
Review threshold settings to avoid unnecessary scaling

Problem: Cold starts affecting performance

Solution:

Increase min replicas to reduce scale-from-zero events
Consider "Always On" deployment for critical applications
Use scheduled deployment if traffic patterns are predictable

Next Steps

Monitor model performance to optimize autoscaling settings
Optimize costs with right-sized resources
Learn about deployment options for different use cases

How to Enable Autoscaling​

Configuration Options​

How Autoscaling Works​

1. Metrics Collection​

2. Scale Up Decision​

3. Scale Down Decision​

4. Cooldown Periods​

5. Dedicated Nodes​

Best Practices​

Development Environment​

Production Environment​

Variable Traffic Patterns​

Steady Traffic Patterns​

CPU-Intensive Models​

Memory-Intensive Models​

GPU-Based Models​

Example Configurations​

Small Model - Development​

Medium Model - Production (Moderate Traffic)​

Large Model - High-Traffic API​

Monitoring Autoscaling​

Current Status​

Metrics Chart​

Scaling History​

Cost Impact​

Troubleshooting​

Problem: Instances scale up too frequently​

Problem: Instances scale down too quickly​

Problem: Not scaling up when needed​

Problem: High costs due to autoscaling​

Problem: Cold starts affecting performance​

Next Steps​

How to Enable Autoscaling

Configuration Options

How Autoscaling Works

1. Metrics Collection

2. Scale Up Decision

3. Scale Down Decision

4. Cooldown Periods

5. Dedicated Nodes

Best Practices

Development Environment

Production Environment

Variable Traffic Patterns

Steady Traffic Patterns

CPU-Intensive Models

Memory-Intensive Models

GPU-Based Models

Example Configurations

Small Model - Development

Medium Model - Production (Moderate Traffic)

Large Model - High-Traffic API

Monitoring Autoscaling

Current Status

Metrics Chart

Scaling History

Cost Impact

Troubleshooting

Problem: Instances scale up too frequently

Problem: Instances scale down too quickly

Problem: Not scaling up when needed

Problem: High costs due to autoscaling

Problem: Cold starts affecting performance

Next Steps