Autoscaling Configuration
Enable automatic scaling for self-hosted models based on CPU and memory usage to handle varying inference loads efficiently. When enabled, the platform automatically adjusts the number of running instances to match demand.
The autoscaler monitors your model's resource usage and intelligently scales up during high demand and down during low usage periods, optimizing both performance and cost.
How to Enable Autoscaling
- During model deployment, scroll to the Auto-scaling section
- Check Enable auto-scaling
- Configure thresholds and limits based on your needs
Configuration Options
| Setting | Default | Range | Description |
|---|---|---|---|
| Min Replicas | 1 | 1-20 | Minimum number of instances always running |
| Max Replicas | 10 | 2-50 | Maximum number of instances to scale to |
| CPU Threshold | 70% | 1-100% | Scale up when average CPU exceeds this |
| Memory Threshold | 80% | 1-100% | Scale up when average memory exceeds this |
| Polling Interval | 30s | 10-300s | How often to check metrics |
| Scale Up Cooldown | 60s | 30-600s | Wait time after scaling up before next scale decision |
| Scale Down Cooldown | 120s | 60-900s | Wait time after scaling down before next scale decision |
How Autoscaling Works
The autoscaling system follows a four-step process to maintain optimal performance:
1. Metrics Collection
Platform monitors CPU and memory usage across all instances every polling interval (default: 30 seconds)
2. Scale Up Decision
When either CPU or memory exceeds threshold, a new instance is added (one at a time)
Trigger conditions:
- CPU usage > CPU threshold OR
- Memory usage > Memory threshold
3. Scale Down Decision
When both CPU and memory are below 50% of threshold for sustained period, instances are removed
Trigger conditions:
- CPU usage < 50% of CPU threshold AND
- Memory usage < 50% of Memory threshold AND
- Sustained for at least one scale down cooldown period
4. Cooldown Periods
After scaling up (60s) or down (120s), the system waits before making another scaling decision. This prevents rapid scaling oscillations and allows new instances to stabilize.
5. Dedicated Nodes
Each new instance is provisioned on its own dedicated GPU node for isolation and performance.
Each autoscaled instance receives its own dedicated GPU compute node. This ensures performance isolation and prevents GPU contention but may increase infrastructure costs during scale-up events. Monitor your usage in the FinOps dashboard.
Best Practices
Development Environment
Configuration:
- Autoscaling: Disabled
- Fixed Instances: 1
- Instance Type: Smallest GPU that fits model
Why: Minimize costs during development when traffic is minimal and predictable.
Production Environment
Configuration:
- Autoscaling: Enabled
- Min Replicas: ≥ 2 for high availability
- Max Replicas: Based on expected peak load
- Lower thresholds: CPU 60-70%, Memory 70-80%
Why: Ensure high availability and responsive scaling for production workloads.
Variable Traffic Patterns
Configuration:
- Set higher max replicas (10-20)
- Lower thresholds (CPU 60%, Memory 70%)
- Shorter polling interval (20-30s)
Why: Respond quickly to unpredictable traffic spikes.
Steady Traffic Patterns
Configuration:
- Use fixed instances instead of autoscaling
- Size appropriately for average load
Why: Autoscaling overhead not needed for predictable workloads.
CPU-Intensive Models
Configuration:
- Lower CPU threshold to 60%
- Keep memory threshold at default 80%
Why: Prevent CPU bottlenecks in compute-heavy models.
Memory-Intensive Models
Configuration:
- Lower memory threshold to 70%
- Keep CPU threshold at default 70%
Why: Prevent OOM (Out of Memory) errors in large language models.
GPU-Based Models
Configuration:
- Longer cooldown periods (180-300s)
- Conservative thresholds (CPU 60%, Memory 70%)
Why: GPU models have longer initialization times, so avoid unnecessary scaling.
Example Configurations
Small Model - Development
Autoscaling: Disabled
Fixed Instances: 1
Instance Type: g5.xlarge (1x A10G GPU)
Use case: Development and testing of small models (7B parameters)
Cost: Minimal, single instance only when needed
Medium Model - Production (Moderate Traffic)
Min Replicas: 2
Max Replicas: 10
CPU Threshold: 70%
Memory Threshold: 80%
Polling Interval: 30s
Instance Type: g5.2xlarge (1x A10G 24GB GPU)
Use case: Production API with moderate, variable traffic
Cost: 2 instances minimum, scales up to 10 during peaks
Large Model - High-Traffic API
Min Replicas: 3
Max Replicas: 20
CPU Threshold: 60%
Memory Threshold: 70%
Polling Interval: 20s
Scale Up Cooldown: 90s
Scale Down Cooldown: 180s
Instance Type: p4d.24xlarge (8x A100 80GB GPUs)
Use case: High-traffic production API requiring low latency
Cost: 3 instances minimum, scales up to 20 for peak loads
Monitoring Autoscaling
View autoscaling activity in your model details page:
Current Status
- Current Replicas: Real-time instance count
- Last Scale Event: Timestamp of last scale up/down with reason
- Next Evaluation: When the next scaling decision will be made
Metrics Chart
- CPU Usage: Line chart showing CPU usage across all instances
- Memory Usage: Line chart showing memory usage across all instances
- Threshold Lines: Visual indicators of scale up/down thresholds
- Time Range: View last 1h, 6h, 24h, or 7 days
Scaling History
Log of all scaling decisions with metrics snapshots:
2024-01-15 14:32:15 - Scaled up from 2 to 3 replicas (CPU: 75%, Memory: 68%)
2024-01-15 14:28:45 - CPU threshold exceeded: 75% > 70%
2024-01-15 13:45:20 - Scaled down from 3 to 2 replicas (CPU: 25%, Memory: 30%)
2024-01-15 13:40:10 - Resources below threshold for 120s
Cost Impact
Track how autoscaling affects your compute costs:
- Current Cost: Real-time cost of running instances
- Projected Monthly: Estimated monthly cost at current scale
- Autoscaling Savings: Cost saved by scaling down during low usage
- Cost per Request: Average cost per inference request
Troubleshooting
Problem: Instances scale up too frequently
Solution:
- Increase CPU/Memory thresholds
- Increase polling interval
- Increase scale up cooldown period
Problem: Instances scale down too quickly
Solution:
- Increase scale down cooldown period
- Lower the scale down trigger (currently 50% of threshold)
Problem: Not scaling up when needed
Solution:
- Lower CPU/Memory thresholds
- Decrease polling interval
- Check if max replicas limit is reached
Problem: High costs due to autoscaling
Solution:
- Reduce max replicas
- Increase scale down speed by lowering cooldown
- Consider scheduled deployment instead
- Review threshold settings to avoid unnecessary scaling
Problem: Cold starts affecting performance
Solution:
- Increase min replicas to reduce scale-from-zero events
- Consider "Always On" deployment for critical applications
- Use scheduled deployment if traffic patterns are predictable
Next Steps
- Monitor model performance to optimize autoscaling settings
- Optimize costs with right-sized resources
- Learn about deployment options for different use cases