Skip to main content

Autoscaling Configuration

Enable automatic scaling for self-hosted models based on CPU and memory usage to handle varying inference loads efficiently. When enabled, the platform automatically adjusts the number of running instances to match demand.

Intelligent Scaling

The autoscaler monitors your model's resource usage and intelligently scales up during high demand and down during low usage periods, optimizing both performance and cost.

How to Enable Autoscaling

  1. During model deployment, scroll to the Auto-scaling section
  2. Check Enable auto-scaling
  3. Configure thresholds and limits based on your needs

Configuration Options

SettingDefaultRangeDescription
Min Replicas11-20Minimum number of instances always running
Max Replicas102-50Maximum number of instances to scale to
CPU Threshold70%1-100%Scale up when average CPU exceeds this
Memory Threshold80%1-100%Scale up when average memory exceeds this
Polling Interval30s10-300sHow often to check metrics
Scale Up Cooldown60s30-600sWait time after scaling up before next scale decision
Scale Down Cooldown120s60-900sWait time after scaling down before next scale decision

How Autoscaling Works

The autoscaling system follows a four-step process to maintain optimal performance:

1. Metrics Collection

Platform monitors CPU and memory usage across all instances every polling interval (default: 30 seconds)

2. Scale Up Decision

When either CPU or memory exceeds threshold, a new instance is added (one at a time)

Trigger conditions:

  • CPU usage > CPU threshold OR
  • Memory usage > Memory threshold

3. Scale Down Decision

When both CPU and memory are below 50% of threshold for sustained period, instances are removed

Trigger conditions:

  • CPU usage < 50% of CPU threshold AND
  • Memory usage < 50% of Memory threshold AND
  • Sustained for at least one scale down cooldown period

4. Cooldown Periods

After scaling up (60s) or down (120s), the system waits before making another scaling decision. This prevents rapid scaling oscillations and allows new instances to stabilize.

5. Dedicated Nodes

Each new instance is provisioned on its own dedicated GPU node for isolation and performance.

Important

Each autoscaled instance receives its own dedicated GPU compute node. This ensures performance isolation and prevents GPU contention but may increase infrastructure costs during scale-up events. Monitor your usage in the FinOps dashboard.

Best Practices

Development Environment

Configuration:

  • Autoscaling: Disabled
  • Fixed Instances: 1
  • Instance Type: Smallest GPU that fits model

Why: Minimize costs during development when traffic is minimal and predictable.

Production Environment

Configuration:

  • Autoscaling: Enabled
  • Min Replicas: ≥ 2 for high availability
  • Max Replicas: Based on expected peak load
  • Lower thresholds: CPU 60-70%, Memory 70-80%

Why: Ensure high availability and responsive scaling for production workloads.

Variable Traffic Patterns

Configuration:

  • Set higher max replicas (10-20)
  • Lower thresholds (CPU 60%, Memory 70%)
  • Shorter polling interval (20-30s)

Why: Respond quickly to unpredictable traffic spikes.

Steady Traffic Patterns

Configuration:

  • Use fixed instances instead of autoscaling
  • Size appropriately for average load

Why: Autoscaling overhead not needed for predictable workloads.

CPU-Intensive Models

Configuration:

  • Lower CPU threshold to 60%
  • Keep memory threshold at default 80%

Why: Prevent CPU bottlenecks in compute-heavy models.

Memory-Intensive Models

Configuration:

  • Lower memory threshold to 70%
  • Keep CPU threshold at default 70%

Why: Prevent OOM (Out of Memory) errors in large language models.

GPU-Based Models

Configuration:

  • Longer cooldown periods (180-300s)
  • Conservative thresholds (CPU 60%, Memory 70%)

Why: GPU models have longer initialization times, so avoid unnecessary scaling.

Example Configurations

Small Model - Development

Autoscaling: Disabled
Fixed Instances: 1
Instance Type: g5.xlarge (1x A10G GPU)

Use case: Development and testing of small models (7B parameters)

Cost: Minimal, single instance only when needed

Medium Model - Production (Moderate Traffic)

Min Replicas: 2
Max Replicas: 10
CPU Threshold: 70%
Memory Threshold: 80%
Polling Interval: 30s
Instance Type: g5.2xlarge (1x A10G 24GB GPU)

Use case: Production API with moderate, variable traffic

Cost: 2 instances minimum, scales up to 10 during peaks

Large Model - High-Traffic API

Min Replicas: 3
Max Replicas: 20
CPU Threshold: 60%
Memory Threshold: 70%
Polling Interval: 20s
Scale Up Cooldown: 90s
Scale Down Cooldown: 180s
Instance Type: p4d.24xlarge (8x A100 80GB GPUs)

Use case: High-traffic production API requiring low latency

Cost: 3 instances minimum, scales up to 20 for peak loads

Monitoring Autoscaling

View autoscaling activity in your model details page:

Current Status

  • Current Replicas: Real-time instance count
  • Last Scale Event: Timestamp of last scale up/down with reason
  • Next Evaluation: When the next scaling decision will be made

Metrics Chart

  • CPU Usage: Line chart showing CPU usage across all instances
  • Memory Usage: Line chart showing memory usage across all instances
  • Threshold Lines: Visual indicators of scale up/down thresholds
  • Time Range: View last 1h, 6h, 24h, or 7 days

Scaling History

Log of all scaling decisions with metrics snapshots:

2024-01-15 14:32:15 - Scaled up from 2 to 3 replicas (CPU: 75%, Memory: 68%)
2024-01-15 14:28:45 - CPU threshold exceeded: 75% > 70%
2024-01-15 13:45:20 - Scaled down from 3 to 2 replicas (CPU: 25%, Memory: 30%)
2024-01-15 13:40:10 - Resources below threshold for 120s

Cost Impact

Track how autoscaling affects your compute costs:

  • Current Cost: Real-time cost of running instances
  • Projected Monthly: Estimated monthly cost at current scale
  • Autoscaling Savings: Cost saved by scaling down during low usage
  • Cost per Request: Average cost per inference request

Troubleshooting

Problem: Instances scale up too frequently

Solution:

  • Increase CPU/Memory thresholds
  • Increase polling interval
  • Increase scale up cooldown period

Problem: Instances scale down too quickly

Solution:

  • Increase scale down cooldown period
  • Lower the scale down trigger (currently 50% of threshold)

Problem: Not scaling up when needed

Solution:

  • Lower CPU/Memory thresholds
  • Decrease polling interval
  • Check if max replicas limit is reached

Problem: High costs due to autoscaling

Solution:

  • Reduce max replicas
  • Increase scale down speed by lowering cooldown
  • Consider scheduled deployment instead
  • Review threshold settings to avoid unnecessary scaling

Problem: Cold starts affecting performance

Solution:

  • Increase min replicas to reduce scale-from-zero events
  • Consider "Always On" deployment for critical applications
  • Use scheduled deployment if traffic patterns are predictable

Next Steps