Auto-Scaling Configuration
Enable automatic scaling based on CPU and memory usage to handle varying workloads efficiently. When enabled, the platform automatically adjusts the number of running instances to match demand using Kubernetes Horizontal Pod Autoscaler (HPA).
Overview
The autoscaler monitors your app's resource usage and intelligently scales up during high demand and down during low usage periods, optimizing both performance and cost.
How It Works
- Metrics Collection: Platform monitors CPU and memory usage across all instances every polling interval (default: 30 seconds)
- Scale Up Decision: When either CPU or memory exceeds threshold, a new instance is added (one at a time)
- Scale Down Decision: When both CPU and memory are below 50% of threshold for sustained period, instances are removed
- Cooldown Periods: After scaling up or down, the system waits before making another scaling decision
- HPA Integration: The platform creates and manages Kubernetes Horizontal Pod Autoscalers for your deployment
The autoscaler uses intelligent algorithms to prevent scaling flapping and ensure smooth transitions between instance counts.
Enabling Autoscaling
During Deployment
- Navigate to Apps and deploy your application
- Scroll to the Autoscaling (Optional) section
- Check Enable Autoscaling
- Configure thresholds and limits based on your needs
After Deployment
- Navigate to app details page
- Click Configuration tab
- Scroll to Autoscaling section
- Toggle Enable Autoscaling
- Redeploy to apply changes
Configuration Options
| Setting | Default | Range | Description |
|---|---|---|---|
| Min Replicas | 1 | 1-20 | Minimum number of instances always running |
| Max Replicas | 10 | 1-50 | Maximum number of instances to scale to |
| CPU Threshold | 70% | 1-100% | Scale up when average CPU exceeds this |
| Memory Threshold | 80% | 1-100% | Scale up when average memory exceeds this |
| Polling Interval | 30s | 10-300s | How often to check metrics |
| Scale Up Cooldown | 60s | 30-600s | Wait time after scaling up before next scale decision |
| Scale Down Cooldown | 120s | 60-900s | Wait time after scaling down before next scale decision |
max_replicas must be greater than min_replicas. Both have a minimum value of 1.
Understanding Thresholds
CPU Threshold: Percentage of allocated CPU (from resources.cpu_limit in manifest)
- If
cpu_limit: "1000m"and threshold is 70%, scaling triggers at 700m (0.7 cores) - Lower thresholds = more aggressive scaling
- Higher thresholds = fewer instances, lower cost
Memory Threshold: Percentage of allocated memory (from resources.memory_limit in manifest)
- If
memory_limit: "1Gi"and threshold is 80%, scaling triggers at 819MB - Critical for preventing OOM (Out of Memory) errors
- Recommended: 70-80% for memory-intensive apps
Scaling Behavior
Scale Up Triggers
A new instance is added when either condition is met:
- Average CPU usage > CPU threshold
- Average memory usage > Memory threshold
Example:
# Manifest resources
resources:
cpu_limit: "1000m"
memory_limit: "1Gi"
# Autoscaling config
cpu_threshold: 70% # Scale up at 700m CPU
memory_threshold: 80% # Scale up at 819MB memory
If current instance is using:
- CPU: 750m (75%) OR
- Memory: 850MB (83%)
Scale up triggered.
Scale Down Triggers
An instance is removed when both conditions are met:
- Average CPU usage < 35% (50% of 70% threshold)
- Average memory usage < 40% (50% of 80% threshold)
- Sustained for at least 2x polling interval
Example:
- CPU: 300m (30%) AND
- Memory: 300MB (29%)
- For at least 60 seconds
Scale down triggered.
Cooldown Periods
After scaling events, the system waits before making another decision:
- Scale Up Cooldown: 60 seconds (configurable: 30-600s)
- Scale Down Cooldown: 120 seconds (configurable: 60-900s)
This prevents rapid scaling oscillations ("flapping").
Example Configurations
Light Traffic API (Development)
Best for development and testing with predictable low traffic.
# Manifest
resources:
cpu_request: "100m"
memory_request: "256Mi"
cpu_limit: "500m"
memory_limit: "512Mi"
Autoscaling: Disabled (use fixed instances instead)
- Fixed Instances: 1
- Resource Tier: Small
Use Case: Development, testing, personal projects
Production Web App (Moderate Traffic)
Balanced configuration for production apps with moderate traffic.
# Manifest
resources:
cpu_request: "200m"
memory_request: "512Mi"
cpu_limit: "1000m"
memory_limit: "1Gi"
Autoscaling:
- Enabled: Yes
- Min Replicas: 2
- Max Replicas: 10
- CPU Threshold: 70%
- Memory Threshold: 80%
- Polling Interval: 30s
- Scale Up Cooldown: 60s
- Scale Down Cooldown: 120s
Use Case: Production web apps, REST APIs, dashboards
Expected Behavior:
- Always 2 instances for high availability
- Scales to 10 during traffic spikes
- Handles 5x traffic increase automatically
High-Traffic API (Variable Load)
Aggressive scaling for APIs with unpredictable traffic patterns.
# Manifest
resources:
cpu_request: "500m"
memory_request: "1Gi"
cpu_limit: "2000m"
memory_limit: "2Gi"
Autoscaling:
- Enabled: Yes
- Min Replicas: 3
- Max Replicas: 20
- CPU Threshold: 60%
- Memory Threshold: 70%
- Polling Interval: 20s
- Scale Up Cooldown: 30s
- Scale Down Cooldown: 60s
Use Case: High-traffic APIs, real-time services, production critical apps
Expected Behavior:
- Always 3 instances for redundancy
- Scales aggressively (lower thresholds)
- Fast response to load changes (20s polling)
- Can handle 6-7x traffic increases
CPU-Intensive Processing
Optimized for compute-heavy workloads.
# Manifest
resources:
cpu_request: "1000m"
memory_request: "512Mi"
cpu_limit: "4000m"
memory_limit: "1Gi"
Autoscaling:
- Enabled: Yes
- Min Replicas: 2
- Max Replicas: 15
- CPU Threshold: 60% (lower for CPU-heavy apps)
- Memory Threshold: 80%
- Polling Interval: 30s
Use Case: Image processing, video transcoding, data analysis
Memory-Intensive Application
Optimized for memory-heavy workloads.
# Manifest
resources:
cpu_request: "200m"
memory_request: "2Gi"
cpu_limit: "1000m"
memory_limit: "4Gi"
Autoscaling:
- Enabled: Yes
- Min Replicas: 2
- Max Replicas: 10
- CPU Threshold: 70%
- Memory Threshold: 70% (lower to prevent OOM)
- Polling Interval: 30s
Use Case: In-memory caching, ML inference, data processing
Monitoring Autoscaling
View autoscaling activity in your app details page:
Metrics Dashboard
- Current Replicas: Real-time instance count
- CPU Usage: Percentage of allocated CPU across all instances
- Memory Usage: Percentage of allocated memory across all instances
Best Practices
Traffic Patterns
Predictable Traffic:
- Use fixed instances instead of autoscaling
- Right-size resource allocation
- Lower operational complexity
Unpredictable Traffic:
- Enable autoscaling with higher max replicas
- Set lower thresholds for faster response
- Monitor scaling patterns and adjust
Resource Allocation
CPU-Intensive Apps:
- Lower CPU threshold (60%)
- Higher CPU limits
- Prevents performance degradation
Memory-Intensive Apps:
- Lower memory threshold (70%)
- Higher memory limits
- Prevents OOM errors
Balanced Apps:
- Standard thresholds (70% CPU, 80% memory)
- Balanced resource allocation
Cost Optimization
- Start Conservative: Begin with low max replicas, increase if needed
- Monitor Patterns: Review metrics to find optimal settings
- Use Min Replicas Wisely: Higher min = higher baseline cost
- Development Environments: Disable autoscaling, use 1 instance
- Tune Cooldowns: Adjust scale_up_cooldown (30-600s) and scale_down_cooldown (60-900s) based on your traffic patterns
Troubleshooting
Not Scaling Up
Problem: CPU/memory high but no new instances
Check:
- Reached max replicas limit?
- Currently in cooldown period?
- Thresholds configured correctly?
- Metrics being collected?
Solution:
# Lower thresholds for more aggressive scaling
cpu_threshold: 60% # Was 70%
memory_threshold: 70% # Was 80%
# Increase max replicas
max_replicas: 20 # Was 10
Scaling Flapping
Problem: Instances constantly scaling up and down
Check:
- Thresholds too close to scale-down threshold?
- Polling interval too short?
- Resource limits too low?
Solution:
# Increase cooldown periods
scale_up_cooldown: 120 # Was 60
scale_down_cooldown: 300 # Was 120
# Increase polling interval
polling_interval: 60 # Was 30
Slow Scaling
Problem: Scaling happens too slowly
Check:
- Polling interval too long?
- Thresholds too high?
- Cooldown periods too long?
Solution:
# Decrease polling interval
polling_interval: 20 # Was 30
# Lower thresholds
cpu_threshold: 60% # Was 70%
# Reduce cooldown
scale_up_cooldown: 30 # Minimum allowed
High Costs
Problem: Too many instances running
Solution:
# Reduce min replicas
min_replicas: 1 # Was 3
# Reduce max replicas
max_replicas: 5 # Was 20
# Allow more aggressive scale-down
# (automatic - 50% of thresholds)