Auto-Scaling Configuration

Enable automatic scaling based on CPU and memory usage to handle varying workloads efficiently. When enabled, the platform automatically adjusts the number of running instances to match demand using Kubernetes Horizontal Pod Autoscaler (HPA).

Overview

The autoscaler monitors your app's resource usage and intelligently scales up during high demand and down during low usage periods, optimizing both performance and cost.

How It Works

Metrics Collection: Platform monitors CPU and memory usage across all instances every polling interval (default: 30 seconds)
Scale Up Decision: When either CPU or memory exceeds threshold, a new instance is added (one at a time)
Scale Down Decision: When both CPU and memory are below 50% of threshold for sustained period, instances are removed
Cooldown Periods: After scaling up or down, the system waits before making another scaling decision
HPA Integration: The platform creates and manages Kubernetes Horizontal Pod Autoscalers for your deployment

Smart Scaling

The autoscaler uses intelligent algorithms to prevent scaling flapping and ensure smooth transitions between instance counts.

Enabling Autoscaling

During Deployment

Navigate to Apps and deploy your application
Scroll to the Autoscaling (Optional) section
Check Enable Autoscaling
Configure thresholds and limits based on your needs

After Deployment

Navigate to app details page
Click Configuration tab
Scroll to Autoscaling section
Toggle Enable Autoscaling
Redeploy to apply changes

Configuration Options

Setting	Default	Range	Description
Min Replicas	1	1-20	Minimum number of instances always running
Max Replicas	10	1-50	Maximum number of instances to scale to
CPU Threshold	70%	1-100%	Scale up when average CPU exceeds this
Memory Threshold	80%	1-100%	Scale up when average memory exceeds this
Polling Interval	30s	10-300s	How often to check metrics
Scale Up Cooldown	60s	30-600s	Wait time after scaling up before next scale decision
Scale Down Cooldown	120s	60-900s	Wait time after scaling down before next scale decision

max_replicas Validation

max_replicas must be greater than min_replicas. Both have a minimum value of 1.

Understanding Thresholds

CPU Threshold: Percentage of allocated CPU (from resources.cpu_limit in manifest)

If cpu_limit: "1000m" and threshold is 70%, scaling triggers at 700m (0.7 cores)
Lower thresholds = more aggressive scaling
Higher thresholds = fewer instances, lower cost

Memory Threshold: Percentage of allocated memory (from resources.memory_limit in manifest)

If memory_limit: "1Gi" and threshold is 80%, scaling triggers at 819MB
Critical for preventing OOM (Out of Memory) errors
Recommended: 70-80% for memory-intensive apps

Scaling Behavior

Scale Up Triggers

A new instance is added when either condition is met:

Average CPU usage > CPU threshold
Average memory usage > Memory threshold

Example:

# Manifest resources
resources:
  cpu_limit: "1000m"
  memory_limit: "1Gi"

# Autoscaling config
cpu_threshold: 70%      # Scale up at 700m CPU
memory_threshold: 80%   # Scale up at 819MB memory

If current instance is using:

CPU: 750m (75%) OR
Memory: 850MB (83%)

Scale up triggered.

Scale Down Triggers

An instance is removed when both conditions are met:

Average CPU usage < 35% (50% of 70% threshold)
Average memory usage < 40% (50% of 80% threshold)
Sustained for at least 2x polling interval

Example:

CPU: 300m (30%) AND
Memory: 300MB (29%)
For at least 60 seconds

Scale down triggered.

Cooldown Periods

After scaling events, the system waits before making another decision:

Scale Up Cooldown: 60 seconds (configurable: 30-600s)
Scale Down Cooldown: 120 seconds (configurable: 60-900s)

This prevents rapid scaling oscillations ("flapping").

Example Configurations

Light Traffic API (Development)

Best for development and testing with predictable low traffic.

# Manifest
resources:
  cpu_request: "100m"
  memory_request: "256Mi"
  cpu_limit: "500m"
  memory_limit: "512Mi"

Autoscaling: Disabled (use fixed instances instead)

Fixed Instances: 1
Resource Tier: Small

Use Case: Development, testing, personal projects

Production Web App (Moderate Traffic)

Balanced configuration for production apps with moderate traffic.

# Manifest
resources:
  cpu_request: "200m"
  memory_request: "512Mi"
  cpu_limit: "1000m"
  memory_limit: "1Gi"

Autoscaling:

Enabled: Yes
Min Replicas: 2
Max Replicas: 10
CPU Threshold: 70%
Memory Threshold: 80%
Polling Interval: 30s
Scale Up Cooldown: 60s
Scale Down Cooldown: 120s

Use Case: Production web apps, REST APIs, dashboards

Expected Behavior:

Always 2 instances for high availability
Scales to 10 during traffic spikes
Handles 5x traffic increase automatically

High-Traffic API (Variable Load)

Aggressive scaling for APIs with unpredictable traffic patterns.

# Manifest
resources:
  cpu_request: "500m"
  memory_request: "1Gi"
  cpu_limit: "2000m"
  memory_limit: "2Gi"

Autoscaling:

Enabled: Yes
Min Replicas: 3
Max Replicas: 20
CPU Threshold: 60%
Memory Threshold: 70%
Polling Interval: 20s
Scale Up Cooldown: 30s
Scale Down Cooldown: 60s

Use Case: High-traffic APIs, real-time services, production critical apps

Expected Behavior:

Always 3 instances for redundancy
Scales aggressively (lower thresholds)
Fast response to load changes (20s polling)
Can handle 6-7x traffic increases

CPU-Intensive Processing

Optimized for compute-heavy workloads.

# Manifest
resources:
  cpu_request: "1000m"
  memory_request: "512Mi"
  cpu_limit: "4000m"
  memory_limit: "1Gi"

Autoscaling:

Enabled: Yes
Min Replicas: 2
Max Replicas: 15
CPU Threshold: 60% (lower for CPU-heavy apps)
Memory Threshold: 80%
Polling Interval: 30s

Use Case: Image processing, video transcoding, data analysis

Memory-Intensive Application

Optimized for memory-heavy workloads.

# Manifest
resources:
  cpu_request: "200m"
  memory_request: "2Gi"
  cpu_limit: "1000m"
  memory_limit: "4Gi"

Autoscaling:

Enabled: Yes
Min Replicas: 2
Max Replicas: 10
CPU Threshold: 70%
Memory Threshold: 70% (lower to prevent OOM)
Polling Interval: 30s

Use Case: In-memory caching, ML inference, data processing

Monitoring Autoscaling

View autoscaling activity in your app details page:

Metrics Dashboard

Current Replicas: Real-time instance count
CPU Usage: Percentage of allocated CPU across all instances
Memory Usage: Percentage of allocated memory across all instances

Best Practices

Traffic Patterns

Predictable Traffic:

Use fixed instances instead of autoscaling
Right-size resource allocation
Lower operational complexity

Unpredictable Traffic:

Enable autoscaling with higher max replicas
Set lower thresholds for faster response
Monitor scaling patterns and adjust

Resource Allocation

CPU-Intensive Apps:

Lower CPU threshold (60%)
Higher CPU limits
Prevents performance degradation

Memory-Intensive Apps:

Lower memory threshold (70%)
Higher memory limits
Prevents OOM errors

Balanced Apps:

Standard thresholds (70% CPU, 80% memory)
Balanced resource allocation

Cost Optimization

Start Conservative: Begin with low max replicas, increase if needed
Monitor Patterns: Review metrics to find optimal settings
Use Min Replicas Wisely: Higher min = higher baseline cost
Development Environments: Disable autoscaling, use 1 instance
Tune Cooldowns: Adjust scale_up_cooldown (30-600s) and scale_down_cooldown (60-900s) based on your traffic patterns

Troubleshooting

Not Scaling Up

Problem: CPU/memory high but no new instances

Check:

Reached max replicas limit?
Currently in cooldown period?
Thresholds configured correctly?
Metrics being collected?

Solution:

# Lower thresholds for more aggressive scaling
cpu_threshold: 60%  # Was 70%
memory_threshold: 70%  # Was 80%

# Increase max replicas
max_replicas: 20  # Was 10

Scaling Flapping

Problem: Instances constantly scaling up and down

Check:

Thresholds too close to scale-down threshold?
Polling interval too short?
Resource limits too low?

Solution:

# Increase cooldown periods
scale_up_cooldown: 120  # Was 60
scale_down_cooldown: 300  # Was 120

# Increase polling interval
polling_interval: 60  # Was 30

Slow Scaling

Problem: Scaling happens too slowly

Check:

Polling interval too long?
Thresholds too high?
Cooldown periods too long?

Solution:

# Decrease polling interval
polling_interval: 20  # Was 30

# Lower thresholds
cpu_threshold: 60%  # Was 70%

# Reduce cooldown
scale_up_cooldown: 30  # Minimum allowed

High Costs

Problem: Too many instances running

Solution:

# Reduce min replicas
min_replicas: 1  # Was 3

# Reduce max replicas
max_replicas: 5  # Was 20

# Allow more aggressive scale-down
# (automatic - 50% of thresholds)

Overview​

How It Works​

Enabling Autoscaling​

During Deployment​

After Deployment​

Configuration Options​

Understanding Thresholds​

Scaling Behavior​

Scale Up Triggers​

Scale Down Triggers​

Cooldown Periods​

Example Configurations​

Light Traffic API (Development)​

Production Web App (Moderate Traffic)​

High-Traffic API (Variable Load)​

CPU-Intensive Processing​

Memory-Intensive Application​

Monitoring Autoscaling​

Metrics Dashboard​

Best Practices​

Traffic Patterns​

Resource Allocation​

Cost Optimization​

Troubleshooting​

Not Scaling Up​

Scaling Flapping​

Slow Scaling​

High Costs​

Next Steps​

Overview

How It Works

Enabling Autoscaling

During Deployment

After Deployment

Configuration Options

Understanding Thresholds

Scaling Behavior

Scale Up Triggers

Scale Down Triggers

Cooldown Periods

Example Configurations

Light Traffic API (Development)

Production Web App (Moderate Traffic)

High-Traffic API (Variable Load)

CPU-Intensive Processing

Memory-Intensive Application

Monitoring Autoscaling

Metrics Dashboard

Best Practices

Traffic Patterns

Resource Allocation

Cost Optimization

Troubleshooting

Not Scaling Up

Scaling Flapping

Slow Scaling

High Costs

Next Steps