Skip to main content

Auto-Scaling Configuration

Enable automatic scaling based on CPU and memory usage to handle varying workloads efficiently. When enabled, the platform automatically adjusts the number of running instances to match demand using Kubernetes Horizontal Pod Autoscaler (HPA).

Overview

The autoscaler monitors your app's resource usage and intelligently scales up during high demand and down during low usage periods, optimizing both performance and cost.

How It Works

  1. Metrics Collection: Platform monitors CPU and memory usage across all instances every polling interval (default: 30 seconds)
  2. Scale Up Decision: When either CPU or memory exceeds threshold, a new instance is added (one at a time)
  3. Scale Down Decision: When both CPU and memory are below 50% of threshold for sustained period, instances are removed
  4. Cooldown Periods: After scaling up or down, the system waits before making another scaling decision
  5. HPA Integration: The platform creates and manages Kubernetes Horizontal Pod Autoscalers for your deployment
Smart Scaling

The autoscaler uses intelligent algorithms to prevent scaling flapping and ensure smooth transitions between instance counts.

Enabling Autoscaling

During Deployment

  1. Navigate to Apps and deploy your application
  2. Scroll to the Autoscaling (Optional) section
  3. Check Enable Autoscaling
  4. Configure thresholds and limits based on your needs

After Deployment

  1. Navigate to app details page
  2. Click Configuration tab
  3. Scroll to Autoscaling section
  4. Toggle Enable Autoscaling
  5. Redeploy to apply changes

Configuration Options

SettingDefaultRangeDescription
Min Replicas11-20Minimum number of instances always running
Max Replicas101-50Maximum number of instances to scale to
CPU Threshold70%1-100%Scale up when average CPU exceeds this
Memory Threshold80%1-100%Scale up when average memory exceeds this
Polling Interval30s10-300sHow often to check metrics
Scale Up Cooldown60s30-600sWait time after scaling up before next scale decision
Scale Down Cooldown120s60-900sWait time after scaling down before next scale decision
max_replicas Validation

max_replicas must be greater than min_replicas. Both have a minimum value of 1.

Understanding Thresholds

CPU Threshold: Percentage of allocated CPU (from resources.cpu_limit in manifest)

  • If cpu_limit: "1000m" and threshold is 70%, scaling triggers at 700m (0.7 cores)
  • Lower thresholds = more aggressive scaling
  • Higher thresholds = fewer instances, lower cost

Memory Threshold: Percentage of allocated memory (from resources.memory_limit in manifest)

  • If memory_limit: "1Gi" and threshold is 80%, scaling triggers at 819MB
  • Critical for preventing OOM (Out of Memory) errors
  • Recommended: 70-80% for memory-intensive apps

Scaling Behavior

Scale Up Triggers

A new instance is added when either condition is met:

  • Average CPU usage > CPU threshold
  • Average memory usage > Memory threshold

Example:

# Manifest resources
resources:
cpu_limit: "1000m"
memory_limit: "1Gi"

# Autoscaling config
cpu_threshold: 70% # Scale up at 700m CPU
memory_threshold: 80% # Scale up at 819MB memory

If current instance is using:

  • CPU: 750m (75%) OR
  • Memory: 850MB (83%)

Scale up triggered.

Scale Down Triggers

An instance is removed when both conditions are met:

  • Average CPU usage < 35% (50% of 70% threshold)
  • Average memory usage < 40% (50% of 80% threshold)
  • Sustained for at least 2x polling interval

Example:

  • CPU: 300m (30%) AND
  • Memory: 300MB (29%)
  • For at least 60 seconds

Scale down triggered.

Cooldown Periods

After scaling events, the system waits before making another decision:

  • Scale Up Cooldown: 60 seconds (configurable: 30-600s)
  • Scale Down Cooldown: 120 seconds (configurable: 60-900s)

This prevents rapid scaling oscillations ("flapping").

Example Configurations

Light Traffic API (Development)

Best for development and testing with predictable low traffic.

# Manifest
resources:
cpu_request: "100m"
memory_request: "256Mi"
cpu_limit: "500m"
memory_limit: "512Mi"

Autoscaling: Disabled (use fixed instances instead)

  • Fixed Instances: 1
  • Resource Tier: Small

Use Case: Development, testing, personal projects


Production Web App (Moderate Traffic)

Balanced configuration for production apps with moderate traffic.

# Manifest
resources:
cpu_request: "200m"
memory_request: "512Mi"
cpu_limit: "1000m"
memory_limit: "1Gi"

Autoscaling:

  • Enabled: Yes
  • Min Replicas: 2
  • Max Replicas: 10
  • CPU Threshold: 70%
  • Memory Threshold: 80%
  • Polling Interval: 30s
  • Scale Up Cooldown: 60s
  • Scale Down Cooldown: 120s

Use Case: Production web apps, REST APIs, dashboards

Expected Behavior:

  • Always 2 instances for high availability
  • Scales to 10 during traffic spikes
  • Handles 5x traffic increase automatically

High-Traffic API (Variable Load)

Aggressive scaling for APIs with unpredictable traffic patterns.

# Manifest
resources:
cpu_request: "500m"
memory_request: "1Gi"
cpu_limit: "2000m"
memory_limit: "2Gi"

Autoscaling:

  • Enabled: Yes
  • Min Replicas: 3
  • Max Replicas: 20
  • CPU Threshold: 60%
  • Memory Threshold: 70%
  • Polling Interval: 20s
  • Scale Up Cooldown: 30s
  • Scale Down Cooldown: 60s

Use Case: High-traffic APIs, real-time services, production critical apps

Expected Behavior:

  • Always 3 instances for redundancy
  • Scales aggressively (lower thresholds)
  • Fast response to load changes (20s polling)
  • Can handle 6-7x traffic increases

CPU-Intensive Processing

Optimized for compute-heavy workloads.

# Manifest
resources:
cpu_request: "1000m"
memory_request: "512Mi"
cpu_limit: "4000m"
memory_limit: "1Gi"

Autoscaling:

  • Enabled: Yes
  • Min Replicas: 2
  • Max Replicas: 15
  • CPU Threshold: 60% (lower for CPU-heavy apps)
  • Memory Threshold: 80%
  • Polling Interval: 30s

Use Case: Image processing, video transcoding, data analysis


Memory-Intensive Application

Optimized for memory-heavy workloads.

# Manifest
resources:
cpu_request: "200m"
memory_request: "2Gi"
cpu_limit: "1000m"
memory_limit: "4Gi"

Autoscaling:

  • Enabled: Yes
  • Min Replicas: 2
  • Max Replicas: 10
  • CPU Threshold: 70%
  • Memory Threshold: 70% (lower to prevent OOM)
  • Polling Interval: 30s

Use Case: In-memory caching, ML inference, data processing

Monitoring Autoscaling

View autoscaling activity in your app details page:

Metrics Dashboard

  • Current Replicas: Real-time instance count
  • CPU Usage: Percentage of allocated CPU across all instances
  • Memory Usage: Percentage of allocated memory across all instances

Best Practices

Traffic Patterns

Predictable Traffic:

  • Use fixed instances instead of autoscaling
  • Right-size resource allocation
  • Lower operational complexity

Unpredictable Traffic:

  • Enable autoscaling with higher max replicas
  • Set lower thresholds for faster response
  • Monitor scaling patterns and adjust

Resource Allocation

CPU-Intensive Apps:

  • Lower CPU threshold (60%)
  • Higher CPU limits
  • Prevents performance degradation

Memory-Intensive Apps:

  • Lower memory threshold (70%)
  • Higher memory limits
  • Prevents OOM errors

Balanced Apps:

  • Standard thresholds (70% CPU, 80% memory)
  • Balanced resource allocation

Cost Optimization

  1. Start Conservative: Begin with low max replicas, increase if needed
  2. Monitor Patterns: Review metrics to find optimal settings
  3. Use Min Replicas Wisely: Higher min = higher baseline cost
  4. Development Environments: Disable autoscaling, use 1 instance
  5. Tune Cooldowns: Adjust scale_up_cooldown (30-600s) and scale_down_cooldown (60-900s) based on your traffic patterns

Troubleshooting

Not Scaling Up

Problem: CPU/memory high but no new instances

Check:

  • Reached max replicas limit?
  • Currently in cooldown period?
  • Thresholds configured correctly?
  • Metrics being collected?

Solution:

# Lower thresholds for more aggressive scaling
cpu_threshold: 60% # Was 70%
memory_threshold: 70% # Was 80%

# Increase max replicas
max_replicas: 20 # Was 10

Scaling Flapping

Problem: Instances constantly scaling up and down

Check:

  • Thresholds too close to scale-down threshold?
  • Polling interval too short?
  • Resource limits too low?

Solution:

# Increase cooldown periods
scale_up_cooldown: 120 # Was 60
scale_down_cooldown: 300 # Was 120

# Increase polling interval
polling_interval: 60 # Was 30

Slow Scaling

Problem: Scaling happens too slowly

Check:

  • Polling interval too long?
  • Thresholds too high?
  • Cooldown periods too long?

Solution:

# Decrease polling interval
polling_interval: 20 # Was 30

# Lower thresholds
cpu_threshold: 60% # Was 70%

# Reduce cooldown
scale_up_cooldown: 30 # Minimum allowed

High Costs

Problem: Too many instances running

Solution:

# Reduce min replicas
min_replicas: 1 # Was 3

# Reduce max replicas
max_replicas: 5 # Was 20

# Allow more aggressive scale-down
# (automatic - 50% of thresholds)

Next Steps