Cost Optimization
Optimize your AI infrastructure costs with smart deployment strategies, resource configuration, and usage patterns. This guide covers best practices for reducing costs while maintaining performance and reliability.
Understanding AI Model Costs
Self-Hosted Models
Self-hosted models incur compute and storage costs:
- Compute Costs: Pay for GPU/CPU resources while instances are running
- Storage Costs: Pay for model storage and volumes
- Network Costs: Data transfer and load balancing
- Deployment Model: Always On, On Demand, or Scheduled
Third-Party Providers
Third-party providers charge per usage:
- Token-Based Pricing: Pay per input and output token
- Model Tier Pricing: Different rates for different models
- Volume Discounts: Some providers offer bulk pricing
- No Infrastructure Costs: Provider manages infrastructure
Cost Optimization Strategies
1. Choose the Right Deployment Model
On Demand (70-90% Savings)
Best for:
- Development and testing
- Low-traffic applications
- Intermittent workloads
- Prototype and demo applications
How it works:
- Scales to zero when idle (default: 5 minutes)
- Starts on first request (30-60 second cold start)
- Pay only for active time
Configuration:
Deployment: On Demand
Scale-to-Zero Timeout: 5 minutes
Cold Start: 30-60 seconds
Example savings:
- Always On: $1,440/month (24/7 at $2/hour)
- On Demand: $144-288/month (10-20% utilization)
- Savings: $1,152-1,296/month (80-90%)
Scheduled Deployment
Best for:
- Business hours only (9am-5pm)
- Batch processing jobs
- Regional availability windows
- Predictable usage patterns
How it works:
- Runs only during specified time windows
- Use cron expressions to define schedule
- Automatically starts and stops
Configuration:
Deployment: Scheduled
Schedule: "0 9-17 * * 1-5" # Weekdays 9am-5pm
# or
Schedule: "0 8 * * *" # Daily at 8am
Example savings:
- Always On: $1,440/month (24/7)
- Scheduled (9am-5pm, weekdays): $352/month (40 hours/week)
- Savings: $1,088/month (75%)
Always On
Best for:
- Production applications with consistent traffic
- Low-latency requirements
- High availability needs
Trade-offs:
- No cold starts
- Instant response
- Fixed costs regardless of usage
2. Right-Size Your Resources
Match GPU to Model Size
| Model Size | Recommended GPU | Monthly Cost | Oversized GPU | Wasted Cost |
|---|---|---|---|---|
| 7B params | 1x T4 (16GB) | $250 | 1x A100 (40GB) | $1,200 |
| 13B params | 1x A10G (24GB) | $450 | 1x A100 (40GB) | $1,050 |
| 70B params | 4x A100 (40GB) | $4,800 | 8x A100 (80GB) | $4,800 |
Best practice: Use the smallest GPU that fits your model
Use Quantization
Reduce model size and GPU requirements with quantization:
| Model | Full Precision | 8-bit Quantized | 4-bit Quantized |
|---|---|---|---|
| 70B Model | 4x A100 (160GB) | 2x A100 (80GB) | 1x A100 (40GB) |
| Monthly Cost | $4,800 | $2,400 | $1,200 |
| Quality Loss | 0% | Less than 2% | 2-5% |
Savings: 50-75% with minimal quality impact
3. Optimize Autoscaling Configuration
Conservative Scaling (Avoid Over-Scaling)
Problem: Aggressive autoscaling can cause unnecessary scaling:
- Scaling up on temporary spikes
- Not scaling down fast enough
Solution:
CPU Threshold: 75% # Higher threshold
Memory Threshold: 85% # Higher threshold
Scale Down Cooldown: 300s # Longer cooldown to avoid premature scale-up
Right-Size Min/Max Replicas
Problem: Min replicas too high = wasted capacity
Solution:
- Development: Min=0 or 1, Max=3
- Production (moderate): Min=2, Max=10
- Production (high-traffic): Min=3, Max=20
Cost impact:
- Min replicas=5 (excessive): $6,000/month
- Min replicas=2 (appropriate): $2,400/month
- Savings: $3,600/month (60%)
4. Optimize Token Usage
Reduce Prompt Length
Problem: Long prompts increase token costs
Solution:
- Remove unnecessary context
- Use concise instructions
- Implement prompt templates
- Cache common context
Example:
- Verbose prompt: 500 tokens
- Optimized prompt: 150 tokens
- Savings: 70% on input tokens
Limit Output Tokens
Problem: Unlimited outputs can generate excessive tokens
Solution:
- Set
max_tokensparameter - Request concise responses
- Use streaming to stop generation early
Example:
{
max_tokens: 500, // Limit response length
temperature: 0.7,
// Instead of unlimited output
}
Implement Response Caching
Problem: Repeated identical requests waste tokens
Solution:
- Cache frequent queries
- Use Redis or similar
- Set appropriate TTL
Savings: 50-80% for repeated queries
5. Choose Cost-Effective Models
Model Tier Selection
| Task Complexity | Recommended Model | Cost per 1M Tokens | Overkill Model | Wasted Cost |
|---|---|---|---|---|
| Simple classification | GPT-3.5 Turbo | $1.50 | GPT-4 | $30 |
| General Q&A | Claude Haiku | $0.80 | Claude Opus | $60 |
| Basic completion | Mistral-7B (self-hosted) | $0.20 | Llama-70B | $2.00 |
Best practice: Use smallest model that meets quality requirements
Self-Hosted vs Third-Party Cost Comparison
Low Volume (< 1M tokens/month):
- Third-party: $1.50 (GPT-3.5)
- Self-hosted on-demand: $50-100
- Winner: Third-party
Medium Volume (10M tokens/month):
- Third-party: $15 (GPT-3.5)
- Self-hosted scheduled: $350
- Winner: Third-party
High Volume (100M tokens/month):
- Third-party: $150 (GPT-3.5)
- Self-hosted always-on: $1,440
- Winner: Third-party
Very High Volume (1B+ tokens/month):
- Third-party: $1,500 (GPT-3.5)
- Self-hosted optimized: $2,000-3,000
- Winner: Depends on optimization
For most use cases, third-party providers are more cost-effective until you reach very high volumes (1B+ tokens/month). Self-hosted becomes competitive only with proper optimization and sustained high usage.
6. Implement Smart Routing
Route by Task Complexity
Strategy: Use different models for different task types
function selectModel(task) {
if (task.complexity === 'simple') {
return 'gpt-3.5-turbo'; // $1.50 per 1M tokens
} else if (task.complexity === 'medium') {
return 'claude-haiku'; // $0.80 per 1M tokens
} else {
return 'gpt-4'; // $30 per 1M tokens
}
}
Savings: 50-80% by avoiding overkill models
Fallback to Cheaper Models
Strategy: Try cheaper model first, fallback to expensive if needed
async function getResponse(prompt) {
// Try cheap model first
const response = await cheapModel(prompt);
// Check quality
if (response.confidence < 0.8) {
// Fallback to expensive model
return await expensiveModel(prompt);
}
return response;
}
Savings: 30-60% while maintaining quality
7. Monitor and Optimize Continuously
Track Cost Metrics
Key metrics to monitor:
- Cost per request
- Cost per 1K tokens
- Cost per user
- Cost by model
- Cost by provider
Set Budget Alerts
Configure alerts:
- Daily spend threshold
- Monthly budget warning (80% of limit)
- Anomaly detection (2x normal daily spend)
Regular Cost Reviews
Weekly:
- Review top 5 most expensive models
- Identify optimization opportunities
- Check for unused or idle models
Monthly:
- Analyze usage trends
- Adjust autoscaling configurations
- Evaluate model selection
8. Batch Processing
Batch Similar Requests
Problem: Individual requests have overhead
Solution:
- Group similar requests
- Process in batches
- Share context across requests
Savings: 20-40% through reduced overhead
Off-Peak Processing
Problem: On-demand processing during peak hours
Solution:
- Queue non-urgent requests
- Process during off-peak hours
- Use scheduled deployment
Savings: Leverage scheduled deployment savings
Cost Optimization Checklist
Development Phase
- Use on-demand deployment
- Start with small models (7B parameters)
- Use third-party providers for prototyping
- Set max_tokens limits
- Disable autoscaling (use 1 instance)
Testing Phase
- Implement response caching
- Test with quantized models
- Optimize prompt templates
- Set up cost monitoring
- Use scheduled deployment for batch tests
Production Phase
- Right-size GPU resources
- Configure autoscaling appropriately
- Implement smart routing
- Set budget alerts
- Use model tier appropriate for tasks
- Monitor cost per request
- Review and optimize monthly
Cost Optimization ROI Calculator
Example Scenario: API with 50M tokens/month
Before Optimization:
Model: GPT-4
Deployment: Always On (self-hosted)
GPU: 4x A100 (oversized)
Monthly Cost: $5,000
After Optimization:
Model: GPT-3.5 Turbo (for simple tasks) + GPT-4 (for complex tasks)
Deployment: On Demand
Smart Routing: 70% simple, 30% complex
Monthly Cost: $500 (third-party) or $1,200 (self-hosted on-demand)
Savings: $3,500-4,500/month (70-90%)
Common Cost Pitfalls
1. Forgotten Always-On Models
Problem: Models left running 24/7 unnecessarily
Solution: Audit all deployments monthly, switch to on-demand if appropriate
Cost impact: $1,440/month per forgotten model
2. Oversized GPU Selection
Problem: Using A100 for 7B model that fits on T4
Solution: Match GPU to model requirements
Cost impact: $950/month wasted per model
3. No Token Limits
Problem: Unlimited output generation
Solution: Set max_tokens based on use case
Cost impact: 2-10x higher token costs
4. Autoscaling to Max Unnecessarily
Problem: Aggressive autoscaling hitting max replicas
Solution: Tune thresholds and cooldown periods
Cost impact: Thousands per month in unnecessary scaling
5. Wrong Provider for Volume
Problem: Using self-hosted for low volume
Solution: Use third-party until reaching cost tipping point
Cost impact: 5-10x higher costs at low volume
Next Steps
- Monitor your costs with real-time analytics
- Configure autoscaling for optimal resource usage
- Review deployment options for your use case
- Set up budget alerts in the FinOps dashboard