Monitoring Models
Monitor model usage, performance, and costs with real-time analytics. The AI Gateway provides comprehensive analytics dashboards to track model performance, identify issues, and optimize costs.
Analytics Dashboard
Access the analytics dashboard to view metrics across all your deployed models.
Summary Metrics
High-level metrics displayed at the top of the dashboard:
- Total Requests: Aggregate count of all API calls across models
- Total Tokens: Sum of input + output tokens processed
- Average Response Time: Mean latency across all requests
- Total Cost: Cumulative spend across all models and providers
Time-Series Charts
Requests Over Time
Line chart showing request volume by hour (24h view) or day (7d/30d/90d views)
Use cases:
- Identify traffic patterns and peak usage times
- Detect anomalies or unexpected traffic spikes
- Plan capacity for expected load
Token Usage Over Time
Line chart showing token consumption trends over selected date range
Use cases:
- Track token consumption trends
- Forecast future usage and costs
- Identify models with highest token usage
Cost Trends
Track spending over time to forecast monthly costs
Use cases:
- Monitor spending patterns
- Set budget alerts
- Identify cost optimization opportunities
Provider Distribution
Doughnut chart showing request distribution across providers (OpenAI, Anthropic, Self-Hosted, etc.)
Use cases:
- Understand provider usage mix
- Identify opportunities to consolidate providers
- Track provider reliability
Model Details Table
Per-model analytics with sortable columns:
| Metric | Description |
|---|---|
| Total Requests | Number of API calls to this model |
| Total Tokens | Input + output tokens processed (affects cost) |
| Input/Output Tokens | Separate breakdown for prompt vs completion tokens |
| Avg Response Time | Mean latency for model requests |
| P95 Response Time | 95th percentile latency (worst 5% of requests) |
| P99 Response Time | 99th percentile latency (worst 1% of requests) |
| Error Rate | Percentage of failed requests |
| Success Rate | Percentage of successful requests |
| Total Cost | Cumulative spend for this model |
| 24h/7d/30d Metrics | Rolling window stats for requests, tokens, cost |
Understanding Latency Metrics
- Average Response Time: Good for understanding typical performance
- P95 Response Time: Ensures 95% of requests complete within this time
- P99 Response Time: Ensures 99% of requests complete within this time
Monitor P95 and P99 latency to ensure consistent user experience. High P95/P99 values indicate performance issues affecting some users even if average latency is acceptable.
Provider Statistics
Aggregated metrics by provider:
- Model Count: Number of models from this provider
- Total Requests: Aggregate requests across all models
- Total Tokens: Aggregate token consumption
- Average Latency: Mean response time for provider
- Total Cost: Cumulative spend with provider
- Success Rate: Percentage of successful requests across all models
User Analytics (Admin Only)
Track usage per user for cost allocation and monitoring:
Per-User Metrics
- Requests: Total API calls by user
- Tokens: Total tokens consumed by user
- Cost: Total spend attributed to user
- Avg Response Time: Mean latency for user's requests
- Success/Error Rates: Request success and failure percentages
Usage Patterns
- Model Usage: Unique models accessed by user
- Providers Used: Which providers the user accesses
- Guardrail Violations: Count of requests blocked by content policies
Activity Tracking
- First Request: When user first accessed the gateway
- Last Request: Most recent activity timestamp
- Days Active: Number of days with at least one request
- Avg Requests per Day: Average daily usage
Cost Metrics
- Cost per Request: Average cost per API call
- Cost per 1K Tokens: Normalized cost metric
- Max Tokens in Single Request: Largest request by token count
Guardrails Monitoring
Platform-Wide Guardrail Metrics
- Total Models: Number of deployed models (self-hosted + third-party)
- Protected Models: Count of models with guardrails enabled
- Total Rules: Aggregate count of all active guardrail rules
- Blocked Today: Requests blocked by guardrails in last 24 hours
- Modified Today: Requests modified by guardrails in last 24 hours
- Active Alerts: Current policy violations requiring attention
Per-Model Guardrail Metrics
Each model tracks:
- Enabled Status: Whether guardrails are active for this model
- Rules Count: Total number of configured rules
- Input Rules: Rules applied to user prompts before model inference
- Output Rules: Rules applied to model responses before returning to user
- Both Rules: Rules that apply to both input and output
- Total Requests: Number of API calls processed
- Blocked Requests: Requests denied due to policy violations
- Modified Requests: Requests altered by guardrails (e.g., PII redaction)
- Block Rate: Percentage of requests blocked
- Last Triggered: Timestamp of most recent guardrail activation
Guardrail Rule Types
Monitor specific rule categories:
Content Filtering
-
Toxicity Detection: Blocks toxic, offensive, or harmful content
- Categories: Hate speech, harassment, violence, profanity, sexual content
- Threshold levels: Low (permissive), Medium (balanced), High (strict)
-
PII Detection: Identifies and redacts personally identifiable information
- Detects: Email addresses, phone numbers, SSN, credit cards, IP addresses, names, addresses, dates of birth
- Actions: Redact, mask, or block entire request
-
Prompt Injection Detection: Detects attempts to manipulate model behavior
- Patterns: "Ignore previous instructions", system prompt extraction, role confusion
Topic Restrictions
-
Allowed Topics: Restricts model to specific subject areas
- Configuration: Whitelist of allowed topics
- Action: Reject off-topic queries with helpful message
-
Banned Topics: Blocks specific prohibited subjects
- Configuration: Blacklist of topics
- Action: Reject query and suggest alternative resources
Rate Limiting
-
Request Rate Limits: Limits requests per user/API key per time window
- Configuration: Requests per minute/hour/day
- Scope: Per user, per API key, per IP address, global
-
Token Limits: Limits total tokens consumed per user per period
- Configuration: Max tokens per day/week/month
- Tracking: Real-time token usage dashboard
Output Validation
-
Hallucination Detection: Flags responses that may contain fabricated information
- Method: Confidence scoring, fact-checking, consistency checks
- Action: Add disclaimer, request regeneration, or block response
-
Format Enforcement: Ensures output matches required structure
- Formats: JSON schema, XML, markdown, specific patterns
- Action: Retry generation or return error
Filters & Options
Customize your analytics view:
Date Range
- Last 24 hours
- Last 7 days
- Last 30 days
- Last 90 days
- Custom date range
Provider Filter
View all providers or filter by:
- OpenAI
- Anthropic
- Mistral
- Cohere
- Self-Hosted
Export Options
- Download CSV: Export model analytics for reporting
- API Access: Programmatic access to metrics via REST API
- Webhook Integration: Real-time alerts for anomalies
Auto-Refresh
- Enabled: Data updates automatically every 30 seconds
- Disabled: Manual refresh required
Alerts and Notifications
Set up alerts for important events:
Performance Alerts
- High Latency: Alert when P95 latency exceeds threshold
- Error Rate: Alert when error rate exceeds percentage
- Availability: Alert when model becomes unavailable
Cost Alerts
- Budget Threshold: Alert when spending exceeds budget
- Anomaly Detection: Alert on unusual spending patterns
- Daily Limits: Alert when approaching daily token limits
Guardrail Alerts
- High Block Rate: Alert when too many requests are blocked
- Policy Violations: Alert on specific policy violations
- Security Events: Alert on prompt injection attempts
Capacity Alerts
- High Utilization: Alert when CPU/Memory exceeds threshold
- Max Replicas: Alert when autoscaling hits max replicas
- Queue Depth: Alert when request queue grows too large
Best Practices
Regular Monitoring
- Daily: Check summary metrics and cost trends
- Weekly: Review model performance and optimize configurations
- Monthly: Analyze usage patterns and forecast future needs
Performance Optimization
- Identify Slow Models: Sort by P95/P99 latency
- Analyze Error Rates: Investigate models with high error rates
- Optimize Token Usage: Review input/output token ratios
- Scale Appropriately: Adjust autoscaling based on usage patterns
Cost Management
- Track Spending: Monitor costs daily to avoid surprises
- Identify High-Cost Models: Sort by total cost
- Optimize Model Selection: Use smaller models for simple tasks
- Review Provider Mix: Compare costs across providers
Security Monitoring
- Review Guardrail Activity: Check blocked and modified requests
- Investigate Anomalies: Look for unusual traffic patterns
- Monitor User Activity: Track per-user usage for abuse detection
- Update Rules: Adjust guardrails based on observed patterns
Troubleshooting
High Error Rates
Possible causes:
- Model overloaded (needs more resources)
- API key issues with third-party provider
- Network connectivity problems
- Invalid request formats
Solutions:
- Enable autoscaling or add more instances
- Verify API key validity
- Check network connectivity
- Review request logs for formatting issues
High Latency
Possible causes:
- Insufficient compute resources
- Cold starts (on-demand deployment)
- Large input/output token counts
- Network latency
Solutions:
- Scale up resources or enable autoscaling
- Use "Always On" deployment
- Optimize prompts to reduce token usage
- Deploy in regions closer to users
Unexpected Costs
Possible causes:
- Autoscaling to max replicas
- High token usage from verbose prompts/responses
- Forgotten "Always On" models
- Inefficient model selection
Solutions:
- Review autoscaling configuration
- Optimize prompts and set max_tokens limits
- Audit deployed models and stop unused ones
- Use smaller models for simpler tasks
Guardrails Over-Blocking
Possible causes:
- Thresholds too strict
- False positives in content detection
- Overly broad topic restrictions
Solutions:
- Adjust sensitivity thresholds
- Review blocked requests to identify patterns
- Refine allowed/banned topic lists
- Add exemptions for legitimate use cases
Next Steps
- Optimize costs based on usage insights
- Configure autoscaling based on traffic patterns
- Learn about deployment options to improve performance