Monitoring Models
Monitor model usage, performance, and costs with analytics. The AI Gateway provides comprehensive analytics endpoints to track model performance, identify issues, and optimize costs.
Analytics Endpoints
The gateway exposes analytics data through REST API endpoints under /api/v1/analytics/:
| Endpoint | Method | Description |
|---|---|---|
/api/v1/analytics/models | GET | Model-level analytics with rolling metrics |
/api/v1/analytics/timeseries | GET | Time-series data for charts |
/api/v1/analytics/providers | GET | Statistics grouped by provider |
/api/v1/analytics/costs | GET | Cost breakdown analysis |
/api/v1/analytics/usage | GET | Detailed usage statistics |
/api/v1/analytics/performance | GET | Performance metrics (latency, throughput) |
/api/v1/analytics/users | GET | Per-user analytics (admin/developer views) |
Common Query Parameters
All analytics endpoints support:
| Parameter | Default | Options | Description |
|---|---|---|---|
range | 30d | 24h, 7d, 30d, 90d | Date range filter |
provider | all | Any provider name | Filter by provider |
Model Analytics
GET /api/v1/analytics/models?range=30d&provider=all
Per-model metrics with rolling window statistics:
| Metric | Description |
|---|---|
| Total Requests | Number of API calls to this model |
| Total Tokens | Input + output tokens processed (affects cost) |
| Input/Output Tokens | Separate breakdown for prompt vs completion tokens |
| Avg Response Time | Mean latency for model requests |
| P95 Response Time | 95th percentile latency (worst 5% of requests) |
| P99 Response Time | 99th percentile latency (worst 1% of requests) |
| Error Rate | Percentage of failed requests |
| Success Rate | Percentage of successful requests |
| Total Cost | Cumulative spend for this model |
| 24h/7d/30d Metrics | Rolling window stats for requests, tokens, cost |
Understanding Latency Metrics
- Average Response Time: Good for understanding typical performance
- P95 Response Time: Ensures 95% of requests complete within this time
- P99 Response Time: Ensures 99% of requests complete within this time
Monitor P95 and P99 latency to ensure consistent user experience. High P95/P99 values indicate performance issues affecting some users even if average latency is acceptable.
Time-Series Data
GET /api/v1/analytics/timeseries?range=30d&granularity=daily
Returns time-bucketed data for charts:
- Hourly buckets for 24h range
- Daily buckets for 7d/30d/90d ranges
Use cases:
- Identify traffic patterns and peak usage times
- Detect anomalies or unexpected traffic spikes
- Plan capacity for expected load
Provider Statistics
GET /api/v1/analytics/providers?range=30d
Aggregated metrics by provider:
- Model Count: Number of models from this provider
- Total Requests: Aggregate requests across all models
- Total Tokens: Aggregate token consumption
- Average Latency: Mean response time for provider
- Total Cost: Cumulative spend with provider
- Success Rate: Percentage of successful requests across all models
Cost Breakdown
GET /api/v1/analytics/costs?range=30d&group_by=model
Cost analysis grouped by different dimensions:
| group_by | Description |
|---|---|
model | Cost per model |
provider | Cost per provider |
day | Daily cost trends |
Usage Statistics
GET /api/v1/analytics/usage?range=30d
Detailed usage statistics with optional filtering:
| Parameter | Type | Description |
|---|---|---|
model_id | string | Filter by specific model |
user_id | string | Filter by specific user |
range | string | Date range |
Returns:
- Total requests, tokens, and cost
- Per-model breakdown
- Date range context
Performance Metrics
GET /api/v1/analytics/performance?range=30d
Performance-focused metrics for models:
| Parameter | Type | Description |
|---|---|---|
model_id | string | Filter by specific model |
range | string | Date range |
User Analytics (Admin Only)
GET /api/v1/analytics/users?range=30d&limit=100
Track usage per user for cost allocation and monitoring. Access is role-based:
- Admin users see all users' data
- Developer users see only their own data
Per-User Metrics
- Requests: Total API calls by user
- Tokens: Total tokens consumed (input, output, total)
- Cost: Total spend attributed to user
- Avg Response Time: Mean latency for user's requests
- Success/Error Rates: Request success and failure percentages
Usage Patterns
- Unique Models Count: Number of distinct models accessed
- Providers Used: Which providers the user accesses
- Guardrail Violations: Count of requests blocked by content policies
Activity Tracking
- First Request: When user first accessed the gateway
- Last Request: Most recent activity timestamp
- Days Active: Number of days with at least one request
- Avg Requests per Day: Average daily usage
Cost Efficiency Metrics
- Cost per Request: Average cost per API call
- Cost per 1K Tokens: Normalized cost metric
- Max Tokens in Single Request: Largest request by token count
Guardrails Monitoring
Platform-Wide Guardrail Metrics
- Total Models: Number of deployed models (self-hosted + third-party)
- Protected Models: Count of models with guardrails enabled
- Total Rules: Aggregate count of all active guardrail rules
- Blocked Today: Requests blocked by guardrails in last 24 hours
- Modified Today: Requests modified by guardrails in last 24 hours
Per-Model Guardrail Metrics
Each model tracks:
- Enabled Status: Whether guardrails are active for this model
- Rules Count: Total number of configured rules
- Input Rules: Rules applied to user prompts before model inference
- Output Rules: Rules applied to model responses before returning to user
- Total Requests: Number of API calls processed
- Blocked Requests: Requests denied due to policy violations
- Modified Requests: Requests altered by guardrails (e.g., PII redaction)
- Block Rate: Percentage of requests blocked
Guardrail Rule Types
Monitor specific rule categories:
Content Filtering
-
Toxicity Detection: Blocks toxic, offensive, or harmful content
- Categories: Hate speech, harassment, violence, profanity, sexual content
- Threshold levels: Low (permissive), Medium (balanced), High (strict)
-
PII Detection: Identifies and redacts personally identifiable information
- Detects: Email addresses, phone numbers, SSN, credit cards, IP addresses, names, addresses, dates of birth
- Actions: Redact, mask, or block entire request
-
Prompt Injection Detection: Detects attempts to manipulate model behavior
- Patterns: "Ignore previous instructions", system prompt extraction, role confusion
Topic Restrictions
- Allowed Topics: Restricts model to specific subject areas
- Banned Topics: Blocks specific prohibited subjects
Output Validation
- Format Enforcement: Ensures output matches required structure
- Formats: JSON schema, XML, markdown, specific patterns
- Action: Retry generation or return error
Filters & Options
Customize your analytics view:
Date Range
- Last 24 hours
- Last 7 days
- Last 30 days
- Last 90 days
Provider Filter
View all providers or filter by:
- OpenAI
- Anthropic
- Google (Gemini)
- Mistral
- Cohere
- Groq
- DeepSeek
- Grok (xAI)
- ElevenLabs
- Stability AI
- Black Forest Labs
- Runway
- Luma AI
- Self-Hosted (vLLM)
API Access
All analytics data is available programmatically via the REST API endpoints listed above.
Best Practices
Regular Monitoring
- Daily: Check summary metrics and cost trends
- Weekly: Review model performance and optimize configurations
- Monthly: Analyze usage patterns and forecast future needs
Performance Optimization
- Identify Slow Models: Sort by P95/P99 latency
- Analyze Error Rates: Investigate models with high error rates
- Optimize Token Usage: Review input/output token ratios
- Scale Appropriately: Adjust autoscaling based on usage patterns
Cost Management
- Track Spending: Monitor costs daily to avoid surprises
- Identify High-Cost Models: Sort by total cost
- Optimize Model Selection: Use smaller models for simple tasks
- Review Provider Mix: Compare costs across providers
Security Monitoring
- Review Guardrail Activity: Check blocked and modified requests
- Investigate Anomalies: Look for unusual traffic patterns
- Monitor User Activity: Track per-user usage for abuse detection
- Update Rules: Adjust guardrails based on observed patterns
Troubleshooting
High Error Rates
Possible causes:
- Model overloaded (needs more resources)
- API key issues with third-party provider
- Network connectivity problems
- Invalid request formats
Solutions:
- Enable autoscaling or add more instances
- Verify API key validity
- Check network connectivity
- Review request logs for formatting issues
High Latency
Possible causes:
- Insufficient compute resources
- Cold starts (on-demand deployment)
- Large input/output token counts
- Network latency
Solutions:
- Scale up resources or enable autoscaling
- Use "Always On" deployment
- Optimize prompts to reduce token usage
- Deploy in regions closer to users
Unexpected Costs
Possible causes:
- Autoscaling to max replicas
- High token usage from verbose prompts/responses
- Forgotten "Always On" models
- Inefficient model selection
Solutions:
- Review autoscaling configuration
- Optimize prompts and set max_tokens limits
- Audit deployed models and stop unused ones
- Use smaller models for simpler tasks
Guardrails Over-Blocking
Possible causes:
- Thresholds too strict
- False positives in content detection
- Overly broad topic restrictions
Solutions:
- Adjust sensitivity thresholds
- Review blocked requests to identify patterns
- Refine allowed/banned topic lists
- Add exemptions for legitimate use cases
Next Steps
- Optimize costs based on usage insights
- Configure autoscaling based on traffic patterns
- Learn about deployment options to improve performance