Monitoring Models

Monitor model usage, performance, and costs with analytics. The AI Gateway provides comprehensive analytics endpoints to track model performance, identify issues, and optimize costs.

Analytics Endpoints

The gateway exposes analytics data through REST API endpoints under /api/v1/analytics/:

Endpoint	Method	Description
`/api/v1/analytics/models`	GET	Model-level analytics with rolling metrics
`/api/v1/analytics/timeseries`	GET	Time-series data for charts
`/api/v1/analytics/providers`	GET	Statistics grouped by provider
`/api/v1/analytics/costs`	GET	Cost breakdown analysis
`/api/v1/analytics/usage`	GET	Detailed usage statistics
`/api/v1/analytics/performance`	GET	Performance metrics (latency, throughput)
`/api/v1/analytics/users`	GET	Per-user analytics (admin/developer views)

Common Query Parameters

All analytics endpoints support:

Parameter	Default	Options	Description
`range`	`30d`	`24h`, `7d`, `30d`, `90d`	Date range filter
`provider`	`all`	Any provider name	Filter by provider

Model Analytics

GET /api/v1/analytics/models?range=30d&provider=all

Per-model metrics with rolling window statistics:

Metric	Description
Total Requests	Number of API calls to this model
Total Tokens	Input + output tokens processed (affects cost)
Input/Output Tokens	Separate breakdown for prompt vs completion tokens
Avg Response Time	Mean latency for model requests
P95 Response Time	95th percentile latency (worst 5% of requests)
P99 Response Time	99th percentile latency (worst 1% of requests)
Error Rate	Percentage of failed requests
Success Rate	Percentage of successful requests
Total Cost	Cumulative spend for this model
24h/7d/30d Metrics	Rolling window stats for requests, tokens, cost

Understanding Latency Metrics

Average Response Time: Good for understanding typical performance
P95 Response Time: Ensures 95% of requests complete within this time
P99 Response Time: Ensures 99% of requests complete within this time

Performance Monitoring

Monitor P95 and P99 latency to ensure consistent user experience. High P95/P99 values indicate performance issues affecting some users even if average latency is acceptable.

Time-Series Data

GET /api/v1/analytics/timeseries?range=30d&granularity=daily

Returns time-bucketed data for charts:

Hourly buckets for 24h range
Daily buckets for 7d/30d/90d ranges

Use cases:

Identify traffic patterns and peak usage times
Detect anomalies or unexpected traffic spikes
Plan capacity for expected load

Provider Statistics

GET /api/v1/analytics/providers?range=30d

Aggregated metrics by provider:

Model Count: Number of models from this provider
Total Requests: Aggregate requests across all models
Total Tokens: Aggregate token consumption
Average Latency: Mean response time for provider
Total Cost: Cumulative spend with provider
Success Rate: Percentage of successful requests across all models

Cost Breakdown

GET /api/v1/analytics/costs?range=30d&group_by=model

Cost analysis grouped by different dimensions:

group_by	Description
`model`	Cost per model
`provider`	Cost per provider
`day`	Daily cost trends

Usage Statistics

GET /api/v1/analytics/usage?range=30d

Detailed usage statistics with optional filtering:

Parameter	Type	Description
`model_id`	string	Filter by specific model
`user_id`	string	Filter by specific user
`range`	string	Date range

Returns:

Total requests, tokens, and cost
Per-model breakdown
Date range context

Performance Metrics

GET /api/v1/analytics/performance?range=30d

Performance-focused metrics for models:

Parameter	Type	Description
`model_id`	string	Filter by specific model
`range`	string	Date range

User Analytics (Admin Only)

GET /api/v1/analytics/users?range=30d&limit=100

Track usage per user for cost allocation and monitoring. Access is role-based:

Admin users see all users' data
Developer users see only their own data

Per-User Metrics

Requests: Total API calls by user
Tokens: Total tokens consumed (input, output, total)
Cost: Total spend attributed to user
Avg Response Time: Mean latency for user's requests
Success/Error Rates: Request success and failure percentages

Usage Patterns

Unique Models Count: Number of distinct models accessed
Providers Used: Which providers the user accesses
Guardrail Violations: Count of requests blocked by content policies

Activity Tracking

First Request: When user first accessed the gateway
Last Request: Most recent activity timestamp
Days Active: Number of days with at least one request
Avg Requests per Day: Average daily usage

Cost Efficiency Metrics

Cost per Request: Average cost per API call
Cost per 1K Tokens: Normalized cost metric
Max Tokens in Single Request: Largest request by token count

Guardrails Monitoring

Platform-Wide Guardrail Metrics

Total Models: Number of deployed models (self-hosted + third-party)
Protected Models: Count of models with guardrails enabled
Total Rules: Aggregate count of all active guardrail rules
Blocked Today: Requests blocked by guardrails in last 24 hours
Modified Today: Requests modified by guardrails in last 24 hours

Per-Model Guardrail Metrics

Each model tracks:

Enabled Status: Whether guardrails are active for this model
Rules Count: Total number of configured rules
Input Rules: Rules applied to user prompts before model inference
Output Rules: Rules applied to model responses before returning to user
Total Requests: Number of API calls processed
Blocked Requests: Requests denied due to policy violations
Modified Requests: Requests altered by guardrails (e.g., PII redaction)
Block Rate: Percentage of requests blocked

Guardrail Rule Types

Monitor specific rule categories:

Content Filtering

Toxicity Detection: Blocks toxic, offensive, or harmful content
- Categories: Hate speech, harassment, violence, profanity, sexual content
- Threshold levels: Low (permissive), Medium (balanced), High (strict)
PII Detection: Identifies and redacts personally identifiable information
- Detects: Email addresses, phone numbers, SSN, credit cards, IP addresses, names, addresses, dates of birth
- Actions: Redact, mask, or block entire request
Prompt Injection Detection: Detects attempts to manipulate model behavior
- Patterns: "Ignore previous instructions", system prompt extraction, role confusion

Topic Restrictions

Allowed Topics: Restricts model to specific subject areas
Banned Topics: Blocks specific prohibited subjects

Output Validation

Format Enforcement: Ensures output matches required structure
- Formats: JSON schema, XML, markdown, specific patterns
- Action: Retry generation or return error

Filters & Options

Customize your analytics view:

Date Range

Last 24 hours
Last 7 days
Last 30 days
Last 90 days

Provider Filter

View all providers or filter by:

OpenAI
Anthropic
Google (Gemini)
Mistral
Cohere
Groq
DeepSeek
Grok (xAI)
ElevenLabs
Stability AI
Black Forest Labs
Runway
Luma AI
Self-Hosted (vLLM)

API Access

All analytics data is available programmatically via the REST API endpoints listed above.

Best Practices

Regular Monitoring

Daily: Check summary metrics and cost trends
Weekly: Review model performance and optimize configurations
Monthly: Analyze usage patterns and forecast future needs

Performance Optimization

Identify Slow Models: Sort by P95/P99 latency
Analyze Error Rates: Investigate models with high error rates
Optimize Token Usage: Review input/output token ratios
Scale Appropriately: Adjust autoscaling based on usage patterns

Cost Management

Track Spending: Monitor costs daily to avoid surprises
Identify High-Cost Models: Sort by total cost
Optimize Model Selection: Use smaller models for simple tasks
Review Provider Mix: Compare costs across providers

Security Monitoring

Review Guardrail Activity: Check blocked and modified requests
Investigate Anomalies: Look for unusual traffic patterns
Monitor User Activity: Track per-user usage for abuse detection
Update Rules: Adjust guardrails based on observed patterns

Troubleshooting

High Error Rates

Possible causes:

Model overloaded (needs more resources)
API key issues with third-party provider
Network connectivity problems
Invalid request formats

Solutions:

Enable autoscaling or add more instances
Verify API key validity
Check network connectivity
Review request logs for formatting issues

High Latency

Possible causes:

Insufficient compute resources
Cold starts (on-demand deployment)
Large input/output token counts
Network latency

Solutions:

Scale up resources or enable autoscaling
Use "Always On" deployment
Optimize prompts to reduce token usage
Deploy in regions closer to users

Unexpected Costs

Possible causes:

Autoscaling to max replicas
High token usage from verbose prompts/responses
Forgotten "Always On" models
Inefficient model selection

Solutions:

Review autoscaling configuration
Optimize prompts and set max_tokens limits
Audit deployed models and stop unused ones
Use smaller models for simpler tasks

Guardrails Over-Blocking

Possible causes:

Thresholds too strict
False positives in content detection
Overly broad topic restrictions

Solutions:

Adjust sensitivity thresholds
Review blocked requests to identify patterns
Refine allowed/banned topic lists
Add exemptions for legitimate use cases

Next Steps

Optimize costs based on usage insights
Configure autoscaling based on traffic patterns
Learn about deployment options to improve performance

Analytics Endpoints​

Common Query Parameters​

Model Analytics​

Understanding Latency Metrics​

Time-Series Data​

Provider Statistics​

Cost Breakdown​

Usage Statistics​

Performance Metrics​

User Analytics (Admin Only)​

Per-User Metrics​

Usage Patterns​

Activity Tracking​

Cost Efficiency Metrics​

Guardrails Monitoring​

Platform-Wide Guardrail Metrics​

Per-Model Guardrail Metrics​

Guardrail Rule Types​

Content Filtering​

Topic Restrictions​

Output Validation​

Filters & Options​

Date Range​

Provider Filter​

API Access​

Best Practices​

Regular Monitoring​

Performance Optimization​

Cost Management​

Security Monitoring​

Troubleshooting​

High Error Rates​

High Latency​

Unexpected Costs​

Guardrails Over-Blocking​

Next Steps​

Analytics Endpoints

Common Query Parameters

Model Analytics

Understanding Latency Metrics

Time-Series Data

Provider Statistics

Cost Breakdown

Usage Statistics

Performance Metrics

User Analytics (Admin Only)

Per-User Metrics

Usage Patterns

Activity Tracking

Cost Efficiency Metrics

Guardrails Monitoring

Platform-Wide Guardrail Metrics

Per-Model Guardrail Metrics

Guardrail Rule Types

Content Filtering

Topic Restrictions

Output Validation

Filters & Options

Date Range

Provider Filter

API Access

Best Practices

Regular Monitoring

Performance Optimization

Cost Management

Security Monitoring

Troubleshooting

High Error Rates

High Latency

Unexpected Costs

Guardrails Over-Blocking

Next Steps