Skip to main content

Monitoring Models

Monitor model usage, performance, and costs with analytics. The AI Gateway provides comprehensive analytics endpoints to track model performance, identify issues, and optimize costs.

Analytics Endpoints

The gateway exposes analytics data through REST API endpoints under /api/v1/analytics/:

EndpointMethodDescription
/api/v1/analytics/modelsGETModel-level analytics with rolling metrics
/api/v1/analytics/timeseriesGETTime-series data for charts
/api/v1/analytics/providersGETStatistics grouped by provider
/api/v1/analytics/costsGETCost breakdown analysis
/api/v1/analytics/usageGETDetailed usage statistics
/api/v1/analytics/performanceGETPerformance metrics (latency, throughput)
/api/v1/analytics/usersGETPer-user analytics (admin/developer views)

Common Query Parameters

All analytics endpoints support:

ParameterDefaultOptionsDescription
range30d24h, 7d, 30d, 90dDate range filter
providerallAny provider nameFilter by provider

Model Analytics

GET /api/v1/analytics/models?range=30d&provider=all

Per-model metrics with rolling window statistics:

MetricDescription
Total RequestsNumber of API calls to this model
Total TokensInput + output tokens processed (affects cost)
Input/Output TokensSeparate breakdown for prompt vs completion tokens
Avg Response TimeMean latency for model requests
P95 Response Time95th percentile latency (worst 5% of requests)
P99 Response Time99th percentile latency (worst 1% of requests)
Error RatePercentage of failed requests
Success RatePercentage of successful requests
Total CostCumulative spend for this model
24h/7d/30d MetricsRolling window stats for requests, tokens, cost

Understanding Latency Metrics

  • Average Response Time: Good for understanding typical performance
  • P95 Response Time: Ensures 95% of requests complete within this time
  • P99 Response Time: Ensures 99% of requests complete within this time
Performance Monitoring

Monitor P95 and P99 latency to ensure consistent user experience. High P95/P99 values indicate performance issues affecting some users even if average latency is acceptable.

Time-Series Data

GET /api/v1/analytics/timeseries?range=30d&granularity=daily

Returns time-bucketed data for charts:

  • Hourly buckets for 24h range
  • Daily buckets for 7d/30d/90d ranges

Use cases:

  • Identify traffic patterns and peak usage times
  • Detect anomalies or unexpected traffic spikes
  • Plan capacity for expected load

Provider Statistics

GET /api/v1/analytics/providers?range=30d

Aggregated metrics by provider:

  • Model Count: Number of models from this provider
  • Total Requests: Aggregate requests across all models
  • Total Tokens: Aggregate token consumption
  • Average Latency: Mean response time for provider
  • Total Cost: Cumulative spend with provider
  • Success Rate: Percentage of successful requests across all models

Cost Breakdown

GET /api/v1/analytics/costs?range=30d&group_by=model

Cost analysis grouped by different dimensions:

group_byDescription
modelCost per model
providerCost per provider
dayDaily cost trends

Usage Statistics

GET /api/v1/analytics/usage?range=30d

Detailed usage statistics with optional filtering:

ParameterTypeDescription
model_idstringFilter by specific model
user_idstringFilter by specific user
rangestringDate range

Returns:

  • Total requests, tokens, and cost
  • Per-model breakdown
  • Date range context

Performance Metrics

GET /api/v1/analytics/performance?range=30d

Performance-focused metrics for models:

ParameterTypeDescription
model_idstringFilter by specific model
rangestringDate range

User Analytics (Admin Only)

GET /api/v1/analytics/users?range=30d&limit=100

Track usage per user for cost allocation and monitoring. Access is role-based:

  • Admin users see all users' data
  • Developer users see only their own data

Per-User Metrics

  • Requests: Total API calls by user
  • Tokens: Total tokens consumed (input, output, total)
  • Cost: Total spend attributed to user
  • Avg Response Time: Mean latency for user's requests
  • Success/Error Rates: Request success and failure percentages

Usage Patterns

  • Unique Models Count: Number of distinct models accessed
  • Providers Used: Which providers the user accesses
  • Guardrail Violations: Count of requests blocked by content policies

Activity Tracking

  • First Request: When user first accessed the gateway
  • Last Request: Most recent activity timestamp
  • Days Active: Number of days with at least one request
  • Avg Requests per Day: Average daily usage

Cost Efficiency Metrics

  • Cost per Request: Average cost per API call
  • Cost per 1K Tokens: Normalized cost metric
  • Max Tokens in Single Request: Largest request by token count

Guardrails Monitoring

Platform-Wide Guardrail Metrics

  • Total Models: Number of deployed models (self-hosted + third-party)
  • Protected Models: Count of models with guardrails enabled
  • Total Rules: Aggregate count of all active guardrail rules
  • Blocked Today: Requests blocked by guardrails in last 24 hours
  • Modified Today: Requests modified by guardrails in last 24 hours

Per-Model Guardrail Metrics

Each model tracks:

  • Enabled Status: Whether guardrails are active for this model
  • Rules Count: Total number of configured rules
  • Input Rules: Rules applied to user prompts before model inference
  • Output Rules: Rules applied to model responses before returning to user
  • Total Requests: Number of API calls processed
  • Blocked Requests: Requests denied due to policy violations
  • Modified Requests: Requests altered by guardrails (e.g., PII redaction)
  • Block Rate: Percentage of requests blocked

Guardrail Rule Types

Monitor specific rule categories:

Content Filtering

  • Toxicity Detection: Blocks toxic, offensive, or harmful content

    • Categories: Hate speech, harassment, violence, profanity, sexual content
    • Threshold levels: Low (permissive), Medium (balanced), High (strict)
  • PII Detection: Identifies and redacts personally identifiable information

    • Detects: Email addresses, phone numbers, SSN, credit cards, IP addresses, names, addresses, dates of birth
    • Actions: Redact, mask, or block entire request
  • Prompt Injection Detection: Detects attempts to manipulate model behavior

    • Patterns: "Ignore previous instructions", system prompt extraction, role confusion

Topic Restrictions

  • Allowed Topics: Restricts model to specific subject areas
  • Banned Topics: Blocks specific prohibited subjects

Output Validation

  • Format Enforcement: Ensures output matches required structure
    • Formats: JSON schema, XML, markdown, specific patterns
    • Action: Retry generation or return error

Filters & Options

Customize your analytics view:

Date Range

  • Last 24 hours
  • Last 7 days
  • Last 30 days
  • Last 90 days

Provider Filter

View all providers or filter by:

  • OpenAI
  • Anthropic
  • Google (Gemini)
  • Mistral
  • Cohere
  • Groq
  • DeepSeek
  • Grok (xAI)
  • ElevenLabs
  • Stability AI
  • Black Forest Labs
  • Runway
  • Luma AI
  • Self-Hosted (vLLM)

API Access

All analytics data is available programmatically via the REST API endpoints listed above.

Best Practices

Regular Monitoring

  • Daily: Check summary metrics and cost trends
  • Weekly: Review model performance and optimize configurations
  • Monthly: Analyze usage patterns and forecast future needs

Performance Optimization

  1. Identify Slow Models: Sort by P95/P99 latency
  2. Analyze Error Rates: Investigate models with high error rates
  3. Optimize Token Usage: Review input/output token ratios
  4. Scale Appropriately: Adjust autoscaling based on usage patterns

Cost Management

  1. Track Spending: Monitor costs daily to avoid surprises
  2. Identify High-Cost Models: Sort by total cost
  3. Optimize Model Selection: Use smaller models for simple tasks
  4. Review Provider Mix: Compare costs across providers

Security Monitoring

  1. Review Guardrail Activity: Check blocked and modified requests
  2. Investigate Anomalies: Look for unusual traffic patterns
  3. Monitor User Activity: Track per-user usage for abuse detection
  4. Update Rules: Adjust guardrails based on observed patterns

Troubleshooting

High Error Rates

Possible causes:

  • Model overloaded (needs more resources)
  • API key issues with third-party provider
  • Network connectivity problems
  • Invalid request formats

Solutions:

  • Enable autoscaling or add more instances
  • Verify API key validity
  • Check network connectivity
  • Review request logs for formatting issues

High Latency

Possible causes:

  • Insufficient compute resources
  • Cold starts (on-demand deployment)
  • Large input/output token counts
  • Network latency

Solutions:

  • Scale up resources or enable autoscaling
  • Use "Always On" deployment
  • Optimize prompts to reduce token usage
  • Deploy in regions closer to users

Unexpected Costs

Possible causes:

  • Autoscaling to max replicas
  • High token usage from verbose prompts/responses
  • Forgotten "Always On" models
  • Inefficient model selection

Solutions:

  • Review autoscaling configuration
  • Optimize prompts and set max_tokens limits
  • Audit deployed models and stop unused ones
  • Use smaller models for simpler tasks

Guardrails Over-Blocking

Possible causes:

  • Thresholds too strict
  • False positives in content detection
  • Overly broad topic restrictions

Solutions:

  • Adjust sensitivity thresholds
  • Review blocked requests to identify patterns
  • Refine allowed/banned topic lists
  • Add exemptions for legitimate use cases

Next Steps