Skip to main content

Monitoring Models

Monitor model usage, performance, and costs with real-time analytics. The AI Gateway provides comprehensive analytics dashboards to track model performance, identify issues, and optimize costs.

Analytics Dashboard

Access the analytics dashboard to view metrics across all your deployed models.

Summary Metrics

High-level metrics displayed at the top of the dashboard:

  • Total Requests: Aggregate count of all API calls across models
  • Total Tokens: Sum of input + output tokens processed
  • Average Response Time: Mean latency across all requests
  • Total Cost: Cumulative spend across all models and providers

Time-Series Charts

Requests Over Time

Line chart showing request volume by hour (24h view) or day (7d/30d/90d views)

Use cases:

  • Identify traffic patterns and peak usage times
  • Detect anomalies or unexpected traffic spikes
  • Plan capacity for expected load

Token Usage Over Time

Line chart showing token consumption trends over selected date range

Use cases:

  • Track token consumption trends
  • Forecast future usage and costs
  • Identify models with highest token usage

Track spending over time to forecast monthly costs

Use cases:

  • Monitor spending patterns
  • Set budget alerts
  • Identify cost optimization opportunities

Provider Distribution

Doughnut chart showing request distribution across providers (OpenAI, Anthropic, Self-Hosted, etc.)

Use cases:

  • Understand provider usage mix
  • Identify opportunities to consolidate providers
  • Track provider reliability

Model Details Table

Per-model analytics with sortable columns:

MetricDescription
Total RequestsNumber of API calls to this model
Total TokensInput + output tokens processed (affects cost)
Input/Output TokensSeparate breakdown for prompt vs completion tokens
Avg Response TimeMean latency for model requests
P95 Response Time95th percentile latency (worst 5% of requests)
P99 Response Time99th percentile latency (worst 1% of requests)
Error RatePercentage of failed requests
Success RatePercentage of successful requests
Total CostCumulative spend for this model
24h/7d/30d MetricsRolling window stats for requests, tokens, cost

Understanding Latency Metrics

  • Average Response Time: Good for understanding typical performance
  • P95 Response Time: Ensures 95% of requests complete within this time
  • P99 Response Time: Ensures 99% of requests complete within this time
Performance Monitoring

Monitor P95 and P99 latency to ensure consistent user experience. High P95/P99 values indicate performance issues affecting some users even if average latency is acceptable.

Provider Statistics

Aggregated metrics by provider:

  • Model Count: Number of models from this provider
  • Total Requests: Aggregate requests across all models
  • Total Tokens: Aggregate token consumption
  • Average Latency: Mean response time for provider
  • Total Cost: Cumulative spend with provider
  • Success Rate: Percentage of successful requests across all models

User Analytics (Admin Only)

Track usage per user for cost allocation and monitoring:

Per-User Metrics

  • Requests: Total API calls by user
  • Tokens: Total tokens consumed by user
  • Cost: Total spend attributed to user
  • Avg Response Time: Mean latency for user's requests
  • Success/Error Rates: Request success and failure percentages

Usage Patterns

  • Model Usage: Unique models accessed by user
  • Providers Used: Which providers the user accesses
  • Guardrail Violations: Count of requests blocked by content policies

Activity Tracking

  • First Request: When user first accessed the gateway
  • Last Request: Most recent activity timestamp
  • Days Active: Number of days with at least one request
  • Avg Requests per Day: Average daily usage

Cost Metrics

  • Cost per Request: Average cost per API call
  • Cost per 1K Tokens: Normalized cost metric
  • Max Tokens in Single Request: Largest request by token count

Guardrails Monitoring

Platform-Wide Guardrail Metrics

  • Total Models: Number of deployed models (self-hosted + third-party)
  • Protected Models: Count of models with guardrails enabled
  • Total Rules: Aggregate count of all active guardrail rules
  • Blocked Today: Requests blocked by guardrails in last 24 hours
  • Modified Today: Requests modified by guardrails in last 24 hours
  • Active Alerts: Current policy violations requiring attention

Per-Model Guardrail Metrics

Each model tracks:

  • Enabled Status: Whether guardrails are active for this model
  • Rules Count: Total number of configured rules
  • Input Rules: Rules applied to user prompts before model inference
  • Output Rules: Rules applied to model responses before returning to user
  • Both Rules: Rules that apply to both input and output
  • Total Requests: Number of API calls processed
  • Blocked Requests: Requests denied due to policy violations
  • Modified Requests: Requests altered by guardrails (e.g., PII redaction)
  • Block Rate: Percentage of requests blocked
  • Last Triggered: Timestamp of most recent guardrail activation

Guardrail Rule Types

Monitor specific rule categories:

Content Filtering

  • Toxicity Detection: Blocks toxic, offensive, or harmful content

    • Categories: Hate speech, harassment, violence, profanity, sexual content
    • Threshold levels: Low (permissive), Medium (balanced), High (strict)
  • PII Detection: Identifies and redacts personally identifiable information

    • Detects: Email addresses, phone numbers, SSN, credit cards, IP addresses, names, addresses, dates of birth
    • Actions: Redact, mask, or block entire request
  • Prompt Injection Detection: Detects attempts to manipulate model behavior

    • Patterns: "Ignore previous instructions", system prompt extraction, role confusion

Topic Restrictions

  • Allowed Topics: Restricts model to specific subject areas

    • Configuration: Whitelist of allowed topics
    • Action: Reject off-topic queries with helpful message
  • Banned Topics: Blocks specific prohibited subjects

    • Configuration: Blacklist of topics
    • Action: Reject query and suggest alternative resources

Rate Limiting

  • Request Rate Limits: Limits requests per user/API key per time window

    • Configuration: Requests per minute/hour/day
    • Scope: Per user, per API key, per IP address, global
  • Token Limits: Limits total tokens consumed per user per period

    • Configuration: Max tokens per day/week/month
    • Tracking: Real-time token usage dashboard

Output Validation

  • Hallucination Detection: Flags responses that may contain fabricated information

    • Method: Confidence scoring, fact-checking, consistency checks
    • Action: Add disclaimer, request regeneration, or block response
  • Format Enforcement: Ensures output matches required structure

    • Formats: JSON schema, XML, markdown, specific patterns
    • Action: Retry generation or return error

Filters & Options

Customize your analytics view:

Date Range

  • Last 24 hours
  • Last 7 days
  • Last 30 days
  • Last 90 days
  • Custom date range

Provider Filter

View all providers or filter by:

  • OpenAI
  • Anthropic
  • Mistral
  • Cohere
  • Google
  • Self-Hosted

Export Options

  • Download CSV: Export model analytics for reporting
  • API Access: Programmatic access to metrics via REST API
  • Webhook Integration: Real-time alerts for anomalies

Auto-Refresh

  • Enabled: Data updates automatically every 30 seconds
  • Disabled: Manual refresh required

Alerts and Notifications

Set up alerts for important events:

Performance Alerts

  • High Latency: Alert when P95 latency exceeds threshold
  • Error Rate: Alert when error rate exceeds percentage
  • Availability: Alert when model becomes unavailable

Cost Alerts

  • Budget Threshold: Alert when spending exceeds budget
  • Anomaly Detection: Alert on unusual spending patterns
  • Daily Limits: Alert when approaching daily token limits

Guardrail Alerts

  • High Block Rate: Alert when too many requests are blocked
  • Policy Violations: Alert on specific policy violations
  • Security Events: Alert on prompt injection attempts

Capacity Alerts

  • High Utilization: Alert when CPU/Memory exceeds threshold
  • Max Replicas: Alert when autoscaling hits max replicas
  • Queue Depth: Alert when request queue grows too large

Best Practices

Regular Monitoring

  • Daily: Check summary metrics and cost trends
  • Weekly: Review model performance and optimize configurations
  • Monthly: Analyze usage patterns and forecast future needs

Performance Optimization

  1. Identify Slow Models: Sort by P95/P99 latency
  2. Analyze Error Rates: Investigate models with high error rates
  3. Optimize Token Usage: Review input/output token ratios
  4. Scale Appropriately: Adjust autoscaling based on usage patterns

Cost Management

  1. Track Spending: Monitor costs daily to avoid surprises
  2. Identify High-Cost Models: Sort by total cost
  3. Optimize Model Selection: Use smaller models for simple tasks
  4. Review Provider Mix: Compare costs across providers

Security Monitoring

  1. Review Guardrail Activity: Check blocked and modified requests
  2. Investigate Anomalies: Look for unusual traffic patterns
  3. Monitor User Activity: Track per-user usage for abuse detection
  4. Update Rules: Adjust guardrails based on observed patterns

Troubleshooting

High Error Rates

Possible causes:

  • Model overloaded (needs more resources)
  • API key issues with third-party provider
  • Network connectivity problems
  • Invalid request formats

Solutions:

  • Enable autoscaling or add more instances
  • Verify API key validity
  • Check network connectivity
  • Review request logs for formatting issues

High Latency

Possible causes:

  • Insufficient compute resources
  • Cold starts (on-demand deployment)
  • Large input/output token counts
  • Network latency

Solutions:

  • Scale up resources or enable autoscaling
  • Use "Always On" deployment
  • Optimize prompts to reduce token usage
  • Deploy in regions closer to users

Unexpected Costs

Possible causes:

  • Autoscaling to max replicas
  • High token usage from verbose prompts/responses
  • Forgotten "Always On" models
  • Inefficient model selection

Solutions:

  • Review autoscaling configuration
  • Optimize prompts and set max_tokens limits
  • Audit deployed models and stop unused ones
  • Use smaller models for simpler tasks

Guardrails Over-Blocking

Possible causes:

  • Thresholds too strict
  • False positives in content detection
  • Overly broad topic restrictions

Solutions:

  • Adjust sensitivity thresholds
  • Review blocked requests to identify patterns
  • Refine allowed/banned topic lists
  • Add exemptions for legitimate use cases

Next Steps