A/B Testing Routers
A/B testing routers split traffic between multiple model variants, compare their performance, and enable data-driven decisions about which model to deploy to production. Routers are deployed to Kubernetes via the AI Gateway and serve predictions through a dedicated endpoint.
Why Use A/B Testing?
Machine learning models need continuous validation in production. A/B testing helps you:
- Validate Model Changes -- Test a new model against your current production model with real traffic before fully replacing it
- Measure Impact -- Quantify differences in accuracy, latency, and business metrics between model versions
- Reduce Risk -- Gradually roll out new models rather than replacing everything at once
- Optimize Continuously -- Use multi-armed bandit algorithms to automatically route more traffic to better-performing models
Routing Strategies
Weighted Random
Split traffic randomly based on configured percentages. Weights are proportional -- a variant with weight 70 gets 70% of traffic relative to the total weight of all enabled variants.
If all weights are zero, traffic is distributed equally across all enabled variants.
Best for: Standard A/B tests where you want precise, controlled traffic allocation.
Feature-Based
Route requests based on input feature values. Rules are evaluated in priority order (lower number = higher priority). The first matching rule determines which variant handles the request. If no rule matches, the request goes to the control variant.
Supported operators:
| Operator | Description | Example |
|---|---|---|
equals | Exact match | region equals "us-east" |
not_equals | Not equal | tier not_equals "free" |
in | Value is in array | country in ["US", "CA", "UK"] |
not_in | Value is not in array | status not_in ["blocked", "suspended"] |
greater_than | Numeric greater than | age greater_than 18 |
less_than | Numeric less than | score less_than 0.5 |
contains | String contains | email contains "@company.com" |
regex | Regular expression match | sku regex "^PREMIUM-.*" |
Each rule specifies:
| Field | Description |
|---|---|
ruleId | Unique identifier for the rule |
featureName | Input feature to evaluate |
operator | Comparison operator |
value | Value to compare against |
targetVariantId | Route to this variant if rule matches |
priority | Evaluation order (lower = higher priority) |
Best for: Segment-specific testing, personalized model selection, routing premium customers to specialized models.
Multi-Armed Bandit
Automatically balance exploration (testing variants) and exploitation (routing to the best performer). The algorithm learns which variants perform best and adjusts traffic allocation over time based on reward feedback.
Algorithms available:
| Algorithm | How It Works |
|---|---|
| Epsilon-Greedy | Routes most traffic to the best-performing variant. With probability epsilon (default 0.1), explores a random variant instead. Variants without any stats are explored first. |
| Thompson Sampling | Uses Bayesian probability with a Beta distribution. Samples from each variant's posterior distribution and selects the variant with the highest sample. Starts with an uninformed prior (Beta(1,1) = uniform). |
| UCB1 | Upper Confidence Bound algorithm. Selects the variant with the highest UCB score: avg_reward + c * sqrt(ln(total_pulls) / arm_pulls). Unpulled arms have infinite UCB and are explored first. The exploration bonus c (default 2.0) controls the exploration-exploitation tradeoff. |
Bandit configuration:
| Field | Default | Description |
|---|---|---|
algorithm | epsilon_greedy | Which MAB algorithm to use |
epsilon | 0.1 | Exploration rate for epsilon-greedy (0-1) |
explorationBonus | 2.0 | Exploration bonus constant for UCB1 |
rewardMetric | success_rate | Metric to optimize: latency, success_rate, or custom |
updateIntervalMinutes | 60 | How often to recalculate arm statistics |
Best for: Continuous optimization when you want to minimize exposure to poor-performing variants while still discovering better ones.
Canary
Gradually roll out a new model variant by routing a configurable percentage of traffic to the canary while the rest goes to the control variant. This strategy is designed for safe, incremental deployments.
Canary configuration:
| Field | Default | Description |
|---|---|---|
targetVariantId | Second variant | The canary variant receiving incremental traffic |
controlVariantId | First variant | The stable control variant |
currentPercentage | 5 | Current percentage of traffic going to canary |
targetPercentage | 100 | Target percentage for full rollout |
incrementPercentage | 10 | How much to increase per interval |
incrementIntervalMinutes | 60 | How often to increment the percentage |
errorRateThreshold | 0.05 | Maximum acceptable error rate before rollback |
latencyThresholdMs | 5000 | Maximum acceptable latency before rollback |
Best for: Safe rollouts of new model versions where you want to gradually increase traffic while monitoring for regressions.
Creating a Router
Via the UI
- Navigate to MLOps then A/B Testing Routers
- Click Create Router
- Enter a name and optional description and tags
- Select a routing strategy
- Add model variants (minimum 2 required):
- Assign variant IDs (e.g., "control", "treatment_a")
- Select models from the Model Registry
- Set traffic weights (for weighted random)
- Mark one variant as the control
- Configure strategy-specific settings:
- Weighted Random -- Set weights per variant (0-100)
- Feature-Based -- Define routing rules with operators and priorities
- Multi-Armed Bandit -- Choose algorithm and parameters
- Canary -- Configure target/control variants and rollout settings
- Click Create Router
Router Methods
routers.create({
name,
description,
tags,
strategy,
variants,
featureRules,
banditConfig,
canaryConfig,
workspaceId
})
Variant structure:
| Field | Required | Description |
|---|---|---|
variantId | Yes | Unique identifier (e.g., "control", "treatment_a") |
modelId | Yes | Reference to model in Model Registry |
weight | No | Traffic weight 0-100 (default: equal distribution) |
isControl | No | Whether this is the control variant (default: first variant) |
Variants are automatically initialized with isEnabled: true, isShadow: false, and zero metrics.
Managing Variants
Adjusting Traffic Weights
routers.updateVariantWeight(routerId, variantId, weight)
Weight must be between 0 and 100. Changes take effect immediately for subsequent requests. This is only meaningful for the weighted random strategy.
Enabling and Disabling Variants
routers.toggleVariant(routerId, variantId, enabled)
Toggle a variant's enabled status. At least 2 variants must remain enabled at all times. When a variant is disabled, its traffic is redistributed among the remaining enabled variants.
Deployment Operations
Deploying a Router
routers.deploy(routerId)
Deploys the router to Kubernetes via the AI Gateway service:
- Router status changes to
deploying - Configuration is sent to the AI Gateway backend with organization ID for namespace isolation
- On success, status changes to
runningand the endpoint URL is stored - On failure, status changes to
failedwith error details in the deployment logs
The router is deployed with default resources:
- CPU: 500m
- Memory: 512Mi
- Replicas: 1
Stopping a Router
routers.stop(routerId)
Stops the router deployment while preserving all configuration and historical data. Status changes to stopped.
Starting a Stopped Router
routers.start(routerId)
Re-deploys a stopped router. The router must be in stopped status.
Deleting a Router
routers.delete(routerId)
Performs a soft delete -- sets deleted: true and deletedAt timestamp. If the router is currently running, it is stopped first. Deleted routers are excluded from list queries.
Metrics and Predictions
Router Metrics
routers.getMetrics(routerId, { startDate, endDate })
Returns aggregated metrics calculated from the router_predictions collection:
| Metric | Description |
|---|---|
totalRequests | Total prediction requests processed |
successCount | Number of successful predictions |
errorCount | Number of failed predictions |
successRate | Ratio of successes to total requests |
avgLatencyMs | Average total latency across all requests |
variantMetrics | Per-variant breakdown of requests, success rate, and latency |
Prediction History
routers.getPredictions(routerId, { limit, offset, variantId })
Returns prediction log entries sorted by creation date (newest first). Each entry includes:
| Field | Description |
|---|---|
predictionId | Unique prediction identifier |
variantId | Which variant handled the request |
modelId | Which model served the prediction |
routingDecision | Strategy used and reason for routing (e.g., "epsilon_greedy: exploit") |
request | Input features and timestamp |
response | Prediction result, probabilities, and confidence |
metrics | Routing time, model latency, total latency, success status |
entityId | Business entity ID for ground truth matching |
feedback | Reward feedback if provided (for MAB) |
Prediction logs have a 90-day TTL and are automatically cleaned up.
Feedback for Bandit Optimization
For multi-armed bandit routing to optimize traffic allocation, you need to provide reward feedback:
routers.recordFeedback(predictionId, { reward, label })
| Field | Description |
|---|---|
reward | Numeric reward value (e.g., 1.0 for success, 0.0 for failure) |
label | Optional label for the feedback (e.g., "correct", "incorrect") |
When reward feedback is recorded:
- The prediction's feedback record is updated with the reward, label, timestamp, and source
- If the router uses a bandit strategy, the arm statistics for the variant are updated:
pullsis incrementedtotalRewardaccumulates the reward valueavgRewardis recalculated astotalReward / pulls
This feedback loop drives the bandit algorithm's learning -- variants that receive higher rewards are selected more frequently.
Permissions
Updating Access
routers.updatePermissions(routerId, { isPublic, sharedWith })
| Field | Description |
|---|---|
isPublic | If true, all authenticated users can view the router |
sharedWith | Array of user IDs who can view the router |
Only the owner and administrators can modify router permissions.
Access Rules
- Admins see all routers (excluding soft-deleted ones)
- Owners have full access to their own routers
- Shared users can view routers shared with them
- Public routers are visible to all authenticated users
- Only owners and admins can modify, deploy, stop, or delete routers
Collections and Indexes
Routers Collection
| Index | Purpose |
|---|---|
access.owner | Find routers by owner |
deployment.status | Filter by deployment status |
workspaceId | Find routers in a workspace |
createdAt (desc) | Sort by creation date |
access.sharedWith | Find shared routers |
deleted | Exclude soft-deleted routers |
Router Predictions Collection
| Index | Purpose |
|---|---|
routerId, createdAt (desc) | Query predictions for a router |
predictionId (unique) | Look up predictions by ID |
entityId, routerId | Match ground truth to predictions |
variantId, routerId | Filter predictions by variant |
createdAt (TTL: 90 days) | Automatic cleanup of old predictions |
Router Experiments Collection
| Index | Purpose |
|---|---|
routerId | Find experiments for a router |
status | Filter by experiment status |
access.owner | Find experiments by owner |
createdAt (desc) | Sort by creation date |
Best Practices
Designing Tests
- Define success metrics before creating a router -- know what you are measuring
- Start with small traffic percentages (5-10%) to the new variant before increasing
- Run tests long enough to achieve statistically significant sample sizes
- Document hypotheses for each test to track what you learned
Choosing a Strategy
- Use weighted random for standard A/B tests with precise traffic control
- Use feature-based when different user segments should see different models
- Use multi-armed bandit when you want to minimize exposure to underperforming variants and automatically optimize
- Use canary for cautious rollouts of new model versions with automatic rollback thresholds
Monitoring
- Watch for increased error rates or latency spikes after deploying a new variant
- Monitor per-variant metrics to compare performance across all dimensions
- For bandit routing, track arm statistics to verify the algorithm is converging on the best variant
- Set up canary thresholds to automatically detect regressions
After Testing
- Once a test concludes, deploy the winning variant as the sole model
- Clean up stopped routers to keep the list manageable
- Archive test results and learnings for future reference