Drift Detection
Drift detection monitors changes in your model's input data distributions over time and alerts you when production data differs significantly from training data. When combined with ground truth integration, it also detects prediction drift -- when model accuracy degrades compared to baseline performance.
Why Drift Detection Matters
Machine learning models are trained on historical data, but the real world is constantly changing. Over time, the data your model sees in production may differ from the data it was trained on, causing performance degradation. This is known as drift.
Types of Drift
| Type | Description | How Detected |
|---|---|---|
| Data Drift | The statistical distribution of input features changes over time (e.g., customer demographics shift, seasonal patterns change) | PSI, KS test, Chi-Square, Jensen-Shannon divergence |
| Concept Drift | The relationship between inputs and outputs changes (e.g., customer behavior evolves, economic conditions shift) | Ground truth accuracy comparison over time |
| Prediction Drift | Model accuracy degrades compared to baseline performance | Requires ground truth labels to detect |
Reference Baselines
A reference baseline captures the statistical distributions of your training data features. When you run drift analysis, current production data is compared against this baseline to detect changes.
Creating a Baseline
drift.createBaseline({
modelId,
version,
featureData,
targetData
})
| Parameter | Required | Description |
|---|---|---|
modelId | Yes | ID of the model this baseline is for |
version | Yes | Version string for this baseline (e.g., "v1.0") |
featureData | Yes | Object mapping feature names to arrays of values |
targetData | No | Target variable distribution (classification or regression) |
The system automatically:
- Detects whether each feature is numerical or categorical based on the first value
- For numerical features: calculates mean, standard deviation, min, max, percentiles (5th, 25th, 50th, 75th, 95th), and histogram
- For categorical features: calculates category frequencies as proportions
- For target data: calculates either classification class frequencies or regression distribution
- Deactivates any previously active baseline for the model
- Sets the new baseline as active
Managing Baselines
| Method | Description |
|---|---|
drift.getActiveBaseline(modelId) | Get the currently active baseline |
drift.setActiveBaseline(modelId, baselineId) | Switch the active baseline |
drift.getBaselines(modelId) | List all baselines (sorted by creation date, newest first) |
Only one baseline can be active per model at a time. Setting a new active baseline deactivates all others.
Prediction Logging
All predictions must be logged for drift analysis. Each prediction log captures the input features, model output, and metadata needed for later comparison.
Logging a Prediction
drift.logPrediction({
modelId,
entityId,
features,
prediction,
probabilities,
modelVersion,
latencyMs,
requestSource
})
| Field | Required | Description |
|---|---|---|
modelId | Yes | Model that produced the prediction |
entityId | No | Business entity ID for ground truth matching (defaults to predictionId if not provided) |
features | Yes | Input features as key-value pairs |
prediction | Yes | Model output (any type) |
probabilities | No | Prediction probabilities array (for classification) |
modelVersion | No | Version of the model used |
latencyMs | No | Prediction latency in milliseconds |
requestSource | No | Source of the prediction request |
Returns a unique predictionId.
Querying Predictions
drift.getPredictions(modelId, { limit, offset, startDate, endDate, hasGroundTruth })
Filter predictions by date range and whether they have ground truth attached. Returns predictions sorted by timestamp (newest first).
Data Retention
Prediction logs have a 90-day TTL -- they are automatically expired after 90 days. Historical drift analysis results are stored indefinitely in the drift_results collection.
Ground Truth Integration
Ground truth (actual outcomes) allows you to measure prediction drift and model accuracy over time. Ground truth records are matched to predictions via the entityId field.
Single Record Upload
drift.addGroundTruth({
modelId,
entityId,
actualOutcome,
outcomeTimestamp
})
The system automatically attempts to match the ground truth with an existing prediction that has the same modelId and entityId and has not yet been matched. When matched:
- The prediction log is updated with the ground truth data (
actualOutcome,outcomeTimestamp,matchedAt) - The ground truth record is marked as
matched: truewith thematchedPredictionId
Bulk Upload
drift.uploadGroundTruth({
modelId,
records: [
{ entityId, actualOutcome, outcomeTimestamp },
...
]
})
Processes records sequentially, attempting to match each with its corresponding prediction. Returns a summary:
| Field | Description |
|---|---|
imported | Number of records successfully imported |
matched | Number of records matched to predictions |
errors | Array of error messages for failed records |
batchId | Unique identifier for this upload batch |
Records missing an entityId are skipped with an error.
Finding Unmatched Predictions
drift.getUnmatchedPredictions(modelId, limit = 100)
Returns predictions that have not yet been matched with ground truth. Use this to identify which predictions still need actual outcomes uploaded.
Statistical Tests
The drift analysis engine uses multiple statistical methods to detect distribution changes:
PSI (Population Stability Index)
The primary metric for drift detection. Measures the overall shift in distribution between baseline and current data.
| PSI Range | Status | Interpretation |
|---|---|---|
| PSI < 0.1 | ok | No significant drift. Data distribution is stable. |
| 0.1 <= PSI < 0.2 | warning | Moderate drift. Investigate potential causes. |
| PSI >= 0.2 | alert | Significant drift. Action required. Consider retraining. |
For numerical features, PSI is calculated by comparing histogram bin frequencies between the baseline and current distributions (aligned to baseline bin edges).
For categorical features, PSI is calculated by aligning category frequencies between baseline and current distributions. Missing categories are assigned a small frequency (0.001) to avoid division by zero.
Kolmogorov-Smirnov Test
Used for numerical features only. Compares the cumulative distributions of baseline and current data samples.
- p-value < 0.05 indicates statistically significant drift
- The test is performed on up to 1,000 samples (approximated from the baseline distribution when raw baseline samples are not available)
Chi-Square Test
Used for categorical features only. Compares observed versus expected frequencies across categories.
- p-value < 0.05 indicates statistically significant drift
- Expected frequencies are derived from the baseline distribution
Jensen-Shannon Divergence
Used for categorical features as a symmetric measure of distribution similarity.
- Ranges from 0 (identical distributions) to 1 (completely different)
- Provides a complementary view to PSI for categorical data
Running Drift Analysis
Triggering Analysis
drift.analyze(modelId, { windowDays, windowStart, windowEnd })
| Parameter | Default | Description |
|---|---|---|
windowDays | 7 | Number of days of recent predictions to analyze |
windowStart | Calculated from windowDays | Start of analysis window |
windowEnd | Current time | End of analysis window |
Requirements:
- An active baseline must exist for the model
- At least one prediction must exist in the specified time window
Analysis Process
- Retrieves the active baseline for the model
- Queries all predictions within the time window
- Extracts feature values from predictions and groups by feature name
- For each feature in the baseline:
- Numerical features: Calculates PSI from histogram frequencies and KS test from sample comparison
- Categorical features: Calculates PSI, Chi-Square, and Jensen-Shannon divergence from category frequencies
- If predictions with ground truth exist, calculates current accuracy
- Computes overall drift score from all PSI values
- Determines overall status (ok, warning, or alert)
- Stores the result in the
drift_resultscollection
Analysis Result
The drift result contains:
| Field | Description |
|---|---|
modelId | Model that was analyzed |
baselineId | Baseline used for comparison |
windowStart / windowEnd | Time window analyzed |
sampleSize | Number of predictions analyzed |
overallStatus | Aggregate status: ok, warning, or alert |
driftScore | Aggregate drift score (0-1) |
featureDrift | Array of per-feature drift results |
predictionDrift | Accuracy comparison (only if ground truth available) |
calculatedAt | When the analysis was performed |
calculationDurationMs | How long the analysis took |
Per-Feature Results
Each feature drift result includes:
| Field | Description |
|---|---|
featureName | Name of the feature |
driftType | Type of drift detected (data) |
metrics.psi | PSI value, p-value, and status |
metrics.ksStatistic | KS test result (numerical features) |
metrics.chiSquare | Chi-Square test result (categorical features) |
metrics.jensenShannon | Jensen-Shannon divergence (categorical features) |
currentDistribution | Current distribution statistics |
baselineDistribution | Baseline distribution for comparison |
hasDrift | Boolean flag if significant drift was detected |
Querying Results
| Method | Description |
|---|---|
drift.getLatestResult(modelId) | Get the most recent analysis result |
drift.getResults(modelId, { limit, status }) | Get result history with optional status filter |
Alert Configuration
Configure automated monitoring thresholds for each model:
drift.updateAlertConfig(modelId, config)
Default Thresholds
| Threshold | Default | Description |
|---|---|---|
psiWarning | 0.1 | PSI value that triggers a warning |
psiAlert | 0.2 | PSI value that triggers an alert |
ksPValueThreshold | 0.05 | KS test p-value significance threshold |
accuracyDropWarning | 0.05 | 5% accuracy drop triggers a warning |
accuracyDropAlert | 0.1 | 10% accuracy drop triggers an alert |
Notification Settings
| Field | Default | Description |
|---|---|---|
email | true | Whether to send email notifications |
recipients | [] | Array of email addresses to notify |
Scheduled Analysis
| Field | Default | Description |
|---|---|---|
enabled | false | Whether scheduled analysis is enabled |
frequency | daily | How often to run: hourly, daily, or weekly |
timeWindowDays | 7 | Days of data to include in each analysis |
Performance History
Track model accuracy over time with configurable granularity:
drift.getPerformanceHistory(modelId, { windowDays, granularity })
| Parameter | Default | Description |
|---|---|---|
windowDays | 30 | Number of days of history to return |
granularity | daily | Time bucket size: hourly, daily, or weekly |
Returns an array of data points, each containing:
| Field | Description |
|---|---|
timestamp | Start of the time bucket |
accuracy | Accuracy within the bucket (requires ground truth) |
predictionCount | Number of predictions in the bucket |
groundTruthCount | Number of predictions with ground truth in the bucket |
Accuracy is calculated as the proportion of predictions where prediction === groundTruth.actualOutcome within each time bucket.
Collections and Indexes
Prediction Logs Collection
| Index | Purpose |
|---|---|
modelId, timestamp (desc) | Query predictions for a model |
entityId, modelId | Match ground truth to predictions |
timestamp (TTL: 90 days) | Automatic cleanup of old predictions |
Ground Truth Collection
Stores actual outcome data uploaded for ground truth matching.
Reference Baselines Collection
| Index | Purpose |
|---|---|
modelId, isActive | Find the active baseline for a model |
Drift Results Collection
| Index | Purpose |
|---|---|
modelId, calculatedAt (desc) | Query results for a model |
Alert Config Collection
| Index | Purpose |
|---|---|
modelId (unique) | One config per model |
Interpreting Common Scenarios
Seasonal Drift
Pattern: Drift appears at regular intervals (monthly, quarterly). Action: Consider building season-aware models or using different models for different periods. This may be expected behavior rather than a problem.
Sudden Drift Spike
Pattern: PSI jumps dramatically in a short time. Action: Investigate recent changes: new data sources, pipeline bugs, external events, or upstream system changes.
Gradual Increase
Pattern: Drift slowly increases over weeks or months. Action: Schedule model retraining as part of regular maintenance. Create a new baseline after retraining.
Single Feature Drift
Pattern: One feature shows high drift while others are stable. Action: Investigate that specific feature. The cause may be an upstream data issue, a feature engineering bug, or a genuine change in the data source.
High Drift but Good Accuracy
Pattern: PSI is in alert range but accuracy (from ground truth) remains stable. Action: The distribution has shifted but the model's decision boundaries may still be valid. Monitor closely and document the finding. Consider updating the baseline if this becomes the new normal.
Best Practices
Establish Baselines Early
Create reference baselines immediately after training while the training data is readily available. The training data distribution is your ground truth for comparison.
Monitor Continuously
Do not wait for problems to appear. Schedule regular drift analysis (daily or weekly) to catch issues early. The performance history view helps identify gradual degradation that might not be obvious from a single analysis.
Collect Ground Truth
Where possible, collect actual outcomes to measure real model performance:
- Data drift detection (via PSI, KS, etc.) tells you the input distribution changed
- Ground truth matching tells you whether the model's predictions are still accurate
- Both are valuable, but accuracy measurement is the definitive indicator of model health
Investigate Warnings
When drift is detected at the warning level, investigate before it becomes critical:
- Is this expected (seasonal change, business event, holiday)?
- Are specific features driving the drift?
- Is model performance actually affected (check ground truth accuracy)?
Plan for Retraining
When significant drift is confirmed:
- Collect recent data with ground truth labels
- Retrain the model on updated data
- Create a new reference baseline from the new training data
- Use an A/B testing router to safely compare the retrained model against the current production version
- Gradually shift traffic to the retrained model once performance is validated
Prediction logs have a 90-day TTL and are automatically deleted after that period. Ensure that drift analysis results and ground truth data are captured before the prediction logs expire if you need long-term records.