Skip to main content

Drift Detection

Drift detection monitors changes in your model's input data distributions over time and alerts you when production data differs significantly from training data. When combined with ground truth integration, it also detects prediction drift -- when model accuracy degrades compared to baseline performance.

Why Drift Detection Matters

Machine learning models are trained on historical data, but the real world is constantly changing. Over time, the data your model sees in production may differ from the data it was trained on, causing performance degradation. This is known as drift.

Types of Drift

TypeDescriptionHow Detected
Data DriftThe statistical distribution of input features changes over time (e.g., customer demographics shift, seasonal patterns change)PSI, KS test, Chi-Square, Jensen-Shannon divergence
Concept DriftThe relationship between inputs and outputs changes (e.g., customer behavior evolves, economic conditions shift)Ground truth accuracy comparison over time
Prediction DriftModel accuracy degrades compared to baseline performanceRequires ground truth labels to detect

Reference Baselines

A reference baseline captures the statistical distributions of your training data features. When you run drift analysis, current production data is compared against this baseline to detect changes.

Creating a Baseline

drift.createBaseline({
modelId,
version,
featureData,
targetData
})
ParameterRequiredDescription
modelIdYesID of the model this baseline is for
versionYesVersion string for this baseline (e.g., "v1.0")
featureDataYesObject mapping feature names to arrays of values
targetDataNoTarget variable distribution (classification or regression)

The system automatically:

  1. Detects whether each feature is numerical or categorical based on the first value
  2. For numerical features: calculates mean, standard deviation, min, max, percentiles (5th, 25th, 50th, 75th, 95th), and histogram
  3. For categorical features: calculates category frequencies as proportions
  4. For target data: calculates either classification class frequencies or regression distribution
  5. Deactivates any previously active baseline for the model
  6. Sets the new baseline as active

Managing Baselines

MethodDescription
drift.getActiveBaseline(modelId)Get the currently active baseline
drift.setActiveBaseline(modelId, baselineId)Switch the active baseline
drift.getBaselines(modelId)List all baselines (sorted by creation date, newest first)

Only one baseline can be active per model at a time. Setting a new active baseline deactivates all others.

Prediction Logging

All predictions must be logged for drift analysis. Each prediction log captures the input features, model output, and metadata needed for later comparison.

Logging a Prediction

drift.logPrediction({
modelId,
entityId,
features,
prediction,
probabilities,
modelVersion,
latencyMs,
requestSource
})
FieldRequiredDescription
modelIdYesModel that produced the prediction
entityIdNoBusiness entity ID for ground truth matching (defaults to predictionId if not provided)
featuresYesInput features as key-value pairs
predictionYesModel output (any type)
probabilitiesNoPrediction probabilities array (for classification)
modelVersionNoVersion of the model used
latencyMsNoPrediction latency in milliseconds
requestSourceNoSource of the prediction request

Returns a unique predictionId.

Querying Predictions

drift.getPredictions(modelId, { limit, offset, startDate, endDate, hasGroundTruth })

Filter predictions by date range and whether they have ground truth attached. Returns predictions sorted by timestamp (newest first).

Data Retention

Prediction logs have a 90-day TTL -- they are automatically expired after 90 days. Historical drift analysis results are stored indefinitely in the drift_results collection.

Ground Truth Integration

Ground truth (actual outcomes) allows you to measure prediction drift and model accuracy over time. Ground truth records are matched to predictions via the entityId field.

Single Record Upload

drift.addGroundTruth({
modelId,
entityId,
actualOutcome,
outcomeTimestamp
})

The system automatically attempts to match the ground truth with an existing prediction that has the same modelId and entityId and has not yet been matched. When matched:

  • The prediction log is updated with the ground truth data (actualOutcome, outcomeTimestamp, matchedAt)
  • The ground truth record is marked as matched: true with the matchedPredictionId

Bulk Upload

drift.uploadGroundTruth({
modelId,
records: [
{ entityId, actualOutcome, outcomeTimestamp },
...
]
})

Processes records sequentially, attempting to match each with its corresponding prediction. Returns a summary:

FieldDescription
importedNumber of records successfully imported
matchedNumber of records matched to predictions
errorsArray of error messages for failed records
batchIdUnique identifier for this upload batch

Records missing an entityId are skipped with an error.

Finding Unmatched Predictions

drift.getUnmatchedPredictions(modelId, limit = 100)

Returns predictions that have not yet been matched with ground truth. Use this to identify which predictions still need actual outcomes uploaded.

Statistical Tests

The drift analysis engine uses multiple statistical methods to detect distribution changes:

PSI (Population Stability Index)

The primary metric for drift detection. Measures the overall shift in distribution between baseline and current data.

PSI RangeStatusInterpretation
PSI < 0.1okNo significant drift. Data distribution is stable.
0.1 <= PSI < 0.2warningModerate drift. Investigate potential causes.
PSI >= 0.2alertSignificant drift. Action required. Consider retraining.

For numerical features, PSI is calculated by comparing histogram bin frequencies between the baseline and current distributions (aligned to baseline bin edges).

For categorical features, PSI is calculated by aligning category frequencies between baseline and current distributions. Missing categories are assigned a small frequency (0.001) to avoid division by zero.

Kolmogorov-Smirnov Test

Used for numerical features only. Compares the cumulative distributions of baseline and current data samples.

  • p-value < 0.05 indicates statistically significant drift
  • The test is performed on up to 1,000 samples (approximated from the baseline distribution when raw baseline samples are not available)

Chi-Square Test

Used for categorical features only. Compares observed versus expected frequencies across categories.

  • p-value < 0.05 indicates statistically significant drift
  • Expected frequencies are derived from the baseline distribution

Jensen-Shannon Divergence

Used for categorical features as a symmetric measure of distribution similarity.

  • Ranges from 0 (identical distributions) to 1 (completely different)
  • Provides a complementary view to PSI for categorical data

Running Drift Analysis

Triggering Analysis

drift.analyze(modelId, { windowDays, windowStart, windowEnd })
ParameterDefaultDescription
windowDays7Number of days of recent predictions to analyze
windowStartCalculated from windowDaysStart of analysis window
windowEndCurrent timeEnd of analysis window

Requirements:

  • An active baseline must exist for the model
  • At least one prediction must exist in the specified time window

Analysis Process

  1. Retrieves the active baseline for the model
  2. Queries all predictions within the time window
  3. Extracts feature values from predictions and groups by feature name
  4. For each feature in the baseline:
    • Numerical features: Calculates PSI from histogram frequencies and KS test from sample comparison
    • Categorical features: Calculates PSI, Chi-Square, and Jensen-Shannon divergence from category frequencies
  5. If predictions with ground truth exist, calculates current accuracy
  6. Computes overall drift score from all PSI values
  7. Determines overall status (ok, warning, or alert)
  8. Stores the result in the drift_results collection

Analysis Result

The drift result contains:

FieldDescription
modelIdModel that was analyzed
baselineIdBaseline used for comparison
windowStart / windowEndTime window analyzed
sampleSizeNumber of predictions analyzed
overallStatusAggregate status: ok, warning, or alert
driftScoreAggregate drift score (0-1)
featureDriftArray of per-feature drift results
predictionDriftAccuracy comparison (only if ground truth available)
calculatedAtWhen the analysis was performed
calculationDurationMsHow long the analysis took

Per-Feature Results

Each feature drift result includes:

FieldDescription
featureNameName of the feature
driftTypeType of drift detected (data)
metrics.psiPSI value, p-value, and status
metrics.ksStatisticKS test result (numerical features)
metrics.chiSquareChi-Square test result (categorical features)
metrics.jensenShannonJensen-Shannon divergence (categorical features)
currentDistributionCurrent distribution statistics
baselineDistributionBaseline distribution for comparison
hasDriftBoolean flag if significant drift was detected

Querying Results

MethodDescription
drift.getLatestResult(modelId)Get the most recent analysis result
drift.getResults(modelId, { limit, status })Get result history with optional status filter

Alert Configuration

Configure automated monitoring thresholds for each model:

drift.updateAlertConfig(modelId, config)

Default Thresholds

ThresholdDefaultDescription
psiWarning0.1PSI value that triggers a warning
psiAlert0.2PSI value that triggers an alert
ksPValueThreshold0.05KS test p-value significance threshold
accuracyDropWarning0.055% accuracy drop triggers a warning
accuracyDropAlert0.110% accuracy drop triggers an alert

Notification Settings

FieldDefaultDescription
emailtrueWhether to send email notifications
recipients[]Array of email addresses to notify

Scheduled Analysis

FieldDefaultDescription
enabledfalseWhether scheduled analysis is enabled
frequencydailyHow often to run: hourly, daily, or weekly
timeWindowDays7Days of data to include in each analysis

Performance History

Track model accuracy over time with configurable granularity:

drift.getPerformanceHistory(modelId, { windowDays, granularity })
ParameterDefaultDescription
windowDays30Number of days of history to return
granularitydailyTime bucket size: hourly, daily, or weekly

Returns an array of data points, each containing:

FieldDescription
timestampStart of the time bucket
accuracyAccuracy within the bucket (requires ground truth)
predictionCountNumber of predictions in the bucket
groundTruthCountNumber of predictions with ground truth in the bucket

Accuracy is calculated as the proportion of predictions where prediction === groundTruth.actualOutcome within each time bucket.

Collections and Indexes

Prediction Logs Collection

IndexPurpose
modelId, timestamp (desc)Query predictions for a model
entityId, modelIdMatch ground truth to predictions
timestamp (TTL: 90 days)Automatic cleanup of old predictions

Ground Truth Collection

Stores actual outcome data uploaded for ground truth matching.

Reference Baselines Collection

IndexPurpose
modelId, isActiveFind the active baseline for a model

Drift Results Collection

IndexPurpose
modelId, calculatedAt (desc)Query results for a model

Alert Config Collection

IndexPurpose
modelId (unique)One config per model

Interpreting Common Scenarios

Seasonal Drift

Pattern: Drift appears at regular intervals (monthly, quarterly). Action: Consider building season-aware models or using different models for different periods. This may be expected behavior rather than a problem.

Sudden Drift Spike

Pattern: PSI jumps dramatically in a short time. Action: Investigate recent changes: new data sources, pipeline bugs, external events, or upstream system changes.

Gradual Increase

Pattern: Drift slowly increases over weeks or months. Action: Schedule model retraining as part of regular maintenance. Create a new baseline after retraining.

Single Feature Drift

Pattern: One feature shows high drift while others are stable. Action: Investigate that specific feature. The cause may be an upstream data issue, a feature engineering bug, or a genuine change in the data source.

High Drift but Good Accuracy

Pattern: PSI is in alert range but accuracy (from ground truth) remains stable. Action: The distribution has shifted but the model's decision boundaries may still be valid. Monitor closely and document the finding. Consider updating the baseline if this becomes the new normal.

Best Practices

Establish Baselines Early

Create reference baselines immediately after training while the training data is readily available. The training data distribution is your ground truth for comparison.

Monitor Continuously

Do not wait for problems to appear. Schedule regular drift analysis (daily or weekly) to catch issues early. The performance history view helps identify gradual degradation that might not be obvious from a single analysis.

Collect Ground Truth

Where possible, collect actual outcomes to measure real model performance:

  • Data drift detection (via PSI, KS, etc.) tells you the input distribution changed
  • Ground truth matching tells you whether the model's predictions are still accurate
  • Both are valuable, but accuracy measurement is the definitive indicator of model health

Investigate Warnings

When drift is detected at the warning level, investigate before it becomes critical:

  • Is this expected (seasonal change, business event, holiday)?
  • Are specific features driving the drift?
  • Is model performance actually affected (check ground truth accuracy)?

Plan for Retraining

When significant drift is confirmed:

  1. Collect recent data with ground truth labels
  2. Retrain the model on updated data
  3. Create a new reference baseline from the new training data
  4. Use an A/B testing router to safely compare the retrained model against the current production version
  5. Gradually shift traffic to the retrained model once performance is validated
warning

Prediction logs have a 90-day TTL and are automatically deleted after that period. Ensure that drift analysis results and ground truth data are captured before the prediction logs expire if you need long-term records.