A model you deployed three months ago is showing declining performance. Walk me through how you'd set up monitoring to detect this proactively, distinguish between data drift and concept drift, and decide when to trigger retraining.

Question

Accepted Answer

Model degradation in production is the norm, not the exception. The question isn't whether it will happen — it's whether you catch it before it becomes a business problem. Good monitoring covers three layers: the input data, the model's output, and the business outcome.

Three types of drift

Data drift (input drift, covariate shift): The distribution of your input features changes. Users are accessing your app more on mobile now; your "device_type" feature distribution has shifted. Or it's December and seasonal patterns are dominating signals your model never saw at that weight during training. The feature distributions in production no longer match what the model was trained on.

Concept drift: The relationship between inputs and the true label changes. Fraud patterns evolve — fraudsters adapt to your model. User preferences shift after a world event. A model trained on pre-COVID travel patterns will fail once behavior normalizes because the meaning of behavioral signals has changed, not just their distribution.

Label drift: The marginal distribution of labels changes — more positive class examples than expected, or fewer. This can be a downstream symptom of data or concept drift, or an upstream issue with how labels are being generated.

What to monitor

Feature distributions: For every input feature, monitor statistical properties over a rolling window — mean, variance, null rate, quantiles, cardinality for categoricals. Compare against a reference distribution from training time. Statistical tests like PSI (Population Stability Index), KS test (continuous features), or Chi-squared (categoricals) can formalize the comparison.

PSI is particularly popular in practice:

def psi(expected, actual, bins=10):
    # Bucket both distributions, compute:
    # sum((actual% - expected%)  ln(actual% / expected%))
    # PSI  0.2: major shift
    ...

Prediction distribution: Monitor the distribution of your model's output scores. If your fraud score used to average 0.3 and now averages 0.6, something has changed — either the data has shifted or there's a bug. Watching prediction drift catches problems even before you have ground-truth labels.

Model performance: Track precision, recall, AUC, F1 against labeled data when available. The challenge: ground truth labels often arrive with delay (fraud may not be confirmed for days; a churn prediction only gets resolved after 30 days). Design your monitoring system around these feedback delays with time-lagged metrics.

Business metrics: The ultimate truth. Track downstream KPIs — conversion rate, fraud loss rate, user engagement. A model can look fine on ML metrics but be causing downstream damage you only see in business dashboards.

Alerting and retraining triggers

Don't alert on every drift signal — that leads to alert fatigue. Tier your alerts:

- P0: Prediction distribution has shifted >50%, null rate on a critical feature >10%, or business metric dropped significantly → immediate investigation
- P1: PSI > 0.2 on top-5 features, performance degradation on rolling window → investigate within 24 hours
- P2*: Gradual PSI creep on non-critical features → log for weekly review

Retraining triggers should be data-driven, not calendar-based (though calendar-based is fine as a baseline). Retrain when: offline evaluation on recent data shows significant performance degradation vs. the production model, PSI on key features exceeds a threshold, or a major distributional event (product change, policy change, external event) is detected.

Designing for observability from the start

Log every prediction with the full feature vector, prediction, and a request ID that can be joined to eventual ground-truth outcomes. This data is the foundation of all monitoring — without it you're flying blind. Store it cheaply in a column-oriented store and build your monitoring dashboards on top.

Tools: Evidently AI, WhyLabs, and Arize are purpose-built for this. Grafana + Prometheus works for custom dashboards. The key is making monitoring a first-class concern at deployment time, not an afterthought when things break.