mlprep
mlprep/ML Breadthhard15 min

Your model's accuracy dropped 8% over 2 weeks. Walk through your complete debugging process. How do you isolate the cause and decide whether to rollback, hotfix, or retrain?

formulate your answer, then —

tldr

Debug in order: (1) validate metric is real and understand temporal shape; (2) check serving infrastructure — feature pipeline, null rates, code changes; (3) compare feature distributions using PSI; (4) segment error analysis to find the concentrated failure; (5) classify: pipeline failure → rollback/hotfix, drift → retrain. A cliff suggests a code or pipeline regression; a gradual slide suggests drift. Assess blast radius before rollback. Post-incident: close the observability gap that let 8% accumulate undetected.

follow-up

  • How do you compute Population Stability Index and what threshold triggers investigation?
  • Your feature pipeline has 200 features. How do you prioritize which ones to check first?
  • You identify concept drift as the cause. How do you decide how much recent data to retrain on — and how do you validate the retrained model before deploying?