mlprep
mlprep/ML Breadthhard10 min

You're predicting if a user will convert within 7 days. Your training pipeline runs at midnight. Which features are safe to use and which might leak future information?

formulate your answer, then —

tldr

Temporal leakage: any feature that includes information from after the prediction timestamp. Safe: historical aggregates with windows that close before prediction time, static attributes. Leaky: post-prediction activity, rolling windows that extend past the prediction point, batch aggregates that accidentally include future events. Use walk-forward CV — never shuffle temporal data. Log serving features and compare to training distributions to catch misalignment in production.

follow-up

  • What is point-in-time correctness in a feature store and how does it prevent temporal leakage at scale?
  • You discover your model has temporal leakage after it's already deployed and has been making predictions for a month. How do you handle this?
  • How do you design a training dataset where labels are known with a lag — for example, returns that may arrive 30 days after purchase?