You're predicting if a user will convert within 7 days. Your training pipeline runs at midnight. Which features are safe to use and which might leak future information?
formulate your answer, then —
tldr
Temporal leakage: any feature that includes information from after the prediction timestamp. Safe: historical aggregates with windows that close before prediction time, static attributes. Leaky: post-prediction activity, rolling windows that extend past the prediction point, batch aggregates that accidentally include future events. Use walk-forward CV — never shuffle temporal data. Log serving features and compare to training distributions to catch misalignment in production.
follow-up
- What is point-in-time correctness in a feature store and how does it prevent temporal leakage at scale?
- You discover your model has temporal leakage after it's already deployed and has been making predictions for a month. How do you handle this?
- How do you design a training dataset where labels are known with a lag — for example, returns that may arrive 30 days after purchase?