How do you build data pipelines for ML that are reliable enough to feed production training and serving? What failure modes do you design against?
You mentioned silent failures are worse — what does a silent data quality failure look like in practice, and how do you catch it before it reaches training?
tldr
ML data pipelines fail loudly (crash) or silently (bad data, model degrades). Silent failures are worse. Make pipelines idempotent (same output on re-run), validate schema and statistics at every stage boundary, and fail fast before bad data reaches training. The most dangerous failures are silent drops in join coverage, null explosions for new user segments, and leakage from timestamp bugs — none of which surface in system-level metrics.
follow-up
- How do you handle schema evolution in a data pipeline without breaking downstream models?
- What's the difference between batch and streaming pipelines for ML, and how do you decide which to use?
- How do you backfill a feature that didn't exist when the model was originally trained?