mlprep
mlprep/ML Breadthmedium15 min

Walk me through your feature engineering process for a new tabular ML problem. What do you look at first, and what transformations do you commonly apply?

formulate your answer, then —

tldr

Feature engineering pipeline: (1) understand data types and distributions, (2) handle missing values — impute + add indicator, (3) transform numerics — log for skew, clip outliers, create ratios, (4) scale for non-tree models using training stats only, (5) encode categoricals by cardinality, (6) decompose datetimes into cyclical features. Always fit on training data; apply to val/test. Tree models skip scaling; linear models and NNs need it.

follow-up

  • How do you prevent data leakage when using target encoding in cross-validation?
  • Your dataset has a feature with 10% missing values. The missingness correlates with the target. How do you handle it?
  • When would you use automated feature engineering (e.g., Featuretools) vs. manual domain-driven engineering?