You have 500 features. How do you decide which ones to keep? Walk me through your feature selection process.
formulate your answer, then —
tldr
Feature selection methods: filter (variance threshold, MI, correlation — fast, model-agnostic, miss interactions), wrapper (RFE — best quality, expensive), embedded (L1/lasso, tree importance — built into training). For tree importance: prefer permutation importance or SHAP over Gini importance (biased toward high-cardinality features). Workflow: variance filter → correlation filter → MI ranking → tree-based final ranking.
follow-up
- Why is Gini importance from random forests biased toward high-cardinality features, and why doesn't permutation importance have this bias?
- How would you use feature selection in a production pipeline while avoiding data leakage?
- When would you use PCA over feature selection, and vice versa?