You have a categorical feature with high cardinality — say, 10,000 unique zip codes. How do you encode it? Walk me through the trade-offs of different approaches.
formulate your answer, then —
tldr
Low cardinality → one-hot. High cardinality + tree models → target encoding with out-of-fold to prevent leakage + smoothing for rare categories. High cardinality + neural nets → learned embeddings (nn.Embedding). Label encoding works for trees but never linear/distance-based models. Hashing trick for streaming/very high cardinality. Always handle unseen categories at inference (UNKNOWN bucket or global mean).
follow-up
- How would you implement out-of-fold target encoding correctly in a sklearn pipeline without leakage?
- Your model is deployed. A new product category appears in production that wasn't in training. How do you handle it?
- How do entity embeddings (like those used in FastAI's tabular model) compare to standard target encoding on structured data?