mlprep
mlprep/ML Codinghard25 min

Implement a decision tree classifier from scratch in Python. Use information gain (entropy) to choose splits. Support both categorical and continuous features. Walk me through the recursive structure and how you'd prevent overfitting.

formulate your answer, then —

Good implementation. Now: what's the connection between this single decision tree and a random forest? What does the "random" in random forest actually refer to, and why does it help?

formulate your answer, then —

tldr

A decision tree greedily finds the (feature, threshold) split that maximally reduces entropy at each node. Information gain = parent entropy - weighted children entropy. Prevent overfitting with max_depth, min_samples_split, and min_samples_leaf. Random forests average many decorrelated trees — each trained on bootstrap samples with random feature subsets at each split — reducing variance without increasing bias.

follow-up

  • How would you modify this implementation to handle missing values — what does XGBoost do differently from sklearn's approach?
  • Implement feature importance computation from a trained tree — both impurity-based and permutation-based methods.
  • Why do gradient-boosted trees (XGBoost, LightGBM) use shallow trees (depth 3-8) while random forests use deep trees? What's the theoretical reason?