mlprep
mlprep/ML Breadthmedium12 min

Explain cross-entropy loss. Why do we use it for classification instead of MSE? Walk me through binary vs. categorical cross-entropy and how softmax fits in.

formulate your answer, then —

tldr

Cross-entropy = -log(probability assigned to correct class). Use it for classification because: (1) the gradient doesn't vanish when the model is confidently wrong — unlike MSE on sigmoid; (2) it's equivalent to maximum likelihood under a categorical distribution; (3) softmax + cross-entropy has a clean gradient of (ŷ - y). BCE for binary tasks, categorical CE for multiclass.

follow-up

  • How does label smoothing modify cross-entropy and why does it help generalization?
  • What's the difference between cross-entropy and KL divergence? When are they equivalent?
  • You have a 1000-class classification problem. How does the softmax denominator scale and what problems can arise?