Explain cross-entropy loss. Why do we use it for classification instead of MSE? Walk me through binary vs. categorical cross-entropy and how softmax fits in.
formulate your answer, then —
tldr
Cross-entropy = -log(probability assigned to correct class). Use it for classification because: (1) the gradient doesn't vanish when the model is confidently wrong — unlike MSE on sigmoid; (2) it's equivalent to maximum likelihood under a categorical distribution; (3) softmax + cross-entropy has a clean gradient of (ŷ - y). BCE for binary tasks, categorical CE for multiclass.
follow-up
- How does label smoothing modify cross-entropy and why does it help generalization?
- What's the difference between cross-entropy and KL divergence? When are they equivalent?
- You have a 1000-class classification problem. How does the softmax denominator scale and what problems can arise?