Explain cross-entropy loss — binary vs categorical, and why it works for classification

Question

Accepted Answer

Explain cross-entropy loss. Why do we use it for classification instead of MSE? Walk me through binary vs. categorical cross-entropy and how softmax fits in. Think about: what cross-entropy measures about two probability distributions. Why MSE loss on probabilities produces flat gradients near 0 and 1. How the log in the loss connects to maximum likelihood estimation. What softmax does before the loss and why that combination is natural. **What cross-entropy measures** Cross-entropy between a tr