Why don't huge overparameterized models always overfit?

Question

Accepted Answer

Classical statistics says more parameters than data points means overfitting. GPT-4 has hundreds of billions of parameters trained on far fewer effective examples. Why doesn't it overfit? Think about: what the classical bias-variance tradeoff predicts for a model with more parameters than data. What actually happens empirically as you scale model size past the interpolation threshold. What SGD is doing implicitly that explicit optimization wouldn't. What "flat minima" means and why they generali