Classical statistics says more parameters than data points means overfitting. GPT-4 has hundreds of billions of parameters trained on far fewer effective examples. Why doesn't it overfit?
tldr
Classical bias-variance predicts overfitting past the interpolation threshold. Empirically, double descent shows test error improves again in the overparameterized regime. Why: (1) SGD has implicit bias toward flat minima / minimum-norm solutions — forms of implicit regularization; (2) high-dimensional loss surfaces have few sharp local minima; (3) architectures encode inductive biases that constrain learned functions to smooth, structured representations. Scaling laws show loss decreases predictably with scale. Classical overfitting still occurs with small fine-tuning sets or insufficient data relative to model size.
follow-up
- What is the minimum norm solution and why does gradient descent implicitly favor it for overparameterized linear models?
- How do Chinchilla scaling laws change how you'd allocate compute budget between model size and training tokens?
- You're fine-tuning a 7B model on 500 labeled examples. Do you expect classical overfitting or modern generalization behavior? What does that imply for your training approach?