NeurIPS 2023 Tutorial: Reconsidering Overfitting in the Age of Overparameterized Models

Panel members
Primary references
Categorized references

Large, overparameterized models such as neural networks are now the workhorses of modern machine learning. These models are often trained to near-zero error on noisy datasets and simultaneously generalize well to unseen data, in contrast to the textbook intuition regarding the perils of overfitting. At the same time, near-perfect data-fitting can have severe issues in the context of robustness, privacy, and fairness. Classical theoretical frameworks provide little guidance for navigating these questions due to overparameterization. It is thus crucial to develop new intuition regarding overfitting and generalization that are reflective of these empirical observations. In this tutorial, we discuss recent work in the learning theory literature that provides theoretical insights into these phenomena.

Speakers: Spencer Frei (UC Davis), Vidya Muthukumar (GeorgiaTech) and Fanny Yang (ETH Zurich)

Slides, Video

Panel members

Moderator: Daniel Hsu, Columbia University,
Daniel Hsu is an associate professor of computer science at Columbia University. His research interests in machine learning include generalization behavior of interpolating models (Belkin et al. 2019) and the role of overparameterization (Dudeja et al. (2022). He organized workshops at DIMACS, FOCS, ICML, and NeurIPS, and gave tutorials at AAAI, ICML, and the Simons Institute.
Panelist: Arash Amini, UCLA,
Arash Amini is an Associate Professor of Statistics at UCLA working on high-dimensional statistics, machine learning, and optimization. He has done foundational work on theoretical guarantees for high-dimensional sparse PCA and community detection. His perspective as a statistician will complement the other panelists’ machine learning backgrounds.
Panelist: Kamalika Chaudhuri, Meta AI/UC San Diego,
Kamalika Chaudhuri is a Research Scientist at Meta AI and Professor of Computer Science and Engineering at UCSD. She has worked extensively in the areas of robust machine learning, differential privacy, and out-of-distribution robustness. Her expertise in applied and theoretical machine learning provides an acute understanding of the practical implications of memorization in AI systems (Meehan et al. 2020).
Panelist: Nathan (Nati) Srebro, Toyota Technological Institute at Chicago,
Nati Srebro is a Full Professor at TTIC working on methodological, statistical and computational aspects of machine learning, and related problems in optimization. His work on generalization in deep learning and implicit bias of optimization algorithms (e.g., Soudry et al. 2018) provides the foundation for many recent works on interpolation learning and benign overfitting, areas to which he continues to make important contributions.
Panelist: Chiyuan Zhang, Google Research,
Chiyuan Zhang is a research scientist at Google Research, working on deep learning and computational neuroscience. His 2017 paper (Chiyuan et al. 2021) on generalization in deep learning was a major impetus for many theoretical studies of overparameterized and interpolating models. Chiyuan also investigates generalization and memorization in machine and human learning, as well as implications in areas such as privacy and large language models (Carlini et al. 2023).

Primary references

In the (rough) order of presentation

Hastie, Trevor and Montanari, Andrea and Rosset, Saharon and Tibshirani, Ryan J (2022). Surprises in high-dimensional ridgeless least squares interpolation. Annals of Statistics.
Bartlett, Peter L and Long, Philip M and Lugosi, Gabor and Tsigler, Alexander (2020). Benign overfitting in linear regression. PNAS.
Muthukumar, Vidya and Vodrahalli, Kailas and Subramanian, Vignesh and Sahai, Anant (2020). Harmless interpolation of noisy data in regression. IEEE Journal on Selected Areas in Information Theory.
Wang, Guillaume and Donhauser, Konstantin and Yang, Fanny (2022). Tight bounds for minimum ℓ₁-norm interpolation of noisy data. In: AISTATS.
Donhauser, Konstantin and Ruggeri, Nicolo and Stojanovic, Stefan and Yang, Fanny (2022). Fast rates for noisy interpolation require rethinking the effects of inductive bias. In: ICML.
Hsu, Daniel and Muthukumar, Vidya and Xu, Ji (2021). On the proliferation of support vectors in high dimensions. In: AISTATS.
Muthukumar, Vidya and Narang, Adhyyan and Subramanian, Vignesh and Belkin, Mikhail and Hsu, Daniel and Sahai, Anant (2021). Classification vs regression in overparameterized regimes: Does the loss function matter?. Journal of Machine Learning Research.
Frei, Spencer and Vardi, Gal and Bartlett, Peter and Srebro, Nathan and Hu, Wei (2023). Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data. In: ICLR.
Frei, Spencer and Vardi, Gal and L., Peter and Srebro, Nathan (2023). Benign Overfitting in Linear Classifiers and Leaky ReLU Networks from KKT Conditions for Margin Maximization. In: COLT.
Xu, Xingyu and Gu, Yuantao (2023). Benign overfitting of non-smooth neural networks beyond lazy training. In: AISTATS.

Categorized references

Note: This list is in alphabetical order and is by no means exhaustive, instead it is supposed to contain a representative subset of the (primarily non-asymptotic) literature on this subject

Empirical works

Belkin, Mikhail and Hsu, Daniel and Ma, Siyuan and Mandal, Soumik (2019). Reconciling modern machine-learning practice and the classical bias–variance trade-off. PNAS.
Mallinar, Neil and B., James and Abedsoltan, Amirhesam and Pandit, Parthe and Belkin, Mikhail and Nakkiran, Preetum (2022). Benign, Tempered, or Catastrophic: A Taxonomy of Overfitting. In: NeurIPS.
Nakkiran, Preetum and Kaplun, Gal and Bansal, Yamini and Yang, Tristan and Barak, Boaz and Sutskever, Ilya (2021). Deep double descent: Where bigger models and more data hurt. In: ICLR.
Neyshabur, Behnam and Tomioka, Ryota and Srebro, Nathan (2014). In search of the real inductive bias: On the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614.
Zhang, Chiyuan and Bengio, Samy and Hardt, Moritz and Recht, Benjamin and Vinyals, Oriol (2021). Understanding deep learning (still) requires rethinking generalization. Communications of the ACM.

Implicit bias in linear models

Efron, Bradley and Hastie, Trevor and Johnstone, Iain and Tibshirani, Roberrt (2004). Least Angle Regression. Annals of Statistics.
Gunasekar, Suriya and Lee, Jason and Soudry, Daniel and Srebro, Nathan (2018). Characterizing implicit bias in terms of optimization geometry. In: ICML.
Hastie, Trevor and Rosset, Saharon and Tibshirani, Robert and Zhu, Ji (2004). The entire regularization path for the support vector machine. Journal of Machine Learning Research.
Ji, Ziwei and Telgarsky, Matus (2018). Risk and parameter convergence of logistic regression. arXiv preprint arXiv:1803.07300.
Soudry, Daniel and Hoffer, Elad and Shpigel, Mor and Gunasekar, Suriya and Srebro, Nathan (2018). The Implicit Bias of Gradient Descent on Separable Data. Journal of Machine Learning Research.
Telgarsky, Matus (2013). Margins, shrinkage, and boosting. In: ICML.
Zhang, Tong and Yu, Bin (2005). Boosting with early stopping: Convergence and Consistency. The Annals of Statistics.

Implicit bias in neural networks

Frei, Spencer and Vardi, Gal and Bartlett, Peter and Srebro, Nathan and Hu, Wei (2023). Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data. In: ICLR.
Ji, Ziwei and Telgarsky, Matus (2020). Directional convergence and alignment in deep learning. In: NeurIPS.
Kou, Yiwen and Chen, Zixiang and Gu, Quanquan (2023). Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU Networks on Nearly-orthogonal Data. In: NeurIPS.
Lyu, Kaifeng and Li, Jian (2019). Gradient descent maximizes the margin of homogeneous neural networks. arXiv preprint arXiv:1906.05890.
Vardi, Gal and Shamir, Ohad (2021). Implicit Regularization in ReLU Networks with the Square Loss. In: COLT.

Generalization bounds for interpolating linear regression

min-ℓ₂-norm interpolators
- Bartlett, Peter L and Long, Philip M and Lugosi, G'abor and Tsigler, Alexander (2020). Benign overfitting in linear regression. PNAS.
- Belkin, Mikhail and Hsu, Daniel and Xu, Ji (2020). Two models of double descent for weak features. SIAM Journal on Mathematics of Data Science.
- Hastie, Trevor and Montanari, Andrea and Rosset, Saharon and Tibshirani, Ryan J (2022). Surprises in high-dimensional ridgeless least squares interpolation. Annals of Statistics. (asymptotic)
- Kobak, Dmitry and Lomond, Jonathan and Sanchez, Benoit (2020). The optimal ridge penalty for real-world high-dimensional data can be zero or negative due to the implicit ridge regularization. Journal of Machine Learning Research.
- Muthukumar, Vidya and Vodrahalli, Kailas and Subramanian, Vignesh and Sahai, Anant (2020). Harmless interpolation of noisy data in regression. IEEE Journal on Selected Areas in Information Theory.
min-ℓ_p-norm interpolators
- Chinot, Geoffrey and L"offler, Matthias and van, Sara (2022). On the robustness of minimum norm interpolators and regularized empirical risk minimizers. Annals of Statistics.
- Donhauser, Konstantin and Ruggeri, Nicolo and Stojanovic, Stefan and Yang, Fanny (2022). Fast rates for noisy interpolation require rethinking the effects of inductive bias. In: ICML.
- Li, Yue and Wei, Yuting (2021). Minimum ℓ₁-norm interpolators: Precise asymptotics and multiple descent. arXiv preprint arXiv:2110.09502. (asymptotic)
- Wang, Guillaume and Donhauser, Konstantin and Yang, Fanny (2022). Tight bounds for minimum ℓ₁-norm interpolation of noisy data. In: AISTATS.

Generalization bounds for interpolating linear classification

max-ℓ₂-margin solutions
- Ardeshir, Navid and Sanford, Clayton and Hsu, Daniel (“2021”). “Support vector machines and linear regression coincide with very high-dimensional features”. In: NeurIPS.
- Chatterji, Niladri and Long, M. Philip (2021). Finite-sample analysis of interpolating linear classifiers in the overparameterized regime. Journal of Machine Learning Research.
- Deng, Zeyu and Kammoun, Abla and Thrampoulidis, Christos (2022). A model of double descent for high-dimensional binary linear classification. Information and Inference: A Journal of the IMA. (asymptotic)
- Hsu, Daniel and Muthukumar, Vidya and Xu, Ji (2021). On the proliferation of support vectors in high dimensions. In: AISTATS.
- Montanari, Andrea and Ruan, Feng and Sohn, Youngtak and Yan, Jun (2019). The generalization error of max-margin linear classifiers: High-dimensional asymptotics in the overparametrized regime. arXiv preprint arXiv:1911.01544. (asymptotic)
- Muthukumar, Vidya and Narang, Adhyyan and Subramanian, Vignesh and Belkin, Mikhail and Hsu, Daniel and Sahai, Anant (2021). Classification vs regression in overparameterized regimes: Does the loss function matter?. Journal of Machine Learning Research.
- Wang, Ke and Thrampoulidis, Christos (2022). Binary classification of gaussian mixtures: Abundance of support vectors, benign overfitting, and regularization. SIAM Journal on Mathematics of Data Science. (asymptotic)
max-ℓ_p margin solutions
- Chinot, Geoffrey and Kuchelmeister, Felix and L"offler, Matthias and van de Geer, Sara (2022). AdaBoost and robust one-bit compressed sensing. Mathematical Statistics and Learning.
- Donhauser, Konstantin and Ruggeri, Nicolo and Stojanovic, Stefan and Yang, Fanny (2022). Fast rates for noisy interpolation require rethinking the effects of inductive bias. In: ICML.
- Stojanovic, Stefan and Donhauser, Konstantin and Yang, Fanny (2024). Tight bounds for maximum ℓ₁-margin classifiers. In: ALT.

Generalization bounds for random features:

Belkin, Mikhail and Hsu, Daniel and Xu, Ji (2020). Two models of double descent for weak features. SIAM Journal on Mathematics of Data Science.
Mei, Song and Montanari, Andrea (2022). The generalization error of random features regression: Precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics.

Generalization bounds for kernels, random forests, other methods:

Aerni, Michael and Milanta, Marco and Donhauser, Konstantin and Yang, Fanny (2023). Strong inductive biases provably prevent harmless interpolation. In: ICLR.
Belkin, Mikhail and Hsu, Daniel J and Mitra, Partha (2018). Overfitting or perfect fitting? risk bounds for classification and regression rules that interpolate. In: NeurIPS.
Belkin, Mikhail and Rakhlin, Alexander and Tsybakov, Alexandre B (2019). Does data interpolation contradict statistical optimality?. In: AISTATS.
Liang, Tengyuan and Rakhlin, Alexander and Zhai, Xiyu (2020). On the multiple descent of minimum-norm interpolants and restricted lower isometry of kernels. In: COLT.
Wyner, Abraham J and Olson, Matthew and Bleich, Justin and Mease, David (2017). Explaining the success of adaboost and random forests as interpolating classifiers. Journal of Machine Learning Research.

Generalization bounds for interpolating neural networks:

Frei, Spencer and Chatterji, Niladri S. and Bartlett, Peter L (2022). Benign Overfitting without Linearity: Neural Network Classifiers Trained by Gradient Descent for Noisy Linear Data. In: COLT.
Frei, Spencer and Vardi, Gal and Bartlett, Peter L. and Srebro, Nathan (2023). Benign Overfitting in Linear Classifiers and Leaky ReLU Networks from KKT Conditions for Margin Maximization. In: COLT.
Kornowski, Guy and Yehudai, Gilad and Shamir, Ohad (2023). From Tempered to Benign Overfitting in ReLU Neural Networks. In: NeurIPS.
Kou, Yiwen and Chen, Zixiang and Chen, Yuanzhou and Gu, Quanquan (2023). Benign Overfitting for Two-layer ReLU Convolutional Neural Networks. In: ICML.
Xu, Xingyu and Gu, Yuantao (2023). Benign overfitting of non-smooth neural networks beyond lazy training. In: AISTATS.