Project basics


  • Form groups of two
  • Pick a paper that includes an extensive proof out of the list (or double-check your own suggestion with me)
  • Sign ups for a paper/project will be opened on 12.3. 13:15




Until 18.3. 23:59, send an email to the supervisor of your project listed here with a pdf containing:

  • Choice of paper, group members
  • Division of labor in the group
  • Identify and motivate the problem of the paper in your own words(as defined in below)

Mid-project drafts

Consider this as an opportunity for you to receive feedback

  • Draft presentation slides in .pdf format, according to instruction below
  • A .pdf with
    • report on what each individual has done
    • plan and division of labor until report deadline

Final content

The project report and presentation should include the following sections:

  1. Identify and motivate problem - why should I / the community care? Including literature review (done-ish)

In presentation (and report):

  1. “Detective hat”: Intuitive (not just technical level) understanding of proof, assumptions, statement in depth

    • Present it in a modular way: a) give intuition, b) proof depends on lemma 1,2,3, which are insightful and intuitive (give rise to potential new proofs) c) proofs of lemmas that are technical
    • If a proof is written in a way that obscures insights, fix that
    • Examine if results follow from other older known results and it’s just a reformulation, facilitated by ignorance (reinventing the wheel)
  2. “Reviewer hat”: Which relevant questions does it shed light on and it actually answer it / solve the problem? How significant is the addition of this paper compared to existing literature

  3. What are interesting, impactful follow-up questions they did not answer and would be interesting to pursue? Show evidence that the question(s) you identified are indeed relevant to understand important phenomena in practice and are novel in the literature. You can start with the paper’s weakness. Examples are

    • They developed theory but assumptions are too removed from practice they aim to prove, i.e. no recognizable link (example could be toy example that shares phenomena). In this case, come up with and run experiments that prove this point. Come up with idea what is needed (theoretically or experimentally) to make the paper impactful
    • You see that the experiments are non-rigorous and not reproducible, they do not support the theory nor heuristic claims.
    • The problem they aim to solve is not exactly relevant. Identify which possible tweaks in their problem formulation could fix that
    • By doing some experiments you find a curious phenomenon that hasn’t been considered and is not explained by the paper
    • Tightening upper bounds by constant or log factors didn’t lead to new insights.

In report:

  1. Break down the problem as much as you can into chunks that you can indeed pursue (or at least, the first few steps), e.g. to prove a conjecture give intuition, lemmas you think you need, and try to prove some of them.

  2. Show your attempts to tackle the first few steps.


Each of the project proposal, progress report, and final project report should be neatly typeset as a PDF document using TeX, LaTeX, or similar systems with bibliographic references (e.g., using BibTeX). Check here for typesetting resources.

The total project grade will depend equally on

  • presentation with rubric
  • structure, clarity and language in report (as in guidelines)
  • content as specified below



  • Length: 18 minutes total using slides (beamer recommended, e.g. using pandoc markdown) + 5 minutes discussion
  • For groups of two: Each person should present for 9 minutes
  • Watch this video on good slides:
  • Find more detailed, generally applicable project feedback, addressing common issues here

Content (see this for corresponding details)

  • Short problem introduction, motivation and lit review (20%)
  • Proof presentation (30%)
  • Paper discussion (30%)
  • Follow-up questions + motivation and possible methods of attack (20%)

The later your presentation, the more of the last bullet point is expected to be included.

Grade determined by peer feedback and self-feedback on the presentation

See rubric which will be used for assessment.



  • The combination of presentation slides and report should contain the content described above. The way to split it will probably be different in each project. If you were able to discuss the proof in the amount of detail you find insightful during the presentation, you can focus more on your own work.

  • Furthermore the pure reproduction (including necessary restructuring and rephrasing) of paper result + proof presentation sections should not constitute more than 50% of the report. Your own investigations should be in the focus here. That may include an extensive literature review, experimental explorations, follow-up theoretical conjectures/results etc. If the proof is poorly presented in the original paper (i.e. convoluting the key ingredients etc.), a simpler proof will also count towards “own investigations”

  • Length: In the Neurips format (see template below) it should be at most 10 pages main text, excluding references and appendix, where you could add more experiments and proof of technical lemmas etc.


Please see the following guidelines which will be used to guide the presentation portion of the report grade

  • Follow this as a guideline for content
  • Follow this as a guideline for good academic writing in general and this specifically for mathematical writing
  • Follow the latex specific style guide for the report
  • Use this paper template
  • On contributions for individuals: please indicate who contributed what

Content grade

In terms of content, it is not the absolute results that will be graded (i.e. you don’t have to prove a new theorem or write a new conference paper), but the depth at which you investigate the paper’s faults and contributions critically, put it into context and the novelty and impact of the follow-up questions that you would like to pursue. Hence, primarily you will be graded on points 1-4 in Final Content

Obviously it’s great if you succeed to solve your follow-up questions (i.e. successfully manage points 5-6), and that’ll be a big bonus for the grade, but you can achieve a good grade without actually having publishable results. Maybe think of it as a proposal for a master thesis project with preliminary ideas and/or results.

Presentation schedule

Date Presenter Paper title Peer graders
27.5. Marco Milanta,  Václav Volhejn Deep Equals Shallow for ReLU Networks in Kernel Regimes
Matteo Russo,   Zhihan Jin Proper Learning, Helly Number, and an Optimal SVM Bound
Nicolas Emmenegger,  Sebastian Haslebacher Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss.
28.5. Mikhail Makarov On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization
Arman Raayatsanati On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport.
3.6. Nils Wachsmuth,   Michael Aerni Gradient Descent with Early Stopping is Provably Robust to Label Noise for Overparameterized Neural Networks.
Mislav Balunovic,   Daniel Paleka How benign is benign overfitting?
Hugo Yèche,   Federico Soldà Bridging Theory and Algorithm for Domain Adaptation
Batuhan Tomekce,   Ege Karaismailoglu Domain adaptation under structural causal models.
4.6. Shyam Ramesh,   Alexandre bense Certifying Some Distributional Robustness with Principled Adversarial Training.
Tao Sun,   Yu Hong Deep Neural Tangent Kernel and Laplace Kernel Have the Same RKHS.


  • Please put your name down on the spreadsheet announced on campuswire on 12.3. next to the paper you want to work on
  • If there are two papers listed for one number it means that you are expected to read both for context (this actually makes your life easier since you are given what you otherwise need to find yourself) and can choose to focus on one
  • One group per paper, two people per group. First come first serve.

Note: Many of these have been published at conferences or journals by now. Link to arxiv is provided to find the newest version.

  1. Surprises in High-Dimensional Ridgeless Least Squares Interpolation. Trevor Hastie, Andrea Montanari, Saharon Rosset, Ryan J. Tibshirani arXiv:1903.08560 [math.ST]

    Exact expressions for double descent and implicit regularization via surrogate random design. Michał Dereziński, Feynman Liang, Michael W. Mahoney arXiv:1912.04533 [cs.LG]

  2. Linearized two-layers neural networks in high dimension. Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, Andrea Montanari arXiv:1904.12191 [math.ST]

    When Do Neural Networks Outperform Kernel Methods? Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, Andrea Montanari arXiv:2006.13409 [stat.ML]

  3. Classifying high-dimensional Gaussian mixtures: Where kernel methods fail and neural networks succeed. Maria Refinetti, Sebastian Goldt, Florent Krzakala, Lenka Zdeborová arXiv:2102.11742 [cs.LG]

  4. Generalisation error in learning with random features and the hidden manifold model Federica Gerace, Bruno Loureiro, Florent Krzakala, Marc Mézard, Lenka Zdeborová [arXiv:2002.09339 [math.ST]]

  5. Triple descent and the two kinds of overfitting: Where & why do they appear? Stéphane d’Ascoli, Levent Sagun, Giulio Biroli arXiv:2006.03509 [cs.LG]

  6. Finite-sample analysis of interpolating linear classifiers in the overparameterized regime. Niladri S. Chatterji, Philip M. Long arXiv:2004.12019v2 [stat.ML]

  7. On the Inductive Bias of Neural Tangent Kernels. Alberto Bietti, Julien Mairal. arXiv:1905.12173 [stat.ML]

  8. Deep Equals Shallow for ReLU Networks in Kernel Regimes Alberto Bietti, Francis Bach arXiv:2009.14397 [stat.ML]

  9. Deep Neural Tangent Kernel and Laplace Kernel Have the Same RKHS. Lin Chen, Sheng Xu. arXiv:2009.14397 [stat.ML]

    On the Similarity between the Laplace and Neural Tangent Kernels. Amnon Geifman, Abhay Yadav, Yoni Kasten, Meirav Galun, David Jacobs, Ronen Basri. arXiv:2007.01580 [cs.LG]

  10. Nonparametric regression using deep neural networks with ReLU activation function. Johannes Schmidt-Hieber. arXiv:1708.06633 [math.ST]

  11. Gradient Methods Never Overfit On Separable Data. Ohad Shamir. arXiv:2007.00028 [cs.LG]

  12. Implicit Bias in Deep Linear Classification: Initialization Scale vs Training Accuracy Edward Moroshko, Blake E. Woodworth, Suriya Gunasekar, Jason D. Lee, Nati Srebro, Daniel Soudry arXiv:2007.06738 [cs.LG]

  13. Learning Parities with Neural Networks. Amit Daniely, Eran Malach. arXiv:2002.07400 [cs.LG]

  14. Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks. Peter L. Bartlett, Nick Harvey, Chris Liaw, Abbas Mehrabian. arXiv:1703.02930 [cs.LG]

    Size-Independent Sample Complexity of Neural Networks. Noah Golowich, Alexander Rakhlin, Ohad Shamir. arXiv:1712.06541 [cs.LG]

  15. Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks. Ziwei Ji, Matus Telgarsky. arXiv:1909.12292 [cs.LG]

  16. Gradient Descent Finds Global Minima of Deep Neural Networks. Simon S. Du, Jason D. Lee, Haochuan Li, Liwei Wang, Xiyu Zhai. arXiv:1811.03804 [cs.LG]

  17. On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport. Lenaic Chizat, Francis Bach. arXiv:1805.09545 [math.OC].

  18. Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss. Lenaic Chizat, Francis Bach. arXiv:2002.04486 [math.OC]

  19. Gradient Descent with Early Stopping is Provably Robust to Label Noise for Overparameterized Neural Networks. Mingchen Li, Mahdi Soltanolkotabi, Samet Oymak. arXiv:1903.11680 [cs.LG]

  20. Algorithm-Dependent Generalization Bounds for Overparameterized Deep Residual Networks. Spencer Frei, Yuan Cao, Quanquan Gu. arXiv:1910.02934 [cs.LG]

  21. Bridging Theory and Algorithm for Domain Adaptation. Yuchen Zhang*, Tianle Liu, Mingsheng Long, Michael I. Jordan (* not the same as above) arXiv:1904.05801 [cs.LG], supplement

  22. Certifying Some Distributional Robustness with Principled Adversarial Training. Aman Sinha, Hongseok Namkoong, John Duchi. arXiv:1710.10571 [stat.ML]

  23. Domain adaptation under structural causal models. Yuansi Chen, Peter Bühlmann arXiv:2010.15764 [stat.ML]

  24. Minimax optimality of permutation tests. Ilmun Kim, Sivaraman Balakrishnan, Larry Wasserman arXiv:2003.13208 [math.ST]

You can choose your own paper, however you have to double check with the instructor before registering the paper. Other possible topics:

  • Online learning
  • model selection
  • non-parametric feature importance