SML group at ETH | student projects

If you are an ETH student in CS, EE or Statistics (math can also be arranged) and interested in doing a project or thesis, please fill out this form and email both Fanny and the project advisor. There are more opportunities in the lab than the listed projects, if you are interested in the general direction of trustworthy ML or causal inference (both empirically and theoretically) with excellent mathematical or coding background, feel free to contact us.

Beyond Accuracy on the Line: Evaluating Out-of-Distribution Generalization in Machine Learning

Develop novel ways to evaluate machine learning methods out-of-distribution which reflect their true generalization capabilities.

The accuracy of machine learning methods often drops when they are trained on one domain and deployed on another. This is a finding which has been observed over and over empirically. However, it is less clear which actions can be undertaken to mitigate this failure, if any. One intuitive approach, distributionally robust optimization (DRO), aims to find a model which performs well on all test datasets “close” to the training data in some probability distance. However, this approach results in overly pessimistic models which are robust against “unrealistic” distribution shifts. Instead, a number of methods have been proposed which aim to identify and exclusively use stable (“causal”) relationships in the data. Corresponding theory states that such methods are guaranteed to perform better than standard empirical risk minimization (ERM) under worst-case distribution shifts. When put to the test empirically, these findings do not seem to hold up well: on real-world datasets, causality- (or invariance-) based generalization methods are very often outperformed by ERM, and seem to generalize worse both in- and out-of-distribution (OOD) [Nastl & Hardt, 2024, Salaudeen et al., 2025]. This seems consistent with the “accuracy-on-the-line” hypothesis, which postulates that the ranking of models is often preserved across distributions. In recent work, it has been argued that this mismatch between theoretical and empirical findings is an “artifact” of misspecified OOD datasets, which do not contain sufficiently adversarial shifts.

The goal is to resolve the mismatch between theoretical and empirical findings in multiple ways:

by verifying whether invariance-based OOD methods rank better if the distribution shift is constructed to be worst-case;
by constructing benchmarks with varying strength and complexity of the distribution shift which could help evaluate a “spectrum” of OOD generalization of models;
by providing a theoretical justification of recent empirical findings through analysis of the mismatch between benchmarks and assumptions.

Goals of the project:

Create novel evaluation schemes for OOD machine learning methods;
Construct novel benchmarks which more accurately measure OOD generalization;
Test a large variety of large-scale models via the novel schemes and benchmarks.

Key Skills & Qualifications:

Strong background in machine learning, familiarity with reading and understanding research papers.
Solid Python and PyTorch skills, plus basic experience running experiments on GPUs.
Background in statistics, basic machine learning theory, training and evaluation of machine learning models.
Interest in out-of-distribution generalization, design of safe and robust models, and causality

Contact: Julia and Jun

Related references:
Do causal predictors generalize better to new domains?
Are Domain Generalization Benchmarks with Accuracy on the Line Misspecified?
Invariance, causality and robustness.
[4] In search of lost domain generalization.
[5] In search of forgotten domain generalization. (Not the same as [4]!)

Mediator-Based Routing for Efficient LLM Systems

Starting from a well-known theoretical framework, understand if there are empirical gains by using such a machinery for adaptive model/agent selection.

Modern LLM systems often choose between multiple ways of answering a prompt: a cheap model, a stronger model, a retrieval-augmented pipeline, a cascade, or a verifier-based workflow. This project studies this choice as a contextual mediator-selection problem: given a prompt, select the most appropriate answer-generating procedure while balancing quality, cost, and latency. Building on the well-understood theoretical framework on mediator-feedback bandits, where a learner selects a mediator/policy that produces an outcome, the goal is to develop a useful method for efficient LLM orchestration (with a potential extension of the theoretical framework too). The expected contribution is a compact empirical study showing when mediator-based routing improves over standard model selection, and whether treating LLM procedures as mediators is useful for practical agentic systems.

Key Skills & Qualifications:

Strong background in machine learning, familiarity with reading and understanding research papers also from a theoretical lens.
Solid Python and PyTorch skills, and experience running experiments on GPUs.
Experience working with LLMs is a plus.

Contact: Federico

Related references:
Information Capacity Regret Bounds for Bandits with Mediator Feedback
Pure Exploration under Mediators' Feedback

Example of previous student projects

Tight bounds for maximum l1-margin classifiers

Stefan Stojanovic with Konstantin Donhauser and Fanny Yang. ALT 2024. [paper]

Certified private data release for sparse Lipschitz functions

Johan Lokna and Robert Hoenig with Konstantin Donhauser, Amartya Sanyal, March Boedihardjo, and Fanny Yang. AISTATS 2024. [paper]

Can semi-supervised learning use all the data effectively? A lower bound perspective

Gizem Yüce with Alexandru Ţifrea, Amartya Sanyal, and Fanny Yang. NeurIPS 2023, Spotlight 🏅. [paper]

Strong inductive biases provably prevent harmless interpolation

Marco Milanta with Michael Aerni, Konstantin Donhauser, and Fanny Yang. ICLR 2023. [paper]

Why adversarial training can hurt robust accuracy

Jacob Clarysse with Julia Hörrmann and Fanny Yang. ICLR 2023. [paper]

How unfair is private learning?

Yaxi Hu with Amartya Sanyal and Fanny Yang. UAI 2022, Oral 🏅. [paper]