student projects
List of available projects
If you are an ETH student in CS, EE or Statistics (math can also be arranged) and interested in doing a project or thesis, please fill out this form and email both Fanny and the project advisor. There are more opportunities in the lab than the listed projects, if you are interested in the general direction of trustworthy ML or causal inference (both empirically and theoretically) with excellent mathematical or coding background, feel free to contact us.
Inference-Time Unlearning for Large Language Models
The project will explore inference-time unlearning, where the main language model remains fixed and unlearning is implemented by modifying the decoding process.
Large language models tend to memorize parts of their training data, including sensitive or proprietary content, and may reproduce those sequences verbatim during generation. In many applications, the goal is not to fully erase all traces of specific data, but to avoid generating certain sentences, passages, or styles when users query the model, while preserving general usefulness. Examples include copyrighted corpora, time-limited data (e.g., pre-GDPR logs), or user-specific content that must not be surfaced.
Most existing unlearning approaches modify the model weights: retraining from scratch without the data to forget, fine-tuning against it, or using specialized unlearning objectives. These strategies are expensive, can degrade performance on unrelated tasks, and make it hard to support multiple unlearning configurations (for different users, products, or jurisdictions) without duplicating or repeatedly editing the base model.
Problem statement: The project will explore inference-time unlearning, where the main language model remains fixed and unlearning is implemented by modifying the decoding process. The central idea is to approximate the behaviour of a hypothetical “unlearned” model by transforming the base model’s token distribution using auxiliary signals that indicate how similar a candidate continuation is to a “forget” region and how compatible it is with a desired “retain” region (for example, safe domains, styles, or time slices). Auxiliary signals may come from small models trained on curated splits (retain/forget), discriminators, or other lightweight predictors that score tokens or continuations. The core question is how to design simple, practical rules that use these signals during decoding to nudge the distribution away from unwanted content and closer to acceptable behaviour, while preserving capabilities and requiring only light additional training and computational overhead.
Goals of the project:
- Construct an LLM unlearning benchmark with clear forget/retain data and memorization-utility metrics.
- Propose and implement one or two inference-time decoding schemes that steer generations away from the forget region using auxiliary signals.
- Experimentally compare these schemes against weight-editing and filtering baselines, and report where inference-time unlearning works best.
Key Skills & Qualifications:
- Strong background in machine learning and deep learning, and comfortable working with modern language models.
- Solid Python and PyTorch skills, plus basic experience running experiments on GPUs.
- Interested in practical questions around memorization, unlearning, and safety, and in designing simple mechanisms with clear empirical evaluation.
- Able to work fairly independently once the experimental setup is defined, and to organize and interpret results across multiple baselines and configurations.
Copyright-Protected Language Generation via Adaptive Model Fusion
Performance-Aware Optimization for Unlearning in Large Language Models
The project will study optimization-based unlearning schemes that explicitly control downstream performance, aiming to stay close to an ideal retrained baseline.
A common approach to unlearning in large language models is to directly update the weights of a pretrained model. Standard optimization-based schemes typically apply gradient ascent (or related updates) on a “forget” set to increase its loss, and add a regularization term—often a KL divergence—to keep the updated model close to the original parameters.
These methods can suppress the targeted content, but often at the cost of substantial utility loss on downstream tasks. The natural gold standard is the retrained model: a model trained from scratch only on the retain data (the original training data minus the forget set). In practice, however, most work focuses on tuning regularization strength and form, without explicitly asking whether the resulting unlearned model remains useful on relevant tasks or approximates the behaviour of this retrained baseline.
Problem statement: The project will investigate optimization-based unlearning from a performance-aware perspective. Instead of regularizing only towards the original model, the goal is to design unlearning objectives that also control performance on data drawn from the retain distribution, for example through validation-based penalties or constraints defined on a retain/validation set. There is also scope for theory in simplified settings (e.g., convex models or small classifiers) to understand which regularizers or constrained formulations yield solutions that are both “unlearned” and close to the retrained solution.
Goals of the project:
- Define a rigorous optimization-based unlearning setup on a pretrained open-source LLM, with clearly specified forget and retain/validation sets and metrics for both unlearning quality and retained utility.
- Develop and implement at least one performance-aware unlearning objective or constraint that explicitly incorporates retain/validation performance, with an optional theoretical analysis in a simplified setting.
- Empirically compare the proposed schemes against standard KL-regularized unlearning in terms of forgetting, downstream utility, and proximity to a model retrained from scratch on the retain data.
Key Skills & Qualifications:
- Strong background in machine learning and deep learning, and comfortable reading research papers.
- Solid Python and PyTorch skills, plus basic experience running experiments on GPUs.
- Interest in optimization, constrained objectives, and connecting empirical behaviour with simple theoretical models.
- Able to work fairly independently once the experimental setup is defined, and to organize and interpret results across multiple baselines and configurations.
Reinforcement-Learning-Free Multi-Objective Language Model Alignment
The project will explore RL-free, theory-driven methods for aligning large language models along multiple, potentially conflicting objectives.
Large language models are typically aligned with human preferences using methods such as RLHF or DPO, which assume that all relevant preferences can be collapsed into a single scalar objective. In practice, however, systems often need to balance several heterogeneous and sometimes conflicting goals (for example, helpfulness, harmlessness, and strict policy compliance).
Multi-objective alignment treats different alignment axes as separate objectives and offers a principled way to reason about these trade-offs. Existing approaches frequently rely on reinforcement learning with multiple rewards merged into one, or on ad-hoc heuristics that resolve conflicts between objectives. These strategies are computationally heavy, high-variance, and provide limited insight into which trade-offs the final model is actually realizing.
Problem statement: The project will explore RL-free multi-objective alignment methods that operate directly at the loss or gradient level, rather than through explicit reward models and an RL loop. The aim is to design preference-aware training objectives that keep different alignment axes separate and are grounded in simple optimization principles (for example, scalarisation schemes or Pareto-inspired gradient rules), instead of ad-hoc tricks. There is also an opportunity to study how unlabeled data can be integrated into such schemes, inspired by recent work on semi-supervised multi-objective learning (e.g., Wegel et al., 2025).
Goals of the project:
- Formalise a simple, theory-motivated multi-objective alignment setup on a pretrained open-source LLM with 2-3 alignment axes and a clear evaluation protocol.
- Propose and implement at least one principled RL-free training rule that combines these objectives at the loss or gradient level, with an optional extension that integrates unlabeled data.
- Empirically compare the proposed method to simple baselines (single-objective alignment, naive scalarisation) and characterise the resulting trade-offs in alignment behaviour, stability, and computational cost.
Key Skills & Qualifications:
- Strong background in machine learning and deep learning, and comfortable reading research papers.
- Solid Python and PyTorch skills, plus basic experience running experiments on GPUs.
- Interest in optimization, multi-objective trade-offs, and theory-driven methods.
- Enjoys combining simple theoretical intuitions (toy examples, small arguments) with empirical validation on real models.
Beyond Accuracy on the Line: Evaluating Out-of-Distribution Generalization in Machine Learning
Develop novel ways to evaluate machine learning methods out-of-distribution which reflect their true generalization capabilities.
The accuracy of machine learning methods often drops when they are trained on one domain and deployed on another. This is a finding which has been observed over and over empirically. However, it is less clear which actions can be undertaken to mitigate this failure, if any. One intuitive approach, distributionally robust optimization (DRO), aims to find a model which performs well on all test datasets “close” to the training data in some probability distance. However, this approach results in overly pessimistic models which are robust against “unrealistic” distribution shifts. Instead, a number of methods have been proposed which aim to identify and exclusively use stable (“causal”) relationships in the data. Corresponding theory states that such methods are guaranteed to perform better than standard empirical risk minimization (ERM) under worst-case distribution shifts. When put to the test empirically, these findings do not seem to hold up well: on real-world datasets, causality- (or invariance-) based generalization methods are very often outperformed by ERM, and seem to generalize worse both in- and out-of-distribution (OOD) [Nastl & Hardt, 2024, Salaudeen et al., 2025]. This seems consistent with the “accuracy-on-the-line” hypothesis, which postulates that the ranking of models is often preserved across distributions. In recent work, it has been argued that this mismatch between theoretical and empirical findings is an “artifact” of misspecified OOD datasets, which do not contain sufficiently adversarial shifts.
The goal is to resolve the mismatch between theoretical and empirical findings in multiple ways:
- by verifying whether invariance-based OOD methods rank better if the distribution shift is constructed to be worst-case;
- by constructing benchmarks with varying strength and complexity of the distribution shift which could help evaluate a “spectrum” of OOD generalization of models;
- by providing a theoretical justification of recent empirical findings through analysis of the mismatch between benchmarks and assumptions.
Goals of the project:
- Create novel evaluation schemes for OOD machine learning methods;
- Construct novel benchmarks which more accurately measure OOD generalization;
- Test a large variety of large-scale models via the novel schemes and benchmarks.
Key Skills & Qualifications:
- Strong background in machine learning, familiarity with reading and understanding research papers.
- Solid Python and PyTorch skills, plus basic experience running experiments on GPUs.
- Background in statistics, basic machine learning theory, training and evaluation of machine learning models.
- Interest in out-of-distribution generalization, design of safe and robust models, and causality
Do causal predictors generalize better to new domains?
Are Domain Generalization Benchmarks with Accuracy on the Line Misspecified?
Invariance, causality and robustness.
[4] In search of lost domain generalization.
[5] In search of forgotten domain generalization. (Not the same as [4]!)
Example of previous student projects
Tight bounds for maximum l1-margin classifiers
Stefan Stojanovic with Konstantin Donhauser and Fanny Yang. ALT 2024. [paper]
Certified private data release for sparse Lipschitz functions
Johan Lokna and Robert Hoenig with Konstantin Donhauser, Amartya Sanyal, March Boedihardjo, and Fanny Yang. AISTATS 2024. [paper]
Can semi-supervised learning use all the data effectively? A lower bound perspective
Gizem Yüce with Alexandru Ţifrea, Amartya Sanyal, and Fanny Yang. NeurIPS 2023, Spotlight 🏅. [paper]
Strong inductive biases provably prevent harmless interpolation
Marco Milanta with Michael Aerni, Konstantin Donhauser, and Fanny Yang. ICLR 2023. [paper]
Why adversarial training can hurt robust accuracy
Jacob Clarysse with Julia Hörrmann and Fanny Yang. ICLR 2023. [paper]
How unfair is private learning?
Yaxi Hu with Amartya Sanyal and Fanny Yang. UAI 2022, Oral 🏅. [paper]