SML group at ETH | student projects

If you are an ETH student in CS, EE or Statistics (math can also be arranged) and interested in doing a project or thesis, please fill out this form and email both Fanny and the project advisor. There are more opportunities in the lab than the listed projects, if you are interested in the general direction of trustworthy ML or causal inference (both empirically and theoretically) with excellent mathematical or coding background, feel free to contact us.

Anytime-valid Inference Using Foundation Models

Leverage predictions from foundation models to improve the efficiency of sequential randomized experiments.

The Hybrid Augmented Inverse Probability Weighting (H-AIPW) estimator has demonstrated efficiency gains in randomized experiments by incorporating predictions from foundation models. However, its current form requires a fixed sample size to be determined before the experiment begins. Since the efficiency gains depend on the accuracy of the model, which is unknown in advance, it is not clear how to determine a “sufficient” sample size a priori.

This project aims to extend the H-AIPW framework to the sequential setting. Leveraging recent theoretical advances in sequential inference, we propose to develop a robust methodology that guarantees valid statistical inference at any stopping time, regardless of the bias in the predictions from the foundation models.

Key objectives include:

Develop a sequential version of the H-AIPW estimator and establish theoretical guarantees that allow the construction of anytime-valid confidence intervals
Validate the proposed methodology across several experimental settings, assessing both the efficiency improvements and the robustness of inferential guarantees

Prerequisites: Strong background in mathematics and statistics

Contact: Pier(silvio)

Related references:
Efficient Randomized Experiments Using Foundation Models
Time-uniform central limit theory and asymptotic confidence sequences
Anytime-Valid Inference for Double/Debiased Machine Learning of Causal Parameters

Efficient Clinical Trials

The project will focus on developing machine learning models to improve the efficiency of clinical trials.

HARVARD COLLABORATION

The student will conduct their thesis at Harvard University using real clinical data.

The Hybrid Augmented Inverse Probability Weighting (H-AIPW) estimator has demonstrated efficiency gains in randomized experiments by incorporating predictions from foundation models. However, these gains have only been demonstrated in the context of political and social science experiments, limiting the generalizability of the findings to other domains.

This project proposes applying the H-AIPW framework within the medical domain. We aim to demonstrate substantial efficiency improvements by utilizing predictive machine learning models trained on extensive observational healthcare data (e.g. Medicare data).

Key objectives include:

Developing predictive machine learning models using large-scale observational healthcare data
Evaluating efficiency gains on an extensive collection of clinical trials using the H-AIPW estimator and related frameworks

Key Skills & Qualifications:

Proficiency in Python and deep learning frameworks (e.g. PyTorch)
Strong foundation in machine learning and statistics
Interest in medical AI and clinical data applications
Ability to collaborate in a multidisciplinary team

Contact: Pier(silvio)

Related references:
Efficient Randomized Experiments Using Foundation Models

Forecasting vegetation health for food security

The project will focus on developing machine learning models to forecast vegetation status across scales towards a food security early warning system for East Africa.

MAX PLANCK COLLABORATION

The student will conduct their thesis at the Max Planck Institute for Biogeochemistry in Jena, Germany using satellite imagery and climate data.

Frequent droughts threaten the livelihoods of pastoral communities in the Horn of Africa. Early warning systems (EWS) can provide crucial information to enable anticipatory action and improve food security. However, most EWS rely on weather variables, yet for grazing cattle the vegetation condition on the ground is more important.

This project proposes to train machine learning methods on satellite imagery in East Africa to provide forecasts of vegetation health. We aim to improve on previous works, by studying the benefits from increasing spatial resolution of the satellite data and through adopting a probabilistic view.

Key objectives include:

Developing probabilistic machine learning models using large-scale Earth observation data
Evaluating accuracy gains from predicting vegetation health status at higher spatial resolution

Key Skills & Qualifications:

Proficiency in Python and deep learning frameworks (e.g. PyTorch)
Strong foundation in machine learning and statistics
Interest in AI for Earth and climate science
Ability to collaborate in a multidisciplinary team

Contact: Vitus

Related references:
Multi-modal Learning for Geospatial Vegetation Forecasting
Early warning systems for drought in East Africa

Example of previous student projects

Tight bounds for maximum l1-margin classifiers

Stefan Stojanovic with Konstantin Donhauser and Fanny Yang. ALT 2024. [paper]

Certified private data release for sparse Lipschitz functions

Johan Lokna and Robert Hoenig with Konstantin Donhauser, Amartya Sanyal, March Boedihardjo, and Fanny Yang. AISTATS 2024. [paper]

Can semi-supervised learning use all the data effectively? A lower bound perspective

Gizem Yüce with Alexandru Ţifrea, Amartya Sanyal, and Fanny Yang. NeurIPS 2023, Spotlight 🏅. [paper]

Strong inductive biases provably prevent harmless interpolation

Marco Milanta with Michael Aerni, Konstantin Donhauser, and Fanny Yang. ICLR 2023. [paper]

Why adversarial training can hurt robust accuracy

Jacob Clarysse with Julia Hörrmann and Fanny Yang. ICLR 2023. [paper]

How unfair is private learning?

Yaxi Hu with Amartya Sanyal and Fanny Yang. UAI 2022, Oral 🏅. [paper]