I am an Assistant Professor in the Computer Science Department (D-INFK) at ETH Zurich. Previously I was a postdoctoral Scholar at Stanford University working with John Duchi and Percy Liang and a Junior Fellow at the Institute for Theoretical Studies at ETH Zurich working with Nicolai Meinshausen. Before that, I was a PhD student at the EECS department of UC Berkeley advised by Martin Wainwright.
Research interests
I’m generally interested in theoretically understanding and developing tools in machine learning and statistics that work well. Currently I am particularly curious about gaining theoretical understanding for the generalization properties of overparameterized models for high-dimensional data (motivated by neural networks), as well as a plethora of questions related to obtaining more trustworthy ML models, specifically distributional robustness, domain generalization and interpretability.
For the latter branch of questions I’m particularly excited about problems in the medical domain - hence if you’re facing concrete reliability issues when using ML for medical diagnostics or treatment, please don’t hesitate to ping me.
Recent talk slides
- (2022) At MSRI workshop Foundations of Stable, Generalizable and Transferable Statistical Learning on fast rates for interpolation for min-lp-norm interpolation (for p in [1,2]) and issues of interpolating models for robust evaluation slides
- (2021) At ELLIS Doctoral symposium on limits of rotationally invariant kernels in high dimensions (such as NTK) and semi-supervised novelty detection slides
Recent papers
-
Certified private data release for sparse Lipschitz functions
Konstantin Donhauser,
Johan Lokna,
Amartya Sanyal,
March Boedihardjo,
Robert Hoenig,
and Fanny Yang
Theory and Practice of Differential Privacy (TPDP)
2023
As machine learning has become more relevant for everyday applications, a natural requirement is the protection of the privacy of the training data. When the relevant learning questions are unknown in advance, or hyper-parameter tuning plays a central role, one solution is to release a differentially private synthetic data set that leads to similar conclusions as the original training data. In this work, we introduce an algorithm that enjoys fast rates for the utility loss for sparse Lipschitz queries. Furthermore, we show how to obtain a certificate for the utility loss for a large class of algorithms.
-
PILLAR: How to make semi-private learning more effective
Francesco Pinto,
Yaxi Hu,
Fanny Yang,
and Amartya Sanyal
Theory and Practice of Differential Privacy (TPDP)
2023
In Semi-Supervised Semi-Private (SP) learning, the learner has access to both public unlabelled and private labelled data. We propose a computationally efficient algorithm that, under mild assumptions on the data, provably achieves significantly lower private labelled sample complexity and can be efficiently run on real-world datasets. For this purpose, we leverage the features extracted by networks pre-trained on public (labelled or unlabelled) data, whose distribution can significantly differ from the one on which SP learning is performed. To validate its empirical effectiveness, we propose a wide variety of experiments under tight privacy constraints (ε=0.1) and with a focus on low-data regimes. In all of these settings, our algorithm exhibits significantly improved performance over available baselines that use similar amounts of public data.
-
Margin-based sampling in high dimensions: When being active is less efficient than staying passive
Alexandru Tifrea*,
Jacob Clarysse*,
and Fanny Yang
International Conference on Machine Learning (ICML),
2023
It is widely believed that given the same labeling budget, active learning (AL) algorithms like margin-based active learning achieve better predictive performance than passive learning (PL), albeit at a higher computational cost. Recent empirical evidence suggests that this added cost might be in vain, as margin-based AL can sometimes perform even worse than PL. While existing works offer different explanations in the low-dimensional regime, this paper shows that the underlying mechanism is entirely different in high dimensions: we prove for logistic regression that PL outperforms margin-based AL even for noiseless data and when using the Bayes optimal decision boundary for sampling. Insights from our proof indicate that this high-dimensional phenomenon is exacerbated when the separation between the classes is small. We corroborate this intuition with experiments on 20 high-dimensional datasets spanning a diverse range of applications, from finance and histology to chemistry and computer vision.
-
Strong inductive biases provably prevent harmless interpolation
Michael Aerni*,
Marco Milanta*,
Konstantin Donhauser,
and Fanny Yang
International Conference on Learning Representations (ICLR),
2023
Classical wisdom suggests that estimators should avoid fitting noise to achieve good generalization. In contrast, modern overparameterized models can yield small test error despite interpolating noise – a phenomenon often called "benign overfitting" or "harmless interpolation". This paper argues that the degree to which interpolation is harmless hinges upon the strength of an estimator’s inductive bias, i.e., how heavily the estimator favors solutions with a certain structure: while strong inductive biases prevent harmless interpolation, weak inductive biases can even require fitting noise to generalize well. Our main theoretical result establishes tight non-asymptotic bounds for high-dimensional kernel regression that reflect this phenomenon for convolutional kernels, where the filter size regulates the strength of the inductive bias. We further provide empirical evidence of the same behavior for deep neural networks with varying filter sizes and rotational invariance.
-
Why adversarial training can hurt robust accuracy
Jacob Clarysse,
Julia Hörrmann,
and Fanny Yang
International Conference on Learning Representations (ICLR),
2023
Machine learning classifiers with high test accuracy often perform poorly under adversarial attacks. It is commonly believed that adversarial training alleviates this issue. In this paper, we demonstrate that, surprisingly, the opposite may be true – Even though adversarial training helps when enough data is available, it may hurt robust generalization in the small sample size regime. We first prove this phenomenon for a high-dimensional linear classification setting with noiseless observations. Our proof provides explanatory insights that may also transfer to feature learning models. Further, we observe in experiments on standard image datasets that the same behavior occurs for perceptible attacks that effectively reduce class information such as mask attacks and object corruptions.
-
Tight bounds for maximum l1-margin classifiers
Stefan Stojanovic,
Konstantin Donhauser,
and Fanny Yang
arXiv preprint,
2022
Popular iterative algorithms such as boosting methods and coordinate descent on linear models converge to the maximum l1-margin classifier, a.k.a. sparse hard-margin SVM, in high dimensional regimes where the data is linearly separable. Previous works consistently show that many estimators relying on the l1-norm achieve improved statistical rates for hard sparse ground truths. We show that surprisingly, this adaptivity does not apply to the maximum l1-margin classifier for a standard discriminative setting. In particular, for the noiseless setting, we prove tight upper and lower bounds for the prediction error that match existing rates of order ||w*||_1^2/3/n^1/3 for general ground truths. To complete the picture, we show that when interpolating noisy observations, the error vanishes at a rate of order 1/sqrt(log(d/n)). We are therefore first to show benign overfitting for the maximum l1-margin classifier.
-
How robust accuracy suffers from certified training with convex relaxations
Piersilvio De Bartolomeis,
Jacob Clarysse,
Amartya Sanyal,
and Fanny Yang
NeurIPS Workshop on empirical falsification (Long Talk)
2022
Adversarial attacks pose significant threats to deploying state-of-the-art classifiers in safety-critical applications. Two classes of methods have emerged to address this issue: empirical defences and certified defences. Although certified defences come with robustness guarantees, empirical defences such as adversarial training enjoy much higher popularity among practitioners. In this paper, we systematically compare the standard and robust error of these two robust training paradigms across multiple computer vision tasks. We show that in most tasks and for both 𝓁∞-ball and 𝓁2-ball threat models, certified training with convex relaxations suffers from worse standard and robust error than adversarial training. We further explore how the error gap between certified and adversarial training depends on the threat model and the data distribution. In particular, besides the perturbation budget, we identify as important factors the shape of the perturbation set and the implicit margin of the data distribution. We support our arguments with extensive ablations on both synthetic and image datasets.
-
Provable concept learning for interpretable predictions using variational inference
Armeen Taeb,
Nicolo Ruggeri,
Carina Schnuck,
and Fanny Yang
ICML Workshop AI4Science
2022
In safety critical applications, practitioners are reluctant to trust neural networks when no interpretable explanations are available. Many attempts to provide such explanations revolve around pixel level attributions or use previously known concepts. In this paper we aim to provide explanations by provably identifying \emphhigh-level, previously unknown concepts. To this end, we propose a probabilistic modeling framework to derive (C)oncept (L)earning and (P)rediction (CLAP) – a VAE-based classifier that uses visually interpretable concepts as linear predictors. Assuming that the data generating mechanism involves predictive concepts, we prove that our method is able to identify them while attaining optimal classification accuracy. We use synthetic experiments for validation, and also show that on real-world (PlantVillage and ChestXRay) datasets, CLAP effectively discovers interpretable factors for classifying diseases.
-
How unfair is private learning?
Amartya Sanyal*,
Yaxi Hu*,
and Fanny Yang
Conference on Uncertainty in Artificial Intelligence (UAI),
Oral,
2022
As machine learning algorithms are deployed on sensitive data in critical decision making processes, it is becoming increasingly important that they are also private and fair. In this paper, we show that, when the data has a long-tailed structure, it is not possible to build accurate learning algorithms that are both private and results in higher accuracy on minority subpopulations. We further show that relaxing overall accuracy can lead to good fairness even with strict privacy requirements. To corroborate our theoretical results in practice, we provide an extensive set of experimental results using a variety of synthetic, vision (CIFAR-10 and CelebA), and tabular (Law School) datasets and learning algorithms.
-
Semi-supervised novelty detection using ensembles with regularized disagreement
Alexandru Țifrea,
Eric Stavarache,
and Fanny Yang
Conference on Uncertainty in Artificial Intelligence (UAI),
2022
Despite their excellent performance on in-distribution (ID) data, machine learning-based
prediction systems often predict out-of-distribution (OOD) samples incorrectly while indicating
high confidence. Instead, they should flag samples that are not similar to the training data, for
example when new classes emerge over time. Even though current OOD detection algorithms can
successfully distinguish completely different data sets, they fail to reliably identify samples from
novel classes. We develop a new ensemble-based procedure that promotes model diversity and
exploits regularization to limit disagreement to only OOD samples, using a batch containing an
unknown mixture of ID and OOD data. We show that our procedure significantly outperforms
state-of-the-art methods, including those that have access, during training, to data that is known
to be OOD. We run extensive comparisons of our approach on a variety of novel-class detection
scenarios, on standard image data sets such as SVHN/CIFAR-10/CIFAR-100 as well as on new
disease detection on medical image data sets.
-
Fast rates for noisy interpolation require rethinking the effects of inductive bias
Konstantin Donhauser,
Nicolo Ruggeri,
Stefan Stojanovic,
and Fanny Yang
International Conference on Machine Learning (ICML),
2022
Good generalization performance on high-dimensional data crucially hinges on a simple structure of the ground truth and a corresponding strong inductive bias of the estimator. Even though this intuition is valid for regularized models, in this paper we caution against a strong inductive bias for interpolation in the presence of noise: Our results suggest that, while a stronger inductive bias encourages a simpler structure that is more aligned with the ground truth, it also increases the detrimental effect of noise. Specifically, for both linear regression and classification with a sparse ground truth, we prove that minimum \ell_p-norm and maximum \ell_p-margin interpolators achieve fast polynomial rates up to order 1/n for p > 1 compared to a logarithmic rate for p = 1. Finally, we provide experimental evidence that this trade-off may also play a crucial role in understanding non-linear interpolating models used in practice.
-
Tight bounds for minimum l1-norm interpolation of noisy data
Guillaume Wang*,
Konstantin Donhauser*,
and Fanny Yang
International Conference on Artificial Intelligence and Statistics (AISTATS),
2022
We provide matching upper and lower bounds of order σ2/log(d/n) for the prediction error of the minimum ℓ1-norm interpolator, a.k.a. basis pursuit. Our result is tight up to negligible terms when d≫n, and is the first to imply asymptotic consistency of noisy minimum-norm interpolation for isotropic features and sparse ground truths. Our work complements the literature on "benign overfitting" for minimum ℓ2-norm interpolation, where asymptotic consistency can be achieved only when the features are effectively low-dimensional.
-
Self-supervised Reinforcement Learning with Independently Controllable Subgoals
Andrii Zadaianchuk,
Georg Martius,
and Fanny Yang
Conference on Robot Learning (CoRL),
2021
To successfully tackle challenging manipulation tasks, autonomous agents must learn a diverse set of skills and how to combine them. Recently, self-supervised agents that set their own abstract goals by exploiting the discovered structure in the environment were shown to perform well on many different tasks. In particular, some of them were applied to learn basic manipulation skills in compositional multi-object environments. However, these methods learn skills without taking the dependencies between objects into account. Thus, the learned skills are difficult to combine in realistic environments. We propose a novel self-supervised agent that estimates relations between environment components and uses them to independently control different parts of the environment state. In addition, the estimated relations between objects can be used to decompose a complex goal into a compatible sequence of subgoals. We show that, by using this framework, an agent can efficiently and automatically learn manipulation tasks in multi-object environments with different relations between objects.
-
How rotational invariance of common kernels prevents generalization in high dimensions
Konstantin Donhauser,
Mingqi Wu,
and Fanny Yang
International Conference on Machine Learning (ICML),
2021
Kernel ridge regression is well-known to achieve minimax optimal rates in low-dimensional settings. However, its behavior in high dimensions is much less understood. Recent work establishes consistency for high-dimensional kernel regression for a number of specific assumptions on the data distribution. In this paper, we show that in high dimensions, the rotational invariance property of commonly studied kernels (such as RBF, inner product kernels and fully-connected NTK of any depth) leads to inconsistent estimation unless the ground truth is a low-degree polynomial. Our lower bound on the generalization error holds for a wide range of distributions and kernels with different eigenvalue decays. This lower bound suggests that consistency results for kernel ridge regression in high dimensions generally require a more refined analysis that depends on the structure of the kernel beyond its eigenvalue decay.
-
Interpolation can hurt robust generalization even when there is no noise
Konstantin Donhauser*,
Alexandru Tifrea*,
Michael Aerni,
Reinhard Heckel,
and Fanny Yang
Neural Information Processing Systems (NeurIPS),
2021
Numerous recent works show that overparameterization implicitly reduces variance for min-norm interpolators and max-margin classifiers. These findings suggest that ridge regularization has vanishing benefits in high dimensions. We challenge this narrative by showing that, even in the absence of noise, avoiding interpolation through ridge regularization can significantly improve generalization. We prove this phenomenon for the robust risk of both linear regression and classification and hence provide the first theoretical result on robust overfitting.
-
Understanding and Mitigating the Tradeoff between Robustness and Accuracy
Aditi Raghunathan,
Sang Michael Xie,
Fanny Yang,
John Duchi,
and Percy Liang
International Conference on Machine Learning (ICML),
2020
Adversarial training augments the training set with perturbations to improve the robust error (over worst-case perturbations), but it often leads to an increase in the standard error (on unperturbed test inputs). Previous explanations for this tradeoff rely on the assumption that no predictor in the hypothesis class has low standard and robust error. In this work, we precisely characterize the effect of augmentation on the standard error in linear regression when the optimal linear predictor has zero standard and robust error. In particular, we show that the standard error could increase even when the augmented perturbations have noiseless observations from the optimal linear predictor. We then prove that the recently proposed robust self-training (RST) estimator improves robust error without sacrificing standard error for noiseless linear regression. Empirically, for neural networks, we find that RST with different adversarial training methods improves both standard and robust error for random and adversarial rotations and adversarial ℓ∞ perturbations in CIFAR-10.
-
Invariance-inducing regularization using worst-case transformations suffices to boost accuracy and spatial robustness
Fanny Yang,
Zuowen Wang,
and Christina Heinze-Deml
Neural Information Processing Systems (NeurIPS),
2019
This work provides theoretical and empirical evidence that invariance-inducing regularizers can increase predictive accuracy for worst-case spatial transformations (spatial robustness). Evaluated on these adversarially transformed examples, we demonstrate that adding regularization on top of standard or adversarial training reduces the relative error by 20% for CIFAR10 without increasing the computational cost. This outperforms handcrafted networks that were explicitly designed to be spatial-equivariant. Furthermore, we observe for SVHN, known to have inherent variance in orientation, that robust training also improves standard accuracy on the test set. We prove that this no-trade-off phenomenon holds for adversarial examples from transformation groups in the infinite data limit.
Selected older publications
-
Early stopping for kernel boosting algorithms: A general analysis with localized complexities
Fanny Yang*,
Yuting Wei*,
and Martin J Wainwright
IEEE Transactions on Information Theory (IEEE IT),
2019
Neural Information Processing Systems (NeurIPS),
Spotlight,
2017
Early stopping of iterative algorithms is a widely-used form of regularization in statistics, commonly used in conjunction with boosting and related gradient-type algorithms. Although consistency results have been established in some settings, such estimators are less well-understood than their analogues based on penalized regularization. In this paper, for a relatively broad class of loss functions and boosting algorithms (including L2-boost, LogitBoost and AdaBoost, among others), we exhibit a direct connection between the performance of a stopped iterate and the localized Gaussian complexity of the associated function class. This connection allows us to show that local fixed point analysis of Gaussian or Rademacher complexities, now standard in the analysis of penalized estimators, can be used to derive optimal stopping rules. We derive such stopping rules in detail for various kernel classes, and illustrate the correspondence of our theory with practice for Sobolev kernel classes.
-
A framework for multi-A(rmed)/B(andit) testing with online fdr control
Fanny Yang,
Aaditya Ramdas,
Kevin G Jamieson,
and Martin J Wainwright
Neural Information Processing Systems (NeurIPS),
Spotlight,
2017
We propose an alternative framework to existing setups for controlling false alarms when multiple A/B tests are run over time. This setup arises in many practical applications, e.g. when pharmaceutical companies test new treatment options against control pills for different diseases, or when internet companies test their default webpages versus various alternatives over time. Our framework proposes to replace a sequence of A/B tests by a sequence of best-arm MAB instances, which can be continuously monitored by the data scientist. When interleaving the MAB tests with an an online false discovery rate (FDR) algorithm, we can obtain the best of both worlds: low sample complexity and any time online FDR control. Our main contributions are: (i) to propose reasonable definitions of a null hypothesis for MAB instances; (ii) to demonstrate how one can derive an always-valid sequential p-value that allows continuous monitoring of each MAB test; and (iii) to show that using rejection thresholds of online-FDR algorithms as the confidence levels for the MAB algorithms results in both sample-optimality, high power and low FDR at any point in time. We run extensive simulations to verify our claims, and also report results on real data collected from the New Yorker Cartoon Caption contest.
-
Statistical and computational guarantees for the Baum-Welch algorithm
Fanny Yang,
Sivaraman Balakrishnan,
and Martin J Wainwright
Journal for Marchine Learning Research (JMLR),
2017
The Hidden Markov Model (HMM) is one of the mainstays of statistical modeling of discrete time series, with applications including speech recognition, computational biology, computer vision and econometrics. Estimating an HMM from its observation process is often addressed via the Baum-Welch algorithm, which is known to be susceptible to local optima. In this paper, we first give a general characterization of the basin of attraction associated with any global optimum of the population likelihood. By exploiting this characterization, we provide non-asymptotic finite sample guarantees on the Baum-Welch updates, guaranteeing geometric convergence to a small ball of radius on the order of the minimax rate around a global optimum. As a concrete example, we prove a linear rate of convergence for a hidden Markov mixture of two isotropic Gaussians given a suitable mean separation and an initialization within a ball of large radius around (one of) the true parameters. To our knowledge, these are the first rigorous local convergence guarantees to global optima for the Baum-Welch algorithm in a setting where the likelihood function is nonconvex. We complement our theoretical results with thorough numerical simulations studying the convergence of the Baum-Welch algorithm and illustrating the accuracy of our predictions.
-
Phase Retrieval via Structured Modulation in Paley-Wiener Spaces
F Yang,
V Pohl,
and H Boche
International Conference on Sampling Theory and Applications
2013
This paper considers the recovery of continuous time signals from the magnitude of its samples. It uses a combination of structured modulation
and oversampling and provides sufficient conditions on
the signal and the sampling system such that signal
recovery is possible. In particular, it is shown that an
average sampling rate of four times the Nyquist rate
is sufficient to reconstruct almost every signal from its
magnitude measurements.
Short C.V.
01/2020 - present |
Assistant Professor, ETH Zurich |
04/2019 - 12/2019 |
Postdoctoral Fellow, Stanford University |
09/2018 - 09/2019 |
Junior Fellow (Postdoc), ETH Zurich |
06/2017 - 02/2018 |
Applied Scientist Intern, Amazon AWS, Palo Alto |
08/2013 - 08/2018 |
PhD, UC Berkeley |
10/2010 - 08/2013 |
M. Sc., Technical University Munich (TUM) |
10/2007 - 10/2010 | B.Sc., Karlsruhe Institute of Technology (KIT) |
The best way to reach me is via e-mail at fan.yang (at) inf.ethz.ch
. however note that I cannot respond to most requests although I try to answer all research-related messages.
Meetings take place in my office: CAB G19.1
Note: Enter the north side of the CAB building and walk up to G floor
The G floor is not connected, hence it’d be a pity if you reach it in the wrong section
Personal
In my free time I play the violin, more extensively earlier. Here’s a few recordings:
2023 Hottingersaal, Zurich with Adrien Luecker, Carmen Weber: Schubert B-Flat Major Trio (Audio)
2015 Berkeley Hertz Hall with Adam Bloniarz: Brahms violin sonata 3 & Debussy violin sonata (Video)
2013 Alte Hofkapelle der Residenz Munich with Kammerorchester L’Estro: Vivalidi Winter (Video)
2010 Karlsruhe Konzerthaus, solo violin with Frank-Michael Guthmann and the KIT Sinfonieorchester:
Brahms Double Concerto (Audio) Mvt 1, Mvt 2, Mvt 3
-2005 in Stuttgart: Ysaye Violin Sonata 3 (Audio), Ysaye Violin Sonata 4 (Audio),
Pablo de Sarasate Carmen Fantasy (Audio)
I also like to take photographs, now primarily with my Fuji X-T3, 35mm f1.4.