I am a Doctoral Student in Machine Learning and a Fellow at the ETH AI Center in Zurich, advised by Prof. Fanny Yang, Prof. Julia Vogt, and Prof. Bjoern Menze. Through my research, I aim to develop ML models that can be trusted for high-stakes decision-making, with an emphasis on medical applications. I am particularly excited about questions related to causal reasoning, representation learning, interpretability, and privacy in machine learning.
Previously, I led a project on conformal prediction under the guidance of Adrian Weller MBE, and researched interpretability methods for causal inference with Prof. Mihaela van der Schaar, both at the University of Cambridge. I was also a Research Scientist at Featurespace, researching and implementing ML models to fight financial crime.
Recent papers
-
Privacy-preserving data release leveraging optimal transport and particle gradient descent
Konstantin Donhauser*,
Javier Abad*,
Neha Hulkund,
and Fanny Yang
International Conference on Machine Learning (ICML),
2024
We present a novel approach for differentially private data synthesis of protected tabular datasets, a relevant task in highly sensitive domains such as healthcare and government. Current state-of-the-art methods predominantly use marginal-based approaches, where a dataset is generated from private estimates of the marginals. In this paper, we introduce PrivPGD, a new generation method for marginal-based private data synthesis, leveraging tools from optimal transport and particle gradient descent. Our algorithm outperforms existing methods on a large range of datasets while being highly scalable and offering the flexibility to incorporate additional domain-specific constraints.
-
Detecting critical treatment effect bias in small subgroups
Piersilvio De Bartolomeis,
Javier Abad,
Konstantin Donhauser,
and Fanny Yang
Conference on Uncertainty in Artificial Intelligence (UAI),
2024
Randomized trials are considered the gold standard for making informed decisions in medicine, yet they often lack generalizability to the patient populations in clinical practice. Observational studies, on the other hand, cover a broader patient population but are prone to various biases. Thus, before using an observational study for decision-making, it is crucial to benchmark its treatment effect estimates against those derived from a randomized trial. We propose a novel strategy to benchmark observational studies beyond the average treatment effect. First, we design a statistical test for the null hypothesis that the treatment effects estimated from the two studies, conditioned on a set of relevant features, differ up to some tolerance. We then estimate an asymptotically valid lower bound on the maximum bias strength for any subgroup in the observational study. Finally, we validate our benchmarking strategy in a real-world setting and show that it leads to conclusions that align with established medical knowledge.
-
Hidden yet quantifiable: A lower bound for confounding strength using randomized trials
Piersilvio De Bartolomeis*,
Javier Abad*,
Konstantin Donhauser,
and Fanny Yang
International Conference on Artificial Intelligence and Statistics (AISTATS),
2024
In the era of fast-paced precision medicine, observational studies play a major role in properly evaluating new drugs in clinical practice. Yet, unobserved confounding can significantly compromise causal conclusions from observational data. We propose a novel strategy to quantify unobserved confounding by leveraging randomized trials. First, we design a statistical test to detect unobserved confounding with strength above a given threshold. Then, we use the test to estimate an asymptotically valid lower bound on the unobserved confounding strength. We evaluate the power and validity of our statistical test on several synthetic and semi-synthetic datasets. Further, we show how our lower bound can correctly identify the absence and presence of unobserved confounding in a real-world setting.
Preprints
-
Copyright-Protected Language Generation via Adaptive Model Fusion
Javier Abad,
Konstantin Donhauser,
Francesco Pinto,
and Fanny Yang
arXiv preprint,
2024
The risk of language models reproducing copyrighted material from their training data has led to the development of various protective measures. Among these, inference-time strategies that impose constraints via post-processing have shown promise in addressing the complexities of copyright regulation. However, they often incur prohibitive computational costs or suffer from performance trade-offs. To overcome these limitations, we introduce Copyright-Protecting Model Fusion (CP-Fuse), a novel approach that combines models trained on disjoint sets of copyrighted material during inference. In particular, CP-Fuse adaptively aggregates the model outputs to minimize the reproduction of copyrighted content, adhering to a crucial balancing property that prevents the regurgitation of memorized data. Through extensive experiments, we show that CP-Fuse significantly reduces the reproduction of protected material without compromising the quality of text and code generation. Moreover, its post-hoc nature allows seamless integration with other protective measures, further enhancing copyright safeguards. Lastly, we show that CP-Fuse is robust against common techniques for extracting training data.
Workshop papers
-
Strong Copyright Protection for Language Models via Adaptive Model Fusion
Javier Abad*,
Konstantin Donhauser*,
Francesco Pinto,
and Fanny Yang
ICML 2nd GenLaw Workshop
2024
The risk of language models unintentionally reproducing copyrighted material from their training data has led to the development of various protective measures. In this paper, we propose model fusion as an effective solution to safeguard against copyright infringement. In particular, we introduce Copyright-Protecting Fusion (CP-Fuse), an algorithm that adaptively combines language models to minimize the reproduction of protected materials. CP-Fuse is inspired by the recently proposed Near-Access Free (NAF) framework and additionally incorporates a desirable balancing property that we demonstrate prevents the reproduction of memorized training data. Our results show that CP-Fuse significantly reduces the memorization of copyrighted content while maintaining high-quality text and code generation. Furthermore, we demonstrate how CP-Fuse can be integrated with other techniques for enhanced protection.
The easiest way to reach me is by emailing javier.abadmartinez@ai.ethz.ch. I am also looking forward to supervising motivated students in my fields of expertise. If you are interested, feel free to drop me a line.
You can also find me on LinkedIn, Google Scholar and X.