SML group at ETH | Efficient randomized experiments using foundation models

Paper Code

Randomized experiments are the gold standard for estimating causal effects, but in practice sample sizes are often limited by budget and recruitment constraints, leading to imprecise treatment effect estimates.

At the same time, foundation models can predict outcomes in many domains where we run experiments. This creates an appealing opportunity to combine experimental data with model predictions to make the estimates from randomized experiments more precise. However, naively incorporating predictions from these models can introduce significant bias and invalidate the statistical guarantees of the randomized experiment.

Therefore, we aim to develop a “safe” method for integrating foundation model predictions into randomized experiments. Specifically, we want an estimator that meets the following desiderata:

integrates foundation model predictions to reduce variance,
has valid inference guarantees even when the models are not accurate,
has asymptotic variance no larger than the efficient estimator based on experimental data alone.

Towards this goal, we introduce Hybrid Augmented Inverse Probability Weighting (H-AIPW) [1], an estimator that meets all the desiderata outlined above. Across eight randomized experiments spanning various scientific disciplines, H-AIPW consistently improves the precision of treatment effect estimates, with variance reductions on the order of 5–15% relative to the standard AIPW estimator based only on experimental data.

Figure 3. Scaled variance of several estimators across eight randomized experiments and two sample sizes. Smaller values indicate better precision. The blue row “AIPW (standard)" is the experimental data only baseline that H-AIPW aims to improve upon.

The efficient estimator based on experimental data alone

We consider a randomized experiment with binary treatment assignment \(A \in \{0,1\}\) with known randomization probability \(\pi_a = \Pr(A = a)\), outcome \(Y \in \mathbb R\), and covariates \(X\). We observe an i.i.d. sample \(\{(X_i, A_i, Y_i)\}_{i=1}^n\) from the target population. Our parameter of interest is the average treatment effect (ATE)

\[\theta = \mathbb{E}[Y|A=1] - \mathbb{E}[Y|A=0].\]

We can estimate \(\theta\) using an estimator in the family of Augmented Inverse Probability Weighting (AIPW) [2]. In this family, each estimator is indexed by an outcome regression \(h(x,a)\) and has the form

\[\widehat\theta_{\mathrm{AIPW}}(h) = \frac{1}{n}\sum_{i=1}^n \psi_i(h),\]

where, the AIPW estimating function is

\[\psi_i(h) = h(X_i,1) - h(X_i,0) + \frac{A_i}{\pi_1}\bigl\{Y_i - h(X_i,1)\bigr\} - \frac{1 - A_i}{\pi_0}\bigl\{Y_i - h(X_i,0)\bigr\}.\]

Semiparametric efficiency theory establishes that the efficient estimator in this class is obtained by taking \(h\) equal to the true conditional mean

\[h^\star(x,a) = \mathbb{E}[Y \mid X = x, A = a].\]

\(\widehat\theta_{\mathrm{AIPW}}(h^\star)\) attains the semiparametric efficiency variance lower bound and therefore has the smallest possible asymptotic variance among all regular estimators.

In practice, however, the true regression function

\[h^\star(x,a) = \mathbb{E}[Y \mid X = x, A = a]\]

is unknown and must be estimated from the experimental data. Because randomized experiments often have modest sample sizes, it is common to approximate \(h^\star\) with a simple parametric model, for example a linear regression of \(Y\) on \((X,A)\). However, this modelling choice can be substantially misspecified when the relationship between \((X,A)\) and \(Y\) is complex, which in turn limits the efficiency gains achievable by AIPW.

The H-AIPW estimator

In many applications we now have access to predictive models trained on massive datasets; e.g. large language models that can be prompted to predict outcomes given covariates and treatment. These models may approximate \(h^\star\) much better than a simple linear regression, especially in high-dimensional or multi-modal problems. The challenge is to take advantage of these potentially powerful external regressions without compromising the experiment’s validity, that is, while still guaranteeing valid inference. Further, we would like the new estimator to never perform worse than the AIPW estimator that relies solely on the experimental data.

Figure 1. H-AIPW combines AIPW estimators built from both experimental data and foundation models, choosing weights that minimize estimated variance while remaining within the AIPW class.

To achieve this, we do not commit to a single outcome regression model. Instead, we work with a collection of candidate regressions \(\{f_1, \dots, f_k\},\) where

\(f_1\) is the linear regression model fitted on \(\{(X_i,A_i,Y_i)\}_{i=1}^n\), and
\(f_2,\dots,f_k\) are regressions derived from foundation models (e.g. different models or prompting strategies) that use external information to predict outcomes.

Each regression \(f_j\) induces a valid AIPW estimator

\[\hat\theta_j \equiv \hat\theta_{\mathrm{AIPW}}(f_j)= \frac{1}{n} \sum_{i=1}^n \psi_i(f_j).\]

The H-AIPW estimator combines these candidates in a data-driven way using weighted averages of the form

\[\hat\theta_\lambda \equiv \sum_{j=1}^k \lambda_j \hat\theta_j,~~\text{s.t.}~~ \sum_{j=1}^k \lambda_j = 1.\]

Crucially, for any choice of \(\lambda\), the combined estimator \(\hat\theta_\lambda\) is still in the class of AIPW estimators. The weights are chosen to minimize an estimate of the asymptotic variance. Concretely, letting \(\hat\Sigma\) denote the estimated covariance matrix of \((\hat\theta_1,\dots,\hat\theta_k)\), we solve

\[\hat\lambda= \arg\min_{\lambda : \sum_{j=1}^k \lambda_j = 1} \lambda^\top \hat\Sigma \lambda,~~\text{and define}~~\hat\theta_{\mathrm{HAIPW}} \equiv \hat\theta_{\hat\lambda}.\]

Because \(\hat\theta_{\mathrm{H-AIPW}}\) remains in the AIPW class, it inherits the usual properties of AIPW estimators that allow for valid inference. Moreover, its asymptotic variance is no larger than that of any individual candidate \(\hat\theta_j\), including the estimator based on experimental data alone. Thus, if the foundation models are informative for the experiment at hand, H-AIPW gains efficiency; if they are not informative, it automatically falls back toward the best estimator based on the experimental data alone, providing a safe way to leverage external models.

Using language models as outcome regressions

To make this practical, we now instantiate H-AIPW with large language models in social science survey experiments. The key idea is to treat LLMs as black-box predictors of potential outcomes.

For each participant, we:

Extract covariates (e.g. age, gender, ideology, income, religion).
Build a short persona that summarizes these covariates.
Provide the treatment or control text and the survey question as the user prompt.
Ask the LLM to output a numeric response on the same scale as the original outcome (e.g. 1–5 Likert).

We can thus obtain predicted outcomes under treatment and control from several LLMs. Each such prediction scheme defines a foundation model \(f_j\), and therefore an AIPW estimator \(\hat\theta_{\mathrm{AIPW}}(f_j)\). H-AIPW will combine these estimators with the standard AIPW estimator based on the experimental data alone.

Figure 2. Example system and user prompts used to generate synthetic survey responses with an LLM.

Experimental results

We evaluate H-AIPW on eight randomized experiments in economics, psychology, political science, foreign policy, and sociology, drawn from the TESS repository. Each study measures an average treatment effect on an attitudinal outcome (e.g., beliefs about terrorism risk, racial discrimination, environmental behavior).

To mimic the small-sample regime typical of many experiments, we subsample each full dataset to target sample sizes \(n = 100\) and \(n = 200\). We compare the following estimators:

H-AIPW
Difference-in-means (DM),
AIPW (standard) with a linear outcome regression fitted on the experimental data only,
AIPW (boosting) with a more flexible boosting model,
PPCT [3] and PROCOVA [4], two recent methods that also leverage foundation models.

For each estimator, we report the variance multiplied by \(n\) averaged over 10000 subsampling repetitions.

Figure 3. Scaled variance of several estimators across eight randomized experiments and two sample sizes.Smaller values indicate better precision. The blue row “AIPW (standard)" is the experimental data only baseline that H-AIPW aims to improve upon.

FAQ

Does this replace randomized experiments with model simulations?

No. H-AIPW requires randomized data and maintains the same identification assumptions as classical ATE estimators. Foundation models are only used to improve precision, not to identify causal effects on their own.

What if the foundation models are significantly biased?

Our results allow the foundation models to be arbitrarily biased. If their predictions are not helpful, the covariance-based weighting shrinks their contribution, and H-AIPW behaves like the standard AIPW estimator. Asymptotically, the variance of H-AIPW will be no worse than the standard AIPW estimator, and H-AIPW estimate will remain consistent despite the bias.

In the randomized experiment ATE estimation setting, PPI-based estimators (e.g. PPCT) are actually special cases of our H-AIPW framework, and therefore of AIPW itself. They correspond (up to some minor variations) to using a single external prediction model as the outcome regression and combining it only with the trivial “zero” model that yields the difference-in-means estimator. In contrast, our method has two crucial differences that make it more efficient and better suited to treatment effect estimation.

First, we explicitly allow different black-box models for the treated and control outcome regressions, rather than forcing a single model to fit both arms. Second, we also include the standard AIPW estimator based on experimental data in the convex combination. As a result, the H-AIPW estimator can be efficient and attain the semiparametric efficiency bound when the standard AIPW model is well-specified. In contrast, PPI-based methods can only achieve the efficiency bound if the black-box model is equal to the true outcome regression.

References

[1] De Bartolomeis, Piersilvio, Javier Abad, Guanbo Wang, Konstantin Donhauser, Raymond M. Duch, Fanny Yang, and Issa J. Dahabreh. “Efficient randomized experiments using foundation models.” Advances in Neural Information Processing Systems (NeurIPS) (2025).

[2] Robins, James M., and Andrea Rotnitzky. “Semiparametric efficiency in multivariate regression models with missing data.” Journal of the American Statistical Association 90.429 (1995): 122-129.

[3] Poulet, Pierre-Emmanuel, Maylis Tran, Sophie Tezenas du Montcel, Bruno Dubois, Stanley Durrleman, and Bruno Jedynak. “Prediction-powered inference for clinical trials.” medRxiv (2025).

[4] Liao, Lauren, Emilie Højbjerre-Frandsen, Alan Hubbard, and Alejandro Schuler. “Prognostic adjustment with efficient estimators to unbiasedly leverage historical data in randomized trials.” arXiv preprint arXiv:2305.19180 (2023).