The package multe implements contamination bias diagnostics for regressions with multiple treatments developed in Goldsmith-Pinkham, Hull, and Kolesár (2024). This vignette illustrates the methods using data from Fryer and Levitt (2013).

First, we fit a regression of test scores on a race dummy (treatment of interest) and a few controls, weighting using sampling weights:

library("multe")
## Regression of IQ at 24 months on race indicators
## and baseline controls
r1 <- stats::lm(std_iq_24 ~ race + factor(age_24) + female +
    SES_quintile, weight = W2C0, data = fl)
## Compute alternatives estimates free of
## contamination bias
m1 <- multe(r1, "race", cluster = NULL)
print(m1, digits = 3)
#> Estimates on full sample:
#>               PL     OWN     ATE      EW      CW
#> Black    -0.2574 -0.2482 -0.2655 -0.2550 -0.2604
#> SE        0.0281  0.0291  0.0298  0.0289  0.0292
#> Hispanic -0.2931 -0.2829 -0.2992 -0.2862 -0.2944
#> SE        0.0260  0.0267  0.0299  0.0268  0.0279
#> Asian    -0.2621 -0.2609 -0.2599 -0.2611 -0.2694
#> SE        0.0343  0.0343  0.0418  0.0343  0.0475
#> Other    -0.1563 -0.1448 -0.1503 -0.1447 -0.1522
#> SE        0.0369  0.0370  0.0359  0.0368  0.0370
#> 
#> P-values for null hypothesis of no propensity score variation:
#> Wald test: 0, LM test: 0 
#> 
#> SD(estimated propensity score), maximum over treatment arms:
#> Full sample: 0.196

The package reports five different estimators:

PL: The uninteracted regression estimator based on the partially linear (PL) model.
OWN: The own-treatment effect component of the contamination bias decomposition. If OWN is close to PL, as above, this indicates negligible contamination bias.
ATE: The unweighted average treatment effect, implemented using regression that interacts the treatment dummies with the controls.
EW: Weighted ATE estimator based on easiest-to-estimate weighting (EW) scheme, implemented by running one-treatment-at-a-time regressions.
CW: Weighted ATE estimator using easiest-to-estimate common weighting (CW) scheme from Corollary 2 in Goldsmith-Pinkham, Hull, and Kolesár (2024), implemented using weighted regression.

Precise definitions of these estimators are given in the Methods section below.

In this example, the propensity score varies significantly with covariates, as indicated by the p-values of the Wald and LM tests.

Including many controls may result in overlap failure, as the next example demonstrates:

r2 <- stats::lm(std_iq_24 ~ race + factor(age_24) + female +
    SES_quintile + factor(siblings) + family_structure,
    weight = W2C0, data = fl)
m2 <- multe(r2, treatment = "race")
#> For variable factor(siblings) the following levels fail overlap:
#> 6
#> Dropping observations with these levels
print(m2, digits = 3)
#> Estimates on full sample:
#>               PL     OWN     ATE      EW      CW
#> Black    -0.2438 -0.2043 -0.2482 -0.2180 -0.2408
#> SE        0.0308  0.0332  0.0355  0.0328  0.0389
#> Hispanic -0.2928 -0.2801 -0.2878 -0.2850 -0.2964
#> SE        0.0259  0.0267  0.0300  0.0267  0.0299
#> Asian    -0.2739 -0.2836 -0.2742 -0.2839 -0.2916
#> SE        0.0342  0.0343  0.0420  0.0343  0.0459
#> Other    -0.1520 -0.1277      NA -0.1295 -0.1459
#> SE        0.0369  0.0374      NA  0.0371  0.0385
#> 
#> Estimates on overlap sample:
#>               PL     OWN     ATE      EW      CW
#> Black    -0.2444 -0.2049 -0.2505 -0.2191 -0.2426
#> SE        0.0309  0.0334  0.0357  0.0329  0.0388
#> Hispanic -0.2915 -0.2791 -0.2871 -0.2839 -0.2974
#> SE        0.0259  0.0267  0.0300  0.0267  0.0299
#> Asian    -0.2766 -0.2863 -0.2769 -0.2865 -0.2924
#> SE        0.0344  0.0345  0.0421  0.0345  0.0459
#> Other    -0.1522 -0.1280 -0.1397 -0.1298 -0.1465
#> SE        0.0369  0.0374  0.0362  0.0371  0.0385
#> 
#> P-values for null hypothesis of no propensity score variation:
#> Wald test: 0, LM test: 0 
#> 
#> SD(estimated propensity score), maximum over treatment arms:
#> Full sample: 0.211, Overlap sample: 0.21

The issue is that no observations with 6 siblings have race equal to other:

table(fl$race[fl$siblings == 6])
#> 
#>    White    Black Hispanic    Asian    Other 
#>       18       10       12        5        0

Thus, the ATE estimator comparing other to white is not identified. The package drops observations with 6 siblings from the sample to form an “overlap sample” (see Methods section below for precise construction of this sample), where the all estimators are identified.

For a researcher who wants to check whether there is a significant difference between the PL estimator and the other estimators, the data frame cb_f reports the difference between the estimates in the full sample, along with the corresponding standard error. The data frame cb_o reports the differences and standard errors for the overlap sample:

print(m2$cb_f, digits = 3)
#>          PL      OWN       ATE       EW       CW
#> Black    NA -0.03947  0.004397 -0.02573 -0.00292
#> pop_se   NA  0.01204  0.017297  0.01027  0.02078
#> Hispanic NA -0.01269 -0.004962 -0.00783  0.00362
#> pop_se   NA  0.00571  0.013143  0.00497  0.01274
#> Asian    NA  0.00975  0.000265  0.01001  0.01769
#> pop_se   NA  0.00487  0.024595  0.00480  0.02847
#> Other    NA -0.02430        NA -0.02246 -0.00605
#> pop_se   NA  0.00738        NA  0.00684  0.01334
print(m2$cb_o, digits = 3)
#>          PL      OWN       ATE       EW       CW
#> Black    NA -0.03951  0.006062 -0.02537 -0.00185
#> pop_se   NA  0.01203  0.017395  0.01026  0.02060
#> Hispanic NA -0.01246 -0.004473 -0.00759  0.00589
#> pop_se   NA  0.00569  0.013110  0.00494  0.01252
#> Asian    NA  0.00963  0.000209  0.00984  0.01578
#> pop_se   NA  0.00488  0.024687  0.00481  0.02840
#> Other    NA -0.02423 -0.012490 -0.02240 -0.00573
#> pop_se   NA  0.00738  0.010968  0.00683  0.01326

We see statistically significant difference between the OWN and PL estimate (i.e. significant contamination bias) for all races, both in the full sample and in the overlap sample.

Standard errors

The package also computes “oracle” standard errors, in addition to the usual standard errors reported above. These can be accessed in the data frame est_f (or est_o for the overlap sample results)

print(m1$est_f, digits = 3)
#>                PL     OWN     ATE      EW      CW
#> Black     -0.2574 -0.2482 -0.2655 -0.2550 -0.2604
#> pop_se     0.0281  0.0291  0.0298  0.0289  0.0292
#> oracle_se      NA      NA  0.0298  0.0288  0.0290
#> Hispanic  -0.2931 -0.2829 -0.2992 -0.2862 -0.2944
#> pop_se     0.0260  0.0267  0.0299  0.0268  0.0279
#> oracle_se      NA      NA  0.0299  0.0268  0.0278
#> Asian     -0.2621 -0.2609 -0.2599 -0.2611 -0.2694
#> pop_se     0.0343  0.0343  0.0418  0.0343  0.0475
#> oracle_se      NA      NA  0.0418  0.0344  0.0465
#> Other     -0.1563 -0.1448 -0.1503 -0.1447 -0.1522
#> pop_se     0.0369  0.0370  0.0359  0.0368  0.0370
#> oracle_se      NA      NA  0.0359  0.0360  0.0366

These oracle standard errors (oracle_se) don’t account for estimation error in the propensity score, in contrast to the default standard errors (pop_se), see Methods section below.

Specifying the cluster argument allows for computation of clustered standard errors:

## cluster in interviewer ID
m1alt <- multe(r1, "race", cluster = factor(factor(fl$interviewer_ID_24)))
print(m1alt, digits = 3)
#> Estimates on full sample:
#>               PL     OWN     ATE      EW      CW
#> Black    -0.2574 -0.2482 -0.2655 -0.2550 -0.2604
#> SE        0.0412  0.0425  0.0410  0.0422  0.0420
#> Hispanic -0.2931 -0.2829 -0.2992 -0.2862 -0.2944
#> SE        0.0441  0.0454  0.0495  0.0457  0.0474
#> Asian    -0.2621 -0.2609 -0.2599 -0.2611 -0.2694
#> SE        0.0521  0.0523  0.0619  0.0523  0.0675
#> Other    -0.1563 -0.1448 -0.1503 -0.1447 -0.1522
#> SE        0.0404  0.0416  0.0410  0.0414  0.0428
#> 
#> P-values for null hypothesis of no propensity score variation:
#> Wald test: 0, LM test: 0 
#> 
#> SD(estimated propensity score), maximum over treatment arms:
#> Full sample: 0.196

Methods

This section describes the implementation of the bias decomposition formula and the implementation of alternative estimators. Relative to Goldsmith-Pinkham, Hull, and Kolesár (2024), we generalize the setup to allow for sampling weights ω_i² (setting the sampling weights to one recovers unweighted formulas). We also explicitly deal with overlap issues.

We are interested in the effect of treatment D_i ∈ {0, 1, …, K} on an outcome Y_i. Let $X_{i}=\1{D_{i}=1,\dotsc, D_{i}=K}$ denote a vector of treatment indicators, let $X_{i0}=\1{D_{i}=0}$, and let Z_i = (1, W_i′)′ denote a vector of controls, including an intercept. We focus on the case where the controls enter linearly, so that control functions take the form 𝒢 = {z′γ: γ ∈ ℝ^{1 + dim (W_i)}}.

We assume that μ_k(W_i) := E[Y_i(k) ∣ W_i] ∈ 𝒢, so that we may write μ_k(W_i) = W_i′α_k for some vectors α_k, k = 0, …, K. The average treatment effect (ATE) conditional on W_i is then given by τ(W_i) = Z_i′(α_k − α₀), and α_k correspond to the coefficients in the interacted regression with U̇_i conditionally mean zero. The uninteracted partially linear (PL) estimator is given by estimating by weighted least squares (WLS), yielding β̂ = (∑_iω_i²Ẋ_iẊ_i′)⁻¹∑_iω_i²Ẋ_iY_i, where Ẋ is the sample residual from WLS regression of X_i onto Z_i. By Proposition 1 in Goldsmith-Pinkham, Hull, and Kolesár (2024), the population analog of β̂, β, has the decomposition where Λ_i = E[X̃_iX̃_i′]⁻¹E[X̃_iX_i ∣ W_i], and X̃_i is the population analog of Ẋ_i, the population residual from regressing X_i onto W_i. Let α̂_k denote the WLS estimates based on . By construction, the sample residuals from estimating and Z_i are both orthogonal to Ẋ_i. As a result, we obtain the exact decomposition where Γ̂ is a matrix with columns γ̂_k = α̂_k − α̂₀, Λ̂_i = E_n[Ẋ_iẊ_i′]⁻¹Ẋ_iX_i′, and E_n[A_i] = ∑_iω_i²A_i/∑_iω_i² denotes the weighted sample mean.

To compute this decomposition, we don’t need to explicitly compute Λ̂_i. Instead, we use the identity where δ̂_kk is a WLS estimator of the system of regressions

Note this decomposition and associated standard errors, in the next subsection, are purely regression-based, so they remain valid even if X_i is not a set of binary indicators. Likewise, misspecification of the interacted regression only affects the interpretation of the decomposition; if μ_k is not linear, the decomposition will not consistently estimate the contamination bias.

The own treatment weights in this decomposition sum to one, and the contamination weights sum to zero, since E_n[Λ̂_i] = I_k, mimicking the population decomposition. If the propensity score doesn’t satisfy p_k ∈ 𝒢, the implied estimate of Λ(w), may not be positive definite; in particular, the diagonal elements may not all be positive, in line with Proposition 1 in Goldsmith-Pinkham, Hull, and Kolesár (2024).

In addition to this decomposition, the package also computes the following alternative estimators:

The unweighted ATE estimator, $\hbate_{k}=E_{n}[W_{i}]'\hat{\gamma}_{k}$
The one-treatment-at-a time estimator $\hbew_{k}$ that fits using only observations with D_i ∈ {0, k}. In other words, it estimates the regression among observations with D_i ∈ {0, k}. This estimator weights the treatment effects using the variance-minimizing weighting scheme given in Corollary 1 in Goldsmith-Pinkham, Hull, and Kolesár (2024). Consequently, we refer to as the efficiently weighted ATE estimator.
The common weights estimator $\hbcw$, given by the WLS regression of Y_i onto X_i, weighting each observation by where, by default, the probabilities π_k correspond to the marginal probability E_n[X_ik] in the dataset. The propensity scores p̂_k(W_i) are based on fitting a multinomial logit model for the treatments. This estimator estimates a weighted ATE with weights $\lambda^{\text{CW}}(W_{i})=\left(\sum_{k=0}^{K}\frac{\pi_{k}(1-\pi_{k})}{p_{k}(W_{i})}\right)^{-1}$. By Corollary 2 in Goldsmith-Pinkham, Hull, and Kolesár (2024), this weighting scheme minimizes the average variance, under homoskedasticity, across all treatment comparisons—comparisons of outcomes under treatment k vs treatment ℓ, if we draw the treatments k and ℓ independently from the marginal treatment distribution (π₀, …, π_K). Option cw_uniform=TRUE in the multe function sets these probabilities to 1/K; setting the option to its default, FALSE, sets them to (E_n[X_i0], …, E_n[X_iK]).

Standard errors

To compute cluster-robust standard errors for an asymptotically linear estimator with influence function ψ_i, we use the formula Here G_i denotes cluster membership, as specified by the multe argument cluster, and G the number of clusters. Specifying cluster=NULL assumes independent data, setting each observation to be in its own cluster (G_i = i and G = N), so the reported standard errors are robust to heteroskedasticity, but not clustering.

We now describe the form of the influence function for the estimators above. For a generic WLS regression of A onto B, let $(Q_{1},Q_{2}) \left(\begin{smallmatrix} R & S\\ 0 & 0 \end{smallmatrix}\right)\Pi'$ denote the QR decomposition of $\diag(\omega_{i})B$. If B has rank r, then R has dimension r × r, Q has dimension N × r, where r is the rank of the regressor matrix, and Π is a permutation matrix. The WLS estimator is then given by $b=\Pi\left(\begin{smallmatrix} R^{-1}Q_{1}'\diag(\omega_{i})A\\ \mathtt{NA}\end{smallmatrix}\right)$. Denoting the regression residual by ϵ̂_i, the influence function is thus given by See the internal function multe:::reg_if for implementation. The influence function for the inner product of linear estimators a and b, is by the delta method given by while for scalars s₁, s₂, ψ(s₁a + s₂b) = s₁ψ(a) + s₂β(b).

We use to compute ψ(α̂_k), as well as where ζ̂_ik is the WLS residual based on , and $\ddot{X}_{ik}$ is the residual from regressing X_ik onto X_i, −k and Z_i. It then follows from and the influence function formulas above, that where X̂_ik is the residual from regressing X_ik onto Z in the subset with D_i ∈ {0, k}, and Û_ik the residual from regressing Y_i onto X_ik and Z_i in this subsample.

When the treatment is binary and overlap holds, the formula for $\psi_{i}(\hbate_{k})$ is similar to that discussed on page 29 in Imbens and Wooldridge (2009), except we don’t assume that the regression error V_i in is conditionally mean zero, so that the standard error is robust to misspecification.

Derivations in the last section show that the influence function for the common weights estimator is given by $\hat{\psi}_{i}(\hbcw_{k})=\hat{\psi}_{i}(\hat{\alpha}^{\textnormal{CW}}_{k}-\hat{\alpha}^{\textnormal{CW}}_{k})$, where

with the formula for a_i given in below, θ corresponds to the parameters in the multinomial logit model, π_k(W_i; θ̂) to the fitted probabilities in this model, and $\hat{\alpha}^{\textnormal{CW}}_{k}$ is the estimate based on below.

Oracle standard errors

The package also reports “oracle” standard errors, which interprets the alternative estimators as estimates of the contrasts with λ(W_i) = 1 for the unweighted ATE, $\lambda(W_{i})=\lcw(W_{i})$ for the common weights estimator and $\lambda(W_{i})=\frac{p_{k}(W_{i})p_{0}(W_{i})}{p_{k}(W_{i})+p_{0}(W_{i})}$ for the efficiently weighted ATE estimator. In contrast, the standard errors in the previous subsection set the estimands to be the population counterparts to these weighted averages, replacing the sums in the above display with population expectations. In addition, the oracle standard errors don’t account for estimation error in the propensity score p(W_i).

For the unweighted ATE, the oracle standard error is based on the influence function ψ̃_i(β̂_k^ATE) = Z̄′ψ_i(α̂_k − α̂₀). From the derivation in the last section, it follows that the oracle standard error for $\hbew_{k}$ is given by where $\hat{\dot{U}}_{i}$ is the interacted regression residual based on .

Finally, the oracle standard errors for $\hbcw_{k}$ are based on the influence function $\tilde{\psi}_{i}(\hbcw_{k})=\tilde{\psi}_{i}(\hat{\alpha}^{\text{CW}}_{k})-\tilde{\psi}_{i}(\hat{\alpha}^{\text{CW}}_{0})$, where

Overlap sample

The package applies the above formulas to the full sample. In cases with poor overlap, this may not yield well-defined estimates or bias decomposition for all treatments. For components of the decomposition and alternative estimators that are not identified, the package returns NA. In such cases, the package also returns results for a trimmed “overlap sample”, where the decomposition and alternative estimators are all identified. The overlap sample is constructed as follows:

Find a factor variable among the controls with the greatest number of levels. If there are no factor variables, skip this step. If for some levels of this variable, we don’t see observations that take on one or more of the K + 1 possible treatments, drop observations with these levels.
For the remaining controls, if a control doesn’t display any variation in the subset of the data with treatment k = 0, …, K, drop the control.

Wald and LM tests

We now give the form of the Wald and LM tests for variation in the propensity score. First, we give a general derivation of these tests in a likelihood context when the Hessian may be reduced rank. We then specialize the formulas to the case where the likelihood corresponds to the that for the multinomial logit model.

Consider a log-likelihood ℓ_n(θ) for a p-dimensional parameter θ, with score function S that’s approximately normal with covariance matrix Ω, and Hessian H. We’re interested in testing the hypothesis that last r elements of θ are zero, H₀: θ₂ = 0. We assume that the submatrix H₁₁ of the Hessian corresponding to the restricted model is full rank, but the full matrices Ω or H may not be invertible.

The score evaluated at the unrestricted estimator θ̂ satisfies ignoring in the notation that the Hessian evaluated needs to be evaluated at intermediate values. Rearranging, This yields the Wald statistic where A⁺ denotes a generalized inverse. By Lemma 9.7 in Newey and McFadden (1994), the statistic has an asymptotic χ² distribution with degrees of freedom equal to the rank of the variance.

The score evaluated at the restricted estimator θ̄₁ satisfies which implies θ̄₁ − θ₁ = −H₁₁⁻¹S₁(θ₁, 0), and hence Thus the statistic will again have a χ² distribution.

To apply these formulas in the context of a multinomial logit model, we use the score and the Hessian

Derivations

We first derive . Observe first that the common weights estimator is identical to the two-step GMM estimator that in the first step, fits a multinomial logit model with the normalization θ₀ = 0. In the second step, we use the moment condition to obtain estimates α̂_k^CW, and set $\hbcw_k=\hat\alpha^{\text{CW}}_{k}-\hat\alpha^{\text{CW}}_{0}$.

Let By equation (6.6) in Newey and McFadden (1994), the influence function of this two-step estimator is given by where ψ_i(θ) is the influence function of the multinomial logit estimator θ̂, and M_k(θ) is the derivative of wrt θ.

Since $\partial \zeta_{k}(W_{i};\theta)/\partial \theta_{j}= Z_{i}\zeta_{k}(W_{i};\theta)[\pi_{j}(1-\pi_{j})\zeta_{k}(W_{i};\theta)-\1{k=j>0}]$, it follows that Since the multinomial logit log-likelihood is given by $\ell_{i}=\sum_{k=0}^{K}X_{k}\log(\pi_{k})=\sum_{k=0}^{K}X_{k}Z_{i}'\theta_{k} -\log(\sum_{k=0}^{K}e^{Z_{i}'\theta_{k}})$, the score and the Hessian are Since ψ_i(θ) = −H(θ)⁻¹S_i(θ), this yields with M̂_k(θ) = (∑_i(ζ − e_k) ⊗ Z_i ⋅ ζ_kX_ik(Y_i − α_λ^C, k)), and $\hat{H}({\theta})= \sum_{i}(\diag(\pi(W_{i}))-\pi(W_{i})\pi(W_{i})')\otimes Z_{i}Z_{i}'$. When π_k = 1/(K + 1) this formula reduces to that in Theorem 1 in Li and Li (2019).

Next, we show . Note that it follows from Proposition 2 in Goldsmith-Pinkham, Hull, and Kolesár (2024) that the efficient influence function is given by where r_k = r_k(W_i) = E[X_ik ∣ W_i, X_ik + X_i0 = 1]. The result then follows since X̂_ik is an estimator of X_ik − r_k(W_i).

Finally, we show . The derivative of the moment condition with respect to π_k = p_k (assuming correct specification of the propensity score) is given by where we write λ for $\lcw(W_{i})$. Since p_k is a projection, by Proposition 4 in Newey (1994), the influence function for α̂_k^CW is given by Next, as noted in Abadie, Imbens, and Zheng (2014), we can view α̃_k^CW = ∑_iλμ_k(W_i)/∑_iλ as an estimator of α_k^CW based on the moment condition E[λ(μ_k(W_i) − α_k^CW)] = 0, which by standard arguments has influence function given by $\frac{\lambda}{E[\lambda]}(\mu_{k}(W_{i})-{\alpha}^{\text{CW}}_{k})$. Since α̂_k^CW − α_k^CW = (α̂_k^CW − α_k^CW) − (α_k^CW − α̃_k^CW), we subtract this influence function from the preceding display to obtain .

References

Abadie, Alberto, Guido W. Imbens, and Fanyin Zheng. 2014. “Inference for Misspecified Models with Fixed Regressors.” Journal of the American Statistical Association 109 (508): 1601–14. https://doi.org/10.1080/01621459.2014.928218.

Fryer, Roland G, and Steven D Levitt. 2013. “Testing for Racial Differences in the Mental Ability of Young Children.” American Economic Review 103 (2): 981–1005. https://doi.org/10.1257/aer.103.2.981.

Goldsmith-Pinkham, Paul, Peter Hull, and Michal Kolesár. 2024. “Contamination Bias in Linear Regressions.” https://arxiv.org/abs/2106.05024.

Imbens, Guido W., and Jeffrey Wooldridge. 2009. “Recent Developments in the Econometrics of Program Evaluation.” Journal of Economic Literature 47 (1): 5–86. https://doi.org/10.1257/jel.47.1.5.

Li, Fan, and Fan Li. 2019. “Propensity Score Weighting for Causal Inference with Multiple Treatments.” The Annals of Applied Statistics 13 (4): 2389–2415. https://doi.org/10.1214/19-AOAS1282.

Newey, Whitney K. 1994. “The Asymptotic Variance of Semiparametric Estimators.” Econometrica 62 (6): 1349–82. https://doi.org/10.2307/2951752.

Newey, Whitney K., and Daniel L. McFadden. 1994. “Large Sample Estimation and Hypothesis Testing.” In Handbook of Econometrics, edited by Robert F. Engle and Daniel L. McFadden, 4:2111–2245. New York, NY: Elsevier. https://doi.org/10.1016/S1573-4412(05)80005-4.

Multiple Treatment Effects Regression

Standard errors

Methods

Standard errors

Oracle standard errors

Overlap sample

Wald and LM tests

Derivations

References