The package multe
implements contamination bias diagnostics for regressions with multiple
treatments developed in Goldsmith-Pinkham et al.
(2024). This vignette illustrates the methods using data from
Fryer and Levitt (2013).
First, we fit a regression of test scores on a race dummy (treatment of interest) and a few controls, weighting using sampling weights:
library("multe")
## Regression of IQ at 24 months on race indicators
## and baseline controls
r1 <- stats::lm(std_iq_24 ~ race + factor(age_24) + female +
SES_quintile, weight = W2C0, data = fl)
## Compute alternatives estimates free of
## contamination bias
m1 <- multe(r1, "race", cluster = NULL)
print(m1, digits = 3)
#> Estimates on full sample:
#> PL OWN ATE EW CW
#> Black -0.2574 -0.2482 -0.2655 -0.2550 -0.2604
#> SE 0.0281 0.0291 0.0298 0.0289 0.0292
#> Hispanic -0.2931 -0.2829 -0.2992 -0.2862 -0.2944
#> SE 0.0260 0.0267 0.0299 0.0268 0.0279
#> Asian -0.2621 -0.2609 -0.2599 -0.2611 -0.2694
#> SE 0.0343 0.0343 0.0418 0.0343 0.0475
#> Other -0.1563 -0.1448 -0.1503 -0.1447 -0.1522
#> SE 0.0369 0.0370 0.0359 0.0368 0.0370
#>
#> P-values for null hypothesis of no propensity score variation:
#> Wald test: 0, LM test: 0
#>
#> SD(estimated propensity score), maximum over treatment arms:
#> Full sample: 0.196The package reports five different estimators:
Precise definitions of these estimators are given in the Methods section below.
In this example, the propensity score varies significantly with covariates, as indicated by the p-values of the Wald and LM tests.
Including many controls may result in overlap failure, as the next example demonstrates:
r2 <- stats::lm(std_iq_24 ~ race + factor(age_24) + female +
SES_quintile + factor(siblings) + family_structure,
weight = W2C0, data = fl)
m2 <- multe(r2, treatment = "race")
#> For variable factor(siblings) the following levels fail overlap:
#> 6
#> Dropping observations with these levels
print(m2, digits = 3)
#> Estimates on full sample:
#> PL OWN ATE EW CW
#> Black -0.2438 -0.2043 -0.2482 -0.2180 -0.2408
#> SE 0.0308 0.0332 0.0355 0.0328 0.0389
#> Hispanic -0.2928 -0.2801 -0.2878 -0.2850 -0.2964
#> SE 0.0259 0.0267 0.0300 0.0267 0.0299
#> Asian -0.2739 -0.2836 -0.2742 -0.2839 -0.2916
#> SE 0.0342 0.0343 0.0420 0.0343 0.0459
#> Other -0.1520 -0.1277 NA -0.1295 -0.1459
#> SE 0.0369 0.0374 NA 0.0371 0.0385
#>
#> Estimates on overlap sample:
#> PL OWN ATE EW CW
#> Black -0.2444 -0.2049 -0.2505 -0.2191 -0.2426
#> SE 0.0309 0.0334 0.0357 0.0329 0.0388
#> Hispanic -0.2915 -0.2791 -0.2871 -0.2839 -0.2974
#> SE 0.0259 0.0267 0.0300 0.0267 0.0299
#> Asian -0.2766 -0.2863 -0.2769 -0.2865 -0.2924
#> SE 0.0344 0.0345 0.0421 0.0345 0.0459
#> Other -0.1522 -0.1280 -0.1397 -0.1298 -0.1465
#> SE 0.0369 0.0374 0.0362 0.0371 0.0385
#>
#> P-values for null hypothesis of no propensity score variation:
#> Wald test: 0, LM test: 0
#>
#> SD(estimated propensity score), maximum over treatment arms:
#> Full sample: 0.211, Overlap sample: 0.21The issue is that no observations with 6 siblings have race equal to
other:
Thus, the ATE estimator comparing other to
white is not identified. The package drops observations
with 6 siblings from the sample to form an “overlap sample” (see Methods
section below for precise construction of this sample), where the all
estimators are identified.
For a researcher who wants to check whether there is a significant
difference between the PL estimator and the other estimators, the data
frame cb_f reports the difference between the estimates in
the full sample, along with the corresponding standard error. The data
frame cb_o reports the differences and standard errors for
the overlap sample:
print(m2$cb_f, digits = 3)
#> PL OWN ATE EW CW
#> Black NA -0.03947 0.004397 -0.02573 -0.00292
#> pop_se NA 0.01204 0.017297 0.01027 0.02078
#> Hispanic NA -0.01269 -0.004962 -0.00783 0.00362
#> pop_se NA 0.00571 0.013143 0.00497 0.01274
#> Asian NA 0.00975 0.000265 0.01001 0.01769
#> pop_se NA 0.00487 0.024595 0.00480 0.02847
#> Other NA -0.02430 NA -0.02246 -0.00605
#> pop_se NA 0.00738 NA 0.00684 0.01334
print(m2$cb_o, digits = 3)
#> PL OWN ATE EW CW
#> Black NA -0.03951 0.006062 -0.02537 -0.00185
#> pop_se NA 0.01203 0.017395 0.01026 0.02060
#> Hispanic NA -0.01246 -0.004473 -0.00759 0.00589
#> pop_se NA 0.00569 0.013110 0.00494 0.01252
#> Asian NA 0.00963 0.000209 0.00984 0.01578
#> pop_se NA 0.00488 0.024687 0.00481 0.02840
#> Other NA -0.02423 -0.012490 -0.02240 -0.00573
#> pop_se NA 0.00738 0.010968 0.00683 0.01326We see statistically significant difference between the OWN and PL estimate (i.e. significant contamination bias) for all races, both in the full sample and in the overlap sample.
The package also computes “oracle” standard errors, in addition to
the usual standard errors reported above. These can be accessed in the
data frame est_f (or est_o for the overlap
sample results)
print(m1$est_f, digits = 3)
#> PL OWN ATE EW CW
#> Black -0.2574 -0.2482 -0.2655 -0.2550 -0.2604
#> pop_se 0.0281 0.0291 0.0298 0.0289 0.0292
#> oracle_se NA NA 0.0298 0.0288 0.0290
#> Hispanic -0.2931 -0.2829 -0.2992 -0.2862 -0.2944
#> pop_se 0.0260 0.0267 0.0299 0.0268 0.0279
#> oracle_se NA NA 0.0299 0.0268 0.0278
#> Asian -0.2621 -0.2609 -0.2599 -0.2611 -0.2694
#> pop_se 0.0343 0.0343 0.0418 0.0343 0.0475
#> oracle_se NA NA 0.0418 0.0344 0.0465
#> Other -0.1563 -0.1448 -0.1503 -0.1447 -0.1522
#> pop_se 0.0369 0.0370 0.0359 0.0368 0.0370
#> oracle_se NA NA 0.0359 0.0360 0.0366These oracle standard errors (oracle_se) don’t account
for estimation error in the propensity score, in contrast to the default
standard errors (pop_se), see Methods section below.
Specifying the cluster argument allows for computation
of clustered standard errors:
## cluster in interviewer ID
m1alt <- multe(r1, "race", cluster = factor(factor(fl$interviewer_ID_24)))
print(m1alt, digits = 3)
#> Estimates on full sample:
#> PL OWN ATE EW CW
#> Black -0.2574 -0.2482 -0.2655 -0.2550 -0.2604
#> SE 0.0412 0.0425 0.0410 0.0422 0.0420
#> Hispanic -0.2931 -0.2829 -0.2992 -0.2862 -0.2944
#> SE 0.0441 0.0454 0.0495 0.0457 0.0474
#> Asian -0.2621 -0.2609 -0.2599 -0.2611 -0.2694
#> SE 0.0521 0.0523 0.0619 0.0523 0.0675
#> Other -0.1563 -0.1448 -0.1503 -0.1447 -0.1522
#> SE 0.0404 0.0416 0.0410 0.0414 0.0428
#>
#> P-values for null hypothesis of no propensity score variation:
#> Wald test: 0, LM test: 0
#>
#> SD(estimated propensity score), maximum over treatment arms:
#> Full sample: 0.196This section describes the implementation of the bias decomposition formula and the implementation of alternative estimators. Relative to Goldsmith-Pinkham et al. (2024), we generalize the setup to allow for sampling weights \(\omega_{i}^{2}\) (setting the sampling weights to one recovers unweighted formulas). We also explicitly deal with overlap issues.
We are interested in the effect of treatment \(D_{i}\in\{0,1,\dotsc, K\}\) on an outcome \(Y_{i}\). Let \(X_{i}=\1{D_{i}=1,\dotsc, D_{i}=K}\) denote a vector of treatment indicators, let \(X_{i0}=\1{D_{i}=0}\), and let \(Z_{i}=(1,W_{i}')'\) denote a vector of controls, including an intercept. We focus on the case where the controls enter linearly, so that control functions take the form \(\mathcal{G}=\{z'\gamma\colon \gamma\in\mathbb{R}^{1+\dim(W_i)}\}\).
We assume that \(\mu_{k}(W_i):=E[Y_{i}(k)\mid W_{i}]\in\mathcal{G}\), so that we may write \(\mu_{k}(W_{i})=W_{i}'\alpha_{k}\) for some vectors \(\alpha_{k}\), \(k=0,\dotsc, K\). The average treatment effect (ATE) conditional on \(W_{i}\) is then given by \(\tau(W_{i})=Z_{i}'(\alpha_{k}-\alpha_{0})\), and \(\alpha_{k}\) correspond to the coefficients in the interacted regression \[\begin{equation} Y_{i}=\sum_{k=0}^{K}X_{ik}Z_{i}'\alpha_{k}+\dot{U}_{i}\label{eq:linear_mu}, % =Z_{i}'\alpha_{0}+\sum_{k=1}^{K}X_{ik}Z_{i}'\gamma_{k}\label{eq:linear_gamma} % &=\alpha_{0, k}+W_{i}'\alpha_{W, 0}+\sum_{k=1}^{K}X_{ik}\tau_{k}+\sum_{k=1}^{K}X_{ik}(W_{i}-E[W_{i}])'\gamma_{W, k}. % \label{eq:linear_tau} \end{equation}\] with \(\dot{U}_{i}\) conditionally mean zero. The uninteracted partially linear (PL) estimator is given by estimating \[\begin{equation}\label{eq:uninteracted_linear} Y_i=\sum_{k=1}^{K} X_{ik}\beta + Z_i'\phi +U_{i}, \end{equation}\] by weighted least squares (WLS), yielding \(\hat{\beta}=(\sum_{i}\omega_{i}^{2}\dot{X}_{i}\dot{X}_{i}')^{-1}\sum_{i}\omega_{i}^{2}\dot{X}_{i}Y_{i}\), where \(\dot{X}\) is the sample residual from WLS regression of \(X_{i}\) onto \(Z_{i}\). By Proposition 1 in Goldsmith-Pinkham et al. (2024), the population analog of \(\hat{\beta}\), \(\beta\), has the decomposition \[\begin{equation*} \beta=E[\diag(\Lambda_{i})\tau(W_{i})]+E[\Lambda_{i}-\diag(\Lambda_{i})\tau(W_{i})], \end{equation*}\] where \(\Lambda_{i}=E[\tilde{X}_{i}\tilde{X}_{i}']^{-1}E[\tilde{X}_{i}X_{i}\mid W_{i}]\), and \(\tilde{X}_{i}\) is the population analog of \(\dot{X}_{i}\), the population residual from regressing \(X_{i}\) onto \(W_{i}\). Let \(\hat{\alpha}_{k}\) denote the WLS estimates based on \(\eqref{eq:linear_mu}\). By construction, the sample residuals from estimating \(\eqref{eq:linear_mu}\) and \(Z_{i}\) are both orthogonal to \(\dot{X}_{i}\). As a result, we obtain the exact decomposition \[\begin{equation}\label{eq:1} \begin{split} \hat{\beta} &=E_{n}[\dot{X}_{i}\dot{X}_{i}']^{-1}E_{n}[\dot{X}_{i}Y_{i}]= % E_{n}[\dot{X}_{i}\dot{X}_{i}']^{-1} % E_{n}[\dot{X}_{i}\sum_{k=1}^{K}X_{ik}Z_{i}'(\hat{\alpha}_{k}-\hat{\alpha}_{0})] \\ & = % E_{n}[\hat{\Lambda}_{i}\hat{\Gamma}'Z_{i}]= E_{n}[\diag(\hat{\Lambda}_{i})\hat{\Gamma}'Z_{i}]+E_{n}[(\hat{\Lambda}_{i}-\diag(\hat{\Lambda}_{i}))\hat{\Gamma}'Z_{i}] =:\hbo+\hbcb, \end{split} \end{equation}\] where \(\hat{\Gamma}\) is a matrix with columns \(\hat{\gamma}_{k}=\hat{\alpha}_{k}-\hat{\alpha}_{0}\), \(\hat{\Lambda}_{i}=E_{n}[\dot{X}_{i}\dot{X}_{i}']^{-1}\dot{X}_{i}X_{i}'\), and \(E_{n}[A_{i}]=\sum_{i}\omega_{i}^{2}A_{i}/\sum_{i}\omega_{i}^{2}\) denotes the weighted sample mean.
To compute this decomposition, we don’t need to explicitly compute \(\hat{\Lambda}_{i}\). Instead, we use the identity \[\begin{equation*} \hbo_{k}=e_{k}'E_{n}[\dot{X}_{i}\dot{X}_{i}']^{-1}E_{n}[\dot{X}_{i}X_{ik}Z_{i}'\hat{\gamma}_{k}]=\hat{\delta}_{k k}' \hat{\gamma}_{k}, \end{equation*}\] where \(\hat{\delta}_{k k}\) is a WLS estimator of the system of regressions \[\begin{equation}\label{eq:delta_system} Z_{i}X_{ik}=\delta_{k k}X_{ik}+\sum_{\ell\neq k}\delta_{k \ell}X_{i\ell} +\Delta_{Z, k}Z_{i}+\zeta_{ik}. \end{equation}\]
Note this decomposition and associated standard errors, in the next subsection, are purely regression-based, so they remain valid even if \(X_{i}\) is not a set of binary indicators. Likewise, misspecification of the interacted regression only affects the interpretation of the decomposition; if \(\mu_k\) is not linear, the decomposition will not consistently estimate the contamination bias.
The own treatment weights in this decomposition sum to one, and the contamination weights sum to zero, since \(E_{n}[\hat{\Lambda}_{i}]=I_{k}\), mimicking the population decomposition. If the propensity score doesn’t satisfy \(p_{k}\in\mathcal{G}\), the implied estimate of \(\Lambda(w)\), \[\begin{equation*} \hat{\Lambda}(w)=\frac{1}{n}\sum_{i=1}^{n}\1{W_{i}=w}\hat{\Lambda}_{i} \end{equation*}\] may not be positive definite; in particular, the diagonal elements may not all be positive, in line with Proposition 1 in Goldsmith-Pinkham et al. (2024).
In addition to this decomposition, the package also computes the following alternative estimators:
cw_uniform=TRUE in the multe function sets
these probabilities to \(1/K\); setting
the option to its default, FALSE, sets them to \((E_{n}[X_{i0}],\dotsc,E_{n}[X_{i
K}])\).To compute cluster-robust standard errors for an asymptotically
linear estimator with influence function \(\psi_{i}\), we use the formula \[\begin{equation*}
\widehat{\operatorname{se}}(\psi)^{2}=\frac{G}{G-1}\sum_{g}\left(\sum_{G_{i}=
g}\psi_{i}\right)
\left(\sum_{G_{i}= g}\psi_{i}\right)'.
\end{equation*}\] Here \(G_{i}\)
denotes cluster membership, as specified by the multe
argument cluster, and \(G\) the number of clusters. Specifying
cluster=NULL assumes independent data, setting each
observation to be in its own cluster (\(G_{i}=i\) and \(G=N\)), so the reported standard errors are
robust to heteroskedasticity, but not clustering.
We now describe the form of the influence function for the estimators
above. For a generic WLS regression of \(A\) onto \(B\), let \((Q_{1},Q_{2}) \left(\begin{smallmatrix}
R & S\\
0 & 0
\end{smallmatrix}\right)\Pi'\) denote the QR decomposition of
\(\diag(\omega_{i})B\). If \(B\) has rank \(r\), then \(R\) has dimension \(r\times r\), \(Q\) has dimension \(N\times r\), where \(r\) is the rank of the regressor matrix,
and \(\Pi\) is a permutation matrix.
The WLS estimator is then given by \(b=\Pi\left(\begin{smallmatrix}
R^{-1}Q_{1}'\diag(\omega_{i})A\\
\mathtt{NA}\end{smallmatrix}\right)\).
Denoting the regression residual by \(\hat{\epsilon}_{i}\), the influence
function is thus given by \[\begin{equation}\label{eq:psi_reg}
\psi_{i}(b)=\Pi\left(\begin{smallmatrix}
R^{-1}Q_{1i}\omega_{i}\hat{\epsilon}_{i}\\
\mathtt{NA}\end{smallmatrix}\right).
\end{equation}\] See the internal function
multe:::reg_if for implementation. The influence function
for the inner product of linear estimators \(a\) and \(b\), is by the delta method given by \[\begin{equation*}
\psi_{i}(a' b)=a'\psi_{i}(b)+b'\psi_{i}(a),
\end{equation*}\] while for scalars \(s_{1},s_{2}\), \(\psi(s_{1}a+s_{2}b)=s_{1}\psi(a)+s_{2}\beta(b)\).
We use \(\eqref{eq:psi_reg}\) to compute \(\psi(\hat{\alpha}_{k})\), as well as \[\begin{align*} \psi_{i}(\bar{Z})&=\frac{\omega_{i}^{2}(Z_{i}-\bar{Z})}{\sum_{i}\omega_{i}^{2}}\\ \psi_{i}(\hat{\delta}_{k k})&=\frac{\omega_{i}^{2}\hat{\zeta}_{ik}\ddot{X}_{ik}}{\sum_{i}\omega_{i}^{2}\ddot{X}_{ik}^{2}}, \end{align*}\] where \(\hat{\zeta}_{ik}\) is the WLS residual based on \(\eqref{eq:delta_system}\), and \(\ddot{X}_{ik}\) is the residual from regressing \(X_{ik}\) onto \(X_{i, -k}\) and \(Z_{i}\). It then follows from \(\eqref{eq:psi_reg}\) and the influence function formulas above, that \[\begin{align*} \psi_{i}(\hbo_{k})&=\hat{\delta}_{k k}'\psi_{i}(\hat{\gamma}_{k})+ \hat{\gamma}_{k}'\psi_{i}(\hat{\delta}_{k k})\\ \psi_{i}(\hat{\beta})&=\left(\sum_{i}\omega_{i}^{2}\dot{X}_{i}\dot{X}_{i}'\right)^{-1}\omega_{i}^{2}\dot{X}_{i}\hat{U}_{i}\\ \psi_{i}(\hbew_{k})&=\frac{\1{D_{i}\in\{0,k\}} \omega_{i}^{2}\hat{X}_{ik}\hat{U}_{ik}}{\sum_{i}\1{D_{i}\in\{0,k\}}\omega_{i}^{2}\hat{X}_{ik}^{2}}\\ \psi_{i}(\hbate_{k})&=\bar{Z}'\psi_{i}(\hat{\alpha}_{k}-\hat{\alpha}_{0}) +\psi_{i}(\bar{Z})(\hat{\alpha}_{k}-\hat{\alpha}_{0}). \end{align*}\] where \(\hat{X}_{ik}\) is the residual from regressing \(X_{ik}\) onto \(Z\) in the subset with \(D_{i}\in\{0,k\}\), and \(\hat{U}_{ik}\) the residual from regressing \(Y_{i}\) onto \(X_{ik}\) and \(Z_{i}\) in this subsample.
When the treatment is binary and overlap holds, the formula for \(\psi_{i}(\hbate_{k})\) is similar to that discussed on page 29 in Imbens and Wooldridge (2009), except we don’t assume that the regression error \(V_{i}\) in \(\eqref{eq:linear_mu}\) is conditionally mean zero, so that the standard error is robust to misspecification.
Derivations in the last section show that the influence function for the common weights estimator is given by \(\hat{\psi}_{i}(\hbcw_{k})=\hat{\psi}_{i}(\hat{\alpha}^{\textnormal{CW}}_{k}-\hat{\alpha}^{\textnormal{CW}}_{k})\), where \[\begin{equation}\label{eq:psi_cw} \hat{\psi}_{i}(\hat{\alpha}^{\textnormal{CW}}_{k})= \frac{1}{\sum_{i}\lcw(W_{i})}\left( \frac{\lcw(W_{i})X_{ik}}{\pi_{k}(W_{i};\hat{\theta})}(Y_{i}-\hat{\alpha}^{\textnormal{CW}}_{k}) +a_{i}(\hat{\theta})\right). \end{equation}\]
with the formula for \(a_i\) given in \(\eqref{eq:a_i}\) below, \(\theta\) corresponds to the parameters in the multinomial logit model, \(\pi_{k}(W_{i};\hat{\theta})\) to the fitted probabilities in this model, and \(\hat{\alpha}^{\textnormal{CW}}_{k}\) is the estimate based on \(\eqref{eq:moment_overlap_weights}\) below.
The package also reports “oracle” standard errors, which interprets the alternative estimators as estimates of the contrasts \[\begin{equation*} \beta_{\lambda,k}=\frac{\sum_{i=1}^{N}\lambda(W_{i})(\mu_{k}(W_{i})-\mu_{0}(W_{i}))}{\sum_{i=1}^{N}\lambda(W_{i})}, \end{equation*}\] with \(\lambda(W_{i})=1\) for the unweighted ATE, \(\lambda(W_{i})=\lcw(W_{i})\) for the common weights estimator and \(\lambda(W_{i})=\frac{p_{k}(W_{i})p_{0}(W_{i})}{p_{k}(W_{i})+p_{0}(W_{i})}\) for the efficiently weighted ATE estimator. In contrast, the standard errors in the previous subsection set the estimands to be the population counterparts to these weighted averages, replacing the sums in the above display with population expectations. In addition, the oracle standard errors don’t account for estimation error in the propensity score \(p(W_{i})\).
For the unweighted ATE, the oracle standard error is based on the influence function \(\tilde{\psi}_{i}(\hat{\beta}_{k}^{ATE})=\bar{Z}'\psi_{i}(\hat{\alpha}_{k}-\hat{\alpha}_{0})\). From the derivation in the last section, it follows that the oracle standard error for \(\hbew_{k}\) is given by \[\begin{equation}\label{eq:hbew_oracle} \psi_{i}(\hbew_{k})=\frac{\1{D_{i}\in\{0,k\}} \omega_{i}^{2}\hat{X}_{ik}\hat{\dot{U}}_{i}}{\sum_{i}\1{D_{i}\in\{0,k\}}\omega_{i}^{2}\hat{X}_{ik}^{2}}, \end{equation}\] where \(\hat{\dot{U}}_{i}\) is the interacted regression residual based on \(\eqref{eq:linear_mu}\).
Finally, the oracle standard errors for \(\hbcw_{k}\) are based on the influence function \(\tilde{\psi}_{i}(\hbcw_{k})=\tilde{\psi}_{i}(\hat{\alpha}^{\text{CW}}_{k})-\tilde{\psi}_{i}(\hat{\alpha}^{\text{CW}}_{0})\), where \[\begin{equation}\label{eq:overlap_oracle} \tilde{\psi}_{i}(\hat{\alpha}^{\text{CW}}_{k})=\frac{\omega_{i}^{2}\lcw(W_{i};\hat{\theta})}{ \sum_{i}\omega_{i}^{2}\lcw(W_{i};\hat{\theta})}\frac{X_{ik}}{\pi_{k}(W_{i};\hat{\theta})}\hat{\dot{U}}_{i}. \end{equation}\]
The package applies the above formulas to the full sample. In cases
with poor overlap, this may not yield well-defined estimates or bias
decomposition for all treatments. For components of the decomposition
and alternative estimators that are not identified, the package returns
NA. In such cases, the package also returns results for a
trimmed “overlap sample”, where the decomposition and alternative
estimators are all identified. The overlap sample is constructed as
follows:
We now give the form of the Wald and LM tests for variation in the propensity score. First, we give a general derivation of these tests in a likelihood context when the Hessian may be reduced rank. We then specialize the formulas to the case where the likelihood corresponds to the that for the multinomial logit model.
Consider a log-likelihood \(\ell_{n}(\theta)\) for a \(p\)-dimensional parameter \(\theta\), with score function \(S\) that’s approximately normal with covariance matrix \(\Omega\), and Hessian \(H\). We’re interested in testing the hypothesis that last \(r\) elements of \(\theta\) are zero, \(H_{0}\colon \theta_{2}=0\). We assume that the submatrix \(H_{11}\) of the Hessian corresponding to the restricted model is full rank, but the full matrices \(\Omega\) or \(H\) may not be invertible.
The score evaluated at the unrestricted estimator \(\hat{\theta}\) satisfies \[\begin{equation*} 0= \begin{pmatrix} S_{1}(\hat{\theta}_{1},\hat{\theta}_{2})\\ S_{2}(\hat{\theta}_{1},\hat{\theta}_{2}) \end{pmatrix}= \begin{pmatrix} S_{1}(\theta_{1},0)\\ S_{2}(\theta_{1},0) \end{pmatrix}+ \begin{pmatrix} H_{11}(\hat{\theta}_{1}-\theta_{1}) + H_{12}\hat{\theta}_{2}\\ H_{21}(\hat{\theta}_{1}-\theta_{1}) + H_{22}\hat{\theta}_{2}, \end{pmatrix} \end{equation*}\] ignoring in the notation that the Hessian evaluated needs to be evaluated at intermediate values. Rearranging, \[\begin{equation*} \begin{pmatrix} \hat{\theta}_{1}-\theta_{1}\\ (H_{22}-H_{21}H_{11}^{-1}H_{12})\hat{\theta}_{2} \end{pmatrix} = \begin{pmatrix} -H_{11}^{-1}S_{1}(\theta_{1},0)-H_{11}^{-1}H_{12}\hat{\theta}_{2}\\ H_{21}H_{11}^{-1}S_{1}(\theta)-S_{2}(\theta) \end{pmatrix} \end{equation*}\] This yields the Wald statistic \[\begin{equation*} W=\hat{\theta}_{2}'(H_{22}-H_{21}H_{11}^{-1}H_{12})' \operatorname{var}(S_{2}(\theta_{1},0)-H_{21}H_{11}^{-1}S_{1}(\theta_{1},0))^{+} (H_{22}-H_{21}H_{11}^{-1}H_{12})\hat{\theta}_{2}, \end{equation*}\] where \(A^{+}\) denotes a generalized inverse. By Lemma 9.7 in Newey and McFadden (1994), the statistic has an asymptotic \(\chi^{2}\) distribution with degrees of freedom equal to the rank of the variance.
The score evaluated at the restricted estimator \(\bar{\theta}_{1}\) satisfies \[\begin{equation*} \begin{pmatrix} 0\\ S_{2}(\bar{\theta}_{1},0) \end{pmatrix}= \begin{pmatrix} S_{1}(\theta_{1},0)\\ S_{2}(\theta_{1},0) \end{pmatrix}+ \begin{pmatrix} H_{11}(\bar{\theta}_{1}-\theta_{1})\\ H_{21}(\bar{\theta}_{1}-\theta_{1}) \end{pmatrix}, \end{equation*}\] which implies \(\bar{\theta}_{1}-\theta_{1}=-H_{11}^{-1}S_{1}(\theta_{1},0)\), and hence \[\begin{equation*} S_{2}(\bar{\theta}_{1},0) =S_{2}(\theta_{1},0)-H_{21}H_{11}^{-1}S_{1}(\theta_{1},0). \end{equation*}\] Thus the statistic \[\begin{equation*} LM=S_{2}(\bar{\theta}_{1},0)'\operatorname{var}(S_{2}(\theta_{1},0)-H_{21}H_{11}^{-1}S_{1}(\theta_{1},0))^{+} S_{2}(\bar{\theta}_{1},0) \end{equation*}\] will again have a \(\chi^{2}\) distribution.
To apply these formulas in the context of a multinomial logit model, we use the score and the Hessian \[\begin{equation*} S(\theta)= \sum_{i}\omega^{2}_{i}(X_{i}-\pi(Z_{i};\theta))\otimes Z_{i},\quad H(\theta)= -\sum_{i}\omega^{2}_{i}(\diag(\pi(Z_{i};\theta))-\pi(Z_{i};)\pi(Z_{i})')\otimes Z_{i}Z_{i}' \end{equation*}\]
We first derive \(\eqref{eq:psi_cw}\). Observe first that the common weights estimator is identical to the two-step GMM estimator that in the first step, fits a multinomial logit model \[\begin{equation}\label{eq:m-n-l} P(D_{i}=k\mid W_{i})=\frac{e^{Z_{i}'\theta_{k}}}{\sum_{k'=0}^{K}e^{Z_{i}'\theta_{k'}}}=: \pi_{k}(W_{i},\theta), \end{equation}\] with the normalization \(\theta_{0}=0\). In the second step, we use the moment condition \[\begin{equation}\label{eq:moment_overlap_weights} E\left[\frac{\lcw(W_{i};\theta)X_{ik}}{\pi_{k}(W_{i};\theta)}(Y_{i}-\alpha^{\text{CW}}_{k})\right]=0. \end{equation}\] to obtain estimates \(\hat\alpha^{\text{CW}}_{k}\), and set \(\hbcw_k=\hat\alpha^{\text{CW}}_{k}-\hat\alpha^{\text{CW}}_{0}\).
Let \[\begin{equation*} \zeta_{k}(W_{i};\hat{\theta})=\frac{\lcw(W_{i};\hat{\theta})}{\pi_{k}(W_{i},\hat{\theta})} % =\frac{1}{\pi_{k}(W_{i},\hat{\theta})\sum_{k'=0}^{K}\frac{\pi_{k'}(1-\pi_{k'})}{\pi_{k}(W_{i};\hat{\theta})}} =\frac{e^{-Z_{i}'\hat{\theta}_{k}}}{\sum_{j=0}^{K}\pi_{j}(1-\pi_{j})e^{-Z_{i}'\hat{\theta}_{j}}} \end{equation*}\] By equation (6.6) in Newey and McFadden (1994), the influence function of this two-step estimator is given by \[\begin{equation*} \psi_{i}(\hat{\alpha}^{\textnormal{CW}}_{k})= \frac{1}{E[\lcw(W_{i})]}\left( \frac{\lcw(W_{i})X_{ik}}{\pi_{k}(W_{i})}(Y_{i}-\alpha^{\text{CW}}_{k}) +M_{k}(\theta)\psi_{i}(\theta)\right), \end{equation*}\] where \(\psi_{i}(\theta)\) is the influence function of the multinomial logit estimator \(\hat{\theta}\), and \(M_{k}(\theta)\) is the derivative of \(\eqref{eq:moment_overlap_weights}\) wrt \(\theta\).
Since \(\partial \zeta_{k}(W_{i};\theta)/\partial \theta_{j}= Z_{i}\zeta_{k}(W_{i};\theta)[\pi_{j}(1-\pi_{j})\zeta_{k}(W_{i};\theta)-\1{k=j>0}]\), it follows that \[\begin{equation*} M_{k}(\theta)= E[(\eta-e_{k})\otimes Z_{i}\cdot \zeta_{k}X_{ik}(Y_{i}-\alpha_{\lambda^{C},k})],\qquad\eta=( \pi_{j}(1-\pi_{j})\zeta_{j}, \dotsc,\pi_{K}(1-\pi_{K})\zeta_{K})'. \end{equation*}\] Since the multinomial logit log-likelihood is given by \(\ell_{i}=\sum_{k=0}^{K}X_{k}\log(\pi_{k})=\sum_{k=0}^{K}X_{k}Z_{i}'\theta_{k} -\log(\sum_{k=0}^{K}e^{Z_{i}'\theta_{k}})\), the score and the Hessian are \[\begin{equation*} S_{i}(\theta)=(X_{i}-\pi(W_{i};\theta))\otimes Z_{i},\qquad H(\theta)=-E_{n}[(\diag(\pi(W_{i}))-\pi(W_{i})\pi(W_{i})')\otimes Z_{i}Z_{i}'], \end{equation*}\] Since \(\psi_{i}(\theta)=-H(\theta)^{-1}S_{i}(\theta)\), this yields \[\begin{equation}\label{eq:a_i} a_{i}(\theta)=\hat{M}_{k}(\hat{\theta}) \hat{H}(\hat{\theta})^{-1}S_{i}(\theta), \qquad S_{i}(\theta)= (X_{i}-\pi(W_{i};\theta))\otimes Z_{i}, \end{equation}\] with \(\hat{M}_{k}({\theta})=\left(\sum_{i}(\zeta-e_{k})\otimes Z_{i}\cdot \zeta_{k}X_{ik}(Y_{i}-\alpha_{\lambda^{C},k})\right)\), and \(\hat{H}({\theta})= \sum_{i}(\diag(\pi(W_{i}))-\pi(W_{i})\pi(W_{i})')\otimes Z_{i}Z_{i}'\). When \(\pi_k=1/(K+1)\) this formula reduces to that in Theorem 1 in Li and Li (2019).
Next, we show \(\eqref{eq:hbew_oracle}\). Note that it follows from Proposition 2 in Goldsmith-Pinkham et al. (2024) that the efficient influence function is given by \[\begin{multline*} \psi_{i}({\alpha}_{\lambda_{k},k0})= \frac{\lambda(W_{i})}{E[\lambda(W_{i})]}\left(\frac{X_{ik}}{p_{k}(W_{i})}(Y_{i}-\mu_{k}(W_{i})) - \frac{X_{i0}}{p_{0}(W_{i})}(Y_{i}-\mu_{0}(W_{i}))\right)\\ = \frac{X_{ik}+X_{i0}}{E[\lambda(W_{i})]}\left(X_{ik}-r_{k}\right)V_{i}= \frac{(X_{ik}+X_{i0})(X_{ik}-r_{k})V_{i}}{E[(X_{ik}+X_{i0})(X_{ik}-r_{k})^{2}]}, \end{multline*}\] where \(r_{k}=r_{k}(W_{i})=E[X_{ik}\mid W_i,X_{ik}+X_{i0}=1]\). The result then follows since \(\hat{X}_{ik}\) is an estimator of \(X_{ik}-r_{k}(W_{i})\).
Finally, we show \(\eqref{eq:overlap_oracle}\). The derivative of the moment condition \(\eqref{eq:moment_overlap_weights}\) with respect to \(\pi_{k}=p_{k}\) (assuming correct specification of the propensity score) is given by \[\begin{equation*} -E[\lambda\frac{X_{ik}}{p_{k}^{2}(W_{i})} (\mu_{k}-\alpha^{\text{CW}}_{k})\dot{p}_{k}(W_{i})], \end{equation*}\] where we write \(\lambda\) for \(\lcw(W_{i})\). Since \(p_{k}\) is a projection, by Proposition 4 in Newey (1994), the influence function for \(\hat{\alpha}^{\text{CW}}_{k}\) is given by \[\begin{multline*} \frac{1}{E[\lambda]}\left( \lambda \frac{X_{ik}}{p_{k}(W_{i})}(Y_{i}-\alpha^{\text{CW}}_{k}) -\frac{\lambda}{p_{k}(W_{i})}(\mu_{k}(W_{i})-\alpha^{\text{CW}}_{k})(X_{ik}-p_{k}(W_{i}))\right)\\ = \frac{1}{E[\lambda]}\left( \lambda\frac{X_{ik}}{p_{k}(W_{i})}(Y_{i}-\mu_{k}(W_{i})) + \lambda(\mu_{k}(W_{i})-\alpha^{\text{CW}}_{k}) \right). \end{multline*}\] Next, as noted in Abadie et al. (2014), we can view \(\tilde{\alpha}^{\text{CW}}_{k}=\sum_{i}\lambda\mu_{k}(W_{i})/\sum_{i}\lambda\) as an estimator of \({\alpha}^{\text{CW}}_{k}\) based on the moment condition \(E[\lambda(\mu_{k}(W_{i})-{\alpha}^{CW}_{k})]=0\), which by standard arguments has influence function given by \(\frac{\lambda}{E[\lambda]}(\mu_{k}(W_{i})-{\alpha}^{\text{CW}}_{k})\). Since \(\hat{\alpha}^{\text{CW}}_{k}-\alpha^{CW}_{k}=(\hat{\alpha}^{\text{CW}}_{k}-{\alpha}^{\text{CW}}_{k})-({\alpha}^{\text{CW}}_{k} -\tilde{\alpha}^{\text{CW}}_{k})\), we subtract this influence function from the preceding display to obtain \(\eqref{eq:overlap_oracle}\).