Design based sandwich SEs for differences of Hajek estimators

We’ve promised to provide sandwich-type SEs for estimates of treatment effect, with stacking of estimated equations as appropriate to address possible use of separate regression fits to estimate covariance adjustment terms and then to estimate marginal treatment effects. Often ``sandwich standard error’’ conjures an expectation of standard errors that are model-based in the sense of assuming i.i.d. data $(A_i, W_{i}, X_i, Y_{0i}, \ldots Y_{Ki})$ and making inference conditional on the values of treatment assignments $A \in \{0, \ldots, K\}$ , importance or frequency weights $W$ , and covariates $X$ , but not the values of potential responses $\{Y_k: k \leq K\}$ . (This document takes $i$ to range over intact clusters.) With the i.i.d assumption, these model-based SEs are valid even if the design specification isn’t, provided that the outcome model is well-specified. The role of the study design is to provide analysis weights which, if respected, ensure desired interpretations of marginal effect estimators if the $Y$ -model happens to be misspecified.

An alternative would be to offer sandwich standard errors under a design-based interpretation, differing from the above principally in terms of conditioning on

$\mathcal{Z} = \sigma(\{(X_i, Y_{0i}, \ldots Y_{Ki}): i\}, \{\sum_{i \in S}\, \![A_i=k\!] : k \in \{0, \ldots, K\}; S \in \mathcal{P}\} )$

where $\mathcal{P}$ is a given partition of {1,, n} encoding a stratification. Of particular interest are standard errors for Hajek estimators, i.e. estimators of form $\hat{\mu}_k[\mathbf{v}] = \frac{\sum_i \check{w}_{i[k]} [\![A_i=k]\!] v_i}{\sum_i \check{w}_{i[k]} [\![A_i=k]\!]}$ and their differences $\begin{equation}\label{eq:hajekdiff} \hat{\mu}_k[\mathbf{u}] - \hat{\mu}_0[\mathbf{v}]. \end{equation}$ The case weights $\{\check{w}_{i[k]}:i\leq n, k\leq K\}$ are the products of a frequency or importance weight $w_{i}$ and, as appopriate, a weighting factor representing a reciprocal of $i$ ’s probability (or odds) of assignment to $k$ (or to one of { $k, \ldots, K$ }); in general differ by $k$ . With $\check{w}_{i[j]} =[\mathbb{P}(A_{i}=j \mid \mathcal{Z}]^{-1}$ for $j=0, \ldots, K$ , $\hat{\mu}_k[\mathbf{y}_{k}] - \hat{\mu}_0[\mathbf{y}_{0}]$ is the natural estimate of the effect of treatment $k$ within the experimental or quasiexperimental study sample, $n^{-1}\sum_{i=1}^{n}(y_{ki}-y_{k0})$ . This is also known as the sample average treatment effect. If the experimental observations are samples from a broader super-population, with sample inclusion probabilities $\pi_{1}, \ldots, \pi_{n}$ , then setting $\check{w}_{i[j]} =\pi_{i}^{-1}[\mathbb{P}(A_{i}=j)]^{-1}$ for $j=0, \ldots, K$ makes $\hat{\mu}_k[\mathbf{y}_{k}] - \hat{\mu}_0[\mathbf{y}_{0}]$ into the corresponding natural estimate of the population average treatment effect.

In one special case a design-based SE for has been available for some time, with an optional interpretation in terms of M-estimation emerging more recently. Suppose:

In this case the variance estimate used in the 2 sample $t$ test without pooling of variances validly estimates the variance of , in the following sense. If $v$ is observed only when $A=k$ while $u$ is observed only when $A=0$ , then $s_{vu} = (n-1)^{-1} \sum_{i=1}^n (v_i - \bar v)(u_i - \bar v)$ is not identified; as far as the data are concerned, it could lie anywhere between $\pm s_{v} s_{u}$ , where $s_{v}^2 = s_{vv} = (n-1)^{-1} \sum_{i=1}^n (v_i - \bar v)^2$ . Assuming $s_{vu} = s_{v}s_{u}$ , then the ordinary unpooled variance is unbiased for the design-based variance (conditional variance given $\mathcal{Z}$ ) of . If on the other hand $s_{uv} \leq s_{v}s_{u}$ , then the expected value of this ordinary unpooled variance exceeds ’s design-based variance . The possibilities $s_{vu} = s_{v}s_{u}$ and $s_{vu} \leq s_{v}s_{u}$ being exhaustive, by Cauchy-Schwartz, the expected value of the unpooled variance can be no less than $\mathrm{Var}(\hat{\mu}_k[\mathbf{u}] - \hat{\mu}_0[\mathbf{v}] \mid \mathcal{Z})$ ; it is valid as a possibly conservative estimate in that it cannot be negatively biased.

As shown by , in this case the same variance estimate coincides with the HC2 flavor of the Huber-White estimate. As a result, it’s sometimes said that ordinary sandwich or cluster-robust variance estimates admit of design-based interpretations provided that one uses the HC2 form. I suspect this works only in special cases, however, perhaps not going far beyond the one delimited above.

The use of the Neyman insight to get design-based SEs through Huber-White covariance estimates intended for model-based interpretation has several important limitations.

The M-estimation framework promises solutions to many or all of these impediments.

The Hajek estimators $\hat{\mu}_{0}[\mathbf{y}_{0}]$ , , $\hat{\mu}_{k}[\mathbf{y}_{k}]$ admit of being defined as the unique multivariate root of the estimating function $\begin{equation} \label{eq:ee-hajekdiff} (\mu_{0}, \ldots, \mu_{K}) \mapsto \left[ \begin{array}{c} \sum_i [\![a_i=K]\!] \check{w}_{i[K]} (y_{Ki} - \mu_K) \\ \vdots \\ \sum_i [\![a_i=0]\!] \check{w}_{i[0]} (y_{0i} - \mu_0) \end{array} \right] . \end{equation}$

The information matrices $A(\mu_0, \ldots, \mu_K)$ corresponding to are diagonal. For a sandwich estimate of variance, then, we only need estimates of the covariance of~. In the one-stratum case with multiple representatives of conditions $k$ and $0$ , we can simply apply Neyman’s two-sample unpooled variances SE to observations $\{ \check{w}_{i[k]} (y_{ki} - \mu_k) : A_i = k\}$ and $\{ \check{w}_{i[0]} (y_{0i} - \mu_0) : A_i = 0\}$ . Impediment~ is solved.

For standard errors of the difference of Hajek estimators within a subgroup $g$ , $\begin{equation*} \frac{\sum_i \check{w}_{i[k]} [\![A_i=k, G_{i} =g]\!] y_{ki}}{\sum_i \check{w}_{i[k]} [\![A_i=k, G_{i}=g]\!]} - \frac{\sum_i \check{w}_{i[0]} [\![A_i=0, G_{i} =g]\!] y_{0i}}{\sum_i \check{w}_{i[0]} [\![A_i=0, G_{i}=g]\!]}, \end{equation*}$ we can simply fold the subgroup indicator into the weights, weighting by $\{\tilde{w}_{i[j]}:j; i\} = \{\check{w}_{i[j]}[\![G_{i}=g]\!]: j; i\}$ as opposed to $\{\check{w}_{i[j]}:j; i\}$ . This solves the first part of Impediment~.

(The remaining piece of Impediment~ is the need for estimates of $\begin{equation*} \mathrm{Cov}\left( \frac{\sum_i \check{w}_{i[k]} [\![A_i=k]\!] v_{ki}}{\sum_i \check{w}_{i[k]} [\![A_i=k]\!]} - \frac{\sum_i \check{w}_{i[k]} [\![A_i=k]\!] v_{0i}}{\sum_i \check{w}_{i[k]} [\![A_i=k]\!]}, \frac{\sum_i \check{w}_{ki} [\![A_i=k]\!] u_{ki}}{\sum_i \check{w}_{ki} [\![A_i=k]\!]} - \frac{\sum_i \check{w}_{ki} [\![A_i=k]\!] u_{0i}}{\sum_i \check{w}_{ki} [\![A_i=k]\!]} \right) \end{equation*}$ where $v_{ji}$ and $u_{ji}$ are observed only when $A_{i}=j$ , all $i$ and $j$ . I haven’t had the opportunity to sit down and try to work this out, but I’m optimistic that this can be handled with straightforward extensions of the Neyman insight.)

Under the design-based interpretation, variances and covariances of quantile regression estimating functions — for a quantile regression with covariates $X$ and $c_{i}$ observations per cluster $i$ , $\begin{equation*} \beta \mapsto \sum_{i=1}^{n}\sum_{j=1}^{c_{i}}\sum_{k=0}^{K} x_{ij}([\![y_{kij} - X_{ij}'\beta, a_{i}= k)]\!] - \tau) \end{equation*}$ — are likely to be estimable in much the same way as variances and covariances of estimating functions for other forms of regression. In any case, there will be no need for estimation of a sparsity parameter. The impediment is removed.

(Quantile regression doesn’t bear any of its usual interpretations under conditioning on $\mathcal{Z}$ , but that doesn’t matter for purposes of using it to estimate slopes in an RDD.)

Now consider Impediment~, strata $S$ in which one or both of the conditions $j$ being contrasted have only one representative, $\sum_{i\in S} [\![A_{i}=j]\!] =1$ . Consider the situation without covariates.

As there is independence across if not within strata, the B matrices corresponding to~ are sums of stratum-wise contributions $\begin{equation} \label{eq:Bform} B_S((\mu_0, \ldots, \mu_K)) = \mathrm{Cov}\left(\left[ \begin{array}{c} \sum_{i\in S} [\![A_i=K]\!] \check{w}_{i[K]} (y_{Ki} - \mu_K) \\ \vdots \\ \sum_{i\in S} [\![A_i=0]\!] \check{w}_{i[0]} (y_{0i} - \mu_0) \end{array} \right] \mid \mathcal{Z}\right), \quad S \in \mathcal{P}. \end{equation}$ For any stratum $S$ and condition $k$ s.t. $\sum_{i \in S} [\![A_i=k]\!] =1$ , $\mathrm{Var}\left\{ \sum_{i\in S} [\![A_i=k]\!] \check{w}_{i[k]} (y_{ki} - \mu_k) \right\}$ is not estimable, just as the variance $[(\# S)-1]^{-1}\sum_{i \in S} (y_{ki} - \bar{y}_k)^2$ is not estimable. On the other hand, the second moment $\mathbb{E}\left[\left\{ \sum_{i\in S} [\![A_i=k]\!] \check{w}_{i[k]} (y_{ki} - \mu_k) \right\}^2\right]$ estimable, by it sample realization $\left\{ \sum_{i\in S} [\![a_i=k]\!] \check{w}_{i[k]} (y_{ki} - \mu_k) \right\}^2$ : in M-estimation, we’re entitled to treat $\mu_k$ is a fixed constant, not a random variable. Since the second moment bounds the variance from above, $\{\sum_{i\in S} [\![A_i=k]\!] \check{w}_{i[k]} (y_{ki} - \mu_k)\}^2$ is a safe, conservative variance estimate.

M-estimation doesn’t help us to estimate $\mathrm{Cov}\left\{ \sum_{i\in S} [\![A_i=k]\!] \check{w}_{i[k]} (y_{ki} - \mu_k), \sum_{i\in S} [\![A_i=0]\!] \check{w}_{i[0]} (y_{0i} - \mu_0) \right\}$ but these terms aren’t identified even for large strata with multiple representatives of either condition. When interest is in the difference $\hat{\mu}_k - \hat{\mu}_0$ , conservatism leads us to ``impute’’ the covariance value that maximizes the variance of that difference, that corresponding to a correlation of 1 between $\sum_{i\in S} [\![A_i=k]\!] \check{w}_{i[k]} (y_{ki} - \mu_k)$ and $\sum_{i\in S} [\![A_i=0]\!] \check{w}_{i[0]} (y_{0i} - \mu_0)$ . We can follow the same principle when approaching the problem from the perspective of $M$ -estimation, whatever the representation of conditions $k$ and 0 within stratum $S$ .

In separate hand-written notes I begin fleshing this out into methods of estimating design-based second moments of first differences of the estimating functions that define the Hajek estimators, which can in turn be used to approximate the corresponding B-matrix terms; extending this work might be a suitable problem for a PhD student.

If covariate $X$ are included in a linear regression specification with unit weights $\sum_{j=1}^{K}\check{w}_{i[j]}[\![a_{i}=j]\!]$ — that is, the specification that without $X$ would have engendered estimation of Hajek estimator differences as in — then B-matrix estimation can still be done just as it is without the covariates. The A matrices are no longer diagonal, but are otherwise straightforward. Combining estimates of these A and B matrices would give legitimate standard errors for differences of Hajek estimators as before, but now with adjustment for covariates.

Another approach to covariates is to have fit their coefficients in a separate and prior regression fit — perhaps a robust linear fit; or perhaps a binary regression fit — and then to contrast Hajek aggregates of residuals from these fits . This approach calls for stacked estimating equations, with accompanying complications to both the A and the B matrix. We might as well consider it in tandem with challenges and .

To draw a distinction between experimental versus non-experimental observations, let $\mathcal{P}$ be a partition of a nonempty subset of $\{1, \ldots, n\}$ , so that some but not necessarily all clusters 1,, $n$ fall into some stratum $S \in \mathcal{P}$ ; these clusters are then the experimental or quasi-experimental sample. Re-define $\mathcal{Z}$ to contain similar information as above but also full information on clusters falling outside of $\mathcal{E} = \bigcap_{S \in \mathcal{P}}S$ : $\begin{equation} \label{eq:Zdefextended} \mathcal{Z} = \sigma\left(\begin{array}{c} \{(X_i, Y_{0i}, \ldots Y_{Ki}): i \in \mathcal{E}\}, \{\sum_{i \in S}\, \![A_i=k\!] : k \in \{0, \ldots, K\}; S \in \mathcal{P}\} , \\ \{(A_{i}, X_i, Y_{0i}, \ldots Y_{Ki}): i \not\in\mathcal{E}\} \end{array}\right). \end{equation}$ One now needs A and B matrices for estimating functions of the Hajek estimator and the regression estimator simultaneously. Given that we’re ultimately interested only to estimate covariances for the Hajek differences, simplified expressions in terms of A- and B-submatrices are available . Further simplification of B matrices is possible given our conditioning on $\mathcal{Z}$ in ; the B matrix continues to be a sum of form~, with clusters $i\not\in \mathcal{E}$ making no contribution. (The design-based interpretation has no bearing on calculation of $A$ matrices.)

References