Instrumental variables in econometrics and epidemiology

Mendelian randomization (MR) became very famous in the recent years. Even clinical journals that are quite conservative in using causal language outside randomized trials also embrace MR studies nowadays. In the methodological field, many new MR methods have been proposed. However, I’m very skeptical about them as they impose extremely strict parametric assumptions. For example, with very few execptions (see Shi 2021 and 2022), they assume strict homogeneity (see Morrison 2020). Furthermore, the biggest problem in current literature is that we are not very transparent about the parameter we are estimating. The target parameter is absent!

In this post, I review the two approaches in instrumental variable. The first one is the local average treatment effect (LATE) framework famous in econometrics (Imbens and Angrist, 1994). The second ons is the structural mean model (SMM) framework proposed by the Harvard CI group (Robins, 1994).

We start from the switching formula.

\[Y_i = D_i \cdot Y_i(1) + (1-D_i) \cdot Y_i(0)\]

which gives

\[Y_i = \tau_i \cdot D_i + Y_i(0)\]

after reordering. $\tau_i = Y_i(1) - Y_i(0)$ is the individual treatment effect.

Learning the importance of heterogeneous TE in IV was a very illuminating experience. At the same time, the plethora of many assumptions was very confusing. How does these different assumptions relate to each other? I don’t have a full-blown answer to this question. However, I think that the two approaches by econometricians and epidemiologists can be classified in terms of the stage in which the assumption is applied. The former, which is usually referred to monotonicity is a matter of how the instrument $Z$ affects the instrument $D$. On the other hand, the latter (e.g. no treatment effect modification (NEM) assumption) imposes restriction on the mode of treatment $D$ influencing the outcome $Y$. The following derivations will make this point clear.

Monotonicity

The proof is essentially the same as in Mostly Harmless Econometrics (Theorem 4.4.1, p.155). Apply $E[\cdot \vert Z]$ to equation (2).

\[E[Y_i \vert Z_i] = E[\tau_i \cdot D_i \vert Z_i] + E[Y_i(0) \vert Z_i]\]

The last term is just a constant $E[Y_i(0)]$ by the exclusion criteria $Y(d) \perp\kern-5pt\perp Z$.

\[E[Y_i \vert Z_i=1] - E[Y_i \vert Z_i=0] \\ = E[\tau_i \cdot D_i \vert Z_i=1] - E[\tau_i \cdot D_i \vert Z_i=0]\]

To simplify this equation, we have two choices. One is to impose some condition on $\tau_i$ and the other is to impose some condition on $D_i$. Monotonicity does the latter by assuming no defiers. Let $C_i = D_i(1) - D_i(0)$ be the compliance status. Then

\[E[\tau_i \cdot D_i \vert Z_i] \\= E[ E[ \tau_i \cdot D_i \vert Z_i, C_i ] \vert Z_i] \\= \sum_c E[\tau_i \cdot D_i \vert Z_i, C_i=c] \cdot P(C_i =c \vert Z_i)\]

By exogeneity of the instrument, $P(C_i \vert Z_i) = P(C_i)$. For $Z=1$,

\[= \sum_c E[\tau_i \cdot D_i \vert Z_i=1, C_i=c] \cdot P(C_i =c) \\ = E[\tau_i \cdot 1 \vert Z_i=1, C_i=1] \cdot P(C_i =1) \\ + E[\tau_i \cdot D_i \vert Z_i=1, C_i=0] \cdot P(C_i=0)\]

and for $Z=0$,

\[= E[\tau_i \cdot 0 \vert Z_i=0, C_i=1] \cdot P(C_i =1) \\ + E[\tau_i \cdot D_i \vert Z_i=0, C_i=0] \cdot P(C_i=0)\]

Finally, subsituting (6) and (7) into (4) and the exclusion criteria guarantees

\[E[\tau_i \cdot D_i \vert Z_i=1] - E[\tau_i \cdot D_i \vert Z_i=0] \\ = E[\tau_i \cdot 1 \vert C_i =1] \cdot P(C_i=1) \\ = E[\tau_i \vert C_i=1] \cdot E[D_i(1) - D_i(0)] \\ = E[\tau_i \vert C_i=1] \cdot (E[D_i \vert Z=1] - E[D_i \vert Z=0])\]

where the last line came from exogeneity. The proof shows that there is no restriction on $\tau_i$ and only the assumptions involving $D_i$ was used to derive the Wald ratio.

Structural Mean Model (SMM)

The additive SMM is

\[E[Y-Y(0) \vert D, Z] = (\psi_0 + \psi_1 Z) \cdot D\]

When I saw this for the first time, I was confused. Most importantly, the interpretation of $\psi_0$ and $\psi_1$ wasn’t very transparent to me. Therefore, I thought it was better to start from equation (2) to make it more sensible.

Applying $E[\cdot \vert D,Z]$ to equation (2) gives

\[E[Y_i - Y_i(0) \vert D_i,Z_i ] = E[\tau_i \cdot D_i \vert D_i, Z_i] \\ = E[\tau_i \vert D_i, Z_i] \cdot D_i \\ = E[\tau_i \vert Z_i] \cdot D_i\]

So, what was happening in the SMM was the specification of $E[\tau_i \vert Z_i]$ which is

\[E[\tau_i \vert Z_i] = \psi_0 + \psi_1 Z_i\]

NEM imposes $\psi_1 = 0$ so that $\psi_0$ eventually becomes the PATE. Embracing the soul of Wooldridge’s idea that I wrote, using $Z - \mu_Z$ instead of $Z$ would have given $\psi_0$ the same interpretation without the NEM although the problem of identification would have remained (the number of moment condition is smaller than the number of estimands).

To obtain the Wald ratio, apply the law of iterated expectation on (assuming NEM),

\[E[Y_i - Y_i(0) \vert Z] = \\ E[ E[Y_i - Y_i(0) \vert D,Z] \vert Z ] \\ = E[ \psi_0 \cdot D \vert Z ] \\ = \psi_0 \cdot E[ D \vert Z ]\]

and substituting it into equation (4) gives the desired result. Examining the consequence of NEM on equation (11) directly shows that SMM-based approaches acheives identification by restricting the mode of $D \rightarrow Y$.

Concluding remark

My impression of these result is that monotonicity and SMM-based assumptions are not necessarily stronger/weaker than the other. The choice ultimately depends on which stage, $Z \rightarrow D$ or $D \rightarrow Y$, will be restricted through imposing additional assumptions. May be some person can come up with an alternative identification strategy by imposing restrictions on both but each of them being weaker for each stage.

Regress-out in single-cell biology

I first intended this content as a research paper but I’m currently not planning any additional single-cell projects so I write this as a blog post. In many, I mean really many single-cell analysis notes, people suggest regressing out certain variables prior to analysis. For example, in the cell-cycle vignette in Seurat manpage, they recommend regressing out cell-cycle scores prior to PCA. I think such practice is not legitimate because it wipes out not only the variation due to cell-cycle but also other biological variations that are correlated to cell-cycle. I show, in this post, that this is ture in general.

More …

Treatment effect heterogeneity and linear regression

In Mostly Harmless Econometrics (3.3.7, p75), the ordinary least square (OLS) estimand is given by

\[\frac{E[ \sigma_D^2 (X_i) \delta_X ]}{E[\sigma_D^2 (X_i)]}\]

where $\sigma_D^2(X_i)$ is the variance of treatment variable $D$ and $\delta_X$ is the stratum specific treatment effect. The formula simply says that under heterogeneous treatment effect, the OLS estimates a variance weighted average over $\delta_X$. In certain cases, this might not be the policy relevant measure although it retains the notion of causal effects. Read MHE for more details.

After some time, I read Wooldrige’s recent work on staggered difference-in-differences. In this work, Wooldridge shows a simple solution to negative weights arising in two-way fixed effects (TWFE) when TE is heterogeneous. One thing that catched my eyes was the mean-centering of the covariates to preserve the average treatment effect of the treated (ATT) interpretation of the main coefficient. Again, read the paper for more details.

Soon after, I found that another work by Wooldridge (with Negi) which uses the same technique to estimate OLS. After a close inspection, I realized that all these works share the same soul and can be derived using the machinary of MHE. This blog post is simply a personal note on how I derive and understand these fantastic works.

With bianry treatment variable $D$ and outcome $Y$, we write potential outcomes as $Y(d)$ ($d=0,1$). With consistency assumption $Y = Y(d)$ if $D=d$, we can write the famous switching formula.

\[Y_i = D_i \cdot Y_i(1) + (1-D_i) \cdot Y_i(0)\]

A careful reordering gives the following form.

\[Y_i = (Y_i(1) - Y_i(0)) \cdot D_i + Y_i(0) = \tau_i \cdot D_i + Y_i(0)\]

where $\tau_i = Y_i(1)- Y_i(0)$ is the individual treatment effect. The average over $\tau_i$ is $\tau = E[\tau_i]$ which is the population average treatment effect (PATE).

Let $X$ be the covariates that are confoudners or effect modifiers (or both). By taking $E[\cdot \vert X, D]$ on both sides, we have something more familiar.

\[E[Y_i \vert X_i, D_i] = E[\tau_i \cdot D_i \vert X_i,D_i] + E[Y_i(0) \vert X_i,D_i] \\ = E[\tau_i \vert X_i,D_i] \cdot D_i + E[Y_i(0) \vert X_i,D_i]\]

We have to simplify $E[\tau_i \vert X_i,D_i]$ and $E[Y_i(0) \vert X_i,D_i]$. Under conditional exchangability ($Y(d) \perp\kern-5pt\perp D \mid X$), we remove $D$ from the latter. In linear regression context, we usually assume a linear conditional expectation function (CEF) so $E[Y_i(0) \vert X_i,D_i] = X_i\beta$. If you want semi/non-parametric regression than you can just leave it and let the algorithm fit the functional form.

Our main interest is the former term $E[\tau_i \vert X_i,D_i]$. As $\tau_i = \tau_i(X_i)$, we can remove $D$ first so we have $E[\tau_i \vert X_i]$. Without any functional assumptions, we just simply put

\[\tau_i = \tau + f(X_i)\]

where $f$ is some arbitrary function such that $E[f(X)] = 0$ and consequently, $E[\tau_i] = \tau$ which is the PATE. Substituting this to equation (4) gives

\[E[Y_i \vert X_i,D_i ] = \tau \cdot D_i + f(X_i) \cdot D_i + X_i \beta\]

with $E[f(X)] = 0$.

What’s there in Wooldridge’s paper is just setting $f(x) = \rho \cdot (x - \mu_X)$. This would be a saturated model if $x$ is binary. If not, it would possibly be a false assumption but we are working with linear regression so let’s believe it’s true. The subtraction forces $E[f(X)]=0$ which is the key to make the coefficient of the main term $D$, which is $\tau$, the PATE. What is important is $E[f(X)]=0$ because as long as this moment condition is retained, we may take more flexible specifictaion of $f(X)$ which may not be just $f(x) = \rho \cdot (x-\mu_X)$. Nevertheless, Wooldridge provides us a simple way to deal with heterogenous TE while retaining a simple regression framework.

This blog post was intended to show that the soul of Wooldridge’s approach can be extended to non/semi-parametric or whatsoever more flexible approaches that account for heterogenous TE. I tried to show what assumption is essential and how standard potential outcome devices can be used to derive the results. I’m currently trying to apply this to genetic association study by combining some population genetic theory. In case your interested, please contact me.

Posting Test

This blog post was created for testing purpose.