Treatment effect heterogeneity and linear regression

In Mostly Harmless Econometrics (3.3.7, p75), the ordinary least square (OLS) estimand is given by

\[\frac{E[ \sigma_D^2 (X_i) \delta_X ]}{E[\sigma_D^2 (X_i)]}\]

where $\sigma_D^2(X_i)$ is the variance of treatment variable $D$ and $\delta_X$ is the stratum specific treatment effect. The formula simply says that under heterogeneous treatment effect, the OLS estimates a variance weighted average over $\delta_X$. In certain cases, this might not be the policy relevant measure although it retains the notion of causal effects. Read MHE for more details.

After some time, I read Wooldrige’s recent work on staggered difference-in-differences. In this work, Wooldridge shows a simple solution to negative weights arising in two-way fixed effects (TWFE) when TE is heterogeneous. One thing that catched my eyes was the mean-centering of the covariates to preserve the average treatment effect of the treated (ATT) interpretation of the main coefficient. Again, read the paper for more details.

Soon after, I found that another work by Wooldridge (with Negi) which uses the same technique to estimate OLS. After a close inspection, I realized that all these works share the same soul and can be derived using the machinary of MHE. This blog post is simply a personal note on how I derive and understand these fantastic works.

With bianry treatment variable $D$ and outcome $Y$, we write potential outcomes as $Y(d)$ ($d=0,1$). With consistency assumption $Y = Y(d)$ if $D=d$, we can write the famous switching formula.

\[Y_i = D_i \cdot Y_i(1) + (1-D_i) \cdot Y_i(0)\]

A careful reordering gives the following form.

\[Y_i = (Y_i(1) - Y_i(0)) \cdot D_i + Y_i(0) = \tau_i \cdot D_i + Y_i(0)\]

where $\tau_i = Y_i(1)- Y_i(0)$ is the individual treatment effect. The average over $\tau_i$ is $\tau = E[\tau_i]$ which is the population average treatment effect (PATE).

Let $X$ be the covariates that are confoudners or effect modifiers (or both). By taking $E[\cdot \vert X, D]$ on both sides, we have something more familiar.

\[E[Y_i \vert X_i, D_i] = E[\tau_i \cdot D_i \vert X_i,D_i] + E[Y_i(0) \vert X_i,D_i] \\ = E[\tau_i \vert X_i,D_i] \cdot D_i + E[Y_i(0) \vert X_i,D_i]\]

We have to simplify $E[\tau_i \vert X_i,D_i]$ and $E[Y_i(0) \vert X_i,D_i]$. Under conditional exchangability ($Y(d) \perp\kern-5pt\perp D \mid X$), we remove $D$ from the latter. In linear regression context, we usually assume a linear conditional expectation function (CEF) so $E[Y_i(0) \vert X_i,D_i] = X_i\beta$. If you want semi/non-parametric regression than you can just leave it and let the algorithm fit the functional form.

Our main interest is the former term $E[\tau_i \vert X_i,D_i]$. As $\tau_i = \tau_i(X_i)$, we can remove $D$ first so we have $E[\tau_i \vert X_i]$. Without any functional assumptions, we just simply put

\[\tau_i = \tau + f(X_i)\]

where $f$ is some arbitrary function such that $E[f(X)] = 0$ and consequently, $E[\tau_i] = \tau$ which is the PATE. Substituting this to equation (4) gives

\[E[Y_i \vert X_i,D_i ] = \tau \cdot D_i + f(X_i) \cdot D_i + X_i \beta\]

with $E[f(X)] = 0$.

What’s there in Wooldridge’s paper is just setting $f(x) = \rho \cdot (x - \mu_X)$. This would be a saturated model if $x$ is binary. If not, it would possibly be a false assumption but we are working with linear regression so let’s believe it’s true. The subtraction forces $E[f(X)]=0$ which is the key to make the coefficient of the main term $D$, which is $\tau$, the PATE. What is important is $E[f(X)]=0$ because as long as this moment condition is retained, we may take more flexible specifictaion of $f(X)$ which may not be just $f(x) = \rho \cdot (x-\mu_X)$. Nevertheless, Wooldridge provides us a simple way to deal with heterogenous TE while retaining a simple regression framework.

This blog post was intended to show that the soul of Wooldridge’s approach can be extended to non/semi-parametric or whatsoever more flexible approaches that account for heterogenous TE. I tried to show what assumption is essential and how standard potential outcome devices can be used to derive the results. I’m currently trying to apply this to genetic association study by combining some population genetic theory. In case your interested, please contact me.

Comments