chr:pos
to a separate file and then assigned the RSIDs.
Therefore, the user had to combine the original sumstats file with the chr:pos
-RSID file manually after obtaining the RSIDs for the variants in chr:pos
.
This post explains how to avoid these steps and write a single command that does everything you need.
The magic comes from using pipes (|
) that combine multiple commands in a shell.
This is also helpful for debugging your script in the shell by throwing the head
command between the lines.
Additionally, I explain how to move between different genome builds (e.g. between hg19
and hg38
) using the liftOver
command.
The required programs and the files is listed below
liftOver
command-line tool. Find the liftOver
binary in the link. The link contains x86_64-linux
binaries so look at other directories for other operating systems.bedops
command-line tool. bedops
is a collection of many command-line tools. The one we are gonna use is bedmap
. sort-bed
is used too.liftOver
chain files for hg38
and hg19
. Files for other genome builds can be found in the parent directories of the link. Note that the genome build here referes to the genome build of the input file. Some basic examples for using these chain files can be found here.hg38
and hg19
. The file we use today is snp144.txt.gz
which is a full list of SNPs in the dbSNP version 144. The list contains the SNPs together with their annotations such as RSID, chr:pos
, variant type and many more.After installing the programs (simply put the binary files in to your ~/bin
folder) and saving the files, my folder currently looks like
├── chains
│ ├── hg19ToHg38.over.chain.gz -> # downloaded from `hg19` chains folder
│ └── hg38ToHg19.over.chain.gz # downloaded from `hg38` chains folder
├── dbSNP
│ └── snp144.txt.gz # downloaded from `hg19` dbSNP
└── sumstats.txt
Both liftOver
and bedops
operate on a bed
file.
In the UCSC wiki, the precise description of the format is given.
12 columns are explained but only the first three is actually important in our application.
The programs we use today don’t even care what appears after the first three.
First, we convert snp144.txt.gz
into a proper bed
file.
Before the conversion, let’s see how the file looks like.
The file is compressed so a naive way to see the content is to unzip it first.
However, as you’ll notice, the file is huge so unzipping the whole file isn’t a nice choice.
The nice part of working in a shell is that many programs operates in a line-by-line basis.
This means that the program processes the file line-by-line and throws the output.
Only the first few lines will tell you how the file looks like so we could unzip the first few lines.
This can be done using the combination of zcat
and head
commands.
zcat snp144.txt.gz | head -n 5
585 chr1 10019 10020 rs775809821 0 + A A -/A genomic deletion unknownnear-gene-5 exact 1 1 SSMP, 0
585 chr1 10055 10055 rs768019142 0 + - - -/A genomic insertion unknownnear-gene-5 between 1 1 SSMP, 0
585 chr1 10107 10108 rs62651026 0 + C C C/T genomic single unknown 0 near-gene-5 exact 1 1 BCMHGSC_JDW, 0
585 chr1 10108 10109 rs376007522 0 + A A A/T genomic single unknown 0 near-gene-5 exact 1 1 BILGI_BIOE, 0
585 chr1 10138 10139 rs368469931 0 + A A A/T genomic single unknown 0 near-gene-5 exact 1 1 BILGI_BIOE, 0
There are total 20 columns.
zcat
sequentially unzips the file line-by-line and prompts the content.
The pipe |
receives the output in which zcat
provides and supplies it to head
.
-n 5
option receives the output of zcat
through the pipe and prompts the first 5 lines and terminates.
As a result, things just happen in less than a second without unzipping the large file as a whole.
What are the relevant parts of the file?
Column 2 (the chromosome number), columns 3/4 (0-based coordinates of the variant) and column 5 (the RSIDs) are all we need.
The first three is required to form a proper bed
file and the last columns is the annotation for our purpose.
As explained earlier, programs we use don’t care much about the columns after the first three so you can add any number of additional columns after the first three.
zcat dbSNP/snp144.txt.gz | awk -v OFS="\t" '{print $2,$3,$4,$5}' > dbSNP/snp144.bed
zcat
extracts the lines from snp144.txt.gz
.
awk
takes the lines and prompts columns 2,3,4 and 5 (which is denoted as $2,$3,$4,$5
).
Finally, >
writes the output of awk
to snp144.bed
.
You can check the content using the head
command.
head -n 5 dbSNP/snp144.bed
chr1 10019 10020 rs775809821
chr1 10055 10055 rs768019142
chr1 10107 10108 rs62651026
chr1 10108 10109 rs376007522
chr1 10138 10139 rs368469931
Now we are done with preprocessing.
Suppose that you have a sumstats in hg38
format without RSIDs.
You first have to convert the chr:pos
to hg19
and then assign the RSIDs.
This can be done using the UCSC liftOver
tool.
A simple example can be found here.
liftOver
basically deals with bed
files.
This means that you should first convert your summary statistics to a bed
file.
As other programs using summary statistics requires some formatting, you need at least four consecutive conversions (sumstat to bed
conversion, genome build conversion using liftOver
, add the RSIDs, convert back to the sumstats format).
Each step has to write and read the content of a file from the harddisk.
As modern sumstats files are frequently big as much as several gigabytes, this is a very I/O intensive task that can be very slow.
The magic of using pipes is to avoid unnecessary disk I/O and only read and write once.
This is done by transferring the output of the previous command to the next command using pipes within the memory and only writing the final result to the disk.
My summary statistics looks like this (head -n 5 sumstats.txt
).
snp G_BETA G_SE N P
chr1:786325:A:G 0.007589 0.028391 9574 0.789250
chr1:805145:G:A -0.009773 0.042689 9574 0.818923
chr1:809277:C:T 0.004939 0.028554 9574 0.862690
chr1:860040:G:A 0.005936 0.031575 9574 0.850886
First, look at the command I’ll use to do several jobs using a single command.
awk -v OFS="\t" 'NR!=1 {split($1,a,":"); print a[1],a[2]-1,a[2],a[3],a[4],$2,$3,$4,$5}' sumstats.txt \
| liftOver stdin chains/hg38ToHg19.over.chain.gz stdout trash.bed -bedPlus=3 \
| sort-bed - \
| bedmap --echo --echo-map-id --delim "\t" - dbSNP/snp144.bed \
>> sumstats.rsid.hg19.txt
\
is simply the linebreak symbol in the shell so don’t be bothered.
sumstats.txt
is read once in the first line and sumstats.rsid.hg19.txt
is written to the disk in the last line.
There are no disk I/Os elsewhere.
The first line reads sumstats.txt
line-by-line and processes the line.
After excluding the first line which contains the header (NR!=1
), it splits the first column with the separator (split($1,a,":");
) and stores the splitted elements to an array named a
.
a
stores the chromosome number, position minus one, position, the reference allele and the alternative allele (the position minus one thing is related to the convention of 0-based indexing of variants).
$num
s appearing afterwards are the columns of sumstats.txt
that are effect size, standard error, number of samples and P-value.
The pipe takes the output of awk
and passes to liftOver
.
The stdin
part tells liftOver
that the input file is replaced by the lines coming from the pipe (which is the output of awk
in the previous line).
The stdout
part tells liftOver
to pass the output to the shell so that the pipe can receive it.
trash.bed
stores the variants in which liftOver
failed to map.
-bedPlus=3
part tells liftOver
that the program just needs to consider the first 3 columns of the input.
sort-bed
then receives the output of liftOver
through the pipe using -
which has the same function as the stdin
of liftOver
that tells sort-bed
to receive what the pipe is passing to it.
This command sorts the input in an ascending order so that the subsequent programs like bedmap
can efficiently process the files.
Finally, bedmap
again receives the output of sort-bed
through the pipe and adds to the input line the RSIDs by reading it from snp144.bed
.
I’ve seen several solutions in the internet but neither of them was suitable for my purpose.
It was either very slow or not working.
After few days of search, I found a fast and a convenient way to do this using bedops
.
First, install bedops
from their homepage.
Next, you need a reference dataset that contains both the Chromosome:Position and the RSID.
This can be found in the ucsc FTP server hg19 hg38.
The file I used was snp150.txt.gz
.
To supply this file to bedops
, you first have to convert the reference file to a .bed
format.
zcat snp150.txt.gz | awk -v OFS='\t' '{if($12=="single") print $2,$3,$4,$5}' > snp150_snp.bed
I’ll explain how this works.
Since snp150.txt.gz
is archived, zcat
unzips the file line-by-line and throws the output to the command after the pipe (|
).
Then, the awk
command receives the line from zcat
and checks if the 12th column’s value is “single”.
Note that the 12th column tells you the type of the variant.
“single” means it’s a single nucleotide subsitution, a SNP.
-v OFS='\t'
option ensures that the output is tab-limited.
Finally, awk
throws it’s output to >
which writes the output in the snp150_snp.bed
file.
Now let’s move onto the summary statistics. Mine looks like
snp beta se N P
1:729632:C:T -0.0416469 0.0577902 4344 0.47116
1:754063:G:T -0.0579509 0.104104 4152 0.577788
1:754105:C:T -0.0293816 0.0579841 4350 0.612379
1:754211:G:A -0.0384934 0.105497 4136 0.715221
1:754629:A:G -0.0733465 0.106585 4147 0.491398
1:759036:G:A -0.0361858 0.0579107 4342 0.532099
1:759884:C:A 0.0624291 0.082878 4252 0.451333
1:765904:G:T 0.0725445 0.0893362 4262 0.416814
1:767096:A:G 0.0287257 0.04053 4343 0.478517
The first column contains the Chromosome:Position.
We now have to convert this information into a .bed
format.
awk -v OFS='\t' 'NR!=1 {split($1,a,":"); print "chr"a[1],a[2]-1,a[2]}' sumstats.txt > query.bed
The NR!=1
option tells the awk
command to ignore the first line.
The split
command splits the value of the first column (which is the Chromosome:Position) and stores it to variable a
.
Subtracting 1 from a[2]
is related to the coordinate system of the .bed
format.
Read this link for more information.
Finally, we run bedmap
which is a part of the bedops
program.
bedmap --echo --echo-map-id --delim '\t' query.bed snp150_snp.bed > output.bed
This gives the desired result.
chr1 729631 729632 rs116720794
chr1 754062 754063 rs12184312
chr1 754104 754105 rs12184325
chr1 754210 754211 rs12184313
chr1 754628 754629 rs10454459
chr1 759035 759036 rs114525117
chr1 759883 759884 rs188068004
chr1 765903 765904 rs115541281
chr1 767095 767096 rs115991721
chr1 767812 767813 rs114066716
Calculating IBD probability was always confusing to me because of the difference of $p$ and $p^2$ (same for $q$) when the alleles are IBD and non-IBD. I was trying to think of an additional line that can help me go through it and finally got one.
The usual calculation goes \(\begin{aligned} P(A_1 A_1) &= P(A_1 A_1 \mid \mathrm{IBD}) P (\mathrm{IBD}) + P(A_1 A_1 \mid \mathrm{non-IBD}) P(\mathrm{non-IBD}) \\ &= pF + p^2 (1-F) \end{aligned}\) while I have put an additional line \(\begin{aligned} P(A_1 A_1) &= P(A_1 A_1 \mid \mathrm{IBD}) P (\mathrm{IBD}) + P(A_1 A_1 \mid \mathrm{non-IBD}) P(\mathrm{non-IBD}) \\ &= P(A_1 \mid \mathrm{IBD}, A_1)P(A_1 \mid \mathrm{IBD}) P(\mathrm{IBD}) + P(A_1 A_1 \mid \mathrm{non-IBD}) P(\mathrm{non-IBD}) \\ &= pF + p^2 (1-F) \end{aligned}\) using the Bayes formula ($P(A_1 A_1 \mid \mathrm{IBD}) = P(A_1 \mid \mathrm{IBD}, A_1)P(A_1 \mid \mathrm{IBD})$). To elaborate, once you know that the it is $A_1$ at one haplotype and the other is in IBD with it, the allele at the other takes $A_1$ with probability one and $A_2$ with probability zero.
I hope it might help other people who are struggling to understand it.
]]>저는 서울대학교 의과대학을 다녔습니다. 서울대학교 의과대학에는 크게 3개의 수련 병원이 있습니다. 연건동에 위치한 서울대학교병원 (소위 본원이라고 불립니다), 분당서울대학교병원 그리고 동작구에 위치한 시립보라매병원입니다.
세 병원은 각각 성격이 다릅니다. 본원이 상급종합병원으로서 암이나 희귀질환 같은 난치성 질환 치료와 연구에 집중한다면 보라매병원은 시립병원으로서 취약계층 진료에 그 목적을 두고 있습니다. 이런 차이는 학생으로 실습기간 동안 충분히 느낄 수 있었습니다. 점심시간이 넉넉한 날이면 현관 근처에서 커피를 마시며 멍때리곤 했는데 과장 조금 보태면 10분에 한 번씩 양복을 입은 기사님이 운전하는 S클래스에서 하차하는 어르신을 봅니다. 온 동네 고급차량은 병원에서 다 본 것 같습니다. 한편, 보라매병원은 완전히 다른 세상입니다. 건강보험료를 지급하지 못하여 공공부조의 일종인 의료급여를 받는 환자가 굉장히 많습니다. 외래에서 뵙는 환자 중에 병이 상당히 진행된 상태에서 내원하는 비수도권 환자도 적지 않게 보입니다.
동시에 요즘은 빅데이터와 인공지능이 각광을 받는 시대입니다. 이런 흐름에 발맞춰 학교와 병원에서도 십수년 간 축적된 환자 데이터를 바탕으로 새로운 지식을 창출하려는 연구를 활발하게 하고 있습니다. 그런데 저는 그런 생각이 들었습니다. 50억 정도 하는 압구정 현대 아파트에 사는 어르신과 경남 양산 시골에 사는 어르신이 서울대병원에 내원할 기회가 동등하지 않은데 이런 편향된 환자의 모임에서 생산된 데이터로 좋은 연구를 할 수 있을까?
인과추론 (Causal inference)는 관찰된 상관관계에서 인과성을 창출하는 방법을 연구하는 분야로 제 의문 역시 인과추론에서 오랫동안 연구된 문제입니다. 이처럼 대표성 없는 자료에서 인과성에서 비롯되지 않는 무의미한 상관관계가 생기는 현상을 선택편향 (selection bias)라고 부릅니다. 선택편향은 우리 생활 곳곳에 퍼져있습니다. 흔히, 많은 기혼 여성이 만났던 사람 중에 현재 배우자가 성격은 좋지만 외모는 제일 못났다고 한탄하곤 합니다. 외모와 인품의 반비례 관계는 어떤 인과성에 의한 것일까요? 선택편향에 대한 이론은 여기에 대한 확고한 답을 줍니다. 외모와 인품 모두 교제의 가능성을 높이기 때문에 (교제가 외모와 인품의 결과이기 때문에) 교제했던 사람만 모아서 보면 둘이 반비례하는 것처럼 보이는 것입니다. 교제가 외모와 인품 모두와 완전히 독립적으로 일어났다면 이런 상관관계는 생기지 않았을 것입니다.
위 같은 이유로 서울대병원의 빅데이터는 여러 요인의 결과로 서울대병원에 내원할 수 있었던 환자만 분석 대상으로 삼기 때문에 앞선 예시와 같은 문제를 겪을 수 밖에 없습니다. 이런 의료 빅데이터의 한 종류로 대규모 유전체 연구가 있습니다. 저는 유전학을 공부하고 이런 자료를 분석하고 있습니다. 그런데 유전체 연구도 연구에 자발적으로 참여하는 사람만 포함하고 있으므로 선택편향으로부터 자유로울 수 없습니다. 그렇다면 그 선택편향의 크기는 얼마고 어떤 요인의 영향을 받는지 밝힐 필요가 있고 제 연구는 정확히 이 문제를 다룹니다.
2021년 네이쳐 유전학 (Nature Genetics)에 출판된 연구는 생물학적 성별과 상염색체 유전 변이 사이의 상관관계를 분석했습니다. 생물학적 성별은 성염색체가 결정하고 상염색체와 무관하므로 아무런 상관관계가 나타나지 않을 것으로 기대할 수 있으나 결과는 충격적이게도 그렇지 않았습니다. 놀랍게도 수백 개의 상염색체 변이가 성별과 상관관계가 있는 것으로 드러났습니다.
이 연구는 선택편향를 이러한 상관관계의 원인으로 지목했으나 정확히 그 수학적인 기전을 밝히진 못했습니다. 제 연구는 그들이 밝히지 못한 수학적인 기전과 일반적인 이론을 제안하는 것입니다.
왜 이런 연구를 했는지 생각해보면 의과대학에서의 경험이 결정적이었습니다. 저는 대학에 오기까지 20년을 부산에서 살았는데 부산은 제2 도시로서의 위상에도 불구하고 전라권이나 경북권에 비해 의료 환경이 매우 낙후됐습니다. 때문에 서울에서 경험한 의료환경은 그 자체로 놀라운 경험이었고 그 안에서도 본원과 보라매 병원의 차이와 같은 불평등의 결과가 항상 마음 한 켠에 있었습니다. 그런 와중에 네이쳐에 출판된 연구를 접하면서 제 문제의식을 구체화할 수 있는 기회라고 생각하여 금번의 연구를 시작할 수 있었습니다.
비록 논문 본문에 적지 않았지만 제 연구는 몇 가지 중요한 점을 시사하고 있습니다. 하나, 대규모 유전체 연구에서 선택편향을 제거하기 위해서는 각 피연구자가 연구에 참여할 확률을 알아야 합니다. 그리고 그 확률을 알기 위해서는 어떤 경위로 그 사람이 연구에 참여하게 되는지 반드시 이해할 필요가 있습니다. 둘, 이런 사실을 한국의 맥락으로 가져오면 한국은 지역 간 의료불평등과 계층에 따른 의료접근성의 차이가 대단히 큰 나라입니다. 그런 나라에서 아무리 많은 데이터를 바탕으로 연구를 한다고 한들 선택편향으로부터 자유로울 수 없습니다. 때문에 엄밀한 연구를 위해서라도 건강불평등의 원인과 해결책을 고민할 수 밖에 없는 것입니다. 마지막으로, 사회적으로 의료접근성이 균등한 곳일수록 최첨단의 연구가 더 쉽고 간편하다는 것입니다. 그런 사회에서는 의료접근성으로 인한 선택편향이 없을테니까요.
저는 집단유전학과 인과추론의 교집합에 위치한 사람입니다. 그 자체로 의료접근성이나 불평등을 연구하진 않는다는 뜻입니다. 그러나 제가 다루는 자료와 데이터는 사회에서 만들어지고 바로 그렇기 때문에 의료접근성과 불평등을 무시할 수가 없습니다. 금번의 연구는 그 점을 수학적인 관점에서 간접적으로 보여주고 있습니다.
]]>Another implication of this study design is that most markers are not actually causal. They are merely markers that tag a nearby causal variant. Therefore, a causal claim based on GWAS does not really make sense in its raw form. To make things sense, many proposed a framework viewing genotyped markers as a noisy measurement of the casual locus. Read Pritchard and Przeworski and Edge et al..
The consequence of measurement error has been documented in various fields and the work of Edge is based on that of psychometrics. Econometric analysis of cross-section and panel data gives a more comprehensive treatment on the general issue which I describe in this post.
Consider the following structural equation. It means that the explanatory variables has a causal effect on the dependent variable: changing the explanatory variable changes the distribution of the dependent variable.
\[y = \beta_0 + \beta_1 x_1 + \cdots + \beta_K x_K^m + v\]where the superscript $m$ denotes the mismeasured variable that we don’t have access to. The mismeasured value $x_K$ for $x_K^m$ is instead observed. Also, assume $\mathbb{E}[v \vert x_1, \ldots, x_K^m] = 0$ for unconfoundedness and $E[v]=0$ since we’ve included the intercept $\beta_0$.
By noisy measurement, it means that $x_K^m$ is observed as $x_K$ with some errors $e_K$ such that $e_K$ is indpendent from all $x_1, \ldots, x_{K-1}$ and $\mathbb{E}[e_K] = 0$.
\[x_K = x_K^m + e_K\]It’s important to clarify whether equation (2) is structural or not. We assume it’s structural for a moment and discuss this later. This can be described in terms of a DAG.
\[e_K \rightarrow x_K \leftarrow x_K^m \rightarrow y\]Subsituting equation (2) to (1) gives
\[y = \beta_0 + \beta_1 x_1 + \cdots + \beta_K x_K + (v-\beta_K e_K)\]Applying ordinary least square (OLS) to equation (4) will not give a consistent estimate of $\beta_K$ since $x_K$ and $v-\beta_K e_K$ are generally correlated due to equation (2) (or diagram (3)). To see this, use the Frisch-Waugh-Lovell (FWL) theorem. First define $r_K = x_K - \mathrm{L}(x_K \vert 1, x_1, \ldots, x_{K-1})$ where $\mathrm{L}(\cdot \vert \cdot)$ is the linear projection. Then by FWL, we have
\[\mathrm{plim} \hat{\beta}_K^{\mathrm{OLS}} = \frac{\mathbb{E}[r_Ky]}{\mathbb{E}[r_K' r_K]}\]From the definition of $r_K$,
\[x_K - \mathrm{L}(x_K \vert 1, x_1, \ldots, x_{K-1}) \\ = x_K^m + e_K - \mathrm{L}(x_K^m + e_K \vert 1, x_1, \ldots, x_{K-1}) \\ = r_K^m + e_K - \mathrm{L}(e_K \vert 1, x_1, \ldots, x_{K-1}) \\ = r_K^m + e_K\]where $r_K^m = x_K^m - \mathrm{L}(x_K^m \vert 1, x_1, \ldots, x_{K-1})$ . The last equality comes from the fact that the measurement error $e_K$ is indpendent from all $x_1, \ldots, x_{K-1}$ and $\mathbb{E}[e_K] = 0$.
Substituting this to the numerator of (5) gives
\[\mathbb{E}[r_K y] = \mathbb{E}[r_K^m y] + \mathbb{E}[e_K y]\\ = \mathbb{E}[r_K^m r_K^m]\beta_K + \mathbb{E}[e_K y]\\ = \mathbb{E}[r_K^m r_K^m]\beta_K + \mathbb{E}[e_K v] \\ = \mathbb{E}[r_K^m r_K^m]\beta_K\]and to the denominator gives
\[\mathbb{E}[r_K r_K] = \mathbb{E} [r_K^m r_K^m + 2 r_K^m e_K + e_K e_K] \\ = \mathbb{E} [r_K^m r_K^m] + \mathbb{E}[e_K e_K]\]Hence, we finally arrive at the following result.
\[\mathrm{plim}{\hat{\beta}_K^{\mathrm{OLS}}} = \beta_K \cdot \frac{\mathrm{Var}(r_K^m)}{\mathrm{Var}(r_K^m)+ \mathrm{Var}(e_K)}\]which states that the OLS points to a value that is smaller than the true value.
The core assumption that led to equation (9) is that the error $e_K$ is exogeneous respect to the variables appearing in the structural equation (1). Some mechanistic reasoning on the process of LD makes this assumption doubtful. This becomes clear when we focus on the DAG (3). When the genotype at the causal locus is $x_K^m$ and the marker genotype is $x_K$, the latter is not caused by $x_K^m$. Instead, they become correlated through a evolutionary process (say, $E$) which is depicted in the following DAG.
\[x_K \leftarrow E \rightarrow x_K^m\]This will eventually break the exogeneity of $e_K$. Hence, the structural interpretation of equation (2) is lost. Furthermore, as the evolutionary process can potentially result in population structure, $x_1, \ldots, x_{K-1}$ may be correlated to $e_K$ if we think them as covariates (e.g. PC). Therefore, the mismeasurement model for GWAS effect size will be valid under a restricted set of evolutionary processes although such processes might be pluasible in real human population.
This scenario is partially addressed by Edge et al. where the measurement error $e_K$ is allowed to be correlated with $v$. When $e_K$ and $v$ are correlated, equation (7) retains non-zero $\mathbb{E}[e_K v]$ which makes equation (9) invalid. Furthermore, the estimand of OLS is not proportional to $\beta_K$ which makes the causal interpretation more difficult. If it were, non-zero effect size of the marker would have been a direct evidence for the presence of a causal variant. However, non-zero $\mathbb{E}[e_K v]$ makes this claim invalid.
The issue might be resolved by adding ancestry-associated covariates that could make $v$ independent from $e_K$. Nevertheless, this leaves the problem of correlation between $e_K$ and $x_1, \ldots,x_{K_1}$ which makes $\mathrm{L}(e_K \vert 1, x_1, \ldots, x_{K-1})$ non-zero.
As a concluding remark, it will be interesting to verify the consequence of the outlined phenomenons through simulations.
The book I mentioned (Wooldrdige, 2010) doesn’t consider $\mathrm{L}(e_K \vert x_{-K}) \neq 0$ where $x_{-K} = 1, x_1, \ldots, x_{K-1}$. For notational convenience, also define $x_{-1,-K} = x_1, \ldots, x_{K-1}$. I did some calculations on this quantitiy. Let $\alpha_0$ and $\alpha = \alpha_1, \ldots, \alpha_{K-1}$ be the coefficients of $\mathrm{L}(e_K \vert x_{-K})$.
\[\mathbb{E}[y \mathrm{L}(e_K \vert x_{-K})] = \mathbb{E}[y \alpha_0 + y x_{-1,-K} \alpha] = \\ \mathbb{E}[-y \mathbb{E}(x_{-1,-K})\alpha + yx_{-1,-K} \alpha] = \\ \mathbb{E}[y(x_{-1,-K}-\mathbb{E}(x_{-1,-K}))] \alpha = \\ \mathrm{Cov}(x_{-1,-K},y) \cdot \alpha\]This term also has nothing to do with $\beta_K$.
]]>In this post, I review the two approaches in instrumental variable. The first one is the local average treatment effect (LATE) framework famous in econometrics (Imbens and Angrist, 1994). The second ons is the structural mean model (SMM) framework proposed by the Harvard CI group (Robins, 1994).
We start from the switching formula.
\[Y_i = D_i \cdot Y_i(1) + (1-D_i) \cdot Y_i(0)\]which gives
\[Y_i = \tau_i \cdot D_i + Y_i(0)\]after reordering. $\tau_i = Y_i(1) - Y_i(0)$ is the individual treatment effect.
Learning the importance of heterogeneous TE in IV was a very illuminating experience. At the same time, the plethora of many assumptions was very confusing. How does these different assumptions relate to each other? I don’t have a full-blown answer to this question. However, I think that the two approaches by econometricians and epidemiologists can be classified in terms of the stage in which the assumption is applied. The former, which is usually referred to monotonicity is a matter of how the instrument $Z$ affects the instrument $D$. On the other hand, the latter (e.g. no treatment effect modification (NEM) assumption) imposes restriction on the mode of treatment $D$ influencing the outcome $Y$. The following derivations will make this point clear.
The proof is essentially the same as in Mostly Harmless Econometrics (Theorem 4.4.1, p.155). Apply $E[\cdot \vert Z]$ to equation (2).
\[E[Y_i \vert Z_i] = E[\tau_i \cdot D_i \vert Z_i] + E[Y_i(0) \vert Z_i]\]The last term is just a constant $E[Y_i(0)]$ by the exclusion criteria $Y(d) \perp\kern-5pt\perp Z$.
\[E[Y_i \vert Z_i=1] - E[Y_i \vert Z_i=0] \\ = E[\tau_i \cdot D_i \vert Z_i=1] - E[\tau_i \cdot D_i \vert Z_i=0]\]To simplify this equation, we have two choices. One is to impose some condition on $\tau_i$ and the other is to impose some condition on $D_i$. Monotonicity does the latter by assuming no defiers. Let $C_i = D_i(1) - D_i(0)$ be the compliance status. Then
\[E[\tau_i \cdot D_i \vert Z_i] \\= E[ E[ \tau_i \cdot D_i \vert Z_i, C_i ] \vert Z_i] \\= \sum_c E[\tau_i \cdot D_i \vert Z_i, C_i=c] \cdot P(C_i =c \vert Z_i)\]By exogeneity of the instrument, $P(C_i \vert Z_i) = P(C_i)$. For $Z=1$,
\[= \sum_c E[\tau_i \cdot D_i \vert Z_i=1, C_i=c] \cdot P(C_i =c) \\ = E[\tau_i \cdot 1 \vert Z_i=1, C_i=1] \cdot P(C_i =1) \\ + E[\tau_i \cdot D_i \vert Z_i=1, C_i=0] \cdot P(C_i=0)\]and for $Z=0$,
\[= E[\tau_i \cdot 0 \vert Z_i=0, C_i=1] \cdot P(C_i =1) \\ + E[\tau_i \cdot D_i \vert Z_i=0, C_i=0] \cdot P(C_i=0)\]Finally, subsituting (6) and (7) into (4) and the exclusion criteria guarantees
\[E[\tau_i \cdot D_i \vert Z_i=1] - E[\tau_i \cdot D_i \vert Z_i=0] \\ = E[\tau_i \cdot 1 \vert C_i =1] \cdot P(C_i=1) \\ = E[\tau_i \vert C_i=1] \cdot E[D_i(1) - D_i(0)] \\ = E[\tau_i \vert C_i=1] \cdot (E[D_i \vert Z=1] - E[D_i \vert Z=0])\]where the last line came from exogeneity. The proof shows that there is no restriction on $\tau_i$ and only the assumptions involving $D_i$ was used to derive the Wald ratio.
The additive SMM is
\[E[Y-Y(0) \vert D, Z] = (\psi_0 + \psi_1 Z) \cdot D\]When I saw this for the first time, I was confused. Most importantly, the interpretation of $\psi_0$ and $\psi_1$ wasn’t very transparent to me. Therefore, I thought it was better to start from equation (2) to make it more sensible.
Applying $E[\cdot \vert D,Z]$ to equation (2) gives
\[E[Y_i - Y_i(0) \vert D_i,Z_i ] = E[\tau_i \cdot D_i \vert D_i, Z_i] \\ = E[\tau_i \vert D_i, Z_i] \cdot D_i \\ = E[\tau_i \vert Z_i] \cdot D_i\]So, what was happening in the SMM was the specification of $E[\tau_i \vert Z_i]$ which is
\[E[\tau_i \vert Z_i] = \psi_0 + \psi_1 Z_i\]NEM imposes $\psi_1 = 0$ so that $\psi_0$ eventually becomes the PATE. Embracing the soul of Wooldridge’s idea that I wrote, using $Z - \mu_Z$ instead of $Z$ would have given $\psi_0$ the same interpretation without the NEM although the problem of identification would have remained (the number of moment condition is smaller than the number of estimands).
To obtain the Wald ratio, apply the law of iterated expectation on (assuming NEM),
\[E[Y_i - Y_i(0) \vert Z] = \\ E[ E[Y_i - Y_i(0) \vert D,Z] \vert Z ] \\ = E[ \psi_0 \cdot D \vert Z ] \\ = \psi_0 \cdot E[ D \vert Z ]\]and substituting it into equation (4) gives the desired result. Examining the consequence of NEM on equation (11) directly shows that SMM-based approaches acheives identification by restricting the mode of $D \rightarrow Y$.
My impression of these result is that monotonicity and SMM-based assumptions are not necessarily stronger/weaker than the other. The choice ultimately depends on which stage, $Z \rightarrow D$ or $D \rightarrow Y$, will be restricted through imposing additional assumptions. May be some person can come up with an alternative identification strategy by imposing restrictions on both but each of them being weaker for each stage.
]]>where $\sigma_D^2(X_i)$ is the variance of treatment variable $D$ and $\delta_X$ is the stratum specific treatment effect. The formula simply says that under heterogeneous treatment effect, the OLS estimates a variance weighted average over $\delta_X$. In certain cases, this might not be the policy relevant measure although it retains the notion of causal effects. Read MHE for more details.
After some time, I read Wooldrige’s recent work on staggered difference-in-differences. In this work, Wooldridge shows a simple solution to negative weights arising in two-way fixed effects (TWFE) when TE is heterogeneous. One thing that catched my eyes was the mean-centering of the covariates to preserve the average treatment effect of the treated (ATT) interpretation of the main coefficient. Again, read the paper for more details.
Soon after, I found that another work by Wooldridge (with Negi) which uses the same technique to estimate OLS. After a close inspection, I realized that all these works share the same soul and can be derived using the machinary of MHE. This blog post is simply a personal note on how I derive and understand these fantastic works.
With bianry treatment variable $D$ and outcome $Y$, we write potential outcomes as $Y(d)$ ($d=0,1$). With consistency assumption $Y = Y(d)$ if $D=d$, we can write the famous switching formula.
\[Y_i = D_i \cdot Y_i(1) + (1-D_i) \cdot Y_i(0)\]A careful reordering gives the following form.
\[Y_i = (Y_i(1) - Y_i(0)) \cdot D_i + Y_i(0) = \tau_i \cdot D_i + Y_i(0)\]where $\tau_i = Y_i(1)- Y_i(0)$ is the individual treatment effect. The average over $\tau_i$ is $\tau = E[\tau_i]$ which is the population average treatment effect (PATE).
Let $X$ be the covariates that are confoudners or effect modifiers (or both). By taking $E[\cdot \vert X, D]$ on both sides, we have something more familiar.
\[E[Y_i \vert X_i, D_i] = E[\tau_i \cdot D_i \vert X_i,D_i] + E[Y_i(0) \vert X_i,D_i] \\ = E[\tau_i \vert X_i,D_i] \cdot D_i + E[Y_i(0) \vert X_i,D_i]\]We have to simplify $E[\tau_i \vert X_i,D_i]$ and $E[Y_i(0) \vert X_i,D_i]$. Under conditional exchangability ($Y(d) \perp\kern-5pt\perp D \mid X$), we remove $D$ from the latter. In linear regression context, we usually assume a linear conditional expectation function (CEF) so $E[Y_i(0) \vert X_i,D_i] = X_i\beta$. If you want semi/non-parametric regression than you can just leave it and let the algorithm fit the functional form.
Our main interest is the former term $E[\tau_i \vert X_i,D_i]$. As $\tau_i = \tau_i(X_i)$, we can remove $D$ first so we have $E[\tau_i \vert X_i]$. Without any functional assumptions, we just simply put
\[\tau_i = \tau + f(X_i)\]where $f$ is some arbitrary function such that $E[f(X)] = 0$ and consequently, $E[\tau_i] = \tau$ which is the PATE. Substituting this to equation (4) gives
\[E[Y_i \vert X_i,D_i ] = \tau \cdot D_i + f(X_i) \cdot D_i + X_i \beta\]with $E[f(X)] = 0$.
What’s there in Wooldridge’s paper is just setting $f(x) = \rho \cdot (x - \mu_X)$. This would be a saturated model if $x$ is binary. If not, it would possibly be a false assumption but we are working with linear regression so let’s believe it’s true. The subtraction forces $E[f(X)]=0$ which is the key to make the coefficient of the main term $D$, which is $\tau$, the PATE. What is important is $E[f(X)]=0$ because as long as this moment condition is retained, we may take more flexible specifictaion of $f(X)$ which may not be just $f(x) = \rho \cdot (x-\mu_X)$. Nevertheless, Wooldridge provides us a simple way to deal with heterogenous TE while retaining a simple regression framework.
This blog post was intended to show that the soul of Wooldridge’s approach can be extended to non/semi-parametric or whatsoever more flexible approaches that account for heterogenous TE. I tried to show what assumption is essential and how standard potential outcome devices can be used to derive the results. I’m currently trying to apply this to genetic association study by combining some population genetic theory. In case your interested, please contact me.
]]>As an illustrative example, let’s think of the following model that includes Poisson, negative-binomial and many more.
\[\log E[Y \vert m, T, X] = \log m + T + \beta X\]where $m$ is the size factor, $T$ is some biological effect (cell-type, trajectory etc.) and $X$ be the covariate (this may include the intercept). Say someone wants to discard $X$ from the data using the regress-out approach.
It’s just a standard log-normal conditional expectation function (CEF) that is assumed in most single-cell methods assuming count models. One might question about over-dispersion in scRNA-seq data. Note that because we only assume equation (1), the setting is robust to any type of over/under-dispersion so don’t worry.
In general, we don’t have any information about $T$ and as in the Seurat vignette, this is inferred after we regress-out $X$. Therefore, the regression is done based on the following equation.
\[\log E[Y \vert m, X] = \log m + \beta' X\]Note that I added an apostrophe ‘ to explicitely state that $\beta$ and $\beta’$ are not the same. The parameter $\beta$ is
\[\beta = \log \frac{E[Y \vert m, T, X=x+1]}{E[Y \vert m, T, X=x]}\]and $\beta’$ is
\[\beta' = \log \frac{E[Y \vert m, X=x+1]}{E[Y \vert m, X=x]}\]The difference between the two quantities can be analytically obtained using the law of iterated expectation.
\[E[Y \vert m, X] = E[ E[Y \vert m, T, X] \vert m, X] \\= E[ \exp(\log m + \beta X + T ) \vert m, X] \\= \exp(\log m + \beta X ) \cdot E[ \exp(T) \vert m,X ]\]Substituting this to equation (4) gives
\[\beta' = \log \frac{E[Y \vert m, X=x+1]}{E[Y \vert m, X=x]} \\ = \log \frac{\exp(\log m + \beta \cdot (x+1)) \cdot E[\exp(T) \vert m,X=x+1]}{\exp(\log m + \beta \cdot x) \cdot E[\exp(T) \vert m,X=x]} \\ = \beta + \log \frac{E[\exp(T) \vert X=x+1]}{E[\exp(T) \vert X=x]}\]Therefore,
\[\beta' - \beta = \log \frac{E[\exp(T) \vert X=x+1]}{E[\exp(T) \vert X=x]}\]which shows that the regress-out approach estimates $\beta’$ which is not $\beta$, an intended parameter to estimate. Equation (6) is zero if the covarite $X$ and $T$ is independent. This may not be true, for example, in cases when $X$ is the batch covariate and one tries to merge data from two batches with different cell-type compositions. This means that $T$ is correlated to $X$ so equation (6) should be non-zero. Unless the covariate is independent of $T$, regress-out approach always estimates the wrong number and substracts it from data. Hence, I believe with few exceptions, this should not be recommended.
]]>