In the last post, I explained how to add RSIDs to GWAS summary statistics.
Although the final RSID assignment step was fast enough for processing large data, the overall process was quite cumbersome for several reasons.
Most of all, it first extracted the chr:pos
to a separate file and then assigned the RSIDs.
Therefore, the user had to combine the original sumstats file with the chr:pos
-RSID file manually after obtaining the RSIDs for the variants in chr:pos
.
This post explains how to avoid these steps and write a single command that does everything you need.
The magic comes from using pipes (|
) that combine multiple commands in a shell.
This is also helpful for debugging your script in the shell by throwing the head
command between the lines.
Additionally, I explain how to move between different genome builds (e.g. between hg19
and hg38
) using the liftOver
command.
The required programs and the files is listed below
liftOver
command-line tool. Find theliftOver
binary in the link. The link containsx86_64-linux
binaries so look at other directories for other operating systems.bedops
command-line tool.bedops
is a collection of many command-line tools. The one we are gonna use isbedmap
.sort-bed
is used too.liftOver
chain files forhg38
andhg19
. Files for other genome builds can be found in the parent directories of the link. Note that the genome build here referes to the genome build of the input file. Some basic examples for using these chain files can be found here.- dbSNP reference for
hg38
andhg19
. The file we use today issnp144.txt.gz
which is a full list of SNPs in the dbSNP version 144. The list contains the SNPs together with their annotations such as RSID,chr:pos
, variant type and many more.
After installing the programs (simply put the binary files in to your ~/bin
folder) and saving the files, my folder currently looks like
├── chains
│ ├── hg19ToHg38.over.chain.gz -> # downloaded from `hg19` chains folder
│ └── hg38ToHg19.over.chain.gz # downloaded from `hg38` chains folder
├── dbSNP
│ └── snp144.txt.gz # downloaded from `hg19` dbSNP
└── sumstats.txt
Step 1: Preparing the files
Both liftOver
and bedops
operate on a bed
file.
In the UCSC wiki, the precise description of the format is given.
12 columns are explained but only the first three is actually important in our application.
The programs we use today don’t even care what appears after the first three.
First, we convert snp144.txt.gz
into a proper bed
file.
Before the conversion, let’s see how the file looks like.
The file is compressed so a naive way to see the content is to unzip it first.
However, as you’ll notice, the file is huge so unzipping the whole file isn’t a nice choice.
The nice part of working in a shell is that many programs operates in a line-by-line basis.
This means that the program processes the file line-by-line and throws the output.
Only the first few lines will tell you how the file looks like so we could unzip the first few lines.
This can be done using the combination of zcat
and head
commands.
zcat snp144.txt.gz | head -n 5
585 chr1 10019 10020 rs775809821 0 + A A -/A genomic deletion unknownnear-gene-5 exact 1 1 SSMP, 0
585 chr1 10055 10055 rs768019142 0 + - - -/A genomic insertion unknownnear-gene-5 between 1 1 SSMP, 0
585 chr1 10107 10108 rs62651026 0 + C C C/T genomic single unknown 0 near-gene-5 exact 1 1 BCMHGSC_JDW, 0
585 chr1 10108 10109 rs376007522 0 + A A A/T genomic single unknown 0 near-gene-5 exact 1 1 BILGI_BIOE, 0
585 chr1 10138 10139 rs368469931 0 + A A A/T genomic single unknown 0 near-gene-5 exact 1 1 BILGI_BIOE, 0
There are total 20 columns.
zcat
sequentially unzips the file line-by-line and prompts the content.
The pipe |
receives the output in which zcat
provides and supplies it to head
.
-n 5
option receives the output of zcat
through the pipe and prompts the first 5 lines and terminates.
As a result, things just happen in less than a second without unzipping the large file as a whole.
What are the relevant parts of the file?
Column 2 (the chromosome number), columns 3/4 (0-based coordinates of the variant) and column 5 (the RSIDs) are all we need.
The first three is required to form a proper bed
file and the last columns is the annotation for our purpose.
As explained earlier, programs we use don’t care much about the columns after the first three so you can add any number of additional columns after the first three.
zcat dbSNP/snp144.txt.gz | awk -v OFS="\t" '{print $2,$3,$4,$5}' > dbSNP/snp144.bed
zcat
extracts the lines from snp144.txt.gz
.
awk
takes the lines and prompts columns 2,3,4 and 5 (which is denoted as $2,$3,$4,$5
).
Finally, >
writes the output of awk
to snp144.bed
.
You can check the content using the head
command.
head -n 5 dbSNP/snp144.bed
chr1 10019 10020 rs775809821
chr1 10055 10055 rs768019142
chr1 10107 10108 rs62651026
chr1 10108 10109 rs376007522
chr1 10138 10139 rs368469931
Now we are done with preprocessing.
Step 2: Converting between genome builds
Suppose that you have a sumstats in hg38
format without RSIDs.
You first have to convert the chr:pos
to hg19
and then assign the RSIDs.
This can be done using the UCSC liftOver
tool.
A simple example can be found here.
liftOver
basically deals with bed
files.
This means that you should first convert your summary statistics to a bed
file.
As other programs using summary statistics requires some formatting, you need at least four consecutive conversions (sumstat to bed
conversion, genome build conversion using liftOver
, add the RSIDs, convert back to the sumstats format).
Each step has to write and read the content of a file from the harddisk.
As modern sumstats files are frequently big as much as several gigabytes, this is a very I/O intensive task that can be very slow.
The magic of using pipes is to avoid unnecessary disk I/O and only read and write once.
This is done by transferring the output of the previous command to the next command using pipes within the memory and only writing the final result to the disk.
My summary statistics looks like this (head -n 5 sumstats.txt
).
snp G_BETA G_SE N P
chr1:786325:A:G 0.007589 0.028391 9574 0.789250
chr1:805145:G:A -0.009773 0.042689 9574 0.818923
chr1:809277:C:T 0.004939 0.028554 9574 0.862690
chr1:860040:G:A 0.005936 0.031575 9574 0.850886
First, look at the command I’ll use to do several jobs using a single command.
awk -v OFS="\t" 'NR!=1 {split($1,a,":"); print a[1],a[2]-1,a[2],a[3],a[4],$2,$3,$4,$5}' sumstats.txt \
| liftOver stdin chains/hg38ToHg19.over.chain.gz stdout trash.bed -bedPlus=3 \
| sort-bed - \
| bedmap --echo --echo-map-id --delim "\t" - dbSNP/snp144.bed \
>> sumstats.rsid.hg19.txt
\
is simply the linebreak symbol in the shell so don’t be bothered.
sumstats.txt
is read once in the first line and sumstats.rsid.hg19.txt
is written to the disk in the last line.
There are no disk I/Os elsewhere.
The first line reads sumstats.txt
line-by-line and processes the line.
After excluding the first line which contains the header (NR!=1
), it splits the first column with the separator (split($1,a,":");
) and stores the splitted elements to an array named a
.
a
stores the chromosome number, position minus one, position, the reference allele and the alternative allele (the position minus one thing is related to the convention of 0-based indexing of variants).
$num
s appearing afterwards are the columns of sumstats.txt
that are effect size, standard error, number of samples and P-value.
The pipe takes the output of awk
and passes to liftOver
.
The stdin
part tells liftOver
that the input file is replaced by the lines coming from the pipe (which is the output of awk
in the previous line).
The stdout
part tells liftOver
to pass the output to the shell so that the pipe can receive it.
trash.bed
stores the variants in which liftOver
failed to map.
-bedPlus=3
part tells liftOver
that the program just needs to consider the first 3 columns of the input.
sort-bed
then receives the output of liftOver
through the pipe using -
which has the same function as the stdin
of liftOver
that tells sort-bed
to receive what the pipe is passing to it.
This command sorts the input in an ascending order so that the subsequent programs like bedmap
can efficiently process the files.
Finally, bedmap
again receives the output of sort-bed
through the pipe and adds to the input line the RSIDs by reading it from snp144.bed
.