Querying RSID from Chromosome:Position (2)

In the last post, I explained how to add RSIDs to GWAS summary statistics. Although the final RSID assignment step was fast enough for processing large data, the overall process was quite cumbersome for several reasons. Most of all, it first extracted the chr:pos to a separate file and then assigned the RSIDs. Therefore, the user had to combine the original sumstats file with the chr:pos-RSID file manually after obtaining the RSIDs for the variants in chr:pos.

This post explains how to avoid these steps and write a single command that does everything you need. The magic comes from using pipes (|) that combine multiple commands in a shell. This is also helpful for debugging your script in the shell by throwing the head command between the lines. Additionally, I explain how to move between different genome builds (e.g. between hg19 and hg38) using the liftOver command.

The required programs and the files is listed below

  1. liftOver command-line tool. Find the liftOver binary in the link. The link contains x86_64-linux binaries so look at other directories for other operating systems.
  2. bedops command-line tool. bedops is a collection of many command-line tools. The one we are gonna use is bedmap. sort-bed is used too.
  3. liftOver chain files for hg38 and hg19. Files for other genome builds can be found in the parent directories of the link. Note that the genome build here referes to the genome build of the input file. Some basic examples for using these chain files can be found here.
  4. dbSNP reference for hg38 and hg19. The file we use today is snp144.txt.gz which is a full list of SNPs in the dbSNP version 144. The list contains the SNPs together with their annotations such as RSID, chr:pos, variant type and many more.

After installing the programs (simply put the binary files in to your ~/bin folder) and saving the files, my folder currently looks like

├── chains
│   ├── hg19ToHg38.over.chain.gz -> # downloaded from `hg19` chains folder
│   └── hg38ToHg19.over.chain.gz # downloaded from `hg38` chains folder
├── dbSNP
│   └── snp144.txt.gz # downloaded from `hg19` dbSNP
└── sumstats.txt

Step 1: Preparing the files

Both liftOver and bedops operate on a bed file. In the UCSC wiki, the precise description of the format is given. 12 columns are explained but only the first three is actually important in our application. The programs we use today don’t even care what appears after the first three.

First, we convert snp144.txt.gz into a proper bed file. Before the conversion, let’s see how the file looks like. The file is compressed so a naive way to see the content is to unzip it first. However, as you’ll notice, the file is huge so unzipping the whole file isn’t a nice choice.

The nice part of working in a shell is that many programs operates in a line-by-line basis. This means that the program processes the file line-by-line and throws the output. Only the first few lines will tell you how the file looks like so we could unzip the first few lines. This can be done using the combination of zcat and head commands.

zcat snp144.txt.gz | head -n 5

585     chr1    10019   10020   rs775809821     0       +       A       A       -/A     genomic deletion        unknownnear-gene-5      exact   1               1       SSMP,   0
585     chr1    10055   10055   rs768019142     0       +       -       -       -/A     genomic insertion       unknownnear-gene-5      between 1               1       SSMP,   0
585     chr1    10107   10108   rs62651026      0       +       C       C       C/T     genomic single  unknown 0      near-gene-5      exact   1               1       BCMHGSC_JDW,    0
585     chr1    10108   10109   rs376007522     0       +       A       A       A/T     genomic single  unknown 0      near-gene-5      exact   1               1       BILGI_BIOE,     0
585     chr1    10138   10139   rs368469931     0       +       A       A       A/T     genomic single  unknown 0      near-gene-5      exact   1               1       BILGI_BIOE,     0

There are total 20 columns. zcat sequentially unzips the file line-by-line and prompts the content. The pipe | receives the output in which zcat provides and supplies it to head. -n 5 option receives the output of zcat through the pipe and prompts the first 5 lines and terminates. As a result, things just happen in less than a second without unzipping the large file as a whole.

What are the relevant parts of the file? Column 2 (the chromosome number), columns 3/4 (0-based coordinates of the variant) and column 5 (the RSIDs) are all we need. The first three is required to form a proper bed file and the last columns is the annotation for our purpose. As explained earlier, programs we use don’t care much about the columns after the first three so you can add any number of additional columns after the first three.

zcat dbSNP/snp144.txt.gz | awk -v OFS="\t" '{print $2,$3,$4,$5}' > dbSNP/snp144.bed

zcat extracts the lines from snp144.txt.gz. awk takes the lines and prompts columns 2,3,4 and 5 (which is denoted as $2,$3,$4,$5). Finally, > writes the output of awk to snp144.bed.

You can check the content using the head command.

head -n 5 dbSNP/snp144.bed

chr1    10019   10020   rs775809821
chr1    10055   10055   rs768019142
chr1    10107   10108   rs62651026
chr1    10108   10109   rs376007522
chr1    10138   10139   rs368469931

Now we are done with preprocessing.

Step 2: Converting between genome builds

Suppose that you have a sumstats in hg38 format without RSIDs. You first have to convert the chr:pos to hg19 and then assign the RSIDs. This can be done using the UCSC liftOver tool. A simple example can be found here.

liftOver basically deals with bed files. This means that you should first convert your summary statistics to a bed file. As other programs using summary statistics requires some formatting, you need at least four consecutive conversions (sumstat to bed conversion, genome build conversion using liftOver, add the RSIDs, convert back to the sumstats format). Each step has to write and read the content of a file from the harddisk. As modern sumstats files are frequently big as much as several gigabytes, this is a very I/O intensive task that can be very slow. The magic of using pipes is to avoid unnecessary disk I/O and only read and write once. This is done by transferring the output of the previous command to the next command using pipes within the memory and only writing the final result to the disk.

My summary statistics looks like this (head -n 5 sumstats.txt).

snp     G_BETA  G_SE    N       P
chr1:786325:A:G 0.007589        0.028391        9574    0.789250
chr1:805145:G:A -0.009773       0.042689        9574    0.818923
chr1:809277:C:T 0.004939        0.028554        9574    0.862690
chr1:860040:G:A 0.005936        0.031575        9574    0.850886

First, look at the command I’ll use to do several jobs using a single command.

awk -v OFS="\t" 'NR!=1 {split($1,a,":"); print a[1],a[2]-1,a[2],a[3],a[4],$2,$3,$4,$5}' sumstats.txt \
    | liftOver stdin chains/hg38ToHg19.over.chain.gz stdout trash.bed -bedPlus=3 \
    | sort-bed - \
    | bedmap --echo --echo-map-id --delim "\t" - dbSNP/snp144.bed \
    >> sumstats.rsid.hg19.txt

\ is simply the linebreak symbol in the shell so don’t be bothered. sumstats.txt is read once in the first line and sumstats.rsid.hg19.txt is written to the disk in the last line. There are no disk I/Os elsewhere.

The first line reads sumstats.txt line-by-line and processes the line. After excluding the first line which contains the header (NR!=1), it splits the first column with the separator (split($1,a,":");) and stores the splitted elements to an array named a. a stores the chromosome number, position minus one, position, the reference allele and the alternative allele (the position minus one thing is related to the convention of 0-based indexing of variants). $nums appearing afterwards are the columns of sumstats.txt that are effect size, standard error, number of samples and P-value.

The pipe takes the output of awk and passes to liftOver. The stdin part tells liftOver that the input file is replaced by the lines coming from the pipe (which is the output of awk in the previous line). The stdout part tells liftOver to pass the output to the shell so that the pipe can receive it. trash.bed stores the variants in which liftOver failed to map. -bedPlus=3 part tells liftOver that the program just needs to consider the first 3 columns of the input.

sort-bed then receives the output of liftOver through the pipe using - which has the same function as the stdin of liftOver that tells sort-bed to receive what the pipe is passing to it. This command sorts the input in an ascending order so that the subsequent programs like bedmap can efficiently process the files. Finally, bedmap again receives the output of sort-bed through the pipe and adds to the input line the RSIDs by reading it from snp144.bed.