GWAS summary statistics are frequently shared these days. They have many usages, for example, an input for LDSC. However, they are sometimes not formatted properly including the absence of RSID. When RSID is absent, Chromosome:Position coordinates are there instead. In theory, it should be possible to map this value to RSID but it’s a demanding task when you have several millions of variants to be queried.
I’ve seen several solutions in the internet but neither of them was suitable for my purpose.
It was either very slow or not working.
After few days of search, I found a fast and a convenient way to do this using bedops
.
First, install bedops
from their homepage.
Next, you need a reference dataset that contains both the Chromosome:Position and the RSID.
This can be found in the ucsc FTP server hg19 hg38.
The file I used was snp150.txt.gz
.
To supply this file to bedops
, you first have to convert the reference file to a .bed
format.
zcat snp150.txt.gz | awk -v OFS='\t' '{if($12=="single") print $2,$3,$4,$5}' > snp150_snp.bed
I’ll explain how this works.
Since snp150.txt.gz
is archived, zcat
unzips the file line-by-line and throws the output to the command after the pipe (|
).
Then, the awk
command receives the line from zcat
and checks if the 12th column’s value is “single”.
Note that the 12th column tells you the type of the variant.
“single” means it’s a single nucleotide subsitution, a SNP.
-v OFS='\t'
option ensures that the output is tab-limited.
Finally, awk
throws it’s output to >
which writes the output in the snp150_snp.bed
file.
Now let’s move onto the summary statistics. Mine looks like
snp beta se N P
1:729632:C:T -0.0416469 0.0577902 4344 0.47116
1:754063:G:T -0.0579509 0.104104 4152 0.577788
1:754105:C:T -0.0293816 0.0579841 4350 0.612379
1:754211:G:A -0.0384934 0.105497 4136 0.715221
1:754629:A:G -0.0733465 0.106585 4147 0.491398
1:759036:G:A -0.0361858 0.0579107 4342 0.532099
1:759884:C:A 0.0624291 0.082878 4252 0.451333
1:765904:G:T 0.0725445 0.0893362 4262 0.416814
1:767096:A:G 0.0287257 0.04053 4343 0.478517
The first column contains the Chromosome:Position.
We now have to convert this information into a .bed
format.
awk -v OFS='\t' 'NR!=1 {split($1,a,":"); print "chr"a[1],a[2]-1,a[2]}' sumstats.txt > query.bed
The NR!=1
option tells the awk
command to ignore the first line.
The split
command splits the value of the first column (which is the Chromosome:Position) and stores it to variable a
.
Subtracting 1 from a[2]
is related to the coordinate system of the .bed
format.
Read this link for more information.
Finally, we run bedmap
which is a part of the bedops
program.
bedmap --echo --echo-map-id --delim '\t' query.bed snp150_snp.bed > output.bed
This gives the desired result.
chr1 729631 729632 rs116720794
chr1 754062 754063 rs12184312
chr1 754104 754105 rs12184325
chr1 754210 754211 rs12184313
chr1 754628 754629 rs10454459
chr1 759035 759036 rs114525117
chr1 759883 759884 rs188068004
chr1 765903 765904 rs115541281
chr1 767095 767096 rs115991721
chr1 767812 767813 rs114066716