Querying RSID from Chromosome:Position

GWAS summary statistics are frequently shared these days. They have many usages, for example, an input for LDSC. However, they are sometimes not formatted properly including the absence of RSID. When RSID is absent, Chromosome:Position coordinates are there instead. In theory, it should be possible to map this value to RSID but it’s a demanding task when you have several millions of variants to be queried.

I’ve seen several solutions in the internet but neither of them was suitable for my purpose. It was either very slow or not working. After few days of search, I found a fast and a convenient way to do this using bedops.

First, install bedops from their homepage. Next, you need a reference dataset that contains both the Chromosome:Position and the RSID. This can be found in the ucsc FTP server hg19 hg38. The file I used was snp150.txt.gz. To supply this file to bedops, you first have to convert the reference file to a .bed format.

zcat snp150.txt.gz | awk -v OFS='\t' '{if($12=="single") print $2,$3,$4,$5}' > snp150_snp.bed 

I’ll explain how this works. Since snp150.txt.gz is archived, zcat unzips the file line-by-line and throws the output to the command after the pipe (|). Then, the awk command receives the line from zcat and checks if the 12th column’s value is “single”. Note that the 12th column tells you the type of the variant. “single” means it’s a single nucleotide subsitution, a SNP. -v OFS='\t' option ensures that the output is tab-limited. Finally, awk throws it’s output to > which writes the output in the snp150_snp.bed file.

Now let’s move onto the summary statistics. Mine looks like

	snp     beta    se      N       P
	1:729632:C:T    -0.0416469      0.0577902       4344    0.47116
	1:754063:G:T    -0.0579509      0.104104        4152    0.577788
	1:754105:C:T    -0.0293816      0.0579841       4350    0.612379
	1:754211:G:A    -0.0384934      0.105497        4136    0.715221
	1:754629:A:G    -0.0733465      0.106585        4147    0.491398
	1:759036:G:A    -0.0361858      0.0579107       4342    0.532099
	1:759884:C:A    0.0624291       0.082878        4252    0.451333
	1:765904:G:T    0.0725445       0.0893362       4262    0.416814
	1:767096:A:G    0.0287257       0.04053 4343    0.478517

The first column contains the Chromosome:Position. We now have to convert this information into a .bed format.

awk -v OFS='\t' 'NR!=1 {split($1,a,":"); print "chr"a[1],a[2]-1,a[2]}' sumstats.txt > query.bed

The NR!=1 option tells the awk command to ignore the first line. The split command splits the value of the first column (which is the Chromosome:Position) and stores it to variable a. Subtracting 1 from a[2] is related to the coordinate system of the .bed format. Read this link for more information.

Finally, we run bedmap which is a part of the bedops program.

bedmap --echo --echo-map-id --delim '\t' query.bed snp150_snp.bed > output.bed

This gives the desired result.

	chr1    729631  729632  rs116720794
	chr1    754062  754063  rs12184312
	chr1    754104  754105  rs12184325
	chr1    754210  754211  rs12184313
	chr1    754628  754629  rs10454459
	chr1    759035  759036  rs114525117
	chr1    759883  759884  rs188068004
	chr1    765903  765904  rs115541281
	chr1    767095  767096  rs115991721
	chr1    767812  767813  rs114066716