Conversion between DNA analysis services
I’ve spent the last months trying to figure out how to use data from Whole Genome Sequencing on various DNA analysis websites. While the output of WGS gives you all the information given by sites like 23-and-me and more, wrangling it into the correct format is a non-trivial task.
After uncountable attempts, and many hours of CPU time, today I finally managed to produce an apparently correct file in 23-and-me format.
Disclaimer: I am not a biologist, so the information below might be inaccurate, misleading, or even incorrect.
Whole Genome Sequencing
Very briefly, the process of genome sequencing nowadays proceeds by
- creating multiple copies of the DNA using PCR
- slicing up the DNA in short chunks
- reading a brief part of each chunk (100-150 bases)
The output of this procedure is a set of so-called reads
, and is usually
saved in a format called FASTA (or FASTQ when a quality index for each basis is
provided).
Afterwards, the reads are aligned on a reference genome, i.e. for each read its most likely start and end position are determined (compared to a fixed assembly of the human genome). The DNA service that I used (DanteLabs) uses the human assembly 37 for this purpose.
The output of this procedure is typically a BAM file. This file contains all the information that was determined by the sequencing procedure.
The next step is Variant Calling, in which some algorithm determines what are the differences between the reference genome and the provided BAM file. The resulting file is a VCF (Variant Call Format) file. The most common type of differences are SNPs, Single Nucleotide Polymorphisms, and correspond to variations of a single DNA basis.
Some services provide you a VCF file, but sometimes it’s just a filtered one, i.e. with just a subset of the SNPs. Trying to overcome this limitation motivated this journey: extracting all the information from a BAM file to be able to construct a 23-and-me file.
First step: variant calling
After multiple attempts, I decided to use deepvariant for variant calling. Assuming that you are in the folder containing your BAM file, the following commands should run the variant calling procedure:
wget ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
gunzip Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
sudo docker run -v "/data:/data" -v "$(pwd)":"/input" -v "$(pwd):/output" \
google/deepvariant:"${BIN_VERSION}" /opt/deepvariant/bin/run_deepvariat \
--model_type=WGS --ref=/input/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa \
--reads=/input/YOUR_BAM_FILE.bam --output_vcf=/output/calls.vcf \
--output_gvcf=/output/calls.gvcf --num_shards=$(nproc) \
--intermediate_results_dir=/data/genome
This command assumes that you have a BAM file using the genome assembly 37, and
that your /data/
folder contains sufficient free space. This procedure will take
roughly a couple of days.
Second step: annotation
We now need to annotate the variants, assigning to each of them their names. This can be done as follows:
wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/00-All.vcf.gz
tabix 00-All.vcf.gz
bgzip calls.vcf
tabix calls.vcf.gz
bcftools annotate -a 00-All.vcf.gz -c ID -Oz -o YOUR_NAME.annotated.vcf.gz calls.vcf.gz
Third step: conversion to 23-and-me format
This uses code provided at https://github.com/2sh/vcf-to-23andme.git. We will use a different template file, more up to date, which you can find here.
python vcf-to-23andme/data_to_db.py YOUR_NAME.annotated.vcf.gz vcf output_db.sql
python vcf-to-23andme/db_to_23.py output_db.sql 23andme_v5_blank.txt genome_YOUR_NAME_v5_Full_20200228144640.txt
This should be all!