WARNING: Not actively maintained!
This tutorial uses VEP to annotate Platinum Genomes variants called by DeepVariant and aligned to build GRCh38.
The BigQuery Web UI and dsub are used to run all of these steps in the cloud.
A query can be used to export variants from BigQuery variants tables in a format that VEP can take as input.
- The specific query below will export variants for all members of the Platinum Genomes cohort in "ensembl" format.
- To use variants from your own table, just change the table name.
- To work with a subset of genomes or just one genome, modify the query to filter by call set name.
Run this query using the BigQuery Web UI and materialize the result to a table in your project.
#standardSQL
--
-- Extract the information needed to annotate variants with VEP.
-- This is small test on just the variants in chromosome 22.
--
WITH
variants AS (
SELECT
reference_name AS chrom,
start + 1 AS pos, -- convert to 1-based coordinates
`end`,
reference_bases AS ref,
alt
FROM
`genomics-public-data.platinum_genomes_deepvariant.single_sample_genome_calls`,
UNNEST(alternate_bases) AS alt -- flatten multiallelic sites
WHERE
-- Include only sites of variation (exclude non-variant segments).
alt IS NOT NULL AND alt NOT IN ("<NON_REF>", "<*>")
-- Remove this line to include the entire genome.
AND reference_name IN ('chr22', '22')
)
SELECT
-- http://www.ensembl.org/info/docs/tools/vep/vep_formats.html#input
CONCAT(
chrom, '\t',
CAST(pos AS STRING), '\t',
CAST(`end` AS STRING), '\t',
ref, '/', alt, '\t',
'+') AS vep_input
FROM
variants
Export the contents of
the table created in the prior step as a gzipped CSV file to Cloud Storage (for
example to path gs://your-bucket-name/platinum-genomes-grch38-chr22.csv.gz
).
Run run_vep_remote.sh to launch dsub jobs that will use the VEP Docker container to run VEP on the variants, writing the resulting annotations as a BigQuery table.
- Run
./run_vep_remote.sh --help
for additional documentation on its command line parameters.
# The Google Cloud Platform project id in which the docker containers
# are stored.
PROJECT_ID=your-project-id
# The bucket name (with the gs:// prefix) to hold temp files and logs.
BUCKET=gs://your-bucket-name
# The full path to the VEP input file.
INPUT_FILE=${BUCKET}/platinum-genomes-grch38-chr22.csv.gz
# Kick off annotation.
./run_vep_remote.sh \
--project_id ${PROJECT_ID} \
--bucket ${BUCKET}/temp \
--docker_image gcr.io/${PROJECT_ID}/vep_grch38 \
--table_name platinum_genomes_grch38_chr22_annotations \
--shards_per_file 10 \
${INPUT_FILE}
Use the following query to do some basic checks on the annotations. The query
below assumes the annotations are in table
vep.platinum_genomes_grch38_chr22_annotations
in your project.
#standardSQL
--
-- Count the number of variants per chromosome for both the variants and
-- VEP output table. This basic QC metric ensures that every chromosome
-- we expected successfully completed.
--
SELECT chrom, variants_count, vep_count
FROM (
SELECT LTRIM(reference_name, "chr") AS chrom, COUNT(1) AS variants_count
FROM
`genomics-public-data.platinum_genomes_deepvariant.single_sample_genome_calls`,
UNNEST(alternate_bases) AS alt -- flatten multi allelic sites
WHERE
-- Include only sites of variantion (exclude non-variant segments).
alt IS NOT NULL AND alt NOT IN ("<NON_REF>", "<*>")
GROUP BY reference_name)
FULL JOIN (
SELECT LTRIM(seq_region_name, "chr") AS chrom, COUNT(1) AS vep_count
FROM `vep.platinum_genomes_grch38_chr22_annotations`
GROUP BY seq_region_name)
USING (chrom)
WHERE vep_count IS NOT NULL
ORDER BY chrom
We expect a result of 210,279 variants and VEP annotations for chromosome 22.
The annotations table created by VEP is a little different than the variant tables:
- VEP uses 1-based coordinates whereas the variants tables use 0-based coordinates per GA4GH.
- VEP rewrites some of the fields. For example field
start
excludes the first base for a deletion.
Run a query like the following to reshape the annotations data and materialize the result to a new table.
#standardSQL
--
-- Add additional columns to the VEP table to facilitate easier JOINs
-- with variant tables.
--
SELECT
SPLIT(input, "\t")[OFFSET(0)] AS reference_name,
CAST(SPLIT(input, "\t")[OFFSET(1)] AS INT64) - 1 AS start_0_based_coords,
SPLIT(SPLIT(input, "\t")[OFFSET(3)], '/')[OFFSET(0)] AS reference_bases,
SPLIT(SPLIT(input, "\t")[OFFSET(3)], '/')[OFFSET(1)] AS alternate_bases,
*
FROM
`vep.platinum_genomes_grch38_chr22_annotations`