The VEP cache extractor is a command line utility for extracting fields from the cache used by Ensembl's VEP, or Variant Effect Predictor tool. The cache itself is a large collection of Perl storable files, each containing a large list of transcript entries.
Firstly, the script requires a modern version of Perl 5 to run. However you probably have such a version if you are running VEP, so this is likely not an issue
To install the script's dependencies, make sure cpanm is installed with
cpan App::cpanminus
and then install the dependencies using
cpanm --installdeps .
If you are using a system version of Perl, running these commands may give you an error message along these lines:
Can't write to /usr/local/share/perl/5.22.2 and /usr/local/bin
If this is the case, make sure to run these commands with sudo
You'll also need a copy of the VEP cache itself. You may already have one if you're already using VEP. In this case,
simply use the directory of the cache as the first argument to the extract script. If not, you can use the downloader script to obtain it,
and use vep_cache
(a relative path) as the cache path
(documented below).
This script downloads a certain version of the VEP cache into the repository directory (the same directory as the script). Use it as follows:
Usage: download.sh -c cache_type -e ensembl_release -g genome-build
-c, --cache-type
The version of the cache to download. Either 'merged', 'ensembl' or 'refseq'
-e, --ensembl-release
The ensembl release number to download the cache for. e.g. 75, 85 etc.
-g, --genome-build
The grch build version to download for. e.g. 37, 38
Use the extract script as follows:
./extract.pl /path/to/cache path_1:column_1 path_2:column_2
In other words, the script's first argument is a directory path indicating where the cache is located.
The second and all following arguments are descriptors indicating which field you
want to extract from the cache and what to name them. Each argument is a pair of
path:column_name pairs. For example, this second argument might be _trans_exon_array.0.stable_id:exon_id
(explained below)
path
: A dot separated string, where each segment is a hash key, indicating which field to choose. For example,"_trans_exon_array.0.stable_id"
would choose the stable_id of the first exon in the transcript, which in json terms means extracttranscript["_trans_exon_array"][0][stable_id]
for each transcript. You might find it helpful to look at the sample.json file in the repository, as this will give you some indication as to which fields are stored in the VEP cache.column_name
: A string indicating the name of the column to store this data in. For example, if we used the path above, along with the column_name of "exon_id", we would get output as follows:
exon_id
"id67641"
"id67642"
"NM_152663.3.1"
"XM_005245297.1.1"
"id67643"
"id67663"
"ENSESTE00000219503"
"ENSESTE00000220088"