Please consider using variantcentrifuge instead, which offers similar functionality with improved maintainability, logging options, and ongoing support.
This repository contains scripts and documentation for filtering variant call format (VCF) files to identify rare genetic variants in genes of interest using a streamlined bash script, filter_variants.sh
.
The filter_variants.sh
script performs the following steps:
- Extract Genes of Interest: Uses
snpEff genes2bed
to produce a BED file containing the genes of interest. - Sort BED File: Sorts the generated BED file.
- Modify BED File: Adds "chr" prefix to the entries in the BED file if
add_chr
is set to true. - Extract Variants: Uses
bcftools
to extract the variants in the BED file from the VCF. - Filter for Rare Variants: Uses
SnpSift
to filter for rare variants based on the provided filter string. - Extract Fields of Interest: Uses
SnpSift
again to extract the specified fields of interest. - Modify Header: Removes the "ANN[0]" and "GEN[*]" prefixes from the header.
- Replace GT Values: Uses
replace_gt_with_sample.sh
to replace the GT values with the sample names. - Save Output: Saves the output to a specified file.
Example of the shell pipeline the script is composing:
snpEff genes2bed GRCh38.mane.1.0.refseq OFD1 | sortBed | awk '{print "chr"$0}' | bcftools view ann.dbnsfp.vcf.gz -R - | SnpSift -Xmx8g filter " (( dbNSFP_gnomAD_exomes_AC[0] <= 2 ) | ( na dbNSFP_gnomAD_exomes_AC[0] )) & ((ANN[ANY].IMPACT has 'HIGH') | (ANN[ANY].IMPACT has 'MODERATE')) " | SnpSift -Xmx4g extractFields -s "," -e "NA" - CHROM POS REF ALT ID QUAL AC ANN[0].GENE ANN[0].FEATUREID ANN[0].EFFECT ANN[0].IMPACT ANN[0].HGVS_C ANN[0].HGVS_P dbNSFP_SIFT_pred dbNSFP_Polyphen2_HDIV_pred dbNSFP_MutationTaster_pred dbNSFP_CADD_phred dbNSFP_gnomAD_exomes_AC dbNSFP_gnomAD_genomes_AC dbNSFP_ALFA_Total_AC GEN[*].GT | sed -e '1s/ANN\[0\]\.//g; s/GEN\[\*\]\.//g' | ./replace_gt_with_sample.sh samples.txt 21 > OFD1_rare_variants.GCKD.tsv
./filter_variants.sh [--config config_file] <gene_name> <vcf_file_location> [reference] [add_chr] [filters] [fields_to_extract] [sample_file] [replace_script_location] [output_file]
--config config_file
: (Optional) The path to the configuration file containing default values for parameters.gene_name
: The name of the gene of interest, e.g., "BICC1".vcf_file_location
: The location of the VCF file.reference
: (Optional, default: "GRCh38.mane.1.0.refseq") The reference to use.add_chr
: (Optional, default: true) Whether or not to add "chr" to the chromosome name. Use "true" or "false".filters
: (Optional, default: Filters for rare and moderate/high impact variants) The filters to apply.fields_to_extract
: (Optional, default: Various fields including gene info, predictions, allele counts) The fields to extract.sample_file
: (Optional, default: "samples.txt") The path to the file containing the sample values to use for replacement.replace_script_location
: (Optional, default: "./replace_gt_with_sample.sh") The location of thereplace_gt_with_sample.sh
script.replace_script_options
: (Optional) Additional options to pass to thereplace_gt_with_sample.sh
script. This can be used to append genotype values to sample names for non-"0/0" genotypes.output_file
: (Optional, default: "variants.tsv") The name of the output file.
The script allows users to provide a configuration file containing default values for parameters. The configuration file is sourced if provided, and the values specified in it are used as defaults.
Example of a configuration file:
reference=GRCh38.mane.1.0.refseq
add_chr=true
filters=(( dbNSFP_gnomAD_exomes_AC[0] <= 2 ) | ( na dbNSFP_gnomAD_exomes_AC[0] )) & ((ANN[ANY].IMPACT has 'HIGH') | (ANN[ANY].IMPACT has 'MODERATE'))
fields_to_extract=CHROM POS REF ALT ID QUAL AC ANN[0].GENE ANN[0].FEATUREID ANN[0].EFFECT ANN[0].IMPACT ANN[0].HGVS_C ANN[0].HGVS_P dbNSFP_SIFT_pred dbNSFP_Polyphen2_HDIV_pred dbNSFP_MutationTaster_pred dbNSFP_CADD_phred dbNSFP_gnomAD_exomes_AC dbNSFP_gnomAD_genomes_AC dbNSFP_ALFA_Total_AC GEN[*].GT
sample_file=samples.txt
replace_script_location=./replace_gt_with_sample.sh
replace_script_options="--append-genotype"
output_file=variants.tsv
To generate the sample file from a multi-sample VCF, you can use the following command:
bcftools view -h /path/to/your_multi_sample.vcf.gz | awk -F' ' '{ for (i=10; i<=NF; ++i) printf "%s%s", $i, (i==NF ? RS : ",") }' > /path/to/samplefile.txt
- GNU Awk 4.2.1, API: 2.0 (GNU MPFR 3.1.6-p2, GNU MP 6.1.2)
- GNU bash, version 4.4.20(1)-release (x86_64-redhat-linux-gnu)
- bcftools 1.17
- snpEff version SnpEff 5.1d (build 2022-04-19 15:49)
- SnpSift version 5.1d (build 2022-04-19 15:50)