Skip to content

update_vcf

dytk2134 edited this page Sep 12, 2018 · 1 revision

Introduction

Convert all the coordinates, IDs and some attributes in a BAM file. Requires output from fasta_diff.

Variant Call Format (VCF)

  • The Variant Call Format Specification:
  • A valid VCF file is composed of three main parts:
    • meta-information lines(optional): prefixed with ##. If -ref is given, the script will update the meta-information lines below
      • contig
        • ##contig=<ID=ID,length=length>
        • ID=[new contig ID in the new reference genome fasta file]
        • length=[the length of contig in the new reference genome fasta file]
      • reference
        • ##reference=reference
        • reference=[the file name of the new reference genome fasta file]
    • header line: prefixed with #, names the 8 fixed, mandatory columns (would not be updated) - #CHROM POS ID REF ALT QUAL FILTER INFO
    • data lines: tab-delimited, 8 mandatory columns and unlimited custom defined column. The Data lines that multiple match or do not exactly match the alignment in alignment file will be removed. Currently, only CHROM and POS will be updated.
      • CHROM
        • will be the new id in the alignment file(output from fata_diff.py)
      • POS
        • will be updated by this formula: new POS = old POS - old start in alignment file + new start in alignment file
        • filter criteria(data line that do not match the criteria below will be removed):
          • old start in alignment file < old POS <= old end in alignment file
          • old start in alignment file < old POS + the length of old REF -1<= old end in alignment file

Example

update_vcf -a match.tsv -ref example_file/new.fa example_file/example.vcf

alignment file (match.tsv)

Scaffold4139    2368    8532    JHOM01041610.1  0       6164
Scaffold1688    0       390     KK246853.1      0       390
Scaffold1688    2775    4110    KK246853.1      2775    4110
Scaffold1688    4670    5814    KK246853.1      4670    5814
Scaffold1688    8337    8871    KK246853.1      8337    8871
Scaffold1688    10333   11477   KK246853.1      10333   11477

VCF file

  • meta-information lines

original header

##contig=<ID=Scaffold1687,length=41690>
##contig=<ID=Scaffold1688,length=11477>
##contig=<ID=Scaffold1689,length=8196>
##reference=old_reference.fa

updated header

##contig=<length=41690,ID=KK246852.1>
##contig=<length=11477,ID=KK246853.1>
##contig=<length=8196,ID=KK246854.1>
##reference=new_reference.fa
  • data lines

original data lines

Scaffold1688    11067   .       G       A       52.74   .       AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=50.00;QD=26.37;SOR=2.303 GT:AD:DP:GQ:PL  1/1:0,2:2:6:80,6,0
Scaffold4139    57      .       A       T       56.74   .       AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=50.00;QD=28.37;SOR=2.303 GT:AD:DP:GQ:PL  1/1:0,2:2:6:84,6,0
Scaffold4139    3359    .       A       T       377.77  .       AC=1;AF=0.500;AN=2;BaseQRankSum=-0.469;ClippingRankSum=0.000;DP=14;ExcessHet=3.0103;FS=3.256;MLEAC=1;MLEAF=0.500;MQ=50.00;MQRankSum=0.000;QD=26.98;ReadPosRankSum=1.736;SOR=0.105       GT:AD:DP:GQ:PL  0/1:2,12:14:48:406,0,48

updated data lines

KK246853.1      11067   .       G       A       52.74   .       AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=50.00;QD=26.37;SOR=2.303 GT:AD:DP:GQ:PL  1/1:0,2:2:6:80,6,0
JHOM01041610.1  991     .       A       T       377.77  .       AC=1;AF=0.500;AN=2;BaseQRankSum=-0.469;ClippingRankSum=0.000;DP=14;ExcessHet=3.0103;FS=3.256;MLEAC=1;MLEAF=0.500;MQ=50.00;MQRankSum=0.000;QD=26.98;ReadPosRankSum=1.736;SOR=0.105       GT:AD:DP:GQ:PL  0/1:2,12:14:48:406,0,48

removed data lines

Scaffold4139    57      .       A       T       56.74   .       AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=50.00;QD=28.37;SOR=2.303 GT:AD:DP:GQ:PL  1/1:0,2:2:6:84,6,0

Running the program with –h prints the following help:

update_vcf -h

usage: update_vcf [-h] [-a ALIGNMENT_FILE] [-u UPDATED_POSTFIX]
                  [-r REMOVED_POSTFIX] [-ref REFERENCE] [-v]
                  VCF_FILE [VCF_FILE ...]

Update the sequence id and coordinates of a VCF file using an alignment file generated by the fasta_diff program.
Updated features are written to a new file with '_updated'(default) appended to the original VCF file name.
Feature that can not be updated, due to the id being removed completely or the feature contains regions that
are removed or replaced with Ns, are written to a new file with '_removed'(default) appended to the original VCF file name.
Example:
    fasta_diff example_file/old.fa example_file/new.fa | update_vcf -ref example_file/new.fa example_file/example.vcf

positional arguments:
  VCF_FILE              List one or more VCF files to be updated

optional arguments:
  -h, --help            show this help message and exit
  -a ALIGNMENT_FILE, --alignment_file ALIGNMENT_FILE
                        The alignment file generated by fasta_diff, a TSV file
                        with 6 columns: old_id, old_start, old_end, new_id,
                        new_start, new_end (default: STDIN)
  -u UPDATED_POSTFIX, --updated_postfix UPDATED_POSTFIX
                        The filename postfix for updated features (default:
                        "_updated")
  -r REMOVED_POSTFIX, --removed_postfix REMOVED_POSTFIX
                        The filename postfix for removed features (default:
                        "_removed")
  -ref REFERENCE, --reference REFERENCE
                        The new reference genome fasta file
  -v, --version         show program's version number and exit