-
Notifications
You must be signed in to change notification settings - Fork 2
update_vcf
dytk2134 edited this page Sep 12, 2018
·
1 revision
Convert all the coordinates, IDs and some attributes in a BAM file. Requires output from fasta_diff.
- The Variant Call Format Specification:
- A valid VCF file is composed of three main parts:
- meta-information lines(optional): prefixed with ##. If
-ref
is given, the script will update the meta-information lines below- contig
##contig=<ID=ID,length=length>
- ID=[new contig ID in the new reference genome fasta file]
- length=[the length of contig in the new reference genome fasta file]
- reference
##reference=reference
- reference=[the file name of the new reference genome fasta file]
- contig
- header line: prefixed with #, names the 8 fixed, mandatory columns (would not be updated)
-
#CHROM POS ID REF ALT QUAL FILTER INFO
- data lines: tab-delimited, 8 mandatory columns and unlimited custom defined column. The Data lines that multiple match or do not exactly match the alignment in alignment file will be removed. Currently, only
CHROM
andPOS
will be updated.- CHROM
- will be the new id in the alignment file(output from fata_diff.py)
- POS
- will be updated by this formula:
new POS = old POS - old start in alignment file + new start in alignment file
- filter criteria(data line that do not match the criteria below will be removed):
- old start in alignment file < old POS <= old end in alignment file
- old start in alignment file < old POS + the length of old REF -1<= old end in alignment file
- will be updated by this formula:
- CHROM
- meta-information lines(optional): prefixed with ##. If
update_vcf -a match.tsv -ref example_file/new.fa example_file/example.vcf
Scaffold4139 2368 8532 JHOM01041610.1 0 6164
Scaffold1688 0 390 KK246853.1 0 390
Scaffold1688 2775 4110 KK246853.1 2775 4110
Scaffold1688 4670 5814 KK246853.1 4670 5814
Scaffold1688 8337 8871 KK246853.1 8337 8871
Scaffold1688 10333 11477 KK246853.1 10333 11477
- meta-information lines
original header
##contig=<ID=Scaffold1687,length=41690>
##contig=<ID=Scaffold1688,length=11477>
##contig=<ID=Scaffold1689,length=8196>
##reference=old_reference.fa
updated header
##contig=<length=41690,ID=KK246852.1>
##contig=<length=11477,ID=KK246853.1>
##contig=<length=8196,ID=KK246854.1>
##reference=new_reference.fa
- data lines
original data lines
Scaffold1688 11067 . G A 52.74 . AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=50.00;QD=26.37;SOR=2.303 GT:AD:DP:GQ:PL 1/1:0,2:2:6:80,6,0
Scaffold4139 57 . A T 56.74 . AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=50.00;QD=28.37;SOR=2.303 GT:AD:DP:GQ:PL 1/1:0,2:2:6:84,6,0
Scaffold4139 3359 . A T 377.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-0.469;ClippingRankSum=0.000;DP=14;ExcessHet=3.0103;FS=3.256;MLEAC=1;MLEAF=0.500;MQ=50.00;MQRankSum=0.000;QD=26.98;ReadPosRankSum=1.736;SOR=0.105 GT:AD:DP:GQ:PL 0/1:2,12:14:48:406,0,48
updated data lines
KK246853.1 11067 . G A 52.74 . AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=50.00;QD=26.37;SOR=2.303 GT:AD:DP:GQ:PL 1/1:0,2:2:6:80,6,0
JHOM01041610.1 991 . A T 377.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-0.469;ClippingRankSum=0.000;DP=14;ExcessHet=3.0103;FS=3.256;MLEAC=1;MLEAF=0.500;MQ=50.00;MQRankSum=0.000;QD=26.98;ReadPosRankSum=1.736;SOR=0.105 GT:AD:DP:GQ:PL 0/1:2,12:14:48:406,0,48
removed data lines
Scaffold4139 57 . A T 56.74 . AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=50.00;QD=28.37;SOR=2.303 GT:AD:DP:GQ:PL 1/1:0,2:2:6:84,6,0
update_vcf -h
usage: update_vcf [-h] [-a ALIGNMENT_FILE] [-u UPDATED_POSTFIX]
[-r REMOVED_POSTFIX] [-ref REFERENCE] [-v]
VCF_FILE [VCF_FILE ...]
Update the sequence id and coordinates of a VCF file using an alignment file generated by the fasta_diff program.
Updated features are written to a new file with '_updated'(default) appended to the original VCF file name.
Feature that can not be updated, due to the id being removed completely or the feature contains regions that
are removed or replaced with Ns, are written to a new file with '_removed'(default) appended to the original VCF file name.
Example:
fasta_diff example_file/old.fa example_file/new.fa | update_vcf -ref example_file/new.fa example_file/example.vcf
positional arguments:
VCF_FILE List one or more VCF files to be updated
optional arguments:
-h, --help show this help message and exit
-a ALIGNMENT_FILE, --alignment_file ALIGNMENT_FILE
The alignment file generated by fasta_diff, a TSV file
with 6 columns: old_id, old_start, old_end, new_id,
new_start, new_end (default: STDIN)
-u UPDATED_POSTFIX, --updated_postfix UPDATED_POSTFIX
The filename postfix for updated features (default:
"_updated")
-r REMOVED_POSTFIX, --removed_postfix REMOVED_POSTFIX
The filename postfix for removed features (default:
"_removed")
-ref REFERENCE, --reference REFERENCE
The new reference genome fasta file
-v, --version show program's version number and exit