Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TribbleIndexedFeatureReader throws "TribbleException: there aren't enough columns .." on large files unless there's a tabix index #1168

Open
bw2 opened this issue Jul 19, 2018 · 2 comments

Comments

@bw2
Copy link
Contributor

bw2 commented Jul 19, 2018

Verify

This is the same Issue as
broadinstitute/gatk#4761

I got the error below when running picard liftover on the bravo-dbsnp-all.vcf.gz (GRCh38 => GRCh37).

I'm reposting this because, even though the user closed broadinstitute/gatk#4761 after they saw that bgzipping + tabixing the input vcf avoids the error, I think there are 2 issues which may not have been fixed yet(?)

  1. if TribbleIndexedFeatureReader stops in the middle of a line, it causes a cryptic parsing error that's hard to connect to lack of tabix index
  2. in the rare event that it stops at the end of a line, it would presumably silently continue and result in truncated output that users may not notice, and that's even harder to debug.
Exception in thread "main" htsjdk.tribble.TribbleException: Line 998904: there aren't enough columns for line 1	4723990	TOPMed_freeze_5?chr1:4,723,9 (we expected 9 tokens, and saw 3 ), for input source: file:///Users/weisburd/code/bravo/bravo-dbsnp-all.removed_chr_prefix.vcf.gz
	at htsjdk.variant.vcf.AbstractVCFCodec.decodeLine(AbstractVCFCodec.java:281)
	at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:262)
	at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:64)
	at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:70)
	at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:37)
	at htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.readNextRecord(TribbleIndexedFeatureReader.java:365)
	at htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.next(TribbleIndexedFeatureReader.java:346)
	at htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.next(TribbleIndexedFeatureReader.java:307)
	at picard.vcf.LiftoverVcf.doWork(LiftoverVcf.java:324)
	at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:269)
	at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:98)
	at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:108)

Subject of the issue

The full logs are:

++ i=
++ java -Xmx50G -jar picard.jar LiftoverVcf R=hg19.fa C=hg38ToHg19.over.chain.remove_chr_prefix.gz I=bravo-dbsnp-all.removed_chr_prefix.vcf.gz O=bravo-dbsnp-all.removed_chr_prefix.liftunder_grch19.vcf.gz REJECT=bravo-dbsnp-all.removed_chr_prefix.rejected.vcf.gz TMP_DIR=./tmp VALIDATION_STRINGENCY=LENIENT WARN_ON_MISSING_CONTIG=true WRITE_ORIGINAL_POSITION=true
++ tee .log
09:33:54.465 INFO  NativeLibraryLoader - Loading libgkl_compression.dylib from jar:file:/Users/weisburd/code/bravo/picard.jar!/com/intel/gkl/native/libgkl_compression.dylib
[Thu Jul 19 09:33:54 EDT 2018] LiftoverVcf INPUT=bravo-dbsnp-all.removed_chr_prefix.vcf.gz OUTPUT=bravo-dbsnp-all.removed_chr_prefix.liftunder_grch19.vcf.gz CHAIN=hg38ToHg19.over.chain.remove_chr_prefix.gz REJECT=bravo-dbsnp-all.removed_chr_prefix.rejected.vcf.gz WARN_ON_MISSING_CONTIG=true WRITE_ORIGINAL_POSITION=true TMP_DIR=[./tmp] VALIDATION_STRINGENCY=LENIENT REFERENCE_SEQUENCE=hg19.fa    LOG_FAILED_INTERVALS=true LIFTOVER_MIN_MATCH=1.0 ALLOW_MISSING_FIELDS_IN_HEADER=false TAGS_TO_REVERSE=[AF] TAGS_TO_DROP=[MAX_AF] VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Thu Jul 19 09:33:54 EDT 2018] Executing as weisburd@wm4d1-03b on Mac OS X 10.12.6 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_161-b12; Deflater: Intel; Inflater: Intel; Picard version: 2.17.4-1-gb63d73e-SNAPSHOT
INFO	2018-07-19 09:33:54	LiftoverVcf	Loading up the target reference genome.
INFO	2018-07-19 09:34:04	LiftoverVcf	Lifting variants over and sorting (not yet writing the output file.)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180127-180133 failed to match chain 816 because intersection length 6 < minMatchSize 7.0 (0.85714287 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180221-180227 failed to match chain 816 because intersection length 2 < minMatchSize 7.0 (0.2857143 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180221-180227 failed to match chain 977 because intersection length 5 < minMatchSize 7.0 (0.71428573 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180801-180808 failed to match chain 23749811 because intersection length 3 < minMatchSize 8.0 (0.375 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180801-180808 failed to match chain 2410 because intersection length 3 < minMatchSize 8.0 (0.375 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180802-180808 failed to match chain 23749811 because intersection length 3 < minMatchSize 7.0 (0.42857143 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180802-180808 failed to match chain 2410 because intersection length 2 < minMatchSize 7.0 (0.2857143 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180803-180804 failed to match chain 2410 because intersection length 1 < minMatchSize 2.0 (0.5 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180803-180808 failed to match chain 23749811 because intersection length 3 < minMatchSize 6.0 (0.5 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180803-180808 failed to match chain 2410 because intersection length 1 < minMatchSize 6.0 (0.16666667 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180803-180809 failed to match chain 23749811 because intersection length 4 < minMatchSize 7.0 (0.5714286 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180803-180809 failed to match chain 2410 because intersection length 1 < minMatchSize 7.0 (0.14285715 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180804-180806 failed to match chain 23749811 because intersection length 1 < minMatchSize 3.0 (0.33333334 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180863-180869 failed to match chain 7666873 because intersection length 1 < minMatchSize 7.0 (0.14285715 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180863-180869 failed to match chain 23749811 because intersection length 6 < minMatchSize 7.0 (0.85714287 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180863-180875 failed to match chain 7666873 because intersection length 7 < minMatchSize 13.0 (0.53846157 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180863-180875 failed to match chain 23749811 because intersection length 6 < minMatchSize 13.0 (0.46153846 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180868-180869 failed to match chain 7666873 because intersection length 1 < minMatchSize 2.0 (0.5 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180868-180869 failed to match chain 23749811 because intersection length 1 < minMatchSize 2.0 (0.5 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180868-180870 failed to match chain 7666873 because intersection length 2 < minMatchSize 3.0 (0.6666667 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180868-180870 failed to match chain 23749811 because intersection length 1 < minMatchSize 3.0 (0.33333334 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180868-180874 failed to match chain 7666873 because intersection length 6 < minMatchSize 7.0 (0.85714287 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180868-180874 failed to match chain 23749811 because intersection length 1 < minMatchSize 7.0 (0.14285715 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180868-180880 failed to match chain 7666873 because intersection length 12 < minMatchSize 13.0 (0.9230769 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180868-180880 failed to match chain 23749811 because intersection length 1 < minMatchSize 13.0 (0.07692308 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180993-181004 failed to match chain 2576 because intersection length 9 < minMatchSize 12.0 (0.75 < 1.0)
INFO	2018-07-19 09:34:06	LiftOver	Interval 1:180993-181004 failed to match chain 2410 
...
...
...
INFO	2018-07-19 09:34:13	LiftOver	Interval 1:1663380-1663401 failed to match chain 2 because intersection length 1 < minMatchSize 22.0 (0.045454547 < 1.0)
INFO	2018-07-19 09:34:13	LiftOver	Interval 1:1663399-1663409 failed to match chain 2 because intersection length 7 < minMatchSize 11.0 (0.6363636 < 1.0)
INFO	2018-07-19 09:34:22	LiftOver	Interval 1:2767665-2767666 failed to match chain 4680 because intersection length 1 < minMatchSize 2.0 (0.5 < 1.0)
INFO	2018-07-19 09:34:22	LiftOver	Interval 1:2778770-2778774 failed to match chain 2 because intersection length 3 < minMatchSize 5.0 (0.6 < 1.0)
INFO	2018-07-19 09:34:22	LiftOver	Interval 1:2778772-2778776 failed to match chain 2 because intersection length 1 < minMatchSize 5.0 (0.2 < 1.0)


INFO	2018-07-19 09:34:23	LiftOver	Interval 1:2970389-2970390 failed to match chain 2 because intersection length 1 < minMatchSize 2.0 (0.5 < 1.0)
[Thu Jul 19 09:34:30 EDT 2018] picard.vcf.LiftoverVcf done. Elapsed time: 0.60 minutes.
Runtime.totalMemory()=6373244928
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.tribble.TribbleException: Line 998904: there aren't enough columns for line 1	4723990	TOPMed_freeze_5?chr1:4,723,9 (we expected 9 tokens, and saw 3 ), for input source: file:///Users/weisburd/code/bravo/bravo-dbsnp-all.removed_chr_prefix.vcf.gz
	at htsjdk.variant.vcf.AbstractVCFCodec.decodeLine(AbstractVCFCodec.java:281)
	at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:262)
	at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:64)
	at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:70)
	at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:37)
	at htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.readNextRecord(TribbleIndexedFeatureReader.java:365)
	at htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.next(TribbleIndexedFeatureReader.java:346)
	at htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.next(TribbleIndexedFeatureReader.java:307)
	at picard.vcf.LiftoverVcf.doWork(LiftoverVcf.java:324)
	at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:269)
	at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:98)
	at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:108)

Your environment

 $ java -version
java version "1.8.0_161"
Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)

$ java  -Xmx50G -jar picard.jar LiftoverVcf --version
2.17.4-1-gb63d73e-SNAPSHOT

happens on latest MacOS and CentOS

Steps to reproduce

Tell us how to reproduce this issue. If possible, include a short code snippet to demonstrate the problem.

Run

java  -Xmx50G -jar picard.jar LiftoverVcf \
    R=hg19.fa \
    C=hg38ToHg19.over.chain.remove_chr_prefix.gz \
    I=bravo-dbsnp-all.removed_chr_prefix.vcf.gz \
    O=bravo-dbsnp-all.removed_chr_prefix.liftunder_grch19.vcf.gz \
    REJECT=bravo-dbsnp-all.removed_chr_prefix.rejected.vcf.gz \
    TMP_DIR=./tmp \
    VALIDATION_STRINGENCY=LENIENT \
    WARN_ON_MISSING_CONTIG=true \
    WRITE_ORIGINAL_POSITION=true |& tee ${i}.log

on the vcf from https://bravo.sph.umich.edu/freeze5/hg38/download

Actual behaviour

I haven't looked through the htsjdk code, but it looks like if there's no .tbi index, the TribbleReader will return lines until it hits some threshold. It then stops & returns incomplete data.
I double checked the input .vcf with a python script and confirmed there's nothing unusual about the line it errored on - the file itself isn't corrupt/truncated.

Generating a tabix index prevents the error.

@magicDGS
Copy link
Member

I believe that this is a picard issue and should be reported there. The gatk issue that you mention is closed because it was already fixed with the update of htsjdk (version >= 2.14.2 have the fix).

You are using a snapshot of picard (concretely, version 2.17.4-1-gb63d73e-SNAPSHOT, as reported in your logs) and the htsjdk version was updated with 2.17.8. Updating picard should fix this problem. I recommend to use the latest version (2.18.9 at this moment).

If the problem persists after update it, please report it here.

@kliarist
Copy link

kliarist commented Jan 3, 2022

Has there been any update for the issue ? I have recently tried using picard (Liftover) to convert hg38ToHg19 and was facing similar issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants