Identity By State variant streamer fails with write error #213

pbilling · 2017-05-26T17:56:54Z

I am trying to apply the IdentityByState pipeline to my variant data, but it reliably (n=4) fails with a write error.

Error message:

(dd6b6f2b6ea510df): Workflow failed. Causes: (dd6b6f2b6ea5110a): S07:VariantStreamer/ParDo(RetrieveVariants)+VariantStreamer/ParDo(ConvergeVariantsList)+JoinNonVariantSegmentsWithVariants.BinShuffleAndCombineTransform/ParDo(BinVariants)+JoinNonVariantSegmentsWithVariants.BinShuffleAndCombineTransform/GroupByKey/Reify+JoinNonVariantSegmentsWithVariants.BinShuffleAndCombineTransform/GroupByKey/Write failed.

Command:

$ java -cp target/google-genomics-dataflow-v1-0.8-SNAPSHOT-runnable.jar com.google.cloud.genomics.dataflow.pipelines.IdentityByState \
--project=gbsc-gcp-project-mvp \
--variantSetId=17987177733120369382 \
--runner=BlockingDataflowPipelineRunner \
--stagingLocation=gs://gbsc-gcp-project-mvp-group/test/dataflow-java/ibs/mvp-phase-2/staging \
--references=chr17:41196311:41277499 \
--hasNonVariantSegments \
--output=gs://gbsc-gcp-project-mvp-group/test/dataflow-java/ibs/mvp-phase-2/result/17987177733120369382-n1820-ibs.tsv

I'm trying right now with the "--hasNonVariantSegments" flag removed, but this is data generated from gVCF files, after processing with the non-variant-segment-transformer, so it should have non-variant segments.

I'm not really sure what this means or how I can go about debugging it. Any ideas are greatly appreciated!

The text was updated successfully, but these errors were encountered:

deflaux · 2017-05-30T18:00:49Z

With --hasNonVariantSegments it is performing the same merge as is used in https://github.com/googlegenomics/codelabs/tree/master/Java/PlatinumGenomes-variant-transformation so that it has all the genotypes for each SNP site in the cohort as input to one of the similarity measures.

I recommend clicking through to the detailed logs and seeing if there is any more info there. The mostly likely issue is an OutOfMemory exception somewhere because the merge operation needs to co-locate all data for a contiguous genomic region on a single machine to perform the merge. If that is the issue, I recommend trying highmem machines. If you still see an OOM, then try smaller genomics regions by decreasing the value of --binSize.

This implementation of Identity-By-State reads from VariantStore. If it were updated to alternatively read from BigQuery, it could instead consume the result of https://github.com/googlegenomics/codelabs/tree/master/Java/PlatinumGenomes-variant-transformation so that the merge does not need to happen twice.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identity By State variant streamer fails with write error #213

Identity By State variant streamer fails with write error #213

pbilling commented May 26, 2017 •

edited

Loading

deflaux commented May 30, 2017

Identity By State variant streamer fails with write error #213

Identity By State variant streamer fails with write error #213

Comments

pbilling commented May 26, 2017 • edited Loading

deflaux commented May 30, 2017

pbilling commented May 26, 2017 •

edited

Loading