Skip to content
This repository has been archived by the owner on Oct 29, 2023. It is now read-only.

Identity By State variant streamer fails with write error #213

Open
pbilling opened this issue May 26, 2017 · 1 comment
Open

Identity By State variant streamer fails with write error #213

pbilling opened this issue May 26, 2017 · 1 comment

Comments

@pbilling
Copy link

pbilling commented May 26, 2017

I am trying to apply the IdentityByState pipeline to my variant data, but it reliably (n=4) fails with a write error.

Error message:

(dd6b6f2b6ea510df): Workflow failed. Causes: (dd6b6f2b6ea5110a): S07:VariantStreamer/ParDo(RetrieveVariants)+VariantStreamer/ParDo(ConvergeVariantsList)+JoinNonVariantSegmentsWithVariants.BinShuffleAndCombineTransform/ParDo(BinVariants)+JoinNonVariantSegmentsWithVariants.BinShuffleAndCombineTransform/GroupByKey/Reify+JoinNonVariantSegmentsWithVariants.BinShuffleAndCombineTransform/GroupByKey/Write failed.

Command:

$ java -cp target/google-genomics-dataflow-v1-0.8-SNAPSHOT-runnable.jar com.google.cloud.genomics.dataflow.pipelines.IdentityByState \
--project=gbsc-gcp-project-mvp \
--variantSetId=17987177733120369382 \
--runner=BlockingDataflowPipelineRunner \
--stagingLocation=gs://gbsc-gcp-project-mvp-group/test/dataflow-java/ibs/mvp-phase-2/staging \
--references=chr17:41196311:41277499 \
--hasNonVariantSegments \
--output=gs://gbsc-gcp-project-mvp-group/test/dataflow-java/ibs/mvp-phase-2/result/17987177733120369382-n1820-ibs.tsv

I'm trying right now with the "--hasNonVariantSegments" flag removed, but this is data generated from gVCF files, after processing with the non-variant-segment-transformer, so it should have non-variant segments.

I'm not really sure what this means or how I can go about debugging it. Any ideas are greatly appreciated!

@deflaux
Copy link
Contributor

deflaux commented May 30, 2017

With --hasNonVariantSegments it is performing the same merge as is used in https://github.com/googlegenomics/codelabs/tree/master/Java/PlatinumGenomes-variant-transformation so that it has all the genotypes for each SNP site in the cohort as input to one of the similarity measures.

I recommend clicking through to the detailed logs and seeing if there is any more info there. The mostly likely issue is an OutOfMemory exception somewhere because the merge operation needs to co-locate all data for a contiguous genomic region on a single machine to perform the merge. If that is the issue, I recommend trying highmem machines. If you still see an OOM, then try smaller genomics regions by decreasing the value of --binSize.

This implementation of Identity-By-State reads from VariantStore. If it were updated to alternatively read from BigQuery, it could instead consume the result of https://github.com/googlegenomics/codelabs/tree/master/Java/PlatinumGenomes-variant-transformation so that the merge does not need to happen twice.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants