Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple CCS.bam file #1

Closed
wyim-pgl opened this issue Feb 16, 2018 · 11 comments
Closed

Multiple CCS.bam file #1

wyim-pgl opened this issue Feb 16, 2018 · 11 comments

Comments

@wyim-pgl
Copy link

Dear Kristoffer,

Hello,
I am trying to use IsoCon for my transcriptome.
We have 30 cells to analysis and my flnc file was generated with 30 cells.
Is it okay to use merged bam file through Samtools? or bax2bam?

Thank you.

Won

@ksahlin
Copy link
Owner

ksahlin commented Feb 16, 2018

Hi Won,

If you have the ccs reads in separate “*.ccs.bam” files per cell, it should be okay to simply merge them with samtools. The important thing is that all the reads in the flnc fasta file are also found in the bam file.

However, is your Iso-Seq dataset targeted or not? IsoCon is designed for use with targeted Iso-Seq sequencing. If you have non-targeted dataset, the algorithm will likely not scale (in runtime) since IsoCon uses an alignment strategy is optimized for highly similar sequences. There is a way to control this (set low value for --neighbor_search_depth, e.g. --neighbor_search_depth 1000 or lower), but it will likely affect the quality of the output. We are currently working on an approach for non-targeted data that uses many of IsoCon's ideas and I hope to release this repository soon.

Best,
Kristoffer

@wyim-pgl
Copy link
Author

Dear Kristoffer,

Thank you for your comment.
This dataset is NOT targeted. Our species is just polyploidy.
I will use --neighbor_search_depth option to reduce the runtime.
Does BAM file need special pulsefeatures? such as DeletionQV,DeletionTag,InsertionQV,IPD,MergeQV ?

Cheers,

Won

@ksahlin
Copy link
Owner

ksahlin commented Feb 16, 2018

Hi,

No, IsoCon does not need the pulse features, it only needs the quality values that were generated for the CCS reads, i.e., the ccs bamfile should be the output generated by the tool ccs.

Ok, good to know about the nontargeted. I will definitely let you know when we the nontargeted approach ready. Non-targeted data has more variable cut points at the end of transcripts and this can cause some redundancy in IsoCon. There is a parameter for that as well --ignore_ends_len that we set to default value of 15 for targeted. It is possible that ends have higher variability in non-targeted and should therefor be increased (with the obvious downside if they are two different isoforms). I don't have any data on this variability for a good estimate though, maybe 30-50 or so.

@wyim-pgl
Copy link
Author

wyim-pgl commented Feb 16, 2018

Thanks!

Is it okay to use h5 to bam through bax2bam?

Does it need to .pbi file as well as .bai?

Also, does bam file need to be sorted?

I will use this option --ignore_ends_len.

It looks like process faster with --neighbor_search_depth

Regards,
Won

@ksahlin
Copy link
Owner

ksahlin commented Feb 16, 2018

Yes, that is what I've been using, namely: bax2bam {hdf5_path}/*bax.h5 -o {out}. Then, for the ccs tool, we have been using the commands (based on recommended settings):

ccs --numThreads=64 --polish --minLength=10 --minPasses=1 --minZScore=-999 --maxDropFraction=0.8 --minPredictedAccuracy=0.8 --minSnr=4 {input.bam_subreads} {output.ccs_bam}

The commands were taken from the snakemake file in our evaluation repository, line 180 and 196.

No, it does not need to be sorted. The default output from ccs works.

@wyim-pgl
Copy link
Author

Thank you so much.
I am running and let you know.
Cheers,

Won

@ksahlin
Copy link
Owner

ksahlin commented Feb 25, 2018

Hi again Won,

Just wanted to let you know that while working on extending the IsoCon algorithm for nontargeted data (repository not available yet), I’ve discovered additional parts in the original IsoCon code that would not scale to a nontargeted dataset (especially of size 30 cells). So I wouldn’t wait for IsoCon to try to finish. While I’m incorporating some of the changes in the IsoCon code (e.g., this commit ), I still believe that IsoCon is not suitable for a nontargeted dataset (runtime-wise), unless the reads are somewhat broken into rough batches first, based on e.g. some sequence similarity and length.

Best,
K

@wyim-pgl
Copy link
Author

Kristoffer,

Thank you for letting me know.

I will think about more way to do this.

Regards,

@wyim-pgl
Copy link
Author

Hi Kristoffer,

Is it possible to run with subset? For example, we are targeting some specific gene. I can blast or map the CCS to them then run IsoCon.

@ksahlin
Copy link
Owner

ksahlin commented Mar 23, 2018

That will probably work. Just make sure that all fasta sequences are also in the ccs.bam. Let me know how large your dataset is after blasting as well. It is possible that you want to set --ignore_ends_len to higher than 15 (default) if your reads are not cut at relatively precise breakpoints.

Let me know how it goes and I'm happy to help you get the most of this analysis.

@ksahlin
Copy link
Owner

ksahlin commented Mar 29, 2018

Hi @Ascendo , just wanted to notify how you can possibly make your analysis faster for a nontargeted dataset. Take-home message: cut transcripts at precise ends after blasting. See issue2 and issue 3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants