Fragment assignments to transcripts #889

Ray6283 · 2023-10-17T01:32:56Z

Ray6283
Oct 17, 2023

Assume a fragments say f_i belong to two transcripts t_k and t_l.

When we give initial nucleotide fraction values (eta_k and eta_l associated to t_k and t_l) to likelihood function , do we count f_i in both t_k and t_l twice ( and then calculate eta based on the reads belong to) or we just guess eta_k independently from f_i?

I mean the initial etas that we guess for likelihood function are dependent on the reads assigned to each transcript or not? If they are dependent, do we count reads that belong to multiple transcripts?

Because the ultimate goal is to estimate eta, and we know that eta is dependent on how many fragments is assigned to a transcript.
At the end, for each fragment we have one transcript to map to.

However, in between of the algorithm there are fragments that belong to multiple transcripts, and when they are belong to multiple transcript, how do we find (estimate) initial eta? is is based on the fragments assigned ( we know in the last step it is based on fragments assigned because all fragments are assigned to only one transcript, but before the end, we haven't decided and there are multiple transcripts for it)

Answered by rob-p

Oct 17, 2023

@Ray6283,

There are multiple (defensible) ways to initialize eta (i.e. eta^0). One valid initialization is uniform. Another valid initialization is uniform plus the assignment of uniquely-mapped fragments (since that will always be a lower-bound for the abundance of any transcript, it's always safe to assign, though if we don't the EM will do it in the first round anyway). Finally, yet another way is a mix (an affine combination) of the uniform initialization and the result of the online estimate — this last one is, in fact, what salmon does. Since salmon is a dual-phase algorithm, there are actually two separate phases to the optimization.

The first phase happens while the fragment mappi…

View full answer

Ray6283 · 2023-10-17T01:45:48Z

Ray6283
Oct 17, 2023
Author

or first we give "randomly" eta_k and eta_l and then calculate p_ki and p_li (probabilities that read i is from transcript k and l) and depending which one has bigger probability, we give f_i to that transcript?

0 replies

rob-p · 2023-10-17T02:01:05Z

rob-p
Oct 17, 2023
Maintainer

@Ray6283,

There are multiple (defensible) ways to initialize eta (i.e. eta^0). One valid initialization is uniform. Another valid initialization is uniform plus the assignment of uniquely-mapped fragments (since that will always be a lower-bound for the abundance of any transcript, it's always safe to assign, though if we don't the EM will do it in the first round anyway). Finally, yet another way is a mix (an affine combination) of the uniform initialization and the result of the online estimate — this last one is, in fact, what salmon does. Since salmon is a dual-phase algorithm, there are actually two separate phases to the optimization.

The first phase happens while the fragment mappings are being parsed. At this point, the full fragment-level probabilities are available, but, at least initially, we don't have an estimate of the transcript abundances. To this end, we employ an online stochastic inference algorithm the stochastic collapsed variational Bayesian algorithm. In short, this estimation, itself, starts with a uniform prior over transcript abundances, but updates the abundance estimates after each "batch" of observed alignments. After we observe the alignments for a read, we update transcript abundances and we place the reads into (range-factorized) equivalence classes.

After we have observed all of the reads, we have an estimate of transcript abundances from this online phase; call this eta'. Then we can set eta^0 as an affine combination of eta' and the uniform distribution. In practice, we weight the affine combination based on the number of observed mapped reads, between 0 (all uniform) and 50,000,000 (all eta').

In reality, the specific initialization generally only has a small effect on the abundances found by the offline inference procedure.

3 replies

Ray6283 Oct 17, 2023
Author

Thanks so much, I very appreciate you time. It totally makes sense now.

This is what it was in my mind about the scope of your algorithm, do you think in general matches with your algorithm? and you think it is right approach?

First reads assigned to transcripts, that is for each fragment f_j we have a set of transcripts t_j1, t_j2, ... t_jd, (f_j is in d transcripts ),
Then we estimate eta_0 (eta') (online phase)

Then using eta_0, find L(eta_0, F) call it L_0 (maximum likelihood function with those bias terms and probability of having transcripts) (offline phase)

since we have eta_0, we can find p_jk=p( f_j in t_jk), for k=1, ..., d (this probability is based on eta and length of transcripts base on your slide)

Then using p_ij s , we put f_j to transcripts with highest probability , say t_jm , we do this for all fragments, so for each fragment we have assigned unique transcript. So we can have new eta, call it eta_1 ( because we have abundance)

Then use eta_1 to calculate L(eta_1,F) call it L_1,
again, we use eta_1 to calculate p_ijs , so again we re-assign fragments, so we have eta_2
and repeat till convergence.

rob-p Oct 17, 2023
Maintainer

Hi @Ray6283,

It seems you are asking about the EM/VBEM algorithm in general. I would say your description is almost right, except this part:

Then using p_ij s , we put f_j to transcripts with highest probability , say t_jm , we do this for all fragments, so for each fragment we have assigned unique transcript. So we can have new eta, call it eta_1 ( because we have abundance)

Specifically, the EM algorithm does not do "hard" assignment. That is, at no point during the algorithm, is a fragment fully assigned to a specific transcript (unless it was uniquely mapped there). Rather, the EM algorithm performs "soft" assignment. So, consider we have a fragment $f_j$ that maps to two transcripts $t_{j1}$ and $t_{j2}$. The EM algorithm will "partially" allocate this fragment to each of the transcripts. Specifically, it will allocate them proportional to $P(f_j \in t_{j1}) \propto P(t_{j1} \mid \eta) P(f_j \mid t_{j1})$ and $P(f_j \in t_{j2}) \propto P(t_{j2} \mid \eta) P(f_j \mid t_{j2})$ respectively. Then, in the "M" phase of the EM algorithm, one calculates the total mass arising from a transcript $t_i$ as $\sum_{f_j \text{ such that } f_j \text{ maps to } t_i} \sum P(f_j \in t_i)$. Computing these abundances for all $t_i$ gives us our next estimate of $\eta$, and then we can go back and re-compute the probabilities $P(f_j \in t_{j1})$ etc. This is done until convergence.

Ray6283 Oct 17, 2023
Author

Thanks so much for your explanation. It totally makes sense now.

Ray6283 · 2023-10-17T17:15:33Z

Ray6283
Oct 17, 2023
Author

is extra $\sum$ a typo in the last two line? ( $\sum_{f_j \text{ such that } f_j \text{ maps to } t_i} P(f_j \in t_i)$ )

1 reply

Ray6283 Oct 17, 2023
Author

Hi @rob-p,
Just a technical question, in $P(f_j \in t_{j2}) \propto P(t_{j2} \mid \eta) P(f_j \mid t_{j2})$ and $\sum_{f_j \text{ such that } f_j \text{ maps to } t_i} P(f_j \in t_i)$ (Eq *), looks like the bias terms appear here.
However, in equivalence class approach we don't use bias terms but still we have convergence since the bias terms are considered in (Eq *) ?
Basically (Eq *) is a criteria for next estimation of eta.

For $P( f_j \in t_{j2})$ (for example), it looks to me that we have two approaches.

One is the bias terms defined in equation (6) (that uses auxiliary bias terms) and other is the one you mentioned in your slide, that is
$P( f_j \in t_{j2})=\frac{\eta_2/l_2}{\sum_{k}\eta_k/l_k}$ (k ranges over all transcripts such that $f_j$ maps to $t_{jk}$ )(Eq **)

Could we say that the one defined in equation (6) of the papers is for auxiliary terms for maximum likelihood function and the one in (Eq**) is the one we use for "M" step of the EM algorithm to find new estimation of $\pmb{\eta}$ or do we use both approaches in "M" step?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fragment assignments to transcripts #889

{{title}}

Replies: 3 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Fragment assignments to transcripts #889

Ray6283 Oct 17, 2023

Replies: 3 comments · 4 replies

Ray6283 Oct 17, 2023 Author

rob-p Oct 17, 2023 Maintainer

Ray6283 Oct 17, 2023 Author

rob-p Oct 17, 2023 Maintainer

Ray6283 Oct 17, 2023 Author

Ray6283 Oct 17, 2023 Author

Ray6283 Oct 17, 2023 Author

Ray6283
Oct 17, 2023

Replies: 3 comments 4 replies

Ray6283
Oct 17, 2023
Author

rob-p
Oct 17, 2023
Maintainer

Ray6283 Oct 17, 2023
Author

rob-p Oct 17, 2023
Maintainer

Ray6283 Oct 17, 2023
Author

Ray6283
Oct 17, 2023
Author

Ray6283 Oct 17, 2023
Author