Fragment assignments to transcripts #889
-
Assume a fragments say f_i belong to two transcripts t_k and t_l. When we give initial nucleotide fraction values (eta_k and eta_l associated to t_k and t_l) to likelihood function , do we count f_i in both t_k and t_l twice ( and then calculate eta based on the reads belong to) or we just guess eta_k independently from f_i? I mean the initial etas that we guess for likelihood function are dependent on the reads assigned to each transcript or not? If they are dependent, do we count reads that belong to multiple transcripts? Because the ultimate goal is to estimate eta, and we know that eta is dependent on how many fragments is assigned to a transcript. However, in between of the algorithm there are fragments that belong to multiple transcripts, and when they are belong to multiple transcript, how do we find (estimate) initial eta? is is based on the fragments assigned ( we know in the last step it is based on fragments assigned because all fragments are assigned to only one transcript, but before the end, we haven't decided and there are multiple transcripts for it) |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 4 replies
-
or first we give "randomly" eta_k and eta_l and then calculate p_ki and p_li (probabilities that read i is from transcript k and l) and depending which one has bigger probability, we give f_i to that transcript? |
Beta Was this translation helpful? Give feedback.
-
There are multiple (defensible) ways to initialize eta (i.e. eta^0). One valid initialization is uniform. Another valid initialization is uniform plus the assignment of uniquely-mapped fragments (since that will always be a lower-bound for the abundance of any transcript, it's always safe to assign, though if we don't the EM will do it in the first round anyway). Finally, yet another way is a mix (an affine combination) of the uniform initialization and the result of the online estimate — this last one is, in fact, what salmon does. Since salmon is a dual-phase algorithm, there are actually two separate phases to the optimization. The first phase happens while the fragment mappings are being parsed. At this point, the full fragment-level probabilities are available, but, at least initially, we don't have an estimate of the transcript abundances. To this end, we employ an online stochastic inference algorithm the stochastic collapsed variational Bayesian algorithm. In short, this estimation, itself, starts with a uniform prior over transcript abundances, but updates the abundance estimates after each "batch" of observed alignments. After we observe the alignments for a read, we update transcript abundances and we place the reads into (range-factorized) equivalence classes. After we have observed all of the reads, we have an estimate of transcript abundances from this online phase; call this eta'. Then we can set eta^0 as an affine combination of eta' and the uniform distribution. In practice, we weight the affine combination based on the number of observed mapped reads, between 0 (all uniform) and 50,000,000 (all eta'). In reality, the specific initialization generally only has a small effect on the abundances found by the offline inference procedure. |
Beta Was this translation helpful? Give feedback.
-
is extra |
Beta Was this translation helpful? Give feedback.
@Ray6283,
There are multiple (defensible) ways to initialize eta (i.e. eta^0). One valid initialization is uniform. Another valid initialization is uniform plus the assignment of uniquely-mapped fragments (since that will always be a lower-bound for the abundance of any transcript, it's always safe to assign, though if we don't the EM will do it in the first round anyway). Finally, yet another way is a mix (an affine combination) of the uniform initialization and the result of the online estimate — this last one is, in fact, what salmon does. Since salmon is a dual-phase algorithm, there are actually two separate phases to the optimization.
The first phase happens while the fragment mappi…