Question on the output of MoDec #1

winnieWei123456 · 2024-02-27T06:28:12Z

Dear developers,

Hello. I have been using MoDec v 1.2 and I would like to know how the "Score" in the output file "Responsibilities/bestPepResp_....txt" is calculated. I couldn't find the calculation method in the original text. Additionally, could you explain the parameter "-S 0, --Salign 0"? I understand that it calculates the binding core offset starting from different positions, and setting it to zero assumes that the central amino acid of the peptide is the center. However, I would like to know how the calculation method for different offsets is related to the weights.

I look forward to your response. Thank you very much.

jracle85 · 2024-03-01T09:43:40Z

Dear Winnie Wei,

This score corresponds to the term
$$\sum_{k=0}^K \sum_{s=-S}^S w_{k,s}\prod_{l=1}^L \frac{\theta_{l,x_{l\oplus s}^n}^k}{f_{x_{l\oplus s}^n}} $$
present in the equation (1) of our paper.

Concerning the -S 0 or --Salign 0, this corresponds indeed to considering a central alignment of the peptides. When you use this, then the equation is done as indicated in above equation and (eq. 1 from the paper), with the sum over the small $s$ being done from $-S$ to $S$ and the $x_{l\oplus s}$ using the "special sum" that is defined in equation N1 of the Supplementary Note from our paper. With -S 1, you would redefine $S = max_n(\lambda^n) - L$ and eq. (1) would have a sum over $s$ from $0$ to $S$, and there would be a simple $x_{l + s}$ term. The idea would be similar for -S 2 where the various values would be counted from the C-terminal of the peptide. This has some implications on the $w_{k,s}$ term because the peptides don't have all the same size. So a peptide of size L for example will only be counted in the $w_{k,0}$ term, while a longer peptides will count in multiple $w_{k,s}$ terms.

Best regards,

Julien

winnieWei123456 · 2024-03-12T12:07:20Z

Dear Professor Julien Racle, Thank you for your kind response. I read the articles on MoDec again and made new attempts with changes to the parameter settings. I have a better understanding of MoDec now. However, since I want to compare MoDec with Gibbscluster, I have a few more questions I would like to seek assistance from you. Firstly, in the article "Robust prediction of HLA class II epitopes by deep motif deconvolution of immunopeptidomes," the term "fi" represents the expected background frequency of amino acid "i" in HLA-II peptidomics data. What is the value of these background frequency then? Is calculated according to the data you curated especially for this paper? Secondly, because MoDec uses ggseqlogo to visualize motifs, I would like to ask if the input for this should be the core binding sequence from the peptides corresponding to each motif with its best responsibility value. Since ggseqlogo requires a sequence alignment as a character vector or a position frequency matrix as input, both of these options are acceptable. I used the binding cores of the peptides corresponding to a specific cluster in Seq2Logo, without selecting a sequence weighting method and setting the pseudo count to 0, and generated a motif image. It looks very similar to the one generated by MoDec using ggseqlogo. Is my approach correct? Are the two images identical, except for the different color settings for the corresponding amino acid properties? I have sent two images as attachments for you. The third question I would like to ask is how the values of q(k, s) in the fullPepRes_K of MoDec results are obtained. Is it by using the calculated PWM from MoDec to score the corresponding binding cores and then proportionally allocate the values? I noticed that the sum of q(k,s) equals to one. I also want to ask how θ is estimated. Is it the frequency of a specific amino acid without any sequence weighting in the original data? I am not very familiar with MoDec and may ask some questions that might sound "silly." I truly appreciate your helpful responses, as they have been valuable for my ongoing study and research. I apologize for taking up your time, and I will always remember your kindness. Yours faithfully, Winnie Wei

…

------------------ 原始邮件 ------------------ 发件人: "GfellerLab/MoDec" ***@***.***>; 发送时间: 2024年3月1日(星期五) 下午5:43 ***@***.***>; ***@***.******@***.***>; 主题: Re: [GfellerLab/MoDec] Question on the output of MoDec (Issue #1) Dear Winnie Wei, This score corresponds to the term $$\sum_{k=0}^K \sum_{s=-S}^S w_{k,s}\prod_{l=1}^L \frac{\theta_{l,x_{l\oplus s}^n}^k}{f_{x_{l\oplus s}^n}} $$ present in the equation (1) of our paper. Concerning the -S 0 or --Salign 0, this corresponds indeed to considering a central alignment of the peptides. When you use this, then the equation is done as indicated in above equation and (eq. 1 from the paper), with the sum over the small $s$ being done from $-S$ to $S$ and the $x_{l\oplus s}$ using the "special sum" that is defined in equation N1 of the Supplementary Note from our paper. With -S 1, you would redefine $S = max_n(\lambda^n) - L$ and eq. (1) would have a sum over $s$ from $0$ to $S$, and there would be a simple $x_{l + s}$ term. The idea would be similar for -S 2 where the various values would be counted from the C-terminal of the peptide. This has some implications on the $w_{k,s}$ term because the peptides don't have all the same size. So a peptide of size L for example will only be counted in the $w_{k,0}$ term, while a longer peptides will count in multiple $w_{k,s}$ terms. Best regards, Julien — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

jracle85 · 2024-03-14T08:57:33Z

Hello Winnie Wei,

Good that you use our tool. I'll try answer your questions below

f_i was indeed obtained from our curated MS data; this was obtained in 2 steps: I first estimated the binding core positions with MoDec considering the frequency in our full MS data, and then I recomputed these frequencies from our HLA-II MS data after removing the first 3 AAs, last 3 AAs and the 9 AAs that were predicted to be in the binding core (in order to remove the bias in frequencies due to some AA preferentially seen at binding anchor positions for example). The values used currently are (I didn't update these values based on our more recent dataset, but that wouldn't change much and wouldn't have a big impact):
f['A'] = 0.0793; f['C'] = 0.00257; f['D'] = 0.06849; f['E'] = 0.09386; f['F'] = 0.02431; f['G'] = 0.07418; f['H'] = 0.02733; f['I'] = 0.03553; f['K'] = 0.08251; f['L'] = 0.05667; f['M'] = 0.01006; f['N'] = 0.04149; f['P'] = 0.0682; f['Q'] = 0.05515; f['R'] = 0.06199; f['S'] = 0.07242; f['T'] = 0.05682; f['V'] = 0.05747; f['W'] = 0.00721; f['Y'] = 0.02443;
I don't see the images attached of the logo, but it doesn't matter. Both approaches give indeed very similar results. In the report from MoDec, I'm using directly the position frequency matrices found in the PWM folder of MoDec results (you can check the file make_logo_report.R). These include indeed the peptide similarity weighting. On the other hand, when I'm showing a final figure of the motif found for one allele after grouping data from multiple samples, I'm using the binding core sequences from MoDec, without weighting peptides by their similarity. (the weighting or not of peptides will only have a very small influence, not much visible on the logos, and also the pseudo count addition won't change much in general if you have many sequences; what is probably more important is that you use the same way of doing for Gibbscluster if you want to compare to them).
and 4. I think that you got the general idea for these two parameters (for theta, it includes the sequence weighting and peptide responsibility from each possible binding cores of the peptide). But I'm sorry, this is too lengthy to derive here. If you want to derive how to compute these values, you should check the equations that I wrote in my paper, in the SI of this paper, and the derivation formulated in Bishop cited there to do the same steps than in Bishop starting from my reformulated equations.

Best regards,

Julien

winnieWei123456 · 2024-03-20T02:25:11Z

Thanks for your quick and extensive reply. I will check the equations and try to learn more about it. Best, Winnie Wei

…

------------------ 原始邮件 ------------------ 发件人: "GfellerLab/MoDec" ***@***.***>; 发送时间: 2024年3月14日(星期四) 下午4:57 ***@***.***>; ***@***.******@***.***>; 主题: Re: [GfellerLab/MoDec] Question on the output of MoDec (Issue #1) Hello Winnie Wei, Good that you use our tool. I'll try answer your questions below f_i was indeed obtained from our curated MS data; this was obtained in 2 steps: I first estimated the binding core positions with MoDec considering the frequency in our full MS data, and then I recomputed these frequencies from our HLA-II MS data after removing the first 3 AAs, last 3 AAs and the 9 AAs that were predicted to be in the binding core (in order to remove the bias in frequencies due to some AA preferentially seen at binding anchor positions for example). The values used currently are (I didn't update these values based on our more recent dataset, but that wouldn't change much and wouldn't have a big impact): f['A'] = 0.0793; f['C'] = 0.00257; f['D'] = 0.06849; f['E'] = 0.09386; f['F'] = 0.02431; f['G'] = 0.07418; f['H'] = 0.02733; f['I'] = 0.03553; f['K'] = 0.08251; f['L'] = 0.05667; f['M'] = 0.01006; f['N'] = 0.04149; f['P'] = 0.0682; f['Q'] = 0.05515; f['R'] = 0.06199; f['S'] = 0.07242; f['T'] = 0.05682; f['V'] = 0.05747; f['W'] = 0.00721; f['Y'] = 0.02443; I don't see the images attached of the logo, but it doesn't matter. Both approaches give indeed very similar results. In the report from MoDec, I'm using directly the position frequency matrices found in the PWM folder of MoDec results (you can check the file make_logo_report.R). These include indeed the peptide similarity weighting. On the other hand, when I'm showing a final figure of the motif found for one allele after grouping data from multiple samples, I'm using the binding core sequences from MoDec, without weighting peptides by their similarity. (the weighting or not of peptides will only have a very small influence, not much visible on the logos, and also the pseudo count addition won't change much in general if you have many sequences; what is probably more important is that you use the same way of doing for Gibbscluster if you want to compare to them). and 4. I think that you got the general idea for these two parameters (for theta, it includes the sequence weighting and peptide responsibility from each possible binding cores of the peptide). But I'm sorry, this is too lengthy to derive here. If you want to derive how to compute these values, you should check the equations that I wrote in my paper, in the SI of this paper, and the derivation formulated in Bishop cited there to do the same steps than in Bishop starting from my reformulated equations. Best regards, Julien — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

jracle85 added the question Further information is requested label Mar 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question on the output of MoDec #1

Question on the output of MoDec #1

winnieWei123456 commented Feb 27, 2024 •

edited

Loading

jracle85 commented Mar 1, 2024

winnieWei123456 commented Mar 12, 2024 via email

jracle85 commented Mar 14, 2024

winnieWei123456 commented Mar 20, 2024 via email

Question on the output of MoDec #1

Question on the output of MoDec #1

Comments

winnieWei123456 commented Feb 27, 2024 • edited Loading

jracle85 commented Mar 1, 2024

winnieWei123456 commented Mar 12, 2024 via email

jracle85 commented Mar 14, 2024

winnieWei123456 commented Mar 20, 2024 via email

winnieWei123456 commented Feb 27, 2024 •

edited

Loading