-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
All aditional sequences removed in seed-to-alignment step #36
Comments
So, I've been trying different configurations for the seed file and perhaps I am not building it properly. From the example (LY86/LY96) it seems I shouldn't need to add a sequence as an outgroup, so in one of our attempts we made a similar file - if I want to reconstruct ancestrals for a pair of paralogs derived from a duplication that happened during early chordate evolution, I should get a basal chordate species that already has the two paralogs and a more derived one which also has both, and that will define the scope, is that correct? In my case, one of the paralogs is missing in humans. Does that mean I should not include the other human sequence or is it ok as long as if I identify it? Also, looking at the manual I noticed that even though aliases would be optional they should help the program identify which are the paralogs. So I built a new file with only species containing the two paralogs, and to each of them I added all aliases for each of them on Uniprot (also including gene names). This time I do get a final alignment that has more stuff than what was already on the seed, but it seems they only managed to retrieve sequences from one of the two paralogs, but not the other. Here's what I have in my seed file: species,name,aliases,sequence,accession |
Update: it seems the issue was caused by the fact one of the aliases for HIUase is part of one of the alias for TTRs (HIUases were also known as Transthyretin-related proteins), and that would cause it to drop it. Removing that alias seems to have solved the problem, I now got an alignment with sequences for both paralogs. |
Hi! I just wanted to let you know that Mike is out of town until tomorrow. I'm glad you were able to figure out where things were going wrong in your seed file. I hope the rest of the steps go more smoothly. This is really helpful feedback for us to describe what to do and what not to do when constructing your seed dataset, as well as what kinds of checks topiary could potentially do to help the user from having too similar of aliases. Thank you! |
Thanks, Kona, after fixing that detail it seems to be working fine with various input changes. As for defining scope, am I right in not including an outgroup in the initial dataset? |
That depends on what ancestors you are interested in. If you want to reconstruct the deepest ancestor of the family encompassing both HIUases and TTRs, then I think it would be best to include an outgroup species (with sequence and aliases) with the "key_species" column set to True. Otherwise, topiary might have a hard time finding and deciding to keep this kind of outgroup sequence. If you are reconstructing the deepest ancestor of either HIUases or TTRs then your current seed setup would be fine. Also, to make things move faster, you can set the "key_species" column to True for only the most evolutionary divergent species for each paralog (example: both paralogs from Amblyraja radiata, both paralogs from Mus musculus, and the outgroup species sequence). It is ideal to choose species with well-annotated genomes, but that's not always an option. Then set all other special sequences you want to include in your project as "key_species" = False. |
@lbleicher : Glad you got it to work! Did topiary give you any warning about the alias overlap? In many cases, it will do so; I'd like it to do so all of the time if possible. |
It did! I probably missed it before because it came early in the output - perhaps it would be interesting to write the warnings as the last output text? Perhaps I had the same issue in other attempts but didn't notice it. |
Good to know it at least warned you. Unfortunately, python doesn’t give much control over output of warnings and errors. I’ll have to think about how to make that more user friendly...
… On Mar 24, 2023, at 12:22 PM, lbleicher ***@***.***> wrote:
It did! I probably missed it before because it came early in the output - perhaps it would be interesting to write the warnings as the last output text? Perhaps I had the same issue in other attempts but didn't notice it.
—
Reply to this email directly, view it on GitHub <#36 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABFZA6V2JREIU24ZEB2Z3STW5XX6DANCNFSM6AAAAAAWBUMYGY>.
You are receiving this because you commented.
|
Hi, I've tried to run seed-to-alignment in another computer and this time it ran to the finish, but all sequences were removed (they get flagged Keep=false) and the final file (06_alignment.fasta) only has what was originally in my list of sequences. Am I doing something wrong with my input file? It has five sequences, two from Mus Musculus (which has the two paralogs), one from Homo sapiens (which lost one of them), one from Danio Rerio and one from Saccoglossus kowalevskii (as the outgroup). I'm sending the input file.
mq_lb_test.csv
The text was updated successfully, but these errors were encountered: