-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Won't identify TEs that are created with EDTA for non-model organism #10
Comments
Hi! Thanks for reporting this. Looks like the code has some trouble writing the results to a file. My first guesses would be:
Would you mind sharing the command used to run deviaTE? And maybe double-check that the library of TE sequences is a valid fasta file? |
Hi,
So the TE names that EDTA output actually had a "/" in all the names, so I
think that is the issue. I corrected this in my reference library and am
rerunning now, I will let you know if this issue persists.
Thanks!
Cory
…On Tue, Mar 29, 2022 at 12:52 AM W-L ***@***.***> wrote:
Hi! Thanks for reporting this. Looks like the code has some trouble
writing the results to a file. My first guesses would be:
- the actual string of [RAW DATA] or [TE] contains some symbol that
turns it into an invalid filepath, e.g. / or a space? Seems odd though
if this happens for all TEs
- Permissions of the directory that it tries to write to could be
another issue, but then I would expect a different Error.
Would you mind sharing the command used to run deviaTE? And maybe
double-check that the library of TE sequences is a valid fasta file?
cheers
—
Reply to this email directly, view it on GitHub
<#10 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AHBUWERFMRQ2LKJCZAZFTZLVCKZFBANCNFSM5R4ZDOOQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Hi,
The naming convention was the issue, it seems to be running fine now. On a
side note - I am scanning for the presence of a large list of transposable
elements and many don't have any reads mapping. Is there any way to prevent
output from being produced when there are no reads mapping to a particular
element?
Thank you,
Cory
…On Tue, Mar 29, 2022 at 2:48 PM Cory Henderson ***@***.***> wrote:
Hi,
So the TE names that EDTA output actually had a "/" in all the names, so I
think that is the issue. I corrected this in my reference library and am
rerunning now, I will let you know if this issue persists.
Thanks!
Cory
On Tue, Mar 29, 2022 at 12:52 AM W-L ***@***.***> wrote:
> Hi! Thanks for reporting this. Looks like the code has some trouble
> writing the results to a file. My first guesses would be:
>
> - the actual string of [RAW DATA] or [TE] contains some symbol that
> turns it into an invalid filepath, e.g. / or a space? Seems odd
> though if this happens for all TEs
> - Permissions of the directory that it tries to write to could be
> another issue, but then I would expect a different Error.
>
> Would you mind sharing the command used to run deviaTE? And maybe
> double-check that the library of TE sequences is a valid fasta file?
> cheers
>
> —
> Reply to this email directly, view it on GitHub
> <#10 (comment)>, or
> unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AHBUWERFMRQ2LKJCZAZFTZLVCKZFBANCNFSM5R4ZDOOQ>
> .
> You are receiving this because you are subscribed to this thread.Message
> ID: ***@***.***>
>
|
This is the output I get for each transposable element in my test, to me
this suggests there are no reads mapping to this particular TE?
******************** Analysis
Starting analysis of TE_00000718_INT#LTR-unknown in
SRR10235406-final.fastq.fused.sort.bam..
No annotaions found for: TE_00000718_INT#LTR-unknown
Normalization: none (values are raw abundances)
Analysis completed - output written to:
SRR10235406-final.fastq.TE_00000718_INT#LTR-unknown
******************** Visualization
Loading data: SRR10235406-final.fastq.TE_00000718_INT#LTR-unknown
Visualization written to:
SRR10235406-final.fastq.TE_00000718_INT#LTR-unknown.pdf
…On Thu, Mar 31, 2022 at 11:13 AM Cory Henderson ***@***.***> wrote:
Hi,
The naming convention was the issue, it seems to be running fine now. On a
side note - I am scanning for the presence of a large list of transposable
elements and many don't have any reads mapping. Is there any way to prevent
output from being produced when there are no reads mapping to a particular
element?
Thank you,
Cory
On Tue, Mar 29, 2022 at 2:48 PM Cory Henderson ***@***.***>
wrote:
> Hi,
>
> So the TE names that EDTA output actually had a "/" in all the names, so
> I think that is the issue. I corrected this in my reference library and am
> rerunning now, I will let you know if this issue persists.
>
> Thanks!
> Cory
>
> On Tue, Mar 29, 2022 at 12:52 AM W-L ***@***.***> wrote:
>
>> Hi! Thanks for reporting this. Looks like the code has some trouble
>> writing the results to a file. My first guesses would be:
>>
>> - the actual string of [RAW DATA] or [TE] contains some symbol that
>> turns it into an invalid filepath, e.g. / or a space? Seems odd
>> though if this happens for all TEs
>> - Permissions of the directory that it tries to write to could be
>> another issue, but then I would expect a different Error.
>>
>> Would you mind sharing the command used to run deviaTE? And maybe
>> double-check that the library of TE sequences is a valid fasta file?
>> cheers
>>
>> —
>> Reply to this email directly, view it on GitHub
>> <#10 (comment)>, or
>> unsubscribe
>> <https://github.com/notifications/unsubscribe-auth/AHBUWERFMRQ2LKJCZAZFTZLVCKZFBANCNFSM5R4ZDOOQ>
>> .
>> You are receiving this because you are subscribed to this thread.Message
>> ID: ***@***.***>
>>
>
|
Hi! Glad that your original issue was solved. deviaTE should probably check for such situations itself to be fair. I'll implement a fix for that.
The program should then exit without producing any output. Hope this helps! |
I added a check to replace invalid characters in TE names, which should prevent the original error (10d2b70). I'm not going to make a new release of the package at this point. But if you would like to make use of this change, you can replace the updated code file on your computer (
|
Thank you for creating a fix for that naming issue. I am still curious
about the other issue where it said I had no annotations but I still
received output, can you explain what that means?
Cory
…On Tue, Apr 5, 2022 at 3:54 AM W-L ***@***.***> wrote:
I added a check to replace invalid characters in TE names, which should
prevent the original error (10d2b70
<10d2b70>).
I'm not going to make a new release of the package at this point. But if
you would like to make use of this change, you can replace the updated code
file on your computer (bin/deviaTE_analyse in this repository). In case
you installed the tool via conda, it should be located somewhere along the
lines of:
~/miniconda3/envs/deviaTE_env/bin/deviaTE_analyse
—
Reply to this email directly, view it on GitHub
<#10 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AHBUWEQDDUR7TCBEV5QWWJLVDQLW5ANCNFSM5R4ZDOOQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Ahh, I see thanks for clarifying! So it is working as intended, fantastic.
I also wanted to broach another more broad question since I have your
attention:
I am trying to identify TEs in unassembled natural genomes (not high enough
coverage for a full assembly, especially for high repeat regions), so the
library I am using is from TEs identified in a chromosome level genome
build of a colony population. I feel like I will be missing potentially
novel TEs circulating in these natural populations by using this method,
which is the intent of this analysis. Can you provide any ideas on how to
build a more fitting library for identification so I can identify TEs that
might not be represented in the colony genome?
Thank you,
Cory
…On Tue, Apr 5, 2022 at 9:30 AM W-L ***@***.***> wrote:
No problem! Forgot to mention that the fix is basically replacing
problematic characters with dashes, so that the analysis can proceed
without issues.
The message about "no annotations" refers to the optional parameter
--annotation. This can be used to provide GFF3 files with annotations of
the TE sequences, e.g. the location of CDS and other defined genetic
elements. These will mainly be used in the visualisation, e.g. at the
bottom of this one:
[image: image]
<https://user-images.githubusercontent.com/16755298/161801714-24779b2b-0c4d-4aeb-82e3-e7a74214f75b.png>
—
Reply to this email directly, view it on GitHub
<#10 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AHBUWERM73GXIEPMOIXTUG3VDRTDTANCNFSM5R4ZDOOQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
That's a tricky one. I think a two-pronged approach might be worth considering in this case.
You could then, for example, use a combined library of TE sequences from these with deviaTE to quantify the TE content. |
Thank you for the very useful information. Let me get back to you when I
have had a chance to run this. I appreciate your help!
Cory
…On Thu, Apr 7, 2022 at 3:01 AM W-L ***@***.***> wrote:
That's a tricky one. I think a two-pronged approach might be worth
considering in this case.
- Repository-based: Try and collect all relevant sequences from
already existing TE databases for the species that you are studying
- De-novo assembly of repeats from raw reads: There are quite a few
tools that can do this, but I don't know for which species and coverage
they are suitable. Some that come to my mind are RepeatExplorer (
https://pubmed.ncbi.nlm.nih.gov/23376349/), dnaPipeTE (
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4419797/), REPdenovo (
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4792456/).
You could then, for example, use a combined library of TE sequences from
these with deviaTE to quantify the TE content.
A possibly helpful review with lots of links to databases & tools:
https://www.nature.com/articles/s41576-018-0050-x#ref-CR77
—
Reply to this email directly, view it on GitHub
<#10 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AHBUWEUQBMXASQDZVLQMZ73VD2W65ANCNFSM5R4ZDOOQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Hello,
I am trying to run this for a set of raw sequences for Anopheles gambiae. I used EDTA to create a TE library from the agamP4 genome assembly and then used my raw sequences as input for this pipeline to identify which TEs are present in which samples we have. Following trimming/mapping, the pipeline attempts to identify TEs but I get the following error for every TE identified by EDTA.
Starting analysis of [TE] in [RAW DATA]-final.fastq.fused.sort.bam..
No annotaions found for: [TE]
Traceback (most recent call last):
File "/home/ch943/bin/miniconda/envs/deviaTE_env/bin/deviaTE_analyse", line 100, in
sample.write_frame(out=args.output + '.raw', insertions=ihat, command=comm, t=timestamp, norm='raw')
File "/home/ch943/bin/miniconda/envs/deviaTE_env/lib/python3.6/site-packages/deviaTE/deviaTE_pileup.py", line 204, in write_frame
with open(out, 'w') as outfile:
FileNotFoundError: [Errno 2] No such file or directory: '[RAW DATA]-final.fastq.[TE].raw'
Any guidance would be appreciated.
The text was updated successfully, but these errors were encountered: