Train database usage #11
Replies: 2 comments 1 reply
-
Hello, sorry for the late reply. We wanted to expand our dataset with as many ncRNA as we could find. Still, majority of our dataset consisted of sequences from RNAcentral (mostly because, as you said, most of these datasets are at least to some degree a subset of RNAcentral). However, inclusion of additional datasets added several million sequences that were not found in the RNAcentral. To prevent potential redundancy in the dataset we used seqkit's rmdup tool to remove sequence duplicates and we clustered the sequences with MMSeqs2. |
Beta Was this translation helpful? Give feedback.
-
Thank you so much for answering! Can you please covert this issue to discussion? |
Beta Was this translation helpful? Give feedback.
-
Hi,
I noticed you are using a combination of database including rnacentral, rfam, ensembl and nt.
Can I please ask why did you chose these databases?
Specifically, rnacentral should be a superset of rfam and ensembl. While nt is not a part of rnacentral, it should have been very similar to the ENA database, which is also a subset of rnacentral.
Besides, what data deduplication pipelines is applied to remove the redundancy?
Beta Was this translation helpful? Give feedback.
All reactions