Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relax reference name validation with ValidationStringency #39

Open
maxibor opened this issue Jun 28, 2020 · 6 comments
Open

Relax reference name validation with ValidationStringency #39

maxibor opened this issue Jun 28, 2020 · 6 comments
Assignees

Comments

@maxibor
Copy link

maxibor commented Jun 28, 2020

Right now, the (default ?) reference name validation stringency of htsjdk is pretty strict, leading to errors when reference names in alignment files are ill-formated (for example, the refererence names in the metaphlan database).
This should be relaxed with ValidationStringency to allow for non-perfectly formatted reference names.
CC @apeltzer @JudithNeukamm

@JudithNeukamm
Copy link
Collaborator

Thanks for your comment.
The ValidationStringency is set to LENIENT by default which emits warnings but keeps the run going if possible. Did you get a wrong output or did the tool just throw a warning?

@maxibor
Copy link
Author

maxibor commented Jun 29, 2020

$ damageprofiler -i metagenomebis.all_mapped.bam -r mpa_db_latest.fa -o damageprofiler
DamageProfiler v0.4.6
Invalid SAM/BAM file. Please check your file.
htsjdk.samtools.SAMException: Sequence name '157592__A0A150IGK6__fliD,flbC,flaV' doesn't match regex: '[0-9A-Za-z!#$%&+./:;?@^_|~-][0-9A-Za-z!#$%&*+./:;=?@^_|~-]*'

Version: 0.4.6, installed via Conda
Test files are unfortunately too big to attach. The error is with the characters in the sequence name, not allowed by the regex (in the example above, a ,)

@JudithNeukamm
Copy link
Collaborator

I did a small test file, and changing the ValidationStringency to 'SILENT' does not solve the problem, unfortunately. I will try to solve this until the next release.

@apeltzer
Copy link
Contributor

To be honest, thats also a quite invalid FastA header 🙄 157592__A0A150IGK6__fliD,flbC,flaV 🤦

@maxibor
Copy link
Author

maxibor commented Jul 6, 2020

Agree, Metaphlan uses funny reference names. Though, for example, this is valid: 157592__A0A150IGK6__fliD;flbC;flaV

@JudithNeukamm
Copy link
Collaborator

Unfortunately, I couldn't solve this problem. It's not influenced by the ValidationStringency parameter, and there doesn't seem to be an option to set a user-defined regex pattern.
It might be an option to contact the developer of metaphlan to make them aware of this issue? Or to fix the file before running DaamageProfiler.
If you find any solution to solve this within the code, I'll happy to include it.

@JudithNeukamm JudithNeukamm self-assigned this Aug 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants