-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix to mzRefinery temporary file naming #129
Conversation
tools/msconvert/msconvert_macros.xml
Outdated
@@ -25,7 +25,9 @@ | |||
#else | |||
--input=${input} | |||
#if hasattr($input, 'display_name') | |||
--input_name='${input.display_name}' | |||
##--input_name='${input.display_name}' | |||
#set basename = $input.display_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using display_name seems always be dangerous to me. Can we not hardcode this to "foo.bar"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bgruening That's actually exactly what I originally did and it works fine (as long as --ident_name
has the same prefix). I was trying to match the original logic as much as possible.
Incidentally, 20 out of 24 tests failed for this module (the updated one for this filter passed). Looks like mostly small differences due to changes in ontology version etc. I see that this tool is currently blacklisted in the build files, so I assume this is known and on the TODO list. I can try to address if time permits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blacklisted, just means there is no conda package at the moment and we can not test it with travis. The correct way would be to get this dependency into Conda, or check if there is already one. Thanks @jvolkening!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the tool is using my local msconvert then, which would explain the difference in ontology version. I see now that the <sourceFile>
on most of the outputs is also affected by this patch, which explains the other source of problems. Easy to update once the exact syntax of the fix is finalized.
Pfft, I did this one at the conda hackathon. :) https://bioconda.github.io/recipes/proteowizard/README.html |
@chambm ha great, so lets get this tool under testing - or do I miss something? |
@jvolkening can you try to use the conda package from @chambm? |
Yes, will try it out. Should I go ahead and hardcore the temporary file names as mentioned above? I don't mind replacing the filenames in the tests once that is decided. |
I would hardcode them. Using the user-facing dataset name, can be a security issue and is error prone. |
Looking at this again, the hard-coded filename will be written to the output (when output format is mzML) as an attribute of the |
Probably, but still this can cause unexpected things. Also the name does not need to original name from the upload, right? |
I suppose, although from my experience this is the default. I believe this question would only matter for a user who relied on this field for some sort of in-house sample or analysis tracking and who downloaded the mzML output to use elsewhere. I will hard-code with a source comment regarding the potential issues for future reference. |
If I understand what's being discussed here correctly, I think we've discussed this before @bgruening . :) In shotgun proteomics it is best practice to keep the original MS run filenames at least to the point where some aggregation is done:
When your analysis has foodata_1.RAW, foodata_2.RAW, ... , foodata_15.RAW, and also bardata_1.RAW, bardata_2.RAW, ... , bardata_15.RAW, and bazdata_1.RAW, bazdata_2.RAW, ... , bazdata_15.RAW, etc.... it's decidedly not best practice to rename all these things to an arbitrary Galaxy dataset numeric name. Note that SOME file formats can keep the original name internally (mzML, mz5, mzXML, pepXML, mzIdentML), but other formats cannot reliably do so (MGF). So if one searches an MGF called "msconvert on dataset 42.mgf" and they want to go back to the original, unfiltered spectrum in the RAW file (for example because the MGF has been through some kind of filtering), a user will have much fun digging through the Galaxy dataset provenance. And no non-Galaxy software will be able to do so automatically (with MGF). And SearchGUI only supports MGF. 👎 |
@chambm The issue at hand is, I think, slightly different that what you describe. With the current PR, the output dataset name is preserved from input with the proper suffix substitution (I didn't change this part of the wrapper). However, in order for mzRefinery (and possibly other filters) to work, the input spectra and search ID files are symlinked to the working directory with new names prior to running I believe the question is (1) how important/meaningful are the filenames thus stored and (2) are there security issues with using the user-supplied dataset name in the symlink call. It's certainly easier, simpler and possibly safer just to use hard-coded temporary filenames except for the question of (1). |
True, it's a slightly different case. But the security concerns are the same as with the main pipeline as far as I can tell, where we symlink the dataset names to the display names so that the filenames are preserved in the pipeline at every step (e.g. msconvert needs to see the file as "foodata_1.RAW" because the RAW file doesn't have 'foodata_1' stored internally). So we need to find a way to mitigate the security concerns or else sacrifice the usability of many proteomics tools (and apply the security concerns about symlinks uniformly, not just in the mzRefinement part). |
I just ran a test locally, changing the display name of an input dataset to an absolute path to an existing file in the galaxy user's home. Running the msconvert wrapper on this, I observed four things:
Is it enough to sanitize the display name within the python wrapper by throwing an error if a '/' is seen, or maybe not dying but substituting with '_'? I can't think of any other special characters that would be a concern, since the variable is not being plugged into a system call anywhere but rather used in the python |
I think '/' is the only problematic character for absolute paths but I'm not 100% certain. Could an escaped slash also evaluate to an absolute path? But absolute paths are only one possible threat. We also need to prevent command injection. Careful quoting of filenames should avoid that, but that assumes the shell will deal with quoted arguments properly (which is probably a safe assumption these days). An alternative is replacing the potential command separators: ; && || (are there any others?) And, for completeness, we should never use eval from user input:
Whatever characters we decide need to be sanitized, I prefer the substitute with '_' approach (with a warning) rather than just terminating. |
Correct me if I'm wrong, but even though it's python doing the symlinking, the command injection could come from the quoting. To fix that we only need to prevent the unquoting, i.e. in each user-set argument, replace each occurrence of ' (single quote) by the four-character string '''. |
Absolutely. I hadn't looked closely at how the python wrapper was constructed. I saw immediately that the actual call to msconvert looks like this:
Generally, On the other hand, I don't see any strong reason NOT to replace pipes, ampersands, etc as well. Personally I'd prefer to err on the side of security rather than try to accommodate an edge case for some user's funky filename. |
The symlinking is a separate issue (I believe) but you're certainly right that there is an avenue for command injection as things are now written due to the fact that the input filename is also used in the system call to |
Based on testing, it appears that potentially dangerous characters (I tried at least '|', '&', and ';') are already being substituted or stripped. I tried including them in both the dataset display name and in a free-form string parameter and they are stripped or substituted by Galaxy prior to being used in the Cheetah command. So this may be a non-issue. I still would like to see if the system call in the python wrapper can be made without using the shell. Where I see this right now:
|
|
I'm hoping for feedback on the following issue. As I was updating the test files for this package, I realized that this would need to be done each time the proteowizard version was updated in the bioconda package, as the metadata in the output would change accordingly. At first I thought it would be enough to add a
I'm inclined toward solution (3) but would like feedback before spending more time on it. |
As a separate issue, there is logic in the Cheetah code to handle Agilent WIFF file naming specially, but I've never worked with these files and am unsure how to deal with this bit of code. Additionally, there are currently no tests or test files relating to this aspect of the code. Apparently there are multiple input files needed for this format? Does anyone have any test files in this format that could be used to add to the test harness? |
I like #2. Planemo makes #1 pretty easy. BTW, msconvert itself doesn't have any functional testing in ProteoWizard's own repository (due to priorities/laziness I'm afraid), but there are functional tests for the vendor readers and the individual filters. So these functional tests for the command-line invocations are nice to have. |
WIFF files are Sciex vendor files and would need a Windows host to convert/test them. But yes, these days they are a pair of files, |
@bgruening I believe there is a compromise that is still fairly conservative in terms of security by using the display name but limiting it to a small set of allowable characters. My current version under testing allows only alphanumerics and the set [ -+._ ] -- everything else is converted to underscore. Personally, as an admin I'm okay with having to help the occasional user who is trying to use a weird filename if it means I can feel better about security. |
The You can git diff the result to verify that the change(s) are expected. |
@chambm How does this work in terms of file upload? This is a compound datatype, correct? If someone uploads 'foo.wiff' and 'foo.wiff.scan', what would the value of |
Thanks! |
Oddly enough, the mz5 output file size from msconvert seems non-deterministic. I ran the same mzML->mz5 command on the file used in test #3 500 times - about 30% of the time the file is one size and the rest of the time it is various larger size. All of these files convert back to the same mzML file so they appear to be consistent in that respect. I found a few comments regarding other software using HDF5 where internal compression can be inconsistent - perhaps something like that is going on here. This is not a Galaxy issue but I had to increase the tolerance for this test to ~ 40% of the total file size to account for the variation observed plus some wiggle room. |
There are still various warnings in the lint check but I thought they should be dealt with in a separate PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jvolkening looks great! We can fix linting later I think!
@chambm I will leave it to you too merge :)
Great work Jeremy! Just 2 last things:
|
Yes, we can blacklist the win part by adding |
@chambm my mistake, I misinterpreted (or perhaps over-interpreted) what you said to mean remove the windows-specific stuff. To be clear, you would like |
Right. Except for tool_dependencies.xml. The _raw.xml wrapper is the previous version of msconvert_win before it was renamed and split out to a subdirectory. |
…l dependency files" This reverts commit 736af90.
@bgruening Even with |
Narf. What the plan for the windows ones. Can we move them out for the moment in a |
Given that the I will try to always test these locally first. I wasn't anticipating any problems with this commit but I guess I learned a lesson. |
How hard would it be to add a feature in planemo, such as a file like Moving the files would break the relative symbolic links. Not impossible of course but tedious. |
planemo has some support for include and explcued I think: https://github.com/peterjc/galaxy_blast/blob/master/tools/ncbi_blast_plus/.shed.yml#L13 But what we want is to get this under testing and not skip it. If we can test this here as well as @jvolkening is indicating let's try this. I think the problem is in the msconvert_wrappery.py script. |
Really, the only difference between |
@bgruening you were too fast for me. |
A human race condition! |
Sorry. This is just to exciting :) |
So can we just comment out the single Windows-specific (Thermo RAW) test for now, until the testing infrastructure supports Windows testing? |
Yes! I vote for this. The wrapper should foremost test them self not the binary or functional tests - Imho this is done with the tests currently and we can safely say this wrapper is running. |
Yes, it's fine to comment it out for now. But eventually we'll want to test the job_conf that we give as an example; I consider that part of the wrapper (for msconvert_win). I guess it could run via planemo's cluster support: http://planemo.readthedocs.io/en/latest/writing_advanced.html#test-against-clusters-job-config-file |
Trying to use the mzRefinery filter in msconvert fails on my system. This appears to be because msconvert requires the identification file prefix to match the spectral file prefix (see here). Based on trial-and-error, "match" means the identification filename must contain the spectral prefix as a substring (I've confirmed that the beginning and end of the identification filename can differ as long as it is a superstring of the spectral filename prefix. Also, msconvert doesn't seem to care what the suffix of the spectral file is (it will autodetect format).
Please review the attached PR as a proposed fix. I changed some of the temporary file naming logic to simplify things because the temporary filenames don't seem to ever be exposed to the end user.
I also fixed some typos in the existing test element for this feature which caused it to fail even before my edits and changed the pepXML filename as part of testing the specific fix in this PR.