Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix to mzRefinery temporary file naming #129

Merged
merged 18 commits into from
Mar 14, 2017

Conversation

jvolkening
Copy link
Contributor

Trying to use the mzRefinery filter in msconvert fails on my system. This appears to be because msconvert requires the identification file prefix to match the spectral file prefix (see here). Based on trial-and-error, "match" means the identification filename must contain the spectral prefix as a substring (I've confirmed that the beginning and end of the identification filename can differ as long as it is a superstring of the spectral filename prefix. Also, msconvert doesn't seem to care what the suffix of the spectral file is (it will autodetect format).

Please review the attached PR as a proposed fix. I changed some of the temporary file naming logic to simplify things because the temporary filenames don't seem to ever be exposed to the end user.

I also fixed some typos in the existing test element for this feature which caused it to fail even before my edits and changed the pepXML filename as part of testing the specific fix in this PR.

@@ -25,7 +25,9 @@
#else
--input=${input}
#if hasattr($input, 'display_name')
--input_name='${input.display_name}'
##--input_name='${input.display_name}'
#set basename = $input.display_name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using display_name seems always be dangerous to me. Can we not hardcode this to "foo.bar"?

Copy link
Contributor Author

@jvolkening jvolkening Feb 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bgruening That's actually exactly what I originally did and it works fine (as long as --ident_name has the same prefix). I was trying to match the original logic as much as possible.

Incidentally, 20 out of 24 tests failed for this module (the updated one for this filter passed). Looks like mostly small differences due to changes in ontology version etc. I see that this tool is currently blacklisted in the build files, so I assume this is known and on the TODO list. I can try to address if time permits.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blacklisted, just means there is no conda package at the moment and we can not test it with travis. The correct way would be to get this dependency into Conda, or check if there is already one. Thanks @jvolkening!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the tool is using my local msconvert then, which would explain the difference in ontology version. I see now that the <sourceFile> on most of the outputs is also affected by this patch, which explains the other source of problems. Easy to update once the exact syntax of the fix is finalized.

@chambm
Copy link
Contributor

chambm commented Feb 7, 2017

Pfft, I did this one at the conda hackathon. :) https://bioconda.github.io/recipes/proteowizard/README.html

@bgruening
Copy link
Member

@chambm ha great, so lets get this tool under testing - or do I miss something?

@bgruening
Copy link
Member

@jvolkening can you try to use the conda package from @chambm?

@jvolkening
Copy link
Contributor Author

Yes, will try it out. Should I go ahead and hardcore the temporary file names as mentioned above? I don't mind replacing the filenames in the tests once that is decided.

@bgruening
Copy link
Member

I would hardcode them. Using the user-facing dataset name, can be a security issue and is error prone.

@jvolkening
Copy link
Contributor Author

Looking at this again, the hard-coded filename will be written to the output (when output format is mzML) as an attribute of the <sourceFile> element. This may not be a big issue but would be one argument for trying to retain the original dataset name. Does Galaxy sanitize this variable before passing it to the wrapper?

@bgruening
Copy link
Member

Probably, but still this can cause unexpected things. Also the name does not need to original name from the upload, right?

@jvolkening
Copy link
Contributor Author

Also the name does not need to original name from the upload, right?

I suppose, although from my experience this is the default. I believe this question would only matter for a user who relied on this field for some sort of in-house sample or analysis tracking and who downloaded the mzML output to use elsewhere. I will hard-code with a source comment regarding the potential issues for future reference.

@chambm
Copy link
Contributor

chambm commented Feb 20, 2017

If I understand what's being discussed here correctly, I think we've discussed this before @bgruening . :) In shotgun proteomics it is best practice to keep the original MS run filenames at least to the point where some aggregation is done:

  1. Mass spectrometer creates a file foodata.RAW
  2. msconvert converts it to foodata.mzML (or .MGF, .mz5, etc.)
  3. Search engine analyzes it and writes out foodata.pepXML (or .pep.xml or .mzid)
  4. Protein assembly/filtering analyzes it, usually aggregation is done at this point, so the name becomes less important because the software will keep the names internally

When your analysis has foodata_1.RAW, foodata_2.RAW, ... , foodata_15.RAW, and also bardata_1.RAW, bardata_2.RAW, ... , bardata_15.RAW, and bazdata_1.RAW, bazdata_2.RAW, ... , bazdata_15.RAW, etc.... it's decidedly not best practice to rename all these things to an arbitrary Galaxy dataset numeric name.

Note that SOME file formats can keep the original name internally (mzML, mz5, mzXML, pepXML, mzIdentML), but other formats cannot reliably do so (MGF). So if one searches an MGF called "msconvert on dataset 42.mgf" and they want to go back to the original, unfiltered spectrum in the RAW file (for example because the MGF has been through some kind of filtering), a user will have much fun digging through the Galaxy dataset provenance. And no non-Galaxy software will be able to do so automatically (with MGF). And SearchGUI only supports MGF. 👎

@jvolkening
Copy link
Contributor Author

@chambm The issue at hand is, I think, slightly different that what you describe. With the current PR, the output dataset name is preserved from input with the proper suffix substitution (I didn't change this part of the wrapper). However, in order for mzRefinery (and possibly other filters) to work, the input spectra and search ID files are symlinked to the working directory with new names prior to running msconvert in order to satisfy msconvert's requirement that the two files share some common prefix. The question is what temporary filenames to use. Normally this wouldn't matter, since the temporary filenames aren't typically exposed to the end user. However, in this case they are recorded in the output of some file formats (for example, mzML <sourceFile> element).

I believe the question is (1) how important/meaningful are the filenames thus stored and (2) are there security issues with using the user-supplied dataset name in the symlink call. It's certainly easier, simpler and possibly safer just to use hard-coded temporary filenames except for the question of (1).

@chambm
Copy link
Contributor

chambm commented Feb 20, 2017

True, it's a slightly different case. But the security concerns are the same as with the main pipeline as far as I can tell, where we symlink the dataset names to the display names so that the filenames are preserved in the pipeline at every step (e.g. msconvert needs to see the file as "foodata_1.RAW" because the RAW file doesn't have 'foodata_1' stored internally). So we need to find a way to mitigate the security concerns or else sacrifice the usability of many proteomics tools (and apply the security concerns about symlinks uniformly, not just in the mzRefinement part).

@jvolkening
Copy link
Contributor Author

I just ran a test locally, changing the display name of an input dataset to an absolute path to an existing file in the galaxy user's home. Running the msconvert wrapper on this, I observed four things:

  1. Galaxy itself didn't modify the variable at all before it passed it along to the python wrapper script.

  2. The wrapper adds a filetype extension automatically if it is missing, which limits the mischief that can be done.

  3. Python will refuse to overwrite an existing file with the new symlink, which again limits any potential damage.

  4. Taking into consideration the limits of (2) and (3), the current tool allows a user to create arbitrary symlinks to the input data anywhere on disk where the galaxy user has write permissions. I'm not sure what the possible malicious uses of this would be given the user can't control which files are symlinked to or overwrite existing files or even have complete control over the link name, but it is still something that should be addressed.

Is it enough to sanitize the display name within the python wrapper by throwing an error if a '/' is seen, or maybe not dying but substituting with '_'? I can't think of any other special characters that would be a concern, since the variable is not being plugged into a system call anywhere but rather used in the python os.symlink().

@chambm
Copy link
Contributor

chambm commented Feb 21, 2017

I think '/' is the only problematic character for absolute paths but I'm not 100% certain. Could an escaped slash also evaluate to an absolute path?

But absolute paths are only one possible threat. We also need to prevent command injection. Careful quoting of filenames should avoid that, but that assumes the shell will deal with quoted arguments properly (which is probably a safe assumption these days). An alternative is replacing the potential command separators: ; && || (are there any others?)

And, for completeness, we should never use eval from user input:

eval $BAR
This is a no-no. Never run eval on data passed in by a user unless you have very, very carefully sanitized it (and if possible, use a whitelist to limit the allowed values).

The attack:
Pass a dangerous command for BAR.

Mitigation:
Just don’t do that.

Whatever characters we decide need to be sanitized, I prefer the substitute with '_' approach (with a warning) rather than just terminating.

@chambm
Copy link
Contributor

chambm commented Feb 21, 2017

Correct me if I'm wrong, but even though it's python doing the symlinking, the command injection could come from the quoting. To fix that we only need to prevent the unquoting, i.e. in each user-set argument, replace each occurrence of ' (single quote) by the four-character string '''.

@jvolkening
Copy link
Contributor Author

But absolute paths are only one possible threat. We also need to prevent command injection.

Absolutely. I hadn't looked closely at how the python wrapper was constructed. I saw immediately that the actual call to msconvert looks like this:

subprocess.Popen(args=command, shell=True, ...)

Generally, shell=True is rather dangerous as regards command injection as it passes the command string straight to a shell. If shell=False is used, the characters you mentioned shouldn't be a concern because python doesn't involve the shell. I can't tell at first glance whether there was a specific reason to use shell=True here or whether it was just for convenience.

On the other hand, I don't see any strong reason NOT to replace pipes, ampersands, etc as well. Personally I'd prefer to err on the side of security rather than try to accommodate an edge case for some user's funky filename.

@jvolkening
Copy link
Contributor Author

Correct me if I'm wrong, but even though it's python doing the symlinking, the command injection could come from the quoting.

The symlinking is a separate issue (I believe) but you're certainly right that there is an avenue for command injection as things are now written due to the fact that the input filename is also used in the system call to msconvert which is being performed through the shell.

@jvolkening
Copy link
Contributor Author

jvolkening commented Feb 21, 2017

Based on testing, it appears that potentially dangerous characters (I tried at least '|', '&', and ';') are already being substituted or stripped. I tried including them in both the dataset display name and in a free-form string parameter and they are stripped or substituted by Galaxy prior to being used in the Cheetah command. So this may be a non-issue. I still would like to see if the system call in the python wrapper can be made without using the shell.

Where I see this right now:

  1. Use display name as msconvert input to prevent breaking other tools that rely on this information being the correct filename (if the user decides to change the default display name, that is their issue).

  2. Replace any '/' with '_' in the display name to prevent injection of absolute paths or backtracking paths.

  3. Try to avoid using direct shell call in command execution (substitution of dangerous characters is already done by Galaxy).

@bgruening
Copy link
Member

  1. should be done either way. I don't like 1) but if there is no other way let's do it and improve user training for this special issue.

@jvolkening
Copy link
Contributor Author

I'm hoping for feedback on the following issue. As I was updating the test files for this package, I realized that this would need to be done each time the proteowizard version was updated in the bioconda package, as the metadata in the output would change accordingly. At first I thought it would be enough to add a lines_diff attribute on the outputs to allow for two or three differences. However, for indexed mzML and mzXML every index line also changes due to the change in upstream content, so the number of allowed differences would need to be quite a bit larger. In addition, I observed one spectrum that actually had different binary arrays after peak-picking, presumably due to changes in the msconvert algorithms. I can see the following options to address this:

  1. Find or create a way to automatically update each of the expected output files each time the bioconda proteowizard version is updated.

  2. Turn off indexing (--noindex) for mzML and mzXML output for most tests and set lines_diff to some reasonably low number. There is currently no Galaxy parameter for controlling indexing, so this would need to be added as well.

  3. Re-think the type of testing done. For instance, rather than comparing file contents directly, perhaps it is sufficient to check that (1) msconvert produced the expected file type and (2) the output file size is close to the expected size within a small tolerance. (1) could be accomplished by checking for expected strings typical of each output type, and (2) could be done by using a fairly conservative value for file size difference. After all, this isn't supposed to be a regression test for msconvert itself but just a test that the calls to msconvert are being made properly.

I'm inclined toward solution (3) but would like feedback before spending more time on it.

@jvolkening
Copy link
Contributor Author

As a separate issue, there is logic in the Cheetah code to handle Agilent WIFF file naming specially, but I've never worked with these files and am unsure how to deal with this bit of code. Additionally, there are currently no tests or test files relating to this aspect of the code. Apparently there are multiple input files needed for this format? Does anyone have any test files in this format that could be used to add to the test harness?

@chambm
Copy link
Contributor

chambm commented Feb 24, 2017

I like #2. Planemo makes #1 pretty easy. BTW, msconvert itself doesn't have any functional testing in ProteoWizard's own repository (due to priorities/laziness I'm afraid), but there are functional tests for the vendor readers and the individual filters. So these functional tests for the command-line invocations are nice to have.

@chambm
Copy link
Contributor

chambm commented Feb 24, 2017

WIFF files are Sciex vendor files and would need a Windows host to convert/test them. But yes, these days they are a pair of files, .wiff and .wiff.scan. Which is arguably better or worse than vendor 'files' which aren't files but directories, e.g. Agilent, Bruker, and Waters!

@jvolkening
Copy link
Contributor Author

I don't like 1) but if there is no other way let's do it and improve user training for this special issue.

@bgruening I believe there is a compromise that is still fairly conservative in terms of security by using the display name but limiting it to a small set of allowable characters. My current version under testing allows only alphanumerics and the set [ -+._ ] -- everything else is converted to underscore. Personally, as an admin I'm okay with having to help the occasional user who is trying to use a weird filename if it means I can feel better about security.

@jvolkening
Copy link
Contributor Author

Planemo makes #1 pretty easy.

@chambm Can you expand or point me to the proper docs? I was trying to look for a way to do this easily but couldn't find anything.

@chambm
Copy link
Contributor

chambm commented Feb 24, 2017

The --update_test_data flag for the test command: http://planemo.readthedocs.io/en/latest/commands.html#test-command

You can git diff the result to verify that the change(s) are expected.

@jvolkening
Copy link
Contributor Author

WIFF files are Sciex vendor files and would need a Windows host to convert/test them. But yes, these days they are a pair of files, .wiff and .wiff.scan. Which is arguably better or worse than vendor 'files' which aren't files but directories, e.g. Agilent, Bruker, and Waters!

@chambm How does this work in terms of file upload? This is a compound datatype, correct? If someone uploads 'foo.wiff' and 'foo.wiff.scan', what would the value of $input.display_name be? I'm trying to figure out how to adapt my changes for this filetype.

@jvolkening
Copy link
Contributor Author

The --update_test_data flag for the test command: http://planemo.readthedocs.io/en/latest/commands.html#test-command

Thanks!

@jvolkening
Copy link
Contributor Author

Oddly enough, the mz5 output file size from msconvert seems non-deterministic. I ran the same mzML->mz5 command on the file used in test #3 500 times - about 30% of the time the file is one size and the rest of the time it is various larger size. All of these files convert back to the same mzML file so they appear to be consistent in that respect. I found a few comments regarding other software using HDF5 where internal compression can be inconsistent - perhaps something like that is going on here.

This is not a Galaxy issue but I had to increase the tolerance for this test to ~ 40% of the total file size to account for the variation observed plus some wiggle room.

@jvolkening
Copy link
Contributor Author

There are still various warnings in the lint check but I thought they should be dealt with in a separate PR.

Copy link
Member

@bgruening bgruening left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jvolkening looks great! We can fix linting later I think!
@chambm I will leave it to you too merge :)

@chambm
Copy link
Contributor

chambm commented Mar 14, 2017

Great work Jeremy! Just 2 last things:

  1. Correct me if I'm wrong @bgruening : can't we keep the msconvert_win tool but blacklist it for Travis? (that would mean keeping msconvert_win.xml, the msconvert_macros.xml and msconvert_wrapper links, the README, the job_conf sample, and .shed.yml; might as well keep the test-data links too)
  2. Add yourself to the contributors list in the README 💯

@bgruening
Copy link
Member

Yes, we can blacklist the win part by adding tools/msconvert/msconvert_win to the blacklist file.

@jvolkening
Copy link
Contributor Author

@chambm my mistake, I misinterpreted (or perhaps over-interpreted) what you said to mean remove the windows-specific stuff. To be clear, you would like msconvert_raw.xml and msconvert_raw_wrapper.py removed but leave the msconvert_win directory intact?

@chambm
Copy link
Contributor

chambm commented Mar 14, 2017

Right. Except for tool_dependencies.xml. The _raw.xml wrapper is the previous version of msconvert_win before it was renamed and split out to a subdirectory.

@chambm
Copy link
Contributor

chambm commented Mar 14, 2017

@bgruening Even with tools/msconvert/msconvert_win on the blacklist, planemo returns tools/msconvert and tools/msconvert/idconvert as the changed tools, so planemo tests those, which of course recurses into the subdirectories. How do you recommend fixing? :)

@bgruening
Copy link
Member

Narf. What the plan for the windows ones. Can we move them out for the moment in a staged folder or something.

@jvolkening
Copy link
Contributor Author

jvolkening commented Mar 14, 2017

Given that the msconvert_win tests DID run, I can't figure out why they failed. They should all be identical to the tests for msconvert_nix (imported from same macro) with symlinks to the same test data. Only one test (raw input) should fail. Log output doesn't help me any.

I will try to always test these locally first. I wasn't anticipating any problems with this commit but I guess I learned a lesson.

@chambm
Copy link
Contributor

chambm commented Mar 14, 2017

How hard would it be to add a feature in planemo, such as a file like .planemo_skip_test, that acts as a flag to not search that subdirectory for tools? Or perhaps there is another solution that will still allow the msconvert_win files to be maintained here and manually uploaded to the toolshed after manual pulsar testing.

Moving the files would break the relative symbolic links. Not impossible of course but tedious.

@bgruening
Copy link
Member

planemo has some support for include and explcued I think: https://github.com/peterjc/galaxy_blast/blob/master/tools/ncbi_blast_plus/.shed.yml#L13

But what we want is to get this under testing and not skip it. If we can test this here as well as @jvolkening is indicating let's try this. I think the problem is in the msconvert_wrappery.py script.

@jvolkening
Copy link
Contributor Author

msconvert_win was missing the macro that pulled in proteowizard from conda, which is almost certainly while most of the tests were failing (re-testing now).

Really, the only difference between msconvert_nix and msconvert_win` are the allowed input types and the one extra test. But the RAW test will always fail on other platforms unless there is a way to exclude it specifically.

@jvolkening
Copy link
Contributor Author

@bgruening you were too fast for me.

@chambm
Copy link
Contributor

chambm commented Mar 14, 2017

A human race condition!

@bgruening
Copy link
Member

Sorry. This is just to exciting :)
We get closer and closer to all galaxyp tools are tested and condafied ... - thanks to you guys!

@jvolkening
Copy link
Contributor Author

So can we just comment out the single Windows-specific (Thermo RAW) test for now, until the testing infrastructure supports Windows testing?

@bgruening
Copy link
Member

Yes! I vote for this. The wrapper should foremost test them self not the binary or functional tests - Imho this is done with the tests currently and we can safely say this wrapper is running.

@chambm
Copy link
Contributor

chambm commented Mar 14, 2017

Yes, it's fine to comment it out for now. But eventually we'll want to test the job_conf that we give as an example; I consider that part of the wrapper (for msconvert_win). I guess it could run via planemo's cluster support: http://planemo.readthedocs.io/en/latest/writing_advanced.html#test-against-clusters-job-config-file

@chambm chambm merged commit d56659d into galaxyproteomics:master Mar 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants