fix to mzRefinery temporary file naming #129

jvolkening · 2017-02-06T22:46:07Z

Trying to use the mzRefinery filter in msconvert fails on my system. This appears to be because msconvert requires the identification file prefix to match the spectral file prefix (see here). Based on trial-and-error, "match" means the identification filename must contain the spectral prefix as a substring (I've confirmed that the beginning and end of the identification filename can differ as long as it is a superstring of the spectral filename prefix. Also, msconvert doesn't seem to care what the suffix of the spectral file is (it will autodetect format).

Please review the attached PR as a proposed fix. I changed some of the temporary file naming logic to simplify things because the temporary filenames don't seem to ever be exposed to the end user.

I also fixed some typos in the existing test element for this feature which caused it to fail even before my edits and changed the pepXML filename as part of testing the specific fix in this PR.

bgruening · 2017-02-06T22:54:28Z

tools/msconvert/msconvert_macros.xml

@@ -25,7 +25,9 @@
    #else
      --input=${input}
      #if hasattr($input, 'display_name')
-      --input_name='${input.display_name}'
+      ##--input_name='${input.display_name}'
+      #set basename = $input.display_name


using display_name seems always be dangerous to me. Can we not hardcode this to "foo.bar"?

@bgruening That's actually exactly what I originally did and it works fine (as long as --ident_name has the same prefix). I was trying to match the original logic as much as possible.

Incidentally, 20 out of 24 tests failed for this module (the updated one for this filter passed). Looks like mostly small differences due to changes in ontology version etc. I see that this tool is currently blacklisted in the build files, so I assume this is known and on the TODO list. I can try to address if time permits.

Blacklisted, just means there is no conda package at the moment and we can not test it with travis. The correct way would be to get this dependency into Conda, or check if there is already one. Thanks @jvolkening!

I guess the tool is using my local msconvert then, which would explain the difference in ontology version. I see now that the <sourceFile> on most of the outputs is also affected by this patch, which explains the other source of problems. Easy to update once the exact syntax of the fix is finalized.

chambm · 2017-02-07T00:03:48Z

Pfft, I did this one at the conda hackathon. :) https://bioconda.github.io/recipes/proteowizard/README.html

bgruening · 2017-02-07T00:06:29Z

@chambm ha great, so lets get this tool under testing - or do I miss something?

bgruening · 2017-02-16T12:30:56Z

@jvolkening can you try to use the conda package from @chambm?

jvolkening · 2017-02-18T20:04:49Z

Yes, will try it out. Should I go ahead and hardcore the temporary file names as mentioned above? I don't mind replacing the filenames in the tests once that is decided.

bgruening · 2017-02-18T22:58:44Z

I would hardcode them. Using the user-facing dataset name, can be a security issue and is error prone.

jvolkening · 2017-02-20T16:41:14Z

Looking at this again, the hard-coded filename will be written to the output (when output format is mzML) as an attribute of the <sourceFile> element. This may not be a big issue but would be one argument for trying to retain the original dataset name. Does Galaxy sanitize this variable before passing it to the wrapper?

bgruening · 2017-02-20T16:45:54Z

Probably, but still this can cause unexpected things. Also the name does not need to original name from the upload, right?

jvolkening · 2017-02-20T17:04:01Z

Also the name does not need to original name from the upload, right?

I suppose, although from my experience this is the default. I believe this question would only matter for a user who relied on this field for some sort of in-house sample or analysis tracking and who downloaded the mzML output to use elsewhere. I will hard-code with a source comment regarding the potential issues for future reference.

chambm · 2017-02-20T17:05:54Z

If I understand what's being discussed here correctly, I think we've discussed this before @bgruening . :) In shotgun proteomics it is best practice to keep the original MS run filenames at least to the point where some aggregation is done:

Mass spectrometer creates a file foodata.RAW
msconvert converts it to foodata.mzML (or .MGF, .mz5, etc.)
Search engine analyzes it and writes out foodata.pepXML (or .pep.xml or .mzid)
Protein assembly/filtering analyzes it, usually aggregation is done at this point, so the name becomes less important because the software will keep the names internally

When your analysis has foodata_1.RAW, foodata_2.RAW, ... , foodata_15.RAW, and also bardata_1.RAW, bardata_2.RAW, ... , bardata_15.RAW, and bazdata_1.RAW, bazdata_2.RAW, ... , bazdata_15.RAW, etc.... it's decidedly not best practice to rename all these things to an arbitrary Galaxy dataset numeric name.

Note that SOME file formats can keep the original name internally (mzML, mz5, mzXML, pepXML, mzIdentML), but other formats cannot reliably do so (MGF). So if one searches an MGF called "msconvert on dataset 42.mgf" and they want to go back to the original, unfiltered spectrum in the RAW file (for example because the MGF has been through some kind of filtering), a user will have much fun digging through the Galaxy dataset provenance. And no non-Galaxy software will be able to do so automatically (with MGF). And SearchGUI only supports MGF. 👎

jvolkening · 2017-02-20T18:03:01Z

@chambm The issue at hand is, I think, slightly different that what you describe. With the current PR, the output dataset name is preserved from input with the proper suffix substitution (I didn't change this part of the wrapper). However, in order for mzRefinery (and possibly other filters) to work, the input spectra and search ID files are symlinked to the working directory with new names prior to running msconvert in order to satisfy msconvert's requirement that the two files share some common prefix. The question is what temporary filenames to use. Normally this wouldn't matter, since the temporary filenames aren't typically exposed to the end user. However, in this case they are recorded in the output of some file formats (for example, mzML <sourceFile> element).

I believe the question is (1) how important/meaningful are the filenames thus stored and (2) are there security issues with using the user-supplied dataset name in the symlink call. It's certainly easier, simpler and possibly safer just to use hard-coded temporary filenames except for the question of (1).

chambm · 2017-02-20T19:22:26Z

True, it's a slightly different case. But the security concerns are the same as with the main pipeline as far as I can tell, where we symlink the dataset names to the display names so that the filenames are preserved in the pipeline at every step (e.g. msconvert needs to see the file as "foodata_1.RAW" because the RAW file doesn't have 'foodata_1' stored internally). So we need to find a way to mitigate the security concerns or else sacrifice the usability of many proteomics tools (and apply the security concerns about symlinks uniformly, not just in the mzRefinement part).

jvolkening · 2017-02-20T20:34:19Z

I just ran a test locally, changing the display name of an input dataset to an absolute path to an existing file in the galaxy user's home. Running the msconvert wrapper on this, I observed four things:

Galaxy itself didn't modify the variable at all before it passed it along to the python wrapper script.
The wrapper adds a filetype extension automatically if it is missing, which limits the mischief that can be done.
Python will refuse to overwrite an existing file with the new symlink, which again limits any potential damage.
Taking into consideration the limits of (2) and (3), the current tool allows a user to create arbitrary symlinks to the input data anywhere on disk where the galaxy user has write permissions. I'm not sure what the possible malicious uses of this would be given the user can't control which files are symlinked to or overwrite existing files or even have complete control over the link name, but it is still something that should be addressed.

Is it enough to sanitize the display name within the python wrapper by throwing an error if a '/' is seen, or maybe not dying but substituting with '_'? I can't think of any other special characters that would be a concern, since the variable is not being plugged into a system call anywhere but rather used in the python os.symlink().

chambm · 2017-02-21T17:51:27Z

I think '/' is the only problematic character for absolute paths but I'm not 100% certain. Could an escaped slash also evaluate to an absolute path?

But absolute paths are only one possible threat. We also need to prevent command injection. Careful quoting of filenames should avoid that, but that assumes the shell will deal with quoted arguments properly (which is probably a safe assumption these days). An alternative is replacing the potential command separators: ; && || (are there any others?)

And, for completeness, we should never use eval from user input:

eval $BAR
This is a no-no. Never run eval on data passed in by a user unless you have very, very carefully sanitized it (and if possible, use a whitelist to limit the allowed values).

The attack:
Pass a dangerous command for BAR.

Mitigation:
Just don’t do that.

Whatever characters we decide need to be sanitized, I prefer the substitute with '_' approach (with a warning) rather than just terminating.

chambm · 2017-02-21T18:16:14Z

Correct me if I'm wrong, but even though it's python doing the symlinking, the command injection could come from the quoting. To fix that we only need to prevent the unquoting, i.e. in each user-set argument, replace each occurrence of ' (single quote) by the four-character string '''.

jvolkening · 2017-02-21T18:23:33Z

But absolute paths are only one possible threat. We also need to prevent command injection.

Absolutely. I hadn't looked closely at how the python wrapper was constructed. I saw immediately that the actual call to msconvert looks like this:

subprocess.Popen(args=command, shell=True, ...)

Generally, shell=True is rather dangerous as regards command injection as it passes the command string straight to a shell. If shell=False is used, the characters you mentioned shouldn't be a concern because python doesn't involve the shell. I can't tell at first glance whether there was a specific reason to use shell=True here or whether it was just for convenience.

On the other hand, I don't see any strong reason NOT to replace pipes, ampersands, etc as well. Personally I'd prefer to err on the side of security rather than try to accommodate an edge case for some user's funky filename.

jvolkening · 2017-02-21T18:27:03Z

Correct me if I'm wrong, but even though it's python doing the symlinking, the command injection could come from the quoting.

The symlinking is a separate issue (I believe) but you're certainly right that there is an avenue for command injection as things are now written due to the fact that the input filename is also used in the system call to msconvert which is being performed through the shell.

jvolkening · 2017-02-21T18:44:15Z

Based on testing, it appears that potentially dangerous characters (I tried at least '|', '&', and ';') are already being substituted or stripped. I tried including them in both the dataset display name and in a free-form string parameter and they are stripped or substituted by Galaxy prior to being used in the Cheetah command. So this may be a non-issue. I still would like to see if the system call in the python wrapper can be made without using the shell.

Where I see this right now:

Use display name as msconvert input to prevent breaking other tools that rely on this information being the correct filename (if the user decides to change the default display name, that is their issue).
Replace any '/' with '_' in the display name to prevent injection of absolute paths or backtracking paths.
Try to avoid using direct shell call in command execution (substitution of dangerous characters is already done by Galaxy).

bgruening · 2017-02-24T11:52:56Z

should be done either way. I don't like 1) but if there is no other way let's do it and improve user training for this special issue.

jvolkening · 2017-02-24T20:10:40Z

I'm hoping for feedback on the following issue. As I was updating the test files for this package, I realized that this would need to be done each time the proteowizard version was updated in the bioconda package, as the metadata in the output would change accordingly. At first I thought it would be enough to add a lines_diff attribute on the outputs to allow for two or three differences. However, for indexed mzML and mzXML every index line also changes due to the change in upstream content, so the number of allowed differences would need to be quite a bit larger. In addition, I observed one spectrum that actually had different binary arrays after peak-picking, presumably due to changes in the msconvert algorithms. I can see the following options to address this:

Find or create a way to automatically update each of the expected output files each time the bioconda proteowizard version is updated.
Turn off indexing (--noindex) for mzML and mzXML output for most tests and set lines_diff to some reasonably low number. There is currently no Galaxy parameter for controlling indexing, so this would need to be added as well.
Re-think the type of testing done. For instance, rather than comparing file contents directly, perhaps it is sufficient to check that (1) msconvert produced the expected file type and (2) the output file size is close to the expected size within a small tolerance. (1) could be accomplished by checking for expected strings typical of each output type, and (2) could be done by using a fairly conservative value for file size difference. After all, this isn't supposed to be a regression test for msconvert itself but just a test that the calls to msconvert are being made properly.

I'm inclined toward solution (3) but would like feedback before spending more time on it.

jvolkening · 2017-02-24T20:13:41Z

As a separate issue, there is logic in the Cheetah code to handle Agilent WIFF file naming specially, but I've never worked with these files and am unsure how to deal with this bit of code. Additionally, there are currently no tests or test files relating to this aspect of the code. Apparently there are multiple input files needed for this format? Does anyone have any test files in this format that could be used to add to the test harness?

chambm · 2017-02-24T20:15:16Z

I like #2. Planemo makes #1 pretty easy. BTW, msconvert itself doesn't have any functional testing in ProteoWizard's own repository (due to priorities/laziness I'm afraid), but there are functional tests for the vendor readers and the individual filters. So these functional tests for the command-line invocations are nice to have.

chambm · 2017-02-24T20:16:31Z

WIFF files are Sciex vendor files and would need a Windows host to convert/test them. But yes, these days they are a pair of files, .wiff and .wiff.scan. Which is arguably better or worse than vendor 'files' which aren't files but directories, e.g. Agilent, Bruker, and Waters!

jvolkening · 2017-02-24T20:24:13Z

I don't like 1) but if there is no other way let's do it and improve user training for this special issue.

@bgruening I believe there is a compromise that is still fairly conservative in terms of security by using the display name but limiting it to a small set of allowable characters. My current version under testing allows only alphanumerics and the set [ -+._ ] -- everything else is converted to underscore. Personally, as an admin I'm okay with having to help the occasional user who is trying to use a weird filename if it means I can feel better about security.

jvolkening · 2017-02-24T20:25:27Z

Planemo makes #1 pretty easy.

@chambm Can you expand or point me to the proper docs? I was trying to look for a way to do this easily but couldn't find anything.

chambm · 2017-02-24T20:28:53Z

The --update_test_data flag for the test command: http://planemo.readthedocs.io/en/latest/commands.html#test-command

You can git diff the result to verify that the change(s) are expected.

jvolkening · 2017-02-24T20:35:11Z

WIFF files are Sciex vendor files and would need a Windows host to convert/test them. But yes, these days they are a pair of files, .wiff and .wiff.scan. Which is arguably better or worse than vendor 'files' which aren't files but directories, e.g. Agilent, Bruker, and Waters!

@chambm How does this work in terms of file upload? This is a compound datatype, correct? If someone uploads 'foo.wiff' and 'foo.wiff.scan', what would the value of $input.display_name be? I'm trying to figure out how to adapt my changes for this filetype.

jvolkening · 2017-02-24T20:37:20Z

The --update_test_data flag for the test command: http://planemo.readthedocs.io/en/latest/commands.html#test-command

Thanks!

…ency files

jvolkening · 2017-03-14T03:18:28Z

Oddly enough, the mz5 output file size from msconvert seems non-deterministic. I ran the same mzML->mz5 command on the file used in test #3 500 times - about 30% of the time the file is one size and the rest of the time it is various larger size. All of these files convert back to the same mzML file so they appear to be consistent in that respect. I found a few comments regarding other software using HDF5 where internal compression can be inconsistent - perhaps something like that is going on here.

This is not a Galaxy issue but I had to increase the tolerance for this test to ~ 40% of the total file size to account for the variation observed plus some wiggle room.

jvolkening · 2017-03-14T03:19:34Z

There are still various warnings in the lint check but I thought they should be dealt with in a separate PR.

bgruening

Thanks @jvolkening looks great! We can fix linting later I think!
@chambm I will leave it to you too merge :)

chambm · 2017-03-14T15:31:54Z

Great work Jeremy! Just 2 last things:

Correct me if I'm wrong @bgruening : can't we keep the msconvert_win tool but blacklist it for Travis? (that would mean keeping msconvert_win.xml, the msconvert_macros.xml and msconvert_wrapper links, the README, the job_conf sample, and .shed.yml; might as well keep the test-data links too)
Add yourself to the contributors list in the README 💯

bgruening · 2017-03-14T15:35:52Z

Yes, we can blacklist the win part by adding tools/msconvert/msconvert_win to the blacklist file.

jvolkening · 2017-03-14T16:20:48Z

@chambm my mistake, I misinterpreted (or perhaps over-interpreted) what you said to mean remove the windows-specific stuff. To be clear, you would like msconvert_raw.xml and msconvert_raw_wrapper.py removed but leave the msconvert_win directory intact?

chambm · 2017-03-14T16:23:42Z

Right. Except for tool_dependencies.xml. The _raw.xml wrapper is the previous version of msconvert_win before it was renamed and split out to a subdirectory.

…l dependency files" This reverts commit 736af90.

chambm · 2017-03-14T18:02:54Z

@bgruening Even with tools/msconvert/msconvert_win on the blacklist, planemo returns tools/msconvert and tools/msconvert/idconvert as the changed tools, so planemo tests those, which of course recurses into the subdirectories. How do you recommend fixing? :)

bgruening · 2017-03-14T18:06:46Z

Narf. What the plan for the windows ones. Can we move them out for the moment in a staged folder or something.

jvolkening · 2017-03-14T18:07:55Z

Given that the msconvert_win tests DID run, I can't figure out why they failed. They should all be identical to the tests for msconvert_nix (imported from same macro) with symlinks to the same test data. Only one test (raw input) should fail. Log output doesn't help me any.

I will try to always test these locally first. I wasn't anticipating any problems with this commit but I guess I learned a lesson.

chambm · 2017-03-14T18:10:27Z

How hard would it be to add a feature in planemo, such as a file like .planemo_skip_test, that acts as a flag to not search that subdirectory for tools? Or perhaps there is another solution that will still allow the msconvert_win files to be maintained here and manually uploaded to the toolshed after manual pulsar testing.

Moving the files would break the relative symbolic links. Not impossible of course but tedious.

bgruening · 2017-03-14T18:17:30Z

planemo has some support for include and explcued I think: https://github.com/peterjc/galaxy_blast/blob/master/tools/ncbi_blast_plus/.shed.yml#L13

But what we want is to get this under testing and not skip it. If we can test this here as well as @jvolkening is indicating let's try this. I think the problem is in the msconvert_wrappery.py script.

jvolkening · 2017-03-14T18:26:24Z

msconvert_win was missing the macro that pulled in proteowizard from conda, which is almost certainly while most of the tests were failing (re-testing now).

Really, the only difference between msconvert_nix and msconvert_win` are the allowed input types and the one extra test. But the RAW test will always fail on other platforms unless there is a way to exclude it specifically.

jvolkening · 2017-03-14T18:27:01Z

@bgruening you were too fast for me.

chambm · 2017-03-14T18:27:30Z

A human race condition!

bgruening · 2017-03-14T18:28:21Z

Sorry. This is just to exciting :)
We get closer and closer to all galaxyp tools are tested and condafied ... - thanks to you guys!

jvolkening · 2017-03-14T18:32:13Z

So can we just comment out the single Windows-specific (Thermo RAW) test for now, until the testing infrastructure supports Windows testing?

bgruening · 2017-03-14T18:34:30Z

Yes! I vote for this. The wrapper should foremost test them self not the binary or functional tests - Imho this is done with the tests currently and we can safely say this wrapper is running.

chambm · 2017-03-14T18:35:49Z

Yes, it's fine to comment it out for now. But eventually we'll want to test the job_conf that we give as an example; I consider that part of the wrapper (for msconvert_win). I guess it could run via planemo's cluster support: http://planemo.readthedocs.io/en/latest/writing_advanced.html#test-against-clusters-job-config-file

jvolkening added 2 commits February 6, 2017 15:18

fix mzRefinery file names

51601e6

fixed soft links to renamed pepXML

6b08261

bgruening reviewed Feb 6, 2017

View reviewed changes

jvolkening added 4 commits March 13, 2017 20:12

removed windows-specific files (temporarily) and outdated tool depend…

736af90

…ency files

reverted pepXML test data name (other tools share it)

f568fe5

updated repo URL in help text

465629f

greatly expanded tolerance for mz5 output size

85df091

bgruening approved these changes Mar 14, 2017

View reviewed changes

jvolkening added 3 commits March 14, 2017 11:30

Revert "removed windows-specific files (temporarily) and outdated too…

3a2f085

…l dependency files" This reverts commit 736af90.

remove outdated dependency definitions and RAW-specific handlers

7326d40

blacklist msconvert_win

ff1e424

bgruening added 2 commits March 14, 2017 19:24

Small refinements

456e1c6

Add requirements

1b4db39

commented out Windows-specific test (temporarily)

88ae8eb

chambm merged commit d56659d into galaxyproteomics:master Mar 14, 2017

fix to mzRefinery temporary file naming #129

fix to mzRefinery temporary file naming #129

Conversation

jvolkening commented Feb 6, 2017

bgruening Feb 6, 2017

Choose a reason for hiding this comment

jvolkening Feb 6, 2017 • edited Loading

Choose a reason for hiding this comment

bgruening Feb 6, 2017

Choose a reason for hiding this comment

jvolkening Feb 7, 2017

Choose a reason for hiding this comment

chambm commented Feb 7, 2017

bgruening commented Feb 7, 2017

bgruening commented Feb 16, 2017

jvolkening commented Feb 18, 2017

bgruening commented Feb 18, 2017

jvolkening commented Feb 20, 2017

bgruening commented Feb 20, 2017

jvolkening commented Feb 20, 2017

chambm commented Feb 20, 2017

jvolkening commented Feb 20, 2017

chambm commented Feb 20, 2017

jvolkening commented Feb 20, 2017

chambm commented Feb 21, 2017 • edited Loading

chambm commented Feb 21, 2017

jvolkening commented Feb 21, 2017

jvolkening commented Feb 21, 2017

jvolkening commented Feb 21, 2017 • edited Loading

bgruening commented Feb 24, 2017

jvolkening commented Feb 24, 2017

jvolkening commented Feb 24, 2017

chambm commented Feb 24, 2017

chambm commented Feb 24, 2017

jvolkening commented Feb 24, 2017

jvolkening commented Feb 24, 2017

chambm commented Feb 24, 2017 • edited Loading

jvolkening commented Feb 24, 2017

jvolkening commented Feb 24, 2017

jvolkening commented Mar 14, 2017

jvolkening commented Mar 14, 2017

bgruening left a comment

Choose a reason for hiding this comment

chambm commented Mar 14, 2017

bgruening commented Mar 14, 2017

jvolkening commented Mar 14, 2017

chambm commented Mar 14, 2017

chambm commented Mar 14, 2017

bgruening commented Mar 14, 2017

jvolkening commented Mar 14, 2017 • edited Loading

chambm commented Mar 14, 2017 • edited Loading

bgruening commented Mar 14, 2017

jvolkening commented Mar 14, 2017

jvolkening commented Mar 14, 2017

chambm commented Mar 14, 2017

bgruening commented Mar 14, 2017

jvolkening commented Mar 14, 2017

bgruening commented Mar 14, 2017

chambm commented Mar 14, 2017

jvolkening Feb 6, 2017 •

edited

Loading

chambm commented Feb 21, 2017 •

edited

Loading

jvolkening commented Feb 21, 2017 •

edited

Loading

chambm commented Feb 24, 2017 •

edited

Loading

jvolkening commented Mar 14, 2017 •

edited

Loading

chambm commented Mar 14, 2017 •

edited

Loading