-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bump msconvert #721
base: master
Are you sure you want to change the base?
bump msconvert #721
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating this. IIRC I stopped updating because it was a pain to update the test reference files.
@chambm linting still fails because of booleans and one data output name |
Will fix it. |
@@ -523,20 +550,20 @@ | |||
|
|||
<xml name="msconvertOutput"> | |||
<outputs> | |||
<data format="mzml" name="output" label="${($input.name[:-4] if $input.name.endswith('.tar') else $input.name).rsplit('.',1)[0]}.${output_type}" > | |||
<filter>general_options['multi_run_output']['do_multi_run_output'] == False</filter> | |||
<data format="mzml" name="output" label="${tool.name} on ${on_string}" > |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've never liked these <tool> on <dataset>
names in Galaxy nor that you can only override them by using a workflow. Even in a workflow the renaming isn't very flexible. That's why I had this python code here to name the dataset the same way it would be if you downloaded the file on your computer, ran msconvert locally, then uploaded the result (that was the idea anyway). If I run msconvert on mydatafile.raw to get an mzML, I should get a dataset named mydatafile.mzML, not "msconvert on mydatafile.raw". If lint is erroring on this output renaming, I think the lint is being too aggressive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had an analogous discussion here #706 (comment) and the solution is to use collections.
We are currently adapting the W4M tools accordingly: workflow4metabolomics/tools-metabolomics#238 workflow4metabolomics/tools-metabolomics#239 .. let me check if the msconvert tool uses name
somewhere.
I think the lint is being too aggressive.
We discussed this here: galaxyproject/galaxy#15933 (comment)
The problem is that it's hard to check if the filters are mutually exclusive. So we decided to enforce being explicit and forbid outputs of the same name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Who decided this ? This sounds like a bad idea. warning is the right level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've never liked these
<tool> on <dataset>
names in Galaxy nor that you can only override them by using a workflow
Yes, 💯, such a bad idea to attempt to capture provenance. I'd say though that input.name is also not exactly the right thing to do. I'd just give it a reasonable default, like ${tool.name}: raw data
or whatever is appropriate. If passed a collection the element_identifier will be kept throughout the analysis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With ${tool.name}: raw data
all msconvert jobs in a history would have the same name ... and I would prefer any of the two other possibilities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you never need to demultiplex but need to keep the filename a collection with a single element would work, if you do need to demultiplex, the tool that does the multiplexing only needs to output a collection with the demultiplexed samples, like you've said. I guess I don't understand why you need to address a portion of an input dataset, and why this works when you don't use a dataset collection but fails if you do ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just don't see a good reason for the naming behavior of single datasets and collections to be so different. It's not unusual to test a tool or series of tools on just a single file, and I'm not sure why one should need to make a collection of that single file first in order to get sensible naming. TBH I'd even like to see collection member names change depending on their type so it's obvious just from looking at the entry that it's an mzML, or a mzIdentML, or TSV, etc. File extensions are an imperfect but very popular and established solution to the real problem of being able to look at a filename and have an idea of what its contents/purpose is. IMO tool wrappers for command-line tools that use file extensions descriptively should preserve that behavior, whether in single datasets or collections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just don't see a good reason for the naming behavior of single datasets and collections to be so different. It's not unusual to test a tool or series of tools on just a single file, and I'm not sure why one should need to make a collection of that single file first in order to get sensible naming.
What should the name in that case be ? A common way to deal with this on the command line is to have directories
├── step0
│ └── sample_1.fastq
├── step1
│ └── sample_1.bam
└── step2
└── sample_1.vcf
the collection gives you a sort of container where the container describes what's inside it.
If you'd treat the dataset name as the immutable thing in the analysis you'd end up with a lot of datasets with the same name (and maybe different extensions) ... that would probably be a little confusing to users.
TBH I'd even like to see collection member names change depending on their type so it's obvious just from looking at the entry that it's an mzML, or a mzIdentML, or TSV, etc.
this seems straightforward to change in the new history .. you'd like to always see the datatype extension, even in the summary view, right ? I think that'd be pretty cool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that every automatic naming scheme has its pros and cons and there will always be users who like one more than the other.
I admit that I disliked Galaxy's default naming scheme at first, but I got used to it .. which does not mean that it's good :)
It really should work just like the command-line
The problem is that the command line is operated by a human that sets the argument values, for "free text parameters" like filenames the operating human can choose sensible names (i.e. that make sense to her).
So in some sense, Galaxy works exactly as the command line .. it just uses other names than you would. One thing I like about Galaxy is the consistency of the naming scheme even if it may not be the best in all cases.
pretty easy to reconstruct the on section in the client
I found it always very convenient to be able to reconstruct this in most cases without clicking. Butm maybe that's the main point: which information do users want to be visible (without the need for additional clicks):
- which tool generated the data with which input (data sets)
- sample name + file type
Both will "fail" for instance for an analysis of the parameter space of a tool. Then you would like to have the value of a specific parameter in the dataset name, e.g. BWA with k=23
.
What came to my mind, was how about per user / per history configurable naming schemes?
... here's also WIFF files which contain multiple runs ...
That's just another can of worms. Seems that some proteomics datatypes include references to other files. This won't work in Galaxy (out of the box), since Galaxy stores the datasets using completely different names (using some hash as a name).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that the command line is operated by a human that sets the argument values, for "free text parameters" like filenames the operating human can choose sensible names (i.e. that make sense to her).
In most cases in proteomics, the software always picks a default output name based on the input filename (usually by changing file extension). That's why I think it should be the default naming scheme for Galaxy wrappers that wrap such tools. But of course you're right it has pros and cons.
Maybe it's time to update the CI |
ab42080
to
c58d281
Compare
make booleans -> selects
just stumbled over msconvert and thought it might be a good idea to update it.