Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error model output #169

Open
wants to merge 11 commits into
base: dev
Choose a base branch
from

Conversation

jordenrabasco
Copy link
Contributor

@jordenrabasco jordenrabasco commented Sep 22, 2024

This pull request is to resolve issue #158

Big changes:

  • Q2-DADA2 'denoise-' commands output a collection[DADA2stats] rather than a DADA2STATS object
  • New Q2-DADA2 action stats_viz will visualize all DADA2stats in the collection[DADA2stats] (denoised stats, and error model stats) as a singular visualization with different tabs for each DADA2STATs object.
  • tests were updated accommodate the new output type
  • tests added for the visualization to make sure that all support files are generated

- generates new action to visualize dada2 stats in a distinct tab format
- alters dada2 stats output to be a collection[DADA2Stat] obj
-updates existing tests to accommodate this change in data type
-update existing tests to come into line with the new output collection[dada2stats]
-adds tests for the output error stats table
-adds tests for the error model vizualization
- fixes bug with vizualizing ccs stats
-removing blanks from test files to fix errors in qiime CI
@gregcaporaso
Copy link
Member

gregcaporaso commented Sep 26, 2024

Thanks @jordenrabasco!

Because this introduces a breaking change (changing the output to a Collection), we can't merge this before the 2024.10 release because we always give warning about upcoming breaking changes. We can mention that this change is coming though in the 2024.10 release notes, and then review/merge early in the 2025.4 development cycle so this new functionality is available in development versions of QIIME 2.

Sound ok? Let us know if you have any questions.

Todo:

  • add note about the breaking change to the 2024.10 release notes

@jordenrabasco
Copy link
Contributor Author

@gregcaporaso Of course! That all sounds good to me. In the meantime, let me know if you need me to change anything with the implementation on my end to make it easier for you guys!

@hagenjp hagenjp self-assigned this Oct 3, 2024
@hagenjp hagenjp removed their assignment Nov 14, 2024
@colinvwood colinvwood self-assigned this Dec 3, 2024
@colinvwood
Copy link
Contributor

Hello @jordenrabasco, how do you feel about making the new visualizer only display the transition plots? I suggest this because we already have the metadata tabulate action to display the dada2 stats which makes the "Denoised-Stats" tab in the new visualizer redundant. The metadata tabulate action also is more full featured, for example it displays the column datatype and the direction of sorting. Additionally, one might want to call metadata tabulate on the transition stats so metadata tabulate will remain a key tool for inspecting dada2 output one way or the other.

One other quick thing I noticed while trying out these changes, try to name artifacts-and-visualizers-like-this.qza (no capitalization and hyphens instead of underscores), this is just a convention we try to follow. I'll circle back around to a more in-depth review once the visualizer has been ironed out.

@colinvwood
Copy link
Contributor

By the way, if at all possible it would be best if we could avoid adding additional R software to this repository. In fact, we're planning to transition some of the R code here to python using the rpy2 library (see #172). This makes things much easier for us to maintain. For example, the internal_plotErrors function that you added to run_dada.R would be much preferred in python. Or probably even better yet, you could simply reach out to dada2's plotErrors function for this purpose.

@jordenrabasco
Copy link
Contributor Author

Hi @colinvwood I can change the names and start transferring things over to rpy2 no problem!

In terms of the tabs in the visualizer I initially incorporated the tab format with dada2-stats and the error-model-stats as it was requested in issue #158, should I remove the visualizer tab format entirely or keep the framework?

@colinvwood
Copy link
Contributor

Hello @jordenrabasco,

In terms of the tabs in the visualizer I initially incorporated the tab format with dada2-stats and the error-model-stats as it was requested in issue #158

I see. In the interest of reducing code duplication I think it makes sense to keep them separate. If it's easy to do so you could perhaps borrow the tabulate template from q2-metadata. This way the presentation is as users are used to.

I also think the denoising stats and the transition frequencies are quite distinct sets of information, and most often users will likely only be interested in (and only able to immediately understand) the former. Keeping the two separate also means that the denoising stats are only able to be visualized one way--with metadata tabulate. This keeps things simple.

What's your opinion from an ease of use perspective--separate visualizers or a tabbed one? If you prefer the tabbed approach I think it would be best if we borrowed the tabulate functionality for the denoising stats tab.

I can change the names and start transferring things over to rpy2 no problem!

Don't worry about rpy2. This is something that I would like to do in the future, but I think it only makes sense if the entire run_dada.R script gets converted all in one sweep. What would be helpful however is if the transitions plot were not implemented in R (because this will mean more refactoring to do later when the R script gets removed). Is there a reason we couldn't use the plotErrors function from the dada2 package (see here for an example)? It's quite straightforward to get an R plot into a qiime2 visualizer: you can just save the plot to disk in a temporary directory, copy it into the visualizer's output_dir, and reference it in the index.html file.

@jordenrabasco
Copy link
Contributor Author

What's your opinion from an ease of use perspective --separate visualizers or a tabbed one?

Personally I think that the having all of the diagnostic information readily available within the same visualization via tabs would be very useful to the user as opposed to having it be separated out and needing to be viewed discreetly. In my experience people generally only look at the diagnostics when things go wrong downstream, so I think having everything in one place could be handy. However, this may also be a personal preference of mine.

If you prefer the tabbed approach I think it would be best if we borrowed the tabulate functionality for the denoising stats tab.

I thought about doing this but couldn't figure out how to incorporate the tabulate functionality within a tabbed visualizer. I could convert the visualizer into a pipeline but that would just output different .qzv objects for the denoising stats, and the error plot stats.

Don't worry about rpy2. This is something that I would like to do in the future, but I think it only makes sense if the entire run_dada.R script gets converted all in one sweep.

Ah okay my mistake sorry for misunderstanding!

What would be helpful however is if the transitions plot were not implemented in R (because this will mean more refactoring to do later when the R script gets removed).

The transition plots themselves are generated in _vizualizer.py in python however the preprocessing of the plot information is in R within internal_plotErrors. Do you want all the preprocessing to happen in python as well?

Is there a reason we couldn't use the plotErrors function from the dada2 package (see here for an example)? It's quite straightforward to get an R plot into a qiime2 visualizer: you can just save the plot to disk in a temporary directory, copy it into the visualizer's output_dir, and reference it in the index.html file.

plotErrors returns a ggplot2 object and when processing the data in the denoise- actions I was uncertain of how to save that as a .qza file and then incorporate it into the tabbed visualizer in the stats-viz action. It is my understanding that the temporary dir is deleted at the end of a qiime2 action. Is this correct or does it stay for the entirety of your qiime session? With this in mind I opted to preprocess the data, save the df as a .qza and then generate the plot via python in the visualizer itself. I didn't invoke the ploterrors function within the visualizer itself due to not wanting to expand the code base by adding an additional rscript (I refactored the q2-dada2 repo a while ago to get rid of tangential and redundant Rscipts). However, if you think this is the way to go I can implement it this way (i.e. visualizer calls Rscript to make the plot).

Also to be clear I am open to going with whatever you guys think is best!
Let me know what you think!

@colinvwood
Copy link
Contributor

@jordenrabasco,

Personally I think that the having all of the diagnostic information readily available within the same visualization via tabs would be very useful to the user

Sounds good, let's go forward with this approach.

I thought about doing this but couldn't figure out how to incorporate the tabulate functionality within a tabbed visualizer. I could convert the visualizer into a pipeline but that would just output different .qzv objects for the denoising stats, and the error plot stats.

This should definitely be doable. Were there any specific problems you ran into that I could help with?

plotErrors returns a ggplot2 object and when processing the data in the denoise- actions I was uncertain of how to save that as a .qza file and then incorporate it into the tabbed visualizer in the stats-viz action.

The temporary directory does go away once you exit the context manager for it, but as long as you copy the saved file into output_dir within the context manager (or save the file right into the output dir) everything works. Take a look at this function as a reference. I think this is probably going to be the best approach because its such a minimal amount of code--one plotErrors call and a few more lines to save the file into output_dir.

Hopefully this is enough to get started. Let me know if you get stuck on anything.

@jordenrabasco
Copy link
Contributor Author

jordenrabasco commented Dec 15, 2024

Hi @colinvwood I am having trouble with a few things.

This should definitely be doable. Were there any specific problems you ran into that I could help with?

I am a bit confused here, do you mean you can render a .qzv file within an html tabbed framework index file or is there another way to go about this?

The temporary directory does go away once you exit the context manager for it, but as long as you copy the saved file into output_dir within the context manager (or save the file right into the output dir) everything works. Take a look at this function as a reference. I think this is probably going to be the best approach because its such a minimal amount of code--one plotErrors call and a few more lines to save the file into output_dir.

I am having some issues changing over the code to the format you provided more specifically saving the errorplots via ggsave into the output dir. From my understanding only a visualizer has the output_dir as the default first arg and denoise_* is registered as a method and so it doesn't have access to the output_dir address. Do you know of a way around this?

Sorry if I am misunderstanding and thanks so much!

@colinvwood
Copy link
Contributor

Hey @jordenrabasco,

I am a bit confused here, do you mean you can render a .qzv file within an html tabbed framework index file or is there another way to go about this?

You'll have to do some work to make the tabulate visualizer (or the template which it extends) be tabbed. We use jinja2 to manage the various visualizer templates so if you don't have any experience with that you might want to take some time to read its documentation. It looks like the table itself is made with jQuery but you probably shouldn't have to mess with that. The tabulate visualizer is in the q2_metadata repository and the base templates are in the q2templates repository.

All of this is maybe an argument against making a single tabbed visualizer and for making two separate visualizers...

I am having some issues changing over the code to the format you provided more specifically saving the errorplots via ggsave into the output dir. From my understanding only a visualizer has the output_dir as the default first arg and denoise_* is registered as a method and so it doesn't have access to the output_dir address. Do you know of a way around this?

You don't want to make the visualization in the denoise-* method, but in the visualization. My understanding was the the input needed by the plotErrors function was the new output that you added in this PR. Thus the visualizer would take this input, call plotErrors, and then save the figure to the output_dir. Let me know if I'm misunderstanding anything.

@jordenrabasco
Copy link
Contributor Author

jordenrabasco commented Dec 16, 2024

@colinvwood

You'll have to do some work to make the tabulate visualizer (or the template which it extends) be tabbed. We use jinja2 to manage the various visualizer templates so if you don't have any experience with that you might want to take some time to read its documentation. It looks like the table itself is made with jQuery but you probably shouldn't have to mess with that. The tabulate visualizer is in the q2_metadata repository and the base templates are in the q2templates repository.

Ah okay I understand now, I thought you wanted me to utilize the tabulate visualizer action itself within the code, instead of augmenting the code to be extended into tabbed format . Apologies for misunderstanding.

All of this is maybe an argument against making a single tabbed visualizer and for making two separate visualizers...

Quite possibly. I will try to see if I can change the table over into tabulate format within a tabbed format and if not I will revisit this idea. If that's okay with you?

You don't want to make the visualization in the denoise-* method, but in the visualization. My understanding was the the input needed by the plotErrors function was the new output that you added in this PR. Thus the visualizer would take this input, call plotErrors, and then save the figure to the output_dir. Let me know if I'm misunderstanding anything.

Ah okay, so the new output from the denoise functions is not the error model output but an augmented dataframe to allow for some preprocessing and to allow for saving both fwd/rev reads within the same obj, as QIIME doesn't allow for optional outputs. I can try and change things over to output the model information instead of the augmented dataframe, however if I do this I will need to create another R script to then import this data back into R and call ploterrors(). I considered this apporach previously but didn't want to expand the code base further via another R file. Is this an approach that would work better for you guys?

Let me know what you think!

@colinvwood
Copy link
Contributor

Quite possibly. I will try to see if I can change the table over into tabulate format within a tabbed format and if not I will revisit this idea. If that's okay with you?

Of course, sounds good.

Ah okay, so the new output from the denoise functions is not the error model output but an augmented dataframe to allow for some preprocessing and to allow for saving both fwd/rev reads within the same obj, as QIIME doesn't allow for optional outputs. I can try and change things over to output the model information instead of the augmented dataframe, however if I do this I will need to create another R script to then import this data back into R and call ploterrors(). I considered this apporach previously but didn't want to expand the code base further via another R file. Is this an approach that would work better for you guys?

I think it's totally fine in this case because the R will be so minimal/boilerplate. Eventually it can be refactored using rpy2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Review
Status: Completed
Development

Successfully merging this pull request may close these issues.

4 participants