Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using CH5 Files in Python #160

Closed
roshankern opened this issue Jun 1, 2022 · 5 comments
Closed

Using CH5 Files in Python #160

roshankern opened this issue Jun 1, 2022 · 5 comments

Comments

@roshankern
Copy link

roshankern commented Jun 1, 2022

After @pwalczysko's great help with #158, I am able to use the Aspera download client to download well data for idr0013 in the form of a .ch5 file. From what I have read, the CH5(Cellh5) format is quite outdated and does not have much support. I have been able to open the files in ImageJ with the Bio Formats Plugin so the data is readable. However, when I try to install the CellH5 python library I have the same issue described in CellH5/cellh5#14. My attempts to load a CD5 file with python-bioformats result in the error:
OSError: [Errno 22] Could not load the file as an image (see log for details).

What is the best way to either:

  1. Open the CH5 files in python and get image data for manipulation
  2. Use the Aspera download client to download the idr0013 well data in a different format (e.g. tiff)

Thanks!

CC @gwaygenomics to keep you in the loop

@sbesson
Copy link
Member

sbesson commented Jun 2, 2022

@roshankern: indeed the raw data for idr0013 (the MitoCheck study) is stored using CellH5 as the raw data format. Bio-Formats will read it as the format developers worked to add support for it (which enabled IDR to load it) but as far as I can tell there is no active work on this format.

Importantly, Aspera will only give you access to the data in it original submitted format. For this study, all we have is CellH5 data for each well (as discussed in #158) and we introduced our custom .screen file to aggregate wells into plates.

For 1, I would also naively have expected python-bioformats to be able to use the Bio-Formats reader to access the data. It might be worth raising your issue with a sample file against https://github.com/CellProfiler/python-bioformats/ in case the CellProfiler team which is maintaining python-bioformats has some hinsights about the issue.

For 2, as mentioned above, we only have the original data in CellH5 format so anything else would require some form of export. As you might be aware, we are actively using IDR to drive the OME-NGFF specification and a subset of images and plates have been converted into the cloud-optimized OME-Zarr format - see idr.github.io/ome-ngff-samples/idr.github.io/ome-ngff-samples. Would this be something of interest for your use case? If so, it should very easy to convert and upload a test plate from idr0013 to the public idr bucket on EMBL-EBI Embassy object store as another example and for you to evaluate whether you can re-use it. From there we could start the discussion of a mass conversion of the entire study.

@roshankern
Copy link
Author

roshankern commented Jun 2, 2022

Thank you for the help @sbesson!

I reached out to the CellProfiler team (CellProfiler/python-bioformats#159), but would still be interested in downloading/using the idr0013 data in OME-Zarr format from the idr bucket. However, if we did a mass conversion of the entire study would it still be possible to use Aspera (or another high-performance transfer client) to download the data?

@sbesson
Copy link
Member

sbesson commented Jun 6, 2022

I reached out to the CellProfiler team (CellProfiler/python-bioformats#159), but would still be interested in downloading/using the idr0013 data in OME-Zarr format from the idr bucket.

Thanks for the interest, I converted the first plate from idr0013 using the 0.4 specification and uploaded it to the public idr bucket - see https://hms-dbmi.github.io/vizarr/?source=https://uk1s3.embassy.ebi.ac.uk/idr/zarr/v0.4/idr0013A/3451.zarr and generally https://idr.github.io/ome-ngff-samples/ for a catalog of the converted IDR OME-NGFF samples. Do you want to give this is a try and see if this format would be useful for your use case?

However, if we did a mass conversion of the entire study would it still be possible to use Aspera (or another high-performance transfer client) to download the data?

The sample plater above is hosted on EMBL-EBI Embassy object store. The Aspera service gives access to the data stored on the EMBL-EBI NFS servers so the two storage and download mechanisms are separate at the moment.

The mass conversion of an entire study is not something we have done so far although we have talked about it several times internally. It raises several interesting questions including terms of storage & accessibility which will need to be resolved together with our partners at EMBL-EBI providing the underlying infrastructure. In this context, it would be useful to hear about your experience accessing this data e.g. what is the typical access speed when using aws, rclone directly from EMBL-EBI embassy? And how does it compare with the original data access using aserpa?

@gwaybio
Copy link

gwaybio commented Jun 8, 2022

Thanks for your work supporting us @sbesson! It has been a real learning experience trying to wrangle publicly available data! Lots of challenges and opportunities, which I'm sure you're well aware of :)

I'll respond to many of your points below based on what @roshankern and I discussed. I'll also point you to this issue WayScience/mitocheck_data/issues/1, where we've outlined our decision process to pursue the aspera CH5 download option directly.

Do you want to give this (OME-NGFF) is a try and see if this format would be useful for your use case?

Roshan was successfully able to access and use the file format - but, given time constraints, this is no longer useful for us in the immediate term (see WayScience/mitocheck_data/issues/1)

The sample plater above is hosted on EMBL-EBI Embassy object store. The Aspera service gives access to the data stored on the EMBL-EBI NFS servers so the two storage and download mechanisms are separate at the moment.

Roshan and I tried to figure out the implications of this, and we landed on this explanation: If you indeed pursue the file format transfer, the original CH5 data would be still available for download. Is this accurate?

It raises several interesting questions including terms of storage & accessibility which will need to be resolved together with our partners at EMBL-EBI providing the underlying infrastructure. In this context, it would be useful to hear about your experience accessing this data e.g. what is the typical access speed when using aws, rclone directly from EMBL-EBI embassy? And how does it compare with the original data access using aserpa?

I agree that all these questions are quite interesting... My lab intends to use IDR data heavily, so I am very much interested in helping to resolve these issues! I admire the IDR team effort on this front, especially in regards to the emphasis on metadata and ome-ngff.

We have not tested either aws or rclone for accessing data, as we have found aspera to be working quite well (and fast!). We are particularly enjoying the -k flag as to avoid re-downloading files if connections interrupt download.

@sbesson
Copy link
Member

sbesson commented Jun 14, 2022

If you indeed pursue the file format transfer, the original CH5 data would be still available for download. Is this accurate?

Yes the transformation into a cloud-optimized format might happen for some IDR studies in the future but for users the expectation is that the original data will remain available for download.

Assuming you guys have settled on using Aspera for downloading the raw data, I think any remaining issue remains only at the python-bioformats front and I'll close this issue.

@sbesson sbesson closed this as completed Jun 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants