Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add non-geoscience example datasets #172

Open
dcherian opened this issue Jun 2, 2023 · 11 comments
Open

Add non-geoscience example datasets #172

dcherian opened this issue Jun 2, 2023 · 11 comments
Labels

Comments

@dcherian
Copy link
Contributor

dcherian commented Jun 2, 2023

@rsatapat
Copy link

rsatapat commented Feb 26, 2024

Xarray is a great tool for Neuroscience research since we typically gather data involving multiple dimensions (trials, days, animas, conditions etc.)
Allen Institute provides an SDK for reading and processing such data alognwith an "observatory" which contains relevant data (https://allensdk.readthedocs.io/en/latest/)

@negin513
Copy link
Contributor

negin513 commented May 17, 2024

Hello @rsatapat, can we add a subset of the data to xaray-data for future tutorials? Any concerns regarding a subset of data being added for tutorials?

@negin513
Copy link
Contributor

Relevant content from @jsiegle: https://xarray.dev/blog/xarray-for-neurophysiology

@scottyhq scottyhq pinned this issue May 17, 2024
@scottyhq
Copy link
Contributor

scottyhq commented May 17, 2024

Just keeping a list of some other examples here

Already using Xarray:

Would require modification to use xarray instead of numpy or custom objects:

@scottyhq
Copy link
Contributor

scottyhq commented Jun 6, 2024

Would be interesting to look at modifying some of these examples to see if Xarray would work well in place of straight numpy arrays https://numpy.org/numpy-tutorials/ ... also it's an excellent repository overall

@scottyhq
Copy link
Contributor

scottyhq commented Jun 7, 2024

Brainstormed a bit more on this today with @TomNicholas. There are really two separate things to accomplish:

  1. Just highlight (visually) a few non-geoscience example datastructures in the tutorial and Xarray docs to make it clear that Xarray is flexible and relevant to different domains. So from the genomic surveillance example above:
    1. "a set of genotype calls obtained from sequencing some mosquitoes. These data can be stored as a 3-dimensional array, where one dimension of the array corresponds to positions (variants) within a reference genome, another dimension corresponds to the individual mosquitoes that were sequenced (samples), and a third dimension corresponds to the number of genomes within each individual (ploidy)." :
image

Note: On one hand it's nice to re-use the existing graphic and actual dataset, but could simplify even further by reducing the size, adding dimension labels to the image on the left, and dropping "alleles" and running set_index() to the dataarray on the right to easily match up!

  1. Bespoke formats (txt, or binary) are pervasive (not HDF,Zarr,netCDF,TIF). It would be great to add an example that coerces such a format into Xarray and does a simple useful visualization or computation.
    1. NumPy .npz files + metadata, which can be opened into xarray variables easily. Many people definitely still use .npz, but which example in the wild to use?
    2. Collection of X-ray images could work https://numpy.org/numpy-tutorials/content/tutorial-x-ray-image-processing.html, but to be really useful want to illustrate labeling (and ultimately selection) by physical coordinates so would have to invent some (patientID, x_distance(mm))
      1. This would segue nicely into building a custom backend docs https://tutorial.xarray.dev/advanced/backends/backends.html

@dcherian
Copy link
Contributor Author

dcherian commented Jun 7, 2024

https://docs.google.com/forms/d/1x9bOIelnUsDMyI1tF4bN7TWK0v4nBDiwhpxh9mi6PaI/edit#responses

One of the user survey responses specifically calls this out:

Examples with Astropy to read FITS files, using Astropy Tables

@scottyhq
Copy link
Contributor

scottyhq commented Jun 7, 2024

Examples with Astropy to read FITS files, using Astropy Table

Some renewed activity in this repository that seems relevant! ratt-ru/xarray-fits#26

@TomNicholas
Copy link
Member

@tomwhite mentioned that the sgkit file openers / converters are actually about to be deprecated in favour of a new package called bio2zarr. Basically their motivation is that the text-based VCF format etc. is so awfully-designed that efficient access via a kerchunk-like approach is basically impossible, so they end up having to convert it to zarr anyway.

@tomwhite
Copy link

@tomwhite mentioned that the sgkit file openers / converters are actually about to be deprecated in favour of a new package called bio2zarr. Basically their motivation is that the text-based VCF format etc. is so awfully-designed that efficient access via a kerchunk-like approach is basically impossible, so they end up having to convert it to zarr anyway.

Both the VCF conversion code in sgkit and the new bio2zarr project both output the same Zarr format (specified here). The reason for bio2zarr is that users were struggling to get the Dask-based sgkit VCF conversion working reliably, so the code was re-written to be a command-line application that runs on multi-core local machines, or HPC schedulers, and bio2zarr is the result.

There are a couple of example sgkit tutorials that may be of interest here: https://sgkit-dev.github.io/sgkit/latest/examples/index.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants