Skip to content

Conversation

@eauel
Copy link

@eauel eauel commented Mar 23, 2019

Hi @alimanfoo,

I have been working on adding the Dask implementations of SVD and randomized SVD to scikit-allel, and I wanted to post my current progress on here for feedback and suggestions.

This PR adds the ability to use the Dask versions of these functions whenever using Dask arrays (i.e. DaskGenotypeArray), and it will fall back to using numpy svd functions otherwise (when using ChunkedGenotypeArray or GenotypeArray, for example).

cc @ebegoli

Eric Auel added 2 commits March 6, 2019 20:34
preprocessing.py:  Re-add the asarray_ndim calls to ensure input is either a dask or numpy array
util.py: add checks to asarray_ndim to ensure that dask arrays are not converted to numpy arrays
@eauel
Copy link
Author

eauel commented Mar 23, 2019

So far I have tested these changes on a 4-node Dask distributed system, and it appears to be working properly. One issue I have noticed however is whenever I try using larger data sets with SVD. If I try to include more than 65536 variants in the data set for PCA, I get an error from the underlying Dask tsqr() function. I believe this is because the function expects a single chunk for the variants dimension, and the chunk size I'm using on the transposed array is (64, 65536). So If I try to include more variants than this, it results in more than one chunk in this dimension, causing the error.

One workaround I found for this is to rechunk the Dask array such that all variants are included in a single chunk and only chunk in the samples dimension. Not sure if this is the best way to resolve the issue though.

@alimanfoo
Copy link
Contributor

Hi @eauel, very cool to see this, thanks for making the PR.

Unfortunately I'm about to go offline for a couple of weeks and so can't follow up in detail, but hopefully we can catch up when I'm back online.

cc @mrocklin - thought you might be interested to see this. If you had any bandwidth there's a couple of points where your advice would be much appreciated.

The first is regarding the chunking that dask's SVD requires, see @eauel's comment above - does it require input data to be rechunked in any way?

The second is regarding whether this PR might make use of NEP 18 instead of having conditional logic depending on the input array type, and if so what that would look like.

Copy link

@mrocklin mrocklin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first is regarding the chunking that dask's SVD requires, see @eauel's comment above - does it require input data to be rechunked in any way?

Yes, the most efficient algorithm would prefer to have all columns in one chunk. You might want to do something like the following:

x = x.rechunk({1: -1, 0: 'auto'})

Chunks of size (64, 64k) seem smallish to me, especially if the dtype is compact.

The second is regarding whether this PR might make use of NEP 18 instead of having conditional logic depending on the input array type, and if so what that would look like.

Perhaps. Dask array supports the NEP-18 __array_function__ protocol in master, but not yet in latest release. Similarly NumPy doesn't have it on by default yet (you have to set an environment variable). As long as you're comfortable with this feature only working on recent releases then yes, this seems cleaner. This seems reasonable to me, but depends on what you're looking for.

Note that I think only np.linalg.svd respects this protocol, not the scipy version.

self.std_ = da.std(gn, axis=1, keepdims=True)
else:
self.mean_ = np.mean(gn, axis=1, keepdims=True)
self.std_ = np.std(gn, axis=1, keepdims=True)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just use gn.mean(...) which should work in both cases.

kwargs.pop('copy', False)
a = da.array(a, **kwargs)
else:
a = np.array(a, **kwargs)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would raise an issue with Dask asking them to support this keyword

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants