Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

max hypercube sample size #118

Open
MLWave opened this issue Oct 12, 2018 · 4 comments
Open

max hypercube sample size #118

MLWave opened this issue Oct 12, 2018 · 4 comments

Comments

@MLWave
Copy link
Member

MLWave commented Oct 12, 2018

One problem is a large concentration of samples in a certain projection range. Graphs may look good for non-concentrated samples, but lack structure for the large concentrations. Also it can slow down clustering algorithms if 25.000+ samples are in a single hypercube.

I would like to be able to set a max_clustering_size. When this size is met, that hypercube should be subdivided into smaller hypercubes. Preferably, this subdivision can be set in a custom manner.

The only way to currently overcome this is to set a finer resolution, or add another dimension to the projection. These settings however drastically alter the output graphs. Another way to somewhat overcome this is to rankorder the projection (creates an even number of samples in each hypercube), but this method also ignores legit clusters by stretching them out.

Why I'd like a custom manner, is because you may use an overlap percentage of 100% or more. Recursing on that makes my brain hurt too much. Though I can think of ways to implement this, I can't think of solid API ways to add custom mapping on hypercubes. At least, not until we make those proper classes, like kmapper.cover.MultiScaleCubical(). Then max_sample_cluster and subcover=kmapper.cover.MultiScaleCubical() could be parameters for MultiScaleCubical.

@sauln
Copy link
Member

sauln commented Oct 12, 2018

I like this idea and believe making a separate cover class would be a good approach.

One thought I had was to adapt something like a Barnes-Hut tree to have an overlap parameter.

What do you mean by "not until we make those proper classes"?

@MLWave
Copy link
Member Author

MLWave commented Oct 12, 2018

Adding it to mapper with parameters feels cluttery:

mapper.map(
    n_cubes=4, 
    perc_overlap=0.1,
    max_sample_cluster=1000, 
    sub_n_cubes=2, 
    sub_perc_overlap=0.2)

Instead, I am thinking of (but haven't fleshed out yet) something like:

mapper.map(cover=km.cover.MultiScaleCubical(n_cubes=[4,10], perc_overlap=[0.5, 0.5]))

and either adding subcover parameters to .MultiScaleCubical or .map. Or perhaps a seperate class RecursiveCubical and RecursiveMultiScaleCubical?

I really like the Barnes-Hut idea. If I understand correctly the subdivision is automatically driven by the number of samples in the splits. That seems smarter than a simple sub-mapping/evenly spaced fat partitioning.

@sauln
Copy link
Member

sauln commented Oct 13, 2018

Have you seen the newly supported interface for covers?

mapper.map(lens, X,
    cover=CubicalCover(n_cubes=10, perc_overlap=0.1)
)

I'd like to phase out the n_cubes and perc_overlap parameters in map as much as possible so we can adopt other more interesting covers, like the one you've suggested.

You're right about the Barnes-Hut. We would probably have to build a custom implementation so we can specify how many samples we want per hypercube.

Also, in the example code you posted, are you setting a different number of cubes and overlap for each dimension of the lens? If so, that feature should be supported in the normal cover.

mapper.map(
   cover=cover.CubicalCover(n_cubes=[5, 10], perc_overlap=[0.2, 0.4])
)

@MLWave
Copy link
Member Author

MLWave commented Oct 28, 2018

Thanks, I totally missed KeplerMapper having support for that! So subtle the implementation ;).

So I made a 'km.cover.RecursiveCubicalCover()', and it works (kinda). The problem is the overlap, it quickly becomes an overlap of overlap and edge explosion (entire hypercubes may be inside the overlap between two larger hypercubes).

I think a solution may be to create subgraphs. Either attach them to the node that has met max_samples_cluster, or show these seperately as their own graphs. This would require being closer in '.map()' though.

img

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants