Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ramp color by kmeans clustering result #1524

Open
djfan opened this issue Jan 30, 2020 · 3 comments
Open

ramp color by kmeans clustering result #1524

djfan opened this issue Jan 30, 2020 · 3 comments

Comments

@djfan
Copy link
Contributor

djfan commented Jan 30, 2020

If the distribution of target feature looks like this:
Screen Shot 2020-01-30 at 2 44 01 PM

Commonly, we want to assign colors based on the breaks like this

Screen Shot 2020-01-30 at 2 45 14 PM

There are a few ways to make it happen (e.g. observe the distribution plot -> figure out the percentile of the breaks).

But it would be much much convenient and accurate to apply Kmeans (or other clustering methods) to find the breaks.

Currently, it needs lines of code and some value transformations to make it happen. I hope it could be included as one of the built-in ramp color options as Quantile / EqualInterval / StdDev etc.

cc @andy-esch

@djfan
Copy link
Contributor Author

djfan commented Jan 30, 2020

For example,

# kmeans for target feature: confirmedCount
color_kmeans = KMeans(n_clusters=7)
data_province['color'] = color_kmeans.fit_predict(data_province['confirmedCount'].values.reshape(-1, 1))

# the kmeans returned cluster numbers are not ordered by target feature values. 
# the following step guarantees 
# feature value of records in cluster 0 > in cluster 1 > 2 > 3 > ...
color_dict = {_: i for i, _ in enumerate(list(data_province.groupby('color').apply(lambda x: x['confirmedCount'].mean()).sort_values().index))}
data_province['color'] = data_province['color'].apply(lambda x: str(color_dict[x]))

# rename the cluster name (0,1,2,3...) as [min, max] of values in each cluster. 
range_dict = data_province.groupby('color').apply(lambda x: str([min(x['confirmedCount']), max(x['confirmedCount'])])).to_dict()
data_province['range'] = data_province['color'].apply(lambda x: range_dict[x])

# the following step makes sure it works well with `color_category_legend` (continuous legend doesn't work as expected)
# the items in legend are in order 
data_province = data_province.sort_values(by='color')

Screen Shot 2020-01-30 at 3 11 05 PM

@djfan
Copy link
Contributor Author

djfan commented Feb 13, 2020

seems like pygeoda now supports NaturalBreaks

@djfan
Copy link
Contributor Author

djfan commented Feb 28, 2020

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant