Skip to content

Commit

Permalink
pandas.CategoricalDtype <-> digital_encoding (#35)
Browse files Browse the repository at this point in the history
* DataStore (#1)

* repair and test subspace fallbacks
* allow error on graphviz
* find "get" when inside another function
* fix root_dims
* datastore class
* digital_encoding -> cat

* ruff

* fix name

* some docstrings

* pyupgrade

* tar_zst

* one remove-cell tag

* fix cell tags

* pd categorical in docs

* improve memory on Dataset.iat
  • Loading branch information
jpn-- authored Apr 15, 2023
1 parent 5c6aecb commit 11cc7f1
Show file tree
Hide file tree
Showing 28 changed files with 1,940 additions and 285 deletions.
8 changes: 4 additions & 4 deletions .github/workflows/run-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -46,12 +46,12 @@ jobs:
run: |
conda info -a
conda list
- name: Lint with flake8
- name: Lint with Ruff
run: |
# stop the build if there are Python syntax errors or undefined names
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
ruff check . --select=E9,F63,F7,F82 --statistics
# exit-zero treats all errors as warnings.
ruff check . --exit-zero --statistics
- name: Test with pytest
run: |
python -m pytest
Expand Down
11 changes: 6 additions & 5 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,12 @@ repos:
hooks:
- id: nbstripout

- repo: https://github.com/charliermarsh/ruff-pre-commit
rev: v0.0.257
hooks:
- id: ruff
args: [--fix, --exit-non-zero-on-fix]

- repo: https://github.com/pycqa/isort
rev: 5.10.1
hooks:
Expand All @@ -23,8 +29,3 @@ repos:
rev: 22.10.0
hooks:
- id: black

- repo: https://github.com/PyCQA/flake8
rev: 5.0.4
hooks:
- id: flake8
8 changes: 4 additions & 4 deletions docs/_script/hide_test_cells.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,10 @@

# Text to look for in adding tags
text_search_dict = {
"# TEST": "remove_cell", # Remove the whole cell
"# HIDDEN": "remove_cell", # Remove the whole cell
"# NO CODE": "remove_input", # Remove only the input
"# HIDE CODE": "hide_input", # Hide the input w/ a button to show
"# TEST": "remove-cell", # Remove the whole cell
"# HIDDEN": "remove-cell", # Remove the whole cell
"# NO CODE": "remove-input", # Remove only the input
"# HIDE CODE": "hide-input", # Hide the input w/ a button to show
}

# Search through each notebook and look for th text, add a tag if necessary
Expand Down
207 changes: 199 additions & 8 deletions docs/walkthrough/encoding.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
"id": "f17c8818",
"metadata": {
"tags": [
"remove_cell"
"remove-cell"
]
},
"outputs": [],
Expand All @@ -30,7 +30,9 @@
"cell_type": "code",
"execution_count": null,
"id": "d4e7246c",
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"import numpy as np\n",
Expand Down Expand Up @@ -217,7 +219,7 @@
"id": "2ad591bb",
"metadata": {
"tags": [
"remove_cell"
"remove-cell"
]
},
"outputs": [],
Expand Down Expand Up @@ -296,7 +298,7 @@
"id": "bd549b5e",
"metadata": {
"tags": [
"remove_cell"
"remove-cell"
]
},
"outputs": [],
Expand Down Expand Up @@ -377,7 +379,7 @@
"id": "7d74e53e",
"metadata": {
"tags": [
"remove_cell"
"remove-cell"
]
},
"outputs": [],
Expand Down Expand Up @@ -477,7 +479,7 @@
"id": "a016d30f",
"metadata": {
"tags": [
"remove_cell"
"remove-cell"
]
},
"outputs": [],
Expand Down Expand Up @@ -525,7 +527,7 @@
"id": "28afb335",
"metadata": {
"tags": [
"remove_cell"
"remove-cell"
]
},
"outputs": [],
Expand Down Expand Up @@ -690,6 +692,195 @@
" _name='WLK_LOC_WLK_FAR'\n",
").to_series() == [0,152,474]).all()"
]
},
{
"cell_type": "markdown",
"id": "cb219dc3-dd66-44cd-a7c5-2a1da4bc1467",
"metadata": {
"tags": []
},
"source": [
"# Pandas Categorical Dtype\n",
"\n",
"Dictionary encoding is very similar to the approach used for the pandas Categorical dtype, and\n",
"can be used to achieve some of the efficiencies of categorical data, even though xarray lacks\n",
"a formal native categorical data representation. Sharrow's `construct` function for creating\n",
"Dataset objects will automatically use dictionary encoding for \"category\" data. \n",
"\n",
"To demonstrate, we'll load some household data and create a categorical data column."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3b765919-69a4-4fb0-b805-9d3b5fed7897",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"hh = sh.example_data.get_households()\n",
"hh[\"income_grp\"] = pd.cut(hh.income, bins=[-np.inf,30000,60000,np.inf], labels=['Low', \"Mid\", \"High\"])\n",
"hh = hh[[\"income\",\"income_grp\"]]\n",
"hh.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "312faa0b-13cf-4649-9835-7a53b5e81a0b",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"hh.info()"
]
},
{
"cell_type": "markdown",
"id": "c51a88d2-02b1-4502-9f4b-271fbb126699",
"metadata": {},
"source": [
"We'll then create a Dataset using construct."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cd1c2fd5-59c6-48cb-bd6e-d2f9dde2aa36",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"hh_dataset = sh.dataset.construct(hh[[\"income\",\"income_grp\"]])\n",
"hh_dataset"
]
},
{
"cell_type": "markdown",
"id": "033b3629-a16b-47a4-bb18-10af9c7c4f07",
"metadata": {},
"source": [
"Note that the \"income\" variable remains an integer as expected, but the \"income_grp\" variable, \n",
"which had been a \"category\" dtype in pandas, is now stored as an `int8`, giving the \n",
"category _index_ of each element (it would be an `int16` or larger if needed, but that's\n",
"not necessary with only 3 categories). The information about the labels for the categories is \n",
"retained not in the data itself but in the `digital_encoding`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "369442af-1c69-41eb-b530-ea398d6eac7a",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"hh_dataset[\"income_grp\"].digital_encoding"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "58db6505-1c90-475e-8d91-0e2e89ec0f0e",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# TESTING\n",
"assert hh_dataset[\"income_grp\"].dtype == \"int8\"\n",
"assert hh_dataset[\"income_grp\"].digital_encoding.keys() == {'dictionary', 'ordered'}\n",
"assert all(hh_dataset[\"income_grp\"].digital_encoding['dictionary'] == np.array(['Low', 'Mid', 'High'], dtype='<U4'))\n",
"assert hh_dataset[\"income_grp\"].digital_encoding['ordered'] is True"
]
},
{
"cell_type": "markdown",
"id": "38f8a6c9-4bca-4e73-82b0-e0996814565a",
"metadata": {},
"source": [
"If you try to make the return trip to a pandas DataFrame using the regular \n",
"`xarray.Dataset.to_pandas()` method, the details of the categorical nature\n",
"of this variable are lost, and only the int8 index is available."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ad2f8677-b02c-4d28-892f-3272804bf714",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"hh_dataset.to_pandas()"
]
},
{
"cell_type": "markdown",
"id": "95f26992-241f-4982-aa56-f1055a35f969",
"metadata": {},
"source": [
"But, if you use the `single_dim` accessor on the dataset provided by sharrow,\n",
"the categorical dtype is restored correctly."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "43b94637-c025-4578-9e97-4b6484cb231e",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"hh_dataset.single_dim.to_pandas()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b1b86a4b-c19f-4ae9-8a8b-395f53209bc6",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# TESTING\n",
"pd.testing.assert_frame_equal(\n",
" hh_dataset.single_dim.to_pandas(),\n",
" hh\n",
")"
]
},
{
"cell_type": "markdown",
"id": "5e66c294-747d-414c-8e03-b8b551a0e2a9",
"metadata": {},
"source": [
"Note that this automatic handling of categorical data only applies when constructing\n",
"or deconstructing a dataset with a single dimension (i.e. the `index` is not a MultiIndex).\n",
"Multidimensional datasets use the normal xarray processing, which will dump string\n",
"categoricals back into python objects, which is bad news for high performance applications."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2208108a-d051-4941-ad72-b89a05169d81",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"sh.dataset.construct(\n",
" hh[[\"income\",\"income_grp\"]].reset_index().set_index([\"HHID\", \"income\"])\n",
")"
]
}
],
"metadata": {
Expand All @@ -708,7 +899,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.10"
"version": "3.10.9"
},
"toc": {
"base_numbering": 1,
Expand Down
15 changes: 6 additions & 9 deletions docs/walkthrough/two-dim.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@
"id": "1f409525",
"metadata": {
"tags": [
"remove_cell"
"remove-cell"
]
},
"outputs": [],
Expand Down Expand Up @@ -288,8 +288,7 @@
"id": "7b0a4e36",
"metadata": {
"tags": [
"remove-cell",
"remove_cell"
"remove-cell"
]
},
"outputs": [],
Expand Down Expand Up @@ -523,8 +522,7 @@
"id": "03bb0e22",
"metadata": {
"tags": [
"remove-cell",
"remove_cell"
"remove-cell"
]
},
"outputs": [],
Expand Down Expand Up @@ -760,8 +758,7 @@
"id": "dfd6d42a",
"metadata": {
"tags": [
"remove-cell",
"remove_cell"
"remove-cell"
]
},
"outputs": [],
Expand Down Expand Up @@ -910,7 +907,7 @@
"id": "e26a7f58",
"metadata": {
"tags": [
"remove_cell"
"remove-cell"
]
},
"outputs": [],
Expand Down Expand Up @@ -938,7 +935,7 @@
"id": "4a4fb08e",
"metadata": {
"tags": [
"remove_cell"
"remove-cell"
]
},
"outputs": [],
Expand Down
2 changes: 1 addition & 1 deletion envs/development.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ dependencies:
# required for testing
- dask
- filelock
- flake8
- ruff
- jupyter
- larch>=5.7.1
- nbmake
Expand Down
2 changes: 1 addition & 1 deletion envs/testing.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ dependencies:
- numexpr
- sparse
- filelock
- flake8
- ruff
# required for testing
- pytest
- pytest-cov
Expand Down
Loading

0 comments on commit 11cc7f1

Please sign in to comment.