pandas.CategoricalDtype <-> digital_encoding (#35)

* DataStore (#1) * repair and test subspace fallbacks * allow error on graphviz * find "get" when inside another function * fix root_dims * datastore class * digital_encoding -> cat * ruff * fix name * some docstrings * pyupgrade * tar_zst * one remove-cell tag * fix cell tags * pd categorical in docs * improve memory on Dataset.iat
ActivitySim · Apr 15, 2023 · 11cc7f1 · 11cc7f1
1 parent 5c6aecb
commit 11cc7f1
Show file tree

Hide file tree

Showing 28 changed files with 1,940 additions and 285 deletions.
diff --git a/.github/workflows/run-tests.yml b/.github/workflows/run-tests.yml
@@ -46,12 +46,12 @@ jobs:
         run: |
           conda info -a
           conda list
-      - name: Lint with flake8
+      - name: Lint with Ruff
         run: |
           # stop the build if there are Python syntax errors or undefined names
-          flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
-          # exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
-          flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
+          ruff check . --select=E9,F63,F7,F82 --statistics
+          # exit-zero treats all errors as warnings.
+          ruff check . --exit-zero --statistics
       - name: Test with pytest
         run: |
           python -m pytest

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -13,6 +13,12 @@ repos:
   hooks:
     - id: nbstripout
 
+- repo: https://github.com/charliermarsh/ruff-pre-commit
+  rev: v0.0.257
+  hooks:
+    - id: ruff
+      args: [--fix, --exit-non-zero-on-fix]
+
 - repo: https://github.com/pycqa/isort
   rev: 5.10.1
   hooks:
@@ -23,8 +29,3 @@ repos:
   rev: 22.10.0
   hooks:
     - id: black
-
-- repo: https://github.com/PyCQA/flake8
-  rev: 5.0.4
-  hooks:
-    - id: flake8
diff --git a/docs/_script/hide_test_cells.py b/docs/_script/hide_test_cells.py
@@ -11,10 +11,10 @@
 
 # Text to look for in adding tags
 text_search_dict = {
-    "# TEST": "remove_cell",  # Remove the whole cell
-    "# HIDDEN": "remove_cell",  # Remove the whole cell
-    "# NO CODE": "remove_input",  # Remove only the input
-    "# HIDE CODE": "hide_input",  # Hide the input w/ a button to show
+    "# TEST": "remove-cell",  # Remove the whole cell
+    "# HIDDEN": "remove-cell",  # Remove the whole cell
+    "# NO CODE": "remove-input",  # Remove only the input
+    "# HIDE CODE": "hide-input",  # Hide the input w/ a button to show
 }
 
 # Search through each notebook and look for th text, add a tag if necessary

diff --git a/docs/walkthrough/encoding.ipynb b/docs/walkthrough/encoding.ipynb
@@ -16,7 +16,7 @@
    "id": "f17c8818",
    "metadata": {
     "tags": [
-     "remove_cell"
+     "remove-cell"
     ]
    },
    "outputs": [],
@@ -30,7 +30,9 @@
    "cell_type": "code",
    "execution_count": null,
    "id": "d4e7246c",
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
    "outputs": [],
    "source": [
     "import numpy as np\n",
@@ -217,7 +219,7 @@
    "id": "2ad591bb",
    "metadata": {
     "tags": [
-     "remove_cell"
+     "remove-cell"
     ]
    },
    "outputs": [],
@@ -296,7 +298,7 @@
    "id": "bd549b5e",
    "metadata": {
     "tags": [
-     "remove_cell"
+     "remove-cell"
     ]
    },
    "outputs": [],
@@ -377,7 +379,7 @@
    "id": "7d74e53e",
    "metadata": {
     "tags": [
-     "remove_cell"
+     "remove-cell"
     ]
    },
    "outputs": [],
@@ -477,7 +479,7 @@
    "id": "a016d30f",
    "metadata": {
     "tags": [
-     "remove_cell"
+     "remove-cell"
     ]
    },
    "outputs": [],
@@ -525,7 +527,7 @@
    "id": "28afb335",
    "metadata": {
     "tags": [
-     "remove_cell"
+     "remove-cell"
     ]
    },
    "outputs": [],
@@ -690,6 +692,195 @@
     "    _name='WLK_LOC_WLK_FAR'\n",
     ").to_series() == [0,152,474]).all()"
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cb219dc3-dd66-44cd-a7c5-2a1da4bc1467",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "# Pandas Categorical Dtype\n",
+    "\n",
+    "Dictionary encoding is very similar to the approach used for the pandas Categorical dtype, and\n",
+    "can be used to achieve some of the efficiencies of categorical data, even though xarray lacks\n",
+    "a formal native categorical data representation.  Sharrow's `construct` function for creating\n",
+    "Dataset objects will automatically use dictionary encoding for \"category\" data. \n",
+    "\n",
+    "To demonstrate, we'll load some household data and create a categorical data column."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3b765919-69a4-4fb0-b805-9d3b5fed7897",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "hh = sh.example_data.get_households()\n",
+    "hh[\"income_grp\"] = pd.cut(hh.income, bins=[-np.inf,30000,60000,np.inf], labels=['Low', \"Mid\", \"High\"])\n",
+    "hh = hh[[\"income\",\"income_grp\"]]\n",
+    "hh.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "312faa0b-13cf-4649-9835-7a53b5e81a0b",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "hh.info()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c51a88d2-02b1-4502-9f4b-271fbb126699",
+   "metadata": {},
+   "source": [
+    "We'll then create a Dataset using construct."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cd1c2fd5-59c6-48cb-bd6e-d2f9dde2aa36",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "hh_dataset = sh.dataset.construct(hh[[\"income\",\"income_grp\"]])\n",
+    "hh_dataset"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "033b3629-a16b-47a4-bb18-10af9c7c4f07",
+   "metadata": {},
+   "source": [
+    "Note that the \"income\" variable remains an integer as expected, but the \"income_grp\" variable, \n",
+    "which had been a \"category\" dtype in pandas, is now stored as an `int8`, giving the \n",
+    "category _index_ of each element (it would be an `int16` or larger if needed, but that's\n",
+    "not necessary with only 3 categories). The information about the labels for the categories is \n",
+    "retained not in the data itself but in the `digital_encoding`:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "369442af-1c69-41eb-b530-ea398d6eac7a",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "hh_dataset[\"income_grp\"].digital_encoding"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "58db6505-1c90-475e-8d91-0e2e89ec0f0e",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "# TESTING\n",
+    "assert hh_dataset[\"income_grp\"].dtype == \"int8\"\n",
+    "assert hh_dataset[\"income_grp\"].digital_encoding.keys() == {'dictionary', 'ordered'}\n",
+    "assert all(hh_dataset[\"income_grp\"].digital_encoding['dictionary'] == np.array(['Low', 'Mid', 'High'], dtype='<U4'))\n",
+    "assert hh_dataset[\"income_grp\"].digital_encoding['ordered'] is True"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "38f8a6c9-4bca-4e73-82b0-e0996814565a",
+   "metadata": {},
+   "source": [
+    "If you try to make the return trip to a pandas DataFrame using the regular \n",
+    "`xarray.Dataset.to_pandas()` method, the details of the categorical nature\n",
+    "of this variable are lost, and only the int8 index is available."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ad2f8677-b02c-4d28-892f-3272804bf714",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "hh_dataset.to_pandas()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "95f26992-241f-4982-aa56-f1055a35f969",
+   "metadata": {},
+   "source": [
+    "But, if you use the `single_dim` accessor on the dataset provided by sharrow,\n",
+    "the categorical dtype is restored correctly."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "43b94637-c025-4578-9e97-4b6484cb231e",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "hh_dataset.single_dim.to_pandas()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b1b86a4b-c19f-4ae9-8a8b-395f53209bc6",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "# TESTING\n",
+    "pd.testing.assert_frame_equal(\n",
+    "    hh_dataset.single_dim.to_pandas(),\n",
+    "    hh\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5e66c294-747d-414c-8e03-b8b551a0e2a9",
+   "metadata": {},
+   "source": [
+    "Note that this automatic handling of categorical data only applies when constructing\n",
+    "or deconstructing a dataset with a single dimension (i.e. the `index` is not a MultiIndex).\n",
+    "Multidimensional datasets use the normal xarray processing, which will dump string\n",
+    "categoricals back into python objects, which is bad news for high performance applications."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2208108a-d051-4941-ad72-b89a05169d81",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "sh.dataset.construct(\n",
+    "    hh[[\"income\",\"income_grp\"]].reset_index().set_index([\"HHID\", \"income\"])\n",
+    ")"
+   ]
   }
  ],
  "metadata": {
@@ -708,7 +899,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.10"
+   "version": "3.10.9"
   },
   "toc": {
    "base_numbering": 1,

diff --git a/docs/walkthrough/two-dim.ipynb b/docs/walkthrough/two-dim.ipynb
@@ -32,7 +32,7 @@
    "id": "1f409525",
    "metadata": {
     "tags": [
-     "remove_cell"
+     "remove-cell"
     ]
    },
    "outputs": [],
@@ -288,8 +288,7 @@
    "id": "7b0a4e36",
    "metadata": {
     "tags": [
-     "remove-cell",
-     "remove_cell"
+     "remove-cell"
     ]
    },
    "outputs": [],
@@ -523,8 +522,7 @@
    "id": "03bb0e22",
    "metadata": {
     "tags": [
-     "remove-cell",
-     "remove_cell"
+     "remove-cell"
     ]
    },
    "outputs": [],
@@ -760,8 +758,7 @@
    "id": "dfd6d42a",
    "metadata": {
     "tags": [
-     "remove-cell",
-     "remove_cell"
+     "remove-cell"
     ]
    },
    "outputs": [],
@@ -910,7 +907,7 @@
    "id": "e26a7f58",
    "metadata": {
     "tags": [
-     "remove_cell"
+     "remove-cell"
     ]
    },
    "outputs": [],
@@ -938,7 +935,7 @@
    "id": "4a4fb08e",
    "metadata": {
     "tags": [
-     "remove_cell"
+     "remove-cell"
     ]
    },
    "outputs": [],

diff --git a/envs/development.yml b/envs/development.yml
@@ -8,7 +8,7 @@ dependencies:
   # required for testing
   - dask
   - filelock
-  - flake8
+  - ruff
   - jupyter
   - larch>=5.7.1
   - nbmake

diff --git a/envs/testing.yml b/envs/testing.yml
@@ -14,7 +14,7 @@ dependencies:
   - numexpr
   - sparse
   - filelock
-  - flake8
+  - ruff
   # required for testing
   - pytest
   - pytest-cov
-Original file line number
+Diff line change
@@ Expand Up / @@ -8,7 +8,7 @@ dependencies: @@
       # required for testing
       - dask
       - filelock
-      - flake8
+      - ruff
       - jupyter
       - larch>=5.7.1
       - nbmake
@@ Expand Down @@