Merge pull request #21 from BirkhoffG/readme

BirkhoffG · web-flow · commit e114fe95b184 · 2023-12-26T16:23:57.000+08:00
Update readme
diff --git a/README.md b/README.md
@@ -17,13 +17,13 @@ License](https://img.shields.io/github/license/BirkhoffG/jax-dataloader.svg)
 - **downloading and pre-processing datasets** via [huggingface
   datasets](https://github.com/huggingface/datasets), [pytorch
   Dataset](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset),
-  and tensorflow dataset (forthcoming)
+  and [tensorflow dataset](www.tensorflow.org/datasets);
 
 - **iteratively loading batches** via (vanillla) [jax
   dataloader](https://birkhoffg.github.io/jax-dataloader/core.html#jax-dataloader),
   [pytorch
-  dataloader](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader),
-  tensorflow (forthcoming), and merlin (forthcoming).
+  dataloader](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)
+  and [tensorflow dataset](www.tensorflow.org/datasets).
 
 A minimum `jax-dataloader` example:
 
@@ -57,9 +57,11 @@ pip install git+https://github.com/BirkhoffG/jax-dataloader.git
 > **Note**
 >
 > We will only install `jax`-related dependencies. If you wish to use
-> integration of `pytorch` or huggingface `datasets`, you should try to
-> manually install them, or run `pip install jax-dataloader[all]` for
-> installing all the dependencies.
+> integration of `pytorch`, huggingface `datasets`, or `tensorflow`, we
+> recommend manually install those dependencies.
+>
+> You can also run `pip install jax-dataloader[all]` to install
+> everything (not recommended).
 
 </div>
 
@@ -68,16 +70,21 @@ pip install git+https://github.com/BirkhoffG/jax-dataloader.git
 [`jax_dataloader.core.DataLoader`](https://birkhoffg.github.io/jax-dataloader/core.html#dataloader)
 follows similar API as the pytorch dataloader.
 
-- The `dataset` argument takes `jax_dataloader.core.Dataset` or
-  `torch.utils.data.Dataset` or (the huggingface) `datasets.Dataset` as
-  an input from which to load the data.
-- The `backend` argument takes `"jax"` or`"pytorch"` as an input, which
-  specifies which backend dataloader to use batches.
+- The `dataset` should be an object of the subclass of
+  `jax_dataloader.core.Dataset` or `torch.utils.data.Dataset` or (the
+  huggingface) `datasets.Dataset` or `tf.data.Dataset`.
+- The `backend` should be one of `"jax"` or `"pytorch"` or
+  `"tensorflow"`. This argument specifies which backend dataloader to
+  load batches.
 
-``` python
-import jax_dataloader as jdl
-import jax.numpy as jnp
-```
+Note that not every dataset is compatible with every backend. See the
+compatibility table below:
+
+|                | `jdl.Dataset` | `torch_data.Dataset` | `tf.data.Dataset` | `datasets.Dataset` |
+|:---------------|:--------------|:---------------------|:------------------|:-------------------|
+| `"jax"`        | ✅            | ❌                   | ❌                | ✅                 |
+| `"pytorch"`    | ✅            | ✅                   | ❌                | ✅                 |
+| `"tensorflow"` | ✅            | ❌                   | ✅                | ✅                 |
 
 ### Using [`ArrayDataset`](https://birkhoffg.github.io/jax-dataloader/dataset.html#arraydataset)
 
@@ -94,7 +101,7 @@ y = jnp.arange(10)
 arr_ds = jdl.ArrayDataset(X, y)
 ```
 
-This `arr_ds` can be loaded by both `"jax"` and `"pytorch"` dataloaders.
+This `arr_ds` can be loaded by *every* backends.
 
 ``` python
 # Create a `DataLoader` from the `ArrayDataset` via jax backend
@@ -103,6 +110,31 @@ dataloader = jdl.DataLoader(arr_ds, 'jax', batch_size=5, shuffle=True)
 dataloader = jdl.DataLoader(arr_ds, 'pytorch', batch_size=5, shuffle=True)
 ```
 
+### Using Huggingface Datasets
+
+The huggingface [datasets](https://github.com/huggingface/datasets) is a
+morden library for downloading, pre-processing, and sharing datasets.
+`jax_dataloader` supports directly passing the huggingface datasets.
+
+``` python
+from datasets import load_dataset
+```
+
+For example, We load the `"squad"` dataset from `datasets`:
+
+``` python
+hf_ds = load_dataset("squad")
+```
+
+Then, we can use `jax_dataloader` to load batches of `hf_ds`.
+
+``` python
+# Create a `DataLoader` from the `datasets.Dataset` via jax backend
+dataloader = jdl.DataLoader(hf_ds['train'], 'jax', batch_size=5, shuffle=True)
+# Or we can use the pytorch backend
+dataloader = jdl.DataLoader(hf_ds['train'], 'pytorch', batch_size=5, shuffle=True)
+```
+
 ### Using Pytorch Datasets
 
 The [pytorch Dataset](https://pytorch.org/docs/stable/data.html) and its
@@ -147,27 +179,24 @@ This `pt_ds` can **only** be loaded via `"pytorch"` dataloaders.
 dataloader = jdl.DataLoader(pt_ds, 'pytorch', batch_size=5, shuffle=True)
 ```
 
-### Using Huggingface Datasets
+### Using Tensowflow Datasets
 
-The huggingface [datasets](https://github.com/huggingface/datasets) is a
-morden library for downloading, pre-processing, and sharing datasets.
-`jax_dataloader` supports directly passing the huggingface datasets.
+`jax_dataloader` supports directly passing the [tensorflow
+datasets](www.tensorflow.org/datasets).
 
 ``` python
-from datasets import load_dataset
+import tensorflow_datasets as tfds
+import tensorflow as tf
 ```
 
-For example, We load the `"squad"` dataset from `datasets`:
+For instance, we can load the MNIST dataset from `tensorflow_datasets`
 
 ``` python
-hf_ds = load_dataset("squad")
+tf_ds = tfds.load('mnist', split='test', as_supervised=True)
 ```
 
-This `hf_ds` can be loaded via `"jax"` and `"pytorch"` dataloaders.
+and use `jax_dataloader` for iterating the dataset.
 
 ``` python
-# Create a `DataLoader` from the `datasets.Dataset` via jax backend
-dataloader = jdl.DataLoader(hf_ds['train'], 'jax', batch_size=5, shuffle=True)
-# Or we can use the pytorch backend
-dataloader = jdl.DataLoader(hf_ds['train'], 'pytorch', batch_size=5, shuffle=True)
+dataloader = jdl.DataLoader(tf_ds, 'tensorflow', batch_size=5, shuffle=True)
 ```
diff --git a/jax_dataloader/core.py b/jax_dataloader/core.py
@@ -62,14 +62,15 @@ def _check_backend_compatibility(ds, backend: str):
     return DataLoader(ds, backend=backend)
 
 # %% ../nbs/core.ipynb 8
-def get_backend_compatibilities():
+def get_backend_compatibilities() -> dict[str, list[type]]:
 
     ds = {
-        'JAX': ArrayDataset(np.array([1,2,3])),
-        'Pytorch': torch_data.Dataset(),
-        'Tensorflow': tf.data.Dataset.from_tensor_slices(np.array([1,2,3])),
-        'Huggingface': hf_datasets.Dataset.from_dict({'a': [1,2,3]})
+        JAXDataset: ArrayDataset(np.array([1,2,3])),
+        TorchDataset: torch_data.Dataset(),
+        TFDataset: tf.data.Dataset.from_tensor_slices(np.array([1,2,3])),
+        HFDataset: hf_datasets.Dataset.from_dict({'a': [1,2,3]})
     }
+    assert len(ds) == len(SUPPORTED_DATASETS)
     backends = {b: [] for b in _get_backends()}
     for b in _get_backends():
         for name, dataset in ds.items():
diff --git a/nbs/core.ipynb b/nbs/core.ipynb
@@ -23,16 +23,7 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "The autoreload extension is already loaded. To reload it, use:\n",
-      "  %reload_ext autoreload\n"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
     "#| include: false\n",
     "%load_ext autoreload\n",
@@ -47,7 +38,18 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "2023-12-26 15:13:36.437449: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
+      "2023-12-26 15:13:36.437528: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
+      "2023-12-26 15:13:36.439236: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
+      "2023-12-26 15:13:37.500782: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n"
+     ]
+    }
+   ],
    "source": [
     "#| export\n",
     "from __future__ import print_function, division, annotations\n",
@@ -143,14 +145,15 @@
    "outputs": [],
    "source": [
     "#| export\n",
-    "def get_backend_compatibilities():\n",
+    "def get_backend_compatibilities() -> dict[str, list[type]]:\n",
     "\n",
     "    ds = {\n",
-    "        'JAX': ArrayDataset(np.array([1,2,3])),\n",
-    "        'Pytorch': torch_data.Dataset(),\n",
-    "        'Tensorflow': tf.data.Dataset.from_tensor_slices(np.array([1,2,3])),\n",
-    "        'Huggingface': hf_datasets.Dataset.from_dict({'a': [1,2,3]})\n",
+    "        JAXDataset: ArrayDataset(np.array([1,2,3])),\n",
+    "        TorchDataset: torch_data.Dataset(),\n",
+    "        TFDataset: tf.data.Dataset.from_tensor_slices(np.array([1,2,3])),\n",
+    "        HFDataset: hf_datasets.Dataset.from_dict({'a': [1,2,3]})\n",
     "    }\n",
+    "    assert len(ds) == len(SUPPORTED_DATASETS)\n",
     "    backends = {b: [] for b in _get_backends()}\n",
     "    for b in _get_backends():\n",
     "        for name, dataset in ds.items():\n",
diff --git a/nbs/index.ipynb b/nbs/index.ipynb