Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Make Cross Validation optional #52

Merged
merged 20 commits into from
Sep 22, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -164,3 +164,5 @@ cython_debug/
# Dataset files
examples/iris.data
examples/mnist.npz
examples/cifar-10-python.tar.gz
examples/cifar-10-batches-py/
26 changes: 13 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
<br />
<div align="center">
<a href="https://github.com/gmontamat/gentun">
<img alt="plugin-icon" src="assets/icon.png">
<img alt="plugin-icon" src="https://github.com/gmontamat/gentun/blob/develop/assets/icon.png?raw=true">
</a>
<h1 style="margin: 0;" align="center">gentun</h1>
<p>
Expand Down Expand Up @@ -113,7 +113,7 @@ genes = [
]
```

We are using the `gentun.models.xgboost.XGBoostCV` handler, which performs k-fold cross validation with available train
We are using the `gentun.models.xgboost.XGBoost` handler, which performs k-fold cross validation with available train
data and returns an average metric over the folds. Thus, we need to define some static parameters which are shared
across the population over all generations:

Expand All @@ -134,11 +134,11 @@ passed either through genes or keyword arguments.

```python
from gentun.algorithms import Tournament
from gentun.models.xgboost import XGBoostCV
from gentun.models.xgboost import XGBoost
from gentun.populations import Population

# Run the genetic algorithm with a population of 50 for 100 generations
population = Population(genes, XGBoostCV, 50, x_train, y_train, **kwargs)
population = Population(genes, XGBoost, 50, x_train, y_train, **kwargs)
algorithm = Tournament(population)
algorithm.run(100, maximize=False)
```
Expand All @@ -159,7 +159,7 @@ population. You can add custom individuals to the population before running the
an intuition of which hyperparameters work well with your model:

```python
from gentun.models.xgboost import XGBoostCV
from gentun.models.xgboost import XGBoost
from gentun.populations import Population


Expand All @@ -171,7 +171,7 @@ hyperparams = {
}

# Generate a random population and then add a custom individual
population = Population(genes, XGBoostCV, 49, x_train, y_train, **kwargs)
population = Population(genes, XGBoost, 49, x_train, y_train, **kwargs)
population.add_individual(hyperparams)
```

Expand All @@ -183,7 +183,7 @@ method, so that uniformly distributed hyperparameter values are obtained with it

```python
from gentun.genes import RandomChoice, RandomLogUniform
from gentun.models.xgboost import XGBoostCV
from gentun.models.xgboost import XGBoost
from gentun.populations import Grid


Expand All @@ -196,7 +196,7 @@ genes = [
gene_samples = [10, 8, 11] # How many samples we want to get from each gene

# Generate a grid of individuals
population = Grid(genes, XGBoostCV, gene_samples, x_train, y_train, **kwargs)
population = Grid(genes, XGBoost, gene_samples, x_train, y_train, **kwargs)
```

Running the genetic algorithm on this population for just one generation is equivalent to doing a grid search over 10
Expand Down Expand Up @@ -227,12 +227,12 @@ processes. Once this is done, the mutation and reproduction steps are run by the
produced.

```python
from gentun.models.xgboost import XGBoostCV
from gentun.models.xgboost import XGBoost
from gentun.services import RedisController

controller = RedisController("experiment", host="localhost", port=6379)
# ... define genes
population = Population(genes, XGBoostCV, 100, controller=controller, **kwargs)
population = Population(genes, XGBoost, 100, controller=controller, **kwargs)
# ... run algorithm
```

Expand All @@ -243,10 +243,10 @@ its `run()` method with train data to begin processing jobs from the queue. You
as they have network access to the redis server.

```python
from gentun.models.xgboost import XGBoostCV
from gentun.models.xgboost import XGBoost
from gentun.services import RedisWorker

worker = RedisWorker("experiment", XGBoostCV, host="localhost", port=6379)
worker = RedisWorker("experiment", XGBoost, host="localhost", port=6379)

# ... fetch x_train and y_train
worker.run(x_train, y_train)
Expand All @@ -264,7 +264,7 @@ This project supports hyperparameter tuning for the following models:
## Contributing

We welcome contributions to enhance this library. You can submit your custom subclasses for:
- [`gentun.models.Handler`](src/gentun/models/base.py#L9-L25)
- [`gentun.models.Handler`](src/gentun/models/base.py#L11-L30)
- [`gentun.genes.Gene`](src/gentun/genes.py#L11-L47)

Our roadmap includes:
Expand Down
73 changes: 73 additions & 0 deletions examples/geneticcnn_cifar10.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
#!/usr/bin/env python
"""
Implementation of Genetic CNN on CIFAR-10 data. This is a replica of
the algorithm described on section 4.2 of the Genetic CNN paper.
http://arxiv.org/pdf/1703.01513
"""

import os
import pickle
from typing import Tuple

import numpy as np
from sklearn.model_selection import train_test_split

from gentun.algorithms import RussianRoulette
from gentun.genes import Binary
from gentun.models.tensorflow import GeneticCNN
from gentun.populations import Population


def load_cifar10(data_dir: str, test_size: int = 10000) -> Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
"""Load, sample, one-hot encode, and normalize CIFAR-10."""

def unpickle(file_name: str) -> dict:
with open(file_name, "rb") as fo:
data = pickle.load(fo, encoding="bytes")
return data

x_train = []
y_raw = []
for batch in range(1, 6):
batch_file = os.path.join(data_dir, f"data_batch_{batch}")
batch_data = unpickle(batch_file)
x_train.append(batch_data[b"data"])
y_raw.extend(batch_data[b"labels"])
x_train = np.concatenate(x_train).astype(np.float32)
x_train = np.moveaxis(x_train.reshape(-1, 3, 32, 32), 1, -1) / 255
y_raw = np.array(y_raw, dtype=np.int32)
# One-hot encode the output
y_train = np.zeros((y_raw.size, 10))
y_train[np.arange(y_raw.size), y_raw] = 1
return train_test_split(x_train, y_train, test_size=test_size, shuffle=True, stratify=y_raw)


if __name__ == "__main__":
# Genetic CNN static parameters
kwargs = {
"nodes": (3, 4, 5),
"input_shape": (32, 32, 3),
"kernels_per_layer": (8, 16, 32),
"kernel_sizes": ((5, 5), (5, 5), (5, 5)),
"pool_sizes": ((3, 3), (3, 3), (3, 3)),
"dense_units": 128,
"dropout_probability": 0.5,
"classes": 10,
"epochs": (120, 60, 40, 20),
"learning_rate": (1e-2, 1e-3, 1e-4, 1e-5),
"batch_size": 32, # Not mentioned in the paper, but 32 is a good default for most cases
"plot": False, # if True, graphviz needs to be installed on your system
}
# Genetic CNN hyperparameters
genes = [Binary(f"S_{i + 1}", int(K_s * (K_s - 1) / 2)) for i, K_s in enumerate(kwargs["nodes"])]

x_train, x_test, y_train, y_test = load_cifar10("cifar-10-batches-py")
population = Population(genes, GeneticCNN, 20, x_train, y_train, x_test, y_test, **kwargs)
algorithm = RussianRoulette(
population,
crossover_probability=0.2, # p_C
crossover_rate=0.2, # q_C
mutation_probability=0.8, # p_M
mutation_rate=0.05, # q_M
)
algorithm.run(50)
29 changes: 12 additions & 17 deletions examples/geneticcnn_mnist.py
Original file line number Diff line number Diff line change
@@ -1,27 +1,23 @@
#!/usr/bin/env python
"""
Implementation of Genetic CNN on MNIST data.
This is a replica of the algorithm described
on section 4.1.1 of the Genetic CNN paper.
Implementation of Genetic CNN on MNIST data. This is a replica of the
algorithm described on section 4.1 of the Genetic CNN paper.
http://arxiv.org/pdf/1703.01513
"""

import random
from typing import Tuple

import numpy as np
from sklearn.model_selection import train_test_split

from gentun.algorithms import RussianRoulette
from gentun.genes import Binary
from gentun.models.tensorflow import GeneticCNN
from gentun.populations import Population


def load_mnist(file_name: str, sample_size: int = 10000) -> Tuple[np.ndarray, np.ndarray]:
"""
Load, sample, one-hot encode,
and normalize MNIST dataset.
"""
def load_mnist(file_name: str, test_size: int = 10000) -> Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
"""Load, sample, one-hot encode, and normalize MNIST."""
mnist = np.load(file_name)
x = mnist["x_train"].reshape(mnist["x_train"].shape[:-2] + (-1,))
y_raw = mnist["y_train"]
Expand All @@ -32,9 +28,8 @@ def load_mnist(file_name: str, sample_size: int = 10000) -> Tuple[np.ndarray, np
# One-hot encode the output
y = np.zeros((size, 10))
y[np.arange(size), y_raw] = 1
# TODO: stratified selection or random (check paper)?
selection = random.sample(range(size), sample_size)
return x[selection], y[selection]
# Split the data into training and test sets, stratified by y
return train_test_split(x, y, test_size=test_size, shuffle=True, stratify=y_raw)


if __name__ == "__main__":
Expand All @@ -44,20 +39,20 @@ def load_mnist(file_name: str, sample_size: int = 10000) -> Tuple[np.ndarray, np
"input_shape": (28, 28, 1),
"kernels_per_layer": (20, 50),
"kernel_sizes": ((5, 5), (5, 5)),
"pool_sizes": ((2, 2), (2, 2)),
"dense_units": 500,
"dropout_probability": 0.5,
"classes": 10,
"kfold": 5,
"epochs": (20, 4, 1),
"learning_rate": (1e-3, 1e-4, 1e-5),
"batch_size": 32,
"plot": True,
"batch_size": 32, # Not mentioned in the paper, but 32 is a good default for most cases
"plot": False, # if True, graphviz needs to be installed on your system
}
# Genetic CNN hyperparameters
genes = [Binary(f"S_{i + 1}", int(K_s * (K_s - 1) / 2)) for i, K_s in enumerate(kwargs["nodes"])]

x_train, y_train = load_mnist("mnist.npz")
population = Population(genes, GeneticCNN, 20, x_train, y_train, **kwargs)
x_train, x_test, y_train, y_test = load_mnist("mnist.npz")
population = Population(genes, GeneticCNN, 20, x_train, y_train, x_test, y_test, **kwargs)
algorithm = RussianRoulette(
population,
crossover_probability=0.2, # p_C
Expand Down
6 changes: 6 additions & 0 deletions examples/get_datasets.sh
Original file line number Diff line number Diff line change
Expand Up @@ -10,4 +10,10 @@ if [ ! -e mnist.npz ]; then
wget https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
fi

if [ ! -e cifar-10-python.tar.gz ]; then
echo "Downloading CIFAR-10 dataset..."
wget https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
tar -xvzf cifar-10-python.tar.gz
fi

echo "Done!"
6 changes: 3 additions & 3 deletions examples/scikit_iris.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@

from gentun.algorithms import Tournament
from gentun.genes import RandomChoice
from gentun.models.sklearn import SklearnCV
from gentun.models.sklearn import Sklearn
from gentun.populations import Population


Expand Down Expand Up @@ -47,12 +47,12 @@ def parse_iris(file_name: str) -> Tuple[np.ndarray, np.ndarray]:
"sklearn_model": RandomForestClassifier,
"sklearn_metric": f1_score,
"metric_kwargs": {"average": "macro"},
"kfold": 5,
"folds": 5,
}

# Fetch training data
x_train, y_train = parse_iris("iris.data")
# Run genetic algorithm on a population of 10 for 10 generations
population = Population(genes, SklearnCV, 10, x_train, y_train, **kwargs)
population = Population(genes, Sklearn, 10, x_train, y_train, **kwargs)
algorithm = Tournament(population)
algorithm.run(10)
4 changes: 2 additions & 2 deletions examples/xgboost_grid_iris.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@

from gentun.algorithms import Tournament
from gentun.genes import RandomChoice, RandomLogUniform
from gentun.models.xgboost import XGBoostCV
from gentun.models.xgboost import XGBoost
from gentun.populations import Grid


Expand Down Expand Up @@ -56,6 +56,6 @@ def parse_iris(file_name: str) -> Tuple[np.ndarray, np.ndarray]:
# Fetch training data
x_train, y_train = parse_iris("iris.data")
# Run genetic algorithm on a grid population for 1 generation
population = Grid(genes, XGBoostCV, gene_samples, x_train, y_train, **kwargs)
population = Grid(genes, XGBoost, gene_samples, x_train, y_train, **kwargs)
algorithm = Tournament(population)
algorithm.run(1, maximize=False)
4 changes: 2 additions & 2 deletions examples/xgboost_iris.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@

from gentun.algorithms import Tournament
from gentun.genes import RandomChoice, RandomLogUniform, RandomUniform
from gentun.models.xgboost import XGBoostCV
from gentun.models.xgboost import XGBoost
from gentun.populations import Population


Expand Down Expand Up @@ -63,6 +63,6 @@ def parse_iris(file_name: str) -> Tuple[np.ndarray, np.ndarray]:
# Fetch training data
x_train, y_train = parse_iris("iris.data")
# Run genetic algorithm on a population of 50 for 100 generations
population = Population(genes, XGBoostCV, 50, x_train, y_train, **kwargs)
population = Population(genes, XGBoost, 50, x_train, y_train, **kwargs)
algorithm = Tournament(population)
algorithm.run(100, maximize=False)
5 changes: 3 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -307,8 +307,8 @@ valid-classmethod-first-arg="cls"
valid-metaclass-classmethod-first-arg="cls"

[tool.pylint.'DESIGN']
max-args=5
max-attributes=7
max-args=9 # 5
max-attributes=12 # 7
max-bool-expr=5
max-branches=12
max-locals=15
Expand All @@ -322,6 +322,7 @@ min-public-methods=2
allow-wildcard-with-all="no"
analyse-fallback-blocks="no"
deprecated-modules="optparse,tkinter.tix"
ignored-modules="tensorflow,tensorflow.keras"

[tool.pylint.'EXCEPTIONS']
overgeneral-exceptions= [
Expand Down
Loading
Loading