|
| 1 | +# DistributedScan |
| 2 | + |
| 3 | +The experiment is configured and started through the `DistributedScan()` command. All of the options effecting the experiment, other than the hyperparameters themselves, are configured through the Scan arguments. The most common use-case is where ~10 arguments are invoked. |
| 4 | + |
| 5 | +## Minimal Example |
| 6 | + |
| 7 | +```python |
| 8 | +jako.DistributedScan(x='x', y='y', params=p, model=input_model,config='config.json') |
| 9 | + |
| 10 | +``` |
| 11 | + |
| 12 | +## DistributedScan Arguments |
| 13 | + |
| 14 | +`x`, `y`, `params`, `model` and `config` are the only required arguments to start the experiment, all other are optional.</aside> |
| 15 | + |
| 16 | +Argument | Input | Description |
| 17 | +--------- | ------- | ----------- |
| 18 | +`x` | array or list of arrays | prediction features |
| 19 | +`y` | array or list of arrays | prediction outcome variable |
| 20 | +`params` | dict or ParamSpace object | the parameter dictionary or the ParamSpace object after splitting |
| 21 | +`model` | function | the Keras model as a function |
| 22 | +`experiment_name` | str | Used for creating the experiment logging folder |
| 23 | +`x_val` | array or list of arrays | validation data for x |
| 24 | +`y_val` | array or list of arrays | validation data for y |
| 25 | +`val_split` | float | validation data split ratio |
| 26 | +`random_method` | str | the random method to be used |
| 27 | +`seed` | float | Seed for random states |
| 28 | +`performance_target` | list | A result at which point to end experiment |
| 29 | +`fraction_limit` | float | The fraction of permutations to be processed |
| 30 | +`round_limit` | int | Maximum number of permutations in the experiment |
| 31 | +`time_limit` | datetime | Time limit for experiment in format `%Y-%m-%d %H:%M` |
| 32 | +`boolean_limit` | function | Limit permutations based on a lambda function |
| 33 | +`reduction_method` | str | Type of reduction optimizer to be used used |
| 34 | +`reduction_interval` | int | Number of permutations after which reduction is applied |
| 35 | +`reduction_window` | int | the lookback window for reduction process |
| 36 | +`reduction_threshold` | float | The threshold at which reduction is applied |
| 37 | +`reduction_metric` | str | The metric to be used for reduction |
| 38 | +`minimize_loss` | bool | `reduction_metric` is a loss |
| 39 | +`disable_progress_bar` | bool | Disable live updating progress bar |
| 40 | +`print_params` | bool | Print each permutation hyperparameters |
| 41 | +`clear_session` | bool | Clear backend session between permutations |
| 42 | +`save_weights` | bool | Keep model weights (increases memory pressure for large models) |
| 43 | +`config` | str or dict | Configuration containing information about machines to distribute and database to upload the data. |
| 44 | + |
| 45 | + |
| 46 | +## DistributedScan Object Properties |
| 47 | + |
| 48 | +Once the `DistributedScan()` procedures are completed, an object with several useful properties is returned.The namespace is strictly kept clean, so all the properties consist of meaningful contents. |
| 49 | + |
| 50 | +In the case conducted the following experiment, we can access the properties in `distributed_scan_object` which is a python class object. |
| 51 | + |
| 52 | +```python |
| 53 | +distributed_scan_object = jako.DistributedScan(x, y, model=iris_model, params=p, fraction_limit=0.1, config='config.json') |
| 54 | +``` |
| 55 | +<hr> |
| 56 | + |
| 57 | +**`best_model`** picks the best model based on a given metric and returns the index number for the model. |
| 58 | + |
| 59 | +```python |
| 60 | +distributed_scan_object.best_model(metric='f1score', asc=False) |
| 61 | +``` |
| 62 | +NOTE: `metric` has to be one of the metrics used in the experiment, and `asc` has to be True for the case where the metric is something to be minimized. |
| 63 | + |
| 64 | +<hr> |
| 65 | + |
| 66 | +**`data`** returns a pandas DataFrame with the results for the experiment together with the hyperparameter permutation details. |
| 67 | + |
| 68 | +```python |
| 69 | +distributed_scan_object.data |
| 70 | +``` |
| 71 | + |
| 72 | +<hr> |
| 73 | + |
| 74 | +**`details`** returns a pandas Series with various meta-information about the experiment. |
| 75 | + |
| 76 | +```python |
| 77 | +distributed_scan_object.details |
| 78 | +``` |
| 79 | + |
| 80 | +<hr> |
| 81 | + |
| 82 | +**`evaluate_models`** creates a new column in `distributed_scan_object.data` with result from kfold cross-evaluation. |
| 83 | + |
| 84 | +```python |
| 85 | +distributed_scan_object.evaluate_models(x_val=x_val, |
| 86 | + y_val=y_val, |
| 87 | + n_models=10, |
| 88 | + metric='f1score', |
| 89 | + folds=5, |
| 90 | + shuffle=True, |
| 91 | + average='binary', |
| 92 | + asc=False) |
| 93 | +``` |
| 94 | + |
| 95 | +Argument | Description |
| 96 | +-------- | ----------- |
| 97 | +`distributed_scan_object` | The class object returned by DistributedScan() upon completion of the experiment. |
| 98 | +`x_val` | Input data (features) in the same format as used in DistributedScan(), but should not be the same data (or it will not be much of validation). |
| 99 | +`y_val` | Input data (labels) in the same format as used in DistributedScan(), but should not be the same data (or it will not be much of validation). |
| 100 | +`n_models` | The number of models to be evaluated. If set to 10, then 10 models with the highest metric value are evaluated. See below. |
| 101 | +`metric` | The metric to be used for picking the models to be evaluated. |
| 102 | +`folds` | The number of folds to be used in the evaluation. |
| 103 | +`shuffle` | If the data is to be shuffled or not. Set always to False for timeseries but keep in mind that you might get periodical/seasonal bias. |
| 104 | +`average` |One of the supported averaging methods: 'binary', 'micro', or 'macro' |
| 105 | +`asc` |Set to True if the metric is to be minimized. |
| 106 | +`saved` | bool | if a model saved on local machine should be used |
| 107 | +`custom_objects` | dict | if the model has a custom object, pass it here |
| 108 | + |
| 109 | +<hr> |
| 110 | + |
| 111 | +**`learning_entropy`** returns a pandas DataFrame with entropy measure for each permutation in terms of how much there is variation between results of each epoch in the permutation. |
| 112 | + |
| 113 | +```python |
| 114 | +distributed_scan_object.learning_entropy |
| 115 | +``` |
| 116 | + |
| 117 | +<hr> |
| 118 | + |
| 119 | +**`params`** returns a dictionary with the original input parameter ranges for the experiment. |
| 120 | + |
| 121 | +```python |
| 122 | +distributed_scan_object.params |
| 123 | +``` |
| 124 | + |
| 125 | +<hr> |
| 126 | + |
| 127 | +**`round_times`** returns a pandas DataFrame with the time when each permutation started, ended, and how many seconds it took. |
| 128 | + |
| 129 | +```python |
| 130 | +distributed_scan_object.round_times |
| 131 | +``` |
| 132 | + |
| 133 | +<hr> |
| 134 | + |
| 135 | +<hr> |
| 136 | + |
| 137 | +**`round_history`** returns epoch-by-epoch data for each model in a dictionary. |
| 138 | + |
| 139 | +```python |
| 140 | +distributed_scan_object.round_history |
| 141 | +``` |
| 142 | + |
| 143 | +<hr> |
| 144 | + |
| 145 | +**`saved_models`** returns the JSON (dictionary) for each model. |
| 146 | + |
| 147 | +```python |
| 148 | +distributed_scan_object.saved_models |
| 149 | +``` |
| 150 | + |
| 151 | +<hr> |
| 152 | + |
| 153 | +**`saved_weights`** returns the weights for each model. |
| 154 | + |
| 155 | +```python |
| 156 | +distributed_scan_object.saved_weights |
| 157 | +``` |
| 158 | + |
| 159 | +<hr> |
| 160 | + |
| 161 | +**`x`** returns the input data (features). |
| 162 | + |
| 163 | +```python |
| 164 | +distributed_scan_object.x |
| 165 | +``` |
| 166 | + |
| 167 | +<hr> |
| 168 | + |
| 169 | +**`y`** returns the input data (labels). |
| 170 | + |
| 171 | +```python |
| 172 | +distributed_scan_object.y |
| 173 | +``` |
| 174 | + |
| 175 | +## Input Model |
| 176 | + |
| 177 | +The input model is any Keras or tf.keras model. It's the model that Jako will use as the basis for the hyperparameter experiment. |
| 178 | + |
| 179 | +#### A minimal example |
| 180 | + |
| 181 | +```python |
| 182 | +def input_model(x_train, y_train, x_val, y_val, params): |
| 183 | + |
| 184 | + model.add(Dense(12, input_dim=8, activation=params['activation'])) |
| 185 | + model.add(Dense(1, activation='sigmoid')) |
| 186 | + model.compile(loss='binary_crossentropy', params['optimizer']) |
| 187 | + out = model.fit(x=x_train, |
| 188 | + y=y_train, |
| 189 | + validation_data=[x_val, y_val]) |
| 190 | + |
| 191 | + return out, model |
| 192 | +``` |
| 193 | +See specific details about defining the model [here](Examples_Typical?id=defining-the-model). |
| 194 | + |
| 195 | +#### Models with multiple inputs or outputs (list of arrays) |
| 196 | + |
| 197 | +For both cases, `DistributedScan(... x_val, y_val ...)` must be explicitly set i.e. you split the data yourself before passing it into Jako. Using the above minimal example as a reference. |
| 198 | + |
| 199 | +For **multi-input** change `model.fit()` as highlighted below: |
| 200 | + |
| 201 | +```python |
| 202 | +out = model.fit(x=[x_train_a, x_train_b], |
| 203 | + y=y_train, |
| 204 | + validation_data=[[x_val_a, x_val_b], y_val]) |
| 205 | +``` |
| 206 | + |
| 207 | +For **multi-output** the same structure is expected but instead of changing the `x` argument values, now change `y`: |
| 208 | + |
| 209 | +```python |
| 210 | + out = model.fit(x=x_train, |
| 211 | + y=[y_train_a, y_train_b], |
| 212 | + validation_data=[x_val, [y_val_a, y_val_b]]) |
| 213 | +``` |
| 214 | +For the case where its both **multi-input** and **multi-output** now both `x` and `y` argument values follow the same structure: |
| 215 | + |
| 216 | +```python |
| 217 | + out = model.fit(x=[x_train_a, x_train_b], |
| 218 | + y=[y_train_a, y_train_b], |
| 219 | + validation_data=[[x_val_a, x_val_b], [y_val_a, y_val_b]]) |
| 220 | +``` |
| 221 | + |
| 222 | + |
| 223 | +## Parameter Dictionary |
| 224 | + |
| 225 | +The first step in an experiment is to decide the hyperparameters you want to use in the optimization process. |
| 226 | + |
| 227 | +#### A minimal example |
| 228 | + |
| 229 | +```python |
| 230 | +p = { |
| 231 | + 'first_neuron': [12, 24, 48], |
| 232 | + 'activation': ['relu', 'elu'], |
| 233 | + 'batch_size': [10, 20, 30] |
| 234 | +} |
| 235 | +``` |
| 236 | + |
| 237 | +#### Supported Input Formats |
| 238 | + |
| 239 | +Parameters may be inputted either in a list or tuple. |
| 240 | + |
| 241 | +As a set of discreet values in a list: |
| 242 | + |
| 243 | +```python |
| 244 | +p = {'first_neuron': [12, 24, 48]} |
| 245 | +``` |
| 246 | +As a range of values `(min, max, steps)`: |
| 247 | + |
| 248 | +```python |
| 249 | +p = {'first_neuron': (12, 48, 2)} |
| 250 | +``` |
| 251 | + |
| 252 | +For the case where a static value is preferred, but it's still useful to include it in in the parameters dictionary, use list: |
| 253 | + |
| 254 | +```python |
| 255 | +p = {'first_neuron': [48]} |
| 256 | +``` |
| 257 | + |
| 258 | +## DistributedScan Config file |
| 259 | + |
| 260 | +A config file has all the information regarding connection to remote machines. The config file will also use one of the remote machines as the central datastore. A sample config file will look like the following: |
| 261 | + |
| 262 | +``` |
| 263 | +{ |
| 264 | + "run_central_node": true, |
| 265 | + "machines": [ |
| 266 | + { |
| 267 | + "machine_id": 1, |
| 268 | + "JAKO_IP_ADDRESS": "machine_1_ip_address", |
| 269 | + "JAKO_PORT": machine_1_port, |
| 270 | + "JAKO_USER": "machine_1_username", |
| 271 | + "JAKO_PASSWORD": "machine_1_password" |
| 272 | + }, |
| 273 | + { |
| 274 | + "machine_id": 2, |
| 275 | + "JAKO_IP_ADDRESS": "machine_2_ip_address", |
| 276 | + "JAKO_PORT": machine_2_port, |
| 277 | + "JAKO_USER": "machine_2_username", |
| 278 | + "JAKO_KEY_FILENAME": "machine_2_key_file_path" |
| 279 | + } |
| 280 | + ], |
| 281 | + "database": { |
| 282 | + "DB_HOST_MACHINE_ID": 1, #the id for machine which is the datastore |
| 283 | + "DB_USERNAME": "database_username", |
| 284 | + "DB_PASSWORD": "database_password", |
| 285 | + "DB_TYPE": "database_type", |
| 286 | + "DATABASE_NAME": "database_name", |
| 287 | + "DB_PORT": database_port, |
| 288 | + "DB_ENCODING": "LATIN1", |
| 289 | + "DB_UPDATE_INTERVAL": 5 |
| 290 | + } |
| 291 | +} |
| 292 | +``` |
| 293 | + |
| 294 | +### DistributeScan config arguments |
| 295 | + |
| 296 | +Argument | Input | Description |
| 297 | +--------- | ------- | ----------- |
| 298 | +`run_central_node` | bool | if set to true, the central machine where the script runs will also be included in distributed run. |
| 299 | +`machines` | list of dict | list of machine configurations |
| 300 | +`machine_id` | int | id for each machine in ascending order. |
| 301 | +`JAKO_IP_ADDRESS` | str | ip address for the remote machine |
| 302 | +`JAKO_PORT` | int | port number for the remote machine |
| 303 | +`JAKO_USER` | str | username for the remote machine |
| 304 | +`JAKO_PASSWORD` | str | password for the remote machine |
| 305 | +`JAKO_KEY_FILENAME` | str | if password not available, the path to RSA private key of the machine could be supplied to this argument. |
| 306 | +`database` | dict | configuration parameters for central datastore |
| 307 | +`DB_HOST_MACHINE_ID` | int | The machine id to one of the remote machines that can be used as the host where the database resides. |
| 308 | +`DB_USERNAME` | str | database username |
| 309 | +`DB_PASSWORD` | str | database_password |
| 310 | +`DB_TYPE` | str | database_type. Default is `postgres`.The available options are `postgres`, `mysql` and `sqlite` |
| 311 | +`DATABASE_NAME` | str | database name |
| 312 | +`DB_PORT` | int | database_port |
| 313 | +`DB_ENCODING` | str | Defaults to `LATIN1` |
| 314 | +`DB_UPDATE_INTERVAL` | int | The frequency with which database update happens. The value is in seconds.Defaults to `5`. |
| 315 | + |
| 316 | + |
0 commit comments