Skip to content

Commit c63c87d

Browse files
authored
Merge pull request #8 from autonomio/initial_release
#2 Initial Release Loose-Ends Tracker
2 parents 1a51961 + fa2a606 commit c63c87d

15 files changed

+856
-436
lines changed

.github/workflows/ci.yml

+4
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,10 @@ jobs:
3535
with:
3636
name: "config.json"
3737
json: ${{ secrets.CONFIG_JSON }}
38+
- name: Add Key
39+
run: |
40+
echo "${{ secrets.AUTONOMIO_DEV_PEM }}" > autonomio-dev.pem
41+
chmod 0400 autonomio-dev.pem
3842
- name: Tests
3943
run: |
4044
pip install tensorflow>=2.0

docs/DistributedScan.md

+316
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,316 @@
1+
# DistributedScan
2+
3+
The experiment is configured and started through the `DistributedScan()` command. All of the options effecting the experiment, other than the hyperparameters themselves, are configured through the Scan arguments. The most common use-case is where ~10 arguments are invoked.
4+
5+
## Minimal Example
6+
7+
```python
8+
jako.DistributedScan(x='x', y='y', params=p, model=input_model,config='config.json')
9+
10+
```
11+
12+
## DistributedScan Arguments
13+
14+
`x`, `y`, `params`, `model` and `config` are the only required arguments to start the experiment, all other are optional.</aside>
15+
16+
Argument | Input | Description
17+
--------- | ------- | -----------
18+
`x` | array or list of arrays | prediction features
19+
`y` | array or list of arrays | prediction outcome variable
20+
`params` | dict or ParamSpace object | the parameter dictionary or the ParamSpace object after splitting
21+
`model` | function | the Keras model as a function
22+
`experiment_name` | str | Used for creating the experiment logging folder
23+
`x_val` | array or list of arrays | validation data for x
24+
`y_val` | array or list of arrays | validation data for y
25+
`val_split` | float | validation data split ratio
26+
`random_method` | str | the random method to be used
27+
`seed` | float | Seed for random states
28+
`performance_target` | list | A result at which point to end experiment
29+
`fraction_limit` | float | The fraction of permutations to be processed
30+
`round_limit` | int | Maximum number of permutations in the experiment
31+
`time_limit` | datetime | Time limit for experiment in format `%Y-%m-%d %H:%M`
32+
`boolean_limit` | function | Limit permutations based on a lambda function
33+
`reduction_method` | str | Type of reduction optimizer to be used used
34+
`reduction_interval` | int | Number of permutations after which reduction is applied
35+
`reduction_window` | int | the lookback window for reduction process
36+
`reduction_threshold` | float | The threshold at which reduction is applied
37+
`reduction_metric` | str | The metric to be used for reduction
38+
`minimize_loss` | bool | `reduction_metric` is a loss
39+
`disable_progress_bar` | bool | Disable live updating progress bar
40+
`print_params` | bool | Print each permutation hyperparameters
41+
`clear_session` | bool | Clear backend session between permutations
42+
`save_weights` | bool | Keep model weights (increases memory pressure for large models)
43+
`config` | str or dict | Configuration containing information about machines to distribute and database to upload the data.
44+
45+
46+
## DistributedScan Object Properties
47+
48+
Once the `DistributedScan()` procedures are completed, an object with several useful properties is returned.The namespace is strictly kept clean, so all the properties consist of meaningful contents.
49+
50+
In the case conducted the following experiment, we can access the properties in `distributed_scan_object` which is a python class object.
51+
52+
```python
53+
distributed_scan_object = jako.DistributedScan(x, y, model=iris_model, params=p, fraction_limit=0.1, config='config.json')
54+
```
55+
<hr>
56+
57+
**`best_model`** picks the best model based on a given metric and returns the index number for the model.
58+
59+
```python
60+
distributed_scan_object.best_model(metric='f1score', asc=False)
61+
```
62+
NOTE: `metric` has to be one of the metrics used in the experiment, and `asc` has to be True for the case where the metric is something to be minimized.
63+
64+
<hr>
65+
66+
**`data`** returns a pandas DataFrame with the results for the experiment together with the hyperparameter permutation details.
67+
68+
```python
69+
distributed_scan_object.data
70+
```
71+
72+
<hr>
73+
74+
**`details`** returns a pandas Series with various meta-information about the experiment.
75+
76+
```python
77+
distributed_scan_object.details
78+
```
79+
80+
<hr>
81+
82+
**`evaluate_models`** creates a new column in `distributed_scan_object.data` with result from kfold cross-evaluation.
83+
84+
```python
85+
distributed_scan_object.evaluate_models(x_val=x_val,
86+
y_val=y_val,
87+
n_models=10,
88+
metric='f1score',
89+
folds=5,
90+
shuffle=True,
91+
average='binary',
92+
asc=False)
93+
```
94+
95+
Argument | Description
96+
-------- | -----------
97+
`distributed_scan_object` | The class object returned by DistributedScan() upon completion of the experiment.
98+
`x_val` | Input data (features) in the same format as used in DistributedScan(), but should not be the same data (or it will not be much of validation).
99+
`y_val` | Input data (labels) in the same format as used in DistributedScan(), but should not be the same data (or it will not be much of validation).
100+
`n_models` | The number of models to be evaluated. If set to 10, then 10 models with the highest metric value are evaluated. See below.
101+
`metric` | The metric to be used for picking the models to be evaluated.
102+
`folds` | The number of folds to be used in the evaluation.
103+
`shuffle` | If the data is to be shuffled or not. Set always to False for timeseries but keep in mind that you might get periodical/seasonal bias.
104+
`average` |One of the supported averaging methods: 'binary', 'micro', or 'macro'
105+
`asc` |Set to True if the metric is to be minimized.
106+
`saved` | bool | if a model saved on local machine should be used
107+
`custom_objects` | dict | if the model has a custom object, pass it here
108+
109+
<hr>
110+
111+
**`learning_entropy`** returns a pandas DataFrame with entropy measure for each permutation in terms of how much there is variation between results of each epoch in the permutation.
112+
113+
```python
114+
distributed_scan_object.learning_entropy
115+
```
116+
117+
<hr>
118+
119+
**`params`** returns a dictionary with the original input parameter ranges for the experiment.
120+
121+
```python
122+
distributed_scan_object.params
123+
```
124+
125+
<hr>
126+
127+
**`round_times`** returns a pandas DataFrame with the time when each permutation started, ended, and how many seconds it took.
128+
129+
```python
130+
distributed_scan_object.round_times
131+
```
132+
133+
<hr>
134+
135+
<hr>
136+
137+
**`round_history`** returns epoch-by-epoch data for each model in a dictionary.
138+
139+
```python
140+
distributed_scan_object.round_history
141+
```
142+
143+
<hr>
144+
145+
**`saved_models`** returns the JSON (dictionary) for each model.
146+
147+
```python
148+
distributed_scan_object.saved_models
149+
```
150+
151+
<hr>
152+
153+
**`saved_weights`** returns the weights for each model.
154+
155+
```python
156+
distributed_scan_object.saved_weights
157+
```
158+
159+
<hr>
160+
161+
**`x`** returns the input data (features).
162+
163+
```python
164+
distributed_scan_object.x
165+
```
166+
167+
<hr>
168+
169+
**`y`** returns the input data (labels).
170+
171+
```python
172+
distributed_scan_object.y
173+
```
174+
175+
## Input Model
176+
177+
The input model is any Keras or tf.keras model. It's the model that Jako will use as the basis for the hyperparameter experiment.
178+
179+
#### A minimal example
180+
181+
```python
182+
def input_model(x_train, y_train, x_val, y_val, params):
183+
184+
model.add(Dense(12, input_dim=8, activation=params['activation']))
185+
model.add(Dense(1, activation='sigmoid'))
186+
model.compile(loss='binary_crossentropy', params['optimizer'])
187+
out = model.fit(x=x_train,
188+
y=y_train,
189+
validation_data=[x_val, y_val])
190+
191+
return out, model
192+
```
193+
See specific details about defining the model [here](Examples_Typical?id=defining-the-model).
194+
195+
#### Models with multiple inputs or outputs (list of arrays)
196+
197+
For both cases, `DistributedScan(... x_val, y_val ...)` must be explicitly set i.e. you split the data yourself before passing it into Jako. Using the above minimal example as a reference.
198+
199+
For **multi-input** change `model.fit()` as highlighted below:
200+
201+
```python
202+
out = model.fit(x=[x_train_a, x_train_b],
203+
y=y_train,
204+
validation_data=[[x_val_a, x_val_b], y_val])
205+
```
206+
207+
For **multi-output** the same structure is expected but instead of changing the `x` argument values, now change `y`:
208+
209+
```python
210+
out = model.fit(x=x_train,
211+
y=[y_train_a, y_train_b],
212+
validation_data=[x_val, [y_val_a, y_val_b]])
213+
```
214+
For the case where its both **multi-input** and **multi-output** now both `x` and `y` argument values follow the same structure:
215+
216+
```python
217+
out = model.fit(x=[x_train_a, x_train_b],
218+
y=[y_train_a, y_train_b],
219+
validation_data=[[x_val_a, x_val_b], [y_val_a, y_val_b]])
220+
```
221+
222+
223+
## Parameter Dictionary
224+
225+
The first step in an experiment is to decide the hyperparameters you want to use in the optimization process.
226+
227+
#### A minimal example
228+
229+
```python
230+
p = {
231+
'first_neuron': [12, 24, 48],
232+
'activation': ['relu', 'elu'],
233+
'batch_size': [10, 20, 30]
234+
}
235+
```
236+
237+
#### Supported Input Formats
238+
239+
Parameters may be inputted either in a list or tuple.
240+
241+
As a set of discreet values in a list:
242+
243+
```python
244+
p = {'first_neuron': [12, 24, 48]}
245+
```
246+
As a range of values `(min, max, steps)`:
247+
248+
```python
249+
p = {'first_neuron': (12, 48, 2)}
250+
```
251+
252+
For the case where a static value is preferred, but it's still useful to include it in in the parameters dictionary, use list:
253+
254+
```python
255+
p = {'first_neuron': [48]}
256+
```
257+
258+
## DistributedScan Config file
259+
260+
A config file has all the information regarding connection to remote machines. The config file will also use one of the remote machines as the central datastore. A sample config file will look like the following:
261+
262+
```
263+
{
264+
"run_central_node": true,
265+
"machines": [
266+
{
267+
"machine_id": 1,
268+
"JAKO_IP_ADDRESS": "machine_1_ip_address",
269+
"JAKO_PORT": machine_1_port,
270+
"JAKO_USER": "machine_1_username",
271+
"JAKO_PASSWORD": "machine_1_password"
272+
},
273+
{
274+
"machine_id": 2,
275+
"JAKO_IP_ADDRESS": "machine_2_ip_address",
276+
"JAKO_PORT": machine_2_port,
277+
"JAKO_USER": "machine_2_username",
278+
"JAKO_KEY_FILENAME": "machine_2_key_file_path"
279+
}
280+
],
281+
"database": {
282+
"DB_HOST_MACHINE_ID": 1, #the id for machine which is the datastore
283+
"DB_USERNAME": "database_username",
284+
"DB_PASSWORD": "database_password",
285+
"DB_TYPE": "database_type",
286+
"DATABASE_NAME": "database_name",
287+
"DB_PORT": database_port,
288+
"DB_ENCODING": "LATIN1",
289+
"DB_UPDATE_INTERVAL": 5
290+
}
291+
}
292+
```
293+
294+
### DistributeScan config arguments
295+
296+
Argument | Input | Description
297+
--------- | ------- | -----------
298+
`run_central_node` | bool | if set to true, the central machine where the script runs will also be included in distributed run.
299+
`machines` | list of dict | list of machine configurations
300+
`machine_id` | int | id for each machine in ascending order.
301+
`JAKO_IP_ADDRESS` | str | ip address for the remote machine
302+
`JAKO_PORT` | int | port number for the remote machine
303+
`JAKO_USER` | str | username for the remote machine
304+
`JAKO_PASSWORD` | str | password for the remote machine
305+
`JAKO_KEY_FILENAME` | str | if password not available, the path to RSA private key of the machine could be supplied to this argument.
306+
`database` | dict | configuration parameters for central datastore
307+
`DB_HOST_MACHINE_ID` | int | The machine id to one of the remote machines that can be used as the host where the database resides.
308+
`DB_USERNAME` | str | database username
309+
`DB_PASSWORD` | str | database_password
310+
`DB_TYPE` | str | database_type. Default is `postgres`.The available options are `postgres`, `mysql` and `sqlite`
311+
`DATABASE_NAME` | str | database name
312+
`DB_PORT` | int | database_port
313+
`DB_ENCODING` | str | Defaults to `LATIN1`
314+
`DB_UPDATE_INTERVAL` | int | The frequency with which database update happens. The value is in seconds.Defaults to `5`.
315+
316+

docs/Install_Options.md

+57
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
# Install Options
2+
3+
Before installing jako, it is recommended to first setup and start the following:
4+
* A python or conda environment.
5+
* A postgresql database setup in one of the machines. This will be used as the central datastore.
6+
7+
8+
## Installing Jako
9+
10+
#### Creating a python virtual environment
11+
```python
12+
virtualenv -p python3 jako_env
13+
source jako_env/bin/activate
14+
```
15+
16+
#### Creating a conda virtual environment
17+
```python
18+
conda create --name jako_env
19+
conda activate jako_env
20+
```
21+
22+
#### Install latest from PyPi
23+
```python
24+
pip install jako
25+
```
26+
27+
#### Install a specific version from PyPi
28+
```python
29+
pip install jako==0.1
30+
```
31+
32+
#### Upgrade installation from PyPi
33+
```python
34+
pip install -U --no-deps jako
35+
```
36+
37+
#### Install from monthly
38+
```python
39+
pip install --upgrade --no-deps --force-reinstall git+https://github.com/autonomio/jako
40+
```
41+
42+
#### Install from weekly
43+
```python
44+
pip install --upgrade --no-deps --force-reinstall git+https://github.com/autonomio/jako@dev
45+
```
46+
47+
#### Install from daily
48+
```python
49+
pip install --upgrade --no-deps --force-reinstall git+https://github.com/autonomio/jako@daily-dev
50+
```
51+
52+
## Installing a postgres database
53+
54+
To enable postgres in your central datastore, follow the steps in this
55+
* Postgres for ubuntu machine: [link](https://blog.logrocket.com/setting-up-a-remote-postgres-database-server-on-ubuntu-18-04/)
56+
* Postgres for Mac Machines: [link](https://www.sqlshack.com/setting-up-a-postgresql-database-on-mac/)
57+

0 commit comments

Comments
 (0)