-
Notifications
You must be signed in to change notification settings - Fork 7
Archiving
#Archiving
##Introduction: Have you ever trained a neutral network, tweaked one little thing and then wished that you still had the results from your previous run? Then you might consider archiving everything you do in Keras with CMS_SURF_2016.utils.archiving. Archiving is a great way to speed up your workflow. Instead of running one network or a series of networks and then dedicating an iPython notebook or script to the results, you can quickly iterate in one notebook or script and everything that you do will be stored -- the model architecture, best weights, training history, compilation parameters, fit parameters, preprocessing procedures, etc. Then all of this information can be easily searched for in your archive.
##Examples: We start our new workflow as usual, by simply building our model.
%matplotlib inline
import sys, os
if __package__ is None:
import sys, os
sys.path.append(os.path.realpath("/data/shared/Software/"))
from CMS_SURF_2016.utils.archiving import PreprocessingProcedure, KerasTrial, get_all_preprocessing, get_all_trials
from keras.models import Sequential, Model
from keras.layers import Dense, Flatten, Reshape, Activation, Dropout, Convolution2D, Merge, Input
from keras.callbacks import EarlyStopping
from keras.utils.np_utils import to_categorical
import numpy as np
trial_dir = 'MyArchiveDir/'
#Define callback
earlystopping = EarlyStopping(patience=10, verbose=1)
#Warning it's important to name all of your layers otherwise the hashing won't work
#Make two input branches
left_branch = Sequential()
left_branch.add(Dense(32, input_dim=784))
right_branch = Sequential()
right_branch.add(Dense(32, input_dim=784))
merged = Merge([left_branch, right_branch], mode='concat')
#Make two layer dense model
model = Sequential()
model.add(merged)
model.add(Dense(10, activation='softmax'))
model.add(Dense(10, activation='softmax'))To keep track of absolutely everything we need to wrap our Keras code in a KerasTrial so that everything that we input can be saved and cryptographically hashed for unique identification. The functions we use to get and preprocess our data need to be wrapped in DataProcedures. The DataProcedure class wraps any function that returns a tuple (X,Y) or a generator that returns batches of data in the form (X,Y). We pass our DataProcedures into setTrain() and setValidation(). When the exectute() is called they grab the data and start training.
#Define a function for our DataProcedure. Note: it must return X,Y
def myGetXY(thousand, one, b=784, d=10):
data_1 = np.random.random((thousand, b))
data_2 = np.random.random((thousand, b))
labels = np.random.randint(d, size=(thousand, one))
labels = to_categorical(labels, d)
X = [data_1, data_2]
Y = labels
return X, Y
#Define a list of two DataProcedures for the model to be fit on one after the other
#We include as arguments to DataProcedures the function that generates our training data its arguments
data = [DataProcedure(archive_dir, True, myGetXY, 1000, 1, b=784, d=10) for i in range(2)]
val_data = DataProcedure(archive_dir, True,myGetXY, 1000, 1, b=784, d=10)
#Build our KerasTrial object and name it
trial = KerasTrial(archive_dir, name="MyKerasTrial", model=model)
#Set the training data
trial.setTrain(train_procedure=data)
trial.setValidation(.2)
#Set the compilation paramters
trial.setCompilation(optimizer='rmsprop',
loss='categorical_crossentropy',
metrics=['accuracy']
)
#Set the fit parameters
trial.setFit(callbacks = [earlystopping], nb_epoch=18)Finally we can execute everything using the parameters that we just set. If we were to run execute() a second time it would simply tell us that the trial had already been completed. If we really want to run the trial again we can simply do trial.execute(redo=True)
#Execute the trial running fitting on each preprocessing procedure in turn
trial.execute()
print("OK IT FINISHED!")Normally at this point you would be responsible for storing the training history, model and weights. In addition you would have to concoct a naming scheme so that you can get that information back later. With Archiving that whole process is streamlined and automated.
from keras.utils.visualize_util import plot
from IPython.display import Image, display
from CMS_SURF_2016.utils.plot import plot_history
#Luckily no information was lost. We can still get the training history for the trial.
history = trial.get_history()
plot_history([('myhistory', history)])
test_pp = [DataProcedure(archive_dir,True, myGetXY, 1000, 1, b=784, d=10) for i in range(2)]
test_X, test_Y = test_pp.get_XY()
#And even the model and weights are still intact
model = trial.compile(loadweights=True)
ev = trial.test(test_pp)
loss = ev[0]
accuracy = ev[1]
plot(model, to_file='model3.png', show_shapes=True, show_layer_names=False)
display(Image("model3.png"))
print('\n')
print("Test_Loss:",loss)
print("Test_Accuracy:",accuracy)Output:

DataProcedure results 'f630227125ebbd009156bc108c71da5c055b3dae' read from archive
768/1000 [======================>.......] - ETA: 0sDataProcedure results 'f630227125ebbd009156bc108c71da5c055b3dae' read from archive
800/1000 [=======================>......] - ETA: 0s
('Test_Loss:', 2.3548251056671141)
('Test_Accuracy:', 0.091999999999999998)We can also do our fit using generators instead of using setFit() we use setFit_Generator(). We also must pass a DataProcedure that returns a generator into setTrain(). For setValidation() we can use either a DataProcedure that returns regular data (X,Y) or a generator.
def myGen(dps, batch_size):
if(isinstance(dps, list) == False): dps = [dps]
for dp in dps:
if(isinstance(dp, DataProcedure) == False):
raise TypeError("Only takes DataProcedure got" % type(dp))
while True:
for i in range(0,len(dps)):
X,Y = dps[i].getData()
if(isinstance(X,list) == False): X = [X]
if(isinstance(Y,list) == False): Y = [Y]
tot = Y[0].shape[0]
assert tot == X[0].shape[0]
for start in range(0, tot, batch_size):
end = start+min(batch_size, tot-start)
yield [x[start:end] for x in X], [y[start:end] for y in Y]
train_proc = DataProcedure(archive_dir,True,myGen,data,100)
#Build our KerasTrial object and name it
trial = KerasTrial(archive_dir, name="MyKerasTrial", model=model)
#Set the training data
trial.setTrain(train_procedure=train_proc,
samples_per_epoch=1000)
trial.setValidation(train_proc, nb_val_samples=100)
#Set the compilation paramters
trial.setCompilation(optimizer='rmsprop',
loss='categorical_crossentropy',
metrics=['accuracy']
)
#Set the fit parameters
trial.setFit_Generator(callbacks = [earlystopping], nb_epoch=18)
#Execute the trial running fitting on each DataProcedure in turn
trial.execute()
print("OK IT FINISHED!")
ev = trial.test(train_proc, 1000)
loss = ev[0]
accuracy = ev[1]
print('\n')
print("Test_Loss:",loss)
print("Test_Accuracy:",accuracy)DataProcedure results 'f630227125ebbd009156bc108c71da5c055b3dae' read from archive
DataProcedure results 'f630227125ebbd009156bc108c71da5c055b3dae' read from archive
('Test_Loss:', 2.2868499279022219)
('Test_Accuracy:', 0.13299999833106996)But what if I lose everything and can't regenerate the trial, or what if I just want to get my results back without having to use the code that generated it? Well you can just search through your archive and everything that you've done is all there!
trials = get_all_trials(trial_dir)
for t in trials:
t.summary()
t.remove_from_archive()
trials = get_all_trials(trial_dir)
print("Deleted all trials?:", len(trials) == 0)
pps = get_all_preprocessing(trial_dir)
for p in pps:
p.summary()
p.remove_from_archive()
pps = get_all_preprocessing(trial_dir)
print("Deleted all preprocessing?:", len(pps) == 0)Output:
--------------------------------------------------
TRIAL SUMMARY (1fc8a8a67063eb3f5c1209b9af76ceb5f5746b88)
Record_Info:
elapse_time = 0.975623, fit_cycles = 1, name = MyKerasTrial, num_test = 1000, num_train = 1000.0, num_validation = 0.0, test_acc = 0.0979999992996, test_loss = 2.31007232666, val_acc = 0.170000001788
Training:
__main__.myGen([<CMS_SURF_2016.utils.archiving.DataProcedure object at 0x7fa9e3827c90>, <CMS_SURF_2016.utils.archiving.DataProcedure object at 0x7fa9e3827d10>],100,)
samples_per_epoch = 1000
Validation:
__main__.myGen([<CMS_SURF_2016.utils.archiving.DataProcedure object at 0x7fa9e3827e10>, <CMS_SURF_2016.utils.archiving.DataProcedure object at 0x7fa9e3827e90>],100,)
nb_val_samples = 100
Compilation:
optimizer=rmsprop, loss=categorical_crossentropy, metrics=[u'accuracy']
Fit:
batch_size=32, nb_epoch=18, callbacks=[{u'patience': 10, u'verbose': 1, u'type': u'EarlyStopping', u'mode': u'auto', u'monitor': u'val_loss'}], validation_split=0.0, shuffle=True, class_weight=True
--------------------------------------------------
--------------------------------------------------
TRIAL SUMMARY (e7ef32d0bf5e6f4d6893812da59c3e59eb291f8f)
Record_Info:
elapse_time = 2.542489, fit_cycles = 2, name = MyKerasTrial, num_test = 1000, num_train = 1600.0, num_validation = 400.0, test_acc = 0.109, test_loss = 2.33810310555, val_acc = 0.13
Training:
__main__.myGetXY(1000,1,b=784,d=10), __main__.myGetXY(1000,1,b=784,d=10)
Validation:
validation_split = 0.2
Compilation:
optimizer=rmsprop, loss=categorical_crossentropy, metrics=[u'accuracy']
Fit:
batch_size=32, nb_epoch=18, callbacks=[{u'patience': 10, u'verbose': 1, u'type': u'EarlyStopping', u'mode': u'auto', u'monitor': u'val_loss'}], validation_split=0.2, shuffle=True
--------------------------------------------------
('Deleted all trials?:', True)
--------------------------------------------------
DataProcedure ('f630227125ebbd009156bc108c71da5c055b3dae')
__main__.myGetXY(1000,1,b=784,d=10)
--------------------------------------------------
('Deleted all data?:', True)