Implement Dropout #1

eric-tramel · 2015-09-09T10:27:52Z

Goal

One of the latest/best regularisation techniques for training RBMs is dropout. Unfortunately, the original Boltzmann.jl package does not implement this technique, so we should undertake this ourselves.

Technique

During the training phase of the RBM, each hidden node is present with only probably $p$. Training is performed for this reduced model and then the resulting trained models are combined. The pertinent section from (Srivasta 2014) reads,

8.2 Learning Dropout RBMs

Learning algorithms developed for RBMs such as Contrastive Divergence (Hinton et al., 2006) can be directly applied for learning Dropout RBMs. The only difference is that r is first sampled and only the hidden units that are retained are used for training. Similar to dropout neural networks, a different r is sampled for each training case in every minibatch. In our experiments, we use CD-1 for training dropout RBMs.

References

Srivasta et al, "Dropout: A simple way to prevent neural networks from overfitting," JMLR, vol. 15, 2014, pp. 1929-1958.

eric-tramel · 2015-09-09T15:32:17Z

I forgot to reference this issue in the commit d3c2caa ! I have created a separate branch to work on implementing this feature.

Modifications

For my first attempt at implementing this feature, I added an optional parameter to rbm.jl/fit() to allow the user to specify the dropout rate:

function fit(rbm::RBM, X::Mat{Float64};
             persistent=true, lr=0.1, n_iter=10, batch_size=100, n_gibbs=1,dorate=0.0)

From here, I tried to take the approach I quoted earlier (§8.2 of Srivasta et al 2014) and apply a different dropout pattern for each training sample in the mini batch. I accomplish this within rbm.jl/gibbs():

function gibbs(rbm::RBM, vis::Mat{Float64}; n_times=1,dorate=0.0)
    suppressedUnits = rand(size(rbm.hbias,1),size(rbm.vis,2)) .< dorate  
   ...

I then modify rbm.jl/sample_visibles() (and, the corresponding rbm.jl/vis_means()) to take this logical array specifying the suppressed/dropped hidden units and assign zeros to the dropped hidden units before calculating the matrix-matrix product between rbm.W and the hidden activations:

function vis_means(rbm::RBM, hid::Mat{Float64}, suppressedUnits::Mat{Bool})    
    hid[suppressedUnits] = 0.0          # Suppress dropped hidden units
    p = rbm.W' * hid .+ rbm.vbias
    return logistic(p)
end

This should, in total, accomplish the dropout.

Doubts

Now, what isn't clear to me is whether or not the dropout pattern should be changing from epoch to epoch. The paper seems to indicate that the pattern should be changing from mini batch to mini batch, but it doesn't specify anything about the epoch. I am assuming that this pattern is updated at every mini batch computation, however. If anyone has any references to other RBM Dropout implementations, they might be helpful in clearing up this issue.

Made a mistake, the hidden activations are matrices, not vectors, since we compute a whole mini batch at once.

Just putting this in to make sure that the package manager in julia is working properly.

Wanted to check the size of vis argument, not a parameter of the rbm class…

eric-tramel · 2015-09-09T15:47:27Z

Okay, it had some issues, some bugs I put in, but now it is building and passing tests! I'll need to make a dropout test to ensure that everything is really working correctly.

Adding in a dropout rate to the `fit()` call in the experiment code.

eric-tramel · 2015-09-09T16:09:34Z

Okay! It works! The issues I was being with the keywords not being recognised were due to the workspace not being cleared before running the mnistexample_dropout.jl script. After clearing out the workspace, it seemed to run fine. What is yet to be done is to run a comparison to show that this implementation of dropout is really giving some advantage over no dropout.

Thanks @alaa-saade !

eric-tramel · 2015-09-10T11:26:54Z

So, it seems like there is still something to be desired in the Dropout performance. Currently there does not seem to be much difference between it and the pseudo-likelihood obtained when not using dropout, as shown in the following figure:

I'm going to restructure where the dropout is enforced. I think that perhaps I'm not doing it in the right manner. Referring to This Lua/torch7 implementation, it seems that we need to make sure to suppress these units on the gradient update, as well.

krzakala · 2015-09-10T12:51:52Z

Interesting. But is it known that the effect of dropout can be seen on pseudo likely hood ??

eric-tramel · 2015-09-10T13:36:46Z

@krzakala : I don't truly know if the effect can be seen on the PL or not. You could very well be right on this point. I'm working on a demo, now, which reports the estimated features (W), as well. I'll also include a histogram of the hidden activations, as was done in (Srivasta 2014), to show the discrepancy between the approaches.

I think I had done something stupid and wasn’t passing the dropout rate argument properly. Hence, I was getting nearly identical results between the dropout and non-dropout cases…because they were doing the exact same thing.

According to the original dropout paper, the use of dropout should induce a sparse-er distribution of activations.

It seems like the weight decays are working properly as they should.

After making the correction to the momentum (not scaling it by the learning rate), we have some odd effects on the learning when the momentum is non-zero. At zero momentum, we get training performance that seems to make sense, but now, it seems weird. In fact, for non-zero momentums, we don’t get any kind of reasonable training.

Added in all the speedup and fix patches from the main branch.

Plot the activation histograms side-by-side instead of in a stacked format.

Adjusting the testing parameters for the dropout version of RBM training to be closer to the experiments shown in (Srivastava *et al*, 2014).

Now showing the learned weights (receptive fields) and their distributions.

Changed the test script to run the dropout comparison on the full dataset in an attempt to match the results in the 2014 paper.

eric-tramel added the enhancement label Sep 9, 2015

eric-tramel self-assigned this Sep 9, 2015

eric-tramel added a commit that referenced this issue Sep 9, 2015

Vec->Mat for Dropout Mask (#1)

f95d544

Made a mistake, the hidden activations are matrices, not vectors, since we compute a whole mini batch at once.

eric-tramel added a commit that referenced this issue Sep 9, 2015

Adding a debug statement (#1)

f5530ab

Just putting this in to make sure that the package manager in julia is working properly.

eric-tramel added a commit that referenced this issue Sep 9, 2015

rbm.vis -> vis (#1)

41fecfc

Wanted to check the size of vis argument, not a parameter of the rbm class…

eric-tramel added a commit that referenced this issue Sep 9, 2015

Specifying Bool type throwing errors (#1)

5143826

eric-tramel added a commit that referenced this issue Sep 9, 2015

MNIST Dropout example (#1)

c342e39

Adding in a dropout rate to the `fit()` call in the experiment code.

eric-tramel added a commit that referenced this issue Sep 9, 2015

trying something else (#1)

0dcbb67

eric-tramel added a commit that referenced this issue Sep 10, 2015

Changing Dropout Order (#1)

76bbde1

eric-tramel added a commit that referenced this issue Sep 14, 2015

Introducing test script (#1)

8e58e4a

eric-tramel added a commit that referenced this issue Sep 15, 2015

Adding chart for hidden activations (#1)

fb4efc3

According to the original dropout paper, the use of dropout should induce a sparse-er distribution of activations.

eric-tramel added a commit that referenced this issue Sep 16, 2015

Updated tests script (#1)

2f7086f

It seems like the weight decays are working properly as they should.

eric-tramel added a commit that referenced this issue Sep 21, 2015

Completing merge of master into implement-dropout (#1)

c929208

Added in all the speedup and fix patches from the main branch.

eric-tramel added a commit that referenced this issue Sep 21, 2015

Updated histogram display for test script (#1)

045650b

Plot the activation histograms side-by-side instead of in a stacked format.

eric-tramel added a commit that referenced this issue Sep 21, 2015

Adding axis labels (#1)

916ca6b

eric-tramel added a commit that referenced this issue Sep 21, 2015

Changing test parameters (#1)

7ac812a

Adjusting the testing parameters for the dropout version of RBM training to be closer to the experiments shown in (Srivastava *et al*, 2014).

eric-tramel added a commit that referenced this issue Sep 21, 2015

Adding yet more charts (#1)

bd8c764

Now showing the learned weights (receptive fields) and their distributions.

eric-tramel added a commit that referenced this issue Sep 22, 2015

Train on the full dataset (#1)

a45c458

Changed the test script to run the dropout comparison on the full dataset in an attempt to match the results in the 2014 paper.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Dropout #1

Implement Dropout #1

eric-tramel commented Sep 9, 2015

8.2 Learning Dropout RBMs

eric-tramel commented Sep 9, 2015

eric-tramel commented Sep 9, 2015

eric-tramel commented Sep 9, 2015

eric-tramel commented Sep 10, 2015

krzakala commented Sep 10, 2015

eric-tramel commented Sep 10, 2015

Implement Dropout #1

Implement Dropout #1

Comments

eric-tramel commented Sep 9, 2015

Goal

Technique

8.2 Learning Dropout RBMs

References

eric-tramel commented Sep 9, 2015

Modifications

Doubts

eric-tramel commented Sep 9, 2015

eric-tramel commented Sep 9, 2015

eric-tramel commented Sep 10, 2015

krzakala commented Sep 10, 2015

eric-tramel commented Sep 10, 2015