Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Dropout #1

Open
eric-tramel opened this issue Sep 9, 2015 · 6 comments
Open

Implement Dropout #1

eric-tramel opened this issue Sep 9, 2015 · 6 comments
Assignees

Comments

@eric-tramel
Copy link

Goal

One of the latest/best regularisation techniques for training RBMs is dropout. Unfortunately, the original Boltzmann.jl package does not implement this technique, so we should undertake this ourselves.

Technique

During the training phase of the RBM, each hidden node is present with only probably $p$. Training is performed for this reduced model and then the resulting trained models are combined. The pertinent section from (Srivasta 2014) reads,

8.2 Learning Dropout RBMs

Learning algorithms developed for RBMs such as Contrastive Divergence (Hinton et al., 2006) can be directly applied for learning Dropout RBMs. The only difference is that r is first sampled and only the hidden units that are retained are used for training. Similar to dropout neural networks, a different r is sampled for each training case in every minibatch. In our experiments, we use CD-1 for training dropout RBMs.

References

Srivasta et al, "Dropout: A simple way to prevent neural networks from overfitting," JMLR, vol. 15, 2014, pp. 1929-1958.

@eric-tramel eric-tramel self-assigned this Sep 9, 2015
@eric-tramel
Copy link
Author

I forgot to reference this issue in the commit d3c2caa ! I have created a separate branch to work on implementing this feature.

Modifications

For my first attempt at implementing this feature, I added an optional parameter to rbm.jl/fit() to allow the user to specify the dropout rate:

function fit(rbm::RBM, X::Mat{Float64};
             persistent=true, lr=0.1, n_iter=10, batch_size=100, n_gibbs=1,dorate=0.0)

From here, I tried to take the approach I quoted earlier (§8.2 of Srivasta et al 2014) and apply a different dropout pattern for each training sample in the mini batch. I accomplish this within rbm.jl/gibbs():

function gibbs(rbm::RBM, vis::Mat{Float64}; n_times=1,dorate=0.0)
    suppressedUnits = rand(size(rbm.hbias,1),size(rbm.vis,2)) .< dorate  
   ...

I then modify rbm.jl/sample_visibles() (and, the corresponding rbm.jl/vis_means()) to take this logical array specifying the suppressed/dropped hidden units and assign zeros to the dropped hidden units before calculating the matrix-matrix product between rbm.W and the hidden activations:

function vis_means(rbm::RBM, hid::Mat{Float64}, suppressedUnits::Mat{Bool})    
    hid[suppressedUnits] = 0.0          # Suppress dropped hidden units
    p = rbm.W' * hid .+ rbm.vbias
    return logistic(p)
end

This should, in total, accomplish the dropout.

Doubts

Now, what isn't clear to me is whether or not the dropout pattern should be changing from epoch to epoch. The paper seems to indicate that the pattern should be changing from mini batch to mini batch, but it doesn't specify anything about the epoch. I am assuming that this pattern is updated at every mini batch computation, however. If anyone has any references to other RBM Dropout implementations, they might be helpful in clearing up this issue.

eric-tramel added a commit that referenced this issue Sep 9, 2015
Made a mistake, the hidden activations are matrices, not vectors, since
we compute a whole mini batch at once.
eric-tramel added a commit that referenced this issue Sep 9, 2015
Just putting this in to make sure that the package manager in julia is
working properly.
eric-tramel added a commit that referenced this issue Sep 9, 2015
Wanted to check the size of vis argument, not a parameter of the rbm
class…
@eric-tramel
Copy link
Author

Okay, it had some issues, some bugs I put in, but now it is building and passing tests! I'll need to make a dropout test to ensure that everything is really working correctly.

eric-tramel added a commit that referenced this issue Sep 9, 2015
Adding in a dropout rate to the `fit()` call in the experiment code.
eric-tramel added a commit that referenced this issue Sep 9, 2015
@eric-tramel
Copy link
Author

Okay! It works! The issues I was being with the keywords not being recognised were due to the workspace not being cleared before running the mnistexample_dropout.jl script. After clearing out the workspace, it seemed to run fine. What is yet to be done is to run a comparison to show that this implementation of dropout is really giving some advantage over no dropout.

Thanks @alaa-saade !

@eric-tramel
Copy link
Author

So, it seems like there is still something to be desired in the Dropout performance. Currently there does not seem to be much difference between it and the pseudo-likelihood obtained when not using dropout, as shown in the following figure:
dropout_trainingpl

I'm going to restructure where the dropout is enforced. I think that perhaps I'm not doing it in the right manner. Referring to This Lua/torch7 implementation, it seems that we need to make sure to suppress these units on the gradient update, as well.

eric-tramel added a commit that referenced this issue Sep 10, 2015
@krzakala
Copy link

Interesting. But is it known that the effect of dropout can be seen on pseudo likely hood ??

@eric-tramel
Copy link
Author

@krzakala : I don't truly know if the effect can be seen on the PL or not. You could very well be right on this point. I'm working on a demo, now, which reports the estimated features (W), as well. I'll also include a histogram of the hidden activations, as was done in (Srivasta 2014), to show the discrepancy between the approaches.

eric-tramel added a commit that referenced this issue Sep 10, 2015
I think I had done something stupid and wasn’t passing the dropout rate
argument properly. Hence, I was getting nearly identical results
between the dropout and non-dropout cases…because they were doing the
exact same thing.
eric-tramel added a commit that referenced this issue Sep 14, 2015
eric-tramel added a commit that referenced this issue Sep 15, 2015
According to the original dropout paper, the use of dropout should
induce a sparse-er distribution of activations.
eric-tramel added a commit that referenced this issue Sep 16, 2015
It seems like the weight decays are working properly as they should.
eric-tramel added a commit that referenced this issue Sep 16, 2015
After making the correction to the momentum (not scaling it by the
learning rate), we have some odd effects on the learning when the
momentum is non-zero. At zero momentum, we get training performance
that seems to make sense, but now, it seems weird. In fact, for
non-zero momentums, we don’t get any kind of reasonable training.
eric-tramel added a commit that referenced this issue Sep 21, 2015
Added in all the speedup and fix patches from the main branch.
eric-tramel added a commit that referenced this issue Sep 21, 2015
Plot the activation histograms side-by-side instead of in a stacked
format.
eric-tramel added a commit that referenced this issue Sep 21, 2015
eric-tramel added a commit that referenced this issue Sep 21, 2015
Adjusting the testing parameters for the dropout version of RBM
training to be closer to the experiments shown in (Srivastava *et al*,
2014).
eric-tramel added a commit that referenced this issue Sep 21, 2015
Now showing the learned weights (receptive fields) and their
distributions.
eric-tramel added a commit that referenced this issue Sep 22, 2015
Changed the test script to run the dropout comparison on the full
dataset in an attempt to match the results in the 2014 paper.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants