Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment 1a (Simple Analysis of SWAG) #1

Open
tmgrgg opened this issue Jul 24, 2020 · 1 comment
Open

Experiment 1a (Simple Analysis of SWAG) #1

tmgrgg opened this issue Jul 24, 2020 · 1 comment

Comments

@tmgrgg
Copy link
Owner

tmgrgg commented Jul 24, 2020

Analysis of performance of SWAG versus approximation rank (1a)

(Dataset: FashionMNIST, Model: DenseNet with depth 10)

I wanted to get an idea for how the SWAG posterior approximation improved performance with rank. Since I was writing most of the code alongside running the experiments, the actual running of the code is done in Notebooks using a Colab GPU - I do intend to standardise these to be able to run parameterised experiments on the command line. For now I'll just link the notebooks.

1. I used SGD to descend to a suitably strong mode which would act as the pretrained solution for the SWAG sampling process.

LR_INIT = 0.1
MOMENTUM = 0.85
L2 = 1e-4

Note that rather than using a final learning rate (SWA_LR) of 0.05, as in the original paper, I used 0.005, as this appeared to lead to a more stable mode (suggesting that with this learning rate the SGD iterates had reached a suitable stationary distribution). The final training graph looks like this:

pretrained_training_grapah

Final pretrained model performance is:

::: Train :::
 {'loss': 0.034561568461060524, 'accuracy': 99.41}
::: Valid :::
 {'loss': 0.23624498672485353, 'accuracy': 92.41}
::: Test :::
 {'loss': 0.2568485828399658, 'accuracy': 92.28}

Note also that I adopted the same learning rate schedule when training the initial solution as in the original paper, namely:

def schedule(lr_init, epoch, max_epochs):
    t = epoch / max_epochs
    lr_ratio = FINAL_LR / lr_init 
    if t <= 0.5:
        factor = 1.0
    elif t <= 0.9:
        factor = 1.0 - (1.0 - lr_ratio) * (t - 0.5) / 0.4
    else:
        factor = lr_ratio
    return lr_init * factor

2. I then used this pretrained solution to build a SWAG model (more as a test for the Posterior and Sampling classes).

The notebook is: https://colab.research.google.com/drive/1tma5QHPAM8K9dRjBfUV0Qv_C_yiQQtwP?usp=sharing

The model trained in the notebook above was trained with the following parameters:

SWA_LR = 0.005
SWA_MOMENTUM = 0.85
L2 = 1e-4
RANK = 30
SAMPLES_PER_EPOCH = 1
SAMPLE_FREQ = int((1/SAMPLES_PER_EPOCH)*len(train_set)/batch_size)
SAMPLING_CONDITION = lambda: True

I plotted the SWA (equivalent to expected value of SWAG) graph as a proxy for validation learning curve for the SWAG solution.
SWA_training

I also plotted the number of samples drawn for Bayesian Model Averaging against performance of the final SWAG model:
BMASamplesVsPerformance

Looks weird. May want to investigate the code.

3. I then used this pretrained solution to evaluate SWAG models trained with different approximation ranks, from k = 1 to 30 in steps of 2. I trained each for 100 swag epochs and performed one sample per epoch (same as above).

The training notebook is here: https://colab.research.google.com/drive/1L2D7aAXxOdrhK-Kk3FxUs9vhsjP_vTf6?usp=sharing
Analysis notebook is here: https://colab.research.google.com/drive/1DHbudUH2BFdlJgmCopv93arsCx9advu6?usp=sharing

What I've found so far is a little surprising (but could very much be down to implementation error, or some other error in my experiment i.e. poor parameter choices):

SWAG Rank versus SWAG Performance on Train and Validation data (with N=30 for bayesian model averaging).
swag_rank_v_performance_v1
)

I also plotted the SWA performance (we would expect this to be essentially constant as SWA is independent of rank).
SWA_swag_rank_v_performance_v1

@tmgrgg
Copy link
Owner Author

tmgrgg commented Aug 7, 2020

After fixing a bunch of issues and comparing my implementation against a run of the original author's implementation to ascertain its correctness, I am now confident that our implementation's agree and have obtained the following results which replace those above:

This is with 30 samples in the bayesian model averages and 50 "swag samples":
Screenshot 2020-08-07 at 08 31 44

This is with 30 samples in the bayesian model averages and 150 "swag samples":
Screenshot 2020-08-07 at 11 07 15

Much clearer improvement for including local uncertainty!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant