You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Analysis of performance of SWAG versus approximation rank (1a)
(Dataset: FashionMNIST, Model: DenseNet with depth 10)
I wanted to get an idea for how the SWAG posterior approximation improved performance with rank. Since I was writing most of the code alongside running the experiments, the actual running of the code is done in Notebooks using a Colab GPU - I do intend to standardise these to be able to run parameterised experiments on the command line. For now I'll just link the notebooks.
1. I used SGD to descend to a suitably strong mode which would act as the pretrained solution for the SWAG sampling process.
Note that rather than using a final learning rate (SWA_LR) of 0.05, as in the original paper, I used 0.005, as this appeared to lead to a more stable mode (suggesting that with this learning rate the SGD iterates had reached a suitable stationary distribution). The final training graph looks like this:
I plotted the SWA (equivalent to expected value of SWAG) graph as a proxy for validation learning curve for the SWAG solution.
I also plotted the number of samples drawn for Bayesian Model Averaging against performance of the final SWAG model:
Looks weird. May want to investigate the code.
3. I then used this pretrained solution to evaluate SWAG models trained with different approximation ranks, from k = 1 to 30 in steps of 2. I trained each for 100 swag epochs and performed one sample per epoch (same as above).
What I've found so far is a little surprising (but could very much be down to implementation error, or some other error in my experiment i.e. poor parameter choices):
SWAG Rank versus SWAG Performance on Train and Validation data (with N=30 for bayesian model averaging).
)
I also plotted the SWA performance (we would expect this to be essentially constant as SWA is independent of rank).
The text was updated successfully, but these errors were encountered:
After fixing a bunch of issues and comparing my implementation against a run of the original author's implementation to ascertain its correctness, I am now confident that our implementation's agree and have obtained the following results which replace those above:
This is with 30 samples in the bayesian model averages and 50 "swag samples":
This is with 30 samples in the bayesian model averages and 150 "swag samples":
Much clearer improvement for including local uncertainty!
Analysis of performance of SWAG versus approximation rank (1a)
(Dataset: FashionMNIST, Model: DenseNet with depth 10)
I wanted to get an idea for how the SWAG posterior approximation improved performance with rank. Since I was writing most of the code alongside running the experiments, the actual running of the code is done in Notebooks using a Colab GPU - I do intend to standardise these to be able to run parameterised experiments on the command line. For now I'll just link the notebooks.
1. I used SGD to descend to a suitably strong mode which would act as the pretrained solution for the SWAG sampling process.
The notebook is: https://colab.research.google.com/drive/1eCTHFa5vooirGJi9YdsXo080PoMCdb-d
and the results of the grid search can be found here: https://drive.google.com/file/d/1-ZWK0jN8cLK8Cw9D7rK7OXz0XxXyWCqC/view?usp=sharing
Notebook: https://colab.research.google.com/drive/110ae3O6WoMqlqpvwH8LnmyTTpoKzFjDr#scrollTo=Z5suM6SZQmEp
Note that rather than using a final learning rate (SWA_LR) of 0.05, as in the original paper, I used 0.005, as this appeared to lead to a more stable mode (suggesting that with this learning rate the SGD iterates had reached a suitable stationary distribution). The final training graph looks like this:
Final pretrained model performance is:
Note also that I adopted the same learning rate schedule when training the initial solution as in the original paper, namely:
2. I then used this pretrained solution to build a SWAG model (more as a test for the Posterior and Sampling classes).
The notebook is: https://colab.research.google.com/drive/1tma5QHPAM8K9dRjBfUV0Qv_C_yiQQtwP?usp=sharing
The model trained in the notebook above was trained with the following parameters:
I plotted the SWA (equivalent to expected value of SWAG) graph as a proxy for validation learning curve for the SWAG solution.
I also plotted the number of samples drawn for Bayesian Model Averaging against performance of the final SWAG model:
Looks weird. May want to investigate the code.
3. I then used this pretrained solution to evaluate SWAG models trained with different approximation ranks, from k = 1 to 30 in steps of 2. I trained each for 100 swag epochs and performed one sample per epoch (same as above).
The training notebook is here: https://colab.research.google.com/drive/1L2D7aAXxOdrhK-Kk3FxUs9vhsjP_vTf6?usp=sharing
Analysis notebook is here: https://colab.research.google.com/drive/1DHbudUH2BFdlJgmCopv93arsCx9advu6?usp=sharing
What I've found so far is a little surprising (but could very much be down to implementation error, or some other error in my experiment i.e. poor parameter choices):
SWAG Rank versus SWAG Performance on Train and Validation data (with N=30 for bayesian model averaging).
)
I also plotted the SWA performance (we would expect this to be essentially constant as SWA is independent of rank).
The text was updated successfully, but these errors were encountered: