Open
Description
Hi , i am from Stackoverflow. I am trying to understand your implementation from the paper " Extrapolating of Learning Curve .. ". As far as i understand , they use 11 different mathematic model to fit the learning curve and then predict with monte carlo estimator. But i can't find in your code where you built these model and where the monte carlo calculation are. Can you please clarify it ? Thanks
Activity
maxim5 commentedon Jan 19, 2018
Hi @baothienpp,
I've implemented
LinearCurvePredictor
, which is a simple, but rather efficient method. In my experiments, it was good enough and saved ~50% of training time, though I haven't tried more sophisticated models. The downside is that it requires a burn-in period of ~20-25 full training cycles, before it could understand the learning curves.See the code in curve_predictor.py. Feel free to implement
BaseCurvePredictor
if you wish try any other approximator.baothienpp commentedon Jan 19, 2018
That sounds interesting though, cause I tried the implementation from the paper. It took a lot of computational power because of Monte Carlo calculation. I am trying to understand your method, could you tell me more the concept behind it, or is it very similar to the paper?
maxim5 commentedon Jan 19, 2018
Yeah, I can imagine.
It's a simple linear regression, implemented by applying a normal equation. The whole math is in
_compute_matrix
method, all around it is just to make it nicer. Intuitively, it computes an average learning curve from the set of existing ones. The stop condition is that current learning curve is significantly worse than the curves seen so far. Until you have tens of thousands of learning curves it's very efficient.baothienpp commentedon Jan 19, 2018
May i ask you why you don't fit a polynomial instead of linear ? Do you think we could use Gaussian Process with square exponential to model the learn curve ?
baothienpp commentedon Jan 21, 2018
Hi I think I figured out why you don't use polynomial because you fit a linear on a set of learning curves ( multivariables regression). At first, I understood that you fit a linear on every single curve and make prediction base on that. So that means the burn-in period is the set of learning curves you have to collect first , did i understand you correctly ?
maxim5 commentedon Jan 22, 2018
Hi @baothienpp ,
Correct, the features are the whole curve. So the predictor doesn't try to learn trends or something like that, it compares the given curve to the set of previous ones and checks the probability it'll be better. The burn-in period is basically the training data for the predictor.
I'm sure there are more sophisticated models, and I'd love to have more implementations in the library. If you're interested to contribute, I'd be happy to merge it.
maxim5 commentedon Jan 22, 2018
By the way, I've added a bunch examples lately. Please take a look, looking forward to your feedback.
baothienpp commentedon Jan 23, 2018
Thanks for those examples, really help. I am thinking about using Bayesian linear regression (blr) instead of simple linear regression. blr output will be a normal distribution, we could use simple math to calculate the probability that a learning curve will be good or bad. I will try it first, and report later. Generally, I like the idea of using simple regression over the model in the paper, it is just too much computational overhead
maxim5 commentedon Jan 24, 2018
@baothienpp Sounds great. Looking forward to seeing your model in action. When you will test it, take a look at the tests.
baothienpp commentedon Feb 3, 2018
Hi Maxim, short unrelated question : If i want to use your idea in some of my work, how can i cite you ?
maxim5 commentedon Feb 3, 2018
Hi @baothienpp
That'll be great if you do this. Please use this code:
Of course, I'll be curious to read the paper once it's out, so don't forget to post the link here ;)
baothienpp commentedon Feb 6, 2018
Thanks ! Unfortunately it is something for work so i can't public :( , but don't worry i cited you. It seems like your framework can only handle single GPU, any chances for multi GPU?
baothienpp commentedon Feb 15, 2018
So i did build a new model using your idea. I used Bayesian ridge regression. Basically, in Linear Regression you minimize the MSE error and in Ridge Regression you minimize the (MSE+ L2 regularization), for more detail you can read here. Bayesian ridge regression is then the probabilistic version of ridge regression which output is mean and variance. I then calculate the probability that current curve yield a better high than the previous best, the formula is exactly the one in Probability Improvement. I tested it with your cifar10 learning curve set. Here the result (the dashed lines are the curves that used in burn-in)
.
With a burn-in period as small as 5 , it still has good prediction
maxim5 commentedon Feb 15, 2018
This looks really impressive: the burn-in period 5 is very low! Thanks for the update.
If you can make a pull request or somehow share your code, I'd incorporate it in the lib, and it looks like a good default. Otherwise I'll try to replicate your results from scratch.
maxim5 commentedon Feb 15, 2018
Sorry, I forgot about your question: right now, the model itself can go multi-gpu and that's it. I'd implement distributed training on the library level, but I think the trivial Bayesian optimization will assign the same hyper-parameters to all GPUs, so it doesn't make sense. It should be a bit smarter and run different optimizations in parallel, e.g., UCB on GPU 0 and PI method on GPU 1.
baothienpp commentedon Feb 17, 2018
I am currently a bit busy, but i will soon upload a short code to describe how i did it because i implemented it different from your interface. Another question, is the portfolio strategy you used, kind of randomly choosing a utility function every iteration ?
maxim5 commentedon Feb 17, 2018
OK. No problem.
Yes, see
BayesianPortfolioStrategy
. It is possible to fix the distribution over utilities or it will construct a distribution based on their performance.baothienpp commentedon Feb 22, 2018
So i am gonna briefly describe my method. I used scikit-learn to implement BRR ( http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.BayesianRidge.html#sklearn.linear_model.BayesianRidge). It has 2 method fit() and predict() , it is important to set the parameter return_std in predict() to true. So now you have the prediction and the std. To calculate the probability , i used the scipy package to calculate the cdf :
maxim5 commentedon Feb 23, 2018
Got it. Do you use the same data as I did, i.e. the set of learning curves?
baothienpp commentedon Feb 23, 2018
Yes i used the curves in your json file