Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Curve Predictor #3

Open
baothienpp opened this issue Jan 19, 2018 · 20 comments
Open

Curve Predictor #3

baothienpp opened this issue Jan 19, 2018 · 20 comments

Comments

@baothienpp
Copy link

Hi , i am from Stackoverflow. I am trying to understand your implementation from the paper " Extrapolating of Learning Curve .. ". As far as i understand , they use 11 different mathematic model to fit the learning curve and then predict with monte carlo estimator. But i can't find in your code where you built these model and where the monte carlo calculation are. Can you please clarify it ? Thanks

@maxim5
Copy link
Owner

maxim5 commented Jan 19, 2018

Hi @baothienpp,

I've implemented LinearCurvePredictor, which is a simple, but rather efficient method. In my experiments, it was good enough and saved ~50% of training time, though I haven't tried more sophisticated models. The downside is that it requires a burn-in period of ~20-25 full training cycles, before it could understand the learning curves.

See the code in curve_predictor.py. Feel free to implement BaseCurvePredictor if you wish try any other approximator.

@baothienpp
Copy link
Author

That sounds interesting though, cause I tried the implementation from the paper. It took a lot of computational power because of Monte Carlo calculation. I am trying to understand your method, could you tell me more the concept behind it, or is it very similar to the paper?

@maxim5
Copy link
Owner

maxim5 commented Jan 19, 2018

It took a lot of computational power because of Monte Carlo calculation.

Yeah, I can imagine.

I am trying to understand your method, could you tell me more the concept behind it, or is it very similar to the paper?

It's a simple linear regression, implemented by applying a normal equation. The whole math is in _compute_matrix method, all around it is just to make it nicer. Intuitively, it computes an average learning curve from the set of existing ones. The stop condition is that current learning curve is significantly worse than the curves seen so far. Until you have tens of thousands of learning curves it's very efficient.

@baothienpp
Copy link
Author

May i ask you why you don't fit a polynomial instead of linear ? Do you think we could use Gaussian Process with square exponential to model the learn curve ?

@baothienpp
Copy link
Author

Hi I think I figured out why you don't use polynomial because you fit a linear on a set of learning curves ( multivariables regression). At first, I understood that you fit a linear on every single curve and make prediction base on that. So that means the burn-in period is the set of learning curves you have to collect first , did i understand you correctly ?

@maxim5
Copy link
Owner

maxim5 commented Jan 22, 2018

Hi @baothienpp ,

Correct, the features are the whole curve. So the predictor doesn't try to learn trends or something like that, it compares the given curve to the set of previous ones and checks the probability it'll be better. The burn-in period is basically the training data for the predictor.

I'm sure there are more sophisticated models, and I'd love to have more implementations in the library. If you're interested to contribute, I'd be happy to merge it.

@maxim5
Copy link
Owner

maxim5 commented Jan 22, 2018

By the way, I've added a bunch examples lately. Please take a look, looking forward to your feedback.

@baothienpp
Copy link
Author

baothienpp commented Jan 23, 2018

Thanks for those examples, really help. I am thinking about using Bayesian linear regression (blr) instead of simple linear regression. blr output will be a normal distribution, we could use simple math to calculate the probability that a learning curve will be good or bad. I will try it first, and report later. Generally, I like the idea of using simple regression over the model in the paper, it is just too much computational overhead

@maxim5
Copy link
Owner

maxim5 commented Jan 24, 2018

@baothienpp Sounds great. Looking forward to seeing your model in action. When you will test it, take a look at the tests.

@baothienpp
Copy link
Author

Hi Maxim, short unrelated question : If i want to use your idea in some of my work, how can i cite you ?

@maxim5
Copy link
Owner

maxim5 commented Feb 3, 2018

Hi @baothienpp

That'll be great if you do this. Please use this code:

@article{podkolzine17,
  author  = {Maxim Podkolzine},
  title   = {Hyper-Engine: Hyper-parameters Tuning for Machine Learning},
  journal = {https://github.com/maxim5/hyper-engine},
  year    = {2017},
}

Of course, I'll be curious to read the paper once it's out, so don't forget to post the link here ;)

@baothienpp
Copy link
Author

baothienpp commented Feb 6, 2018

Thanks ! Unfortunately it is something for work so i can't public :( , but don't worry i cited you. It seems like your framework can only handle single GPU, any chances for multi GPU?

@baothienpp
Copy link
Author

baothienpp commented Feb 15, 2018

So i did build a new model using your idea. I used Bayesian ridge regression. Basically, in Linear Regression you minimize the MSE error and in Ridge Regression you minimize the (MSE+ L2 regularization), for more detail you can read here. Bayesian ridge regression is then the probabilistic version of ridge regression which output is mean and variance. I then calculate the probability that current curve yield a better high than the previous best, the formula is exactly the one in Probability Improvement. I tested it with your cifar10 learning curve set. Here the result (the dashed lines are the curves that used in burn-in)
curves_compare.
With a burn-in period as small as 5 , it still has good prediction

@maxim5
Copy link
Owner

maxim5 commented Feb 15, 2018

This looks really impressive: the burn-in period 5 is very low! Thanks for the update.
If you can make a pull request or somehow share your code, I'd incorporate it in the lib, and it looks like a good default. Otherwise I'll try to replicate your results from scratch.

@maxim5
Copy link
Owner

maxim5 commented Feb 15, 2018

Sorry, I forgot about your question: right now, the model itself can go multi-gpu and that's it. I'd implement distributed training on the library level, but I think the trivial Bayesian optimization will assign the same hyper-parameters to all GPUs, so it doesn't make sense. It should be a bit smarter and run different optimizations in parallel, e.g., UCB on GPU 0 and PI method on GPU 1.

@baothienpp
Copy link
Author

I am currently a bit busy, but i will soon upload a short code to describe how i did it because i implemented it different from your interface. Another question, is the portfolio strategy you used, kind of randomly choosing a utility function every iteration ?

@maxim5
Copy link
Owner

maxim5 commented Feb 17, 2018

OK. No problem.

is the portfolio strategy you used, kind of randomly choosing a utility function every iteration ?

Yes, see BayesianPortfolioStrategy. It is possible to fix the distribution over utilities or it will construct a distribution based on their performance.

@baothienpp
Copy link
Author

baothienpp commented Feb 22, 2018

So i am gonna briefly describe my method. I used scikit-learn to implement BRR ( http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.BayesianRidge.html#sklearn.linear_model.BayesianRidge). It has 2 method fit() and predict() , it is important to set the parameter return_std in predict() to true. So now you have the prediction and the std. To calculate the probability , i used the scipy package to calculate the cdf :

        prediction, std = self.predict()
        #self.target is the max value of the current best curve
        probability = stats.norm(prediction, std).cdf(np.inf) - stats.norm(prediction, std).cdf(max(self.target))
        # the total probability on the whole normal distribution is 100% , but since i only consider one half of it as 100%, if the value is bigger than 0.5 it has 100% probability 
        probability = min(probability * 100 / 0.5, 100)
        #if probability < 75 , terminate !
        if probability < 75:

@maxim5
Copy link
Owner

maxim5 commented Feb 23, 2018

Got it. Do you use the same data as I did, i.e. the set of learning curves?

@baothienpp
Copy link
Author

Yes i used the curves in your json file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants