MFTreeSearchCV

This is a package for fast hyper-parameter tuning using noisy multi-fidelity tree-search for scikit-learn estimators (classifiers/regressors). Given ranges and types (categorical, integer, reals) for several hyper-parameters, this package is designed to search for a good configuration by treating the k-fold cross-validation errors under different settings and under different fidelities (levels of approximation based on amount of data used), as a multi-fidelity noisy black-box function. This work is based on the publications:

A deterministic version of MF Tree Seach
A version that can hadle noise -- which is there in tuning

Please cite the above papers, when using this software in any publications/reports.

Prerequisites

You will need a C and a Fortran compiler such as gnu95. Please install them by following the correct steps for your machine and OS.
You will need the following python packages: sklearn, numpy, brewer2mpl, scipy, matplotlib

Installing

Go to the location of your choice and clone the repo.

git clone https://github.com/rajatsen91/MFTreeSearchCV.git

Now the next steps are to setup the multi-fidelity enviroment.

We will assume that the location of the project is /home/MFTreeSearchCV/
You first need to build the direct fortran library. For this cd into /home/MFTreeSearchCV/utils/direct_fortran and run bash make_direct.sh. You will need a fortran compiler such as gnu95. Once this is done, you can run simple_direct_test.py to make sure that it was installed correctly.
Edit the set_up_gittins file to change the GITTINS_PATH to the local location of the repo.

GITTINS_PATH=/home/MFTreeSearchCV

Run source set_up_gittins to set up all environment variables.
To test the installation, run bash run_all_tests.sh. Some of the tests are probabilistic and could fail at times. Some tests that require a scratch directory will fail, but nothing to worry about.

Usage

The usage of this repository is designed to be similar to parameter tuning functions in sklearn.model_selection, like gridsearchcv. The main function is MFTreeSearchCV.MFTreeSearchCV. The arguments and methods of this function are as follows,

	"""Multi-Fidelity  Tree Search over specified parameter ranges for an estimator.
	Important members are fit, predict.
	MFTreeSearchCV implements a "fit" and a "score" method.
	It also implements "predict", "predict_proba" is they are present in the base-estimator.
	The parameters of the estimator used to apply these methods are optimized
	by cross-validated Tree Search over a parameter search space.
	----------
	estimator : estimator object.
		This is assumed to implement the scikit-learn estimator interface.
		Unlike grid search CV, estimator need not provide a ``score`` function.
		Therefore ``scoring`` must be passed. 
	param_dict : Dictionary with parameters names (string) as keys and and the value is another dictionary. The value dictionary has
	the keys 'range' that specifies the range of the hyper-parameter, 'type': 'int' or 'cat' or 'real' (integere, categorical or real),
	and 'scale': 'linear' or 'log' specifying whether the search is done on a linear scale or a logarithmic scale. An example for param_dict
	for scikit-learn SVC is as follows:
		eg: param_dict = {'C' : {'range': [1e-2,1e2], 'type': 'real', 'scale': 'log'}, \
		'kernel' : {'range': [ 'linear', 'poly', 'rbf', 'sigmoid'], 'type': 'cat'}, \
		'degree' : {'range': [3,10], 'type': 'int', 'scale': 'linear'}}
	scoring : string, callable, list/tuple, dict or None, default: None
		A single string (see :ref:`scoring_parameter`). this must be specified as a string. See scikit-learn metrics 
		for more details. 
	fixed_params: dictionary of parameter values other than the once in param_dict, that should be held fixed at the supplied value.
	For example, if fixed_params = {'nthread': 10} is passed with estimator as XGBoost, it means that all
	XGBoost instances will be run with 10 parallel threads
	cv : int, cross-validation generator or an iterable, optional
		Determines the cross-validation splitting strategy.
		Possible inputs for cv are:
		- None, to use the default 3-fold cross validation,
		- integer, to specify the number of folds in a `(Stratified)KFold`,
	debug : Binary
		Controls the verbosity: True means more messages, while False only prints critical messages
	refit : True means the best parameters are fit into an estimator and trained, while False means the best_estimator is not refit

	fidelity_range : range of fidelity to use. It is a tuple (a,b) which means lowest fidelity means a samples are used for training and 
	validation and b samples are used when fidelity is the highest. We recommend setting b to be the total number of training samples
	available and a to bea reasonable value. 
	
	n_jobs : number of parallel runs for the CV. Note that njobs * (number of threads used in the estimator) must be less than the number of threads 
	allowed in your machine. default value is 1. 

	nu_max : automatically set, but can be give a default values in the range (0,2]
	rho_max : rho_max in the paper. Default value is 0.95 and is recommended
	sigma : sigma in the paper. Default value is 0.02, adjust according to the believed noise standard deviation in the system
	C : default is 1.0, which is overwritten if Auto = True, which is the recommended setting
	Auto : If True then the bias function parameter C is auto set. This is recommended. 
	tol : default values is 1e-3. All fidelities z_1, z_2 such that |z_1 - z_2| < tol are assumed to yield the same bias value

	total_budget : total budget for the search in seconds. This includes the time for automatic parameter C selection and does not include refit time. 
	total_budget should ideally be more than 5X the unit_cost which is the time taken to run one experiment at the highest fidelity
	
	unit_cost : time in seconds required to fit the base estimator at the highest fidelity. This should be estimated by the user and then supplied. 
	
	Attributes
	----------
	cv_results_ : dictionary showing the scores attained under a few parameters setting. Each
	parameter setting is the best parameter obtained from a tree-search call. 
	best_estimator_ : estimator or dict
		Estimator that was chosen by the search, i.e. estimator
		which gave highest score (or smallest loss if specified)
		on the left out data. Not available if ``refit=False``.
		See ``refit`` parameter for more information on allowed values.
	best_score_ : float
		Mean cross-validated score of the best_estimator
	best_params_ : dict
		Parameter setting that gave the best results on the hold out data.
	refit_time_ : float
		Seconds used for refitting the best model on the whole dataset.
		This is present only if ``refit`` is not False.
	fit_time_ : float
		Seconds taken to find the best parameters. Should be close to the budget given. 
	

	
	"""

A functional example is provided in the ipython notebook Illustrate.ipynb.

Example

estimator = LogisticRegression() #base estimator
param_dict = {'C':{'range':[1e-5,1e5],'scale':'log','type':'real'},\
              'penalty':{'range':['l1','l2'],'scale':'linear','type':'cat'}} #parameter space
fidelity_range = [500,15076] # fidelity range, lowest fidelity uses 500 samples while the highest one uses 
#the whole dataset  
n_jobs = 3 # number of jobs
cv = 3 # cv level
fixed_params = {}
scoring = 'accuracy'

t1 = time.time()
estimator = estimator.fit(X_train,y_train)
t2 = time.time()
total_budget = 500 # total budget in seconds
print('Time without CV: ', t2 - t1)

model = MFTreeSearchCV(estimator=estimator,param_dict=param_dict,scoring=scoring,\
                      fidelity_range=fidelity_range,unit_cost=None,\
                    cv=cv,  n_jobs = n_jobs,total_budget=total_budget,debug = True,fixed_params=fixed_params)

## running in debug mode will display certain outputs


m = model.fit(X_train,y_train) # actual tree search will be done here 

y_pred = m.predict(X_test) # the returned model m will have the best parameters fitted as 'refit = True'

accuracy = accuracy_score(y_pred,y_test) # this is the best accuracy obtained

print(m.best_params_) # will print the best params 

print(m.cv_results_) # will print cv scores for some other parameters as well, that were close

Authors

Rajat Sen

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Acknowledgments

We acknowledge the support from @kirthevasank for providing the original multi-fidelity black-box function code base.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

MFTreeSearchCV

Prerequisites

Installing

Usage

Example

Authors

License

Acknowledgments

Files

README.md

Latest commit

History

README.md

File metadata and controls

MFTreeSearchCV

Prerequisites

Installing

Usage

Example

Authors

License

Acknowledgments