-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathresearch-ensemble methods
208 lines (124 loc) · 7.21 KB
/
research-ensemble methods
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
voting classifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()
voting_clf = VotingClassifier(
estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
voting='hard')
voting_clf.fit(X_train, y_train)
What would the ideal ensemble of
1. methods of detection (FIRMS, MODIS,VIIRS, wind, lightning, soil moisture, weather, athmospheric data {which one?},
COUNT THEM ALL AND BRAINSTORM-->>> MISSING ONES <<<<----
and 2. ML models/algorithms to combine and analize them--->> make predictions,
understand patters/recognize fires faster
, etc..
ensemble methods: "suppose you ask a complex question to thousands of random people, then aggregate
their answers. In many cases you will find that this aggregated answer is better than
an expert’s answer. This is called the wisdom of the crowd. Similarly, if you aggregate
the predictions of a group of predictors (such as classifiers or regressors), you will
often get better predictions than with the best individual predictor." A.G.
-"Let’s look at each classifier’s accuracy on the test set:" A.G.
>>> from sklearn.metrics import accuracy_score
>>> for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
...
clf.fit(X_train, y_train)
...
y_pred = clf.predict(X_test)
...
print(clf.__class__.__name__, accuracy_score(y_test, y_pred))
...
LogisticRegression 0.864
RandomForestClassifier 0.872
SVC 0.888
VotingClassifier 0.896
- Gradient Boosting Machines GBMs
This sequential learning technique sounds a bit like Gradient Descent, except that instead of
tweaking a single predictor’s parameter to minimise the cost function, AdaBoost adds predictors
to the ensemble gradually making it better. The great disadvantage of this algorithm is that
the model cannot be parallelized
since each predictor can only be trained after the previous one has been trained and evaluated.
--Adaptive Boosting --AdaBoost
Weights can be determined using the error value.
For instance,the higher the error the more is the weight assigned to the observation.
This process is repeated until the error function does not change,
or the maximum limit of the number of estimators is reached.
Hyperparameters
base_estimators: specifies the base type estimator, i.e. the algorithm to be used as base learner.
n_estimators: It defines the number of base estimators, where the default is 10
but you can increase it in order to obtain a better performance.
learning_rate : same impact as in gradient descent algorithm
max_depth : Maximum depth of the individual estimator
n_jobs : indicates the system how many processors it is allowed to use. Value of ‘-1’ means there is no limit;
random_state : makes the model’s output replicable. It will always produce the same results
when you give it a fixed value as well as the same parameters and training data.
AdaBoost difference with Gradient BM:
This is another very popular Boosting algorithm whose work basis is just like what we’ve seen for AdaBoost.
Gradient Boosting works by sequentially
adding the previous predictors underfitted predictions to the ensemble, ensuring the erros made previously are corrected.
The difference lies in what it does with the underfitted values of its predecessor.
Contrary to AdaBoost, which tweaks the instance weights at every interaction,
this method tries to fit the new predictor to the --residual errors-- made by the previous predictor.
-------------------------------
Ensemble learning reduces the tendency to memorize noise
the best ensemble is when models are skillful in differente ways
and compensate for the weaknesses of the others.
Voting Classifier
-same training set, different algorithms
BAGGING (Bootstrap aggregation)
-one algorithm, different subsets of the training set
Bagging reduces the variance of individual models in the ensemble
Bagging classificators aggregate predictions by mayority voting
Bagging regressors aggregate predictions by averaging
Out of Bag (OOB) represent the around 37% instances that were not sampled by the bagging model
Since they were not seen by the model, they can be used to test it's performance (OOB evaluation).
OOB evaluation is good to get a SECOND/ADDITIONAL performance metric (apart from the test set accuracy score)
without resorting to cross-validation
Inside the bagging classifier add the parameter oob_score = True
-------Random Forests
-Unlike bagging (which can use different models as its base estimator [logistic reg, neural nets, KNN,etc..]),
Random Forests only use Decision Trees.
Bootstrap samples have the same size as the trainig set (bagging has subsets)
"RF introduces more randomization than bagging because only d features can be samples at each node WITHOUT REPLACEMENT
(where d is a portion of the total number of features)
[in scikit learn the default for d is the square root of the total number of features, e.g. 10 for 100)
Each tree added to the ensemble is trained on a dif bootstrap sample from the data set
-------------Boosting
"
Many predictors are trained SEQUENTIALLY and each predictor LEARNS FROM THE ERRORS OF ITS PREDECESSOR
Many weak learners are combined to form as strong one
Ensemble of predictors are trained SEQUENTIALLY
2 popular ones: AdaBoost and Gradient Boosting
-AdaBoost (adaptive boosting)
Give more weight to errors when it passes to the next predictor
-Gradient Boosting
In contrast to AdaBoost, the weights of the training instances are not tweaked.
Instead, each predictor is trained using the RESIDUAL ERRORS of its predecessor as LABELS.
"In gradient boosting, SHRINKAGE,
==the prediction of each tree in the ensemble is shrinked after it is multipled by the learning rate (eta [between 0-1]).
Similar to AdaBoost, there is a trade-off between eta and number of estimators,
Decreasing the learning rate needs to be compensated by increasing the number of estimators
if we want to achieve a good performance.
(formula) y pred = y1 + residual r1 + residual r2 + ....+ residual r N
----Stochastic gradient boosting
Normal gradient boosting is too intense computationally
and some CARTs may use the same split points on the same features
-In SGB, each cart is trained on a RANDOM SUBSET of the training data (not the whole training data)
Instances AND features are sampled without replacement == more diversity in the ensemble (more variance-resistance)
(subsampling instances and features)
Not all features are considered when a split is made.
Tree is trained, residual errors computed- these are multiped by the learning rate (eta) and fed to the next tree.
------
Stacking
"Stacking is an ensemble learning technique that combines multiple classification or regression models
via a meta-classifier or a meta-regressor.
The base level models are trained based on a complete training set,
then the meta-model is trained on the outputs of the base level model as features."
"Stacked Generalization or “Stacking” for short is an ensemble machine learning algorithm.
It involves combining the predictions
from multiple machine learning models
on the same dataset,
like bagging and boosting."