Skip to content
/ pine Public

Command line tool for making predictions using ensembles of random decision trees (random forest)

Notifications You must be signed in to change notification settings

ruffrey/pine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Pine - a tool for Random Decision Forests

Ensembles of random decision trees.

This app implements the machine learning technique of generating groups of random decision trees, where each is responsible for small parts of the dataset.

Training and predictions for continuous floating point or integer data are available (categorical-voting not supported).

See this Kaggle discussion of the term random forest.

usage

The tree executable can be downloaded on the Releases page.

Go is required to build this app yourself:

cd pine/tree
make deps
go build # outputs the executable "tree" app
./tree # prints help

Training:

./tree -train -data=../test-data/iris.csv -save=../sav.gob

Predicting:

./tree -pred -model=../sav.gob -seed=5.7,3.8,1.7,0.3

All options:

Usage of ./tree:
  -charmode skipSize
    	Character prediction mode rather than numeric feature mode. This will create test cases by iterating through the data skipSize at a time, and making the previous `sequenceLength` items have higher weights based on the closeness to the current item being predicted.s
  -data string
    	Training data input file
  -folds int
    	How many subdivisions of the dataset to make for cross-validation (default 5)
  -m int
    	Override calculation for feature split size (little m)
  -max int
    	Stop predicting after this many rounds (-pred only)
  -model string
    	Load a pretrained model for prediction
  -pred
    	Make a prediction
  -profile string
    	[cpu|mem] enable profiling
  -save string
    	Where to save the model after training
  -seed string
    	Predict based on this string of data
  -seqlen int
    	Normally equal to the number of variables during -charmode, override for fewer previous look-behind-memory-variables in every input test cases
  -skipsize int
    	During -charmode, how many items to skip before making another training case (default 3)
  -subsetpct float
    	Percent of the dataset which should be used to train a tree (always minus 1 fold for cross-validation) (default 0.6)
  -tojson
    	Convert a model to json
  -train
    	Train a model
  -trees int
    	How many decision trees to make per fold of the dataset (default 1)

experimental character mode

There is an experimental -charmode flag that attempts to encode strings of text and make predictions on it, like you would with a neural network.

how it works

random decision tree ensembles training

Given a data set, rows of input features x, where the last column is the expected category y. Often these are encoded in CSV format. The data should be encoded to float32 parseable values.

Train a set of random decison trees per bag:

  • Given a training set (group of every bag/folds except one, see cross validation below) and a test training set
  • Do the following to create however many trees you want in each set:
    • randomly select a sample that is 2/3 of the training set with replacement. There are likely duplicate values.
    • With the sample, determine the best split point of the data:
      • M is the total number of input features.
      • m is a subset of features from the total number of features that each tree will be responsible for caring about. In other words, each tree will try to best predict only m out of M total features. Ways to calculate m are like the square root of M or other ways to produce a smaller value.
      • So, randomly pick m features for this tree to care about.
      • For each feature, run through every possible split of the features of the input rows.
      • Track which predictions are correct and which are not. Use the split that has the most correct predictions on this data sample. This is also called minimizing the Gini error.
    • Continue splitting the tree into nodes until reaching the maximum desired depth, or until naturally reaching the end of the tree.

Once you have a trained tree, a prediction is made by running a sample without the last column through every tree, and getting the mode (most frequent) prediction across the trees.

Cross-validation

Cross-validation is a way minimize the out-of-bag error. In other words, we validate that samples not in the bags are still predicted correctly.

To do it, start by splitting the whole dataset into equal bags (or folds) without replacement, before creating the bags.

For example, say there are 20 samples and we want 4 folds. Each fold will have 5 samples, and none of the 20 samples will be repeated across all the folds. However, they need to be put randomly into the folds (random without replacement).

Next, loop through all the folds. The fold in the loop iteration will be the test set, so reserve it for later. Use all the other folds to train a set of decision trees. In our example above, that means on the first fold, we would use the last 3 for training, on the second, use the first fold and the last two for training, etc. For every training set, construct decision trees that best predicts it.

License

MIT

Sources

http://blog.citizennet.com/blog/2012/11/10/random-forests-ensembles-and-performance-metrics

http://blog.yhat.com/posts/random-forests-in-python.html

https://machinelearningmastery.com/implement-random-forest-scratch-python/

About

Command line tool for making predictions using ensembles of random decision trees (random forest)

Topics

Resources

Stars

Watchers

Forks