Intro.tex

\section{Introduction}
% no \IEEEPARstart

%What is the problem?
%Why is it interesting and important?
%Why is it hard? (E.g., why do naive approaches fail?)
%Why hasn't it been solved before? (Or, what's wrong with previous proposed solutions? How does mine differ?)
%What are the key components of my approach and results? Also include any specific limitations.

Decision trees provide simple predictive models that are widely used
in machine learning and data mining for both classification and
regression problems. Each decision tree leaf represents a particular
label in a classification problem (or a particular scalar value in a
regression problem) and the internal nodes of the tree represent
conjunctions of features that lead into the tree leaves. Decision
trees are popular not only because they are easy to interpret, but
also because they provide the basic blocks of more sophisticated and
powerful ensemble methods such as Adaboost \cite{freund1995desicion},
Random Forests~\cite{breiman2001random} or Gradient Tree
Boosting~\cite{friedman2001greedy}. Such methods are usually the best
performers in industry for both classification and regression tasks.
%What all these
%methods have in common is that they use randomized decision trees as a
%base learner. 

Learning a decision tree over a training set is a recursive top-down
procedure. In each step the procedure finds a splitting rule that
partitions the underline space into two segments so that it maximizes
some notion of information gain, i.e., it reduces the variability of
the labels in the underlying (two) child nodes compared to the parent
node. The search for the optimal splitting rule makes the running time
of each partition decision linear to the number of features and
possible values in the training set.

Such a cost make the use of tree-based methods prohibitive for high
dimensional supervised learning problems such as text or image
annotation that are common in many research and industrial
problems. Such problems require defining a mapping between the raw
input space and a target vector space. For example, in text
classification a document (raw input space) is usually mapped to a
vector whose dimensions correspond to all of the possible words in a
dictionary and the values of the vector elements are determined by
the frequency of the words in the document.  Although such vectors
have many dimensions, they are often sparsely representable. For
instance, the number of unique words associated to a text document is
actually small compared to the number of words of a given language.

Many models, such as logistic regression or SVM, could directly
benefit from the input sparsity by formulating the entire algorithm
through a set of dot products that can be computed fast with
appropriate sparse vector representations. However, this is not
possible for tree based methods, since the tree learning procedure
cannot be expressed with dot products. Consequently, most machine
learning packages don't support sparse input for tree-based methods,
are restricted to decision stumps (decision tree with only one
internal node) or have a sub-optimal implementation through the
simulation of a random access memory as in the dense case. The only
solution is often to densify the input space which leads first to
severe memory constraints and then to slow training time.

In this paper, we present an efficient splitting procedure tailored for
numerical sparse input data in compressed sparse column format, a sparse matrix
format. For a given subset of samples, we are able to efficiently extract non
zero values for a given feature of this subset of samples. Knowing which
elements are nonzero allows large speedup. It decreases sorting time of
samples in the current node along features which is an essential component in
all tree-based models. Moreover it reduces the set of possible splits to
evaluate at each node. We also want to highlight that the contribution of this
paper have been proposed for and merged in the \emph{scikit-learn}
\cite{buitinck2013api,pedregosa2011scikit} open source package. This will
benefit the machine learning community.

The rest of this paper is organized as follows: Section \ref{sec:background}
introduces decision tree splitting algorithm and sparse matrix formats; Section
\ref{sec:sparse-input-dt} describes the proposed splitting algorithm for sparse
input data; Section \ref{sec:experiments} summarizes the experiments carried out with the proposed algorithm comparing them to the traditional decision tree training algorithm
and Section \ref{sec:conclusion} concludes and describes further
perspectives.