forked from fareshedayati/ICDE-2015
-
Notifications
You must be signed in to change notification settings - Fork 1
/
dense.tex
133 lines (123 loc) · 9.48 KB
/
dense.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
\section{Decision Trees With Dense Input Data} \label{sec:background}
\subsection{Notation}
We denote by $\mathcal{X}$ an input space and by $\mathcal{Y}$ the output space.
Without loss of generality, we suppose that $\mathcal{X} = \mathcal{R}^m$ where
$m$ denotes the number of features and $\mathcal{Y} = \mathcal{R}$ in case of regression or $\mathcal{Y} = \{0, 1, \cdots, K-1\}$ in case of a $K$-$class$ classification. Learning samples are represented by a pair of matrices
$(X, Y) \subseteq (\mathcal{X}, \mathcal{Y})_{i=0}^{n-1}$ with $n$ rows, where each row
corresponds to a sample and each column to a feature or an output variable.
In the process of decision tree induction we have to move samples around, instead of moving the actual rows of $(X,Y)$ we move their indices. For this purpose we introduce $\mathcal{L} = [0, 1, \cdots , n-1]$ which is the list of row indices of $(X,Y)$. As we will see in the next section, impurity measure plays a crucial role in induction of decision trees. An impurity measure of a set of samples $\mathcal{L}$ measures the heterogeneity of the target variable of the samples. The variance of the target variable is used in regression, and Gini and cross-entropy are usually used in tree-based classification algorithms. Given $\mathcal{L}$, we let $\hat{p}_k$ be the empirical probability of class $k$ in the dataset. The Gini impurity measure is defined as:
\begin{equation}
\label{gini}
I(\mathcal{L}) = \sum_{k=0}^{K-1} \hat{p}_k \left(1- \hat{p}_k\right),
\end{equation}
and the cross-entropy impurity measure is defined as:
\begin{equation}
\label{entropy}
I(\mathcal{L}) = \sum_{k=0}^{K-1} \hat{p}_k \log\left(\hat{p}_k\right)
\end{equation}
Note that the impurity measure is maximized in case of uniform distribution of the target variable which makes $\mathcal{L}$ most heterogenous, i.e. $\hat{p}_k = \frac{1}{K}$. Moreover the impurity measure is minimized when the empirical probability is concentrated on one class while other classes have $\hat{p}_k = 0$.
\subsection{Training}
A decision tree \cite{breiman1984classification} is built by recursively
splitting the sample set into to nodes by maximizing the average reduction of an impurity measure:
\begin{equation}
\label{red}
\Delta{}I(s, \mathcal{L}) =
\left(I({ \mathcal{L}}) - \left(
\frac{|\mathcal{L}_r|}{|\mathcal{L}|} I({\mathcal{L}_l}) +
\frac{|\mathcal{L}_l|}{|\mathcal{L}|} I({\mathcal{L}_r})\right)\right)
\end{equation}
where $s=\left(j, \, \theta\right)$ is a splitting rule composed of a feature index $j$ and a threshold value $\theta$ which divides the sample set into $ \mathcal{L}_l$ and $ \mathcal{L}_r$ where
\begin{equation}
\label{left}
\mathcal{L}_l = \{ i \in \mathcal{L} \, | \, X_{i,j} < \theta \},
\end{equation}
and
\begin{equation}
\label{right}
\mathcal{L}_r = \{ i \in \mathcal{L} \, | \, X_{i,j} \ge \theta \}
\end{equation}
This recursive procedure is repeated until a stopping condition is met, e.g. a maximal depth
is reached or there are too few samples to split. Those stopping criteria act
as regularization parameters. Leaves are labeled by the output mean in
regression or by the class frequencies in classification with reaching training
samples. The recursive induction of the decision decision is described
by Algorithm~\ref{algo:tree-induction} and the search for the best split
is described by Algorithm~\ref{algo:find-best-split}. Note that sorting samples (Line~\ref{alg-line:sorting}) in a node along different features is at the core of Algorithm~\ref{algo:find-best-split}; it speeds up computation of the impurity measure for all possible splitting rules in an incremental manner. In Algorithm ~\ref{algo:tree-induction} below a stack is initially setup with a single element namely a pair of a root $t_0$ and the entire set of samples $\mathcal{L}$ (Line~5). Line~7 pops the next set of samples from the stack ($\mathcal{L}_p$) and splity it into two nodes, the left node $\mathcal{L}_l$ and and the right node $\mathcal{L}_r$, based on the best splitting rule obtained in Line~11. Note that this splitting is carried our recursively until the stopping criterion are met (Line~8,9) in which case the current node becomes a leaf.
In the context of ensemble, trees are further randomized by searching for the
best split among $k$ features at each node and also might be induced on a
bootstrap copy of the samples. The tree can be grown alternatively in a best-first
search manner by replacing the stack of Algorithm~\ref{algo:tree-induction}
by a priority queue where priority is defined by expected impurity reduction.
\begin{algorithm}\label{algo:tree-induction}
Build a decision tree
\textnormal{
\begin{algorithmic}[1]
\Function{InduceDecisionTree}{$\mathcal{L}$, $X$, $Y$}
\State Initialize a tree structure $\tau$ with root node $t_0$
\State Initialize an empty stack $stack$
\State Initialize a sample set $\mathcal{L}=\{0,\ldots,n-1\}$
\State $stack$\Call{.push}{($t_0$, $\mathcal{L}$)}
\While{$stack$ is not empty}
\State $t_p$, $\mathcal{L}_p$ = $stack$\Call{.pop}{}()
\If{$t_p$ satisfies stopping criterion}
\State Make $t_p$ a leaf node using $\mathcal{L}_p$ and $Y$.
\Else
\State \begin{varwidth}[t]{0.8\linewidth}
Find a splitting rule $s^*$ which maximizes impurity reduction
among possible splitting rules: \\\\%$Q(\mathcal{L}_p, X)$:\\\\
$
s^* = {FindBestSplit}(\mathcal{L}_p)\\
%\arg\max_{s \in Q(\mathcal{L}_p, X)} \Delta{}I(s, \mathcal{L}_p).
$
\end{varwidth}
\State Make $t_p$ an internal node by splitting rule $s$.
\State Partition $\mathcal{L}_p$ into $\mathcal{L}_r$ and
$\mathcal{L}_l$ given $s^*$. \label{alg-line:partition}
\State Create two empty nodes $t_r$ and $t_l$ child of $t_p$.
\State $stack$\Call{.push}{($t_r$, $\mathcal{L}_r$)}
\State $stack$\Call{.push}{($t_l$, $\mathcal{L}_l$)}
\EndIf
\EndWhile
\State \Return $\tau$
\EndFunction
\end{algorithmic}
}
\end{algorithm}
\subsection{The Optimal Split}
Given a set of samples $\mathcal{L}_p$ the best splitting rule is computed by finding for each feature a threshold that can reduce the average impurity measure (Equation~\ref{red}) the most and at the end picking the feature with its corresponding optimal threshold that has the best reduction among all features. For a given splitting rule $s = \left(j , \theta\right)$, the whole sample set needs to be traversed once for determining where each sample row belongs to, to the left (Equation~\ref{left}) or to the right node (Equation~\ref{right}). To avoid multiple traversals, for a given feature $j$, the samples are sorted along the feature in an ascending order, and the impurity reduction is computed in an incremental way from the lowest value of feature $j$ to the highest value, treating each feature value as a threshold. As an example consider computing the impurity measure Gini (Equation~\ref{gini}) for all values of feature $j$. We can easily update Equation \ref{red} by keeping track of left node $\mathcal{L}_l$ and right node $\mathcal{L}_r$ class frequencies and updating them as we move from one threshold to the next, e.g. from $\theta = X_{i,j}$ to $\theta = X_{i+1,j}$. In other words $\Delta{}I\left(s = (j,X_{i+1,j}), \mathcal{L}\right)$ can be easily computed by updating left node and right node class frequencies for $\theta = X_{i,j}$, and on top of them the empirical probabilities $\hat{p}_k$ and $|\mathcal{L}_l|$ and $|\mathcal{L}_r|$ of Equation~\ref{red}. \\
In Algorithm~\ref{algo:find-best-split} below, Line~3 iterates over all features ranging from $0$ to $m$ creating a dataset $\mathcal{F}_j$ for each one of them. $\mathcal{F}_j$ is basically a collection of all values in column $j$ of the current node $\mathcal{L}_p$. The algorithm computes the impurity for each of the possible values in $\mathcal{F}_j$ in Lines~6 to 8. Line~10 keeps track of the best impurity reduction, and the corresponding rule (referred to $s^*$ in Line~11) which is a combination of a feature index $j$ and a threshold $\theta$. $s^*$ is the returned value of the algorithm and the best splitting rule. %For more details of this algorithm refer to Algorithm~\ref{algo:find-best-split} below.
\begin{algorithm}\label{algo:find-best-split}
Search for the best split
\textnormal{
\begin{algorithmic}[1]
\Function{FindBestSplit}{$\mathcal{L}_p$}
\State $\text{best} = -\infty$
\For{$j \in \{0, \ldots, m-1\}$}
\State Extract feature values reaching the node
\[
\mathcal{F}_j = \{X_{i,j}, \forall i \in \mathcal{L}_p\}.
\] \label{alg-line:value-extract}
\State Sort $\mathcal{L}_p$ and $\mathcal{F}_j$ by increasing values
of $\mathcal{F}_j$. \label{alg-line:sorting}
%\State Generate all possible splitting rules
% \[
%Q(\mathcal{F}_j)=\{((x_j \leq \nu), (x_j > \nu))| \nu \in \mathcal{F}_j)
%\]
%\For{$s$ in $Q(\mathcal{F}_j)$}
\For{$\theta$ in ${F}_j$}
\State $s$ = $\left(j, \theta\right)$
\State Evaluate impurity reduction $s$
\[
\text{score} = \Delta{}I(s, \mathcal{L}_p).
\]
\If{$\text{score} > \text{best}$}
\State $\text{best} =\text{score}$
\State $s^* = s$
\EndIf
\EndFor
\EndFor
\State \Return $s^*$
\EndFunction
\end{algorithmic}
}
\end{algorithm}