Handout 7: Classification and Regression Trees

(1)

Engineering Part IIB & EIST Part II Paper I10: Advanced Pattern Processing

Handout 7: Classification and Regression Trees

x <2₆

x <2₂

x <0₁

ω₂ ω₁

ω₃ ω₁

x <7₅

ω₂ yes

no

no yes

yes no

yes

Mark Gales

mjfg@eng.cam.ac.uk

October 2001

(2)

Introduction

So far we have only considered situations where the feature vectors have been real valued, or discrete valued. In these cases we have been able to find a natural distance measure between the vectors.

Consider the situation where we have nominal data - for in- stance descriptions that are discrete and there is no nature concept of ordering. The question is how can such data be used in a classification task. Most importantly how can we use such data to train a classifier.

One approach to handling this problem is to usedecision trees.

We will look at the issues behind training and classifying with decision trees. A general tree growing methodology known as classification and regression trees (CART) will be discussed.

(3)

A Simple Decision Tree

Rather than using a single decision stage it is possible to use a multi-stage classification process. One possible example is the decision, or classification, tree.

x <2₆

x <2₂

x <0₁

ω₂ ω₁

ω₃ ω₁ x <7₅

ω₂ yes

no

no yes

yes no

yes

A classification tree consists of a set of

• Root node: the top node in the tree.

• Internal nodes: consists of variable(s) and threshold(s)

• Terminal nodes (or leaves): consists of a class label

(4)

Decision Tree Construction

When constructing decision trees a number of questions must be addressed:

1. how many splits should there be at each node (binary?);

2. which property should be tested at each node;

3. when should a node be declared a leaf;

4. if the tree becomes too large how should it be pruned;

5. if a leaf is impure how should it be labelled;

6. how should missing data be handles.

Each of these questions, other than the missing data issue, will be addressed in this lecture.

Two forms of query exist, monothetic and polythetic queries.

In monothetic queries we have questions like (size=medium?)

In polythetic queries we have

(size=medium?) AND (NOT (colour=yellow?))

Polythetic queries may be used by can result in a vast number of possible questions.

(5)

Number of Splits

Each decision outcome at a node is called a split. It splits the data into subsets. The root node splits the complete training set, subsequent nodes split subsets of the data.

In general the number of splits is determined by an expert and can vary throughout the tree (we will briefly look at a scheme for this). However every decision and hence every decision tree can be generated just using binary decisions.

The number of splits may dramatically alter the total number of nodes in the tree.

yes no

no Watermelon yes

Apple Grape

(size=medium?) (size=big?)

(a) Binary Tree

Size?

Watermelon Apple Grape big medium

small

(b) Tertiary Tree

(6)

Node Assignment

If the generation of the tree has been very successful then the nodes of the tree will be pure. By that it means that all samples at a terminal node have the same class label. In this case assignment of the terminal nodes to classes is trivial.

However extremely pure trees are not necessarily good. It often indicates that the tree is over tuned to the training data and will not generalise.

For the case of impure nodes they are assigned to the class with the greatest number of samples at that node, in other words

ˆ

ω = arg max

k {P(ω_k|N)}

where

P(ω_k|N) = n_k

and n_k is the fraction of the samples assigned to node N that belong to class ω_k.

(7)

Splitting Rules

The aim of the set if splits is to yield a set of pure terminal nodes.

To assess the purity of the nodes a node impurity function at a particular node N, I(N) is defined

I(N) = φ(P(ω₁|N), . . . , P(ω_K|N)) where there are K classification classes.

The function φ() should have the following properties

• φ() is a maximum when P(ω_i) = 1/K for all i

• φ() is at a minimum when P(ω_i) = 1 and P(ω_j) = 0, j 6= i.

• It is symmetric function (i.e. the order of the class proba- bilities doesn’t matter).

The obvious way to use the purity when constructing the tree is to split the node according to the question that maximises the increase in in purity (or minimises the impurity). The drop in impurity for a binary split is

∆I(N) = I(N) − n_LI(N_L) − (1− n_L)I(N_R) where:

• N_L and N_R are the left and right descendants;

• n_L is the fraction of the samples in the left descendant;

(8)

Splitting Rules (cont)

We may also want to extend the binary split into a multi-way split. Here we can use the natural extension

∆I(N) = I(N)− ^X^B

k=1

n_kI(N_k)

where n_k is the fraction of the samples in descendant k.

The direct use of this expression to determine B has a draw- back:

• large B are inherently favoured over small B.

To avoid this problem the split is commonly selected by

∆_BI(N) = ∆I(N)

−^P^K_k=1n_klog(n_k) This is the gain ratio impurity

Using the splitting rules described before results in a greedy algorithm. This means that any split is only locally good.

There is no guarantee that a global optimal split will be found, nor that the smallest tree will be obtained.

(9)

Purity Measures

There are a number of commonly used measures:

• Entropy measure:

I(N) = − ^X^K

i=1P(ω_i|N) log(P(ω_i|N))

• Variancemeasure: for a two class problem a simple measure is

I(N) = P(ω₁|N)P(ω₂|N)

• Gini measure: this is a generalisation of the variance measure to K classes

I(N) = ^X

i6=j P(ω_i|N)P(ω_j|N) = 1 − ^X^K

i=1(P(ω_i|N))²

• Misclassification measure:

I(N) = 1 −max

k (P(ω_k|N))

Scaled versions of the purity measures for a two class problem are shown below.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Probability

Purity

Entropy Misclassification Gini/variance

(10)

Selecting a Split

We will assume that we are only considering binary splits, with n training samples. We will also only consider splits parallel to the axis. The choice of where to split for a particular dimension is determined by:

1. sorting the n-samples according to the particular feature selected (assuming a natural sort order is possible);

2. examining the impurity measure at each of the possible n − 1 splits;

3. selecting the split with the lowest impurity;

Each of the possible d-dimensions should be examined.

The above algorithm is only applicable when there is a natural ordering to the data.

What about nominal data?

In the general case we can consider any particular split based on questions about attributes of the feature vector. We saw the example where the size was used to determine the fruit.

More general questions are possible. For many tasks, such as clustering the phonetic context in speech recognition, the exact set of questions is not found to be important, provided a sensible set of questions is used.

(11)

When to stop splitting?

If the splitting is allowed to carry on we will eventually reach as stage where there is only one sample in each node.

Pure nodes can always be obtained!

However these pure nodes are not useful. It is unlikely that the tree built will generalise to unseen data. The tree has ef- fectively become a lookup-table. There are a number of techniques that may be used to stop the tree growth at an appro- priate stage.

• Threshold: the tree is grown until the reduction in the impurity measure fails to exceed some specified minimum. We stop splitting a node N when

∆I(N) ≤ β

where β is the minimum threshold. This is a very simple scheme to operate. It allows the tree to stop splitting at different levels.

• MDL: a criterion related to minimum description length (MDL) may also be used. Here we look for a global minimum of

αsize+ ^X

leaf nodes

I(N)

size represents the number of nodes and α is a positive constant. The problem is that a tunable parameter α has been added.

Other generic techniques such as cross-validation and heldout data may also be used.

(12)

Pruning

Stopping criteria can suffer from a lack of sufficient look ahead, sometimes called the horizon effect. Instead of using a stopping criteria it is possible to generate a large tree and then prune this tree. The basic process is:

1. grow a tree until leaf nodes have a minimum impurity;

2. consider all adjacent pairs of leaf nodes (i.e. ones that have a common parent). For any pair that yields only a small increase in the impurity (i.e. gain is below some specified threshold)

(a) delete the two leaf nodes;

(b) make the parent a leaf node;

3. goto (2) unless all pairs are above the threshold.

There are a number of features of pruning.

• The resulting tree may be highly unbalanced.

• Pruning avoids the horizon effect. Compared to techniques like cross-validation, no heldout data is required.

• Pruning can be computationally expensive for very large trees. This may be reduced by not restricting merging to only act at leaves.

(13)

Multivariate Decision Trees

In some situations the proper feature to use to form the decision boundary does not line up with the axis.

x1 x2

(c) Univariate

x1 x2

(d) Multivariate

This has lead to the use of multivariate decision trees. Here decision do not have to be parallel to the axis. The question is how to select the direction of the new split.

This may be viewed as producing a hyperplane that correctly partitions the data. This is the same problem as for the single layer perceptron. We can therefore use:

• perceptron criterion;

• least mean squares;

• Fisher’s linear discriminant.

There is a problem with this form of rule as the training time may be slow, particularly near the root node.

(14)

Computational Complexity

Suppose there arentraining examples of d-dimensional data.

What are training and test time and memory costs.

• Training: initially only consider the root node. We first sort the data for each dimension, O(dnlog₂(n)). For all possible splits we calculate, for example,the entropy measure, O(n)+(n−1)O(d). Thus the cost associated with the root node is O(dnlog₂(n)). Assume that there is approxi- mately an equal split of the data. The cost at the first level is O(dnlog₂(n/2)), at the second level,O(dnlog₂(n/4)) and so on. The maximum number of levels isO(log₂(n)). Sum- ming the levels yields cost O(dn(log₂(n))²).

The memory cost is related to the number of nodes which is 1 + 2 + 4 + . . .+n/2 ≈ n, O(n).

• Testing: The time complexity for performing classification is simply the number of levels in the decision tree O(log₂(n)).

In practice many of the assumptions made, for example equal splits, are not tree. Also many heuristics are used to speed up the generation of trees. However these costs yield a reason- able “rule-of-thumb” for building and classifying with decision trees.

(15)

Other Tree Classifiers

This lecture has concentrated on CART. Alternative schemes are

• ID3: this is intended for use with nominal data only. The systems runs by splitting the data with a branching fac- tor B_j for node j. As the number of splits is usually not binary the gain ratio impurity is commonly used. The number of levels of the tree is equal to the number of input variables, d. In the standard scheme there is no pruning.

• C4.5 this is the successor and refinement of ID3. Real valued samples are treated in the same fashion as CART.

Multi-way splits are used for the nominal data (as in ID3).

Heuristics for pruning based on the statistical significance of splits are used. There are a number of differences to CART - see Duda, Hart and Stork for more details.

(16)

Classification Trees?

The classification tree approach is recommended when:

1. Complex datasets believed to have non-linear decision boundaries, which may be approximated by a sum of linear boundaries.

2. There is a need to gain insight into the nature of the data structure. In multi-layer perceptrons it is hard to inter- pret the meaning of a particular node or layer, due to the high connectivity.

3. Need for ease of implementation. The generation of decision is relatively simple compared to, for example, multi- layer perceptrons.

4. Speed of classification. As discussed in the cost slide classification can be achieved in cost O(L) where L is the number of levels in the tree. The questions at each level are relatively simple.