• Nenhum resultado encontrado

Handout 7: Classification and Regression Trees

N/A
N/A
Protected

Academic year: 2023

Share "Handout 7: Classification and Regression Trees"

Copied!
16
0
0

Texto

(1)

Engineering Part IIB & EIST Part II Paper I10: Advanced Pattern Processing

Handout 7: Classification and Regression Trees

x <26

x <22

x <01

ω2 ω1

ω3 ω1

x <75

ω2 yes

no

no

no yes

yes no

yes

Mark Gales

mjfg@eng.cam.ac.uk

October 2001

(2)

Introduction

So far we have only considered situations where the feature vectors have been real valued, or discrete valued. In these cases we have been able to find a natural distance measure between the vectors.

Consider the situation where we have nominal data - for in- stance descriptions that are discrete and there is no nature concept of ordering. The question is how can such data be used in a classification task. Most importantly how can we use such data to train a classifier.

One approach to handling this problem is to usedecision trees.

We will look at the issues behind training and classifying with decision trees. A general tree growing methodology known as classification and regression trees (CART) will be discussed.

(3)

A Simple Decision Tree

Rather than using a single decision stage it is possible to use a multi-stage classification process. One possible example is the decision, or classification, tree.

x <26

x <22

x <01

ω2 ω1

ω3 ω1 x <75

ω2 yes

no

no

no yes

yes no

yes

A classification tree consists of a set of

Root node: the top node in the tree.

Internal nodes: consists of variable(s) and threshold(s)

Terminal nodes (or leaves): consists of a class label

(4)

Decision Tree Construction

When constructing decision trees a number of questions must be addressed:

1. how many splits should there be at each node (binary?);

2. which property should be tested at each node;

3. when should a node be declared a leaf;

4. if the tree becomes too large how should it be pruned;

5. if a leaf is impure how should it be labelled;

6. how should missing data be handles.

Each of these questions, other than the missing data issue, will be addressed in this lecture.

Two forms of query exist, monothetic and polythetic queries.

In monothetic queries we have questions like (size=medium?)

In polythetic queries we have

(size=medium?) AND (NOT (colour=yellow?))

Polythetic queries may be used by can result in a vast num- ber of possible questions.

(5)

Number of Splits

Each decision outcome at a node is called a split. It splits the data into subsets. The root node splits the complete training set, subsequent nodes split subsets of the data.

In general the number of splits is determined by an expert and can vary throughout the tree (we will briefly look at a scheme for this). However every decision and hence every decision tree can be generated just using binary decisions.

The number of splits may dramatically alter the total number of nodes in the tree.

yes no

no Watermelon yes

Apple Grape

(size=medium?) (size=big?)

(a) Binary Tree

Size?

Watermelon Apple Grape big medium

small

(b) Tertiary Tree

(6)

Node Assignment

If the generation of the tree has been very successful then the nodes of the tree will be pure. By that it means that all samples at a terminal node have the same class label. In this case assignment of the terminal nodes to classes is trivial.

However extremely pure trees are not necessarily good. It often indicates that the tree is over tuned to the training data and will not generalise.

For the case of impure nodes they are assigned to the class with the greatest number of samples at that node, in other words

ˆ

ω = arg max

k {P(ωk|N)}

where

P(ωk|N) = nk

and nk is the fraction of the samples assigned to node N that belong to class ωk.

(7)

Splitting Rules

The aim of the set if splits is to yield a set of pure terminal nodes.

To assess the purity of the nodes a node impurity function at a particular node N, I(N) is defined

I(N) = φ(P(ω1|N), . . . , P(ωK|N)) where there are K classification classes.

The function φ() should have the following properties

φ() is a maximum when P(ωi) = 1/K for all i

φ() is at a minimum when P(ωi) = 1 and P(ωj) = 0, j 6= i.

It is symmetric function (i.e. the order of the class proba- bilities doesn’t matter).

The obvious way to use the purity when constructing the tree is to split the node according to the question that maximises the increase in in purity (or minimises the impurity). The drop in impurity for a binary split is

I(N) = I(N) nLI(NL) (1 nL)I(NR) where:

NL and NR are the left and right descendants;

nL is the fraction of the samples in the left descendant;

(8)

Splitting Rules (cont)

We may also want to extend the binary split into a multi-way split. Here we can use the natural extension

I(N) = I(N) XB

k=1

nkI(Nk)

where nk is the fraction of the samples in descendant k.

The direct use of this expression to determine B has a draw- back:

large B are inherently favoured over small B.

To avoid this problem the split is commonly selected by

BI(N) = ∆I(N)

PKk=1nklog(nk) This is the gain ratio impurity

Using the splitting rules described before results in a greedy algorithm. This means that any split is only locally good.

There is no guarantee that a global optimal split will be found, nor that the smallest tree will be obtained.

(9)

Purity Measures

There are a number of commonly used measures:

Entropy measure:

I(N) = XK

i=1P(ωi|N) log(P(ωi|N))

Variancemeasure: for a two class problem a simple mea- sure is

I(N) = P(ω1|N)P(ω2|N)

Gini measure: this is a generalisation of the variance measure to K classes

I(N) = X

i6=j P(ωi|N)P(ωj|N) = 1 XK

i=1(P(ωi|N))2

Misclassification measure:

I(N) = 1 max

k (P(ωk|N))

Scaled versions of the purity measures for a two class prob- lem are shown below.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Probability

Purity

Entropy Misclassification Gini/variance

(10)

Selecting a Split

We will assume that we are only considering binary splits, with n training samples. We will also only consider splits parallel to the axis. The choice of where to split for a particu- lar dimension is determined by:

1. sorting the n-samples according to the particular feature selected (assuming a natural sort order is possible);

2. examining the impurity measure at each of the possible n 1 splits;

3. selecting the split with the lowest impurity;

Each of the possible d-dimensions should be examined.

The above algorithm is only applicable when there is a natu- ral ordering to the data.

What about nominal data?

In the general case we can consider any particular split based on questions about attributes of the feature vector. We saw the example where the size was used to determine the fruit.

More general questions are possible. For many tasks, such as clustering the phonetic context in speech recognition, the exact set of questions is not found to be important, provided a sensible set of questions is used.

(11)

When to stop splitting?

If the splitting is allowed to carry on we will eventually reach as stage where there is only one sample in each node.

Pure nodes can always be obtained!

However these pure nodes are not useful. It is unlikely that the tree built will generalise to unseen data. The tree has ef- fectively become a lookup-table. There are a number of tech- niques that may be used to stop the tree growth at an appro- priate stage.

Threshold: the tree is grown until the reduction in the impurity measure fails to exceed some specified mini- mum. We stop splitting a node N when

I(N) β

where β is the minimum threshold. This is a very simple scheme to operate. It allows the tree to stop splitting at different levels.

MDL: a criterion related to minimum description length (MDL) may also be used. Here we look for a global min- imum of

αsize+ X

leaf nodes

I(N)

size represents the number of nodes and α is a positive constant. The problem is that a tunable parameter α has been added.

Other generic techniques such as cross-validation and heldout data may also be used.

(12)

Pruning

Stopping criteria can suffer from a lack of sufficient look ahead, sometimes called the horizon effect. Instead of using a stop- ping criteria it is possible to generate a large tree and then prune this tree. The basic process is:

1. grow a tree until leaf nodes have a minimum impurity;

2. consider all adjacent pairs of leaf nodes (i.e. ones that have a common parent). For any pair that yields only a small increase in the impurity (i.e. gain is below some specified threshold)

(a) delete the two leaf nodes;

(b) make the parent a leaf node;

3. goto (2) unless all pairs are above the threshold.

There are a number of features of pruning.

The resulting tree may be highly unbalanced.

Pruning avoids the horizon effect. Compared to tech- niques like cross-validation, no heldout data is required.

Pruning can be computationally expensive for very large trees. This may be reduced by not restricting merging to only act at leaves.

(13)

Multivariate Decision Trees

In some situations the proper feature to use to form the deci- sion boundary does not line up with the axis.

x1 x2

(c) Univariate

x1 x2

(d) Multivariate

This has lead to the use of multivariate decision trees. Here decision do not have to be parallel to the axis. The question is how to select the direction of the new split.

This may be viewed as producing a hyperplane that correctly partitions the data. This is the same problem as for the single layer perceptron. We can therefore use:

perceptron criterion;

least mean squares;

Fisher’s linear discriminant.

There is a problem with this form of rule as the training time may be slow, particularly near the root node.

(14)

Computational Complexity

Suppose there arentraining examples of d-dimensional data.

What are training and test time and memory costs.

Training: initially only consider the root node. We first sort the data for each dimension, O(dnlog2(n)). For all possible splits we calculate, for example,the entropy mea- sure, O(n)+(n−1)O(d). Thus the cost associated with the root node is O(dnlog2(n)). Assume that there is approxi- mately an equal split of the data. The cost at the first level is O(dnlog2(n/2)), at the second level,O(dnlog2(n/4)) and so on. The maximum number of levels isO(log2(n)). Sum- ming the levels yields cost O(dn(log2(n))2).

The memory cost is related to the number of nodes which is 1 + 2 + 4 + . . .+n/2 n, O(n).

Testing: The time complexity for performing classifica- tion is simply the number of levels in the decision tree O(log2(n)).

In practice many of the assumptions made, for example equal splits, are not tree. Also many heuristics are used to speed up the generation of trees. However these costs yield a reason- able “rule-of-thumb” for building and classifying with deci- sion trees.

(15)

Other Tree Classifiers

This lecture has concentrated on CART. Alternative schemes are

ID3: this is intended for use with nominal data only. The systems runs by splitting the data with a branching fac- tor Bj for node j. As the number of splits is usually not binary the gain ratio impurity is commonly used. The number of levels of the tree is equal to the number of input variables, d. In the standard scheme there is no pruning.

C4.5 this is the successor and refinement of ID3. Real valued samples are treated in the same fashion as CART.

Multi-way splits are used for the nominal data (as in ID3).

Heuristics for pruning based on the statistical significance of splits are used. There are a number of differences to CART - see Duda, Hart and Stork for more details.

(16)

Classification Trees?

The classification tree approach is recommended when:

1. Complex datasets believed to have non-linear decision boundaries, which may be approximated by a sum of lin- ear boundaries.

2. There is a need to gain insight into the nature of the data structure. In multi-layer perceptrons it is hard to inter- pret the meaning of a particular node or layer, due to the high connectivity.

3. Need for ease of implementation. The generation of deci- sion is relatively simple compared to, for example, multi- layer perceptrons.

4. Speed of classification. As discussed in the cost slide clas- sification can be achieved in cost O(L) where L is the number of levels in the tree. The questions at each level are relatively simple.

Referências

Documentos relacionados

s'était établie au quatrième trimestre de 1992 à son niveau le plus bas -a de nouveau progressé du premier au deuxième trimestre, mais dans une faible mesure.. L'institut d'émission