PART

● Avoids global optimization step used in C4.5rules and RIPPER

● Generates an unrestricted decision list using basic separateandconquer procedure

● Builds a partial decision tree to obtain a rule

♦ A rule is only pruned if all its implications are known

♦ Prevents hasty generalization

● Uses C4.5’s procedures to build a tree

Building a partial tree

Expand-subset (S):

Choose test T and use it to split set of examples into subsets

Sort subsets into increasing order of average entropy

while

there is a subset X not yet been expanded AND all subsets expanded so far are leaves expand-subset(X)

all subsets expanded are leaves AND estimated error for subtree

≥ estimated error for node

undo expansion into subsets and make node a leaf

Example

Notes on PART

● Make leaf with maximum coverage into a rule

● Treat missing values just as C4.5 does

♦ I.e. split instance into pieces

● Time taken to generate a rule:

♦ Worst case: same as for building a pruned tree

● Occurs when data is noisy

♦ Best case: same as for building a single rule

● Occurs when data is noise free

Rules with exceptions

1.Given: a way of generating a single good rule

2.Then it’s easy to generate rules with exceptions 3.Select default class for toplevel rule

4.Generate a good rule for one of the remaining classes

5.Apply this method recursively to the two subsets produced by the rule

(I.e. instances that are covered/not covered)

Iris data example

Exce ptions a re re pre s e nte d a s

Dotte d pa ths , a lte rna tive s a s s olid one s .

Extending linear classification

● Linear classifiers can’t model nonlinear class boundaries

● Simple trick:

♦ Map attributes into new space consisting of combinations of attribute values

♦ E.g.: all products of n factors that can be constructed from the attributes

● Example with two attributes and n = 3:

Problems with this approach

● 1^st problem: speed

♦ 10 attributes, and n = 5 ⇒ >2000 coefficients

♦ Use linear regression with attribute selection

♦ Run time is cubic in number of attributes

● 2^nd problem: overfitting

♦ Number of coefficients is large relative to the number of training instances

♦ Curse of dimensionality kicks in

Support vector machines

● Support vector machines are algorithms for learning linear classifiers

● Resilient to overfitting because they learn a particular linear decision boundary:

♦ The maximum margin hyperplane

● Fast in the nonlinear case

♦ Use a mathematical trick to avoid creating

“pseudoattributes”

The maximum margin hyperplane

● The instances closest to the maximum margin

Support vectors

●The support vectors define the maximum margin hyperplane

● All other instances can be deleted without changing its position and orientation

Finding support vectors

● Support vector: training instance for which α_i^> 0

● Determine α_i^and b ?—

A constrained quadratic optimization problem

♦ Offtheshelf tools for solving these problems

♦ However, specialpurpose algorithms are faster

♦ Example: Platt’s sequential minimal optimization algorithm (implemented in WEKA)

● Note: all this assumes separable data!

x₌b_∑i is supp. vector _iy_ia_i_⋅a

Nonlinear SVMs

● “Pseudo attributes” represent attribute combinations

● Overfitting not a problem because the maximum margin hyperplane is stable

♦ There are usually few support vectors relative to the size of the training set

● Computation time still an issue

♦ Each time the dot product is computed, all the

“pseudo attributes” must be included

A mathematical trick

● Avoid computing the “pseudo attributes”

● Compute the dot product before doing the nonlinear mapping

● Example:

● Corresponds to a map into the instance space spanned by all products of n attributes

x₌b_∑i is supp. vector _i y_i_a_i_⋅a_ⁿ

Other kernel functions

● Mapping is called a “kernel function”

● Polynomial kernel

● We can use others:

● Only requirement:

● Examples:

x₌b_∑i is supp. vector _i y_i_a_i_⋅a_ⁿ

x₌b_∑i is supp. vector _i y_iK_a_i_⋅a_ K_{ }x_i, x_j_{= }x_i_{⋅ }x_j_

K_{ }x , x _{= }x_⋅x _1^d

Noise

● Have assumed that the data is separable (in original or transformed space)

● Can apply SVMs to noisy data by introducing a “noise” parameter C

● C bounds the influence of any one training instance on the decision boundary

♦ Corresponding constraint: 0 ≤ α_i≤ C

● Still a quadratic optimization problem

● Have to determine C by experimentation

Sparse data

● SVM algorithms speed up dramatically if the data is sparse (i.e. many values are 0)

● Why? Because they compute lots and lots of dot products

● Sparse data

⇒

compute dot products very efficiently

● Iterate only over nonzero values

● SVMs can process sparse datasets with 10,000s of

Applications

● Machine vision: e.g face identification

● Outperforms alternative approaches (1.5% error)

● Handwritten digit recognition: USPS data

● Comparable to best alternative (0.8% error)

● Bioinformatics: e.g. prediction of protein secondary structure

● Text classifiation

● Can modify SVM technique for numeric prediction problems

Support vector regression

● Maximum margin hyperplane only applies to classification

● However, idea of support vectors and kernel functions can be used for regression

● Basic method same as in linear regression: want to minimize error

♦ Difference A: ignore errors smaller than ε and use absolute error instead of squared error

♦ Difference B: simultaneously aim to maximize flatness of function

More on SVM regression

● If there are tubes that enclose all the training points, the flattest of them is used

♦ Eg.: mean is used if 2ε > range of target values

● Model can be written as:

♦ Support vectors: points on or outside tube

♦ Dot product can be replaced by kernel function

♦ Note: coefficients α_imay be negative

● No tube that encloses all training points?

♦ Requires tradeoff between error and flatness

♦ Controlled by upper limit C on absolute value of coefficients α

x₌b_∑i is supp. vector _ia_i_⋅a

Examples

ε = 2

The kernel perceptron

● Can use “kernel trick” to make nonlinear classifier using perceptron rule

● Observation: weight vector is modified by adding or subtracting training instances

● Can represent weight vector using all instances that have been misclassified:

♦ Can use instead of

( where y is either 1 or +1)

● Now swap summation signs:

♦ Can be expressed as:

∑_i w_ia_i

∑_i ∑_j y_ j_a'_ j__ia_i

∑_j y_ j_∑_i a '_ j__ia_i

∑_j y_ j_a'_ j_⋅a

Comments on kernel perceptron

● Finds separating hyperplane in space created by kernel function (if it exists)

♦ But: doesn't find maximummargin hyperplane

● Easy to implement, supports incremental learning

● Linear and logistic regression can also be upgraded using the kernel trick

♦ But: solution is not “sparse”: every training instance contributes to solution

● Perceptron can be made more stable by using all weight vectors encountered during learning, not just last one

Multilayer perceptrons

● Using kernels is only one way to build nonlinear classifier based on perceptrons

● Can create network of perceptrons to approximate arbitrary target concepts

● Multilayer perceptron is an example of an artificial neural network

♦ Consists of: input layer, hidden layer(s), and output layer

● Structure of MLP is usually found by experimentation

● Parameters can be found using backpropagation

Examples

Backpropagation

● How to learn weights given network structure?

♦ Cannot simply use perceptron learning rule because we have hidden layer(s)

♦ Function we are trying to minimize: error

♦ Can use a general function minimization technique called gradient descent

● Need differentiable activation function: use sigmoid function instead of threshold function

● Need differentiable error function: can't use zeroone loss, but can use squared error

f _x_=_1exp¹ _−x_

The two activation functions

Gradient descent example

●

Function: x

+1

●

Derivative: 2x

●

Learning rate: 0.1

●

Start value: 4

Can only find a local minimum!

Minimizing the error I

●

Need to find partial derivative of error function for each parameter (i.e. weight)

dw_i=y₋f _x_ ^df_dw^x^

dfx

dx =f _x_1−f _x_

x_=∑_i w_if _x_i_

df _x_

dw_i =f '_x_f _x_i_

dw =y₋f _x_f '_x_f _x_i_

Minimizing the error II

● What about the weights for the connections from the input to the hidden layer?

dw_ij=^dE_dx _dw^dx

ij=y₋f _x_f '_x_ _dw^dx

x_=∑_i w_if _x_i_

dw_ij=w_i ^df_dw^^xⁱ^

dw_ij=y₋f _x_f '_x_w_if '_x_i_a_i

df_x_i_

dw_ij =f '_x_i_ _dw^dxⁱ

ij=f '_x_i_a_i

Remarks

● Same process works for multiple hidden layers and multiple output units (eg. for multiple classes)

● Can update weights after all training instances have been processed or incrementally:

♦ batch learning vs. stochastic backpropagation

♦ Weights are initialized to small random values

● How to avoid overfitting?

♦ Early stopping: use validation set to check when to stop

♦ Weight decay: add penalty term to error function

● How to speed up learning?

Radial basis function networks

●

Another type of feedforward network with two layers (plus the input layer)

●

Hidden units represent points in instance space and activation depends on distance

♦ To this end, distance is converted into similarity: Gaussian activation function

● Width may be different for each hidden unit

♦ Points of equal activation form hypersphere (or hyperellipsoid) as opposed to hyperplane

●

Output layer same as in MLP

Learning RBF networks

● Parameters: centers and widths of the RBFs + weights in output layer

● Can learn two sets of parameters independently and still get accurate models

♦ Eg.: clusters from kmeans can be used to form basis functions

♦ Linear model can be used based on fixed RBFs

♦ Makes learning RBFs very efficient

● Disadvantage: no builtin attribute weighting based on relevance

Instancebased learning

● Practical problems of 1NN scheme:

♦ Slow (but: fast treebased approaches exist)

● Remedy: remove irrelevant data

♦ Noise (but: k NN copes quite well with noise)

● Remedy: remove noisy instances

♦ All attributes deemed equally important

● Remedy: weight attributes (or simply select)

♦ Doesn’t perform explicit generalization

● Remedy: rulebased NN approach

Learning prototypes

● Only those instances involved in a decision need to be stored

● Noisy instances should be filtered out

Speed up, combat noise

● IB2: save memory, speed up classification

♦ Work incrementally

♦ Only incorporate misclassified instances

♦ Problem: noisy data gets incorporated

● IB3: deal with noise

♦ Discard instances that don’t perform well

♦ Compute confidence intervals for

● 1. Each instance’s success rate

● 2. Default accuracy of its class

♦ Accept/reject instances

● Accept if lower limit of 1 exceeds upper limit of 2

Weight attributes

● IB4: weight each attribute

(weights can be classspecific)

● Weighted Euclidean distance:

● Update weights based on nearest neighbor

● Class correct: increase weight

● Class incorrect: decrease weight

● Amount of change for i th attribute depends on



^^w¹²^^x¹⁻^y¹^²^^...^^wⁿ²^^xⁿ⁻^yⁿ^²^

Rectangular generalizations

● Nearestneighbor rule is used outside rectangles

● Rectangles are rules! (But they can be more conservative than “normal” rules.)

● Nested rectangles are rules with exceptions

Generalized exemplars

● Generalize instances into hyperrectangles

♦ Online: incrementally modify rectangles

♦ Offline version: seek small set of rectangles that cover the instances

● Important design decisions:

♦ Allow overlapping rectangles?

● Requires conflict resolution

♦ Allow nested rectangles?

Dealing with uncovered instances?

Separating generalized exemplars

Class 1

Class 2

Generalized distance functions

● Given: some transformation operations on attributes

● K*: similarity = probability of transforming instance A into B by chance

● Average over all transformation paths

● Weight paths according their probability (need way of measuring this)

● Uniform way of dealing with different attribute types

Numeric prediction

● Counterparts exist for all schemes previously discussed

♦ Decision trees, rule learners, SVMs, etc.

● (Almost) all classification schemes can be applied to regression problems using discretization

♦ Discretize the class into intervals

♦ Predict weighted average of interval midpoints

♦ Weight according to class probabilities

Regression trees

● Like decision trees, but:

♦ Splitting criterion: minimize intrasubset variation

♦ Termination criterion: std dev becomes small

♦ Pruning criterion: based on numeric error measure

♦ Prediction: Leaf predicts average class values of instances

● Piecewise constant functions

Model trees

● Build a regression tree

● Each leaf

⇒

linear regression function

● Smoothing: factor in ancestor’s predictions

♦ Smoothing formula:

♦ Same effect can be achieved by incorporating ancestor models into the leaves

● Need linear regression function at each node

● At each node, use only a subset of attributes

♦ Those occurring in subtree

♦ (+maybe those occurring in path to the root)

● Fast: tree usually uses only a small subset of

p'₌^np_n^_^kq_k

Building the tree

● Splitting: standard deviation reduction

● Termination:

♦ Standard deviation < 5% of its value on full training set

♦ Too few instances remain (e.g. < 4)

Pruning:

♦ Heuristic estimate of absolute error of LR models:

♦ Greedily remove terms from LR models to minimize estimated error

SDR₌sd_T_−∑_i_∣^T_Tⁱ_∣×sd_T_i_

n_

n_−×average_absolute_error

Nominal attributes

● Convert nominal attributes to binary ones

● Sort attribute by average class value

● If attribute has k values,

generate k – 1 binary attributes

● i th is 0 if value lies within the first i , otherwise 1

● Treat binary attributes as numeric

● Can prove: best split on one of the new attributes is the best (binary) split on original

Missing values

● Modify splitting criterion:

● To determine which subset an instance goes into, use surrogate splitting

● Split on the attribute whose correlation with original is greatest

● Problem: complex and timeconsuming

● Simple solution: always use the class

SDR₌_∣^m_T_∣_×[sd_T_−∑_i_∣^T_Tⁱ_∣×sd_T_i_]

Surrogate splitting based on class

● Choose split point based on instances with known values

● Split point divides instances into 2 subsets

● L (smaller class average)

● R (larger)

● m is the average of the two averages

● For an instance with a missing value:

● Choose L if class value < m

● Otherwise R

● Once full tree is built, replace missing values with averages of corresponding leaf nodes

Pseudocode for M5'

● Four methods:

♦ Main method: MakeModelTree

♦ Method for splitting: split

♦ Method for pruning: prune

♦ Method that computes error: subtreeError

● We’ll briefly look at each method in turn

● Assume that linear regression method performs attribute subset selection based on error

MakeModelTree

MakeModelTree (instances) {

SD = sd(instances)

for each k-valued nominal attribute

convert into k-1 synthetic binary attributes root = newNode

root.instances = instances split(root)

prune(root)

printTree(root) }

split

split(node) {

if sizeof(node.instances) < 4 or sd(node.instances) < 0.05*SD node.type = LEAF

else

node.type = INTERIOR for each attribute

for all possible split positions of attribute calculate the attribute's SDR

node.attribute = attribute with maximum SDR split(node.left)

prune

prune(node) {

if node = INTERIOR then prune(node.leftChild) prune(node.rightChild)

node.model = linearRegression(node)

if subtreeError(node) > error(node) then node.type = LEAF

}

subtreeError

subtreeError(node) {

l = node.left; r = node.right if node = INTERIOR then

return (sizeof(l.instances)*subtreeError(l) + sizeof(r.instances)*subtreeError(r))

/sizeof(node.instances) else return error(node)

}

Model tree for servo data

Result

of merging

Rules from model trees

● PART algorithm generates classification rules by building partial decision trees

● Can use the same method to build rule sets for regression

♦ Use model trees instead of decision trees

♦ Use variance instead of entropy to choose node to expand when building partial tree

● Rules will have linear models on righthand side

● Caveat: using smoothed trees may not be

Locally weighted regression

● Numeric prediction that combines

● instancebased learning

● linear regression

● “Lazy”:

● computes regression function at prediction time

● works incrementally

● Weight training instances

● according to distance to test instance

● needs weighted version of linear regression

● Advantage: nonlinear approximation

● But: slow

Design decisions

● Weighting function:

♦ Inverse Euclidean distance

♦ Gaussian kernel applied to Euclidean distance

♦ Triangular kernel used the same way

♦ etc.

● Smoothing parameter is used to scale the distance function

♦ Multiply distance by inverse of this parameter

♦ Possible choice: distance of k th nearest training

Discussion

● Regression trees were introduced in CART

● Quinlan proposed model tree method (M5)

● M5’: slightly improved, publicly available

● Quinlan also investigated combining instance

based learning with M5

● CUBIST: Quinlan’s commercial rule learner for numeric prediction

● Interesting comparison: neural nets vs. M5

Clustering: how many clusters?

●

How to choose k in kmeans? Possibilities:

♦ Choose k that minimizes crossvalidated squared distance to cluster centers

♦ Use penalized squared distance on the training data (eg. using an MDL criterion)

♦ Apply kmeans recursively with k = 2 and use stopping criterion (eg. based on MDL)

● Seeds for subclusters can be chosen by seeding along direction of greatest variance in cluster

(one standard deviation away in each direction from

Incremental clustering

● Heuristic approach (COBWEB/CLASSIT)

● Form a hierarchy of clusters incrementally

● Start:

♦ tree consists of empty root node

● Then:

♦ add instances one by one

♦ update tree appropriately at each stage

♦ to update, find the right leaf for an instance

♦ May involve restructuring the tree

● Base update decisions on category utility

Clustering weather data

L K J I H G F E D C B A ID

True High

Mild Overcast

True Normal

Mild Sunny

False Normal

Mild Rainy

False Normal

Cool Sunny

False High

Mild Sunny

True Normal

Cool Overcast

True Normal

Cool Rainy

False Normal

Cool Rainy

False High

Mild Rainy

False High

Hot Overcast

True High

Hot Sunny

False High

Hot Sunny

Windy Humidity

Temp.

Outlook 1

Clustering weather data

N M L K J I H G F E D C B A ID

True High

Mild Rainy

False Normal

Hot Overcast

True High

Mild Overcast

True Normal

Mild Sunny

False Normal

Mild Rainy

False Normal

Cool Sunny

False High

Mild Sunny

True Normal

Cool Overcast

True Normal

Cool Rainy

False Normal

Cool Rainy

False High

Mild Rainy

False High

Hot Overcast

True High

Hot Sunny

False High

Hot Sunny

Windy Humidity

Temp.

Outlook 4

Mer ge best host and r unner - up

Final hierarchy

D C B A ID

False High

Mild Rainy

False High

Hot Overcast

True High

Hot Sunny

False High

Hot Sunny

Windy Humidity

Temp.

Outlook

Example: the iris data

^(subset)

Clustering with cutoff

Category utility

● Category utility: quadratic loss function defined on conditional probabilities:

● Every instance in different category

⇒

numerator becomes

maximum CU_C_1,C_2,...,C_k_=^∑^l^Pr^[^C^l^]∑ⁱ^∑^j^^Pr^[âⁱ⁼_k^vîj^|^C^l^]²⁻^Pr^[âⁱ⁼^vîj^]²^

n_−∑_i _∑_j Pr_[a_i₌v_ij_]²

Numeric attributes

● Assume normal distribution:

● Then:

● Thus

becomes

f a= __₂¹_ exp −^^a₂^−_² ² 

∑_j Pr_[a_i₌v_ij_]²_≡∫ f _a_i_²da_i₌₂_¹_

CU_C_1,C_2,...,C_k_=^∑^l^Pr^[^C^l^]∑ⁱ^∑^j^^Pr^[âⁱ⁼_k^vîj^|^C^l^]²⁻^Pr^[âⁱ⁼^vîj^]²^

CU_C_1,C_2,...,C_k_=^∑^l ^Pr^[^C^l^]²¹_k^^^∑ⁱ^^¹^il⁻^¹ⁱ^

Probabilitybased clustering

● Problems with heuristic approach:

♦ Division by k?

♦ Order of examples?

♦ Are restructuring operations sufficient?

♦ Is result at least local minimum of category utility?

● Probabilistic perspective

⇒

seek the most likely clusters given the data

● Also: instance belongs to a particular cluster with a certain probability

Finite mixtures

● Model data using a mixture of distributions

● One cluster, one distribution

♦ governs probabilities of attribute values in that cluster

● Finite mixtures : finite number of clusters

● Individual distributions are normal (usually)

● Combine distributions using cluster weights

Twoclass mixture model

A 51 A 43 B 62 B 64 A 45 A 42 A 46 A 45 A 45

B 62 A 47 A 52 B 64 A 51 B 65 A 48 A 49 A 46

B 64 A 51 A 52 B 62 A 49 A 48 B 62 A 43 A 40

A 48 B 64 A 51 B 63 A 43 B 65 B 66 B 65 A 46

A 39 B 62 B 64 A 52 B 63 B 64 A 48 B 64 A 48

A 51 A 48 B 64 A 42 A 48 A 41

data

model

Using the mixture model

● Probability that instance x belongs to cluster A:

with

● Probability of an instance given the clusters:

Pr_[A |x_]=^Pr^[^x_Pr^|Â_[^]_x^Pr_] ^[Â^]₌^f^^{x ;}_Pr^Â_[^,_x^_]Â^^pÂ f _x ;_,_= __₂¹_ exp −^^x₂^−_² ²

Learning the clusters

● Assume:

♦ we know there are k clusters

● Learn the clusters

⇒

♦ determine their parameters

♦ I.e. means and standard deviations

● Performance criterion:

♦ probability of training data given the clusters

● EM algorithm

♦ finds a local maximum of the likelihood

EM algorithm

● EM = ExpectationMaximization

● Generalize kmeans to probabilistic setting

● Iterative procedure:

● E “expectation” step:

Calculate cluster probability for each instance

● M “maximization” step:

Estimate distribution parameters from cluster probabilities

● Store cluster probabilities as instance weights

More on EM

● Estimate parameters from weighted instances

● Stop when loglikelihood saturates

● Loglikelihood:

_A=^w¹^x_w¹^^w²^x²^^...^^wⁿ^xⁿ

1w₂_...w_n

_A= ^w¹^^x¹^−²^_w^w₁²_^^x_w²^−₂__...²^_^..._w^^wⁿ^^xⁿ^−²

Extending the mixture model

● More then two distributions: easy

● Several attributes: easy—assuming independence!

● Correlated attributes: difficult

♦ Joint model: bivariate normal distribution with a (symmetric) covariance matrix

♦ n attributes: need to estimate n + n (n+1)/2 parameters

More mixture model extensions

● Nominal attributes: easy if independent

● Correlated nominal attributes: difficult

● Two correlated attributes ⇒^v1 v₂ parameters

● Missing values: easy

● Can use other distributions than normal:

● “lognormal” if predetermined minimum is given

● “logodds” if bounded from above and below

● Poisson for attributes that are integer counts

● Use crossvalidation to estimate k !

Bayesian clustering

● Problem: many parameters

⇒

EM overfits

● Bayesian approach : give every parameter a prior probability distribution

♦ Incorporate prior into overall likelihood figure

♦ Penalizes introduction of parameters

● Eg: Laplace estimator for nominal attributes

● Can also have prior on number of clusters!

● Implementation: NASA’s AUTOCLASS

Discussion

● Can interpret clusters by using supervised learning

♦ postprocessing step

● Decrease dependence between attributes?

♦ preprocessing step

♦ E.g. use principal component analysis

● Can be used to fill in missing values

● Key advantage of probabilistic clustering:

♦ Can estimate likelihood of data

♦ Use it to compare different models objectively

From naïve Bayes to Bayesian Networks

●

Naïve Bayes assumes:

attributes conditionally independent given the class

●

Doesn’t hold in practice but classification accuracy often high

●

However: sometimes performance much

worse than e.g. decision tree

Enter Bayesian networks

●

Graphical models that can represent any probability distribution

●

Graphical representation: directed acyclic graph, one node for each attribute

●

Overall probability distribution factorized into component distributions

●

Graph’s nodes hold component

distributions (conditional distributions)

N et w or k fo r t he

w ea th er d at a

N et w or k fo r t he

w ea th er d at a

Computing the class probabilities

●

Two steps: computing a product of

probabilities for each class and normalization

♦ For each class value

● Take all attribute values and class value

● Look up corresponding entries in conditional probability distribution tables

● Take the product of all probabilities

♦ Divide the product for each class by the sum of the products (normalization)

Why can we do this? (Part I)

●

Single assumption: values of a node’s parents completely determine

probability distribution for current node

• Means that node/attribute is

conditionally independent of other ancestors given parents

Pr_[node|ancestors]=Pr_[node|parents]

Why can we do this? (Part II)

● Chain rule from probability theory:

• Because of our assumption from the previous slide:

Pr_[a_1,a_2,...,a_n_]=∏_iⁿ₌₁ Pr_[a_i|a_i₋₁,...,a₁_]

Pr_[a_1,a_2,...,a_n_]=∏_iⁿ₌₁ Pr_[a_i|a_i₋₁,...,a₁_]=

∏_iⁿ₌₁ Pr_[a_i|a_i'sparents_]

No documento Data Mining (páginas 35-130)

Building a partial tree

Example

Notes on PART

Rules with exceptions

Iris data example

Extending linear classification

Problems with this approach

Support vector machines

The maximum margin hyperplane

Support vectors

Finding support vectors

Nonlinear SVMs

A mathematical trick

Other kernel functions

Noise

Sparse data

⇒

Applications

Support vector regression

More on SVM regression

Examples

The kernel perceptron

Comments on kernel perceptron

Multilayer perceptrons

Examples

Backpropagation

The two activation functions

Gradient descent example

Function: x

+1

Derivative: 2x

Learning rate: 0.1

Start value: 4

Can only find a local minimum!

Minimizing the error I

Need to find partial derivative of error function for each parameter (i.e. weight)

Minimizing the error II

Remarks

Radial basis function networks

Another type of feedforward network with two layers (plus the input layer)

Hidden units represent points in instance space and activation depends on distance

Output layer same as in MLP

Learning RBF networks

Instance­based learning

Learning prototypes

Speed up, combat noise

Weight attributes



Rectangular generalizations

Generalized exemplars

Separating generalized exemplars

Generalized distance functions

Numeric prediction

Regression trees

Model trees

⇒

Building the tree

Nominal attributes

Missing values

Surrogate splitting based on class

Pseudo­code for M5'

MakeModelTree

split

prune

subtreeError

Model tree for servo data

Rules from model trees

Locally weighted regression

Design decisions

Discussion

Clustering: how many clusters?

How to choose k in k­means? Possibilities:

Incremental clustering

Clustering weather data

Clustering weather data

Final hierarchy

Example: the iris data

Clustering with cutoff

Category utility

Instancebased learning

Pseudocode for M5'

How to choose k in kmeans? Possibilities:

Probabilitybased clustering

Twoclass mixture model