Data Mining

(1)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 1 07/20/06

Data Mining

Practical Machine Learning Tools and Techniques Slides for Chapter 2 of Data Mining by I. H. Witten and E. Frank

Input:

Concepts, instances, attributes

Terminology What’s a concept?

zClassification, association, clustering, numeric prediction What’s in an example?

zRelations, flat files, recursion What’s in an attribute?

zNominal, ordinal, interval, ratio Preparing the input

zARFF, attributes, missing values, getting to know data

Terminology

Components of the input:

zConcepts: kinds of things that can be learned

Aim: intelligible and operational concept description zInstances: the individual, independent examples

of a concept

Note: more complicated forms of input are possible zAttributes: measuring aspects of an instance

We will focus on nominal and numeric ones

What’s a concept?

Styles of learning:

zClassification learning:

predicting a discrete class

zAssociation learning:

detecting associations between features

zClustering:

grouping similar instances into clusters

zNumeric prediction:

predicting a numeric quantity

Concept: thing to be learned

Concept description:

output of learning scheme

5 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 07/20/06

Classification learning

Example problems: weather data, contact lenses, irises, labor negotiations

Classification learning is supervised

zScheme is provided with actual outcome

Outcome is called the class of the example

Measure success on fresh data for which class labels are known (test data)

In practice success is often measured subjectively

6 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 07/20/06

Association learning

Can be applied if no class is specified and any kind of structure is considered “interesting”

Difference to classification learning:

zCan predict any attribute’s value, not just the class, and more than one attribute’s value at a time

zHence: far more association rules than classification rules

zThus: constraints are necessary

Minimum coverage and minimum accuracy

(2)

Clustering

Finding groups of items that are similar

Clustering is unsupervised

zThe class of an example is not known

Success often measured subjectively

…

Iris virginica 1 .9

5 .1 2 .7 5 .8 10 2 10 1 52 51 2 1

Iris virginica 2 .5

6 .0 3 .3 6 .3

Iris versicolor 1 .5

4.5 3 .2 6 .4

Iris versicolor 1 .4

4.7 3 .2 7 .0

Iris setosa 0 .2 1 .4 3 .0 4 .9

Iris setosa 0 .2 1 .4 3 .5 5 .1

Type Petal width Petal length Sepal width Sepal length

Numeric prediction

Variant of classification learning where

“class” is numeric (also called “regression”)

Learning is supervised

zScheme is being provided with target value

Measure success on test data

…

40 False Normal Mild Rainy

55 False High Hot Overcast

0 True High Hot Sunny

5 False High Hot Sunny

Play- tim e Windy Humidity Temperature Outlook

What’s in an example?

Instance: specific type of example

Thing to be classified, associated, or clustered

Individual, independent example of target concept

Characterized by a predetermined set of attributes Input to learning scheme: set of

instances/dataset

Represented as a single relation/flat file Rather restricted form of input

No relationships between objects

Most common form in practical data mining

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 1 0 07/20/06

A family tree

=

St even M

Graham M

Pam F

Grace F

Ray

= M

Ian M

Pip pa F

Br ian

= M

Anna F

Nikk i F Peggy

F Peter

M

1 1 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 07/20/06

Family tree represented as a table

Ian Pam Female Nikki

Ian Pam Female Anna

Ray Grace Male Brian

Ray Grace Female Pippa

Ray Grace Male Ian

Peggy Peter Female Pam

Peggy Peter Male Graham

Peggy Peter Male Steven

?

? Female Peggy

?

? Male Peter

parent2 Parent1 Gender Nam e

The “sister-of” relation

yes Anna Nikki

…

Yes Nikki Anna

…

Yes Pippa Ian

…

Yes Pam Steven

No Graham Steven

No Peter Steven

…

No Steven Peter

No Peggy Peter

Sister of?

Second person First person

No All the rest

Yes Anna Nikki

Yes Nikki Anna

Yes Pippa Brian

Yes Pippa Ian

Yes Pam Graham

Yes Pam Steven

Sister of?

Closed-world assumption

(3)

A full representation in one table

Ian Ian Ray Ray Peggy Peggy Parent 2

Fem ale Fem ale Fem ale Fem ale Fem ale Fem ale Gender

Pam Pam Grace Grace Pet er Pet er Parent 1 Name

Parent 2 Parent 1 Gender Name

Ian Ian Ray Ray Peggy Peggy

Pam Pam Grace Grace Pet er Pet er

Fem ale Fem ale Male Male Male Male

No All the rest

Yes Anna

Nik ki

Yes Nik ki

Anna

Yes Pippa

Brian

Yes Pippa

Ian

Yes Pam

Graham

Yes Pam

St even

Sister of ? Second person

First person

If second person’s gender = female

and first person’s parent = second person’s parent then sister-of = yes

Generating a flat file

Process of flattening called “denormalization”

zSeveral relations are joined together to make one

Possible with any finite set of finite relations

Problematic: relationships without pre-specified number of objects

zExample: concept of nuclear-family

Denormalization may produce spurious regularities that reflect structure of database

zExample: “supplier” predicts “supplier address”

The “ancestor-of” relation

Yes Other positive examples here

Yes Ian Pam Femal e Nikk i

?

? Femal e Grace

Ray Ian Ian Ian Peggy Peggy Parent 2

Male Femal e Femal e Femal e Femal e Male Gender

Grace Pam Pam Pam Pet er Pet er Parent 1 Nam e Parent

2 Parent

1 Gender Nam e

? Peggy

?

? Pet er

?

Femal e Femal e Male Male Male Male

No All the rest

Yes Ian

Grace

Yes Nikk i

Pam

Yes Nikk i

Pet er

Yes Anna

Pet er

Yes Pam

Pet er

Yes Steven

Pet er

Ancestor of?

Recursion

Appropriate techniques are known as

“inductive logic programming”

z(e.g. Quinlan’s FOIL)

zProblems: (a) noise and (b) computational complexity If person1 is a parent of person2

then person1 is an ancestor of person2

If person1 is a parent of person2 and person2 is an ancestor of person3 then person1 is an ancestor of person3

Infinite relations require recursion

What’s in an attribute?

Each instance is described by a fixed predefined set of features, its “attributes”

But: number of attributes may vary in practice

zPossible solution: “irrelevant value” flag

Related problem: existence of an attribute may depend of value of another one

Possible attribute types (“levels of measurement”):

zNominal, ordinal, interval and ratio

Nominal quantities

Values are distinct symbols

zValues themselves serve only as labels or names

zNominal comes from the Latin word for name

Example: attribute “outlook” from weather data

zValues: “sunny”,”overcast”, and “rainy”

No relation is implied among nominal values (no ordering or distance measure)

Only equality tests can be performed

(4)

Ordinal quantities

Impose order on values

But: no distance between values defined

Example:

attribute “temperature” in weather data

zValues: “hot” > “mild” > “cool”

Note: addition and subtraction don’t make sense

Example rule:

temperature < hot ‰play = yes

Distinction between nominal and ordinal not always clear (e.g. attribute “outlook”)

Interval quantities

Interval quantities are not only ordered but measured in fixed and equal units

Example 1: attribute “temperature”

expressed in degrees Fahrenheit

Example 2: attribute “year”

Difference of two values makes sense

Sum or product doesn’t make sense

zZero point is not defined!

Ratio quantities

Ratio quantities are ones for which the measurement scheme defines a zero point

Example: attribute “distance”

zDistance between an object and itself is zero

Ratio quantities are treated as real numbers

zAll mathematical operations are allowed

But: is there an “inherently” defined zero point?

zAnswer depends on scientific knowledge (e.g.

Fahrenheit knew no lower limit to temperature)

Attribute types used in practice

Most schemes accommodate just two levels of measurement: nominal and ordinal

Nominal attributes are also called

“categorical”, ”enumerated”, or “discrete”

zBut: “enumerated” and “discrete” imply order

Special case: dichotomy (“boolean” attribute)

Ordinal attributes are called “numeric”, or

“continuous”

zBut: “continuous” implies mathematical continuity

Metadata

Information about the data that encodes background knowledge

Can be used to restrict search space

Examples:

zDimensional considerations

(i.e. expressions must be dimensionally correct)

zCircular orderings (e.g. degrees in compass)

zPartial orderings

(e.g. generalization/specialization relations)

Preparing the input

Denormalization is not the only issue

Problem: different data sources (e.g. sales department, customer billing department, …)

zDifferences: styles of record keeping, conventions, time periods, data aggregation, primary keys, errors

zData must be assembled, integrated, cleaned up

z“Data warehouse”: consistent point of access

External data may be required (“overlay data”)

Critical: type and level of data aggregation

(5)

The ARFF format

%

% ARFF file for weather data with some numeric features

%

@relation weather

@attribute outlook {sunny, overcast, rainy}

@attribute temperature numeric

@attribute humidity numeric

@attribute windy {true, false}

@attribute play? {yes, no}

@data

sunny, 85, 85, false, no sunny, 80, 90, true, no overcast, 83, 86, false, yes ...

Additional attribute types

ARFF supports string attributes:

zSimilar to nominal attributes but list of values is not pre-specified

It also supports date attributes:

zUses the ISO-8601 combined date and time format yyyy-MM-dd-THH:mm:ss

@attribute description string

@attribute today date

Sparse data

In some applications most attribute values in a dataset are zero

zE.g.: word counts in a text categorization problem

ARFF supports sparse data

This also works for nominal attributes (where the first value corresponds to “zero”)

0, 26, 0, 0, 0 ,0, 63, 0, 0, 0, “class A”

0, 0, 0, 42, 0, 0, 0, 0, 0, 0, “class B”

{1 26, 6 63, 10 “class A”}

{3 42, 10 “class B”}

Attribute types

Interpretation of attribute types in ARFF depends on learning scheme

zNumeric attributes are interpreted as

ordinal scales if less-than and greater-than are used

ratio scales if distance calculations are performed (normalization/standardization may be required) zInstance-based schemes define distance between

nominal values (0 if values are equal, 1 otherwise)

Integers in some given data file: nominal, ordinal, or ratio scale?

Nominal vs. ordinal

Attribute “age” nominal

Attribute “age” ordinal

(e.g. “young” < “pre-presbyopic” < “presbyopic”)

If age = young and astigmatic = no and tear production rate = normal then recommendation = soft

If age = pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft

If age ) pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft

Missing values

Frequently indicated by out-of-range entries

zTypes: unknown, unrecorded, irrelevant

zReasons:

malfunctioning equipment

changes in experimental design

collation of different datasets

measurement not possible

Missing value may have significance in itself (e.g.

missing test in a medical examination)

zMost schemes assume that is not the case: “missing”

may need to be coded as additional value

(6)

Inaccurate values

Reason: data has not been collected for mining it

Result: errors and omissions that don’t affect original purpose of data (e.g. age of customer)

Typographical errors in nominal attributes ‰ values need to be checked for consistency

Typographical and measurement errors in numeric attributes ‰ outliers need to be identified

Errors may be deliberate (e.g. wrong zip codes)

Getting to know the data

Simple visualization tools are very useful

zNominal attributes: histograms (Distribution consistent with background knowledge?)

zNumeric attributes: graphs (Any obvious outliers?)

2-D and 3-D plots show dependencies

Need to consult domain experts

Too much data to inspect? Take a sample!