Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 1 07/20/06
Data Mining
Practical Machine Learning Tools and Techniques Slides for Chapter 2 of Data Mining by I. H. Witten and E. Frank
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 2 07/20/06
Input:
Concepts, instances, attributesTerminology What’s a concept?
zClassification, association, clustering, numeric prediction What’s in an example?
zRelations, flat files, recursion What’s in an attribute?
zNominal, ordinal, interval, ratio Preparing the input
zARFF, attributes, missing values, getting to know data
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 3 07/20/06
Terminology
Components of the input:
zConcepts: kinds of things that can be learned
Aim: intelligible and operational concept description zInstances: the individual, independent examples
of a concept
Note: more complicated forms of input are possible zAttributes: measuring aspects of an instance
We will focus on nominal and numeric ones
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 4 07/20/06
What’s a concept?
Styles of learning:
zClassification learning:
predicting a discrete class
zAssociation learning:
detecting associations between features
zClustering:
grouping similar instances into clusters
zNumeric prediction:
predicting a numeric quantity
Concept: thing to be learned
Concept description:
output of learning scheme
5 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 07/20/06
Classification learning
Example problems: weather data, contact lenses, irises, labor negotiations
Classification learning is supervised
zScheme is provided with actual outcome
Outcome is called the class of the example
Measure success on fresh data for which class labels are known (test data)
In practice success is often measured subjectively
6 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 07/20/06
Association learning
Can be applied if no class is specified and any kind of structure is considered “interesting”
Difference to classification learning:
zCan predict any attribute’s value, not just the class, and more than one attribute’s value at a time
zHence: far more association rules than classification rules
zThus: constraints are necessary
Minimum coverage and minimum accuracy
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 7 07/20/06
Clustering
Finding groups of items that are similar
Clustering is unsupervised
zThe class of an example is not known
Success often measured subjectively
…
…
…
Iris virginica 1 .9
5 .1 2 .7 5 .8 10 2 10 1 52 51 2 1
Iris virginica 2 .5
6 .0 3 .3 6 .3
Iris versicolor 1 .5
4.5 3 .2 6 .4
Iris versicolor 1 .4
4.7 3 .2 7 .0
Iris setosa 0 .2 1 .4 3 .0 4 .9
Iris setosa 0 .2 1 .4 3 .5 5 .1
Type Petal width Petal length Sepal width Sepal length
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 8 07/20/06
Numeric prediction
Variant of classification learning where
“class” is numeric (also called “regression”)
Learning is supervised
zScheme is being provided with target value
Measure success on test data
…
…
…
…
…
40 False Normal Mild Rainy
55 False High Hot Overcast
0 True High Hot Sunny
5 False High Hot Sunny
Play- tim e Windy Humidity Temperature Outlook
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 9 07/20/06
What’s in an example?
Instance: specific type of example
Thing to be classified, associated, or clustered
Individual, independent example of target concept
Characterized by a predetermined set of attributes Input to learning scheme: set of
instances/dataset
Represented as a single relation/flat file Rather restricted form of input
No relationships between objects
Most common form in practical data mining
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 1 0 07/20/06
A family tree
=
St even M
Graham M
Pam F
Grace F
Ray
= M
Ian M
Pip pa F
Br ian
= M
Anna F
Nikk i F Peggy
F Peter
M
1 1 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 07/20/06
Family tree represented as a table
Ian Pam Female Nikki
Ian Pam Female Anna
Ray Grace Male Brian
Ray Grace Female Pippa
Ray Grace Male Ian
Peggy Peter Female Pam
Peggy Peter Male Graham
Peggy Peter Male Steven
?
? Female Peggy
?
? Male Peter
parent2 Parent1 Gender Nam e
1 2 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 07/20/06
The “sister-of” relation
yes Anna Nikki
…
…
…
Yes Nikki Anna
…
…
…
Yes Pippa Ian
…
…
…
Yes Pam Steven
No Graham Steven
No Peter Steven
…
…
…
No Steven Peter
No Peggy Peter
Sister of?
Second person First person
No All the rest
Yes Anna Nikki
Yes Nikki Anna
Yes Pippa Brian
Yes Pippa Ian
Yes Pam Graham
Yes Pam Steven
Sister of?
Second person First person
Closed-world assumption
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 1 3 07/20/06
A full representation in one table
Ian Ian Ray Ray Peggy Peggy Parent 2
Fem ale Fem ale Fem ale Fem ale Fem ale Fem ale Gender
Pam Pam Grace Grace Pet er Pet er Parent 1 Name
Parent 2 Parent 1 Gender Name
Ian Ian Ray Ray Peggy Peggy
Pam Pam Grace Grace Pet er Pet er
Fem ale Fem ale Male Male Male Male
No All the rest
Yes Anna
Nik ki
Yes Nik ki
Anna
Yes Pippa
Brian
Yes Pippa
Ian
Yes Pam
Graham
Yes Pam
St even
Sister of ? Second person
First person
If second person’s gender = female
and first person’s parent = second person’s parent then sister-of = yes
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 1 4 07/20/06
Generating a flat file
Process of flattening called “denormalization”
zSeveral relations are joined together to make one
Possible with any finite set of finite relations
Problematic: relationships without pre-specified number of objects
zExample: concept of nuclear-family
Denormalization may produce spurious regularities that reflect structure of database
zExample: “supplier” predicts “supplier address”
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 1 5 07/20/06
The “ancestor-of” relation
Yes Other positive examples here
Yes Ian Pam Femal e Nikk i
?
? Femal e Grace
Ray Ian Ian Ian Peggy Peggy Parent 2
Male Femal e Femal e Femal e Femal e Male Gender
Grace Pam Pam Pam Pet er Pet er Parent 1 Nam e Parent
2 Parent
1 Gender Nam e
? Peggy
?
?
?
?
? Pet er
?
?
?
?
Femal e Femal e Male Male Male Male
No All the rest
Yes Ian
Grace
Yes Nikk i
Pam
Yes Nikk i
Pet er
Yes Anna
Pet er
Yes Pam
Pet er
Yes Steven
Pet er
Ancestor of?
Second person First person
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 1 6 07/20/06
Recursion
Appropriate techniques are known as
“inductive logic programming”
z(e.g. Quinlan’s FOIL)
zProblems: (a) noise and (b) computational complexity If person1 is a parent of person2
then person1 is an ancestor of person2
If person1 is a parent of person2 and person2 is an ancestor of person3 then person1 is an ancestor of person3
Infinite relations require recursion
1 7 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 07/20/06
What’s in an attribute?
Each instance is described by a fixed predefined set of features, its “attributes”
But: number of attributes may vary in practice
zPossible solution: “irrelevant value” flag
Related problem: existence of an attribute may depend of value of another one
Possible attribute types (“levels of measurement”):
zNominal, ordinal, interval and ratio
1 8 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 07/20/06
Nominal quantities
Values are distinct symbols
zValues themselves serve only as labels or names
zNominal comes from the Latin word for name
Example: attribute “outlook” from weather data
zValues: “sunny”,”overcast”, and “rainy”
No relation is implied among nominal values (no ordering or distance measure)
Only equality tests can be performed
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 1 9 07/20/06
Ordinal quantities
Impose order on values
But: no distance between values defined
Example:
attribute “temperature” in weather data
zValues: “hot” > “mild” > “cool”
Note: addition and subtraction don’t make sense
Example rule:
temperature < hot ‰play = yes
Distinction between nominal and ordinal not always clear (e.g. attribute “outlook”)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 20 07/20/06
Interval quantities
Interval quantities are not only ordered but measured in fixed and equal units
Example 1: attribute “temperature”
expressed in degrees Fahrenheit
Example 2: attribute “year”
Difference of two values makes sense
Sum or product doesn’t make sense
zZero point is not defined!
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 2 1 07/20/06
Ratio quantities
Ratio quantities are ones for which the measurement scheme defines a zero point
Example: attribute “distance”
zDistance between an object and itself is zero
Ratio quantities are treated as real numbers
zAll mathematical operations are allowed
But: is there an “inherently” defined zero point?
zAnswer depends on scientific knowledge (e.g.
Fahrenheit knew no lower limit to temperature)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 22 07/20/06
Attribute types used in practice
Most schemes accommodate just two levels of measurement: nominal and ordinal
Nominal attributes are also called
“categorical”, ”enumerated”, or “discrete”
zBut: “enumerated” and “discrete” imply order
Special case: dichotomy (“boolean” attribute)
Ordinal attributes are called “numeric”, or
“continuous”
zBut: “continuous” implies mathematical continuity
2 3 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 07/20/06
Metadata
Information about the data that encodes background knowledge
Can be used to restrict search space
Examples:
zDimensional considerations
(i.e. expressions must be dimensionally correct)
zCircular orderings (e.g. degrees in compass)
zPartial orderings
(e.g. generalization/specialization relations)
2 4 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 07/20/06
Preparing the input
Denormalization is not the only issue
Problem: different data sources (e.g. sales department, customer billing department, …)
zDifferences: styles of record keeping, conventions, time periods, data aggregation, primary keys, errors
zData must be assembled, integrated, cleaned up
z“Data warehouse”: consistent point of access
External data may be required (“overlay data”)
Critical: type and level of data aggregation
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 2 5 07/20/06
The ARFF format
%
% ARFF file for weather data with some numeric features
%
@relation weather
@attribute outlook {sunny, overcast, rainy}
@attribute temperature numeric
@attribute humidity numeric
@attribute windy {true, false}
@attribute play? {yes, no}
@data
sunny, 85, 85, false, no sunny, 80, 90, true, no overcast, 83, 86, false, yes ...
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 26 07/20/06
Additional attribute types
ARFF supports string attributes:
zSimilar to nominal attributes but list of values is not pre-specified
It also supports date attributes:
zUses the ISO-8601 combined date and time format yyyy-MM-dd-THH:mm:ss
@attribute description string
@attribute today date
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 2 7 07/20/06
Sparse data
In some applications most attribute values in a dataset are zero
zE.g.: word counts in a text categorization problem
ARFF supports sparse data
This also works for nominal attributes (where the first value corresponds to “zero”)
0, 26, 0, 0, 0 ,0, 63, 0, 0, 0, “class A”
0, 0, 0, 42, 0, 0, 0, 0, 0, 0, “class B”
{1 26, 6 63, 10 “class A”}
{3 42, 10 “class B”}
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 2 8 07/20/06
Attribute types
Interpretation of attribute types in ARFF depends on learning scheme
zNumeric attributes are interpreted as
ordinal scales if less-than and greater-than are used
ratio scales if distance calculations are performed (normalization/standardization may be required) zInstance-based schemes define distance between
nominal values (0 if values are equal, 1 otherwise)
Integers in some given data file: nominal, ordinal, or ratio scale?
2 9 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 07/20/06
Nominal vs. ordinal
Attribute “age” nominal
Attribute “age” ordinal
(e.g. “young” < “pre-presbyopic” < “presbyopic”)
If age = young and astigmatic = no and tear production rate = normal then recommendation = soft
If age = pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft
If age ) pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft
3 0 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 07/20/06
Missing values
Frequently indicated by out-of-range entries
zTypes: unknown, unrecorded, irrelevant
zReasons:
malfunctioning equipment
changes in experimental design
collation of different datasets
measurement not possible
Missing value may have significance in itself (e.g.
missing test in a medical examination)
zMost schemes assume that is not the case: “missing”
may need to be coded as additional value
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 3 1 07/20/06
Inaccurate values
Reason: data has not been collected for mining it
Result: errors and omissions that don’t affect original purpose of data (e.g. age of customer)
Typographical errors in nominal attributes ‰ values need to be checked for consistency
Typographical and measurement errors in numeric attributes ‰ outliers need to be identified
Errors may be deliberate (e.g. wrong zip codes)
Other problems: duplicates, stale data
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 3 2 07/20/06
Getting to know the data
Simple visualization tools are very useful
zNominal attributes: histograms (Distribution consistent with background knowledge?)
zNumeric attributes: graphs (Any obvious outliers?)
2-D and 3-D plots show dependencies
Need to consult domain experts
Too much data to inspect? Take a sample!