• Section 7.1 Introduction to Clustering Techniques
Rajaraman, A.; Leskovec, J.; Ullman, J.D. Mining of Massive Datasets. Cambridge University Press (2019). ISBN: 9781107015357
• Available at http://www.mmds.org
• QUICK-R – Cluster Analysis
• R has a variety of functions for cluster analysis
• This tutorial describes hierarchical agglomerative, partitioning, and model based
• Available at https://www.statmethods.net/advstats/cluster.html
• Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
• 2.3 Clustering - A large variety of clustering algorithms
• Available at https://scikit-learn.org/stable/modules/clustering.html
2
• Introduction
• Motivation
• Examples
• Distance measures
• Major approaches to clustering
• Hierarchical and point-assignment
• The “curse of dimensionality”
• Makes clustering in high-dimensional spaces difficult
• Enables some simplifications if used correctly in a clustering algorithm
• The Problem of Clustering
• Given a set of points, with a notion of distance between points, group the points into clusters, so that
• Members of a cluster are close/similar to each other
• Members of different clusters are dissimilar
• Obs.
• Usually points are in a high-dimensional space
• Each point has a large number of attributes, used for clustering
• Similarity is defined using a distance measure
• Euclidean, Cosine, Jaccard, edit distance, …
4
x x x x x x x x x x
x x x x x
x xx x x x x x x
x x x x
x
x x x x x x
x x x x x
x x x x x x x x x x x
x x x x x
x xx x x x x x x
x x x x
x
x x x x x x
x x x x
Outlier Cluster
6
• Facts:
• Clustering in two dimensions is easy
• Clustering of small amounts of data is easy
• Many applications involve large datasets
• Many applications model points with a large number of dimensions (~10,000 )
• High-dimensional spaces look different:
• Almost all pairs of points are at about the same distance
8
• 1854 cholera outbreak in London
• By plotting the homes of cholera victims on a map, Snow was able to identify a water pump in Broad Street as the cause of all the local cases of cholera
• Snow's map essentially proved that cholera was spread by contaminated water
and disproved the prevailing miasma theory, which believed that diseases like cholera were transmitted by bad air
• He showed that homes supplied by
the Southwark and Vauxhall Waterworks Company, which was taking water from sewage-polluted
sections of the Thames, had a cholera rate fourteen times that of those supplied by Lambeth Waterworks Company, which obtained water from the upriver, cleaner Seething Wells
https://en.wikipedia.org/wiki/John_Snow
• SDSS – Sloan Digital Sky Survey
• A catalog of 2 billion “sky objects”
represents objects by their radiation in 7 dimensions (frequency bands)
• Problem
• Cluster into similar objects, e.g., galaxies, nearby stars, quasars, etc.
10
• Intuition:
• Music is divided into categories, and customers prefer a few categories
• But what are categories?
• Represent a CD by a set of customers who bought it:
• Similar CDs have similar sets of customers, and vice-versa
• Space of all CDs:
• CD space = one dimension for each customer
• Values in a dimension may be 0 or 1 only
• A CD is a point in this space (x1, x2,…, xk), where xi = 1 iff the ith customer bought the CD
• Example: the dimension is tens of millions for Amazon
• Task:
• Find clusters of similar CDs
12
• Finding topics:
• Represent a document by a vector (x1, x2,…, xk),
where xi = 1 iff the ith word (in some order) appears in the document
• It actually does not matter if k is infinite; i.e., do not limit the set of words
• Documents with similar sets of words may be about the same topic
• Set similarity
• Jaccard distance
14
3 in intersection 7 in union
Jaccard similarity = 3/7
Containment similarity = 3/5
C1 C2
• Similarity of vectors
• cosine distance
• Similarity of sets of points:
• Euclidean distance
• The Euclidean distance is the prototypical example of the distance in a metric space: symmetric, positive, obeys the triangle inequality
16
• Hierarchical or agglomerative algorithms
1. Start with each point in its own cluster
2. Clusters are combined based on their “closeness,”
using one of many possible definitions of “close”
x x x x x x x x x x
x x x x x
x x x x x x
x x x x
x x x x x x
x x x x x x x xx x
x x x x x x
x x x
x x x x x x
x x x x
• Hierarchical or agglomerative algorithms
1. Start with each point in its own cluster
2. Clusters are combined based on their “closeness,”
using one of many possible definitions of “close”
3. Combination stops when further combination leads to clusters that are undesirable for one of several reasons
• Stop when we have a predetermined number of clusters, or
• Use a measure of compactness for clusters, and refuse to construct a cluster by combining two smaller clusters if the resulting cluster has points that are spread out over too large a region
18 x x
x x x x x x x x x x x xx x
x x x x x x
x x x
x x x x x x
x x x x
Stop when
reaching 2 clusters Stop when
clusters are too far apart or
• Hierarchical or agglomerative algorithms
• The other class of algorithms
• Involve point assignment
• Points are considered in some order, and each one is assigned to the cluster into which it best fits
• This process is normally preceded by a short phase in which initial clusters are estimated
• Variations
• Allow occasional combining or splitting of clusters, or
• Allow points to remain unassigned, if they are outliers (points too far from any of the current clusters)
20
• Algorithms for clustering can also be distinguished by:
• Whether the algorithm assumes a Euclidean space, or
whether the algorithm works for an arbitrary distance measure
• AEuclidean space allows summarizing a collection of points by their centroid – the average of the points
• For a non-Euclidean space, there is no notion of a centroid;
another way to summarize clusters is required
x x x x x x x x x x
x x x x x
x x x x x x
x x x x
x x x x x x
x x x x x x x xx x
x x x x x x
x x x
x x x x x x
x x x x
Clustroidis a representative point of a cluster and is the point closest to all other points
• Algorithms for clustering can also be distinguished by:
• Whether the algorithm assumes that the data is small enough to fit in main memory, or whether data must reside in secondary memory
• Algorithms for large amounts of data often must take shortcuts, since it is infeasible to look at all pairs of points
• It is also necessary to summarize clusters in main memory, since we cannot hold all the points of all the clusters in main memory at the same time
22
2 billion “sky objects”
• “Curse of dimensionality”
• Unintuitive properties of high-dimensional spaces
• Examples: In high dimensions...
• Almost all pairs of points are equally far away from one another
• Almost any two vectors are almost orthogonal
• 7.2 Hierarchical Clustering
• Hierarchical clustering in a Euclidean space
• algorithm can only be used for relatively small datasets
• Hierarchical clustering in non-Euclidean space
• Representing clusters by their “clustroids”
• Representing clusters when there is no centroid or average point
24
• Chap. 2 High-Dimensional Space
Blum, A., Hopcroft, J., Kannan, R. Foundations of Data Science.
(Version of 04/01/2018)
Available at https://www.cs.cornell.edu/jeh/book.pdf