• Nenhum resultado encontrado

• Section 7.1 Introduction to Clustering Techniques Rajaraman, A.; Leskovec, J.; Ullman, J.D. Mining of Massive Datasets. Cambridge University Press (2019). ISBN: 9781107015357

N/A
N/A
Protected

Academic year: 2022

Share "• Section 7.1 Introduction to Clustering Techniques Rajaraman, A.; Leskovec, J.; Ullman, J.D. Mining of Massive Datasets. Cambridge University Press (2019). ISBN: 9781107015357"

Copied!
25
0
0

Texto

(1)
(2)

• Section 7.1 Introduction to Clustering Techniques

Rajaraman, A.; Leskovec, J.; Ullman, J.D. Mining of Massive Datasets. Cambridge University Press (2019). ISBN: 9781107015357

Available at http://www.mmds.org

• QUICK-R – Cluster Analysis

R has a variety of functions for cluster analysis

This tutorial describes hierarchical agglomerative, partitioning, and model based

Available at https://www.statmethods.net/advstats/cluster.html

• Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.

2.3 Clustering - A large variety of clustering algorithms

Available at https://scikit-learn.org/stable/modules/clustering.html

2

(3)

• Introduction

• Motivation

• Examples

• Distance measures

• Major approaches to clustering

• Hierarchical and point-assignment

• The “curse of dimensionality”

• Makes clustering in high-dimensional spaces difficult

• Enables some simplifications if used correctly in a clustering algorithm

(4)

• The Problem of Clustering

• Given a set of points, with a notion of distance between points, group the points into clusters, so that

Members of a cluster are close/similar to each other

Members of different clusters are dissimilar

• Obs.

• Usually points are in a high-dimensional space

Each point has a large number of attributes, used for clustering

• Similarity is defined using a distance measure

Euclidean, Cosine, Jaccard, edit distance, …

4

(5)

x x x x x x x x x x

x x x x x

x xx x x x x x x

x x x x

x

x x x x x x

x x x x x

x x x x x x x x x x x

x x x x x

x xx x x x x x x

x x x x

x

x x x x x x

x x x x

Outlier Cluster

(6)

6

(7)

• Facts:

• Clustering in two dimensions is easy

• Clustering of small amounts of data is easy

• Many applications involve large datasets

• Many applications model points with a large number of dimensions (~10,000 )

• High-dimensional spaces look different:

Almost all pairs of points are at about the same distance

(8)

8

(9)

1854 cholera outbreak in London

By plotting the homes of cholera victims on a map, Snow was able to identify a water pump in Broad Street as the cause of all the local cases of cholera

Snow's map essentially proved that cholera was spread by contaminated water

and disproved the prevailing miasma theory, which believed that diseases like cholera were transmitted by bad air

He showed that homes supplied by

the Southwark and Vauxhall Waterworks Company, which was taking water from sewage-polluted

sections of the Thames, had a cholera rate fourteen times that of those supplied by Lambeth Waterworks Company, which obtained water from the upriver, cleaner Seething Wells

https://en.wikipedia.org/wiki/John_Snow

(10)

• SDSS – Sloan Digital Sky Survey

A catalog of 2 billion “sky objects”

represents objects by their radiation in 7 dimensions (frequency bands)

• Problem

Cluster into similar objects, e.g., galaxies, nearby stars, quasars, etc.

10

(11)

• Intuition:

• Music is divided into categories, and customers prefer a few categories

• But what are categories?

• Represent a CD by a set of customers who bought it:

• Similar CDs have similar sets of customers, and vice-versa

(12)

• Space of all CDs:

• CD space = one dimension for each customer

Values in a dimension may be 0 or 1 only

A CD is a point in this space (x1, x2,…, xk), where xi = 1 iff the ith customer bought the CD

• Example: the dimension is tens of millions for Amazon

• Task:

• Find clusters of similar CDs

12

(13)

• Finding topics:

• Represent a document by a vector (x1, x2,…, xk),

where xi = 1 iff the ith word (in some order) appears in the document

It actually does not matter if k is infinite; i.e., do not limit the set of words

• Documents with similar sets of words may be about the same topic

(14)

• Set similarity

Jaccard distance

14

3 in intersection 7 in union

Jaccard similarity = 3/7

Containment similarity = 3/5

C1 C2

(15)

• Similarity of vectors

cosine distance

(16)

• Similarity of sets of points:

Euclidean distance

The Euclidean distance is the prototypical example of the distance in a metric space: symmetric, positive, obeys the triangle inequality

16

(17)

• Hierarchical or agglomerative algorithms

1. Start with each point in its own cluster

2. Clusters are combined based on their “closeness,”

using one of many possible definitions of “close”

x x x x x x x x x x

x x x x x

x x x x x x

x x x x

x x x x x x

x x x x x x x xx x

x x x x x x

x x x

x x x x x x

x x x x

(18)

• Hierarchical or agglomerative algorithms

1. Start with each point in its own cluster

2. Clusters are combined based on their “closeness,”

using one of many possible definitions of “close”

3. Combination stops when further combination leads to clusters that are undesirable for one of several reasons

Stop when we have a predetermined number of clusters, or

Use a measure of compactness for clusters, and refuse to construct a cluster by combining two smaller clusters if the resulting cluster has points that are spread out over too large a region

18 x x

x x x x x x x x x x x xx x

x x x x x x

x x x

x x x x x x

x x x x

Stop when

reaching 2 clusters Stop when

clusters are too far apart or

(19)

• Hierarchical or agglomerative algorithms

(20)

• The other class of algorithms

• Involve point assignment

Points are considered in some order, and each one is assigned to the cluster into which it best fits

This process is normally preceded by a short phase in which initial clusters are estimated

Variations

Allow occasional combining or splitting of clusters, or

Allow points to remain unassigned, if they are outliers (points too far from any of the current clusters)

20

(21)

• Algorithms for clustering can also be distinguished by:

Whether the algorithm assumes a Euclidean space, or

whether the algorithm works for an arbitrary distance measure

AEuclidean space allows summarizing a collection of points by their centroid – the average of the points

For a non-Euclidean space, there is no notion of a centroid;

another way to summarize clusters is required

x x x x x x x x x x

x x x x x

x x x x x x

x x x x

x x x x x x

x x x x x x x xx x

x x x x x x

x x x

x x x x x x

x x x x

Clustroidis a representative point of a cluster and is the point closest to all other points

(22)

• Algorithms for clustering can also be distinguished by:

Whether the algorithm assumes that the data is small enough to fit in main memory, or whether data must reside in secondary memory

Algorithms for large amounts of data often must take shortcuts, since it is infeasible to look at all pairs of points

It is also necessary to summarize clusters in main memory, since we cannot hold all the points of all the clusters in main memory at the same time

22

2 billion “sky objects”

(23)

• “Curse of dimensionality”

• Unintuitive properties of high-dimensional spaces

• Examples: In high dimensions...

• Almost all pairs of points are equally far away from one another

• Almost any two vectors are almost orthogonal

(24)

• 7.2 Hierarchical Clustering

• Hierarchical clustering in a Euclidean space

algorithm can only be used for relatively small datasets

• Hierarchical clustering in non-Euclidean space

Representing clusters by their “clustroids”

Representing clusters when there is no centroid or average point

24

(25)

• Chap. 2 High-Dimensional Space

Blum, A., Hopcroft, J., Kannan, R. Foundations of Data Science.

(Version of 04/01/2018)

Available at https://www.cs.cornell.edu/jeh/book.pdf

Referências

Documentos relacionados

O presente trabalho, intitulado “Banco de Alimentos: o fetiche do reaproveitamento alimentar e das ações educativas”, apresenta os resultados de uma pesquisa realizada em

ou na arte pública (p. 304).Logo, não é suficiente a afirmação política e a valorização discursiva da interculturalidade enquanto estratégia de ação, pois o seu sucesso

social assistance. The protection of jobs within some enterprises, cooperatives, forms of economical associations, constitute an efficient social policy, totally different from

Pedro II à cidade de Desterro, capital da província de Santa Catarina, em 1845, segundo Oswaldo Rodrigo Cabral (1979), configura-se como o acontecimento

The classification of each combination was made by calculating the values of body mass index (mass.height -2 ) of points with 1 degree of membership for to each fuzzy set

Other foliations of interest in Classical Differential Geometry, such as asymptotic lines and lines of mean curvature, defined also by quadratic differential equations similar to

Gregersen (eds.) Information and the Nature of Reality: From Physics to Metaphysics , Cambridge, UK: Cambridge University Press. ( Addresses the role of absent information in

Este artigo tem por objetivo destacar o mecanismo de sinalização celular do glucagon e da epinefrina no fígado, dando ênfase no papel do AMPc, como o segundo mensageiro