S patial “K” luster A nalysis T hrough E dge R emoval SKATER

(1)

SKATER

Spatial “K”luster Analysis Through Edge Removal

Extensions, possible applications and computational implementation

Paulo Justiniano Ribeiro Jr

LEG:Laborat ório de Estat´ıstica e Geoinformaç ão / UFPR In collaboration with:

Elias Teixeira Krainski (LEG/UFPR) Renato Martins Assunc¸ ˜ao (LESTE/UFMG)

http://www.leg.ufpr.br e-mail:paulojus@ufpr.br

CHICAS Lancaster, UK

(2)

IBC-2010/Floripa e 55 RBRAS

International Biometrics Conference &

55âReuni ão da Regi ão Brasileira da Sociedade Internacional de Biometria

1 Organization: IBS, Rbras, RArg

2 05 a 10 december de 2010, Florian ´opolis, SC, Brasil

3 satelite events (opened to proposals)

4 Free/excursion day (and RBras/RArg meeting) on wednesday, 07/12

5 http://www.ibc-floripa-2010.org/andhttp://www.tibs.org

(3)

Outline

Motivatition Basics of SKATER

Computational implementation Extensions and fundamentals Some applications

Final remarks

(4)

Motivating Examples I: Living condition indexes in BH

Saude Educacao

Crianca Economico

15 35 55 75 95

(5)

Clustering and regionalization

Clustering:

nindividuals with their “atributes”−→k“homogeneous” groups group composition (possibly with minimal number)

how many groups?

Regionalization:

classification procedure

applied to spatial objects (with an areal representation) groups them into homogeneous contiguous regions

Why?

detecting heterogeneous sub-regions (“step”) exploratory

administrative/management purposes . . .

reduced number of possible groups

(6)

Approaches

1 non-spatial clustering + neighbouring preserving classification 2 steps

possibly too many groups

2 clustering including (weighted) spatial covariates SAGE (spatial analysis GIS environment)

objective function involveshomogeneity,compactnessandequality

3 explicit use of neighbouring structure in optimization implicit constraints

AZP (automatic zonning procedure) computationally expensive

(7)

The SKATER approach

algorithm of type (3)

graph to introduce/represent neighbourhood vertices (units) + edges

“cost” of edges: dissimilarities

heuristics for prunning

minimal 1 group graph (MST: minimal spanning tree) regionalization−→optimal graph partitioning

(8)

Skater in action I

1

2

3 4 5 6 7

8 9 10

11 12 13

15 14 16

17 18 19 20

21 22 23

24 25 26

27 28

30 29

31 32 33

34 35

36

37

38

39 40

41 42

43

44 45

46

47 48

49 50 51

52

53 54

55 56

57

58 59

60 61

62 63 64 65

66

68 67 69

71 70 72

73 74

75

76

77

78 79

80

81 82

83

84 85

86 87 88

89 90

91 92

93 94

95 96

97

98

(9)

Skater in action II

1

2

3 4 5 6 7

8 9 10

11 12 13

15 14 16

17 18 19 20

21 22 23

24 25 26

27 28

30 29

31 32 33

34 35

36

37

38

39 40

41 42

43

44 45

46

47 48

49 50 51

52

53 54

55 56

57

58 59

60 61

62 63 64 65

66

68 67 69

71 70 72

73 74

75

76

77

78 79

80

81 82

83

84 85

86 87 88

89 90

91 92

93 94

95 96

97

98

1

2

3 4 5 6 7

8 9 10

11 12 13

15 14 16

17 18 19 20

21 22 23

24 25 26

27 28

30 29

31 32 33

34 35

36

37

38

39 40

41 42

43

44 45

46

47 48

49 50 51

52

53 54

55 56

57

58 59

60 61

62 63 64 65

66

68 67 69

71 70 72

73 74

75

76

77

78 79

80

81 82

83

84 85

86 87 88

89 90

91 92

93 94

95 96

97

98

(10)

An “R”-ish template for the analysis

1 spatial packages: sp, spdep, . . .

2 Read attributes and the map read.table()andreadShapePoly()

3 standardize data (scale())

4 neighbourhood list (poly2nb())

5 costs for edges (dissimilarities) (nbcosts())

6 weighted neighbourhood structure (nb2listw())

7 minimum spanning tree (mstree())

8 partition (skater())

(11)

Yet another example

(12)

Extending SKATER

costs of edges

euclidean, mahalanobis, . . .

density based measurements (non-Gaussian data) measures of homogeneity

distance to the mean(s)

likelihood based measures, criteria for number of groups applications

areal data

point processes, geostatistical data, time series, independent observations

(13)

Extending SKATER

costs of edges

density based measurements (non-Gaussian data) measures of homogeneity

distance to the mean(s)

areal data

(14)

Extending SKATER

costs of edges

density based measurements (non-Gaussian data)

measures of homogeneity distance to the mean(s)

areal data

(15)

Extending SKATER

costs of edges

density based measurements (non-Gaussian data)

measures of homogeneity distance to the mean(s)

likelihood based measures, criteria for number of groups

applications areal data

(16)

Minimum spanning tree

connected graph(nodesv, edges e, attributes y) path between any pair(vi,vj) tree(no circuit)

spanning treen nodes V= (v1, . . . ,vn)andn−1 edgesE= (e1, . . . ,en−1) costs for each edge (dissimilarities)d(vi,vj) =d(ek) =P

l(yil−yjl)² minimum spanning tree (MST): set{e1, . . .en−1}minimizingPn−1

k=1d(ek) removal of an edge results in two graphs (also MSTs)

Prim (1957) algorithm: start from an empty set and include a first node inV compute dissimilarities between nodesVinandVout

select the node fromVoutwith minimum dissimilarity iterate until all included

unique under certain conditions

(17)

Minimum spanning tree

Y with p.d.f p(y/θ, φ), location (θ) andφ(scale) plθ,φ(θ, φ/X,y) =_max ^L^(θ,φ/^X^,^y⁾

θ′,φ′[L(θ^′,φ^′/X,y)]

similarly fory : ply(y/X, θ, φ) =_max_{y′ ∈ℜ}^p⁽^y^/_[^X_p^,θ,φ)₍_y_/_X_,θ,φ]

distance measures

common scale:dij=log{[p(yi/Xi, θ, φ)p(yj/Xj, θ, φ)]⁻¹} more general:dij=log{[ply(yi/Xi, θ, φ)ply(yj/Xj, θ, φ)]⁻¹}

(18)

Gaussian case

yi∼N(µi, σ²)

d(i_,j) =log(2πσ²) +_2σ¹₂[(y_i−µi)²+ (y_j−µj)²] µˆi= ˆµj= (yi+yj)/2

d(i,j) =log(2π) + (yi−yj)²

yi∼N(µi, σ²_i).

d(i,j) =log(2πσ²_i) + ¹

2σ²_i[(yi−µi)²+ (yj−µj)²] ˆ

µi= ˆµj= (yi+yj)/2 andˆσ²_i = ˆσ²_j = (yi−yj)²/2 d(i,j) =log(π(yi−yj)²)

yi∼N(0, σ²_i)

d(i,j) =¹₂[log(2πσ²_i) +log(2πσ²_j)] + ^y

2 i 2σ²_i + ^y

2 j 2σ²_j

ˆσ²_i = ˆσ²_j = ˆσ²_ij= (y_i²+y_j²)/2 d(i,j) =log(π(y_i−y_j)²) +1

(19)

Poisson case

Common parameter for mean an variance ply:

to comparey_k withy_iandy_jon different groups

yiandyjon the same group with different offset (e.g.population sizes)λi=θi×oi

yi∼Poisson(λi)p(yi/θi,oi) =p(yi/λi) =λ^y_iⁱe^−λⁱ/Qyi

maxy^′{p(y^′/λi)}=maxy^′{λ^y_i^′e^−λⁱ/(y^′!) 0 forλi<0.5 and≈λi−0.5 forλi≥0.5

(20)

0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0

0.2 0.4 0.6 0.8

0.1

y’

density

0.0 0.5 1.0 1.52.0 2.5 3.0 3.5 0.0

0.1 0.2 0.3 0.4 0.5 0.6

0.5

y’

density

0 1 2 3

0.0 0.1 0.2 0.3 0.4

0.5 0.75

y’

density

0 1 2 3 4

0.0 0.1 0.2 0.3 0.4

1

y’

density

0 1 2 3 4

0.05 0.10 0.15 0.20 0.25 0.30

1.5

y’

density

0 1 2 3 4 5

0.05 0.10 0.15 0.20 0.25

2.5

y’

density

47 48 49 50 51 52 53

0.050 0.051 0.052 0.053 0.054 0.055 0.056

50

density

0 1 2 3 4 5

−0.5

−0.4

−0.3

−0.2

−0.1 0.0

0e+00 1e+05 2e+05 3e+05 4e+05

−0.520

−0.515

−0.510

−0.505

−0.500

−0.495

−0.490

(21)

Partitioning the MST

hierarquical division of MST (groups) For each edgek :

remove the edge

compute homogeneity (Hg) for the groups (e.g.Hg=P

iP

l(y_il−¯y_l)²) remove edge minimizingP

gHg iterate

objective function:Hg−(Hg1+Hg2)

stopping criteria: number of groups, minimum number within groups, reduction of P

gHg, etc

efficient algorithm (Assunc¸ ˜ao et. al. 2006)

(22)

Alternative Homogeneity measures

likelihood based for groupi: Hi=−P

jlog(p(yj|θˆi,φˆi)), for K groups:Pk

i=1Hi

“deviance”:Dk =Hk −Hk−1

Dk ∼χ²p: stopping criteria

(23)

Poins proccess

1

2 3

4

5 6

78 9

10 11

12 13 14

15

16 17

18 19 20

212223 24 25

26 27 2829

30 3132

33 34 35

36

3738 39 40 41 42 43

44 4546 47

49 48 50 51

52 535455

56 57

58

59 60 616263 64 65

(24)

R code

require(spdep); require(spatstat); data(japanesepines) del = deldir(japanesepines, rw=c(0,1,0,1))

d.area = data.frame(del$summary)$dir.area

nb.del = tapply(c(del$dirs[,5], del$dirs[,6]), c(del$dirs[,6], del$dirs[,5]), as.integer) class(nb.del) ¡- ”nb”

costs.a = nbcosts(nb.del, d.area)

nbw.a = nb2listw(nb.del, costs.a, style=”B”) mst.a = mstree(nbw.a)

ldnorm = function(x, id) -sum(dnorm(x[id], mean(x[id]), sd(d.area), log=TRUE)) sk5.a = skater(mst.a[,1:2], d.area, 4, method=ldnorm)