SKATER
Spatial “K”luster Analysis Through Edge Removal
Extensions, possible applications and computational implementation
Paulo Justiniano Ribeiro Jr
LEG:Laborat ´orio de Estat´ıstica e Geoinformac¸ ˜ao / UFPR In collaboration with:
Elias Teixeira Krainski (LEG/UFPR) Renato Martins Assunc¸ ˜ao (LESTE/UFMG)
http://www.leg.ufpr.br e-mail:paulojus@ufpr.br
CHICAS Lancaster, UK
IBC-2010/Floripa e 55 RBRAS
International Biometrics Conference &
55aReuni ˜ao da Regi ˜ao Brasileira da Sociedade Internacional de Biometria
1 Organization: IBS, Rbras, RArg
2 05 a 10 december de 2010, Florian ´opolis, SC, Brasil
3 satelite events (opened to proposals)
4 Free/excursion day (and RBras/RArg meeting) on wednesday, 07/12
5 http://www.ibc-floripa-2010.org/andhttp://www.tibs.org
Outline
Motivatition Basics of SKATER
Computational implementation Extensions and fundamentals Some applications
Final remarks
Motivating Examples I: Living condition indexes in BH
Saude Educacao
Crianca Economico
15 35 55 75 95
Clustering and regionalization
Clustering:
nindividuals with their “atributes”−→k“homogeneous” groups group composition (possibly with minimal number)
how many groups?
Regionalization:
classification procedure
applied to spatial objects (with an areal representation) groups them into homogeneous contiguous regions
Why?
detecting heterogeneous sub-regions (“step”) exploratory
administrative/management purposes . . .
reduced number of possible groups
Approaches
1 non-spatial clustering + neighbouring preserving classification 2 steps
possibly too many groups
2 clustering including (weighted) spatial covariates SAGE (spatial analysis GIS environment)
objective function involveshomogeneity,compactnessandequality
3 explicit use of neighbouring structure in optimization implicit constraints
AZP (automatic zonning procedure) computationally expensive
The SKATER approach
algorithm of type (3)
graph to introduce/represent neighbourhood vertices (units) + edges
“cost” of edges: dissimilarities
heuristics for prunning
minimal 1 group graph (MST: minimal spanning tree) regionalization−→optimal graph partitioning
Skater in action I
1
2
3 4 5 6 7
8 9 10
11 12 13
15 14 16
17 18 19 20
21 22 23
24 25 26
27 28
30 29
31 32 33
34 35
36
37
38
39 40
41 42
43
44 45
46
47 48
49 50 51
52
53 54
55 56
57
58 59
60 61
62 63 64 65
66
68 67 69
71 70 72
73 74
75
76
77
78 79
80
81 82
83
84 85
86 87 88
89 90
91 92
93 94
95 96
97
98
Skater in action II
1
2
3 4 5 6 7
8 9 10
11 12 13
15 14 16
17 18 19 20
21 22 23
24 25 26
27 28
30 29
31 32 33
34 35
36
37
38
39 40
41 42
43
44 45
46
47 48
49 50 51
52
53 54
55 56
57
58 59
60 61
62 63 64 65
66
68 67 69
71 70 72
73 74
75
76
77
78 79
80
81 82
83
84 85
86 87 88
89 90
91 92
93 94
95 96
97
98
1
2
3 4 5 6 7
8 9 10
11 12 13
15 14 16
17 18 19 20
21 22 23
24 25 26
27 28
30 29
31 32 33
34 35
36
37
38
39 40
41 42
43
44 45
46
47 48
49 50 51
52
53 54
55 56
57
58 59
60 61
62 63 64 65
66
68 67 69
71 70 72
73 74
75
76
77
78 79
80
81 82
83
84 85
86 87 88
89 90
91 92
93 94
95 96
97
98
An “R”-ish template for the analysis
1 spatial packages: sp, spdep, . . .
2 Read attributes and the map read.table()andreadShapePoly()
3 standardize data (scale())
4 neighbourhood list (poly2nb())
5 costs for edges (dissimilarities) (nbcosts())
6 weighted neighbourhood structure (nb2listw())
7 minimum spanning tree (mstree())
8 partition (skater())
Yet another example
Extending SKATER
costs of edges
euclidean, mahalanobis, . . .
density based measurements (non-Gaussian data) measures of homogeneity
distance to the mean(s)
likelihood based measures, criteria for number of groups applications
areal data
point processes, geostatistical data, time series, independent observations
Extending SKATER
costs of edges
euclidean, mahalanobis, . . .
density based measurements (non-Gaussian data) measures of homogeneity
distance to the mean(s)
likelihood based measures, criteria for number of groups applications
areal data
point processes, geostatistical data, time series, independent observations
Extending SKATER
costs of edges
euclidean, mahalanobis, . . .
density based measurements (non-Gaussian data)
measures of homogeneity distance to the mean(s)
likelihood based measures, criteria for number of groups applications
areal data
point processes, geostatistical data, time series, independent observations
Extending SKATER
costs of edges
euclidean, mahalanobis, . . .
density based measurements (non-Gaussian data)
measures of homogeneity distance to the mean(s)
likelihood based measures, criteria for number of groups
applications areal data
point processes, geostatistical data, time series, independent observations
Minimum spanning tree
connected graph(nodesv, edges e, attributes y) path between any pair(vi,vj) tree(no circuit)
spanning treen nodes V= (v1, . . . ,vn)andn−1 edgesE= (e1, . . . ,en−1) costs for each edge (dissimilarities)d(vi,vj) =d(ek) =P
l(yil−yjl)2 minimum spanning tree (MST): set{e1, . . .en−1}minimizingPn−1
k=1d(ek) removal of an edge results in two graphs (also MSTs)
Prim (1957) algorithm: start from an empty set and include a first node inV compute dissimilarities between nodesVinandVout
select the node fromVoutwith minimum dissimilarity iterate until all included
unique under certain conditions
Minimum spanning tree
Y with p.d.f p(y/θ, φ), location (θ) andφ(scale) plθ,φ(θ, φ/X,y) =max L(θ,φ/X,y)
θ′,φ′[L(θ′,φ′/X,y)]
similarly fory : ply(y/X, θ, φ) =maxy′ ∈ℜp(y/[Xp,θ,φ)(y/X,θ,φ]
distance measures
common scale:dij=log{[p(yi/Xi, θ, φ)p(yj/Xj, θ, φ)]−1} more general:dij=log{[ply(yi/Xi, θ, φ)ply(yj/Xj, θ, φ)]−1}
Gaussian case
yi∼N(µi, σ2)
d(i,j) =log(2πσ2) +2σ12[(yi−µi)2+ (yj−µj)2] µˆi= ˆµj= (yi+yj)/2
d(i,j) =log(2π) + (yi−yj)2
yi∼N(µi, σ2i).
d(i,j) =log(2πσ2i) + 1
2σ2i[(yi−µi)2+ (yj−µj)2] ˆ
µi= ˆµj= (yi+yj)/2 andˆσ2i = ˆσ2j = (yi−yj)2/2 d(i,j) =log(π(yi−yj)2)
yi∼N(0, σ2i)
d(i,j) =12[log(2πσ2i) +log(2πσ2j)] + y
2 i 2σ2i + y
2 j 2σ2j
ˆσ2i = ˆσ2j = ˆσ2ij= (yi2+yj2)/2 d(i,j) =log(π(yi−yj)2) +1
Poisson case
Common parameter for mean an variance ply:
to compareyk withyiandyjon different groups
yiandyjon the same group with different offset (e.g.population sizes)λi=θi×oi
yi∼Poisson(λi)p(yi/θi,oi) =p(yi/λi) =λyiie−λi/Qyi
maxy′{p(y′/λi)}=maxy′{λyi′e−λi/(y′!) 0 forλi<0.5 and≈λi−0.5 forλi≥0.5
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0
0.2 0.4 0.6 0.8
0.1
y’
density
0.0 0.5 1.0 1.52.0 2.5 3.0 3.5 0.0
0.1 0.2 0.3 0.4 0.5 0.6
0.5
y’
density
0 1 2 3
0.0 0.1 0.2 0.3 0.4
0.5 0.75
y’
density
0 1 2 3 4
0.0 0.1 0.2 0.3 0.4
1
y’
density
0 1 2 3 4
0.05 0.10 0.15 0.20 0.25 0.30
1.5
y’
density
0 1 2 3 4 5
0.05 0.10 0.15 0.20 0.25
2.5
y’
density
47 48 49 50 51 52 53
0.050 0.051 0.052 0.053 0.054 0.055 0.056
50
density
0 1 2 3 4 5
−0.5
−0.4
−0.3
−0.2
−0.1 0.0
0e+00 1e+05 2e+05 3e+05 4e+05
−0.520
−0.515
−0.510
−0.505
−0.500
−0.495
−0.490
Partitioning the MST
hierarquical division of MST (groups) For each edgek :
remove the edge
compute homogeneity (Hg) for the groups (e.g.Hg=P
iP
l(yil−¯yl)2) remove edge minimizingP
gHg iterate
objective function:Hg−(Hg1+Hg2)
stopping criteria: number of groups, minimum number within groups, reduction of P
gHg, etc
efficient algorithm (Assunc¸ ˜ao et. al. 2006)
Alternative Homogeneity measures
likelihood based for groupi: Hi=−P
jlog(p(yj|θˆi,φˆi)), for K groups:Pk
i=1Hi
“deviance”:Dk =Hk −Hk−1
Dk ∼χ2p: stopping criteria
Poins proccess
1
2 3
4
5 6
78 9
10 11
12 13 14
15
16 17
18 19 20
212223 24 25
26 27 2829
30 3132
33 34 35
36
3738 39 40 41 42 43
44 4546 47
49 48 50 51
52 535455
56 57
58
59 60 616263 64 65
R code
require(spdep); require(spatstat); data(japanesepines) del = deldir(japanesepines, rw=c(0,1,0,1))
d.area = data.frame(del$summary)$dir.area
nb.del = tapply(c(del$dirs[,5], del$dirs[,6]), c(del$dirs[,6], del$dirs[,5]), as.integer) class(nb.del) ¡- ”nb”
costs.a = nbcosts(nb.del, d.area)
nbw.a = nb2listw(nb.del, costs.a, style=”B”) mst.a = mstree(nbw.a)
ldnorm = function(x, id) -sum(dnorm(x[id], mean(x[id]), sd(d.area), log=TRUE)) sk5.a = skater(mst.a[,1:2], d.area, 4, method=ldnorm)