• Nenhum resultado encontrado

S patial “K” luster A nalysis T hrough E dge R emoval SKATER

N/A
N/A
Protected

Academic year: 2022

Share "S patial “K” luster A nalysis T hrough E dge R emoval SKATER"

Copied!
24
0
0

Texto

(1)

SKATER

Spatial “K”luster Analysis Through Edge Removal

Extensions, possible applications and computational implementation

Paulo Justiniano Ribeiro Jr

LEG:Laborat ´orio de Estat´ıstica e Geoinformac¸ ˜ao / UFPR In collaboration with:

Elias Teixeira Krainski (LEG/UFPR) Renato Martins Assunc¸ ˜ao (LESTE/UFMG)

http://www.leg.ufpr.br e-mail:paulojus@ufpr.br

CHICAS Lancaster, UK

(2)

IBC-2010/Floripa e 55 RBRAS

International Biometrics Conference &

55aReuni ˜ao da Regi ˜ao Brasileira da Sociedade Internacional de Biometria

1 Organization: IBS, Rbras, RArg

2 05 a 10 december de 2010, Florian ´opolis, SC, Brasil

3 satelite events (opened to proposals)

4 Free/excursion day (and RBras/RArg meeting) on wednesday, 07/12

5 http://www.ibc-floripa-2010.org/andhttp://www.tibs.org

(3)

Outline

Motivatition Basics of SKATER

Computational implementation Extensions and fundamentals Some applications

Final remarks

(4)

Motivating Examples I: Living condition indexes in BH

Saude Educacao

Crianca Economico

15 35 55 75 95

(5)

Clustering and regionalization

Clustering:

nindividuals with their “atributes”−→k“homogeneous” groups group composition (possibly with minimal number)

how many groups?

Regionalization:

classification procedure

applied to spatial objects (with an areal representation) groups them into homogeneous contiguous regions

Why?

detecting heterogeneous sub-regions (“step”) exploratory

administrative/management purposes . . .

reduced number of possible groups

(6)

Approaches

1 non-spatial clustering + neighbouring preserving classification 2 steps

possibly too many groups

2 clustering including (weighted) spatial covariates SAGE (spatial analysis GIS environment)

objective function involveshomogeneity,compactnessandequality

3 explicit use of neighbouring structure in optimization implicit constraints

AZP (automatic zonning procedure) computationally expensive

(7)

The SKATER approach

algorithm of type (3)

graph to introduce/represent neighbourhood vertices (units) + edges

“cost” of edges: dissimilarities

heuristics for prunning

minimal 1 group graph (MST: minimal spanning tree) regionalization−→optimal graph partitioning

(8)

Skater in action I

1

2

3 4 5 6 7

8 9 10

11 12 13

15 14 16

17 18 19 20

21 22 23

24 25 26

27 28

30 29

31 32 33

34 35

36

37

38

39 40

41 42

43

44 45

46

47 48

49 50 51

52

53 54

55 56

57

58 59

60 61

62 63 64 65

66

68 67 69

71 70 72

73 74

75

76

77

78 79

80

81 82

83

84 85

86 87 88

89 90

91 92

93 94

95 96

97

98

(9)

Skater in action II

1

2

3 4 5 6 7

8 9 10

11 12 13

15 14 16

17 18 19 20

21 22 23

24 25 26

27 28

30 29

31 32 33

34 35

36

37

38

39 40

41 42

43

44 45

46

47 48

49 50 51

52

53 54

55 56

57

58 59

60 61

62 63 64 65

66

68 67 69

71 70 72

73 74

75

76

77

78 79

80

81 82

83

84 85

86 87 88

89 90

91 92

93 94

95 96

97

98

1

2

3 4 5 6 7

8 9 10

11 12 13

15 14 16

17 18 19 20

21 22 23

24 25 26

27 28

30 29

31 32 33

34 35

36

37

38

39 40

41 42

43

44 45

46

47 48

49 50 51

52

53 54

55 56

57

58 59

60 61

62 63 64 65

66

68 67 69

71 70 72

73 74

75

76

77

78 79

80

81 82

83

84 85

86 87 88

89 90

91 92

93 94

95 96

97

98

(10)

An “R”-ish template for the analysis

1 spatial packages: sp, spdep, . . .

2 Read attributes and the map read.table()andreadShapePoly()

3 standardize data (scale())

4 neighbourhood list (poly2nb())

5 costs for edges (dissimilarities) (nbcosts())

6 weighted neighbourhood structure (nb2listw())

7 minimum spanning tree (mstree())

8 partition (skater())

(11)

Yet another example

(12)

Extending SKATER

costs of edges

euclidean, mahalanobis, . . .

density based measurements (non-Gaussian data) measures of homogeneity

distance to the mean(s)

likelihood based measures, criteria for number of groups applications

areal data

point processes, geostatistical data, time series, independent observations

(13)

Extending SKATER

costs of edges

euclidean, mahalanobis, . . .

density based measurements (non-Gaussian data) measures of homogeneity

distance to the mean(s)

likelihood based measures, criteria for number of groups applications

areal data

point processes, geostatistical data, time series, independent observations

(14)

Extending SKATER

costs of edges

euclidean, mahalanobis, . . .

density based measurements (non-Gaussian data)

measures of homogeneity distance to the mean(s)

likelihood based measures, criteria for number of groups applications

areal data

point processes, geostatistical data, time series, independent observations

(15)

Extending SKATER

costs of edges

euclidean, mahalanobis, . . .

density based measurements (non-Gaussian data)

measures of homogeneity distance to the mean(s)

likelihood based measures, criteria for number of groups

applications areal data

point processes, geostatistical data, time series, independent observations

(16)

Minimum spanning tree

connected graph(nodesv, edges e, attributes y) path between any pair(vi,vj) tree(no circuit)

spanning treen nodes V= (v1, . . . ,vn)andn−1 edgesE= (e1, . . . ,en−1) costs for each edge (dissimilarities)d(vi,vj) =d(ek) =P

l(yil−yjl)2 minimum spanning tree (MST): set{e1, . . .en−1}minimizingPn−1

k=1d(ek) removal of an edge results in two graphs (also MSTs)

Prim (1957) algorithm: start from an empty set and include a first node inV compute dissimilarities between nodesVinandVout

select the node fromVoutwith minimum dissimilarity iterate until all included

unique under certain conditions

(17)

Minimum spanning tree

Y with p.d.f p(y/θ, φ), location (θ) andφ(scale) plθ,φ(θ, φ/X,y) =max L(θ,φ/X,y)

θ[L/X,y)]

similarly fory : ply(y/X, θ, φ) =maxy′ ∈ℜp(y/[Xp,θ,φ)(y/X,θ,φ]

distance measures

common scale:dij=log{[p(yi/Xi, θ, φ)p(yj/Xj, θ, φ)]−1} more general:dij=log{[ply(yi/Xi, θ, φ)ply(yj/Xj, θ, φ)]−1}

(18)

Gaussian case

yi∼N(µi, σ2)

d(i,j) =log(2πσ2) +12[(yiµi)2+ (yjµj)2] µˆi= ˆµj= (yi+yj)/2

d(i,j) =log(2π) + (yiyj)2

yi∼N(µi, σ2i).

d(i,j) =log(2πσ2i) + 1

2i[(yiµi)2+ (yjµj)2] ˆ

µi= ˆµj= (yi+yj)/2 andˆσ2i = ˆσ2j = (yiyj)2/2 d(i,j) =log(π(yiyj)2)

yi∼N(0, σ2i)

d(i,j) =12[log(2πσ2i) +log(2πσ2j)] + y

2 i 2i + y

2 j 2j

ˆσ2i = ˆσ2j = ˆσ2ij= (yi2+yj2)/2 d(i,j) =log(π(yiyj)2) +1

(19)

Poisson case

Common parameter for mean an variance ply:

to compareyk withyiandyjon different groups

yiandyjon the same group with different offset (e.g.population sizes)λi=θi×oi

yi∼Poisson(λi)p(yii,oi) =p(yii) =λyiie−λi/Qyi

maxy{p(yi)}=maxyyie−λi/(y!) 0 forλi<0.5 and≈λi−0.5 forλi≥0.5

(20)

0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0

0.2 0.4 0.6 0.8

0.1

y’

density

0.0 0.5 1.0 1.52.0 2.5 3.0 3.5 0.0

0.1 0.2 0.3 0.4 0.5 0.6

0.5

y’

density

0 1 2 3

0.0 0.1 0.2 0.3 0.4

0.5 0.75

y’

density

0 1 2 3 4

0.0 0.1 0.2 0.3 0.4

1

y’

density

0 1 2 3 4

0.05 0.10 0.15 0.20 0.25 0.30

1.5

y’

density

0 1 2 3 4 5

0.05 0.10 0.15 0.20 0.25

2.5

y’

density

47 48 49 50 51 52 53

0.050 0.051 0.052 0.053 0.054 0.055 0.056

50

density

0 1 2 3 4 5

−0.5

−0.4

−0.3

−0.2

−0.1 0.0

0e+00 1e+05 2e+05 3e+05 4e+05

−0.520

−0.515

−0.510

−0.505

−0.500

−0.495

−0.490

(21)

Partitioning the MST

hierarquical division of MST (groups) For each edgek :

remove the edge

compute homogeneity (Hg) for the groups (e.g.Hg=P

iP

l(yil¯yl)2) remove edge minimizingP

gHg iterate

objective function:Hg−(Hg1+Hg2)

stopping criteria: number of groups, minimum number within groups, reduction of P

gHg, etc

efficient algorithm (Assunc¸ ˜ao et. al. 2006)

(22)

Alternative Homogeneity measures

likelihood based for groupi: Hi=−P

jlog(p(yj|θˆi,φˆi)), for K groups:Pk

i=1Hi

“deviance”:Dk =Hk −Hk−1

Dk ∼χ2p: stopping criteria

(23)

Poins proccess

1

2 3

4

5 6

78 9

10 11

12 13 14

15

16 17

18 19 20

212223 24 25

26 27 2829

30 3132

33 34 35

36

3738 39 40 41 42 43

44 4546 47

49 48 50 51

52 535455

56 57

58

59 60 616263 64 65

(24)

R code

require(spdep); require(spatstat); data(japanesepines) del = deldir(japanesepines, rw=c(0,1,0,1))

d.area = data.frame(del$summary)$dir.area

nb.del = tapply(c(del$dirs[,5], del$dirs[,6]), c(del$dirs[,6], del$dirs[,5]), as.integer) class(nb.del) ¡- ”nb”

costs.a = nbcosts(nb.del, d.area)

nbw.a = nb2listw(nb.del, costs.a, style=”B”) mst.a = mstree(nbw.a)

ldnorm = function(x, id) -sum(dnorm(x[id], mean(x[id]), sd(d.area), log=TRUE)) sk5.a = skater(mst.a[,1:2], d.area, 4, method=ldnorm)

Referências

Documentos relacionados

A finalidade deste trabalho foi contribuir com os debates acerca de processos privatizantes na política pública de saúde brasileira, colocando luz sobre dois

Além disso, dentre as diferentes cultivares, a banana ‘maça’ é a que apresenta maior problemas de manutenção da qualidade pós-colheita devido a deterioração fisiológica

Sobremesa Ribeira Collection Hotel (Folhado de leite creme com gelado de baunilha)1. SOBREMESAS

Os parâmetros obtidos através de ensaios triaxiais que foram usados, coesão (c) e ângulo de atrito (φ), são de outro trecho da obra (seção 25+300), já que não foram feitos

Auction Criteria: greatest grant Estimated grant(up front): R$ 223,3 million Jobs created throughout the concession contract: estimated 4.545 (direct, indirect and income

(E) Renda per capita que represente, pelo menos, o dobro do país. &#34;O capitalismo pode não ter inventado a cidade, mas indiscutivelmente inventou a cidade grande.

LenShift Vertical , apresentações via LAN, Gestão Crestron, Extron, PJ-Link, AMX , Telnet.. Opcional Mini

Executado Parcialmente, aquisição de equipamentos para parte da rede de saúde através de recurso de convênio de Emenda Parlamentar e Co-financiamento da Atenção Básica pela