• Nenhum resultado encontrado

FctClus: A Fast Clustering Algorithm for Heterogeneous Information Networks.

N/A
N/A
Protected

Academic year: 2017

Share "FctClus: A Fast Clustering Algorithm for Heterogeneous Information Networks."

Copied!
15
0
0

Texto

(1)

FctClus: A Fast Clustering Algorithm for

Heterogeneous Information Networks

Jing Yang1, Limin Chen1,2*, Jianpei Zhang1

1Institute of Computer Science and Technology, Harbin Engineering University, Harbin, China,2Institute of Computer Science and Technology, Mudanjiang Teachers College, Mudanjiang, China

*chenlimin_clm@126.com

Abstract

It is important to cluster heterogeneous information networks. A fast clustering algorithm based on an approximate commute time embedding for heterogeneous information networks with a star network schema is proposed in this paper by utilizing the sparsity of heterogeneous information networks. First, a heterogeneous information network is transformed into multiple compatible bipartite graphs from the compatible point of view. Second, the approximate com-mute time embedding of each bipartite graph is computed using random mapping and a linear time solver. All of the indicator subsets in each embedding simultaneously determine the tar-get dataset. Finally, a general model is formulated by these indicator subsets, and a fast algo-rithm is derived by simultaneously clustering all of the indicator subsets using the sum of the weighted distances for all indicators for an identical target object. The proposed fast algorithm, FctClus, is shown to be efficient and generalizable and exhibits high clustering accuracy and fast computation speed based on a theoretic analysis and experimental verification.

Introduction

Information networks are ubiquitous and include social information networks and DBLP bib-liographic networks. Numerous studies on homogeneous information networks, which consist of a single type of data object, have been performed; however, little research has been per-formed on the clustering of heterogeneous information networks, which consist of multiple types of data objects. Clustering on a heterogeneous network may lead to better understanding the hidden structures and deeper meanings of the networks[1].

The star network schema is popular and important in the field of heterogeneous informa-tion networks. The star network schema includes one data object target type and multiple data object attribute types, whereby each relation is the target data objects and all attribute data objects linking to it.

Algorithms based on compatible bipartite graphs can effectively consider multiple types of relational data. Various classical clustering algorithms, such as algorithms based on semi-defi-nite programming[2,3], algorithms based on information theory[4] and spectral clustering algorithms for multi-type relational data[5], have been proposed for heterogeneous data from the compatible point of view. These algorithms are generalizable, but the computational com-plexity of these algorithms is too great for use in clustering heterogeneous information networks. OPEN ACCESS

Citation:Yang J, Chen L, Zhang J (2015) FctClus: A Fast Clustering Algorithm for Heterogeneous Information Networks. PLoS ONE 10(6): e0130086. doi:10.1371/journal.pone.0130086

Academic Editor:Enrique Hernandez-Lemus, National Institute of Genomic Medicine, MEXICO

Received:December 27, 2014

Accepted:May 15, 2015

Published:June 19, 2015

Copyright:© 2015 Yang et al. This is an open access article distributed under the terms of the

Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability Statement:All relevant data are within the paper and its Supporting Information files.

Funding:The authors have no support or funding to report.

(2)

Sun et al. presents an algorithm, NetClus[6], and a PathSim-based clustering algorithm[7] for clustering heterogeneous information networks. NetClus is effective for DBLP bibliographic networks, but the algorithm is not a general model for clustering other heterogeneous informa-tion networks; NetClus is not sufficiently stable. The concept behind NetClus is also used for clustering service webs[8,9]. The PathSim-based clustering algorithm requires a user guide, and the clustering quality reflects the requirements of users rather than the requirements of the network. ComClus[10] is a derivation algorithm of NetClus for use with hybrid networks that simultaneously include heterogeneous and homogeneous relations. NetClus and ComClus are not general and depend on the given application.

Dynamic link inference in heterogeneous networks[11] requires more accurate initial clustering. A high clustering quality is necessary for network analysis, but low computation speed is intolerable because of the large network scales involved. The accuracy of the LDCC algorithm[12] is improved, while both the heterogeneous and homogeneous data relations are explored. The CESC algorithm [13] is very effective for clustering homogeneous data using an approximate commute time embed-ding. A heterogeneous information network with a star network schema can transform into multi-ple compatible bipartite graphs from the compatible point of view. When the relation between any two nodes of the bipartite graph is presented with the commute time, the relation of both heteroge-neous and homogenous data objects can be explored; the clustering accuracy can also be improved. The heterogeneous information networks are large but very sparse; therefore, the approximate commute time embedding of each bipartite graph can be quickly computed using random map-ping and a linear time solver[14]. All of the indicator subsets in each embedding indicate the target dataset, and subsequently, a general model for clustering heterogeneous information networks is formulated based on all indicator subsets. All weighted distances between the indicators and the cluster centers in the respective indicator subsets are computed. All indicator subsets can be simul-taneously clustered according to the sum of the weighted distances for all indicators for an identical target object. Based on the above discussion, an effective clustering algorithm, FctClus, which is based on the approximate commute time embedding for heterogeneous information networks, is proposed in this paper. The computation speed and clustering accuracy of FctClus are high.

Methods

Commute Time Embedding of the Bipartite Graph

Given two types of datasets,X0 ¼ fxð0Þ1 ;x ð0Þ

2 ; ;xð0Þn0gandX1 ¼ fx ð1Þ 1 ;x

ð1Þ

2 ; ;xnð1Þ1g, the graph Gb=hV,Eiis called a bipartite graph ifV(Gb) =X0[X1andEðGbÞ ¼ fhx

ð0Þ i ;x

ð1Þ

j g, where 1i

n0, 1jn1.Wn0n1is the relation matrix betweenX0andX1, where the elementwijis the

edge weight betweenxði0Þandxð 1Þ

j . Then, the adjacency matrix of the bipartite graphGbcan be

denoted as

~

Wnn¼

0 Wn0n1

ðWTÞ

n1n0 0

" #

D1andD2are the diagonal matrices, where the diagonal element ofD1isdi¼

X

n1

j¼1 wijand

the diagonal element ofD2isdj¼

X

n0

i¼1 wij.

D¼ D1 0

0 D2

(3)

thus the Laplacian matrix of the bipartite graphGbisL¼D W~.Lcan be eigen-decomposed

intoL=FΛFT, whereΛ=diag(λ1,λ2, ,λn) is a diagonal matrix composed of the eigenvalues

ofLandλ1λ2 λn,F= (ϕ1,ϕ2, ,ϕn) is an eigenmatrix andϕiis an eigenvector

cor-responding to the eigenvalueλi. LetL+be a pseudo-inverse matrix ofLandLþ¼ Xn

i¼2

1

lii T i.

The bipartite graph is also an undirected weighted graph. According to the literature[15], the commute timecijbetween nodesiandjofGbcan be computed by the pseudo-inverse matrix

L+.

cij¼gvðliiþþl þ jj 2l

þ

ijÞ ¼gvðei ejÞ T

ðe

i ejÞ ð1Þ

wherelþ

ij is the (i,j) element ofL

+,

gv=∑wij,eiis a unit column vector in which thei-th element

is 1; that is,ei ¼ ½0

1; ;i 10;1i;i0þ1; ;0nŠ T

.

According to the literature[15,16], the commute timecijbetween nodesiandjofGbis

cij¼gvðei ejÞ T

ðe i ejÞ

¼gvðei ejÞ T

FL 1FTðei ejÞ

¼ ½ ffiffiffiffig

v

p

L 1=2FTðei ejފ T

½ ffiffiffiffig

v

p

L 1=2FTðei ejފ

Thus, the commute timecijis the square pairwise Euclidean distance between the row

vec-tors in the spaceð ffiffiffiffig

v

p

L 1=2FTÞTor the column vectors in the space ffiffiffiffig

v

p

L 1=2FT[13],

ð ffiffiffiffig

v

p L 1=2

FT

ÞTor ffiffiffiffig

v

p L 1=2

FTis called the commute time embedding of the bipartite graph Gb.cijis the average path length between two nodes rather than the shortest path between

two nodes. Using the commute time for clustering the noisy data increases robustness and cap-tures the complex clusters. Therefore clustering in the commute time embedding can also effectively capture the complex clusters. ffiffiffiffig

v

p

L 1=2FTis used in this paper. If a normal Lapla-cian matrixLn=D

−1/2LD−1/2is used, the commute time embedding is ffiffiffiffig

v

p L 1=2

FTD 1=2[13].

Approximate Commute Time Embedding of the Bipartite Graph

If directly computing ffiffiffiffig

v

p L 1=2

FTor ffiffiffiffig

v

p L 1=2

FTD 1=2, the process requiresO(n3) time for the eigen-decomposition of the Laplacian matrixLorLn.n = n0+n1is the number of nodes andsis the number of edges in the bipartite graphGb. According to the literature[17], if the edges in

Gbare oriented and

Bði;jÞ ¼

1 iis tail;jis head 1 iis head;jis tail

0 others 8 > > < > > :

whereiandjare nodes ofGb, thenBs×nis a directed edge-node incidence matrix. UsingW^ss

as a diagonal matrix whose entries are the edge weights, thusL¼BTW B^ . Furthermore,

c¼ ffiffiffiffig

v

p W^1=2

BLþ2Rsn ð2Þ

(4)

commute time is the Euclidean distance betweeniandjinψbecause

cij¼gvðei ejÞ T

ðe i ejÞ

¼gvðei ejÞ T

LLþðe i ejÞ

¼gvðei ejÞ T

BTW BL^ þðe i ejÞ

¼ ½ ffiffiffiffig

v

p W^1=2BLþðe i ejފ

T

½ ffiffiffiffig

v

p W^1=2BLþðe i ejފ

According to the literature[18], given vectorsv1, ,vn2Rsandε>0,Qkrsis a random

matrix of row vectors, whereQði;jÞ ¼ 1= ffiffiffiffi

kr

p

is equivalent whenkr=O(logn/ε2). With

probability 1−1 /n, at least

ð1 εÞjjvi vjjj 2

jjQvi Qvjjj 2

ð1þεÞjjvi vjjj 2

ð3Þ

for all pairs.

Therefore, given the bipartite graphGbwithnnodes andsedges,ε>0, and a matrix

Ykrn¼ ffiffiffiffi

gv

p QW^1=2BLþwith probability of at least 11 /n:

ð1 εÞcij jjYðei ejÞjj 2

ð1þεÞcij ð4Þ

for any nodesi,j2Gb, wherekr=O(logn/ε2).

The proof of Eq (4) comes directly from Eq (2) and Eq (3).cij||Y(ei−ej)||2with an errorε

based on Eq (4). If directly computingYkrn¼ ffiffiffiffi

gv

p QW^1=2BLþ,L+must

first be computed, but the computational complexity of directly computingL+is excessive. However, using the method in the literature[19,20] to computeYkrn, the complexity is decreased. Let

y¼ ffiffiffiffig

v

p QðW^1=2BÞ; then,Y=θL+, which is equal toYL=θ. First,y¼ ffiffiffiffig

v

p QðW^1=2BÞis

com-puted, and then,YL=θ. Each row ofY,yi, is computed by solving the systemyiL=θi, whereθi

is thei-th row ofθ. The linear time solver of Spielman and Teng[19,20] requires onlyO~ðsÞtime to solve the system. Becausekyi ^yikL εkyikL[17], where^yiis the solution,yiL=θiusing

the linear time solver. Then, [17]

ð1 εÞ2cij jjY^ðei ejÞjj 2

ð1þεÞ2cij

Therefore,cij jjY^ðei ejÞjj 2

with an error bound ofε2. The component of the algorithm for the approximate commute time embedding of the bipartite graph is illustrated as follows.

Algorithm1 ApCte(Approximate Commute Time Embedding of the Bipartite Graph)

1. input the relation matrixWn0n1;

2. compute the matricesB,W^andLusingWn

0n1;

3. computey¼ ffiffiffiffiffig

v

p QðW^1=2BÞ;

4. compute each^yiusing the systemyiL=θiby calling to the Spielman-Teng solverkrtimes

[14], 1ikr;

5. output the approximate commute time embeddingY^.

All data objects ofX0andX1are mapped into a common subspaceY^, where thefirstn0 col-umn vectors ofY^indicateX0and the lastn1column vectors ofY^indicateX1. The dataset is composed of then = n0+n1column vectors ofY^is called an indicator dataset. The input matrix

(5)

matricesB,W^andLin step 2 isO(2s) +O(s) +O(n). The sparse matrixBhas 2snonzero ele-ments, and the diagonal matrixW^ hassnonzero elements. Computingy¼ ffiffiffiffig

v

p QðW^1=2B

Þ

takesO(2skr+s) time in step 3. Because the linear time solver of Spielman and Teng[19,20]

requires onlyO~ðsÞtime to solve for eachyiof systemyiL=θi, constructingY^takesO~ðskrÞ

time in step 4. Therefore, the complexity of algorithm1, ApCte, is onlyO(2s) +O(s) +O(n) +O

(2skr+s) +O~ðskrÞ=O~ð4sþnþ3skrÞ. In practice,kr=O(logn/ε2) is small and does not vary

between different datasets. The indicator dataset includes low-dimensional homogeneous data; therefore, traditional algorithms can be used for the indicator dataset.

A General Model Formulation

Given a datasetw¼ fXtg T

t¼0withT+1 types, whereXtis a dataset belonging to thet-th type, a

weighted graphG=<V,E,W>onχis called an information network; ifV(G) =χ, theE(G) is a binary relation onVandW:E!R+. Such an information network is called a

heteroge-neous information network whenT1 and a homogeneous information network when

T= 0[6].

An information networkG=<V,E,W>onχis called a heterogeneous information net-work with a star netnet-work schema if8e=hxi,xji 2E,xi2X0andxj2Xt(t6¼0).X0is the target dataset, andXt(t6¼0) is the attribute dataset.

To derive a general model for clustering the target dataset, a heterogeneous information net-work with a star netnet-work schema using the datasetw¼ fXtg

T

t¼0withT+1 types is given, where X0is the target dataset andfXtg

T

t¼1are the attribute datasets.Xt¼ fx ðtÞ 1 ;x

ðtÞ

2 ; ;xðtÞntg, wherent

is the object number ofXt.Wð0tÞ 2Rn0ntdenotes the relation matrix between the target dataset

X0and the attribute datasetXt, where the elementwð0tÞij denotes the relation betweenx ð0Þ i ofX0 andxðtÞj ofXt. If an edge betweenxð0Þi andx

ðtÞ

j exists, its edge weight isw ð0tÞ

ij . If no edge exists, wð0tÞij = 0.Trelation matricesfWð0tÞg

T

t¼1exist in the heterogeneous information network with a

star network schema.

The target datasetX0and the attribute datasetXtconstitute a bipartite graph,G(0t), which

corresponds to the relation matrixW(0t). The indicator datasetYð0tÞ ¼ fyð0tÞ 1 ;y

ð0tÞ 2 ; ;y

ð0tÞ n0þntg

which also is the approximate commute time embedding ofG(0t)can be quickly computed by ApCte, where thefirstn0data ofY(0t)indicateX0and the lastntdata ofY(0t)indicate the

attri-bute datasetXt.Ytð0Þconsists of thefirstn0data ofY(0t), andY(t)consists of the lastntdata of

Y(0t).Ytð0ÞandY(t)are called the indicator subsets.yðtÞi 2Y ð0Þ

t indicates thei-th object ofX0and is called an indicator for 1in0. There exists a one-to-one correspondence between the indi-cators ofYtð0Þand the objects ofX0. BecauseTbipartite graphs correspond toTindicator data-sets, the target datasetX0is simultaneously indicated by theTindicator subsetsfYtð0Þg

T t¼1, and

each object ofX0is simultaneously indicated byTindicators.

β(t)is the weight of the relation matrixW(0t), whereX

T

t¼1

bðtÞ¼1,β(t)>0. The target dataset

X0is partitioned intoKclusters. The indicators offYtð0Þg T

t¼1, which indicate the identical object

ofX0, belong toTclusters. TheTclusters are inTdifferent indicator subsets and are denoted using the same label. Let

F¼X

T

t¼1

ðbðtÞgkYð0Þ

t o

ðtÞk2

Þ ¼X

T

t¼1

ðbðtÞX n0

i¼1

ðgijX K

j¼1

kyðtÞi o ðtÞ j k

2

(6)

whereoðtÞj is thej-th cluster center of the indicator subsetYð 0Þ

t . There exists a one-to-one

corre-spondence between the indicator functiong¼ fgijgn0

i¼1and the objects ofX0. If all indicators,

fyðtÞi g T

t¼1, that indicates thei-th object ofX0belong to thej-th cluster,γij= 1; otherwise,γij= 0.

If the objective functionFin Eq (5) is minimized, the clusters ofX0are optimal from the compatible point of view because each indicator subset reflects the relation between the target dataset and the attribute dataset. Obviously, determining the global minimum of Eq (5) is NP hard.

Derivation of Fast Algorithm for Clustering Heterogeneous Information

Networks

The following steps allow for the local minimum ofFin Eq (5) to be quickly achieved by simul-taneously clustering all of the indicator subsets.

Setting the Cluster Label

When given the cluster label of each indicator subset, the modeling process can be simplified. Suppose that the labels of theKclusters of eachYtð0Þare set. Letq1,q22X0,y1ð1Þ;y

ð1Þ 2 2Y

ð0Þ 1 ,

y1ðTÞ;y ðTÞ 2 2Y

ð0Þ T .fy

ðtÞ 1 g

T

t¼1indicateq1, andfy2ðtÞg T

t¼1indicateq2. The clusters which indicators for an identical target object belong to have the same label. If one indicator offy1ðtÞg

T

t¼1belongs

to thej-th cluster, all of the other indicators offyðtÞ1 g T

t¼1also belong to thej-th cluster in their

respective indicator subset. Iffy1ðtÞg T

t¼1belongs to thej-th cluster, then allfy ðtÞ 2 g

T

t¼1either belong

to thej-th cluster in their respective indicator subset or none belong to thej-th cluster. Each cluster ofYtð0Þhas an initial center.Krandom objects are selected from the target

data-setX0. The indicators indicating theKobjects are taken as the initial cluster centers for each

Ytð0Þand for the clusters whose center indicates an identical target object with the same label.

Then, all of the other indicators for an identical target object only belong to thej-th cluster in eachYtð0Þor no indicators belong to thej-th cluster, where 1jK. Therefore, theKclusters

offYtð0Þg T

t¼1are set labels.

The sum of the Weighted Distances

An object ofX0is indicated byTindicators. All of theTdistances between the indicator and the center in eachYtð0Þaffect the object allocation. The target object allocation is determined by

the sum of the weighted distances for theTindicators. Settingqi2X0,yð

1Þ i 2Yð

0Þ 1 ,

yiðTÞ2Y ð0Þ T ,fy

ðtÞ i g

T

t¼1indicatesqi. The weighted distance betweeny ðtÞ

i and thej-th cluster

cen-ter inYtð0ÞisbðtÞkyðtÞi oðtÞj k 2

. The sum of the weighted distances is

dis¼X

T

t¼1

ðbðtÞkyðtÞi o ðtÞ j k

2

Þ, which determines the cluster that the objectqibelongs.

j¼arg minX

T

t¼1

ðbðtÞkyðtÞi oðtÞj k 2

Þ ð6Þ

(7)

The Local Minimum of

F

Fin Eq (5) can also be expressed as

F¼X

T

t¼1

ðbðtÞX n0

i¼1

ðgijX K

j¼1

kyðtÞi o ðtÞ j k

2

ÞÞ ¼X

n0

i¼1

ðgijX K

j¼1

XT

t¼1

ðbðtÞkyðtÞi o ðtÞ j k

2

ÞÞ ð7Þ

Obviously, Eq (7) is another representation of Eq (5).

Given the initial centersfoðtÞj gKj¼1and the cluster labels in theTindicator subsetsfYtð0Þg T t¼1,

fYtð0Þg T

t¼1isfirst partitioned by computing Eq (6) and settingF=F0in Eq (7). The cluster cen-ters offYtð0Þg

T

t¼2remain the same, andγijis unchanged. The new centerfo^ð

1Þ j g

K

j¼1of each cluster

inY1ð0Þis computed. The new center is the mean of all data of each cluster. The new centers

fo^ð1Þj g K j¼1ofY

ð0Þ

1 replace the old centers, and subsequently, Eq (7) is used to setF=F1. Then,

F1¼X

n0

i¼1

ðgijX K

j¼1

ðbð1Þkyð1Þi o^ ð1Þ j k 2 þX T t¼2

ðbðtÞkyðtÞi o ðtÞ j k

2

ÞÞÞ F0 ð8Þ

proving

F1 ¼

X

n0

i¼1

ðgijX K

j¼1

ðbð1Þkyð1Þi o^ ð1Þ j k 2 þX T t¼2

ðbðtÞkyiðtÞ o ðtÞ j k

2

ÞÞÞ

Because only the new centersfo^ð1Þj gK1ofY1ð0Þreplace the old centers,γijremains

unchanged. Therefore

F1¼X

n0

i¼1

ðgijX K

j¼1

ðbð1Þkyið1Þ o^ ð1Þ j k

2

ÞÞ þX

n0

i¼1

ðgijX K

j¼1

XT

t¼2

ðbðtÞkyðtÞi o ðtÞ j k

2

ÞÞ

Because the cluster centers offYtð0Þg T

t¼2also remain unchanged,

X

n0

i¼1

ðgijX K

j¼1

XT

t¼2

ðbðtÞkyðtÞi o ðtÞ j k

2

ÞÞis constant, andX

n0

i¼1

ðgijX K

j¼1

ðbð1Þkyði1Þ o^ ð1Þ j k 2 ÞÞ X n0 i¼1

ðgijX K

j¼1

ðbð1Þkyð1Þi o ð1Þ j k

2

ÞÞ. Subsequently,

F1X

n0

i¼1

ðgijX K

j¼1

ðbð1Þkyð1Þi o ð1Þ j k

2

ÞÞ þX

n0

i¼1

ðgijX K

j¼1

XT

t¼2

ðbðtÞkyðtÞi o ðtÞ j k

2

ÞÞ ¼F0

Thus, the cluster centers ofY1ð0Þ, forF1F0, are replaced.

The new centersfo^ðj1Þg K j¼1ofY

ð0Þ

1 replace the old centers, while the centers offY ð0Þ t g

T t¼2

remain unchanged. Re-clusteringfYtð0Þg T

t¼1using Eq (6), where the corresponding value isF= F2in Eq (7), givesF2F1.

PartitioningfYtð0Þg T

t¼1using Eq (6) computes the new cluster centersfo^ ð1Þ j g

K j¼1ofY

ð0Þ 1 ; the

new centers replace the old centersfoð1Þj g K

j¼1. Then, the same procedure is repeated for each

fYtð0Þg T

t¼2. The value ofFdecreases in this case. The above procedures are repeated untilFin Eq

(8)

Algorithm 2 FctClus(Fast Clustering Algorithm based on the Approximate Commute Time Embedding for Heterogeneous Information Networks)

1. Input relation matricesfWð0tÞ2Rn0ntgT

t¼1, weightsfbð

tÞ

>0gTt¼1and cluster numberK; 2. fort= 1 toTdo

3. Compute indicator datasetY(0t)of the bipartite graph corresponding toW(0t)using algo-rithm 1;

4. Constitute the indicator subsetYtð0Þthat indicatesX0; 5. end for

6. Initialize theKinitial cluster centersfoðjtÞg K

j¼1offYð

0Þ

t g

T

t¼1and set the cluster label;

7. loop

8. fort= 1 toTdo

9. PartitionfYtð0Þg T

t¼1intoKclusters by computing Eq (6);

10. Re-compute the new cluster centersfo^ðjtÞg K

j¼1ofYð

0Þ

t ;

11. foðj¼o^jðtÞgKj¼1; 12. end for

13. end loop

14. Output the clusters ofX0.

The computational complexity of steps 2~5 isO~ðX

T

t¼1

ð4stþntþ3stkrÞÞin algorithm 2,

whereTis the number of relational matrices in the heterogeneous information network andkr

is the data dimension ofYtð0Þ.ntandstare the node number and edge number of thet-th

bipar-tite graph, respectively. Step 6 requires onlyO(K) time; the time is constant. The object number ofX0is equal to the indicator number of each indicator subset, thus the computational com-plexity of steps 7~13 isO(uTKkrn0), whereKis the number of clusters of eachYð

t ;n0is the data number of eachYtð0Þ; anduis the iteration number forFin Eq (7) convergence. Therefore,

the computational complexity of algorithm 2, FctClus, isO~ðXT

t¼1

ð4stþntþ3stkrÞÞ+O

(uTKkrn0), wherekranduare small andTandKare constant.

Experiments

The Experimental Dataset

The experimental datasets are composed of real data selected from the DBLP data. The DBLP is a typical heterogeneous information network in computer science domain and contains 4 types of objects, including papers, authors, terms and venues. Two different-scaled heterogeous datasets calledSsmallandSlargerespectively are used in experiments.

Ssmallis the small test dataset and is called the "four-area dataset", as in the literature[6].

Ssmallextracted from the DBLP dataset downloaded in 2011 contains four areas related to data

(9)

representative conferences for each area are chosen, and all papers and terms that appear in the titles are included.Ssmallis showed inS1 File.

Slargeis the large test dataset and extracted from the Chinese DBLP dataset, which are

shar-ing resources released by Institute of automation, Chinese Academy of Sciences.Slargeincludes

34 computer science journals, 16, 567 papers, 47, 701 authors and 52,262 terms(keywords).

Slargeis showed inS2 File.

When analyzing the papers, this object is the target dataset, and the other objects are the attribute datasets. There is no direct link between papers because the DBLP provided very lim-ited citation information. When analyzing the authors, this object is the target dataset, while papers and venues are the attribute datasets. However, there is a direct link between authors because of the co-author relation between various authors; therefore, authors are another attri-bute dataset related to the target dataset.

The experiments are performed in the MATLAB 7.0 programming environment. The matlab source codes for our algorithm are showed inS3 Fileand are available online athttps:// github.com/lsy917/chenlimin, which include a main program and three function programs. FctClus.m is the main program which output the clusters of the object dataset, and ApCte.m,

Prematrix.m and Net_Branches.m are function programs. The Koutis CMG solver[14] is used

in all experiments as the nearly linear time solver to create the embedding. The solver uses symmetric, diagonally dominant matrices that are available online athttp://www.cs.cmu.edu/~ jkoutis/cmg.html.

The Relational Matrix

Papers are the target dataset, while authors, venues and terms are the attribute datasets.X0 denotes papers, andX1,X2andX3denote authors, venues and terms, respectively.W(0t)is the relation matrix betweenX0andXt, 1t3. The element offWð0tÞg

3 t¼1is

wð0tÞij ¼

1 if i2X0;j2X1[X2;nodeilinks toj; p ifi2X0;j2X3; nodeiappearsptimes in nodej;

0 otherwise;

8 > > <

> > :

When authors are the target dataset, papers and venues are the attribute datasets. Authors are also an attribute dataset because of the co-author relation existing between authors.X0 denotes authors whenX1andX2denote papers and venues, respectively.W(0t)is the relation matrix betweenX0andXt, 0t2. The element offWð0tÞg

2 t¼0is

wð0tÞij ¼

1 if i2X0;j2X1[X2;nodeilinks toj;

p ifi2X0;j2X0;nodeiandjco authorppapers;

0 otherwise;

8 > > <

> > :

All the algorithms use the same relation matrix for all experiments.

Parameter Analysis

Analysis of Parameterkr. The equation [13]

Accuracy¼

Xn

i¼1

(10)

is used to compute the clustering accuracy in the experiments, wherenis the object number of dataset,label(i)is the cluster label, andciis the predicted label of an objecti.δ() is an indicator

function:

dðÞ ¼ 1 mapðiÞ ¼labelðiÞ

0 mapðiÞ 6¼labelðiÞ :

(

kris small in practice, and minimal differences exist among the various datasets[13]. The

lit-erature[13] has proved that the accuracy curve is flat for clustering different homogeneous datasets whenkr50.

Using the small datasetSsmall, the clustering accuracy as a function ofkrin a heterogeneous

information network is studied.

An experiment with differentkris conducted in the small dataset,Ssmall. In the FctClus

algo-rithm, the weight offWð0tÞg3

t¼1is taken asβ

(1)= 0.3,

β(2)= 0.4 andβ(3)= 0.3 for clustering papers; the weight offWð0tÞg2

t¼0is taken asβ

(1)= 0.4,β(2)= 0.2 andβ(3)= 0.4 for clustering

authors. The clustering accuracy is affected bykr, as shown inFig 1andFig 2.

The parameterkris quite small because the accuracy curve is flat whenkrobtains a certain

value.kr= 60 is suitable for the dataset in the experiment.kris small and does not considerably

affect the computation speed of FctClus. It is advantageous that FctClus is not sensitive tokrin

terms of both accuracy and performance. All weights of the relation matrix andkr= 60 are

studied in other experiments.

Analysis of Iteration

u

An experiment is conducted in the small datasetSsmallto compare the influence of iterationu

on the clustering result, wherekr= 60. The influence of the iterationuon clustering papers and

authors is shown inFig 3andFig 4. The algorithm quickly convergences whenu= 30.u= 40 is examined in the other experiments.

Fig 1. The influence ofkrfor clustering papers onssmall.

(11)

Comparison of Clustering Accuracy and Computation Speed

The complexity of the algorithms is too high for large-scale networks based on semi-definite programming[2,3] and spectral clustering algorithms for multi-type relational data[5]. The low-complexity algorithms CIT[4], NetClus[6] and ComClus[10] are selected for comparison with the FctClus algorithm in terms of clustering accuracy and computation speed; the datasets

SsmallandSlargeare also chosen for this experiment.

The initial cluster centers of FctClus or the initial cluster partitions of the other three algo-rithms are randomly selected 3 times. The best clustering accuracy of the 3 measurements is used as the clustering accuracy of the four algorithms, and the computation speed at this time is considered as the measured computation speed. The parameters in literature[6] are used as the parameters in NetClus, and the parameters in literature[10] are used as the parameters in ComClus in this experiment. The comparison results are shown inTable 1andTable 2.

The clustering accuracy of FctClus is the highest of all four algorithms. The clustering accu-racy of CIT is lower than that of FctClus because the bipartite graphs of the heterogeneous

Fig 2. The influence of krfor clustering authors onssmall.

doi:10.1371/journal.pone.0130086.g002

Fig 3. The influence ofufor clustering papers onssmall.

(12)

information networks are sparse. The computational complexity of CIT isO(n2), and the con-vergence speed of CIT is low when the heterogeneous information network is sparse. The clus-tering accuracy of NetClus is low because only heterogeneous relations are used.

Homogeneous and heterogeneous relations are both used in ComClus; therefore, the accuracy of ComClus is higher than that of NetClus. FctClus is an algorithm based on commute time embedding. The data relations are explored using commute time and the direct relations of the target dataset are considered. FctClus is not affected by the sparsity of networks; thus, FctClus is highly accurate.

The computation speed of FctClus is nearly as fast as NetClus. The experiment demon-strates that FctClus is effective. FctClus is more universal and can be adapted for clustering any heterogeneous information network with a star network schema. However, NetClus and Com-Clus can only be adapted for clustering bibliographic networks because NetCom-Clus and ComCom-Clus depend on a ranking function of a specific application field.

Comparison of Clustering Stability

To compare the stability of the FctClus, NetClus and CIT algorithms, the small datasetSsmallis

used for clustering papers in this experiment. ComClus is a derivation algorithm of NetClus; it has the same properties as NetClus. ComClus is not considered in this study.

The initial cluster centers of FctClus and the initial cluster partitions of NetClus and CIT are randomly recorded 10 times, and the three algorithms are executed 10 times respectively. The clustering accuracy of the three algorithms for 10 times is shown inFig 5. Although the computation speeds of FctClus and NetClus are both high,Fig 5shows that the stability of

Fig 4. The influence ofufor clustering authors onssmall.

doi:10.1371/journal.pone.0130086.g004

Table 1. Comparison of clustering accuracy (%).

target object &dataset CIT NetClus ComClus FctClus

Papers onssmall 73.91 71.54 72.83 78.87

Authors onssmall 74.41 69.13 74.91 81.33

Papers onslarge 70.84 71.28 72.93 76.36

Authors onslarge 71.02 68.29 73.01 77.94

(13)

FctClus is higher than that of NetClus and that the initial centers do not greatly impact the clustering result of FctClus. However, NetClus is very unstable, and the initial clusters greatly impact the clustering accuracy and convergence speed of NetClus. CIT is more stable than Net-Clus, but the clustering accuracy is low.

Running Time Analysis of the FctClus Algorithm

The running time distributions of FctClus on the two datasets are shown inTable 3. The exper-imental data show that FctClus is effective. The running time for serial computing the three embedding is less than 50% of the total running time. When utilizing parallel computing for the three embedding, the computation speed is higher. When clustering indicator subsets in parallel, the computation speed may also be increased.

Table 2. Comparison of computation speed(s).

target object &dataset CIT NetClus ComClus FctClus

Papers onssmall 78.5 37.3 40.3 37.1

Authors onssmall 79.8 36.9 39.8 38.3

Papers onslarge 1469.3 802.6 827.3 808.4

Authors onslarge 1484.7 743.7 781.4 774.9

doi:10.1371/journal.pone.0130086.t002

Fig 5. A stability comparison of the 3 algorithms for 10 times.

doi:10.1371/journal.pone.0130086.g005

Table 3. Distribution of running time for FctClus.

target object &dataset Embedding time(s) Clustering time(s) Total time(s)

Papers onssmall 19.6 17.5 37.1

Authors onssmall 18.1 20.2 38.3

Papers onslarge 398.8 409.6 808.4

Authors onslarge 382.4 392.5 774.9

(14)

Conclusions

The relation between the original data described by the commute time guarantees the accuracy and performance of the FctClus algorithm. Because heterogeneous information networks are sparse, FctClus can use random mapping and a linear time solver[14] to compute the approxi-mate commute time embedding, which guarantees the high computation speed. FctClus is effective and may be broadly implemented for large heterogeneous information networks, as demonstrated in theory and experimentally. The weight of the relation matrix impacts the tar-get function, but the weight cannot be determined self-adaptively; this requires further research. The relations of data in the real world are typically high-order heterogeneous, so effective clustering algorithms for heterogeneous information networks with any schema will be studied in the future.

Supporting Information

S1 File.Ssmalldataset.

(TXT)

S2 File.Slargedataset.

(TXT)

S3 File. The matlab source codes for algorithm.

(TXT)

Author Contributions

Conceived and designed the experiments: JY LMC. Performed the experiments: LMC. Ana-lyzed the data: JPZ. Contributed reagents/materials/analysis tools: JY LMC. Wrote the paper: LMC.

References

1. Sun Y, Han J (2012) Mining heterogeneous information networks: principles and methodologies. Syn-thesis Lectures on Data Mining and Knowledge Discovery, 3 (2). pp.1–159.

2. Gao B, Liu TY, Qin T, Zheng X, Cheng QS, Ma WY (2005) Web image clustering by consistent utiliza-tion of visual features and surrounding texts. In Proceedings of the 13th annual ACM internautiliza-tional con-ference on Multimedia.pp.112-121.

3. Gao B, Liu TY, Zheng X, Cheng QS, Ma WY (2005) Consistent bipartite graph co-partitioning for star-structrured high-order heterogeneous data co-clustering. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. pp.41-50.

4. Gao B, Liu TY, Ma WY (2006) Star-structured high-order heterogeous data co-clustering based on cosistent information theory. In Data Mining, 2006. ICDM'06. Sixth International Conference on. pp.880-884.

5. Long B, Zhang ZM, Wu X, Yu PS (2006) Spectral clustering for multi-type relational data. In Proceed-ings of the 23rd international conference on Machine learning. pp.585-592.

6. Sun Y, Yu Y, Han J (2009) Rankclus: ranking-based clustering of heterogeneous information networks with star network schema. In Proceedings of the 15th ACM SIGKDD international conference on Knowl-edge discovery and data mining. pp.797-806.

7. Sun Y, Norick B, Han J, Yan X, Yu PS, Yu, X (2012) Integrating meta-path selection with user-guided object clustering in heterogeneous information networks, In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. pp.1348-1356.

8. Li P, Wen J, Li X (2013) SNTClus: A novel service clustering algorithm based on network analysis and service tags. Przegląd Elektrotechniczny.pp.89.

(15)

10. Wang R, Shi C, Yu PS, Wu B (2013) Integrating clustering and ranking on hybrid heterogeneous infor-mation network. In Advances in Knowledge Discovery and Data Mining.pp.583-594.

11. Aggarwal CC, Xie Y, Philip SY (2012) Dynamic link inference in heterogeneous networks. In SDM. pp.415-426.

12. Zhang L, Chen C, Bu J, Chen Z, Cai D, Han J (2012)Locally discriminative coclustering. Knowledge and Data Engineering, IEEE Transactions on, 24 (6).pp.1025–1035.

13. Khoa NLD, Chawla S (2011) Large scale spectral clustering using approximate commute time embed-ding. arXiv preprint arXiv:1111.4541.

14. Koutis I, Miller GL, Tolliver D (2011) Combinatorial preconditioners and multilevel solvers for problems in computer vision and image processing. Computer Vision and Image Understanding, 115 (12). pp.1638–1646.

15. Fouss F, Pirotte A,Renders JM, Saerens M (2007)Random walk computation of similarities between nodes of a graph with application to collaborative recommendation. Knowledge and Data Engineering, IEEE Transactions on, 19 (3).pp.355–369.

16. Qiu H, Hancock ER (2007)Clustering and embedding using commute times. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 29 (11).pp.1873–1890. PMID:17848771

17. Spielman DA, Srivastava N (2008) Graph sparsification by effective resistances. In Proceedings of the 40th annual ACM symposium on Theory of computing, STOC '08.pp.563-568.

18. Achlioptas D (2001) Database-friendly random projections, in Proceedings of the twentieth ACM SIG-MOD SIGACT SIGART symposium on Principles of database systems, PODS '01.pp.274-281.

19. Spielman DA, Teng SH (2004) Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. In Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, STOC '04.pp.81-90.

20. Spielman DA, Teng SH (2014) Nearly-linear time algorithms for preconditioning and solving symmetric, diagonally dominant linear systems. SIAM Journal on Matrix Analysis and Applications, 35 (3).pp.835–

Referências

Documentos relacionados

Estas normalmente associam-se constituindo uma rede em que se conectam violências do sistema social com as praticadas no nível interpessoal (Assis, 1994). Face

Most recently, Li and Patra [3] proposed a new method to prioritize disease genes by extending the random walk with restart algorithm on a heterogeneous network constructed

Esta modificação no modelo de seleção dos livros didáticos representou uma ampla melhoria do ensino catarinense nas escolas com a possibilidade de seleção de livros didáticos sobre

Uma vez que a distância média de afundamento nas éguas do grupo obeso foi superior às referências e com base em estudos prévios que demonstraram alteração na

The analysis of these reports provided evidence about IIC creation, respectively: Human Capital (increase of knowledge through training in environmental and safety

In this paper an improved variable step size algorithm which is based on the incremental conductance algorithm is proposed to adjust the step size using the

When the number of clusters is equal to number of objects, each individual object constitutes a cluster that is it has the lowest MSE, lowest intra cluster distance, highest

System dynamics [5,6] provides the appropriate toolbox for the first approach. Built on the principles of control theory, this modeling method is developed to capture the dynamics