FctClus: A Fast Clustering Algorithm for
Heterogeneous Information Networks
Jing Yang1, Limin Chen1,2*, Jianpei Zhang1
1Institute of Computer Science and Technology, Harbin Engineering University, Harbin, China,2Institute of Computer Science and Technology, Mudanjiang Teachers College, Mudanjiang, China
*chenlimin_clm@126.com
Abstract
It is important to cluster heterogeneous information networks. A fast clustering algorithm based on an approximate commute time embedding for heterogeneous information networks with a star network schema is proposed in this paper by utilizing the sparsity of heterogeneous information networks. First, a heterogeneous information network is transformed into multiple compatible bipartite graphs from the compatible point of view. Second, the approximate com-mute time embedding of each bipartite graph is computed using random mapping and a linear time solver. All of the indicator subsets in each embedding simultaneously determine the tar-get dataset. Finally, a general model is formulated by these indicator subsets, and a fast algo-rithm is derived by simultaneously clustering all of the indicator subsets using the sum of the weighted distances for all indicators for an identical target object. The proposed fast algorithm, FctClus, is shown to be efficient and generalizable and exhibits high clustering accuracy and fast computation speed based on a theoretic analysis and experimental verification.
Introduction
Information networks are ubiquitous and include social information networks and DBLP bib-liographic networks. Numerous studies on homogeneous information networks, which consist of a single type of data object, have been performed; however, little research has been per-formed on the clustering of heterogeneous information networks, which consist of multiple types of data objects. Clustering on a heterogeneous network may lead to better understanding the hidden structures and deeper meanings of the networks[1].
The star network schema is popular and important in the field of heterogeneous informa-tion networks. The star network schema includes one data object target type and multiple data object attribute types, whereby each relation is the target data objects and all attribute data objects linking to it.
Algorithms based on compatible bipartite graphs can effectively consider multiple types of relational data. Various classical clustering algorithms, such as algorithms based on semi-defi-nite programming[2,3], algorithms based on information theory[4] and spectral clustering algorithms for multi-type relational data[5], have been proposed for heterogeneous data from the compatible point of view. These algorithms are generalizable, but the computational com-plexity of these algorithms is too great for use in clustering heterogeneous information networks. OPEN ACCESS
Citation:Yang J, Chen L, Zhang J (2015) FctClus: A Fast Clustering Algorithm for Heterogeneous Information Networks. PLoS ONE 10(6): e0130086. doi:10.1371/journal.pone.0130086
Academic Editor:Enrique Hernandez-Lemus, National Institute of Genomic Medicine, MEXICO
Received:December 27, 2014
Accepted:May 15, 2015
Published:June 19, 2015
Copyright:© 2015 Yang et al. This is an open access article distributed under the terms of the
Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability Statement:All relevant data are within the paper and its Supporting Information files.
Funding:The authors have no support or funding to report.
Sun et al. presents an algorithm, NetClus[6], and a PathSim-based clustering algorithm[7] for clustering heterogeneous information networks. NetClus is effective for DBLP bibliographic networks, but the algorithm is not a general model for clustering other heterogeneous informa-tion networks; NetClus is not sufficiently stable. The concept behind NetClus is also used for clustering service webs[8,9]. The PathSim-based clustering algorithm requires a user guide, and the clustering quality reflects the requirements of users rather than the requirements of the network. ComClus[10] is a derivation algorithm of NetClus for use with hybrid networks that simultaneously include heterogeneous and homogeneous relations. NetClus and ComClus are not general and depend on the given application.
Dynamic link inference in heterogeneous networks[11] requires more accurate initial clustering. A high clustering quality is necessary for network analysis, but low computation speed is intolerable because of the large network scales involved. The accuracy of the LDCC algorithm[12] is improved, while both the heterogeneous and homogeneous data relations are explored. The CESC algorithm [13] is very effective for clustering homogeneous data using an approximate commute time embed-ding. A heterogeneous information network with a star network schema can transform into multi-ple compatible bipartite graphs from the compatible point of view. When the relation between any two nodes of the bipartite graph is presented with the commute time, the relation of both heteroge-neous and homogenous data objects can be explored; the clustering accuracy can also be improved. The heterogeneous information networks are large but very sparse; therefore, the approximate commute time embedding of each bipartite graph can be quickly computed using random map-ping and a linear time solver[14]. All of the indicator subsets in each embedding indicate the target dataset, and subsequently, a general model for clustering heterogeneous information networks is formulated based on all indicator subsets. All weighted distances between the indicators and the cluster centers in the respective indicator subsets are computed. All indicator subsets can be simul-taneously clustered according to the sum of the weighted distances for all indicators for an identical target object. Based on the above discussion, an effective clustering algorithm, FctClus, which is based on the approximate commute time embedding for heterogeneous information networks, is proposed in this paper. The computation speed and clustering accuracy of FctClus are high.
Methods
Commute Time Embedding of the Bipartite Graph
Given two types of datasets,X0 ¼ fxð0Þ1 ;x ð0Þ
2 ; ;xð0Þn0gandX1 ¼ fx ð1Þ 1 ;x
ð1Þ
2 ; ;xnð1Þ1g, the graph Gb=hV,Eiis called a bipartite graph ifV(Gb) =X0[X1andEðGbÞ ¼ fhx
ð0Þ i ;x
ð1Þ
j g, where 1i
n0, 1jn1.Wn0n1is the relation matrix betweenX0andX1, where the elementwijis the
edge weight betweenxði0Þandxð 1Þ
j . Then, the adjacency matrix of the bipartite graphGbcan be
denoted as
~
Wnn¼
0 Wn0n1
ðWTÞ
n1n0 0
" #
D1andD2are the diagonal matrices, where the diagonal element ofD1isdi¼
X
n1
j¼1 wijand
the diagonal element ofD2isdj¼
X
n0
i¼1 wij.
D¼ D1 0
0 D2
thus the Laplacian matrix of the bipartite graphGbisL¼D W~.Lcan be eigen-decomposed
intoL=FΛFT, whereΛ=diag(λ1,λ2, ,λn) is a diagonal matrix composed of the eigenvalues
ofLandλ1λ2 λn,F= (ϕ1,ϕ2, ,ϕn) is an eigenmatrix andϕiis an eigenvector
cor-responding to the eigenvalueλi. LetL+be a pseudo-inverse matrix ofLandLþ¼ Xn
i¼2
1
lii T i.
The bipartite graph is also an undirected weighted graph. According to the literature[15], the commute timecijbetween nodesiandjofGbcan be computed by the pseudo-inverse matrix
L+.
cij¼gvðliiþþl þ jj 2l
þ
ijÞ ¼gvðei ejÞ T
Lþðe
i ejÞ ð1Þ
wherelþ
ij is the (i,j) element ofL
+,
gv=∑wij,eiis a unit column vector in which thei-th element
is 1; that is,ei ¼ ½0
1; ;i 10;1i;i0þ1; ;0n T
.
According to the literature[15,16], the commute timecijbetween nodesiandjofGbis
cij¼gvðei ejÞ T
Lþðe i ejÞ
¼gvðei ejÞ T
FL 1FTðei ejÞ
¼ ½ ffiffiffiffig
v
p
L 1=2FTðei ejÞ T
½ ffiffiffiffig
v
p
L 1=2FTðei ejÞ
Thus, the commute timecijis the square pairwise Euclidean distance between the row
vec-tors in the spaceð ffiffiffiffig
v
p
L 1=2FTÞTor the column vectors in the space ffiffiffiffig
v
p
L 1=2FT[13],
ð ffiffiffiffig
v
p L 1=2
FT
ÞTor ffiffiffiffig
v
p L 1=2
FTis called the commute time embedding of the bipartite graph Gb.cijis the average path length between two nodes rather than the shortest path between
two nodes. Using the commute time for clustering the noisy data increases robustness and cap-tures the complex clusters. Therefore clustering in the commute time embedding can also effectively capture the complex clusters. ffiffiffiffig
v
p
L 1=2FTis used in this paper. If a normal Lapla-cian matrixLn=D
−1/2LD−1/2is used, the commute time embedding is ffiffiffiffig
v
p L 1=2
FTD 1=2[13].
Approximate Commute Time Embedding of the Bipartite Graph
If directly computing ffiffiffiffig
v
p L 1=2
FTor ffiffiffiffig
v
p L 1=2
FTD 1=2, the process requiresO(n3) time for the eigen-decomposition of the Laplacian matrixLorLn.n = n0+n1is the number of nodes andsis the number of edges in the bipartite graphGb. According to the literature[17], if the edges in
Gbare oriented and
Bði;jÞ ¼
1 iis tail;jis head 1 iis head;jis tail
0 others 8 > > < > > :
whereiandjare nodes ofGb, thenBs×nis a directed edge-node incidence matrix. UsingW^ss
as a diagonal matrix whose entries are the edge weights, thusL¼BTW B^ . Furthermore,
c¼ ffiffiffiffig
v
p W^1=2
BLþ2Rsn ð2Þ
commute time is the Euclidean distance betweeniandjinψbecause
cij¼gvðei ejÞ T
Lþðe i ejÞ
¼gvðei ejÞ T
LþLLþðe i ejÞ
¼gvðei ejÞ T
LþBTW BL^ þðe i ejÞ
¼ ½ ffiffiffiffig
v
p W^1=2BLþðe i ejÞ
T
½ ffiffiffiffig
v
p W^1=2BLþðe i ejÞ
According to the literature[18], given vectorsv1, ,vn2Rsandε>0,Qkrsis a random
matrix of row vectors, whereQði;jÞ ¼ 1= ffiffiffiffi
kr
p
is equivalent whenkr=O(logn/ε2). With
probability 1−1 /n, at least
ð1 εÞjjvi vjjj 2
jjQvi Qvjjj 2
ð1þεÞjjvi vjjj 2
ð3Þ
for all pairs.
Therefore, given the bipartite graphGbwithnnodes andsedges,ε>0, and a matrix
Ykrn¼ ffiffiffiffi
gv
p QW^1=2BLþwith probability of at least 1−1 /n:
ð1 εÞcij jjYðei ejÞjj 2
ð1þεÞcij ð4Þ
for any nodesi,j2Gb, wherekr=O(logn/ε2).
The proof of Eq (4) comes directly from Eq (2) and Eq (3).cij||Y(ei−ej)||2with an errorε
based on Eq (4). If directly computingYkrn¼ ffiffiffiffi
gv
p QW^1=2BLþ,L+must
first be computed, but the computational complexity of directly computingL+is excessive. However, using the method in the literature[19,20] to computeYkrn, the complexity is decreased. Let
y¼ ffiffiffiffig
v
p QðW^1=2BÞ; then,Y=θL+, which is equal toYL=θ. First,y¼ ffiffiffiffig
v
p QðW^1=2BÞis
com-puted, and then,YL=θ. Each row ofY,yi, is computed by solving the systemyiL=θi, whereθi
is thei-th row ofθ. The linear time solver of Spielman and Teng[19,20] requires onlyO~ðsÞtime to solve the system. Becausekyi ^yikL εkyikL[17], where^yiis the solution,yiL=θiusing
the linear time solver. Then, [17]
ð1 εÞ2cij jjY^ðei ejÞjj 2
ð1þεÞ2cij
Therefore,cij jjY^ðei ejÞjj 2
with an error bound ofε2. The component of the algorithm for the approximate commute time embedding of the bipartite graph is illustrated as follows.
Algorithm1 ApCte(Approximate Commute Time Embedding of the Bipartite Graph)
1. input the relation matrixWn0n1;
2. compute the matricesB,W^andLusingWn
0n1;
3. computey¼ ffiffiffiffiffig
v
p QðW^1=2BÞ;
4. compute each^yiusing the systemyiL=θiby calling to the Spielman-Teng solverkrtimes
[14], 1ikr;
5. output the approximate commute time embeddingY^.
All data objects ofX0andX1are mapped into a common subspaceY^, where thefirstn0 col-umn vectors ofY^indicateX0and the lastn1column vectors ofY^indicateX1. The dataset is composed of then = n0+n1column vectors ofY^is called an indicator dataset. The input matrix
matricesB,W^andLin step 2 isO(2s) +O(s) +O(n). The sparse matrixBhas 2snonzero ele-ments, and the diagonal matrixW^ hassnonzero elements. Computingy¼ ffiffiffiffig
v
p QðW^1=2B
Þ
takesO(2skr+s) time in step 3. Because the linear time solver of Spielman and Teng[19,20]
requires onlyO~ðsÞtime to solve for eachyiof systemyiL=θi, constructingY^takesO~ðskrÞ
time in step 4. Therefore, the complexity of algorithm1, ApCte, is onlyO(2s) +O(s) +O(n) +O
(2skr+s) +O~ðskrÞ=O~ð4sþnþ3skrÞ. In practice,kr=O(logn/ε2) is small and does not vary
between different datasets. The indicator dataset includes low-dimensional homogeneous data; therefore, traditional algorithms can be used for the indicator dataset.
A General Model Formulation
Given a datasetw¼ fXtg T
t¼0withT+1 types, whereXtis a dataset belonging to thet-th type, a
weighted graphG=<V,E,W>onχis called an information network; ifV(G) =χ, theE(G) is a binary relation onVandW:E!R+. Such an information network is called a
heteroge-neous information network whenT1 and a homogeneous information network when
T= 0[6].
An information networkG=<V,E,W>onχis called a heterogeneous information net-work with a star netnet-work schema if8e=hxi,xji 2E,xi2X0andxj2Xt(t6¼0).X0is the target dataset, andXt(t6¼0) is the attribute dataset.
To derive a general model for clustering the target dataset, a heterogeneous information net-work with a star netnet-work schema using the datasetw¼ fXtg
T
t¼0withT+1 types is given, where X0is the target dataset andfXtg
T
t¼1are the attribute datasets.Xt¼ fx ðtÞ 1 ;x
ðtÞ
2 ; ;xðtÞntg, wherent
is the object number ofXt.Wð0tÞ 2Rn0ntdenotes the relation matrix between the target dataset
X0and the attribute datasetXt, where the elementwð0tÞij denotes the relation betweenx ð0Þ i ofX0 andxðtÞj ofXt. If an edge betweenxð0Þi andx
ðtÞ
j exists, its edge weight isw ð0tÞ
ij . If no edge exists, wð0tÞij = 0.Trelation matricesfWð0tÞg
T
t¼1exist in the heterogeneous information network with a
star network schema.
The target datasetX0and the attribute datasetXtconstitute a bipartite graph,G(0t), which
corresponds to the relation matrixW(0t). The indicator datasetYð0tÞ ¼ fyð0tÞ 1 ;y
ð0tÞ 2 ; ;y
ð0tÞ n0þntg
which also is the approximate commute time embedding ofG(0t)can be quickly computed by ApCte, where thefirstn0data ofY(0t)indicateX0and the lastntdata ofY(0t)indicate the
attri-bute datasetXt.Ytð0Þconsists of thefirstn0data ofY(0t), andY(t)consists of the lastntdata of
Y(0t).Ytð0ÞandY(t)are called the indicator subsets.yðtÞi 2Y ð0Þ
t indicates thei-th object ofX0and is called an indicator for 1in0. There exists a one-to-one correspondence between the indi-cators ofYtð0Þand the objects ofX0. BecauseTbipartite graphs correspond toTindicator data-sets, the target datasetX0is simultaneously indicated by theTindicator subsetsfYtð0Þg
T t¼1, and
each object ofX0is simultaneously indicated byTindicators.
β(t)is the weight of the relation matrixW(0t), whereX
T
t¼1
bðtÞ¼1,β(t)>0. The target dataset
X0is partitioned intoKclusters. The indicators offYtð0Þg T
t¼1, which indicate the identical object
ofX0, belong toTclusters. TheTclusters are inTdifferent indicator subsets and are denoted using the same label. Let
F¼X
T
t¼1
ðbðtÞgkYð0Þ
t o
ðtÞk2
Þ ¼X
T
t¼1
ðbðtÞX n0
i¼1
ðgijX K
j¼1
kyðtÞi o ðtÞ j k
2
whereoðtÞj is thej-th cluster center of the indicator subsetYð 0Þ
t . There exists a one-to-one
corre-spondence between the indicator functiong¼ fgijgn0
i¼1and the objects ofX0. If all indicators,
fyðtÞi g T
t¼1, that indicates thei-th object ofX0belong to thej-th cluster,γij= 1; otherwise,γij= 0.
If the objective functionFin Eq (5) is minimized, the clusters ofX0are optimal from the compatible point of view because each indicator subset reflects the relation between the target dataset and the attribute dataset. Obviously, determining the global minimum of Eq (5) is NP hard.
Derivation of Fast Algorithm for Clustering Heterogeneous Information
Networks
The following steps allow for the local minimum ofFin Eq (5) to be quickly achieved by simul-taneously clustering all of the indicator subsets.
Setting the Cluster Label
When given the cluster label of each indicator subset, the modeling process can be simplified. Suppose that the labels of theKclusters of eachYtð0Þare set. Letq1,q22X0,y1ð1Þ;y
ð1Þ 2 2Y
ð0Þ 1 ,
y1ðTÞ;y ðTÞ 2 2Y
ð0Þ T .fy
ðtÞ 1 g
T
t¼1indicateq1, andfy2ðtÞg T
t¼1indicateq2. The clusters which indicators for an identical target object belong to have the same label. If one indicator offy1ðtÞg
T
t¼1belongs
to thej-th cluster, all of the other indicators offyðtÞ1 g T
t¼1also belong to thej-th cluster in their
respective indicator subset. Iffy1ðtÞg T
t¼1belongs to thej-th cluster, then allfy ðtÞ 2 g
T
t¼1either belong
to thej-th cluster in their respective indicator subset or none belong to thej-th cluster. Each cluster ofYtð0Þhas an initial center.Krandom objects are selected from the target
data-setX0. The indicators indicating theKobjects are taken as the initial cluster centers for each
Ytð0Þand for the clusters whose center indicates an identical target object with the same label.
Then, all of the other indicators for an identical target object only belong to thej-th cluster in eachYtð0Þor no indicators belong to thej-th cluster, where 1jK. Therefore, theKclusters
offYtð0Þg T
t¼1are set labels.
The sum of the Weighted Distances
An object ofX0is indicated byTindicators. All of theTdistances between the indicator and the center in eachYtð0Þaffect the object allocation. The target object allocation is determined by
the sum of the weighted distances for theTindicators. Settingqi2X0,yð
1Þ i 2Yð
0Þ 1 ,
yiðTÞ2Y ð0Þ T ,fy
ðtÞ i g
T
t¼1indicatesqi. The weighted distance betweeny ðtÞ
i and thej-th cluster
cen-ter inYtð0ÞisbðtÞkyðtÞi oðtÞj k 2
. The sum of the weighted distances is
dis¼X
T
t¼1
ðbðtÞkyðtÞi o ðtÞ j k
2
Þ, which determines the cluster that the objectqibelongs.
j¼arg minX
T
t¼1
ðbðtÞkyðtÞi oðtÞj k 2
Þ ð6Þ
The Local Minimum of
F
Fin Eq (5) can also be expressed as
F¼X
T
t¼1
ðbðtÞX n0
i¼1
ðgijX K
j¼1
kyðtÞi o ðtÞ j k
2
ÞÞ ¼X
n0
i¼1
ðgijX K
j¼1
XT
t¼1
ðbðtÞkyðtÞi o ðtÞ j k
2
ÞÞ ð7Þ
Obviously, Eq (7) is another representation of Eq (5).
Given the initial centersfoðtÞj gKj¼1and the cluster labels in theTindicator subsetsfYtð0Þg T t¼1,
fYtð0Þg T
t¼1isfirst partitioned by computing Eq (6) and settingF=F0in Eq (7). The cluster cen-ters offYtð0Þg
T
t¼2remain the same, andγijis unchanged. The new centerfo^ð
1Þ j g
K
j¼1of each cluster
inY1ð0Þis computed. The new center is the mean of all data of each cluster. The new centers
fo^ð1Þj g K j¼1ofY
ð0Þ
1 replace the old centers, and subsequently, Eq (7) is used to setF=F1. Then,
F1¼X
n0
i¼1
ðgijX K
j¼1
ðbð1Þkyð1Þi o^ ð1Þ j k 2 þX T t¼2
ðbðtÞkyðtÞi o ðtÞ j k
2
ÞÞÞ F0 ð8Þ
proving
F1 ¼
X
n0
i¼1
ðgijX K
j¼1
ðbð1Þkyð1Þi o^ ð1Þ j k 2 þX T t¼2
ðbðtÞkyiðtÞ o ðtÞ j k
2
ÞÞÞ
Because only the new centersfo^ð1Þj gKj¼1ofY1ð0Þreplace the old centers,γijremains
unchanged. Therefore
F1¼X
n0
i¼1
ðgijX K
j¼1
ðbð1Þkyið1Þ o^ ð1Þ j k
2
ÞÞ þX
n0
i¼1
ðgijX K
j¼1
XT
t¼2
ðbðtÞkyðtÞi o ðtÞ j k
2
ÞÞ
Because the cluster centers offYtð0Þg T
t¼2also remain unchanged,
X
n0
i¼1
ðgijX K
j¼1
XT
t¼2
ðbðtÞkyðtÞi o ðtÞ j k
2
ÞÞis constant, andX
n0
i¼1
ðgijX K
j¼1
ðbð1Þkyði1Þ o^ ð1Þ j k 2 ÞÞ X n0 i¼1
ðgijX K
j¼1
ðbð1Þkyð1Þi o ð1Þ j k
2
ÞÞ. Subsequently,
F1X
n0
i¼1
ðgijX K
j¼1
ðbð1Þkyð1Þi o ð1Þ j k
2
ÞÞ þX
n0
i¼1
ðgijX K
j¼1
XT
t¼2
ðbðtÞkyðtÞi o ðtÞ j k
2
ÞÞ ¼F0
Thus, the cluster centers ofY1ð0Þ, forF1F0, are replaced.
The new centersfo^ðj1Þg K j¼1ofY
ð0Þ
1 replace the old centers, while the centers offY ð0Þ t g
T t¼2
remain unchanged. Re-clusteringfYtð0Þg T
t¼1using Eq (6), where the corresponding value isF= F2in Eq (7), givesF2F1.
PartitioningfYtð0Þg T
t¼1using Eq (6) computes the new cluster centersfo^ ð1Þ j g
K j¼1ofY
ð0Þ 1 ; the
new centers replace the old centersfoð1Þj g K
j¼1. Then, the same procedure is repeated for each
fYtð0Þg T
t¼2. The value ofFdecreases in this case. The above procedures are repeated untilFin Eq
Algorithm 2 FctClus(Fast Clustering Algorithm based on the Approximate Commute Time Embedding for Heterogeneous Information Networks)
1. Input relation matricesfWð0tÞ2Rn0ntgT
t¼1, weightsfbð
tÞ
>0gTt¼1and cluster numberK; 2. fort= 1 toTdo
3. Compute indicator datasetY(0t)of the bipartite graph corresponding toW(0t)using algo-rithm 1;
4. Constitute the indicator subsetYtð0Þthat indicatesX0; 5. end for
6. Initialize theKinitial cluster centersfoðjtÞg K
j¼1offYð
0Þ
t g
T
t¼1and set the cluster label;
7. loop
8. fort= 1 toTdo
9. PartitionfYtð0Þg T
t¼1intoKclusters by computing Eq (6);
10. Re-compute the new cluster centersfo^ðjtÞg K
j¼1ofYð
0Þ
t ;
11. foðjtÞ¼o^jðtÞgKj¼1; 12. end for
13. end loop
14. Output the clusters ofX0.
The computational complexity of steps 2~5 isO~ðX
T
t¼1
ð4stþntþ3stkrÞÞin algorithm 2,
whereTis the number of relational matrices in the heterogeneous information network andkr
is the data dimension ofYtð0Þ.ntandstare the node number and edge number of thet-th
bipar-tite graph, respectively. Step 6 requires onlyO(K) time; the time is constant. The object number ofX0is equal to the indicator number of each indicator subset, thus the computational com-plexity of steps 7~13 isO(uTKkrn0), whereKis the number of clusters of eachYð
0Þ
t ;n0is the data number of eachYtð0Þ; anduis the iteration number forFin Eq (7) convergence. Therefore,
the computational complexity of algorithm 2, FctClus, isO~ðXT
t¼1
ð4stþntþ3stkrÞÞ+O
(uTKkrn0), wherekranduare small andTandKare constant.
Experiments
The Experimental Dataset
The experimental datasets are composed of real data selected from the DBLP data. The DBLP is a typical heterogeneous information network in computer science domain and contains 4 types of objects, including papers, authors, terms and venues. Two different-scaled heterogeous datasets calledSsmallandSlargerespectively are used in experiments.
Ssmallis the small test dataset and is called the "four-area dataset", as in the literature[6].
Ssmallextracted from the DBLP dataset downloaded in 2011 contains four areas related to data
representative conferences for each area are chosen, and all papers and terms that appear in the titles are included.Ssmallis showed inS1 File.
Slargeis the large test dataset and extracted from the Chinese DBLP dataset, which are
shar-ing resources released by Institute of automation, Chinese Academy of Sciences.Slargeincludes
34 computer science journals, 16, 567 papers, 47, 701 authors and 52,262 terms(keywords).
Slargeis showed inS2 File.
When analyzing the papers, this object is the target dataset, and the other objects are the attribute datasets. There is no direct link between papers because the DBLP provided very lim-ited citation information. When analyzing the authors, this object is the target dataset, while papers and venues are the attribute datasets. However, there is a direct link between authors because of the co-author relation between various authors; therefore, authors are another attri-bute dataset related to the target dataset.
The experiments are performed in the MATLAB 7.0 programming environment. The matlab source codes for our algorithm are showed inS3 Fileand are available online athttps:// github.com/lsy917/chenlimin, which include a main program and three function programs. FctClus.m is the main program which output the clusters of the object dataset, and ApCte.m,
Prematrix.m and Net_Branches.m are function programs. The Koutis CMG solver[14] is used
in all experiments as the nearly linear time solver to create the embedding. The solver uses symmetric, diagonally dominant matrices that are available online athttp://www.cs.cmu.edu/~ jkoutis/cmg.html.
The Relational Matrix
Papers are the target dataset, while authors, venues and terms are the attribute datasets.X0 denotes papers, andX1,X2andX3denote authors, venues and terms, respectively.W(0t)is the relation matrix betweenX0andXt, 1t3. The element offWð0tÞg
3 t¼1is
wð0tÞij ¼
1 if i2X0;j2X1[X2;nodeilinks toj; p ifi2X0;j2X3; nodeiappearsptimes in nodej;
0 otherwise;
8 > > <
> > :
When authors are the target dataset, papers and venues are the attribute datasets. Authors are also an attribute dataset because of the co-author relation existing between authors.X0 denotes authors whenX1andX2denote papers and venues, respectively.W(0t)is the relation matrix betweenX0andXt, 0t2. The element offWð0tÞg
2 t¼0is
wð0tÞij ¼
1 if i2X0;j2X1[X2;nodeilinks toj;
p ifi2X0;j2X0;nodeiandjco authorppapers;
0 otherwise;
8 > > <
> > :
All the algorithms use the same relation matrix for all experiments.
Parameter Analysis
Analysis of Parameterkr. The equation [13]
Accuracy¼
Xn
i¼1
is used to compute the clustering accuracy in the experiments, wherenis the object number of dataset,label(i)is the cluster label, andciis the predicted label of an objecti.δ() is an indicator
function:
dðÞ ¼ 1 mapðiÞ ¼labelðiÞ
0 mapðiÞ 6¼labelðiÞ :
(
kris small in practice, and minimal differences exist among the various datasets[13]. The
lit-erature[13] has proved that the accuracy curve is flat for clustering different homogeneous datasets whenkr50.
Using the small datasetSsmall, the clustering accuracy as a function ofkrin a heterogeneous
information network is studied.
An experiment with differentkris conducted in the small dataset,Ssmall. In the FctClus
algo-rithm, the weight offWð0tÞg3
t¼1is taken asβ
(1)= 0.3,
β(2)= 0.4 andβ(3)= 0.3 for clustering papers; the weight offWð0tÞg2
t¼0is taken asβ
(1)= 0.4,β(2)= 0.2 andβ(3)= 0.4 for clustering
authors. The clustering accuracy is affected bykr, as shown inFig 1andFig 2.
The parameterkris quite small because the accuracy curve is flat whenkrobtains a certain
value.kr= 60 is suitable for the dataset in the experiment.kris small and does not considerably
affect the computation speed of FctClus. It is advantageous that FctClus is not sensitive tokrin
terms of both accuracy and performance. All weights of the relation matrix andkr= 60 are
studied in other experiments.
Analysis of Iteration
u
An experiment is conducted in the small datasetSsmallto compare the influence of iterationu
on the clustering result, wherekr= 60. The influence of the iterationuon clustering papers and
authors is shown inFig 3andFig 4. The algorithm quickly convergences whenu= 30.u= 40 is examined in the other experiments.
Fig 1. The influence ofkrfor clustering papers onssmall.
Comparison of Clustering Accuracy and Computation Speed
The complexity of the algorithms is too high for large-scale networks based on semi-definite programming[2,3] and spectral clustering algorithms for multi-type relational data[5]. The low-complexity algorithms CIT[4], NetClus[6] and ComClus[10] are selected for comparison with the FctClus algorithm in terms of clustering accuracy and computation speed; the datasets
SsmallandSlargeare also chosen for this experiment.
The initial cluster centers of FctClus or the initial cluster partitions of the other three algo-rithms are randomly selected 3 times. The best clustering accuracy of the 3 measurements is used as the clustering accuracy of the four algorithms, and the computation speed at this time is considered as the measured computation speed. The parameters in literature[6] are used as the parameters in NetClus, and the parameters in literature[10] are used as the parameters in ComClus in this experiment. The comparison results are shown inTable 1andTable 2.
The clustering accuracy of FctClus is the highest of all four algorithms. The clustering accu-racy of CIT is lower than that of FctClus because the bipartite graphs of the heterogeneous
Fig 2. The influence of krfor clustering authors onssmall.
doi:10.1371/journal.pone.0130086.g002
Fig 3. The influence ofufor clustering papers onssmall.
information networks are sparse. The computational complexity of CIT isO(n2), and the con-vergence speed of CIT is low when the heterogeneous information network is sparse. The clus-tering accuracy of NetClus is low because only heterogeneous relations are used.
Homogeneous and heterogeneous relations are both used in ComClus; therefore, the accuracy of ComClus is higher than that of NetClus. FctClus is an algorithm based on commute time embedding. The data relations are explored using commute time and the direct relations of the target dataset are considered. FctClus is not affected by the sparsity of networks; thus, FctClus is highly accurate.
The computation speed of FctClus is nearly as fast as NetClus. The experiment demon-strates that FctClus is effective. FctClus is more universal and can be adapted for clustering any heterogeneous information network with a star network schema. However, NetClus and Com-Clus can only be adapted for clustering bibliographic networks because NetCom-Clus and ComCom-Clus depend on a ranking function of a specific application field.
Comparison of Clustering Stability
To compare the stability of the FctClus, NetClus and CIT algorithms, the small datasetSsmallis
used for clustering papers in this experiment. ComClus is a derivation algorithm of NetClus; it has the same properties as NetClus. ComClus is not considered in this study.
The initial cluster centers of FctClus and the initial cluster partitions of NetClus and CIT are randomly recorded 10 times, and the three algorithms are executed 10 times respectively. The clustering accuracy of the three algorithms for 10 times is shown inFig 5. Although the computation speeds of FctClus and NetClus are both high,Fig 5shows that the stability of
Fig 4. The influence ofufor clustering authors onssmall.
doi:10.1371/journal.pone.0130086.g004
Table 1. Comparison of clustering accuracy (%).
target object &dataset CIT NetClus ComClus FctClus
Papers onssmall 73.91 71.54 72.83 78.87
Authors onssmall 74.41 69.13 74.91 81.33
Papers onslarge 70.84 71.28 72.93 76.36
Authors onslarge 71.02 68.29 73.01 77.94
FctClus is higher than that of NetClus and that the initial centers do not greatly impact the clustering result of FctClus. However, NetClus is very unstable, and the initial clusters greatly impact the clustering accuracy and convergence speed of NetClus. CIT is more stable than Net-Clus, but the clustering accuracy is low.
Running Time Analysis of the FctClus Algorithm
The running time distributions of FctClus on the two datasets are shown inTable 3. The exper-imental data show that FctClus is effective. The running time for serial computing the three embedding is less than 50% of the total running time. When utilizing parallel computing for the three embedding, the computation speed is higher. When clustering indicator subsets in parallel, the computation speed may also be increased.
Table 2. Comparison of computation speed(s).
target object &dataset CIT NetClus ComClus FctClus
Papers onssmall 78.5 37.3 40.3 37.1
Authors onssmall 79.8 36.9 39.8 38.3
Papers onslarge 1469.3 802.6 827.3 808.4
Authors onslarge 1484.7 743.7 781.4 774.9
doi:10.1371/journal.pone.0130086.t002
Fig 5. A stability comparison of the 3 algorithms for 10 times.
doi:10.1371/journal.pone.0130086.g005
Table 3. Distribution of running time for FctClus.
target object &dataset Embedding time(s) Clustering time(s) Total time(s)
Papers onssmall 19.6 17.5 37.1
Authors onssmall 18.1 20.2 38.3
Papers onslarge 398.8 409.6 808.4
Authors onslarge 382.4 392.5 774.9
Conclusions
The relation between the original data described by the commute time guarantees the accuracy and performance of the FctClus algorithm. Because heterogeneous information networks are sparse, FctClus can use random mapping and a linear time solver[14] to compute the approxi-mate commute time embedding, which guarantees the high computation speed. FctClus is effective and may be broadly implemented for large heterogeneous information networks, as demonstrated in theory and experimentally. The weight of the relation matrix impacts the tar-get function, but the weight cannot be determined self-adaptively; this requires further research. The relations of data in the real world are typically high-order heterogeneous, so effective clustering algorithms for heterogeneous information networks with any schema will be studied in the future.
Supporting Information
S1 File.Ssmalldataset.
(TXT)
S2 File.Slargedataset.
(TXT)
S3 File. The matlab source codes for algorithm.
(TXT)
Author Contributions
Conceived and designed the experiments: JY LMC. Performed the experiments: LMC. Ana-lyzed the data: JPZ. Contributed reagents/materials/analysis tools: JY LMC. Wrote the paper: LMC.
References
1. Sun Y, Han J (2012) Mining heterogeneous information networks: principles and methodologies. Syn-thesis Lectures on Data Mining and Knowledge Discovery, 3 (2). pp.1–159.
2. Gao B, Liu TY, Qin T, Zheng X, Cheng QS, Ma WY (2005) Web image clustering by consistent utiliza-tion of visual features and surrounding texts. In Proceedings of the 13th annual ACM internautiliza-tional con-ference on Multimedia.pp.112-121.
3. Gao B, Liu TY, Zheng X, Cheng QS, Ma WY (2005) Consistent bipartite graph co-partitioning for star-structrured high-order heterogeneous data co-clustering. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. pp.41-50.
4. Gao B, Liu TY, Ma WY (2006) Star-structured high-order heterogeous data co-clustering based on cosistent information theory. In Data Mining, 2006. ICDM'06. Sixth International Conference on. pp.880-884.
5. Long B, Zhang ZM, Wu X, Yu PS (2006) Spectral clustering for multi-type relational data. In Proceed-ings of the 23rd international conference on Machine learning. pp.585-592.
6. Sun Y, Yu Y, Han J (2009) Rankclus: ranking-based clustering of heterogeneous information networks with star network schema. In Proceedings of the 15th ACM SIGKDD international conference on Knowl-edge discovery and data mining. pp.797-806.
7. Sun Y, Norick B, Han J, Yan X, Yu PS, Yu, X (2012) Integrating meta-path selection with user-guided object clustering in heterogeneous information networks, In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. pp.1348-1356.
8. Li P, Wen J, Li X (2013) SNTClus: A novel service clustering algorithm based on network analysis and service tags. Przegląd Elektrotechniczny.pp.89.
10. Wang R, Shi C, Yu PS, Wu B (2013) Integrating clustering and ranking on hybrid heterogeneous infor-mation network. In Advances in Knowledge Discovery and Data Mining.pp.583-594.
11. Aggarwal CC, Xie Y, Philip SY (2012) Dynamic link inference in heterogeneous networks. In SDM. pp.415-426.
12. Zhang L, Chen C, Bu J, Chen Z, Cai D, Han J (2012)Locally discriminative coclustering. Knowledge and Data Engineering, IEEE Transactions on, 24 (6).pp.1025–1035.
13. Khoa NLD, Chawla S (2011) Large scale spectral clustering using approximate commute time embed-ding. arXiv preprint arXiv:1111.4541.
14. Koutis I, Miller GL, Tolliver D (2011) Combinatorial preconditioners and multilevel solvers for problems in computer vision and image processing. Computer Vision and Image Understanding, 115 (12). pp.1638–1646.
15. Fouss F, Pirotte A,Renders JM, Saerens M (2007)Random walk computation of similarities between nodes of a graph with application to collaborative recommendation. Knowledge and Data Engineering, IEEE Transactions on, 19 (3).pp.355–369.
16. Qiu H, Hancock ER (2007)Clustering and embedding using commute times. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 29 (11).pp.1873–1890. PMID:17848771
17. Spielman DA, Srivastava N (2008) Graph sparsification by effective resistances. In Proceedings of the 40th annual ACM symposium on Theory of computing, STOC '08.pp.563-568.
18. Achlioptas D (2001) Database-friendly random projections, in Proceedings of the twentieth ACM SIG-MOD SIGACT SIGART symposium on Principles of database systems, PODS '01.pp.274-281.
19. Spielman DA, Teng SH (2004) Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. In Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, STOC '04.pp.81-90.
20. Spielman DA, Teng SH (2014) Nearly-linear time algorithms for preconditioning and solving symmetric, diagonally dominant linear systems. SIAM Journal on Matrix Analysis and Applications, 35 (3).pp.835–