This section presents a synthesis of the classification results. Table 4.1 presents the best classification accuracies obtained by the classification algorithms used in this technical report. We highlight with bold font the highest classification accuracies for each collection.
Considering the results from Table 4.1, we analysed which classification algorithm is
advisable for each domain. To do so we generate the average ranking [Demsar, 2006] of
the algorithms considering the collections from each domain (see Table 2.91). Figure 3.1
presents the average ranking for each domain. We notice by the diagrams that the most
accurate classifier varies for each domain. MNB obtained the best accuracies for most
collections from MD, NA, SD, and WP domains. IBk obtained the best accuracies mainly
for collections from EM and TREC domains, and SMO was advisable for collections from
SA domain.
Table 3.93: Best accuracies for the classification algorithms used in this technical report.
Collection NB MNB J48 SMO IBk
20ng 64.61 90.08 73.96 84.81 86.71
ACM 73.89 76.04 66.36 77.38 64.36
Classic4 88.48 95.79 90.35 94.53 94.24
CSTR 78.59 84.64 66.85 75.26 82.29
Dmoz-Business-500 58.61 68.28 53.10 65.84 61.49 Dmoz-Health-500 73.15 82.07 73.61 79.90 77.83 Dmoz-Computers-500 59.67 70.13 54.88 66.01 63.70 Dmoz-Science-500 62.11 73.81 57.20 67.38 64.38 Dmoz-Sports-500 75.85 83.76 83.88 85.48 80.09 Enron-Top-20 51.54 72.60 64.47 65.98 66.72
Fbis 61.79 77.18 71.49 78.92 80.99
Hitech 62.92 72.92 56.76 66.44 71.79
Industry-Sector 41.54 76.23 57.56 70.34 77.58 IrishEconomic 59.75 67.65 51.08 65.54 60.90
La1s 75.21 88.17 76.65 84.30 80.55
La2s 75.25 89.91 76.84 86.76 82.79
LATimes 70.76 85.00 75.68 81.11 76.83
Multi-Domain-Sentiment 72.85 78.65 74.95 81.90 71.40
New3 56.72 79.16 70.85 71.93 79.26
NFS 70.86 83.84 70.74 81.87 78.88
Oh0 79.66 89.83 80.95 81.55 81.85
Oh5 78.76 86.27 80.39 77.24 79.41
Oh10 72.38 80.66 72.09 76.00 73.61
Oh15 75.03 83.68 75.78 75.03 75.57
Ohscal 62.78 74.73 71.30 76.69 68.65
Ohsumed-400 34.75 40.51 30.91 35.71 38.16
Opinosis 60.74 59.56 60.83 61.03 62.87
Pubmed-Cancer 69.90 86.61 96.80 95.98 –
Re0 57.05 79.92 75.26 77.79 83.51
Re1 66.73 83.34 79.60 72.72 81.89
Re8 81.27 95.33 90.73 93.95 94.14
Review-Polarity 66.80 80.10 68.25 83.65 70.50
Reviews 85.22 93.33 88.35 91.64 92.30
SpamAssassin 87.80 96.51 96.61 98.96 98.72 SpamTrec-3000 81.42 97.42 96.88 97.72 98.61 SyskillWebbert 72.51 90.75 95.81 77.85 95.81
Tr11 54.06 85.00 78.98 77.06 86.95
Tr12 57.82 80.15 79.23 69.61 81.74
Tr21 47.95 61.35 81.27 79.77 88.66
Tr23 56.85 70.61 93.19 72.54 84.33
Tr31 79.72 94.38 93.10 90.72 94.60
Tr41 84.95 94.52 90.77 87.80 93.04
Tr45 66.66 82.46 90.28 81.30 88.84
WAP 71.73 81.02 67.05 81.85 74.48
WebKb 41.39 60.38 69.08 57.20 67.95
(a) EM domain (b) MD domain
(c) NA domain (d) SA domain
(e) SD domain (f) TREC domain
(g) WP domain
Figure 3.1: Average ranking diagrams for classification results.
Chapter 4
Clustering Results
We used the Torch
1toolkit to generate the text clustering results. The Torch toolkit was developed in Java language and supports the ARFF file format for documents repre-sentation.
We run two traditional clustering algorithms: k-means (partitional clustering) e bi-secting k-means (hierarchical clustering). The number of classes of the text collections was used to define the value of k (number of clusters) and the cosine correlation was used as similarity measure. In all the executions of k-means, we used random initialization of centroids and a maximum of 30 iterations for convergence. Furthermore, we selected the partition with lower quadratic error from 10 different runs of k-means.
We used the F
SCOREmeasure to quantitatively assess the quality of the obtained clus-tering results. It is essentially an information retrieval measure that computes how much the clustering solution can recover the class information associated with each document.
For this purpose, consider the following:
• T is a clustering solution;
• Q
iis a single cluster belonging to T , where Q
icontains a set of documents; and
• L
ris a class of reference and its respective set of documents.
The F measure of a class L
ris calculated by choosing the maximum value obtained in some cluster Q
i∈ T , according to Eq. (4.1). In this case, F (L
r, Q
i) is a mean between precision P (L
r, Q
i) =
|L|Qr∩Qi|i|
and recall R(L
r, Q
i) =
|L|Lr∩Qi|r|
. The F
SCOREof a clustering solution with n documents and c classes is the sum of the F values of the classes weighted by the size of each class (Eq. 4.2).
F (L
r) = max F (L
r, Q
i) (4.1)
Thus, if the clustering solution perfectly recovers the information class of the docu-ments, then the F
SCOREmeasure is equal to 1. In general, the higher the F
SCOREvalues, the better the clustering solution.
Table 4.1 present the F
SCOREfor the collections analyzed in this technical report. We highlight with bold font the highest F
SCOREfor each collection.
Considering the results from Table 4.1, we analyzed which clustering algorithm is
advisable fer each domain. To do so we generated the average ranking of the algorithms
considering the collections from each domain. Figure 4.1 presents the average ranking for
each domain. We noticed by the diagrams that bisecting k-means is advisable for all the
domains.
Table 4.1: F
SCOREfor the clustering algorithms used in this technical report.
Collection K-means Bisecting K-means
20ng 0.419 0.459
ACM 0.411 0.461
Classic3 0.916 0.797
CSTR 0.467 0.718
Dmoz-Business-500 0.216 0.184
Dmoz-Computers-500 0.297 0.289
Dmoz-Health-500 0.386 0.402
Dmoz-Science-500 0.276 0.259
Dmoz-Sports-500 0.439 0.349
Enron-Top-20 0.349 0.427
FBIS 0.563 0.607
Hitech 0.522 0.589
Industry-Sector 0.206 0.259
IrishEcnomic 0.407 0.500
La1s 0.477 0.523
La2s 0.509 0.604
LATimes 0.411 0.505
Multi-Domain-Sentiment 0.606 0.544
New3 0.410 0.527
NFS 0.327 0.445
Oh0 0.525 0.601
Oh10 0.537 0.568
Oh15 0.509 0.513
Oh5 0.519 0.524
Ohscal 0.503 0.426
Ohsumed-400 0.152 0.152
Opinosis 0.545 0.601
PubMed-Cancer 0.372 0.316
Re0 0.411 0.652
Re1 0.463 0.589
Re8 0.515 0.812
Review-Polarity 0.566 0.568
Reviews 0.771 0.804
SpamAssassin 0.619 0.750
SpamTrec-3000 0.661 0.685
SyskillWebert 0.830 0.933
Tr11 0.458 0.671
Tr12 0.342 0.681
Tr21 0.612 0.771
Tr23 0.397 0.619
Tr31 0.630 0.712
Tr41 0.571 0.693
Tr45 0.412 0.620
WAP 0.461 0.608
WebKb 0.364 0.520
(a) EM domain (b) MD domain
(c) NA domain (d) SA domain
(e) SD domain (f) TREC domain
(g) WP domain