Synthesis of the Classification Results - Benchmarking Text Collections for Classiﬁcation and C

This section presents a synthesis of the classification results. Table 4.1 presents the best classification accuracies obtained by the classification algorithms used in this technical report. We highlight with bold font the highest classification accuracies for each collection.

Considering the results from Table 4.1, we analysed which classification algorithm is

advisable for each domain. To do so we generate the average ranking [Demsar, 2006] of

the algorithms considering the collections from each domain (see Table 2.91). Figure 3.1

presents the average ranking for each domain. We notice by the diagrams that the most

accurate classifier varies for each domain. MNB obtained the best accuracies for most

collections from MD, NA, SD, and WP domains. IBk obtained the best accuracies mainly

for collections from EM and TREC domains, and SMO was advisable for collections from

SA domain.

Table 3.93: Best accuracies for the classification algorithms used in this technical report.

Collection NB MNB J48 SMO IBk

20ng 64.61 90.08 73.96 84.81 86.71

ACM 73.89 76.04 66.36 77.38 64.36

Classic4 88.48 95.79 90.35 94.53 94.24

CSTR 78.59 84.64 66.85 75.26 82.29

Dmoz-Business-500 58.61 68.28 53.10 65.84 61.49 Dmoz-Health-500 73.15 82.07 73.61 79.90 77.83 Dmoz-Computers-500 59.67 70.13 54.88 66.01 63.70 Dmoz-Science-500 62.11 73.81 57.20 67.38 64.38 Dmoz-Sports-500 75.85 83.76 83.88 85.48 80.09 Enron-Top-20 51.54 72.60 64.47 65.98 66.72

Fbis 61.79 77.18 71.49 78.92 80.99

Hitech 62.92 72.92 56.76 66.44 71.79

Industry-Sector 41.54 76.23 57.56 70.34 77.58 IrishEconomic 59.75 67.65 51.08 65.54 60.90

La1s 75.21 88.17 76.65 84.30 80.55

La2s 75.25 89.91 76.84 86.76 82.79

LATimes 70.76 85.00 75.68 81.11 76.83

Multi-Domain-Sentiment 72.85 78.65 74.95 81.90 71.40

New3 56.72 79.16 70.85 71.93 79.26

NFS 70.86 83.84 70.74 81.87 78.88

Oh0 79.66 89.83 80.95 81.55 81.85

Oh5 78.76 86.27 80.39 77.24 79.41

Oh10 72.38 80.66 72.09 76.00 73.61

Oh15 75.03 83.68 75.78 75.03 75.57

Ohscal 62.78 74.73 71.30 76.69 68.65

Ohsumed-400 34.75 40.51 30.91 35.71 38.16

Opinosis 60.74 59.56 60.83 61.03 62.87

Pubmed-Cancer 69.90 86.61 96.80 95.98 –

Re0 57.05 79.92 75.26 77.79 83.51

Re1 66.73 83.34 79.60 72.72 81.89

Re8 81.27 95.33 90.73 93.95 94.14

Review-Polarity 66.80 80.10 68.25 83.65 70.50

Reviews 85.22 93.33 88.35 91.64 92.30

SpamAssassin 87.80 96.51 96.61 98.96 98.72 SpamTrec-3000 81.42 97.42 96.88 97.72 98.61 SyskillWebbert 72.51 90.75 95.81 77.85 95.81

Tr11 54.06 85.00 78.98 77.06 86.95

Tr12 57.82 80.15 79.23 69.61 81.74

Tr21 47.95 61.35 81.27 79.77 88.66

Tr23 56.85 70.61 93.19 72.54 84.33

Tr31 79.72 94.38 93.10 90.72 94.60

Tr41 84.95 94.52 90.77 87.80 93.04

Tr45 66.66 82.46 90.28 81.30 88.84

WAP 71.73 81.02 67.05 81.85 74.48

WebKb 41.39 60.38 69.08 57.20 67.95

(a) EM domain (b) MD domain

(e) SD domain (f) TREC domain

(g) WP domain

Figure 3.1: Average ranking diagrams for classification results.

Chapter 4 Clustering Results

We used the Torch

toolkit to generate the text clustering results. The Torch toolkit was developed in Java language and supports the ARFF file format for documents repre-sentation.

We run two traditional clustering algorithms: k-means (partitional clustering) e bi-secting k-means (hierarchical clustering). The number of classes of the text collections was used to define the value of k (number of clusters) and the cosine correlation was used as similarity measure. In all the executions of k-means, we used random initialization of centroids and a maximum of 30 iterations for convergence. Furthermore, we selected the partition with lower quadratic error from 10 different runs of k-means.

We used the F

_SCORE

measure to quantitatively assess the quality of the obtained clus-tering results. It is essentially an information retrieval measure that computes how much the clustering solution can recover the class information associated with each document.

For this purpose, consider the following:

• T is a clustering solution;

• Q

is a single cluster belonging to T , where Q

contains a set of documents; and

• L

is a class of reference and its respective set of documents.

The F measure of a class L

is calculated by choosing the maximum value obtained in some cluster Q

∈ T , according to Eq. (4.1). In this case, F (L

, Q

) is a mean between precision P (L

, Q

) =

^|L_|Q^r^∩Qⁱ^|

and recall R(L

, Q

) =

^|L_|L^r^∩Qⁱ^|

. The F

_SCORE

of a clustering solution with n documents and c classes is the sum of the F values of the classes weighted by the size of each class (Eq. 4.2).

F (L

) = max F (L

, Q

) (4.1)

Thus, if the clustering solution perfectly recovers the information class of the docu-ments, then the F

_SCORE

measure is equal to 1. In general, the higher the F

_SCORE

values, the better the clustering solution.

Table 4.1 present the F

_SCORE

for the collections analyzed in this technical report. We highlight with bold font the highest F

_SCORE

for each collection.

Considering the results from Table 4.1, we analyzed which clustering algorithm is

advisable fer each domain. To do so we generated the average ranking of the algorithms

considering the collections from each domain. Figure 4.1 presents the average ranking for

each domain. We noticed by the diagrams that bisecting k-means is advisable for all the

domains.

Table 4.1: F

_SCORE

for the clustering algorithms used in this technical report.

Collection K-means Bisecting K-means

20ng 0.419 0.459

ACM 0.411 0.461

Classic3 0.916 0.797

CSTR 0.467 0.718

Dmoz-Business-500 0.216 0.184

Dmoz-Computers-500 0.297 0.289

Dmoz-Health-500 0.386 0.402

Dmoz-Science-500 0.276 0.259

Dmoz-Sports-500 0.439 0.349

Enron-Top-20 0.349 0.427

FBIS 0.563 0.607

Hitech 0.522 0.589

Industry-Sector 0.206 0.259

IrishEcnomic 0.407 0.500

La1s 0.477 0.523

La2s 0.509 0.604

LATimes 0.411 0.505

Multi-Domain-Sentiment 0.606 0.544

New3 0.410 0.527

NFS 0.327 0.445

Oh0 0.525 0.601

Oh10 0.537 0.568

Oh15 0.509 0.513

Oh5 0.519 0.524

Ohscal 0.503 0.426

Ohsumed-400 0.152 0.152

Opinosis 0.545 0.601

PubMed-Cancer 0.372 0.316

Re0 0.411 0.652

Re1 0.463 0.589

Re8 0.515 0.812

Review-Polarity 0.566 0.568

Reviews 0.771 0.804

SpamAssassin 0.619 0.750

SpamTrec-3000 0.661 0.685

SyskillWebert 0.830 0.933

Tr11 0.458 0.671

Tr12 0.342 0.681

Tr21 0.612 0.771

Tr23 0.397 0.619

Tr31 0.630 0.712

Tr41 0.571 0.693

Tr45 0.412 0.620

WAP 0.461 0.608

WebKb 0.364 0.520

(a) EM domain (b) MD domain

(e) SD domain (f) TREC domain

(g) WP domain

Figure 4.1: Average ranking diagrams for clustering results.

Chapter 5 Final Considerations

This technical report aims to solve the lack of benchmarking collections and results found in literature. To do so we make 45 preprocessed text collections available in an on-line repository: http://sites.labic.icmc.usp.br/text_collections/. The tool developed to preprocess the text collections (Text Preprocessing Tool) is available at http://sites.labic.icmc.usp.br/tpt/.

The collections used in this technical report are from different domains: e-mails, medi-cal, news, sentiment analysis, scientific, and TREC. The collections have different charac-teristics. The number of documents ranges from 204 to 65991, the number of terms from 1762 to 100464, the average number of terms per document from 6.55 to 720.30, and the number of classes from 2 to 51. Besides this traditional characteristics, we extract another characteristic as the matrix sparsity, standard deviation of the number of documents per class, majority class and S-Index (see Section 2).

We also provided classification and clustering results using traditional/state-of-the-art algorithms. The used algorithms for classification were: Na¨ıve Bayes, Multinomial Na¨ıve Bayes, C4.5, Support Vector Machine, and k Nearest Neighbors. The tool devel-oped to generate the classification results (Inductive Classification Tool) is available at http://sites.labic.icmc.usp.br/ict/. The experiments carried out in this report demonstrated that Multinomial Na¨ıve Bayes is advisable to be used in the MD, NA, SD, and WP domains, kNN is advisable in EM and TREC domains, and SVM is advisable in SA domain.

For clustering we used k-means and bisecting-k-means algorithms. The tool developed to generate the clustering results (Torch) is available at http://sites.labic.icmc.usp.

br/torch/. We identified that, in general, bisecting-k-means provided better results than k-means for collections from all the domains.

This technical report provides collections with their characteristic and baseline results

Bibliography

[Apache, 2006] Apache (2006). The apache spamassassin project. http:

//spamassassin.apache.org/publiccorpus/. Last access: November 6 2013.

Cited on page 43.

[Berry, 2004] Berry, M. (2004). Survey of Text Mining I: Clustering, Classification, and Retrieval. Number v. 1. Springer. Cited on page 1.

[Berry et al., 2008] Berry, M., Castellanos, M., for Industrial, S., and Mathematics, A.

(2008). Survey of Text Mining II: Clustering, Classification, and Retrieval. Number v.

2 in Computer science. Springer. Cited on page 1.

[Blitzer et al., 2009] Blitzer, J., Dredze, M., and Pereira, F. (2009). Multi-domain sentiment dataset (version 2.0). http://www.cs.jhu.edu//~mdredze/datasets/

sentiment/. Last access: November 6 2013. Cited on page 25.

[Caruana and Niculescu-Mizil, 2006] Caruana, R. and Niculescu-Mizil, A. (2006). An em-pirical comparison of supervised learning algorithms. In Proceedings of the Internation Conference on Machine Learning, pages 161–168. Cited on page 57.

[Cohen, 2009] Cohen, W. W. (2009). Enron email dataset. http://www.cs.cmu.edu/

~enron/. Last access: July 19 2013. Cited on page 16.

[Cormack and Lynam, 2007] Cormack, G. V. and Lynam, T. R. (2007). 2007 trec

pub-lic spam corpus. http://plg.uwaterloo.ca/~gvcormac/treccorpus07/. Last access:

No documento Benchmarking Text Collections for Classiﬁcation and Clustering Tasks (páginas 168-177)