Applying feature transformation using Relative Frequency with Power Transformation and Lemmatization in automatic Spam Filtering

(1)

21

Applying feature transformation using Relative

Frequency with Power Transformation and

Lemmatization in automatic Spam Filtering

Augustine Malero

Department of Computer Science, University of Dodoma

mkumbi@gmail.com

Abstract

Advances in Information and communication technology have paved a way for electronic mail commonly referred as email to become the medium of communication. Over the recent years this medium has become the target of abuse through spamming. One of the approaches of combating spamming is the use of automatic spam filtering through machine learning. The conventional features in automatic spam filtering are Term Frequency with Inverse Document Frequency (TFIDF). In this paper, an alternative approach is presented with the use of Relative Frequency with Power Transformation (RFPT) coupled with lemmatization technique. The techniques used considerably show improvements over the conventional one that is TFIDF.

Keywords:Spam filtering, machine learning, TFIDF, RFPT, lemmatization.

I. Introduction

Advances in Information and communication technology have paved a way for electronic mail commonly referred as email to become the medium of communication. Despite the advantages this medium has enjoyed, it has also become the target of abuse. The practice of sending unsolicited emails commonly known as spam has increasingly become a nuisance. The severity of the problem ranges from normal users who waste time deleting all the spam emails before reading the legitimate ones to the large companies who spend millions of money yearly to combat it.

To combat the spam problem, several approaches have been used, however the prevailing techniques have not been able to completely eliminate the problem. Classically, many filters in machine learning incorporate the term frequencies weighted by inverse document frequency (TFIDF) method to generate feature vectors. That is suppose we represent the documents as a feature vector X=[x1,...,Xn]

T

(2)

Internat ional journal of Computer Science & Net work Solut ions Oct.2 014 -Volume 2.No.10

http:/ / www.ijcsns.com ISSN 2345 -3397

22

total number of emails in the collection and Ni is the email frequency which is the number of emails in which term i occurs.

    ∗

i i

i

N N x

=

W log (1)

This paper proposes the use of Relative Frequency with Power Transformation (RFPT) coupled with lemmatization to improve the performance of automatic filtering of spam. Experimental results show that RFPT coupled with lemmatization is statistically better than TFIDF for a k-Nearest Neighbor (k-NN) and Support Vector Machines with radio basis function classifiers.

II. Feature Transformation Techniques

A. Relative Term Frequency (RF)

RF refers to the occurrence of an absolute term frequency divided by the total number of observation of that term in the emails. RF is applied to eliminate the problem of dependency on text length in emails as shown in equation (2).

y_i= xi

∑

x_j

(2)

B. Power Transformation (PT)

The power transformation is a continuously varying function, with respect to the power parameter ν_{, in a piece-wise function form that makes it continuous at the point of singularity} (ν_{=0). For data vectors (y}₁_{,..., y}_n_{) the power transform is expressed by equation (3). This}

transformation makes the sample distribution of the frequency yi ≥0 to be Gaussian like.

z_i=y_iv, (0<v<1). (3)

When RF is transformed using (3) above, the resulting feature is termed as Relative Frequency with Power Transformation abbreviated as RFPT and they have superior properties to simplify the process of learning by a classification system. Equation (3) can then be modified to zi given

by equation (4) representing RFPT explicitly. The techniques for transformation are basically aimed at achieving a desirable Gaussian distribution since the distribution leads to an optimal decision boundary. During the experiments length of RFPT was normalized to 1 by setting the value of ν_{to 0.5.}

z_i=

(

xi

∑

x_j

)

v

, (0<v<1), ₍₄₎

III. Lemmatization

(3)

23

example if confronted with the token saw, stemming might return just s, whereas lemmatization would attempt to return either see or saw depending on whether the use of the token was as a verb or a noun. The use of lemmatization is important in natural language processing (NLP) applications because it brings out actual grammatical or semantic relations which are otherwise not accessible by the software (Ingason et al, ). In the experiments of this research, a lemmatized version of Ling-Spam Corpus was used.

IV. Methodology

A. Data for Experiments

During the study, in order to come out with effective results during the processes of dimension reduction, feature transformation and classification techniques, a collection of emails called Ling-Spam corpus which has been widely used by other researchers as well was adopted (Androutsopoulos et al, 2000).

The corpus contains 10 subdirectories (part1, …, part10) with 2412 Linguist messages obtained by randomly downloading digests from the archives, separating their messages, and removing text added by the list’s server, 481 spam messages received by the first author. Attachments, HTML tags, and duplicate spam messages received on the same day were not included. In the experiments, both the bare and the lemma versions of the corpus were used in which one part was reserved for testing and the other 9 were used for training.

After extracting or selecting data for experiments, the procedure for automatic email classification can be divided into four general steps. The steps are feature vector generation, dimensionality reduction, learning or classifier training followed by testing. The following subsections describe each of the steps.

B. Feature Vector Generation

Feature vector generation was done so as emails may be interpreted by a classifier, extraction of features to represent the email documents was also done. First a lexicon consisting of all different words in a learning email set was generated and the alphabetic order of the word list was created. Then the feature vectors were composed of the frequencies of the lexicon words found in each email. The features extracted can be represented in form of the feature vector X which can be denoted by equation (6); whereby, n is dimensionality (size of lexicon), xi is the

frequency value of the ith word and T refers to the transpose of a vector.

(

x , ,x

)

,

=

X ₁ _L _n T (5)

C. Dimensionality Reduction by Principal Component Analysis (PCA)

(4)

Internat ional journal of Computer Science & Net work Solut ions Oct.2 014 -Volume 2.No.10

http:/ / www.ijcsns.com ISSN 2345 -3397

24

PCA can be defined as the linear projection that minimizes the average projection cost, referred to as the mean squared distance between the data points and their projections (Bishop, 2006). PCA is used mainly to solve the problem of high dimensionality (Duda et al, 2000).

D. Classification Methods

During the experiments two classifiers were used. These are k-Nearest Neighbors and Support-vector Machines. We decided to use these two methods because of the advantages with which each one of them has over other methods. SVM being a learning system based on statistical learning theory(Cortes, 1985 Cristianini et al, 2003, Duda et al, 2000, Joachims, 1998). It has several remarkable properties that show its substantial performance in text classification (Duda et al, 2000). It is able to learn independent of the dimensionality of the feature space, SVM shows efficient performance, even if the size of the feature space is large and with SVM document vectors usually tend to be sparse. In the experiments, SVMlight_{package (Joachims, 1998) was}

used. Three different types of SVM kernel functions were also adopted: Linear Kernel (Linear), Polynomial Kernel (Polynomial) and Radial Basis Function (RBF).

The k-nearest neighbor algorithm (k-NN) is a method for classifying objects based on closest training examples in the feature space. k-NN is a type of instance-based learning, or lazy learning where the function is only approximated locally and all computation is deferred until a new incoming instance is fed to the algorithm. The k-nearest neighbor algorithm classifies an object by a majority vote of the training neighbors, with the object being assigned to the class most common amongst its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of its nearest neighbor. With the kNN classifier, it has satisfactory noise-rejection properties generally. However k-NN is also analytically tractable and simple to implement (Bremner, 2005). In the experiments k was set to 3.

E. Evaluation of Classifiers

To evaluate the effectiveness of classifiers, F1 measure was applied, the measure was first defined by C.J van Rijsbergen (1979). The measure is often used in the field of information retrieval for measuring search, document classification, and query classification performance (Beitzel, 2006) the measure is defined as the harmonic mean of recall R and precision, P (Busagala et al, 2006) given by equation (7). Where precision is the proportion of documents that the classification system assigns to a class that really belong to the class in question expressed in equation (8) and recall is defined as a measure of the ability of a system to present all relevant items expressed in equation (9). The recall, precision, and F1 – measure are regarded as the standard evaluation methods for classification systems in automatic email classification.

F₁= 2RP

R+P (7)

P= TP

TP+FP= Σ

j=1

C

TP_j

Σ

j=1

C

(5)

25 F₁= 2RP

R+P (7)

R= TP

TP+FN=

∑

TP_j

∑

(

TP_j+FN _j

₎

(9)

V. Results and Analysis

A. Performance Comparison of Classifiers

Figure 1 shows a summary of results obtained for spam filtering before transformation and lemmatization, with lemmatization only, with RFPT transformation only, with TFIDF transformation only and with RFPT transformation and lemmatization.

It can be observed that: - (1) k-NN classifier performed the best classification rates on RFPT feature vectors used. This is observed on RFPT which emerged the best classification result with 100% accuracy. However SVM RBF had the best classification rates. (2) The classification rate was significantly improved by employing the RFPT, for example the accuracy of SVM Polynomial was improved by 7.6492%, that of SVM Linear by 1.5873% and that of k-NN by 1.7606% (details not shown). (3) As Figure 1 shows, RFPT and lemmatization technique further improved the classification accuracy of SVM Linear and SVM RBF classifiers that of SVM Linear was improved by 1.7637% and that of SVM RBF by 0.1736% (details not shown).

B. Performance Comparisons of RFPT with Lemmatization against TFIDF

In this section we compare the results obtained by relative frequency with power transformation (RFPT) coupled with Lemmatization and the conventional term frequency

(6)

Internat ional journal of Computer Science & Network Solutions Oct .2014-Volume 2.No.10

ht tp:/ / www .ijcsns.com ISSN 2345-33 97

26 weighted by inverse document frequency (TFIDF).

From the results of Figure 2, it is evident that RFPT coupled with lemmatization technique has better results as compared to TFIDF. For example the SVM RBF of RFPT and lemmatization has 0.8742% improvement over that of TFIDF suggesting that RFPT with lemmatization is better than TFID in classification systems.

VI. Conclusions

In this paper, several sets of experiments were carried out to study the impact of features transformation coupled with lemmatization technique on email classification. Comparison experiments show that lemmatizing features followed by normalizing absolute frequency to the relative frequency followed by the power transformation improved the classifiers’ performance significantly. Empirical evidence shows that relative frequency and power transformation with lemmatization are techniques that are showing considerable improvements in the email classification performance. This implies combining the effect of lemmatization to reduce features inflectional forms and RFPT in turning features to a better Gaussian-like distribution and remove the problem of dependency on their lengths, results in higher classification rates than untransformed features to represent emails for classification purposes even at lower dimensionalities.

References

i. C. Cortes and V. Vapnik, “Support-vector networks”, Machine Learning, 1995, Vol. 20, Part 3, pp273-297.

k-NN SVM Linear SVM Polynomial SVM RBF

98.4 98.6 98.8 99 99.2 99.4 99.6 99.8 100

TFIDF

RFPT and lemmatization (v=0.5)

Classifiers

M

ic

ro

A

v

e

ra

g

ed

F

-m

ea

s

u

re

(

%)

(7)

27

ii. Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008.

iii. C.J. Van Rijsbergen. “Information Retrieval”, Butterworth, London. Chapter 7, 1979.

iv. C.M. Bishop, “Pattern Recognition and Machine Learning”, (Information Science and Statistics). Springer-Verlag New York Inc., Secaucus, NJ, USA, 2006.

v. D. Bremner, E. Demaine, J. Erickson, J. Iacono, S. Langerman, P. Morin, and G. Toussaint, "Output-sensitive algorithms for computing nearest-neighbor decision boundaries". Discrete and Computational Geometry, 2005, Vol. 33, Part 4, pp593–604.

vi. I. Androutsopoulos, G. Paliouras, V. Karkaletsis, G. Sakkis, C.D. Spyropoulos and P. Stamatopoulos, “Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach”, 2000.

vii. J. clark, I. Koprinska, and J. Joon, “linger – a smart personal assistant for e-mail Classification”, School of Information Technologies, University of Sydney, Sydney, Australia, 2006.

viii. K.P. Bennet and C. Campbell, “Support Vector Machines, “Hype or Hallelujah?””

SGKDD Explorations, 2000, Vol 2, Issue 2.

ix. L.F. Cranor, and B.A. LaMacchia, “Spam!”, Communications of ACM, 1998, vol. 41, Part 8, pp74–83.

x. L.S.P. Busagala, W. Ohyama, T. Wakabayashi and F. Kimura. “Machine Learning with Transformed Features in Automatic Text Classification”, 2006.

xi. N. Cristianini and J. Shawe-Taylor, “An introduction to Support Vector Machines and Other Kernel-Based Learning Methods”, Cambridge University Press, 2003.

xii. R. Segal, M. Kephart, MailCat, “An Intelligent Assistant for Organizing E-Mail”, 3d Int. Conf. Autonomous Agents, 1999.

xiii. R.O. Duda, P.E. Hart, D.G. Stork, “Pattern Classification”, 2nd edition, 2000.

xiv. S.M. Beitzel, “On Understanding and Classifying Web Queries”, PhD Thesis, 2006.

xv. T. Joachims, “Text categorization with support vector machines: Learning with many relevant features”, 1998.