A comparasion of classiﬁcation methods applyed on credit card fraud detection Manoel Fernando Alonso Gadi Alair Pereira do Lago Xidi Wang Article presented to

(1)

A comparasion of classification methods applyed on credit card

fraud detection

Manoel Fernando Alonso Gadi Alair Pereira do Lago

Xidi Wang

Article presented to

Brazil, Sao Paulo, april 2008

(2)

Abstract

In 2002, January the 31th, the famous journal Nature [Kla02], with a strong impact in the scientific environment, published some news about immune based systems. Amoung the diffenrent considered applications, we can find detection of fraudulent financial transactions. One can find there the possibil- ity of a commercial use of such system as close as 2003, in a British company.

In spite of that, we do not know of any scientific publication that uses Artifi- cial Immune Systems in financial fraud detection. This work reports results very satisfactory on the application of Artificial Immune Systems (AIS) to credit card fraud detection.

In fact, scientific finantion fraud detection publications are quite rare, as point out Phua et al. [PLSG05], in particular for credit card transactions.

Phua et al. points out the fact that no public database of financial fraud transactions is available for public tests as the main cause of such a small number of publications. Two of the most important publications in this subject that report results about their implementations are the prized [MTV00], that compairs Neural Networks and Bayesian Networks in credit card fraud detection, with a favoured result for Bayesian Networks and [SFL⁺97], that proposed the method AdaCost. This thesis joins both these works and pub- lishes results in credit card fraud detection.

Moreover, in spite the non availability of Maes data and implementations, we reproduce the results of their e amplify the set of comparisons in such a way to compare the methods Neural Networks, Bayesian Networks, and also Artificial Immune Systems, Decision Trees, and even the simple Naive Bayes.

All work presented here is a compilation of [Gad08] master thesis, which took into account the skewed nature of the dataset, as well as the need of a parametric adjustment, sometimes through the usage of genetic algorithms, in order to obtain the best results for each compared method.

1 Introduction

In recent years many bio-inspired algorithms are sprouting for resolu- tion of problems of classification and otimization. As examples of it there is Ant Colony [MGdL01], Neural Networks [Ros62] and Artificial Immune

(3)

Systems [dCT02] and [dCZ].

The human systems have been used as a source of inspiration for the scientific development in some fronts, in artificial intelligence, for instance, some research try to create artificial systems to simulate the functioning of the human systems so it could take decisions considered intelligent. An important example of this was the fact of that in January of 2002 the famous magazine Nature [Kla02] published an article about imuno-based systems, it gave a special approach in applications in the protection of computers against intruders and in the detention of fraudulent financial transactions.

The same article pointed the future commercial use of imuno-based systems with respect to detention of fraud in systems does business for the year of 2003. Despite of this, works on fraud detention in general are rare, as it was pointed in Phua et al. [PLSG05]. It indicates the absence of a public database for application of diverse methods as the main cause of this low number of publication. Two of the most important publications on the subject are [MTV00] which compares Neural Nets and Bayesianas Nets in the detention of fraud in credit cards, emerging with a favorable result to Bayesianas Nets, and [SFL⁺97] that initiates presenting an excellent work on fraud detention, however finishes it for getting more conclusions on the detention of intruders in computer networks.

The aim of this work was give a real life application of Artificial Immune Systems for Credit Card data, with the reproduction of [MTV00] results beetwen Neural Nets and Bayesian Nets, and also including Naive Bayes and Decision Trees in the pool of compared methods. This comparison took into account particularitities of these data, such like skewedness of data and the diferent costs of false positive and false negative. It also involved a parametric adjustment for each method.

Some of the applycation field backgroud: Fraud prevention is a subject that always broght interest and investment from financial institutions, the advent of new technologies as telephone, auto-attendance machines (ATM) and credit card systems have laveraged the volume of fraud loss of many banks. In this context, fraud prevention, with a special importance of fraud automatic detection, arises as an open field for applycation of all known classification methods.

We live in a world where analysing whether each transaction is or not legitimate is very expensive, or worst than that, confirming IF a transac-

(4)

tion was done by a client or a fraudster by calling the card holder is cost prohibitive if we do it for all transactions.

In this context pattern recognition systems plays a very important role, once it is able to learn from past experience (fraud happened in the past) and classify new instances (transactions) in a fraud like group or in a legitimate like group. I credit card business today the most common method used is Neural Networks. In general, its Neural Network implementation are made inside a complex Workflow system, this system integrates with the bank database, in a way that when a new transaction comes in the Workflow is able to calculate all the model input variable and then the output fraud score, in the end, this is fraud score is used to decide which transaction to check manually and between those selected, which one to check first.

Skewed data and other discussions: Fraud detection is a very difficult work for prediction models. Skewedness of the data, search space dimensionality, different cost of false positive and false negative, durability of the model and short time-to-answer are some of the main problems one has to face when develop a fraud detection model. For this article we focused our attention on skewedness of the data by comparing 5 methods we consider work well in such unbalanced world. The problem of different cost between false positive and false negative needs am article itself, and thats what we intend to do before December this year. The durability and short time-to-answer problem we intend to start to analyze next year. Having our goal set a question arised, how to compare the five methods in a way it would be acceptable for both academia and card business industry. We have considered using KS, ROC Curve, Lift Curve, Hit Rate and Detection Rate. The first three are well accepted for academia but with just little practical use, on the other hand, Hit Rate and Detection Rate have little meaning for academia, and even worst, they can not be seem separately. For these reasons we proposed the use of a cost function, such a function was based on the cost caused by the action of the fraudster (money stolen cost) plus the cost to analyze each transaction (call phone cost).

Objective function: Our objective function is based on confusion matrix which come as an ouput of each Weka execution:

=== Confusion Matrix ===

a b <-- classified as

(5)

120220 540 | a = N <-- real non-fraud 412 55 | b = S <-- real fraud

In the matrix above we have a total of 121,227 registers, the main diagonal defines the amount of registers correctly classified (120, 220 + 55 = 120, 275 = 99.21%). The inverted diagonal defines the amount of miss classified registers ( 540 + 412 = 952 = 0.79%). We call also define some important indicators used in the market.

Hit Rate (or confidence) as being 55/(540+55) = 9.24% (total of fraudulent classified as fraudulent divided for the total registers classified as fraud).

Detection Rate (or covering), as being 55/(412+55) = 11.8% (total of fraudulent classified as fraudulent divided for the total of fraudulent registers).

Facing the problem of how to mix both Hit Rate and Detection Rate in a proper usefull objective function, we talked to some specialist from fraud prevention department and we decided to use a fuction based on the cost, cause it would be able to take care of this fusion.

Two financial fraud cost were taken into account: first wasthe fraud detec- ton cost which is defined as (540 + 55) * Avg Cprev¹, assuming Avg Cprev

= $ 1.00, we would have $ 595.00 of cost to prevent these 55 frauds from occurring; second, the proper fraud loss caused by in transaction not pre- vented from happening; second was fraud loss cost which is defined as 412

* Avg FraudLoss², assuming Avg FraudLoss = $ 100,00, we would have the ammount of $ 41,200.00 of fraudulent authorizations passing without being detected and generating financial losses. And in the end, the cost function is the sum of this both costs.

Bellow, we see the cost of the confusion matrix presented (a), the cost if we consider all transaction legitimate deciding to let all fraud transaction go without been traced (b), and finally, the cost if we investigate all transactions.

(a) f(x) = - [100 * 412 + 1 * 595 ] = - 41,795.00 (b) f(x) = - [100 * 467 ] = - 46,700.00 (c) f(x) = - [ 1 * 121227] = - 121,227.00

1Where Avg Cprev is defined as the average cost to analyze a transactio.

2Where Avg FraudLoss is average ticket of a confirmed fraudulent transactions

(6)

2 Experimental Description:

The experimental part of our work was initiated when we received and we transformed the database proceeding from credit card of the Bank, that was followed by the definition of the methodology of tests that we were use in our comparisons: the GA. In this chapter, we approach with more of details of the data transformation, the GA and the objective function.

In the last section, we argue on the functionality and the domain of the parameters associated to each method.

Data tranformation: Our research received support from an important bank in the Brazilian credit card market.

We received from the Bank a database with 41647 registers. Each reg- ister represents an authorization of a credit card, in other words, with only approved transactions. The transactions denied for any criteria of credit was deleted from this database before arriving to the number of 41647 registers. The time window of this this database is between Jul/14/2004 and Sep/12/2004.

Evaluation of the cases as fraud and not-fraud: Still under decision of the Bank, it decided to apply the follow rule for evaluation the class of an authorization:

• If happened a plea (not recognition of the accomplishment) of a transaction from the client, or the bank distrust it is a legitimate transaction and confirm it is not with the cliente, in the next 2 months after the date of the transaction, called performance period, then this transaction is considered to be fraudulent.

• Otherwise it was considered legitimat.

Something important to be observed is that when an authorization is fraudulent the Bank has almost 100% of certainty about the class such transaction belongs, but when we affirm that a transaction is legitimate, we cannot affirm this in fact legitimate, we only can affirm that it was still not identified as fraudulent in the performance window. However, in practical it is almost no relevant, once, according to the Bank, at least 80% of the occured frauds are identified as fraudulent in the period of 2 months.

(7)

Sampling: The sampling of transaction to be sent for our study had had two stages:

1. First one was a random sampling of card numbers to be analyzed in this period, not attempting to the fact of the card had or not a transaction in the historical period.

2. Second one was a weighted sampling of the class. Which selected 10% of transactions evaluated as legitimate and 100% transactions evaluated as fraudulent.

Categorization: With the database effectively in our hands, the prepara- tion can be illustrated in four stages:

1. First stage, we some frequencies and decide that some variable were not important to be considered for the modeling (ex: card number).

Figure ?? shows how the filtering were performed, the with an x in Stage 2 column are those remained for next stage.

2. Second stage was the binding of the variables. All variables but Mer- chant Category Code (MCC) ³ were categorized in up to 10 groups, one digit only.

3. Third stage consisted in transforming the previous database, in a comma separated value format (csv).

4. Fourth stage was responsible for generating 9 sets of bases. Each set possessing a pair of bases, a database with 70% of transactions for development (training sample - training set), and another database with 30% of transaction for validation of the models (testing sample - testing set), the table [?] shows that this tables has around the same number of frauds and legimates transactions.

After this process of sampling, all databases was converted for Weka proprietor format (.arff).

3MCC got 32 categories so it could fit the number of groups of Transaction Category Code (TCC).

(8)

base #development frauds #development legimates #validation frauds #validation legimates

1 1084 27904 475 12184

2 1092 28012 467 12076

3 1088 28061 471 12027

4 1075 28145 484 11943

5 1081 28045 478 12043

6 1116 27973 443 12115

7 1099 28113 460 11975

8 1106 27884 453 12204

9 1100 28188 459 11960

Tabela 1: Number of frauds and legimates in each database

Methodology of comparision: In the beginning of our work, we planned to compare only two methods. They were Neural Networks and Artificial Im- mune Systems. But in the very beginning, we expanded our studies for the other three methods. After deciding the methods to compare, how to compare became the question. Among all consulted works [AR05] presented itself as a very good direction. This text brings a comparison between methods of image recognition, collates five different filters for recognition of images (a = Steerable, b = Quadrature steerable, c = Real Gabor, d = Complex Gabor e e = Line operator) using Receiver Operating Characteristics (ROC) curve⁴. , to be more exact, using AUC.

[AR05] presents an exhausting search in the space of parameters, that was possible because they had only two parameters to evaluate and fast algorithm to run.

Finally, the text presents one bidimensional graphics, in form of level curves, one for each filter. Through this graphic, the text calculates ρ as the proportion of AUC bigger than 0,9 ⁵. Comparing all the filters according to this value of ρ, the article is able to point which method is more robust.

We evaluate the use of a similar methodology as the one in [AR05], however some impeditive factors like the big number of parameters of NN and

4A ROC curve is a bidimensional demonstration of the performance of a classifier (a ROC curve has a similar shape of a ln(x) function in a bidimensional graphic). To compare classifyings the text reduced curve ROC to a value to scale. The method most common to carry through this reduction is to calculate the area below of curve ROC (AUC - area under the it bends), and was this the used method.

Uma interesting comment on the AUC, is that the AUC is a portion of the square of area 1 (space ROC), for in such a way, its values vary between 0.0 and 1.0. , as classifying however worse that the random ones are not found in space ROC, they do not exist classifying with lesser AUC that 0,5 (0,5 are the area of a random classifier).

5The text calls this value detection band.

(9)

AIS, the time it takes to run and Weka not implement ROC/AUC forced us in another direction. As it will be presented in a future work we still working on, there is this meta-heuristic Cost Sensitive Classifier which implements a cost sensitive analysis, and it has a very good parallel with the ROC/AUC kind of analysis.

2.1 Tunning GA

In [Gad08] one can check a very detailed analysis that proves our GA find the optimal, for page restriction we chose to omite it from here.

GA utilization: For Decison Tree, Bayesian Network and Naive Bayes GA optimation was not necessary, once they had a reduced number of parameters and a short running time. GA optimation was used for NN and AIS therefore which an exhausting search was impracticable.

2.2 Short GA description - for parameter tunning

First step was a 50 random execution, followed by 20 GA generations.

Each GA generation combined two random selected candidate ammoung the 15 bests from previously generation, it combination performed: cross over, mutation, random change and no action for each parameter independently, and as the generation pass the chance of no action increases. In the end, we performed a local search around the optimal founded by GA optimazion.

To clarify it, we present part of the code used for GA optimization algorithm (L is learning rate parameter of NN):

avg = rand();

if(avg<0.10){ //10% chance of fusion.

Lnew = int((L1 + L2)/2);

} elsif (avg<0.20) { //10% chance of fusion with up mutation.

Lnew = int(rand()*20)/100 + int((L1 + L2)/2);

} elsif (avg<0.30) { //10% chance of fusion with down mutation.

Lnew = -int(rand()*20)/100 + int((L1 + L2)/2);

} elsif (avg<0.40) { //10% chance of up mutation.

Lnew = int(rand()*20)/100 + L1;

} elsif (avg<0.50) { //10% chance of up mutation.

Lnew = -int(rand()*20)/100 + L1;

} elsif (avg<0.75) { //25% chance of no action.

Lnew = L1;

} elsif (avg<0.95) { //20% chance of cross-over.

Lnew = L2;

} else { //5% chance of a random choice.

Lnew = int(100*(rand()+0.001))/100.0;

(10)

}

L1 represents the learning rate of one of the best candidates from previous generation of GA. L2 represents the learning rate of another (and diferent) one of the best candidates from previous generation of GA.Lnew represents the learning rate of the new candidate created in this generation to be evaluated. rand()is the function that generates random numbers between 0 and 1.0.

As well as performed for learning rate, the algorithm did to other parameters, in a total of 5 of 7 parameters for NN and 7 of 9 parameters for AIS.

3 Parameter description

Instead of presenting the methods in detail our describing the Weka’s implementations, we chose to bring a discussion only about what matter the most for our tests, and that was the set of parameters of each method. All information about the parameters were retrieved from [WF08] and [WF05].

Tree of Decision: DT has two parameters C and M:

• C confidence. Set confidence threshold for pruning. (Default: 0.25)

• M number. Set minimum number of instances per leaf. (Default: 2) Neural Network: NNpossesss seven parameters ( L, M, N, V, S, E, H):

• L num. Set the learning rate. (default 0.3). The closer to zero the smaller the impact of a new income information.

• M num. Set the momentum (default 0.2). It varies between 0.00 and 1.00, and its inclusion (values greater than zero) has for objective to increase the speed of the training of a neural net and to reduce the instability.

• N num. Set the number of epochs to train through. (default 500). For state space restrition and knowing, after tests, that using N greater than 500 did not icrease the performance much we decide to vary this parameter, fixing it to its default 500.

(11)

• V num. Set the percentage size of the validation set from the training to use. (default 0 (no validation set is used, instead num of epochs is used). It varies between 0% and 99,99%, when this parameter is greater that zero intend to reduce overfitting. If V = 0 then E loses its meaning.

• S num. Set the seed for the random number generator. (default 0). In our case, as we would like eliminate any random factors, we decide to not vary this parameter, fixing it to zero.

• E num. Set the threshold for the number of consequetive errors allowed during validation testing. (default 20). Number between 1 and 100. In combination with the parameter -N, forms the stop conditions of the algorithm.

• H str. Set the number of nodes to be used on each layer. Each number represents its own layer and the num of nodes on that layer. Each number should be comma seperated. There are also the wildcards ’a’,

’i’, ’o’, ’t’ (default 4).

This will set what the hidden layers are made up of when auto build is enabled. Note to have no hidden units, just put a single 0, Any more 0’s will indicate that the string is badly formed and make it unaccepted.

Negative numbers, and floats will do the same. There are also some wildcards. These are ’a’ = (number of attributes + number of classes) / 2, ’i’ = number of attributes, ’o’ = number of classes, and ’t’ = number of attributes + number of classes.

Naive Bayes: NB does not have any parameter.

Bayesian Network: BN has three parameters (D, Q, E):

• D defines if a structure called ADTree will or not be used;

• Q defines which search for topology algorithm will be used, the available ones are: GeneticSearch, HillClimber, K2, LocalScoreSearchAlgo- rithm, RepeatedHillClimber, SimulatedAnnealing, TabuSearch e TAN, the search algorithms has two parameters:

– P defines the number of parentes allowed in the topology.

(12)

– S defines the type of score to be used to build the conditional table, they can be: BAYES, BDeu, MDL, ENTROPY e AIC;

• E define the estimator. An estimator is algorithm to calculate the conditional tables. In Weka they can be: BayesNetEstimator, BMAEsti- mator, MultiNomialBMAEstimator e SimpleEstimator, this estimator has one parameter (A): A is called alpha and varies between 0% e 100%, and represent a start value for the conditional probability.

Artificial Immune System: AIShave 10 parameters ( S, F, C, H, M, R, V, A, E, K) ⁶:

• S num. Set the seed for the random number generator. (default 0). In our case, as we would like eliminate any random factors, we decide to not vary this parameter, fixing it to 1 (one). But for AIS, we found a problem in convergence of some databases we test, so in sama case we had to change this parameter.

• F perc. Set minimum number percentage affinity threshold (see [WTB04]

page 6);

• C num. Clonal Rate is a integer which determine rate of clones .Varies between 0 and 100;

• H num. Hyper-mutation rate. Varies between 0 and 100, and deter- mines the percentage of clones (from last parameter) will suffer mutation;

• M perc. Mutation rate is a percentage between 0 and 1 that corresponds to the probability of given caracteristic be mutated, only works for those clones chose to be mutaded (last parameter);

• R num. Total resources is the maximum number of B-Cell (or ARB) allowed in the system;

• V perc. Stimulation threshold is a number between 0 and 1 used as criteria to keep or drop a given B-Cell;

6The implementation used here is the 1.6 version (March 2006) of Jason Brownlee at [Bro05]. Brownlee says he has implemented the algorithm founded in [WTB04].

(13)

• A num. Number of affinity threshold instances. Unfortunatly, this parameter did not receive explanation in [Bro05], so we decided to not change it, fixing it to its default -1;

• E num. Memmory pool size. Define the number of random initialisation instances. By simplicity we varied it between 0 and 10;

• K num. K nearest neighbors represent the number of B-Cell to be matched and consulted in a voting election of which class the current transaction belongs to (fraud or legitimate?). K equals to 1 means no voting.

4 Result’s Summary.

This section brings summarized view of the results reached in our executions.

To clarify, the five compared methods were: Naive Bayes (NB), Neural Network (NN), Bayesian Network (BN), Artificial Immune System (AIS) and Decision Tree(DT).

And the execution strategies for a given methodM were: Stand^DEF(M)

7, P adrao^GA(M) and P adrao^{ST A}(M).

By method analyses: Figure 1 shows in a very condensed way how each method performed in each different strategy we used. We choose to show here only the results with the 3 databases for evaluation. With this figure we can see that:

• Method NB does not has any parameter;

• NN obtained a decrease of 23.33%⁸ of the cost when we applied the GA in relation to the execution with default parameters;

7Standmust be to be interpreted as the standard execution of Weka (without any meta- heuristics);^DEF must be interpreted as the use of the default parameters of Weka;^GAmust be interpreted as being used the optimized set of parameters using genetic algorithm for each database; and^{ST A}must be interpreted as being used optimal stable set of parameters (in a semi-exhausting search for NN and AIS at 6 out of the 9 databases.

823.33% = 1 - $ 29.98 thousands / $ 39.10 thousands = 1 − P adrao^GA(N N)/P adrao^DEF(N N)

(14)

PadraoDEF; 32,76

PadraoDEF; 35,66

PadraoDEF; 28,91

PadraoDEF; 39,10

PadraoDEF; 30,44 CostDEF; 21,76

CostDEF; 30,80

CostDEF; 31,75

CostDEF; 25,76

CostDEF; 31,04 PadraoGA; 27,84

PadraoGA; 24,97

PadraoGA; 28,90

PadraoGA; 29,98

PadraoGA; 30,44 CostGA; 19,17

CostGA; 21,56

CostGA; 22,25

CostGA; 22,07

CostGA; 31,23 PadraoSTA; 27,87

PadraoSTA; 23,30

PadraoSTA; 28,90

PadraoSTA; 36,33

PadraoSTA; 30,44 CostSTA; 19,52

CostSTA; 21,92

CostSTA; 23,10

CostSTA; 23,34

CostSTA; 31,23 18,0023,0028,0033,0038,00

DT

AIS

BN

NN

NB Custo médio (R$ mil)

Mé to do

Gráfico comparativo resumo (bases 8, 9 e 1)

Figura 1: Resumo dos resultados vis˜ao m´etodo.

(15)

• BN, as well as NB, did not reduced the cost, therefore by changing the maximum number of a node parents in the BN was not enough to modify the probability so that it would changes the classification;

• AIS obtained a very interisting result. The use of GA reduced in 29.98%⁹ the cost in relation to the execution with default parameters;

• Meanwhle, DT reached 15.01%¹⁰ reduces at the cost in relation to the execution with default parameters.

In all but BN and NB, we see that, in general, we obtain considerable reductions using the genetic algorithm for the parametric search.

By Strategy Analyses: We also present another way to compare the strategies. Figure 2 shows a different vision of the results, that facilitates to establish relations of order between the methods and, in a special way, to identify which is the best method for each type of strategy. In this figure we can see that:

• At strategyP adrao^DEF(M), BN got an unquestionable better resulted , AIS and NN had very bad results compared to other methods, in particular, NN only improved 15.4%¹¹ in relation to a strategy that consider all transactions as legitimate;

• P adrao^GA(M), we verify that almost all the methods had gotten reduction in comparison to P adrao^DEF(M), the method to reduce the most its cost and to assume the first in the rank was AIS;

• When we analyze the execution P adrao^{ST A}(M),we see two important facts: first one was an brusque increase of the cost of P adrao^{ST A}(N N) in relation to P adrao^GA(N N), that shows the verfitting tendency of method NN with optimized parameters. secound was the reduction of the cost of P adrao^{ST A}(AIS) in relation to P adrao^GA(AIS).

929.98% = 1 - $ 24.97 thousands / $ 35.66 thousands = 1 − P adrao^GA(AIS)/P adrao^DEF(AIS)

1015.01% = 1 - $ 27.84 thousands / $ 32.76 thousands = 1 − P adrao^GA(DT)/P adrao^DEF(DT)

1115.4% = -$39.1 thousands/-$46.2 thousands, where -$46.2 thousands corresponds to the average cost of bases 8, 9 and 1 when one decides to not create an area of fraud detention.

(16)

DT; 19,52

DT; 27,87

DT; 19,17

DT; 27,84

DT; 21,76

DT; 32,76 AIS; 21,92

AIS; 23,30

AIS; 21,56

AIS; 24,97

AIS; 30,80

AIS; 35,66 BN; 23,10

BN; 28,90

BN; 22,25

BN; 28,90

BN; 31,75

BN; 28,91 NN; 23,34

NN; 36,33

NN; 22,07

NN; 29,98

NN; 25,76

NN; 39,10 NB; 31,23

NB; 30,44

NB; 31,23

NB; 30,44

NB; 31,04

NB; 30,44 18,0023,0028,0033,0038,00

CostSTA

PadraoSTA

CostGA

PadraoGA

CostDEF

PadraoDEF Custo médio (R$ mil)

Es tra té gia

Gráfico comparativo resumo (bases 8, 9 e 1) NB NN BN AIS DT

Figura 2: Resumo dos resultados vis˜ao estrat´egia.

Por limita¸cão técnica da planilha utilizada, fixado o método, a barra de erros presente nas seis estratégias mostrada na figura precisa ser a mesma.

Adotamos o m´aximo entre o os seis desvios-padr˜ao.

(17)

We suppose that this happened due to the fact that AIS more param- etershan all the other, therefore the amplest parametric search space.

In a way that when the parametric space is reduced, after the freezing of some parameters, during the process of stabilization of the parameters, can be observed a more efficient optimization. This phenomenon is many times mentioned as being “Curse of Dimensionality”.

• Temos como última estratégia, Cost^{ST A}(M). Aqui fica clara a supe- rioridade de DT sobre os demais métodos, AIS passa a ocupar uma segunda coloca¸cão com folga, NN e BN se igualam na terceira posi¸cão, e verificamos que NB se apresentou como sendo o pior método.

Finally, we detach that Naive Bayes was the worst amongst the compared methods, Neural Netwok loses much performance when a set of stable parameters is required, and Artificial Immune System and Decision Tree had presented themselves as the best methods.

Analysis of the set of optimal stable parameters: As a complement to the summary (forebody of this section), we bring the figure 3 that synthecizes the results of the search for optimal stable parameters for the 9 databases, the 5 methods, Standard. Moreover, we also present in this section the table 2, which brings the set of optimal parameters for each method.

Average Cost

Method on validation Stable Parameter DT -$ 27,870.66 -C 0.49 -M 1

NN -$ 36,332.33 -L 0.40 -M 0.12 -N 500 -V 0 -S 0 -E 0 -H 20 NB -$ 30,439.33 n/a

BN -$ 28,901.66 -D -Q weka.classifiers.bayes.net.search.local.K2 – -P 1 -S BAYES -E weka.classifiers.bayes.net.estimate.SimpleEstimator – -A 0.5 AIS -$ 23,303.00 -S 1 -F 0 -C 30 -H 10 -R 177 -V 1 -A -1 -E 5 -K 1

Tabela 2: Summary of optimal stable parameters

In a first glance, we can affirm that for DT we have a tree with minimum pruning according to parameter M. For NN, we see that the parameters L and M achieved very interisting values with a big L (Learning Rate) and very small M (Momentum), it allows us to trace a parallel with DT, saying that, as well as DT, NN takes a step to less prunning and more overfitting. BN did not change the way we disired. Finally, for AIS, we obtained a very good set of parameters at GA execution, and that made easy for the semi-exhausting

(18)

Desenvolvimento: GA - Execução Padrão - Parâmetros Estáveis Método\Base234567PadraoSTA PadraoGA Piora (%) DT-27.133-27.467-30.308-28.996-27.326-26.403-27.939-27.9320,0% NN-30.765-31.382-36.530-33.406-29.558-31.491-32.189-30.9604,0% NB-30.787-30.227-30.961-30.493-30.325-31.678-30.745-30.7450,0% BN-28.375-28.478-30.880-30.880-28.946-28.712-29.379-29.1250,9% AIS-22.353-23.330-25.697-24.750-22.389-21.612-23.355-23.3030,2% GA - Cost Sensitive - Parâmetros Estáveis Método\Base234567CostSTACostGA Piora (%) DT-21.814-19.505-21.197-20.677-20.845-20.559-20.766-20.4191,7% NN-23.531-22.351-23.262-24.047-22.865-22.428-23.081-22.2163,9% NB-29.966-31.031-29.618-31.526-32.141-30.225-30.751-30.7510,0% BN-24.042-22.475-24.141-23.787-22.582-22.116-23.191-22.9980,8% AIS-21.755-23.174-23.971-23.832-22.322-22.465-22.920-22.3142,7% Avaliação: GA - Execução Padrão - Parâmetros Estáveis Método\Base891PadraoSTAPadraoGA Piora (%) DT-27.063-27.333-29.216-27.871-27.8410,1% NN-36.913-39.549-32.535-36.332-30.04720,9% NB-29.946-30.155-31.217-30.439-30.4390,0% BN-28.092-28.755-29.858-28.902-28.9020,0% AIS-22.797-23.860-23.252-23.303-24.966-6,7% GA - Cost Sensitive - Parâmetros Estáveis Método\Base891CostSTACostGA Piora (%) DT-19.584-19.856-19.131-19.524-19.1671,9% NN-23.427-22.842-23.752-23.340-22.3144,6% NB-30.446-32.444-30.786-31.225-31.2250,0% BN-22.432-23.473-23.399-23.101-22.2463,8% AIS-21.682-23.130-22.007-22.273-21.5573,3%

Resultado Execuções dos métodos com Algoritmo Genético + busca Semi-exautiva para encontrar parâmetros estáveis.

Figura 3: Resultado busca por parˆametros est´aveis.

(19)

search for optimal stable parameters, one of the most surprisinly results was the K equals to 1, it means no voting is necessary, the first rule to match decides the class.

5 Conclusions and future works

Our work consisted in presenting a comparison of five methods of classification (Decision Tree, Neural Network, Bayesian Network, Naive Baye and Artificial Immune System), all the used implementations had been proceeding from the basic package of the Weka-3-4-11, except AIS whose author was Lee [Bro05]. And, through the analysis of the results of the tests that we carry through, we could arrive at some conclusions.

Perhaps because DT is a classic classification method, it has been forgot- ten in recent works, however still reveals itself as one of the best methods, with sufficient competitive results.

The definition of the objective function to be optimized has crucial impact in the quality of the generated model.

In all our executions, except for NB that does not possess any parameter, we verify that the best results had not been reached with default set of parameters, in particular for NN and AIS, the results gotten using default parameters are insignificant when compared with those gotten after a parametric adjustment using GA. Unfortunately, having in hands all results we reached, we can dare to say, that is practically possible to generate any order of the five methods through an appropriate choice of parameters for each one.

Our tests had reproduced the results of Maes em [MTV00] when analyzing that BN is better that NN, this occurred in all our evaluated tests.

NN is the most used method in real application today, however our tests has showed that almost always it is between the worse methods.

On our tests AIS had a surprisingly increase of performance from default parameters to GA optimized parameters, and its parameters were such a stable in a way that after the search for a set of stable parameters it still got a good performance, and this performance gave him the leadership of our tests.

In addition to the confirmation of the awarded results achieved for Maes, who pointed that BN is better than NN to detect fraud, we concluded that DT also belongs to the hall of methods that surpasses NN.