• Nenhum resultado encontrado

A Comparative Analysis of Features Selection Techniques Using Genetic Algorithm in Keystroke Dynamics

N/A
N/A
Protected

Academic year: 2021

Share "A Comparative Analysis of Features Selection Techniques Using Genetic Algorithm in Keystroke Dynamics"

Copied!
37
0
0

Texto

(1)

Federal University of Rio Grande do Norte Center for Agricultural Sciences

Technology in Systems Analysis and Development

A Comparative Analysis of Features

Selection Techniques Using Genetic

Algorithm in Keystroke Dynamics

Tuany Mariah Lima do Nascimento

Maca´ıba 2019

(2)

Tuany Mariah Lima do Nascimento

A Comparative Analysis of Features Selection Techniques Using

Genetic Algorithm in Keystroke Dynamics

Undergraduate report submitted to the Academic Unit Specialized in Agricultural Sciences of the Federal University of Rio Grande do Norte as a partial requirement to obtain the degree of Technologist in System Analysis and Development

Supervisor: Laura Emmanuella Alves dos Santos Santana de Oliveira

Co supervisor: M´arjory Da Costa Abreu

Maca´ıba

2019

(3)

Nascimento, Tuany Mariah Lima do.

A comparative analysis of features selection techniques using genetic algorithm in keystroke dynamics / Tuany Mariah Lima do Nascimento. - 2019.

37 f.: il.

Monografia (tecnólogo) - Universidade Federal do Rio Grande do Norte, Unidade Acadêmica Especializada em Ciências Agrárias, Curso Superior em Análise e Desenvolvimento de Sistemas.

Macaíba, RN, 2019.

Orientador: Profa. Dra. Laura Emmanuella Alves dos Santos Santana.

Coorientador: Profa. Dra. Márjory da Costa Abreu.

1. Genetic algorithms Monografia. 2. Features selection -Monografia. 3. Gender recognition - -Monografia. I. Santana, Laura Emmanuella Alves dos Santos. II. Abreu, Márjory da Costa. III. Título.

RN/UF/BSPRH CDU 004.023

Universidade Federal do Rio Grande do Norte - UFRN Sistema de Bibliotecas - SISBI

Catalogação de Publicação na Fonte. UFRN Biblioteca Setorial Prof. Rodolfo Helinski Escola Agrícola de Jundiaí -EAJ

(4)

Tuany Mariah Lima do Nascimento

A Comparative Analysis of Features Selection Techniques Using Genetic Algorithm in Keystroke Dynamics

Undergraduate report submitted to the Academic Unit Specialized in Agricultural Sciences of the Federal University of Rio Grande do Norte as a partial requirement to obtain the degree of Technologist in System Analysis and Development

Supervisor: Profa. Dra. Laura Emmanuella Alves dos Santos Santana de Oliveira Federal University of Rio Grande do Norte

-EAJ/UFRN

Co supervisor: Profa. Dra. M´arjory Da Costa Abreu

Federal University of Rio Grande do Norte -DIMAp/UFRN

Prof. Dr. Josenalde Barbosa de Oliveira Federal University of Rio Grande do Norte

-EAJ/UFRN

Prof. Dr. Daniel Sabino Amorim de Ara´ujo Federal University of Rio Grande do Norte

(5)

ACKNOWLEDGEMENTS

Provavelmente esse ser´a o ´unico texto que vocˆe ir´a ler em portuguˆes. Ent˜ao. . . N˜ao tem como come¸car os agradecimentos de outra forma, sen˜ao agradecendo a Deus, aquele que tudo fez, que tanto me amou e cuidou de mim, que me deu paciˆencia e sabedoria para enfrentar momentos dif´ıceis.

`

A minha fam´ılia, meus pais, Dilma Dorot´eia e Eliseu In´acio, por sempre estarem comigo, me apoiando, me escutando, mesmo n˜ao entendendo nada dos meus algoritmos mas estavam presentes a cada conquista de cada compila¸c˜ao. Que sempre me disseram ”os estudos ´e a maior heran¸ca que o pai pode dar para o filho. Ent˜ao minha filha, estude!”, e com isso sempre fazendo de tudo para investir na minha educa¸c˜ao quanto na educa¸c˜ao da minha irm˜a, Clara Juliany, que tamb´em agrade¸co por todo apoio, compreens˜ao e por me buscar no DIMAp quando eu precisava ficar at´e tarde estudando (isso tamb´em te incluo, J.Danilo).

Aos meus amigos de longas datas, Kadja Kaline, Samyr Saraiva, Geovana Quix-abeira, Leonardo Cardoso de Mello, Andreza Duarte, Nath´alia Duarte, Lucas Genu´ıno e Dereck Mutran n˜ao s´o agrade¸co por todo apoio, por me ouvir, por sempre estar junto comigo, mas tamb´em quero pedir desculpas pela minha ausˆencia.

Agrade¸co tamb´em ao corpo docente do curso, que com vocˆes eu aprendi n˜ao s´o dentro de sala, mas fora de sala tamb´em, em especial `a minha orientadora e m˜ae de gradua¸c˜ao, Laura Emmanuella, que esteve comigo do primeiro at´e o ´ultimo per´ıodo. Agrade¸co n˜ao s´o pelo conhecimento transmitido dentro e fora de sala, mas tamb´em pela confian¸ca em aplicar suas provas (quando tiver pode chamar), pela compreens˜ao, pux˜oes de orelha, conselhos, caronas, por acreditar em mim mesmo quando eu n˜ao acreditava e pensava em desistir. Sinto-me privilegiada e honrada por ter conhecido e ter trabalhado com uma pessoa que transmite uma paz interior e uma calma muito grande. Agrade¸co tamb´em `a professora M´arjory Da Costa Abreu, por ter me recebido em seu projeto de inicia¸c˜ao cient´ıfica, por todo incentivo e corre¸c˜ao/revis˜ao do Inglˆes (Desculpa se deu trabalho). Realmente, esse trabalho n˜ao seria poss´ıvel sem o incentivo de vocˆes duas.

A galera do TADS, por todo incentivo, F´abio Henrique (por ter estado comigo nesta grande maratona, por todos os conselhos, preocupa¸c˜oes e n˜ao tem como esquecer o que tu fez por mim, sempre vou ser grata, meu MUITO OBRIGADA!), Andr´e Gomes, Ayrton Barreto (valeus pelas caronas, por sempre permitir que eu dirigisse seu carro e por me proporcionar passeios incr´ıveis), Hudson Silva, Lucas Andrade e J. Gabriel (de vocˆes eu me lembrarei n˜ao s´o dos debates pol´ıticos, mas tamb´em sobre os conhecimentos de tecnologias e da laranja bem docinha do R.U hahaha), Lucas Antˆonio, Adryel Juli˜ao, Adriano Macedo, Eric Couto, Larysse Savanna e Joffrey Peyrac (por sempre, me aconselhar e sempre estarem comigo tentando me acalmar nesse momento de grande ansiedade), Joel Santos e Iaslan

(6)

Nascimento (meus monitores favoritos), Bruno (por todo apoio, amizade e por sempre estudar comigo), La´ercio Medeiros, Elidiel, Pedro PP, Jailton Cˆamara, Andrelyne Vit´oria (minha slave e coorientanda, por ter me ajudado a rodar os experimentos), Luis Fernando (obrigada por sempre descontrair nos dias estressantes), ao pessoal do suporte e a todxs os

alunos do t´ecnico de inform´atica por todos os momentos de zoa¸c˜oes nos corredores. E por fim, a mim, Tuany Mariah, autora deste trabalho, por ter aceito e encarado todos os desafios propostos, mesmo com medo. Por ter conseguido (ou tentado) manter a cabe¸ca no lugar em momentos t˜ao dif´ıceis, por n˜ao ter deixado que ningu´em te dissesse que vocˆe n˜ao ´e capaz, se vocˆe tem um sonho, corre atr´as e continua a agarrar as oportunidades que batem em sua porta, acredito que vocˆe est´a no caminho correto! s´o n˜ao deixa de estudar, de manter a humildade e de todos os dias melhorar pessoalmente, academicamente e profissionalmente.

(7)

ABSTRACT

Due to the continuous use of social networks, users can be vulnerable to situations such as paedophilia treats. One of the ways to do the investigation of an alleged paedophile is to verify the legitimacy of the genre that it is said to be. One possible technique to adopt is keystroke dynamics analysis. However, this technique can extract many attributes, causing a negative impact on the accuracy of the classifier due to the presence of redundant and irrelevant attributes. Therefore, the present work presents a comparative analysis between two attribute selection approaches, wrapper and hybrid (wrapper + filter), using the meta heuristic genetic algorithm, as KNN, SVM, and Naive Bayes classifiers and as Correlation and Relief filter. Bringing the best SVM classifier using the wrapper approach, for both databases.

(8)
(9)

LIST OF FIGURES

Figure 1 – Users in the age group of 9 to 17 years who knew strangers by means of

communication . . . 13 Figure 2 – Keystroke . . . 19 Figure 3 – Filter. . . 20 Figure 4 – Wrapper . . . 21 Figure 5 – Flowchart . . . 22 Figure 6 – Crossover . . . 22

Figure 7 – Using the filter approach in the Crossover . . . 24

Figure 8 – Box plot of Accuracy and Attributes for the Brazilian hand-based behavioral biometrics Database . . . 29

(10)
(11)

CONTENTS 1 INTRODUCTION . . . 13 1.1 Justification . . . 15 1.2 Objectives . . . 15 2 RELATED WORKS . . . 17 3 THEORETICAL REFERENCE . . . 19 3.1 Keystroke Dynamics . . . 19 3.2 Feature Selection . . . 20 3.3 Genetic Algorithms . . . 21

3.4 Genetic Algorithm Proposed . . . 23

4 METHODOLOGY . . . 25

4.1 Databases . . . 25

4.2 Classifiers . . . 26

4.3 Filters . . . 27

5 RESULTS AND DISCUSSION . . . 29

6 CONCLUSION AND FUTURE WORKS . . . 33

(12)
(13)

13

1 INTRODUCTION

The use of social networks grew enormously, serving to connect and share in-formation among different groups of people. However, users can be vulnerable to risky situations, paedophilia is one of them as shown in the Image 1 . One of the ways to do the investigation of an alleged pedophile is to validate whether the individual is, in fact, the genre that it is said to be. For this, a possible technique to be adopted is the keystroke dynamics(ANANYA; SINGH, 2018). The keystroke dynamics is a behavioral biometric which consists of analyzing the way the user types in a terminal, monitoring the keyboard to identify the user based on his/her usual typing rhythm patterns. Keystroke dynamics can be extracted through fixed text or free texts, that is, texts that have been predetermined for all individuals under observation and texts that have periodic monitoring of the keystrokes, respectively (DARABSEH; NAMIN, 2015). This technique is accessible since external hardware is not required to collect data from the keyboard only (REVATHY; KARNAN; SUNDARAM,2017; KOSTYUCHENKO et al., 2018)

Figure 1 – Users in the age group of 9 to 17 years who knew strangers by means of communication

Studies have shown that keystroke dynamics is a viable technique for gender recognition such as Tsimperidis, Arampatzis e Karakos (2018), Antal e Nemes (2016), however, this technique can extract many attributes and dealing with a large amount of attributes can have negative impact. In an intuitive way, the greater the number of attributes in a database, the easier it is to extract knowledge models, however, in practice

(14)

14 Chapter 1. INTRODUCTION

this may not be true, since there may be irrelevant or redundant attributes and the presence of noise it can confuse the classifier as well as impair its accuracy, the computational cost increases exponentially with a very large number of attributes, making it difficult to construct the model, in this way the curse of dimensionality (FACELI et al., ). In this way, the selection of attributes has been one of the main steps of data preprocessing for later applications in tasks such as data mining, machine learning, pattern recognition, among others, its main objective is to select a subset of relevant attributes among all the attributes available to the task (SANTANA, 2012).

Features selection can be done using a filter or wrapper-based approach. In the filter-based approach, the features selection is performed independently of the classifier, taking into some evaluation metrics. The filter approach tends to select a larger number of features in its subsets and to have much lower computational cost when compared to the wrapper approach (SANTANA; CANUTO,2014) that uses the accuracy of the classifier to evaluate the subset of selected features, this requires a high computational power (SHEN et al.,2017).

The works of Tsimperidis, Arampatzis e Karakos (2018), Antal e Nemes (2016) used filter approach to features selection for the problem of recognition using keystroke dynamics. The first used as an evaluation metric the information gain, while the second used correlation to evaluate the selected subsets.

Other works used wrapper approach to features select on the basis of keystroke dynamics but for individual identification, asNisha e Kumar (2014),Revathy, Karnan e Sundaram(2017), Darabseh e Namin (2015) where were used as metaheuristics Genetic Algorithms, Particle Swarm Optimization and as classification algorithms the neural network Multilayer Perceptron, Support Vector Machine and K-Nearest Neighbor.

In the work of Kawamura e Chakraborty (2017) a hybrid approach was proposed with the combination of the filter and wrapper approaches to find an ideal subset of features applied to databases for general use available in the UCI (DUA; GRAFF,2017). In the first step, the filter approach was used to rank the features and in the next step, the features that have a high ranking are used in the wrapper approach with genetic algorithm and particles warm optimization.

Thus, the present work aims to perform a comparative analysis between features selection techniques for the gender recognition on the databases of keystroke dynamics using a wrapper approach with genetic algorithms and a hybrid approach applying filters to rank the features during the crossing of genetic algorithm

This work is organized as follows.. In Chapter 2presents some of the main related works. In Chapter 3presents basic concepts about keystroke dynamics, features selection, genetic algorithms and the genetic algorithm proposed in the present work. The description of the methods used in the work is presented in Chapter4, as well as the results in Chapter

(15)

1.1. Justification 15

1.1 JUSTIFICATION

The work of Othman et al. (2016) shows that the features selection performed by metaheuristics in the context of gender recognition offers a better rate of accuracy compared to other types of features selection. And the work of Kawamura e Chakraborty

(2017) shows that the hybrid approach is effective, to obtain a smaller and optimal subset of attributes with high precision and low computational cost.

Since works point to the improvement in the performance of classification systems from the features selection, and that it is feasible the gender recognition from databases of keystroke dynamics, and that the approaches wrapper and hybrid are effective solutions for other contexts, this work will allow the identification of which features selection techniques may result in smaller subsets of attributes with increased system accuracy.

This chapter gave an introduction to the work that will be presented, as well as motivation.

1.2 OBJECTIVES

The present work has as general objective the (i) comparison of the wrapper and hybrid approaches for features selection in bases of keystroke dynamics for the gender recognition; (ii) to identify the combination of classifiers and features selection techniques to result in greater accuracy for the genre recognition.

As a specific objective, it is hoped to increase the accuracy of gender identification systems on the basis of typing dynamics, allowing a more effective use of this type of system.

(16)
(17)

17

2 RELATED WORKS

Gender recognition can be applied in biometric identification systems based on the recognition of speech, face, iris. In some cases, gender classification across the face is performed earlier (OTHMAN et al., 2016).

In the work ofAntal e Nemes(2016), two databases were used in their experiments. The first one contained 38 female and 38 male users, where the collection was performed through slips when responding to a questionnaire of 58 questions. In the second database, it has 18 male users and 18 female users, where the collection occurred by pressing the key. For the features selection performed in this work, the authors used the Weka data mining software (HALL et al., 2009) making use of the filter approach, as correlation evaluation method was used. The ranking of attributes in correlation-based attributes sorts the attributes according to the correlation with the class. As classifiers, Random Forest was used to evaluate the data sets. The authors used two types of cross-validation, 10-fold common cross-validation, and leave-one-user-out cross validation, respectively, to evaluate the accuracy of the methods. The classification measures were performed in both databases and then after using the subset with five remaining attributes during the selection process using the filter approach. The results were promising with cross-validation in both sets of data.

In the work ofFairhurst e Costa-Abreu (2011), the authors proposed in their work the classification of the user and gender identity using the GREYC database, this database has 60 attributes, this same database has 98 male and 35 female users. The authors used the cross-validation with 10 folds and had as classifiers the KNN, Decision Tree, Naive Bayes and also investigated the combination of classifiers using some known techniques are they Dynamic Classifier Selection based on local accuracy class (DCS-LA) and Voto e Majority Sum.

In works such asTsimperidis, Arampatzis e Karakos (2018), the authors preferred in their work the use of free text due to better integration with the volunteers’ regular typing activities and is less intrusive. In order to create a new database, a free text keylogger has been developed, that is, a device that can be hardware or software to monitor the keys that are typed on the keyboard, and storing this information in a file log (RAHIM et al.,2018), can be installed on any computer with windows operating system. 248 log files were generated, of which 125 are male files and 123 are female files.About 10,000 known and unknown attributes were collected. For the select features performed in the work, the authors used the filter approach, using information measures. Several experiments were performed using five classifiers, they are: SVM, MLP, NB, RF and RBFN, with a different number of attributes starting at 50 and ending with 400 attributes. The RBFN model correctly predicts the gender of the unknown user, reaching an accuracy of 95.6%.

(18)

18 Chapter 2. RELATED WORKS

In this chapter, we discussed several related works, presented some steps of the selection of attributes in databases of typing dynamics, fine-tuning to improve the performance of systems for data recognition, and information relevant to the used databases. In the works ofTsimperidis, Arampatzis e Karakos(2018),Antal e Nemes(2016), a selection was done using the filter approach, whereas in this work, our proposed approach will be used the wrapper and hybrid approach. Also, in the work of Fairhurst e Costa-Abreu

(2011), it was used the GREYC database for the recognition of gender, however it was applied to investigate a combination of classifiers.

(19)

19

3 THEORETICAL REFERENCE

In this chapter we will address some basic concepts in order to establish a theoretical basis on the present work.

3.1 KEYSTROKE DYNAMICS

Also known as keyboard-dynamics, keystroke dynamics considered one of the techniques of behavioral biometrics and with purpose of identifying users based on the keystroke, several features can be extracted from this like: first key press time, flight time between key presses and second key press time, Figure 2(ANUSAS-AMORNKUL; WANGSUK, 2015).

Figure 2 – Keystroke

Source: The author himself

The identification of a user through keystroke dynamics, compares the current activity with the similar stored activity sample captured during registration. In order to reduce unauthorized access, most applications aim to detect significant differences in computer use (FAIRHURST; COSTA-ABREU,2011).

The methods for analyzing the user’s keystroke dynamics can be divided into two large groups, fixed text and free text. The fixed text authentication is based on texts obtained by entering a predetermined word, such as: login and password. Authen-tication through free text is based on constant monitoring (KIM; KIM; KANG, 2018;

KOSTYUCHENKO et al., 2018; DARABSEH; NAMIN, 2015; FAIRHURST; COSTA-ABREU,2011). The data can be obtained through the keyboard, without the addition of hardware. However, the use of keystroke dynamics is significant only for users who have their own set of time style (VENKO; KUMAR,2016).

Because it is a behavioral biometric, the keystroke dynamics have a degree of variability as the physiological biometric, even without considering the psychological and

(20)

20 Chapter 3. THEORETICAL REFERENCE

physiological state of the individual under observation. So even if the user provides the samples evenly, there will be variability between behavioral biometrics and physiological biometrics (GIOT; EL-ABED; ROSENBERGER, 2009; KANG; CHO, 2009; HWANG; LEE; CHO, 2009). The physiological characteristics tend to be more resistant, although some modalities like handwriting, voice pattern present a level of manageable uniformity. However, typing on a keyboard is more difficult to maintain significant control over the number of milliseconds for which a key is pressed, for example (FAIRHURST; COSTA-ABREU,2011; SILVA, 2019).

3.2 FEATURE SELECTION

As mentioned previously, the features selection is considered one of the most important steps in the process of data mining, machine learning and pattern recognition, since it aims to reduce dimensionality by selecting a subset features relevant to the build of a model, because the presence of irrelevant or redundant features can affect the performance of the classification algorithm (SANTANA, 2012; BIDI; ELBERRICHI, 2016). Features Selection can be performed using two approaches, filter or wrapper. In the filter approach some evaluation measure is applied, assigning for each feature, acting independently of learning algorithm, Figure3. The evaluation measures can be based on information gain, distance or correlation between features. The filter approach has low computational cost compared to the wrapper approach, since in this approach the evaluation of the relevance of the features is done using a learning model (KAWAMURA; CHAKRABORTY, 2017;

LI; LIU, 2017; BIDI; ELBERRICHI, 2016). The quality of the feature or of a subset of features is measured using some classifier, and it can obtain good results, however for databases that contain many features there can be a great computational cost, Figure4

(SANTANA, 2012; LI; LIU, 2017).

Figure 3 – Filter

(21)

3.3. Genetic Algorithms 21

Figure 4 – Wrapper

Source: The author himself

Works such as Kawamura e Chakraborty (2017), the authors proposed a hybrid two-step approach combining filter and wrapper approaches to find an ideal subset of features. In the first step, the filter approach was used for ranking of features and in the next step, the features the have a high ranking are used in the wrapper approach with genetic algorithms and particle warm optimization. As a filter algorithm the authors used CFS and minimum redundancy and maximum relevance (mRMR) in the fitness function of the evolutionary algorithm and as classifiers the used SVM with cross-validation using 4-folds. The result showed that the hybrid approach is effective, to obtain a smaller and optimal subset of features with high accuracy and low computational cost.

3.3 GENETIC ALGORITHMS

Genetic algorithms are a search technique and optimization based on the Darwinian principle of natural selection proposed in the book, The Origin of Species in 1859. This algorithm can be used to features selection of a database from a wrapper approach and was used in this work. Its development is based on biological mechanisms as heredity and evolution. The algorithm is a population based method and a chromosome is used to represent each solution to the given problem (SANTANA; CANUTO, 2014).

The operations of the genetic algorithm work by randomly generating an initial population, which will be the first generation of all, then each individual of this population is evaluated using an objective function or fitness function, which is used to define how close the individual is of the optimal solution. From the evaluation carried out, the fittest individuals are selected to continue in the genetic process, they are placed in a temporary

(22)

22 Chapter 3. THEORETICAL REFERENCE

Figure 5 – Flowchart

Source: The author himself Figure 6 – Crossover

Source: The author himself

population which can be called parents and they are responsible for the next generation. This loop involving evaluation, selection and genetic operators is repeated until reach ending condition. For ending condition can be used the following criteria: (i) maximum number of generations; (ii) processing timeout; (iii) stagnation: occurs when no improvements in population in the next iterations (SANTANA, 2012).

(23)

3.4. Genetic Algorithm Proposed 23

The following parameters were adopted in this experiment: • Initial population with 30 individuals

• The selection was done binary tournament and elitism

• Genetic operators, mutation rate of 0.5 and crossover rate equals 0.9

• Crossover with single-point, that is the two parents selected to reproduce, have their genetic material changed from a single-point crossover, which is chosen at random as shown in the Figure 6.

3.4 GENETIC ALGORITHM PROPOSED

In the features selection using the hybrid approach, the algorithm follows the entire flowchart described above and with the same parameters. However, the algorithm proposed in this work is the application of the Correlation and Relief filters in the crossover step, where each filter was applied in a parent, returning half the features of this filter, considering the ranking generated by the filter. The child generated will be the union of the selected features of the two parents. As the crossover of two parents must generate two children maintain the size of population, this child was duplicated forming twins, as shown in Figure 7.

Once this is done, the genetic algorithm follows its normal flow, verifying the mutation, which can cause the twins to differentiate and adding these children to a new population.

Thus, it is expected that with the application of the filters in the crossover step there will be an increase in accuracy.

This chapter presented the technical background necessary to understand all the techniques, definitions and models that will be used in this work.

(24)

24 Chapter 3. THEORETICAL REFERENCE

Figure 7 – Using the filter approach in the Crossover

(25)

25

4 METHODOLOGY

The experiment was conducted in three stages using the databases, Brazilian hand-based behavioral biometrics (SILVA; SILVA; COSTA-ABREU,2016) and GREYC (GIOT; EL-ABED; ROSENBERGER, 2009). The first step was the features selection with the wrapper approach using genetic algorithms for the KNN, SVM and Naive Bayes classifiers. The second was the features selection with the hybrid approach ( filter + wrapper) using genetic algorithm, applying the Relief and Correlation filters at in the crossover step to the same classifiers. The filters were chosen because they have ranged outputs in order to make them uniform. All experiments were performed 30 times, for the third and final step to calculate the mean and standard deviation in order to perform statistical tests with T-test.

For the development of this work we used the classifiers and filters of the API Weka (HALL et al.,2009) and for the data analysis we used the Python language with the Pandas (MCKINNEY,2018) and Maplotlib libraries (HUNTER,2007).

4.1 DATABASES

For the development of the present work, two databases were used. The first base is presented in Silva, Silva e Costa-Abreu (2016) where attributes were extracted through the keystroke dynamics based on physical keyboard. they counted on the participation of 77 volunteers being 57 males and 20 females. They were asked to type the same text -words used in Brazilian Portuguese including English cognates and when possible digraphs belonging to both languages.

The keyboard used for data collection is the QWERTY keyboard which may facilitate a replication of this experiment because of its very broad usage. There were no uppercase, special characters, and accents, making the selection process more complex, which, cloud interfere with the data analysis process. For the base collected was the used dwell time of the first key, the flight time that the user took between the first one for the second key and the dwell time for the second key. In order to obtain three samples from each user, the database gathered three occurrences of the same digraph in different words or considered two or three digraphs as one due to the proximity of their keys in the keyboard. totaling 14 digraphs on a base, are: ME, ER, IR, IC, CA, MI, SE, MO, OO, DE, EL, RM and UE, so the database has 42 attributes..

The second database is presented in the work Giot, El-Abed e Rosenberger(2009), that through the keystroke dynamics using a keyboard it was possible to obtain information about the gender of individuals. To obtain this data, there were experiments that had a total of 133 users, 35 females and 98 males. During sessions held at different times,

(26)

26 Chapter 4. METHODOLOGY

participants were asked to type the same text ”greycology”, six times on a keyboard of a laptop and six times also on a USB keyboard.

The base has 60 attributes, representing the release time spacing of each key, the time interval for pressing a key and another successor; the pressure period of each key, the time of pressing each key, and finally a vector V binding all four previous values.

Another factor to consider on the basis is their classes and the amount of standards that each one has. The female class has 1933 standards, already in the male class there are 5615 standards. Thus, the base has a total of 7555 patterns collected. In addition, it presents information about id and gender, however for this work id was withdrawn since the goal was to find the gender of individuals.

4.2 CLASSIFIERS

In order to perform a comparative analysis between the attribute selection tech-niques, we used three classifiers, which were from the standard implementation of the Weka toolbox. These classifiers were chosen because they belong to different learning paradigms and can bring different aspects to the comparative analysis to be performed.

K-Nearest Neighbor (KNN): The KNN is a distance-based model because it classifies a new object based on the examples of the training sets that are close to it. For this reason, this model has as hypothesis that similar data tend to be concentrated in the same region, that is, data that do not have any similarity tend to be distant. This similarity is calculated by the Euclidean distance. In addition to the distance metric, the value of K (neighborhood size) is the parameter that affects classification. For this experiment, the value of K was adjusted to be K = 11 and K = 3 for the Brazilian hand-based behavioral biometrics and GREYC bases, respectively (HALL et al., 2009; FACELI et al., ; BIDI; ELBERRICHI, 2016; PEDREGOSA et al., 2011).

Support Vector Machine (SVM): Support Vector Machine, known as SVM, is an optimization-based classifier. This algorithm has as its central idea to build a hyperplane in order to separate the dataset into classes. The adjustable parameters for this classifier for this work was the C being 10 for the Brazilian base hand-based behavioral biometrics and C being 100, the kernel being Puk for both bases (HALL et al.,2009;FACELI et al., ;

BIDI; ELBERRICHI, 2016; PEDREGOSA et al., 2011).

Naive Bayes: Naive Bayes is a probabilistic model based on the Bayes theorem. This algorithm assumes there is no correlation between features, that is, they act indepen-dently (HALL et al.,2009; FACELI et al., ; BIDI; ELBERRICHI, 2016;PEDREGOSA et al., 2011).

(27)

4.3. Filters 27

4.3 FILTERS

The filters used in this work, as well as the classifiers, can also be found in the Weka toolbox. They were used in the crossover step, making a hybrid attribute selection. These filters are the same as those used in (SILVA; SILVA; COSTA-ABREU,2016)

Correlation: Evaluates the value of a feature by measuring the Pearson correlation between this attribute and the class. The search mode used was Ranker, because it performs an individual evaluation of the features by organizing in order of importance. Using the Correlation filter, if the result is -1 or 1, it means that the correlation is negative or positive, respectively, if the value is 0 means that the attributes do not depend linearly on one another, that is, Correlation is a technique used to select the most relevant attributes of a database (HALL et al., 2009).

Relief: Through the use of Euclidean distance, Relief aims to evaluate attribute quality by identifying attributes that display different values for instances of distinct classes and similar values for instances of the same class (HALL et al., 2009).

This chapter presented a detailed explanation of all the techniques used that were used in this work.

(28)
(29)

29

5 RESULTS AND DISCUSSION

As mentioned previously, our experiments were conducted using two databases, Brazilian hand-based behavioral biometrics and GREYC, having 43 and 60 attributes and 231 and 7555 instances, respectively. In Figures 8 and9, the accuracy of the classification with all attributes can be seen using 66% of the training instances and 33% for the test, with the KNN, SVM and Naive classifiers.

Figure 8 – Box plot of Accuracy and Attributes for the Brazilian hand-based behavioral biometrics Database

Source: The author himself

According to the results presented in Figures 8 and 9, it can be observed that Naive presented the least accuracy for the two databases, which can be justified by a possible correlation between the attributes of the base since this model assumes the independence between attributes. The KNN obtained higher accuracy for both bases, and for the Brazilian hand-based behavioral biometrics base the SVM got a bit better adjusted.

In Figures 8 and 9, it is shown the accuracy of the classification with the best subsets selected using the Wrapper approach, as well as shows the amount of attributes after selection using this approach.

As already mentioned, the parameters used in our experiments were: crossover rate of 0.9, mutation rate of 0.5, population size of 30, for parent selection the binary tournament was used and the stopping condition was the maximum number of generations.

(30)

30 Chapter 5. RESULTS AND DISCUSSION

Figure 9 – Box plot of Accuracy and Attributes for the GREYC Database

Source: The author himself

It can be observed that there was an increase in accuracy in both bases for all classifiers when compared to the results with the complete base. The SVM classifier was the most adequate in both bases with an increase in accuracy of 14.55% and 11.77% even though it did not remove many attributes such as Naive Bayes, which presented 26.10% and 8.11% increase in accuracy and having at the end of the selection about 24% and 20% of the attributes for the respective bases, Brazilian hand-based behavioral biometrics and GREYC, reinforcing the hypothesis that there are correlated attributes in the bases that affect Naive performance with the complete base. It can also be observed that the increase in accuracy for the GREYC base using the KNN classifier was very low about 2.14%.

Using the hybrid approach, it can be observed in the graphs, that again the SVM classifier obtained a better result for the GREYC base presenting 90% accuracy, that is, there was an increase of 11.72% in the accuracy for this classifier. For both approaches using the GREYC database, the computational cost was extremely high due to the number of instances, taking an average of one month to finish the 30 executions, where two machines were used, the first contained an i7 processor with 16Gb of RAM and the second a Zeon processor with 16Gb of RAM .

For the verification of our results, we used the t-test, analyzing the performance of each classifier alone in the hybrid and wrapper approaches, as well as no selection in order to verify if the presented result was statistically superior. The t-test checks whether the difference between two samples is real or only apparent. The difference will be real

(31)

31

when the test result is less than 0.05 (acceptable error probability, 5%), in this case we will reject the null hypothesis that would indicate that the difference is only apparent. For all comparisons made, observing the classifiers alone in each approach, the test result always gave less than 0.05 indicating that the difference presented is real.

Note that Figures8and9, which show the number of attributes that each database has after selecting attributes using the wrapper and hybrid approach, respectively. The attribute quantities appear to be on the same margin, indicating in percentage that there was a reduction of attributes by the same ratio.

(32)
(33)

33

6 CONCLUSION AND FUTURE WORKS

The process of selecting attributes, is one of the most important steps in the process of data mining, pattern recognition and machine learning and with it databases containing many attributes, may require a high computational power.Thus, the selection of a subset of features that contains relevance to the class may increase the accuracy of the classification. The use of genetic algorithms seems to be better, although it has an increase in computational cost, but may show us better results in terms of accuracy as in a better subset. The present work shows a comparative analysis between two techniques of attribute selection using the hybrid and wrapper approaches with genetic algorithms in bases of typing dynamics for the gender recognition. Where our results showed that the wrapper approach is better compared to the hybrid approach and statistically proven. The best classifier for both databases that resulted in greater accuracy to classify a user’s genre was the SVM with 90% accuracy.

As future works and improvement of the presented work, is the application of the hybrid in the stage of selection of parents, mutation and also in the evaluation stage, in order to perform a comparative analysis. Also as improvement is the addition of other bases of keystroke dynamics as well also bases touch and balancing of the GREYC database.

(34)
(35)

35

BIBLIOGRAPHY

ANANYA; SINGH, S. Keystroke dynamics for continuous authentication. In: 8TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING, DATA SCIENCE ENGINEERING (CONFLUENCE). Proceedings... [S.l.], 2018. p. 205–208. Cited on the page 13

ANTAL, M.; NEMES, G. Gender recognition from mobile biometric data. In: IEEE 11TH INTERNATIONAL SYMPOSIUM ON APPLIED COMPUTATIONAL INTELLIGENCE AND INFORMATICS (SACI). Proceedings... [S.l.], 2016. p. 243–248. Cited 4 times on the pages 13, 14, 17, and18

ANUSAS-AMORNKUL, T.; WANGSUK, K. A comparison of keystroke dynamics techniques for user authentication. In: INTERNATIONAL COMPUTER SCIENCE AND ENGINEERING CONFERENCE (ICSEC). Proceedings... [S.l.], 2015. p. 1–5. Cited on the page 19

BIDI, N.; ELBERRICHI, Z. Feature selection for text classification using genetic algorithms. In: 8TH INTERNATIONAL CONFERENCE ON MODELLING,

IDENTIFICATION AND CONTROL (ICMIC). Proceedings... [S.l.], 2016. p. 806–810. Cited 2 times on the pages 20and 26

DARABSEH, A.; NAMIN, A. S. Effective user authentications using keystroke dynamics based on feature selections. In: 2015 IEEE 14TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA). Proceedings... [S.l.], 2015. p. 307–312. Cited 3 times on the pages 13, 14, and 19

DUA, D.; GRAFF, C. UCI Machine Learning Repository. 2017. Dispon´ıvel em:

<http://archive.ics.uci.edu/ml>. Acessado em 28/05/2019. Cited on the page 14

FACELI, K. et al. Artificial Intelligence:: A machine learning approach. Rio de Janeiro: LTC. Cited 2 times on the pages 14and 26

FAIRHURST, M.; COSTA-ABREU, M. D. Using keystroke dynamics for gender identification in social network environment. In: 4TH INTERNATIONAL CONFERENCE ON IMAGING FOR CRIME DETECTION AND PREVENTION 2011 (ICDP 2011). Proceedings... [S.l.], 2011. p. 1–6. Cited 4 times on the pages 17, 18, 19, and20

GIOT, R.; EL-ABED, M.; ROSENBERGER, C. Greyc keystroke: A benchmark for keystroke dynamics biometric systems. In: ”IEEE 3RD INTERNATIONAL CONFERENCE ON BIOMETRICS: THEORY, APPLICATIONS, AND SYSTEMS”. Proceedings... [S.l.], 2009. p. 1–6. Cited 2 times on the pages 20and 25

HALL, M. et al. The weka data mining software: An update. SIGKDD Explorations, v. 11, n. 1, p. 10–18, 2009. Cited 4 times on the pages17, 25, 26, and 27

HUNTER, J. D. Matplotlib: A 2d graphics environment. Computing in Science & Engineering, IEEE COMPUTER SOC, v. 9, n. 3, p. 90–95, 2007. Cited on the page 25

HWANG, S. seob; LEE, H. joo; CHO, S. Improving authentication accuracy using artificial rhythms and cues for keystroke dynamics-based authentication. Expert Systems with Applications, v. 36, n. 7, p. 10649 – 10656, 2009. Cited on the page 20

(36)

36 Bibliography

KANG, P.; CHO, S. A hybrid novelty score and its use in keystroke dynamics-based user authentication. Pattern Recognition, v. 42, n. 11, p. 3115 – 3127, 2009. Cited on the page 20

KAWAMURA, A.; CHAKRABORTY, B. A hybrid approach for optimal feature subset selection with evolutionary algorithms. In: IEEE 8TH INTERNATIONAL CONFERENCE ON AWARENESS SCIENCE AND TECHNOLOGY (ICAST). Proceedings... [S.l.], 2017. p. 564–568. Cited 4 times on the pages 14,15,20, and 21

KIM, J.; KIM, H.; KANG, P. Keystroke dynamics-based user authentication using freely typed text based on user-adaptive feature extraction and novelty detection. Applied Soft Computing, v. 62, p. 1077 – 1087, 2018. ISSN 1568-4946. Cited on the page19

KOSTYUCHENKO, E. Y. et al. User identification by the free-text keystroke dynamics. In: 3RD RUSSIAN-PACIFIC CONFERENCE ON COMPUTER TECHNOLOGY AND APPLICATIONS (RPC). Proceedings... [S.l.], 2018. p. 1–4. Cited 2 times on the pages

13and 19

LI, J.; LIU, H. Challenges of feature selection for big data analytics. IEEE Intelligent Systems, v. 32, n. 2, p. 9–15, Mar 2017. Cited on the page 20

MCKINNEY, W. Python for Data Analysis. S˜ao Paulo: Novatec, 2018. Cited on the page 25

NISHA, J. R.; KUMAR, R. P. A. User authentication based on keystroke dynamics. In: . [S.l.: s.n.], 2014. Cited on the page14

OTHMAN, M. et al. A review of feature selection in gender classification. Journal of Advanced Review on Scientific Research, v. 27, p. 8–17, 01 2016. Cited 2 times on the pages 15and 17

PEDREGOSA, F. et al. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, v. 12, p. 2825–2830, 2011. Cited on the page 26

RAHIM, R. et al. Keylogger application to monitoring users activity with exact string matching algorithm. Journal of Physics: Conference Series, v. 954, p. 012008, 01 2018. Cited on the page 17

REVATHY, U.; KARNAN, D. M.; SUNDARAM, D. S. M. Personal identification systems for keystroke dynamics with genetic algorithm and particle swarm optimization. In: . [S.l.: s.n.], 2017. Cited 2 times on the pages 13and 14

SANTANA, L. E. A. S. Otimiza¸c˜ao em comites de classificadores :: Uma abordagem baseada em filtro para selecao de subconjuntos de atributos. 2012. 189 f. Tese (Doutorado) — Universidade Federal do Rio Grande do Norte, Natal, 2012. Cited 3 times on the pages

14, 20, and 22

SANTANA, L. E. A. S.; CANUTO, A. M. P. Filter-based optimization techniques for selection of feature subsets in ensemble systems. Expert Systems with Applications, v. 41, n. 4, Part 2, p. 1622 – 1631, 2014. Cited 2 times on the pages 14and 21

SHEN, J. et al. Sliding block-based hybrid feature subset selection in network traffic. IEEE Access, v. 5, p. 18179–18186, 2017. Cited on the page14

(37)

Bibliography 37

SILVA, V. R. D. An investigation of biometric-based user predictability in the online game League of Legends. 2012. 189 f. Disserta¸c˜ao (Mestrado) — Universidade Federal do Rio Grande do Norte, Natal, 2019. Cited on the page 20

SILVA, V. R. D.; SILVA, J. C. G. D. A.; COSTA-ABREU, M. D. A new brazilian hand-based behavioural biometrics database: data collection and analysis. In: 7TH INTERNATIONAL CONFERENCE ON IMAGING FOR CRIME DETECTION AND PREVENTION (ICDP 2016). Proceedings... [S.l.], 2016. p. 1–6. Cited 2 times on the pages 25 and 27

TSIMPERIDIS, I.; ARAMPATZIS, A.; KARAKOS, A. Keystroke dynamics features for gender recognition. Digital Investigation, v. 24, p. 4 – 10, 2018. Cited 4 times on the pages 13, 14, 17, and 18

VENKO, C.; KUMAR, S. Biometric based keystroke dynamics authentication - a review. Asian Journal of Research in Social Sciences and Humanities, v. 6, p. 698, 01 2016. Cited on the page 19

Referências

Documentos relacionados

No presente trabalho foi possível identificar uma linhagem bacteriana obtida de serrapilheira da Caatinga capaz de produzir altos níveis das enzimas celulase total e endoglucanase,

- Entender como os livros didáticos de língua inglesa, sugeridos pelos PNLDs 2012 e 2015, constroem as identidades sociais de raça e classe; - Verificar como a identidade racial negra

Porém, podemos identificar que há algumas características em comum nas três séries: 1 os dez minutos finais de todo episódio são essenciais para segurar a audiência, fazendo com

 Managers involved residents in the process of creating the new image of the city of Porto: It is clear that the participation of a resident designer in Porto gave a

According to the IHS criteria, chronic headache is defined by the occurrence of headache on 15 or more days per month for the past three months, which was identified in

Depois de introduzido o tema e discutidas algumas questões teóricas e metodológicas relacionadas com a problemática dos Serviços dos Ecossistemas e com as vinhas, foi aplicado

The goal of this study is to contribute to the understanding of the natural regeneration dynamics of the Atlantic Forest, through the comparative analysis of the floristic