Trabalho Futuro - Um estudo de limpeza em base de dados desbalanceada e com sobreposição de cla

Capítulo 5 Conclusão

5.3 Trabalho Futuro

Para esta pesquisa, foram utilizadas algumas heurísticas, tais como o número de clusters do algoritmo k-means fixo em k = 15 para o C-clear, o parâmetro limiarSmote do método C-clear fixo no valor da freqüência da classe positiva, e os limiares de limpeza

limiarLimpezaPositivo e limiarLimpezaNegativo fixados nos valores 0,7. Entretanto, os

valores estipulados para esses parâmetros possivelmente não são ótimos para todas as bases. Os resultados obtidos com a limpeza promovida pelo C-clear, por exemplo, corroboram essa hipótese, pois somente a base Pima apresentou ganho de desempenho com este método, e as três demais resultaram em degradação significativa no desempenho. Portanto, um trabalho futuro seria estender o C-clear para aprender seus parâmetros a partir dos próprios dados de treinamento, como feito em [Rakotomamonjy, 2004][Sing, Beerenwinkel, & Lengauer, 2004][Brefeld & Scheffer, 2005][Prati & Flash, 2005][Ataman, Street & Zhang, 2006]. Para tanto, a estrutura de validação implementada no UnBMiner (Seção 4.1.4) foi projetada de forma que essa extensão possa ser facilmente implementada.

Bibliografia

Ataman, K.; Street, W.N. & Zhang, Y. (2006). Learning to rank by maximizing AUC with linear programming. IEEE International Joint Conference on Neural Networks IJCNN’2006. p.123-129.

Batista, G.E.A.P.A.; Prati, R.C. & Monard, M.C. (2004). A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data. SIGKDD Explorations, v.6 p.20-29.

Brefeld, U. & Scheffer, T. (2005). AUC Maximizing Support Vector Learning. Proceedings of the ICML Workshop on ROC Analysis in Machine Learning. Boley, D.L. (1998). Principal Direction Divisive Partitioning. Data Mining and

Knowledge Discovery, v.2, n.4, p.325-344.

Castillo, M. & Serrano, J. (2004). A Multistrategy Approach for Digital Text Categorization from Imbalanced Documents. SIGKDD Explorations, v.6 p.70-79. Chawla, N.V.; Bowyer, K.W.; Hall, L.O. & Kegelmeyer, W.P. (2002). SMOTE:

Synthetic Minority Over-sampling Technique. JAIR, v.16, p.321–357.

Chawla, N.V.; Japkowicz, N. & Kotcz, A. (2003). (Editors) Proceedings of the ICML’2003 Workshop on Learning from Imbalanced Data Sets.

Chawla, N.V.; Japkowicz, N.; Kotcz, A. (2004) (Editors) Editorial: Special Issue on Learning from Imbalanced Data Sets. ACM SIGKDD Explorations. v.6. p.1-6. Dasarathy, B.; Sanchez, J. & Townsend, S. (2000). Nearest Neighbor Editing and

Condensing Tools – Synergy Exploitation. Pattern Analysis and Applications. v.3, n.1, p.19-30.

Daskalaki, S.; Kopanas, I. & Avouris, N. (2006). Evaluation of Classifiers for an Uneven Class Distribution Problem. Applied Artificial Intelligence. v.20, p.381- 417

Fayyad, U.M. (1997). Editorial: Data Mining and Knowledge Discovery. v.1 p.5-10. Fayyad, U.M. (2004). (Editor). Special Issue on Learning from Imbalanced Data Sets.

ACM SIGKDD Explorations. v.6.

Fawcett, T. (2004) ROC Graphs - Notes and Practical Considerations. Machine Learning.

Ferri, C.; Flach, P. & Hernández-Orallo, J.H. (2002). Learning Decision Trees using the Area under the ROC curve. In C.S.A. Hoffman, editor, Nineteenth International Conference on Machine Learning (ICML’2002). Morgan Kaufmann Publishers. p.139–146.

Gama, J. & Brazdil, P. (2000) Cascade Generalization. Machine Learning. v.41 n.3 p.315-343.

Guo, H. & Viktor, H.L. (2004). Learning from Imbalanced Data Sets with Boosting and Data Generation: The DataBoost-IM Approach. SIGKDD Explorations. v.6 p.30- 39

Han, H.; Wang, W.Y. & Mao, B.H. (2005) Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. In Proceedings of ICIC. Hefei. p.878- 887.

Hart, P. E. (1968). The Condensed Nearest Neighbor Rule. IEEE Transactions on Information Theory IT-14. p.515–516.

He, Z; Xu, X. & Deng, S. (2002). Squeezer: an Efficient Algorithm for Clustering Categorical Data. Journal of Computer Science and Technology. v.17, n.5, p.611- 625

He, Z; Xu, X. & Deng, S. (2005). Clustering Mixed Numeric and Categorical Data: a Cluster Ensemble Approach. ArXiv Computer Science e-prints. (Acesso em 12/12/2006. Disponível em: http://arxiv.org/ftp/cs/papers/0509/0509011.pdf)

Japkowicz, N. (2002). Supervised Learning with Unsupervised Output Separation. In Proceedings of the IASTED International Conference on Artificial Intelligence and Soft Computing (ASC). p.321-325.

Japkowicz, N. (2003). Class imbalances: Are we Focusing on the Right Issue? In Proceedings of the ICML’03 Workshop on Learning from Imbalanced Data Sets.

Jo, T. & Japkowicz, N. (2004). Class Imbalances versus Small Disjuntcs. SIGKDD Explorations, v.6 p.40-49.

Kohavi, R. (1995). A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. International Joint Conference on Artificial Intelligence (IJCAI’1995). p.1137-1145

Kubat, M., Holte, R. & Matwin, S. (1998). Machine Learning for the Detection of Oil Spills in Radar Images. Machine Learning, v.30 p.195-215.

Kubat, M. & Matwin, S. (1997). Addressing the Curse of Imbalanced Training Sets: One-sided Selection. In Proceedings of ICML. Nashville: p.179–86.

Lachiche, N. & Flach, P.A. (2003). Improving Accuracy and Cost of Two-class and Multi-class Probabilistic Classifiers Using ROC Curves. ICML’2003. p.416-423 Ladeira, M; Vieira, M.H.P; Prado, H.A; Noivo, R.M & Castanheira, D.B.S (2005).

UnBMiner - Ferramenta Aberta Para Mineração de Dados. Revista Tecnologia da Informação, Brasília-DF, v.5, n.1, p.45-63.

Langley, P.; Iba, W. & Thompson, K. (1992). An Analysis of Bayesian Classifiers. In Proceedings of the 10th National Conference on Artificial Intelligence. AAAI Press and MIT Press. p.223-228.

Laurikkala, J. (2001). Improving Identification of Difficult Small Classes by Balancing Class Distribution. (TR A-2001-2). University of Tampere.

MacQueen, J.B. (1967). Some Methods for Classification and Analysis of Multivariate Observations. Proceedings of Fifth Berkeley Symposium on Mathematical Statistics and Probability. Berkeley, University of California Press, v.1, p.281-297. Merz, C.J. & Murphy, P.M. (1998) UCI Repository of Machine Learning Datasets.

http://www.ics.uci.edu/~mlearn/MLRepository.html. (Acesso em 20/01/2007). Mitchell, T. (1997). Machine Learning. New York. McGraw Hill

Nickerson, A.; Japkowicz, N. & Millos, E. (2001). Using Unsupervised Learning to Guide Resampling in Imbalanced Data Sets. In Proceedings of the 8th International Workshop on AI and Statistics. Key West. p.261-65.

Oliveira, G.L. & Neto, M.G.M. (2004). ExperText: Uma Ferramenta de Combinação de Múltiplos Classificadores Naive Bayes. Anales de la 4ª Jornadas Iberoamericanas de Ingeniería de Software e Ingeniería de Conocimiento. Madrid. v.1, p.317-32. Phua, C.; Alahakoon, D. & Lee, V. (2004). Minority Report in Fraud Detection:

Classification of Skewed Data. ACM SIGKDD Explorations. v.6. p.50-59.

Prati, R.C.; Batista, G.E.A.P.A. & Monard, M.C (2003). Uma experiência no Balanceamento Artificial de Conjuntos de Dados para Aprendizado com Classes Desbalanceadas utilizando Análise ROC. IV Workshop de Inteligência Artificial ATAI'2003.

Prati, R.C.; Batista, G.E.A.P.A. & Monard, M.C (2004). Class Imbalances versus Class Overlapping: an Analysis of a Learning System Behavior. In MICAI. p.312-321. Prati, R.C. & Flash, P.A. (2005). ROCCER: An algorithm for rule learning based on

ROC analysis. In Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence (IJCI’2005). p.823-828. Edinburgh, Scotland, UK.

Prati, R.C. (2006). Novas abordagens em aprendizado de máquina para a geração de regras, classes desbalanceadas e ordenação de casos. Tese de Doutorado. Disponível em: www.teses.usp.br/teses/disponiveis/55/55134/tde-01092006-155445. Quinlan, J. R. (1986). Induction of Decision Trees. Machine Learning, 1(1), p.81-106. Quinlan, J.R. (1993). C4.5 Programs for Machine Learning. Morgan Kaufmann, San

Mateo, CA.

Sanches, M. K. & Monard, M. C. (2004). Proposta de um Algoritmo de Clustering Semi-supervisionado para Rotular Exemplos a Partir de Poucos Exemplos Rotulados. In Workshop in Artificial Intelligence, Jornadas Chilenas de

Computación. Sociedad Chilena de Ciencias de la Computación. Arica, Chile. v.1, p.1-9.

Sing, T.; Beerenwinkel, N. & Lengauer, T. (2004). Learning Mixtures of Localized Rules by Maximizing the Area Under the ROC Curve. European Conference on Artificial Intelligence – ROCAI’2004. Workshop on ROC Analysis in AI. p.89- 96.

Raskutti, B. & Kowalczyk, A. (2003). Extreme Re-balancing for SVMs: a Case Study. In Proceedings of Workshop on Learning from Imbalanced Data Sets II. Washington, DC.

Rakotomamonjy, A. (2004). Optimizing Area Under Roc Curve with SVMs. In First Workshop on ROC Analysis in AI. Valencia, Spain.

SPSS Inc.; NCR Systems Engineering Copenhagen & DaimlerChrysler AG (1999). CRISP-DM 1.0 – Step-by-step Data Mining Guide. SPSS & CRISP-DM Consortium. (Disponível em www.crisp-dm.org/CRISPWP-0800.pdf. Acesso em 26/04/2006).

Tomek, I. (1976). Two Modifications of CNN. IEEE Transactions on Systems Man and Communications SMC. v.6. p.769–772.

Van Rijsbergen, C. J. (1979). Information Retrieval. 2ª Edição, London, Butterworths. Wilson, D.R. & Martinez, T.R. (2000). Reduction Techniques for Exemplar-Based

Learning Algorithms. Machine Learning. v.38, n.3, p 257-286.

Weiss, G. (2004) Mining with Rarity: A Unifying Framework. ACM SIGKDD Explorations v.6. p.7-19.

Apêndice A Classificadores

Para efeito de construção de classificadores, um conjunto de n variáveis pode ser dividido em n-1 variáveis atributos e uma variável classe. Por existir apenas uma variável classe, é usual se referir aos estados desta variável como as classes possíveis. Um caso é uma instância deste conjunto de n variáveis atributos, isto é, um caso e sua classe formam um registro numa base de dados. Um classificador é um modelo construído a partir de um conjunto de instâncias de atributos-classe (casos e suas respectivas classes), ou seja, é uma função que mapeia os valores das variáveis atributos nos possíveis estados da variável classe.

No documento Um estudo de limpeza em base de dados desbalanceada e com sobreposição de classes (páginas 47-52)