reality of covariance matrix in a complex normal distribution Luís Miguel Grilo 1* , Carlos Agra Coelho 2*
A. Pedro Duarte Silva 1 , Paula Brito
1
Faculdade de Economia e Gestão & CEGE, Universidade Católica Portuguesa (Porto), Porto, Portugal, [email protected]
2
Faculdade de Economia & LIAAD-INESC TEC, Universidade do Porto, Porto, Portugal, [email protected]
Abstract
Building on probabilistic models for interval-valued variables, parametric classification rules, based on Normal or Skew-Normal distributions, are derived for interval data. The performance of such rules is then compared with distance-based methods previously investigated. The results show that parametric approaches generally outperform distance-based ones, and that restricted cases of the variance-covariance matrix which take into account the particular nature of interval data lead to parsimonious rules, which are sometimes quite effective in reducing expected error rates.
Keywords Discriminant analysis, Interval data, Parametric modelling of interval data
1. Discriminant approaches for interval data
In this paper, we are interested in the analysis of interval data, i.e., where elements are characterized by variables whose values are intervals of R, and investigate and compare different methods for discriminant analysis of such data.
Distance-based approaches to linear discriminant analysis of interval data are discussed in Duarte Silva & Brito (2006). Three fundamental approaches are considered. The first approach assumes an uniform distribution in each observed interval, derives the corresponding measures of dispersion and association, and appropriately defines linear combinations of interval variables that maximize the usual discriminant criterion; the second approach expands the original data set into the set of all interval description vertices, and proceeds with a classical analysis of the expanded set; finally, a third approach replaces each interval by a midpoint and range representation. These approaches lead to representations in the discriminant space in the form of intervals or single points, from which distance-based allocation rules are derived. In Brito & Duarte Silva (2012), a parametric modelling for interval data, assuming multivariate Normal or Skew-Normal (Azzalini, & Capitanio (1999)) distributions for the Midpoints and Log- Ranges of the interval variables, is proposed. The intrinsic nature of the interval variables leads to special structures of the variance-covariance matrix, represented by different possible configurations. This approach is implemented in an R package, MAINT.DATA (Duarte Silva & Brito (2011)), available at the CRAN repository, which includes several tools for modelling and analysing interval data. In particular MAINT.DATA introduces a data class for representing interval data and provides methods and functions for parameter estimation,
Programa e resumos 160
statistical tests for the different covariance configurations, and parametric Discriminant Analysis.
Discriminant analysis of interval data has been investigated by other authors in different contexts. Ishibuchi, Tanaka and Noriko Fukuoka (1990) address discriminant analysis of interval data determining interval representations in a discriminant space using a mathematical programming formulation. Approaches of discriminant analysis of interval data based on imprecise probability theory may be found in Nivlet, Fournier & Royer (2001) and Utkin & Coolen (2011). In Lauro, Verde & Palumbo (2000), a generalization of classical Factorial Discriminant Analysis to symbolic data is proposed. This method is based on a numerical analysis of the transformed symbolic data, followed by a symbolic interpretation of the results; it allows considering quantitative, qualitative nominal or distributional variables; classification rules are then based on proximities in the factorial plane (see also Lauro, Verde & Irpino (2008)).
2. Comparison of classification methodologies
This paper evaluates the relative performance of different classification rules for interval data. It compares the distance-based classification rules considered in Duarte Silva & Brito (2006), the parametric classification rules derived from the models discussed in Brito & Duarte Silva (2012), and Factorial Discriminant Analysis (Lauro, Verde & Palumbo (2000)).
The comparisons rely on cross-validated classification rates of a real data set of temperatures in meteorological stations in China, with four interval variables and 899 observations, and on a controlled experiment with simulated data. As concerns the China dataset, the primary available data (obtained from the University Corporation for Atmospheric Research - UCAR) consists in monthly maxima and minima temperatures for the different stations, those have been aggregated for the four trimesters (Jan-Mar; Apr-Jun, Jul-Sep, Oct- Dec), leading in four interval variables. Twenty five different discriminant methods are applied to the resulting dataset, namely: nine distance-based approaches, Factorial Discriminant Analysis (FCA) considering single, average and complete linkage allocations, and sixteen parametric-based approaches, eight using the Gaussian model - Linear and Quadratic Discriminant Analysis, and eight using Skew-Normal Discriminant Analysis - Location and General model – both with 4 different configurations for the variance-covariance matrix (see Brito & Duarte Silva (2012)).
The simulation experiment uses a full factorial design for problems with two groups, three interval variables, and the following seven factors:
− Classification method (22 methods): all the methods compared in the China temperature data, except for the three Factorial Discriminant Analysis methods. − Data Generating Process (2 levels): MidPoints generated by transformations using
Programa e resumos 161 − Separation (2 levels): Good and bad separation between group centroids.
− Range heterogeneity (2 levels): Same or group-specific distribution, used in the generation of Log-Ranges.
− Training sample size (4 levels): Total number of training sample observations, set at 30, 60, 100 and 150.
− Variance ratios (2 levels): Homocedastic and heterocedastic problems.
− True Variance-Covariance configuration (4 levels): Configuration of the population covariance used in the data generation. The same cases as assumed by the parametric methods under comparison.
The results show that parametric approaches generally outperform other approaches, and that restricted configurations of the variance-covariance matrix which take into account the particular nature of interval data lead to parsimonious rules, which can be quite effective in reducing expected error rates. Furthermore, the Gaussian parametric methods proved more reliable than the Skew-normal based methods, particularly in the cases with small training samples where the latter methods appear to suffer from estimation difficulties. With large samples, corresponding Gaussian and Skew-Normal based methods tend to produce similar results, but the Skew-Normal parametric methods are never clearly superior to corresponding Gaussian methods, even when the data generation is based on the Skew-Normal distribution.
Acknowledgements: This research is supported by the Project NORTE-07-0124-FEDER- 000059, financed by the North Portugal Regional Operational Programme (ON.2 – O Novo Norte), under the National Strategic Reference Framework (NSRF), through the European Regional Development Fund (ERDF), and by national funds, through the Portuguese funding agency, Fundação para a Ciência e a Tecnologia (FCT) through the project PEst- OE/EGE/UI0731/2011.
References
AZZALINI, A. & CAPITANIO, A. (1999). Statistical applications of the multivariate Skew- Normal distribution. J. R. Statist. Soc. B, 61 (3), 579–602.
BRITO, P. & DUARTE SILVA, A.P. (2012). Modelling interval data with Normal & Skew- Normal distributions. Journal of Applied Statistics, 39 (1), 3-20.
DUARTE SILVA, A.P. & BRITO, P. (2006). Linear discriminant analysis for interval data.
Programa e resumos 162
DUARTE SILVA, A.P. & BRITO, P. (2011). MAINT.DATA: Model and Analyze Interval Data. R Package, version 0.2. Available at:
http://cran.r-project.org/web/packages/MAINT.Data/index.html.
ISHIBUCHI, H., TANAKA, H. & FUKUOKA, N. (1990). Discriminant analysis of multi- dimensional interval data & its application to chemical sensing. International Journal of
General Systems, 16 (4), 311-329.
LAURO, N.C., VERDE, R. & PALUMBO, F. (2000). Factorial discriminant analysis on symbolic objects. IN: BOCK, H.-H. & DIDAY, E. (Eds.), Analysis of Symbolic Data,
Exploratory Methods for Extracting Statistical Information from Complex Data. Springer,
Heidelberg, 212-233.
LAURO, N.C., VERDE, R. & IRPINO, A. (2008). Factorial discriminant analysis. IN DIDAY, E. & NOIRHOMME-FRAITURE, M. (Eds.), Symbolic Data Analysis & the Sodas Software. Wiley, Chichester, 341-358.
NIVLET, P., FOURNIER, F. & ROYER, J.J. (2001). Interval discriminant analysis: An efficient method to integrate errors in supervised pattern recognition. IN ISIPTA'01, 284-292. UTKIN, L.V. & COOLEN, F.P.A. (2011). Interval-valued regression & classification models in the framework of machine learning. IN Proc. 7th International Symposium on Imprecise
Programa e resumos 163
Data Mining – Sábado, 12 de Abril, Sala 316 (10h00)