Linear regression with empirical distributions

Texto

(1)Linear regression with empirical distributions. Sónia Manuela Mendes Dias Tese de Doutoramento apresentada à Faculdade de Ciências da Universidade do Porto Matemática. 2014.

(2) Sónia Manuela Mendes Dias Plano Doutoral em Matemática Aplicada Matemática 2014. Orientador Paula Brito, Professora Associada Faculdade de Economia, Universidade do Porto. D. Linear regression with empirical distributions.

(3) F or you, Nuno.

(4)

(5) Acknowledgements ´ sempre me dizia era uma “maratona e nao ˜ uma Superar esta “prova”, que como alguem ˜ teria sido poss´ıvel sem o apoio de todos aqueles que estiveram por corrida dos 100m”, nao perto durante estes cinco anos e que sempre acreditaram, muitas vezes mais do que eu, ´ que iria conseguir superar os obstaculos e “chegar a` meta”. O meu muito obrigada... A` Professora Paula Brito pelo incondicional apoio, pela disponibilidade que sempre teve para me ajudar e aconselhar. Por me ter dado força para continuar nos momentos em que desanimava, por ter sempre uma palavra de incentivo, por me fazer acreditar que ´ que lhe conseguiria vencer as dificuldades e chegar ao fim desta etapa. Tenho tambem ˜ agradecer por me ter mostrado o que e´ fazer parte do “mundo da investigaçao”, por me ´ ter apresentado a` “comunidade de Dados Simbolicos”, por me ter incentivado a participar ˆ ´ em conferencias dando a conhecer trabalho que estavamos a desenvolver. Tudo isto foi, ˜ so´ cientificamente mas tambem ´ como sem duvida, importante para me fazer crescer nao ´ pessoa. Professora Paula, gostei muito de trabalhar consigo. Espero que continuemos a colaborar. ´ ´ A todos aqueles que trabalham em Analise de Dados Simbolicos e que mais ou menos diretamente contribu´ıram para o desenvolvimento deste trabalho. ˜ e a` Direçao ˜ da Escola Superior de Tecnologia e Gestao ˜ (ESTG), do InstiA` Instituiçao ´ tuto Politecnico de Viana do Castelo, onde trabalho desde 2001, por me terem concedido ˜ especiais, que permitiram que nestes ultimos algumas condiçoes anos fosse poss´ıvel de´ sempenhar em paralelo, e com sucesso, as minhas atividades de docente e de estudante de doutoramento. v.

(6) ´ Aos meus colegas do grupo de Matematica da ESTG, que sempre me apoiaram, comˆ preenderam a minha ausencia da escola e a minha falta de disponibilidade para participar ˜ fossem as absolutamente necessarias. ´ noutras atividades que nao ˜ e amizade. Pelos bons moAos meus amigos e amigas pelo apoio, compreensao mentos que passamos juntos, que me ajudaram a descontrair e a recuperar forças. Um ´ especial obrigada a` Sandra, pelas interminaveis conversas, pelo apoio, pela partilha de ˆ experiencias. Foi muito importante ter por perto uma amiga, que por estar numa fase da vida semelhante a` minha, me entendia de forma especial. ˜ Sergio. ´ Aos meus pais e ao meu irmao O meu muito obrigada por tudo o que ja´ ˜ sempre fazer. O vosso me deram e fizeram por mim e por tudo aquilo que sei que irao apoio durante estes anos em que estive a fazer o doutoramento foi fundamental. Nunca conseguirei agradecer-vos a vossa infinita disponibilidade. Obrigada por estarem sempre ´ por perto. Sergio obrigada pelos conselhos de profissional, que contribu´ıram para uma ´ melhoria da parte grafica desta tese. ˜ teria chegado ate´ aqui, se nao ˜ estivesses Ao Nuno, porque com toda a certeza que nao sempre ao meu lado. Foste tu que me incentivaste a fazer um doutoramento, que me apoiaste e que me ajudaste a chegar ao fim. Foi com a tua ajuda que consegui o que nunca tinha sonhado: ter um doutoramento e vencer a barreira da l´ıngua inglesa. Disseste muitas vezes que merecias ser co-orientador da minha tese, e tenho que concordar que ˆ estiveste muito perto. Obrigada pela ajuda, pela paciencia, pelo apoio incondicional, pela confiança que sempre depositaste em mim, por acreditares que eu seria capaz de fazer ´ as variaçoes ˜ o doutoramento. Desculpa as minhas fases mas, de humor que oscilavam consoante o trabalho do doutoramento corria bem ou mal. E e´ por tudo isto, que te dedico esta tese.. vi.

(7) Resumo. ´ No contexto dos dados classicos, a cada indiv´ıduo esta´ associado um valor real unico ´ ou uma categoria (microdados). No entanto, o interesse de muitos estudos recai sobre conjuntos de registos agregados de acordo com caracter´ısticas de indiv´ıduos ou classes ˜ classica ´ de indiv´ıduos, os designados macrodados. A soluçao para estudar este tipo de ˜ situaçoes passa por associar a cada indiv´ıduo ou classe de indiv´ıduos uma medida de ˆ ´ tendencia central, por exemplo, a correspondente media ou a moda do conjunto de regis˜ perde-se a variabilidade inerente aos dados. Neste tipo tos; no entanto com esta opçao ˜ a Analise ´ ´ ˜ que a cada unidade seja associada a de situaçoes de Dados Simbolicos propoe ˜ ou o intervalo de valores que contemplam os registos individuais, considerando distribuiçao ´ ´ ´ assim novos tipos de variaveis, designadas por variaveis simbolicas. Um dos tipos de ´ ´ ´ variaveis simbolicas e´ a variavel histograma, para a qual a cada unidade corresponde uma ˜ emp´ırica que se pode representar por um histograma ou uma funçao ˜ quantil. distribuiçao ˜ Se para todas as observaçoes, cada unidade tomar valores num unico intervalo de peso ´ ´ ´ igual a um, a variavel histograma reduz-se ao caso particular da variavel intervalar. Em ˜ uniformemente ambos os casos e´ assumido que os valores dentro de cada intervalo estao ´ ´ distribu´ıdos. Por conseguinte, e´ necessario adaptar os conceitos e metodos da estat´ıstica ´ ´ classica a estes novos tipos de variaveis. ˜ funcional entre variaveis ´ ´ ˜ pode A relaçao histograma ou entre variaveis intervalares nao ˜ do modelo de regressao ˜ classico. ´ ser obtida por uma simples adaptaçao Neste trabalho ˜ propostos novos modelos de regressao ˜ linear para dados histograma e intervalares. sao ˜ designados por Modelos de Distribuiçao ˜ e Distribuiçao ˜ Estes novos modelos de regressao, ´ ˜ Simetrica permitem prever distribuiçoes/intervalos, representados pelas respectivas fun˜ quantil, a partir de distribuiçoes ˜ e intervalos associados as ` variaveis ´ çoes explicativas. Para ˆ ´ ˜ determinar os parametros dos modelos e´ necessario resolver problemas de optimizaçao vii.

(8) ´ ˜ de nao ˜ negatividade sobre as incognitas. ´ quadratica, sujeitos a restriçoes Para definir os ˜ e calcular o erro entre as distribuiçoes ˜ observadas e previstas problemas de minimizaçao ˆ ´ ´ e´ usada a distancia de Mallows. Tal como na analise classica, e´ poss´ıvel deduzir a partir dos modelos uma medida para a qualidade do ajuste cujos valores variam entre 0 e 1. ˜ O comportamento dos modelos propostos e a medida da qualidade do ajuste sao ˜ Estes estudos indicam ilustrados com exemplos de dados reais e estudos de simulaçao. ´ um bom desempenho dos metodos propostos e dos respectivos coeficientes de determina˜ çao.. ´ ´ Palavras Chave: Dados com variabilidade, variaveis histograma, variaveis intervalares, ˜ linear, Analise ´ ´ regressao de Dados Simbolicos.. viii.

(9) Abstract. In the classical data framework one numerical value or one category is associated with each individual (microdata). However, the interest of many studies lays in groups of records gathered according to characteristics of the individuals or classes of individuals, leading to macrodata. The classical solution for these situations is to associate with each individual or class of individuals a central measure, e.g., the mean or the mode of the corresponding records; however with this option the variability across the records is lost. For such situations, Symbolic Data Analysis proposes that a distribution or an interval of the individual records’ values is associated with each unit, thereby considering new variable types, named symbolic variables. One such type of symbolic variable is the histogramvalued variable, where to each entity under analysis corresponds an empirical distribution that can be represented by a histogram or a quantile function. If for all observations each unit takes values on only one interval with weight equal to one, the histogram-valued variable is then reduced to the particular case of an interval-valued variable. In either case, an Uniform distribution is assumed within the considered intervals. Accordingly, it is necessary to adapt concepts and methods of classical statistics to new kinds of variables. The functional linear relations between histogram or between interval-valued variables cannot be a simple adaptation of the classical regression model. In this work new linear regression models for histogram data and interval data are proposed. These new Distribution and Symmetric Distributions Regression Models allow predicting distributions/intervals, represented by their quantile functions, from distributions/intervals of the explicative variables. To determine the parameters of the models it is necessary to solve quadratic optimization problems subject to non-negativity constraints on the unknowns. To define the minimization problems and to compute the error measure between the predicted and observed distributions, the Mallows distance is used. As in classical analysis, it is possible to deduce a ix.

(10) goodness-of-fit measure from the models whose values range between 0 and 1. Examples on real data as well as simulated experiments illustrate the behavior of the proposed models and the goodness-of-fit measure. These studies indicate a good performance of the proposed methods and of the respective coefficients of determination.. Key-words: Data with variability, histogram-valued variables, interval-valued variables, linear regression, Symbolic Data Analysis.. x.

(11) Contents. List of Tables. xxii. List of Figures. xxviii. 1 Introduction 1.1 Motivation. 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 1. 1.2 Data with variability and imprecise data . . . . . . . . . . . . . . . . . . . .. 2. 1.3 Symbolic Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3. 1.4 Linear regression between empirical distributions . . . . . . . . . . . . . . .. 7. 1.5 Organization of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . .. 11. 2 Histogram and Interval data. 13. 2.1 Symbolic variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Definition and classification. 13. . . . . . . . . . . . . . . . . . . . . . .. 13. 2.1.2 Histogram and Interval-valued variables . . . . . . . . . . . . . . . .. 17. 2.2 Arithmetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 24. 2.2.1 Interval Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . .. 24. 2.2.2 Histogram Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . .. 26. 2.2.3 Operations with Quantile Functions . . . . . . . . . . . . . . . . . .. 31. 2.2.3.1. Quantile Functions defined with the same number of pieces. 31. 2.2.3.2. Operations . . . . . . . . . . . . . . . . . . . . . . . . . .. 34. 2.2.3.3. The space of quantile functions . . . . . . . . . . . . . . .. 37. 2.3 Distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 41. 2.3.1 Distances between intervals . . . . . . . . . . . . . . . . . . . . . .. 42. 2.3.2 Distances between distributions . . . . . . . . . . . . . . . . . . . .. 44. xi.

(12) 2.4 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 50. 2.4.1 Symbolic-numerical-numerical category . . . . . . . . . . . . . . . .. 50. 2.4.1.1. Univariate descriptive statistics. . . . . . . . . . . . . . . .. 51. 2.4.1.2. Bivariate descriptive statistics . . . . . . . . . . . . . . . .. 57. 2.4.1.3. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . .. 71. 2.4.2 Symbolic-symbolic-symbolic category . . . . . . . . . . . . . . . . .. 75. 2.4.2.1. Descriptive measures defined from the Mallows distance . .. 2.4.2.2. Descriptive measures defined from the Wasserstein distance 81. 2.4.2.3. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . .. 83. 2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 87. 3 State of the art. 76. 91. 3.1 Linear Regression Models for interval variables . . . . . . . . . . . . . . . .. 91. 3.1.1 Descriptive linear regression models . . . . . . . . . . . . . . . . . .. 92. 3.1.1.1. Linear regression models for data with variability . . . . . .. 92. 3.1.1.2. Linear regression models for imprecise data . . . . . . . . . 106. 3.1.2 Probabilistic linear regression models . . . . . . . . . . . . . . . . . 109 3.1.2.1. Linear regression models for data with variability . . . . . . 109. 3.1.2.2. Linear regression models for imprecise data . . . . . . . . . 113. 3.2 Linear Regression Models for histogram variables . . . . . . . . . . . . . . 120 3.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 4 Regression Models for histogram data. 129. 4.1 The DSD Regression Model I . . . . . . . . . . . . . . . . . . . . . . . . . 129 4.1.1 Definition of the Model . . . . . . . . . . . . . . . . . . . . . . . . . 129 4.1.2 Estimation of the parameters of the DSD Model I . . . . . . . . . . . 137 4.1.2.1. Optimization problem . . . . . . . . . . . . . . . . . . . . . 137. 4.1.2.2. Kuhn Tucker conditions. . . . . . . . . . . . . . . . . . . . 142. 4.1.3 Goodness-of-fit measure . . . . . . . . . . . . . . . . . . . . . . . . 147 4.2 The DSD Regression Model II . . . . . . . . . . . . . . . . . . . . . . . . . 149 4.2.1 Definition of the Model . . . . . . . . . . . . . . . . . . . . . . . . . 150 4.2.2 Estimation of the parameters of the DSD Model II . . . . . . . . . . . 153 4.2.2.1. Optimization problem . . . . . . . . . . . . . . . . . . . . . 153 xii.

(13) 4.2.2.2. Properties and the Goodness-of-fit measure. . . . . . . . . 162. 4.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 5 Regression Models for interval data. 169. 5.1 The DSD Regression Model I . . . . . . . . . . . . . . . . . . . . . . . . . 169 5.1.1 Particularization of the Model to interval-valued variables . . . . . . . 169 5.1.2 The DSD Model I is a generalization of the classical linear regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 5.1.3 The simple DSD Model I for interval-valued variables . . . . . . . . . 178 5.2 DSD Regression Model II . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 5.2.1 Definition and Properties . . . . . . . . . . . . . . . . . . . . . . . . 186 5.2.2 Goodness-of-fit measures . . . . . . . . . . . . . . . . . . . . . . . 194 5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 6 Simulation studies. 197. 6.1 Building symbolic simulated data tables . . . . . . . . . . . . . . . . . . . . 197 6.2 Simulation study I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 6.2.1 Factorial design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 6.2.2 Results and conclusions . . . . . . . . . . . . . . . . . . . . . . . . 203 6.2.2.1. Study of the behavior of the error function . . . . . . . . . . 204. 6.2.2.2. Study of the behavior of the coefficients of determination Ω. 6.2.2.3. e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 and Ω. Comparison of the observed and predicted intervals . . . . 208. 6.3 Simulation study II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 6.3.1 Simulation study with interval-valued variables . . . . . . . . . . . . 214 6.3.1.1. Description of the simulation study . . . . . . . . . . . . . . 214. 6.3.1.2. Results and conclusions . . . . . . . . . . . . . . . . . . . 216. 6.3.2 Simulation study with histogram-valued variables . . . . . . . . . . . 221 6.3.2.1. Description of the simulation study . . . . . . . . . . . . . . 221. 6.3.2.2. Results and conclusions . . . . . . . . . . . . . . . . . . . 225. 6.3.2.3. Concerning the goodness-of-fit measures versus level of linearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225. 6.3.2.4. Concerning the analysis of the parameters’ estimation . . . 227 xiii.

(14) 6.3.2.5 6.3.2.6. Concerning symmetry/asymmetry of Yb (j). . . . . . . . . . 231. Comparing DSD Models I and II . . . . . . . . . . . . . . . 233. 6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 7 Analysis of data with variability. 239. 7.1 Prediction of the hematocrit values . . . . . . . . . . . . . . . . . . . . . . 240 7.1.1 The histogram data . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 7.1.2 The DSD Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 7.1.3 Comparison of the DSD Models with other proposed symbolic models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 7.1.4 Interval data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 7.2 Distributions of Crimes in USA. . . . . . . . . . . . . . . . . . . . . . . . . 250. 7.2.1 The data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250 7.2.2 Three approaches to study linear relations between data with variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 7.2.3 The prediction of the violent crimes in the state of Arkansas . . . . . 255 7.2.4 Predicted Quantiles. . . . . . . . . . . . . . . . . . . . . . . . . . . 258. 7.3 Time of unemployment from years of employment . . . . . . . . . . . . . . 263 7.3.1 The data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 7.3.2 Prediction with the DSD Model I . . . . . . . . . . . . . . . . . . . . 264 7.3.3 Comparison of the predictions with different symbolic models. . . . . 265. 7.4 Predicted burned area of forest fires . . . . . . . . . . . . . . . . . . . . . . 273 7.4.1 The data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 7.4.2 Linear regression studies with macrodata . . . . . . . . . . . . . . . 274 7.4.2.1. Non symbolic approaches . . . . . . . . . . . . . . . . . . 274. 7.4.2.2. Symbolic approaches. . . . . . . . . . . . . . . . . . . . . 275. 7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280 8 General conclusions and Future Work. 283. 8.1 Conclusions of the work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 References. 287 xiv.

(15) Appendices. 295. A Ordered and disjoint histograms. 297. B Behavior of pairs of quantile functions. 299. C Results of simulation studies. 307. C.1 Simulation study I with interval variables. . . . . . . . . . . . . . . . . . . . 307. C.2 Simulation study II with interval variables . . . . . . . . . . . . . . . . . . . 314 C.3 Simulation study II with histogram variables . . . . . . . . . . . . . . . . . . 324. xv.

(16) xvi.

(17) List of Tables. 1.1 Symbolic data table with information of three healthcare centers (part 1). . .. 4. 1.2 Symbolic data table with information of three healthcare centers (part 2). . .. 4. 1.3 Classical data table (microdata) with the records of hematocrit and hemoglobin of each patient per day.. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 5. 1.4 Symbolic data table (macrodata) when the values of hematocrit and hemoglobin are symbolic variables (Billard and Diday (2002)). . . . . . . . . . . . . . . .. 6. 2.1 Symbolic data table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 15. 2.2 Divergency measures between distributions (Arroyo (2008)). . . . . . . . . .. 44. 2.3 Mallows and Wasserstein distances between the observations of the histogram-valued variable “Waiting time for a consult” in Table 1.2. . . . . . . . .. 50. 2.4 Symbolic data table where the variables hematocrit and hemoglobin are interval-valued variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 60. 2.5 Symbolic data table where the variables hematocrit and hemoglobin are histogram-valued variables. . . . . . . . . . . . . . . . . . . . . . . . . . .. 61. 2.6 Table of frequencies associated with interval-valued variable “Age” in Table 1.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 71. 2.7 Descriptive statistics for the interval-valued variable “Age” in Table 1.1. . . .. 73. 2.8 Table of frequencies of the histogram-valued variable “Waiting time for a consult” in Table 1.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 73. 2.9 Descriptive statistics for the histogram-valued variable “Waiting time for a consult” in Table 1.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 74. 2.10 Values of the covariance and variance for the data in Tables 2.4 and 2.5. . .. 75. 4.1 Symbolic data table for histogram-valued variables −X and Y. . . . . . . . . 135 xvii.

(18) 5.1 Symbolic data table for interval-valued variables −X and Y . . . . . . . . . . 172. e and RM SEM in the situations analyzed in the simula6.1 Mean values of Ω, Ω. c(j) ∈ U (−5, 5) tion study I, when the error function is generated considering e and re(j) ∈ U(−2, 2).. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206. 6.2 Mean values of Ω considering different levels of linearity, when the distribu-. . . tions generating observations of X are Uniform ΩU and Normal ΩN . . . 226 6.3 Mean values of Ω considering different levels of linearity, when the distribu-. . tions generating observations of X are Log-Normal ΩLogN and a mixture. . of distributions ΩM ix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 6.4 Relative efficiency of the estimation of parameters a and b when the distributions of Y are created by DSD Model I and DSD Model II. . . . . . . . . . . 234 6.5 Evaluation measures for the situations CDSDI /PDSDI , CDSDI /PDSDII ,. CDSDII /PDSDII , and CDSDII /PDSDI . . . . . . . . . . . . . . . . . . . . . . 236 7.1 Observed and predicted histograms (using different methods) of the hematocrit values shown in Table 2.5 (part1: patients 1 to 4).. . . . . . . . . . . . 244. 7.2 Observed and predicted histograms (using different methods) of the hematocrit values shown in Table 2.5 (part2: patients 5 to 8).. . . . . . . . . . . . 245. 7.3 Observed and predicted histograms (using different methods) of the hematocrit values shown in Table 2.5 (part3: patients 9 and 10). . . . . . . . . . . 246 7.4 Comparison of the expressions of the symbolic linear regression models for the histogram-valued variables in Table 2.5. . . . . . . . . . . . . . . . . . . 246 7.5 Comparison of the Root Mean Square Error values when the Leave-One-Out method is not/is applied together with the proposed models for the histogramvalued variables in Table 2.5.. . . . . . . . . . . . . . . . . . . . . . . . . . 247. 7.6 Comparison of the Root Mean Square Error values when the Leave-One-Out method is not/is applied together with the DSD Models for the interval-valued variables in Table 2.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 7.7 Comparison of the Root Mean Square Error values when the Leave-One-Out method is not/is applied together with the DSD Models for the histogramvalued variables in Section 7.2. . . . . . . . . . . . . . . . . . . . . . . . . 252 xviii.

(19) 7.8 Comparison of the expressions of the symbolic linear regression models that predict the number of violent crimes in USA states. . . . . . . . . . . . . . . 253 7.9 Performance of the symbolic linear regression models that predict the number of violent crimes in USA states. . . . . . . . . . . . . . . . . . . . . . . 253 7.10 Comparison between the observed and predicted LV C quantiles. . . . . . . 262 7.11 Symbolic data table where the two variables, time of activity before unemployment and time of unemployment are interval-valued variables. . . . . . . 264 7.12 Comparison of the expressions of the symbolic linear regression models for interval-valued variables in Table 7.11.. . . . . . . . . . . . . . . . . . . . . 266. 7.13 Performance of the symbolic linear regression models that predict the logarithm of the time of unemployment. . . . . . . . . . . . . . . . . . . . . . . 267 7.14 Data with information about the total burned area of forest fire and other four variables: LN area, temp, wind and rh organized by month. . . . . . . . . 274 7.15 Comparison of the expressions of the symbolic linear regression models for interval-valued variables in Table 7.14.. . . . . . . . . . . . . . . . . . . . . 277. 7.16 Comparison of the Root Mean Square Error values when the Leave-One-Out method is not/is applied together with the proposed models for the histogramvalued variables in Table 7.14. . . . . . . . . . . . . . . . . . . . . . . . . . 279 C.1 Results of the DSD Model I with a = 2, b = 1 and v = −1, when is low and. similar the variability in variable X . . . . . . . . . . . . . . . . . . . . . . . 307. C.2 Results of the DSD Model I with a = 2, b = 1 and v = −1, when is high and. similar the variability in variable X. . . . . . . . . . . . . . . . . . . . . . . 308. C.3 Results of the DSD Model I with a = 2, b = 1 and v = −1, when is mixed. (type I) the variability in variable X . . . . . . . . . . . . . . . . . . . . . . . 308. C.4 Results of the DSD Model I with a = 2, b = 1 and v = −1, when is mixed. (type II) the variability in variable X. . . . . . . . . . . . . . . . . . . . . . . 309. C.5 Results of the DSD Model II with a = 2, b = 1 and v = [−2, 0], when is low and similar the variability in variable X (part I). . . . . . . . . . . . . . . . . 310 C.6 Results of the DSD Model II with a = 2, b = 1 and v = [−2, 0], when is low and similar the variability in variable X (part II). . . . . . . . . . . . . . . . . 310 xix.

(20) C.7 Results of the DSD Model II with a = 2, b = 1 and v = [−2, 0], when is high and similar the variability in variable X (part I). . . . . . . . . . . . . . . . . 311 C.8 Results of the DSD Model II with a = 2, b = 1 and v = [−2, 0], when is high and similar the variability in variable X (part II). . . . . . . . . . . . . . . . . 311 C.9 Results of the DSD Model II with a = 2, b = 1 and v = [−2, 0], when is mixed (type I) the variability in variable X (part I). . . . . . . . . . . . . . . . 312 C.10 Results of the DSD Model II with a = 2, b = 1 and v = [−2, 0], when is mixed (type I) the variability in variable X (part II). . . . . . . . . . . . . . . . 312 C.11 Results of the DSD Model II with a = 2, b = 1 and v = [−2, 0], when is mixed (type II) the variability in variable X (part I). . . . . . . . . . . . . . . 313 C.12 Results of the DSD Model II with a = 2, b = 1 and v = [−2, 0], when is mixed (type II) the variability in variable X (part II).. . . . . . . . . . . . . . . 313. C.13 Results, in different conditions, of the DSD Model I with a = 2, b = 1 and. v = −1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314 C.14 Results, in different conditions, of the DSD Model I with a = 2, b = 8 and. v = 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315 C.15 Results, in different conditions, of the DSD Model I with a = 6, b = 0 and. v = 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316 C.16 Results, in different conditions, of the DSD Model I with a1 = 2, b1 = 1,. a2 = 0.5, b2 = 3, a3 = 1.5, b3 = 1 and v = −1. . . . . . . . . . . . . . . . . 317 C.17 Results, in different conditions, of the DSD Model I with a1 = 2, b1 = 1,. a2 = 0.5, b2 = 3, a3 = 1.5, b3 = 1 and v = −1 (continuation of the Table C.16). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318 C.18 Results, in different conditions, of the DSD Model II with a = 2, b = 1 and. Ψ−1 Constant (t) that represents the interval I = [−2, 0]. . . . . . . . . . . . . . 319 C.19 Results, in different conditions, of the DSD Model II with a = 2, b = 8 and. Ψ−1 Constant (t) that represents the interval I = [1, 5]. . . . . . . . . . . . . . . 320 C.20 Results, in different conditions, of the DSD Model II with a = 6, b = 0 and. Ψ−1 Constant (t) that represents the interval I = [1, 3]. . . . . . . . . . . . . . . 321 C.21 Results, in different conditions, of the DSD Model II with a1 = 2, b1 = 1,. a2 = 0.5, b2 = 3, a3 = 1.5, b3 = 1 and Ψ−1 Constant (t) that represents the interval I = [−2, 0]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322 xx.

(21) C.22 Results, in different conditions, of the DSD Model II with a1 = 2, b1 = 1,. a2 = 0.5, b2 = 3, a3 = 1.5, b3 = 1 and Ψ−1 Constant (t) that represents the interval I = [−2, 0] (continuation of the Table C.21). . . . . . . . . . . . . . 323 C.23 Results, in different conditions, of the DSD Model I with a = 2, b = 1 and. v = −1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324 C.24 Results, in different conditions, of the DSD Model I with a = 2, b = 8 and. v = 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325 C.25 Results, in different conditions, of the DSD Model I with a = 6, b = 0 and. v = 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326 C.26 Results, in different conditions, of the DSD Model I with a1 = 2, b1 = 1,. a2 = 0.5, b2 = 3, a3 = 1.5, b3 = 1 and v = −1. . . . . . . . . . . . . . . . . 327 C.27 Results, in different conditions, of the DSD Model I with a1 = 2, b1 = 1,. a2 = 0.5, b2 = 3, a3 = 1.5, b3 = 1 and v = −1 (continuation of the Table C.26). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328 C.28 Results, in different conditions, of the DSD Model I with a1 = 6, b1 = 0,. a2 = 2, b2 = 8, a3 = 10, b3 = 5 and v = 3. . . . . . . . . . . . . . . . . . . 329 C.29 Results, in different conditions, of the DSD Model I with a1 = 6, b1 = 0,. a2 = 2, b2 = 8, a3 = 10, b3 = 5 and v = 3 (continuation of the Table C.28). . 330 −1 C.30 Results of the DSD Model II with a = 2, b = 8 and Ψ−1 Constant (t) = ΨU (t). when the histogram-valued variable X have Uniform or Normal distributions. 331 −1 C.31 Results of the DSD Model II with a = 2, b = 8 and Ψ−1 Constant (t) = ΨU (t). when the histogram-valued variable X have LogNormal distributions or a mixture of different distributions. . . . . . . . . . . . . . . . . . . . . . . . . 332 C.32 Results, in different conditions, of the DSD Model II with a = 2, b = 8 and −1 Ψ−1 Constant (t) = ΨU (t) (continuation of the Tables C.30 and C.31). . . . . . . 333 −1 C.33 Results of the DSD Model II with a = 2, b = 8 and Ψ−1 Constant (t) = ΨN (t). when the histogram-valued variable X have Uniform or Normal distributions. 334 −1 C.34 Results of the DSD Model II with a = 2, b = 8 and Ψ−1 Constant (t) = ΨN (t). when the histogram-valued variable X have LogNormal distributions or a mixture of different distributions. . . . . . . . . . . . . . . . . . . . . . . . . 335 C.35 Results, in different conditions, of the DSD Model II with a = 2, b = 8 and −1 Ψ−1 Constant (t) = ΨN (t) (continuation of the Tables C.33 and C.34). . . . . . . 336. xxi.

(22) −1 C.36 Results of the DSD Model II with a = 2, b = 8 and Ψ−1 Constant (t) = ΨLogN (t). when the histogram-valued variable X have Uniform or Normal distributions. 337 −1 C.37 Results of the DSD Model II with a = 2, b = 8 and Ψ−1 Constant (t) = ΨLogN (t). when the histogram-valued variable X have LogNormal distributions or a mixture of different distributions. . . . . . . . . . . . . . . . . . . . . . . . . 338 C.38 Results, in different conditions, of the DSD Model II with a = 2, b = 8 and −1 Ψ−1 Constant (t) = ΨLogN (t) (continuation of the Tables C.36 and C.37). . . . . . 339 −1 C.39 Results of the DSD Model II with a = 2, b = 8 and Ψ−1 Constant (t) = ΨM ix (t). when the histogram-valued variable X have Uniform or Normal distributions. 340 −1 C.40 Results of the DSD Model II with a = 2, b = 8 and Ψ−1 Constant (t) = ΨM ix (t). when the histogram-valued variable X have LogNormal distributions or a mixture of different distributions. . . . . . . . . . . . . . . . . . . . . . . . . 341 C.41 Results, in different conditions, of the DSD Model II with a = 2, b = 8 and −1 Ψ−1 Constant (t) = ΨM ix (t) (continuation of the Tables C.39 and C.40). . . . . . 342. xxii.

(23) List of Figures. 2.1 Representation of the histograms associated with each healthcare center for the histogram-valued variable Y5 in Table 1.2. . . . . . . . . . . . . . . . . .. 22. −1 −1 2.2 Representation of the quantile functions Ψ−1 Y5 (A) , ΨY5 (B) , ΨY5 (C) in Table 1.2. .. 23. −1 −1 2.3 Representation of the quantile functions Ψ−1 Y2 (A) ; ΨY2 (B) ; ΨY2 (C) in Table 1.1. .. 24. 2.4 Representation of interval IX + IY in Example 2.4. . . . . . . . . . . . . . .. 25. 2.5 Representation of intervals IX , 2IX and respective symmetric intervals −IX ,. −2IX in Example 2.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 26. 2.6 Representation of histograms HX , HY and HX + HY in Example 2.5. . . . .. 28. 2.7 Representation of histograms HX , HX + 2 and HX − 2 in Example 2.5.. . .. 28. 2.8 Representation of histograms HX , 2HX and −HX in Example 2.5. . . . . .. 29. 2.9 Representation of histogram HX − HX in Example 2.6. . . . . . . . . . . .. 30. . . .. 33. −1 2.10 Representation of the quantile functions Ψ−1 X and ΨY in Example 2.7.. −1 −1 −1 2.11 Representation of the quantile functions Ψ−1 X , ΨY , ΨX +ΨY in Example 2.8. 36 −1 −1 2.12 Representation of the quantile functions Ψ−1 X , ΨX + 2, ΨX − 2 in. Example 2.8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 36. −1 −1 2.13 Representation of the functions Ψ−1 X (t), 2ΨX (t), −ΨX (t) in Example 2.8. .. 37. 2.14 Representation of the histogram HX in Example 2.8 and the respective symmetric −HX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 38. Example 2.8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 39. −1 −1 2.15 Representation of the functions Ψ−1 X (t); −ΨX (t), −ΨX (1 − t) in. −1 −1 2.16 Representation of the functions Ψ−1 X (t); −ΨX (t), −ΨX (1 − t) in. Example 2.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 40. 2.17 Scatter Plot of the interval data in Table 2.4 described in case 1.a). . . . . .. 61. 2.18 Scatter Plot of the interval data in Table 2.4 described in case 2. . . . . . . .. 62. xxiii.

(24) 2.19 Scatter Plot of the histogram data in Table 2.5. . . . . . . . . . . . . . . . .. 62. 2.20 Histogram of the interval-valued variable “Age” in Table 1.1. . . . . . . . . .. 72. 2.21 Histogram of the cumulative relative frequency and the quartiles of the intervalvalued variable “Age” in Table 1.1.. . . . . . . . . . . . . . . . . . . . . . .. 72. 2.22 Histogram of the histogram-valued variable “Waiting time for a consult” in Table 1.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 74. 2.23 Barycentric histogram of the histograms in Example 2.15. . . . . . . . . . .. 84. 2.24 Quantile function that represents the barycentric histogram of the histograms in Example 2.15. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 84. 2.25 Quantile function that represents the barycentric histogram of the histograms in Example 2.16. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 85. 2.26 Quantile function that represents the median histogram of the histograms in Example 2.16. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 86. 2.27 Quantile function that represents the barycentric interval of the intervals in Example 2.17. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 86. 2.28 Quantile function that represents the median histogram of the intervals in Example 2.17. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 87. 4.1 Scatter plots considering the histogram-valued variables X and Y in Table 2.5. 135 4.2 Scatter plots considering the histogram-valued variables −X and Y in Table 4.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 4.3 Scatter plots considering the mean values of the observations of the histogramvalued variables X and Y in (a); −X and Y in (b). . . . . . . . . . . . . . . 136 5.1 Scatter plots associated with interval-valued variables X and Y in Table 2.4. 171 5.2 Scatter plots associated with interval-valued variables −X and Y in Table 5.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 6.1 Mean values of Ω and the respective standard deviation for different error functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204. e and the respective standard deviation for different error 6.2 Mean values of Ω. functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 xxiv.

(25) 6.3 Mean values of RM SEM and the respective standard deviation for different error functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 6.4 Observed and predicted intervals with low and similar variability when. e = 0.0993). . . . . . . . 209 e c ∈ U(−5, 5) and re ∈ U(−2, 2) (Ω = 0.7138 and Ω. 6.5 Observed and predicted intervals with high and similar variability when. e = 0.0730). . . . . . . . 209 e c ∈ U(−5, 5) and re ∈ U(−2, 2) (Ω = 0.9898 and Ω. c ∈ U(−5, 5) 6.6 Observed and predicted intervals with mixed variability I when e. e = 0.6728). . . . . . . . . . . . . . . . 210 and re ∈ U(−2, 2) (Ω = 0.9677 and Ω. c ∈ U(−5, 5) 6.7 Observed and predicted intervals with mixed variability II when e. e = 0.7914). . . . . . . . . . . . . . . . 210 and re ∈ U(−2, 2) (Ω = 0.9571 and Ω. 6.8 Observed and predicted intervals with low and similar variability when. e = 0.6807). . . . . . 211 e c ∈ U(−1, 1) and re ∈ U(−0.5, 0.5) (Ω = 0.9866 and Ω. 6.9 Observed and predicted intervals with high and similar variability when. e = 0.6943). . . . . . 211 e c ∈ U(−1, 1) and re ∈ U(−0.5, 0.5) (Ω = 0.9996 and Ω. 6.10 Observed and predicted intervals with high and similar variability when. e = 0.0018). . . . . 212 e c ∈ U(−60, 60) and re ∈ U(−mr, mr) (Ω = 0.3663 and Ω. c ∈ U(−40, 40) 6.11 Observed and predicted intervals with mixed variability II when e. e = 0.0262). . . . . . . . . . . . . . 212 and re ∈ U(−mr, mr) (Ω = 0.2199 and Ω. 6.12 Representation of the RM SEM in all cases when the DSD Model I, with one. explicative variable, is applied. . . . . . . . . . . . . . . . . . . . . . . . . . 217. 6.13 Boxplots of the values estimated for parameter a, under different conditions, when DSD Model I (a = 2; b = 8; v = 3) is applied to interval-valued variables and when the level I error is considered. . . . . . . . . . . . . . . 219 6.14 Boxplots of the values estimated for parameter a, under different conditions, when DSD Model II (a = 2; b = 8; Ψ−1 Constant (t) = 3 + 2(2t − 1)) is applied to interval-valued variables and when the level I error is considered. . . . . . . 219 6.15 Boxplots of the values estimated for parameter b, under different conditions, when DSD Model I (a = 2; b = 8; v = 3) is applied to interval-valued variables and when the level I error is considered. . . . . . . . . . . . . . . 220 6.16 Boxplots of the values estimated for parameter b, under different conditions, when DSD Model II (a = 2; b = 8; Ψ−1 Constant (t) = 3 + 2(2t − 1)) is applied to interval-valued variables and when the level I error is considered. . . . . . . 220 xxv.

(26) 6.17 Boxplots of the values estimated for parameter a, under different conditions, when DSD Model I (a = 2, b = 8, v = 3) is applied to histogram-valued variables and when level I error is considered. . . . . . . . . . . . . . . . . 228 6.18 Boxplots of the values estimated for parameter b, under different conditions, when DSD Model I (a = 2, b = 8, v = 3) is applied to histogram-valued variables and when level I error is considered. . . . . . . . . . . . . . . . . 229 6.19 Boxplots of the values estimated for parameter v, under different conditions, when DSD Model I (a = 2, b = 8, v = 3) is applied to histogram-valued variables and when level I error is considered. . . . . . . . . . . . . . . . . 229 6.20 Boxplots of the values estimated for parameter a, under different conditions, −1 when DSD Model II (a = 2, b = 8, Ψ−1 Constant (t) = ΨN (t)) is applied to. histogram-valued variables and when level I error is considered. . . . . . . . 230 6.21 Boxplots of the values estimated for parameter b, under different conditions, −1 when DSD Model II (a = 2, b = 8, Ψ−1 Constant (t) = ΨN (t)) is applied to. histogram-valued variables and when level I error is considered. . . . . . . . 230 2. b −1 6.22 Representation of the DM (Ψ−1 N (t), ΨN (t)) in different conditions, when DSD. −1 Model II (a = 2; b = 8; Ψ−1 Constant (t) = ΨN (t)) is applied to histogram-valued. variables and when level I error is considered. . . . . . . . . . . . . . . . . 231 6.23 Boxplots that represent the “skewness” of the distributions estimated with DSD Model I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 6.24 Boxplots that represent the “skewness” of the distributions estimated with DSD Model II. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 6.25 Boxplots that represent the “skewness” of the distributions estimated with the DSD Model II with parameters a = 2, b = 8, Ψ−1 LogN (0.95,1) and a = 2, b = 8,. Ψ−1 LogN (0.5,0.4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 6.26 Boxplot for the estimated parameter a, in four cases CDSDI /PDSDI ,. CDSDI /PDSDII , CDSDII /PDSDII , CDSDII /PDSDI . . . . . . . . . . . . . . . . 235 6.27 Boxplot for the estimated parameter b, in four cases CDSDI /PDSDI ,. CDSDI /PDSDII , CDSDII /PDSDII , CDSDII /PDSDI . . . . . . . . . . . . . . . . 236 7.1 Comparing the predictions and error functions for observation 1 in Table 2.5. 242 7.2 Observed and predicted quantile functions of each observation in Table 2.5. xxvi. 243.

(27) 7.3 Scatter plots representing the observed and predicted intervals of hematocrit values in Table 2.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 7.4 Scatter plots representing the observed and predicted intervals of hematocrit values in Table 2.4 when the Leave-One-Out method is applied. . . . . . . . 249 7.5 Map of the communities of the USA. . . . . . . . . . . . . . . . . . . . . . . 250 7.6 Selected states used to define the model. . . . . . . . . . . . . . . . . . . . 250 7.7 Observed and predicted quantile functions of LV C considering the approaches: symbolic, classic, classic - symbolic (part1). . . . . . . . . . . . . . . . . . . 256 7.8 Observed and predicted quantile functions of LV C considering the approaches: symbolic, classic, classic-symbolic (part2). . . . . . . . . . . . . . . . . . . 257 7.9 Observed and estimated quantile function of the variable LV C in the state of Arkansas. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258 7.10 Predicted quantile 0.2 of LV C for all states.. . . . . . . . . . . . . . . . . . 261. 7.11 Predicted quantile 0.4 of LV C for all states.. . . . . . . . . . . . . . . . . . 261. 7.12 Predicted quantile 0.6 of LV C for all states.. . . . . . . . . . . . . . . . . . 262. 7.13 Predicted quantile 0.8 of LV C for all states.. . . . . . . . . . . . . . . . . . 262. 7.14 Scatter plot of the explicative interval-valued variables E and of the response variables LN U observed in (a) and predicted with the DSD Model I, in (b). . 265 7.15 Observed and predicted quantile functions consider all methods presented in Table 7.12 (part1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268 7.16 Observed and predicted quantile functions consider all methods presented in Table 7.12 (part2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 7.17 Observed and predicted quantile functions consider all methods presented in Table 7.12 (part3). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270 7.18 Observed and predicted quantile functions consider all methods presented in Table 7.12 (part4). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 7.19 Observed and predicted quantile functions consider all methods presented in Table 7.12 (part5). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272 7.20 Localization of Montesinho natural park.. . . . . . . . . . . . . . . . . . . . 273. 7.21 Observed and predicted intervals of burned area by month in Table 7.14. . . 276 7.22 Observed and predicted intervals of the burned area in Montesinho natural park in Table 7.14. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278 xxvii.

(28) 7.23 Observed and predicted intervals applying the DSD Model I and DSD Model I with LOO, for the interval-valued variable LN area, for each month. . . . . 279 B.1 Relative positions of the intervals when the respective quantile functions are parallel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300 B.2 Relative positions of the intervals when the respective quantile functions intersect outside the interval [0, 1]. . . . . . . . . . . . . . . . . . . . . . . . 301 B.3 Relative positions of the intervals when the respective quantile functions intersect in the interval [0, 1].. . . . . . . . . . . . . . . . . . . . . . . . . . 302. xxviii.

(29) 1. Introduction. In this work, linear regression models that allow predicting intervals and histograms from other intervals and histograms will be proposed. The models are developed in the context of Symbolic Data Analysis. In this first chapter the concepts of data with variability and symbolic variables, which generalize the classical definition of variables in Multivariate Data Analysis will be introduced. The option to study linear regression between empirical distributions using Symbolic Data Analysis rather than other approaches will also be explained. To finalize this chapter, a short description of the structure of the thesis is presented.. 1.1. Motivation The extensive and complex data that emerged in the last decades made it necessary to extend and generalize the classical concept of data sets. Data tables where the cells contain a single quantitative or categorical value were no longer sufficient. More complex data tables were needed, with cells that include more accurate and complete information. In some situations, the data needs to express the variability or imprecision of the records associated with each observed unit. In this research we will focus on situations where variability in data description occurs. The classical solution to analyze these data is to reduce the collection of records associated with each individual or class of individuals to one value, typically the mean, mode or maximum/minimum; however, with this option the variability across the records is lost. The main goal of this study is to propose a linear regression model that allows predicting, for each observed unit, a distribution representing the variability of its values associated with a variable, from the distribution of values associated with other variables. 1.

(30) 2. CHAPTER 1. INTRODUCTION. 1.2. Data with variability and imprecise data It is important to clarify the difference between data with variability and imprecise data, two types of data that are confused at times. When we want to study a characteristic that floats around a period of time or is associated with a specific group/class of individuals, the “value” that best describes this characteristic is not a real value or category but a set/distribution/range of values. In this case we are in the presence of data with variability. This kind of data is associated with situations where the variables are, for example, the temperature or stock values that fluctuate during a day, week or month; the prices of a product in some regions; the age of the patients in various healthcare centers; the weight or height of the players of a football team. The variability of the data might emerge due to the aggregation of single observations (Arroyo (2008)). This aggregation is named contemporary, if the records are collected in the same temporal instant or the temporal instant is not relevant; or temporal, if the time is the aggregation criterion and the records are grouped along one unit of time (one day, for example) but the observed order is not pertinent. When we work with data with variability, the “best values” that represent the observations associated with each unit (individual or class of individuals) are the empirical distributions or, in particular, intervals of values. Although with a different meaning, the observations associated with imprecise data are also represented by ranges of values. Imprecise data occurs when each interval associated to each unit under analysis represents the uncertain value associated with the record (e.g. the measure of distances or longitudes obtained with imprecise instruments). In this context, “the intervals are a imprecise perception of real values non observable” (Moore (1966)). It is important to underline that the same variable can be approached in different contexts. As an example, the variable “weight” is a classical variable if the goal is to study the data table where, for example, the individuals are football players and to each football player corresponds his exact weight. If we are interested in studying the weight not of one single player but of a football team, since each team is a class of individuals, it is characterized not by a real value but by a set of values, an interval or distribution. For example, the interval [75,80] may represent the weights of all players from a given football team. The variable weight can also be studied in other situations. For example, if we don’t know the exact.

(31) 1.3. SYMBOLIC DATA ANALYSIS. 3. weight of the football players of the team, we can associate with each player the “value” that represents this imprecision. The interval [80,82] may mean that the weight of one football player is between 80 and 82 Kg. In these examples, the element interval may represent two different types of records. In the first situation the interval considers the variability of the weight values in the football team, whereas in the second situation the interval represents the imprecision of the weight value. In this research we will work with data with variability and not imprecise data. Considering all values or distributions associated with each unit allows accounting the variability of these records and hence perform more accurate studies. So, this perspective is in agreement with the opinion of Schweizer, that around 30 years ago advocated that “distributions are the numbers of the future”. Following in his footsteps, Diday generalized the classical concept of variables in Multivariate Data Analysis and introduced Symbolic Data Analysis (Diday (1988)). It is in this context that this work is developed.. 1.3. Symbolic Data Analysis The data studied in Symbolic Data Analysis are represented in complex data tables where each cell expresses the variability of the records of each observed unit. These tables are called symbolic data tables and their cells may contain finite sets of values/categories, intervals or distributions. For these cases the variables are named symbolic variables. The objects/units/individuals may be single individuals (first-level units) or classes of individuals (higher-level units). Similarly as to the classical case, symbolic variables may also be classified as quantitative or qualitative. For quantitative symbolic variables, each unit is allowed to take a single value (single-valued variables); a finite set of values (multi-valued variables); an interval (interval-valued variables); or a probability/frequency/weight distribution (modalvalued variables). A particular type of modal-valued variables is histogram-valued variables. In this case, the values attained by the variable for each unit are empirical distributions or, more specifically, histograms, where the values in each subinterval are assumed to be uniformly distributed. If we consider a symbolic variable where all units are associated with only one interval of real numbers (uniformly distributed) with probability/frequency/weight equal to one, then we are in the presence of interval-valued variables. As an example, let us consider a symbolic data table containing information about.

(32) 4. CHAPTER 1. INTRODUCTION. patients (adults) attending healthcare centers, during a fixed period of time. In healthcare centre A, the age of patients ranged from 25 to 83 years old, in healthcare centre B, it ranged from 18 to 90 years old and in healthcare centre C, from 20 to 74 years old, so that the age is an interval-valued variable (see Table 1.1). Now, consider another variable which records the waiting time for a consult. In this case, the information is recorded with respect to intervals of time with associated frequencies of the waiting time in each healthcare center. Each entity is a histogram and the waiting time for a consult is a histogram-valued variable (see Table 1.2). Notice that in this example the entities under analysis are the healthcare centers (higher-level units), for each of which we have aggregated information (contemporary aggregation), and not the individual patients of each center (first-level units). Table 1.1: Symbolic data table with information of three healthcare centers (part 1).. Healthcare center A B C. . Gender. Age. Y1. Y2. 1. F, 12 ; M, 2 F, 23 ; M, 13 F, 25 ; M, 35. Education . [25, 83] [18, 90] [20, 74]. . Y3 9th grade, 1/2; Higher education, 1/2 th 6 grade, 1/4; 9th grade, 1/4;. 12th grade, 1/4; Higher education, 1/4. 4th grade, 1/3; 9th grade, 1/3; 12th grade, 1/3. Table 1.2: Symbolic data table with information of three healthcare centers (part 2).. Healthcare. Number of emergency. center. consults Y4. A. {1, 2, 3}. B. {0, 1, 4, 5, 10}. C. {0, 1, 3, 7}. Waiting time for . a consult (in minutes) Y5 [15, 30[ , 0.1; [30, 45[ , 0.6; [45, 90] , 0.3 . [0, 15[ , 0.8; [15, 45[ , 0.2 . [0, 15[ , 0.6; [30, 60] , 0.4. Emergency 24h. Y6 No Yes Yes. In other situations we may have multiple records associated with each unit that may be the result of several observations performed in one day/month/year. If we want to study this variable, and as an alternative to summarizing all values by just one value losing the variability of the information, we may aggregate the refering information to one specific period of time (temporary aggregation). Thereby each individual (first-level unit) may be associated with an interval of values (interval-valued variable) or with a distribution.

(33) 1.3. SYMBOLIC DATA ANALYSIS. 5. (histogram-valued variable). As an example, consider the classical data table, in Table 1.3, that contains information about the level of hematocrit and hemoglobin of a set of patients attending a healthcare center during one month. Aggregating the values associated with each patient we build a symbolic data table, Table 1.4, where to each unit, the patient, corresponds a distribution or the interval of values that describes the variability of the values of hematocrit and hemoglobin recorded for each patient during one month. Table 1.3: Classical data table (microdata) with the records of hematocrit and hemoglobin of each patient per day.. Patients. Hematocrit (Y). Hemoglobin (X). Day 1. Day 2. .... Day 30. Day 1. Day 2. .... Day 30. 1. 35.68. 39.61. .... 34.54. 12.4. 12.19. .... 11.54. 2. 40.83. 36.69. .... 39.45. 12.67. 13.04. .... 12.07. 3. 46.45. 47.97. .... 48.68. 12.38. 13.63. .... 16.16. 4. 42.62. 38.34. .... 39.89. 14.26. 13.58. .... 12.89. 5. 48.65. 46.32. .... 39.19. 14.61. 13.80. .... 16.24. 6. 46.58. 39.70. .... 39.12. 13.98. 14.54. .... 13.81. 7. 47.64. 46.09. .... 48.25. 14.81. 15.55. .... 14.68. 8. 43.68. 39.84. .... 38.40. 13.27. 13.68. .... 13.67. 9. 38.88. 29.06. .... 41.64. 10.97. 11.98. .... 13.56. 10. 47.54. 50.60. .... 49.82. 15.95. 15.64. .... 16.01. Since the eighties of last century, Symbolic Data Analysis has achieved considerable development of new statistical techniques to analyze multi-valued data (see, for instance, Bock and Diday (2000), Billard and Diday (2003, 2006), Diday and Noirhomme-Fraiture (2008), Noirhomme-Fraiture and Brito (2011)). Recently, there has been a growing interest in the analysis of histogram-valued variables, but the bulk of the research is still developed for interval-valued variables. The methods proposed so far for the former are indeed, frequently, a generalization of their counterparts for the latter. The concepts and methods which have been developed in the context of Symbolic Data Analysis can be classified in three categories, according to the input data - the method the output data (Irpino and Verde (2012, 2013)), as follows:.

(34) 6. CHAPTER 1. INTRODUCTION. Table 1.4: Symbolic data table (macrodata) when the values of hematocrit and hemoglobin are symbolic variables (Billard and Diday (2002)).. Patients. Hematocrit (Y). Hemoglobin (X). 1. [33.29; 39.61]. {[11.54; 12.19[ , 0.4; [12.19; 12.8] , 0.6}. 2. [36.69; 45.12]. {[12.07; 13.32[ , 0.5; [13.32; 14.17] , 0.5}. 3. [36.70; 48.68]. {[12.38; 14.2[ , 0.3; [14.2; 16.16] , 0.7}. 4. [36.38; 47.41]. {[12.38; 14.26[ , 0.5; [14.26; 15.29] , 0.5}. 5. [39.19; 50.86]. {[13.58; 14.28[ , 0.3; [14.28; 16.24] , 0.7}. 6. [39.7; 47.24]. {[13.81; 14.5[ , 0.4; [14.5; 15.2] , 0.6}. 7. [41.56; 48.81]. {[14.34; 14.81[ , 0.5; [14.81; 15.55] , 0.5}. 8. [38.4; 45.22]. {[13.27; 14.0[ , 0.6; [14.0; 14.6] , 0.4}. 9. [28.83; 41.98]. {[9.92; 11.98[ , 0.4; [11.98; 13.8] , 0.6}. 10. [44.48; 52.53]. {[15.37; 15.78[ , 0.3; [15.78; 16.75] , 0.7}. symbolic-numerical-numerical: Symbolic data in input are transformed into standard data in order to apply classic multivariate techniques. The results are real values. For example, the symbolic mean of a set of intervals is the classical mean of their centers. symbolic-numerical-symbolic: Symbolic data in input are analyzed according to classic multivariate techniques and the results are symbolic data. Most part of Symbolic Data Analysis techniques belong to this category. For example, the linear regression models defined for interval-valued variables reduce the intervals to their centers and half ranges, apply classical linear regression models to these elements and afterwards reconstruct the intervals with the estimated centers and half ranges. symbolic-symbolic-symbolic: Symbolic data in input are transformed using generalization/specialization operators and the results are symbolic data. Under this category, some concepts were only recently proposed, such as, for example, the concept of barycentric histogram. In that case, the input data are a set of histograms, the method involves a distance that works with distributions (a histogram may represent an empirical distribution) and the result is a symbolic element, a barycentric histogram, which stays at the minimum distance of a set of histograms..

(35) 1.4. LINEAR REGRESSION BETWEEN EMPIRICAL DISTRIBUTIONS. 7. The category symbolic-symbolic-symbolic is the most recent approach and will be the one considered in this research. In this context it was necessary to study the behavior of the new elements that are now intervals and histograms. In the Symbolic Data Analysis approach and in this category, linear regression models do not predict real values from real values but rather distributions from other distributions. All concepts and methods in this category require new tools to work with these more complex elements, with which it is necessary to learn how to operate. Knowing the arithmetics and distances that will be applied to these elements and analyzing the behavior of the vector spaces whose elements are intervals and distributions, constitutes important knowledge from which we can define statistical concepts encompassed in the category symbolic-symbolic-symbolic. Most concepts and methods developed in the Symbolic Data Analysis approach are descriptive since a probabilistic assumption is not considered. The development of nondescriptive methods for Symbolic Data Analysis is still an open research topic for almost all kinds of symbolic variables. Studies were recently published proposing probabilistic models for interval-valued variables (see Lima Neto et al. (2011) and Brito and Duarte Silva (2012)). However, in these works, intervals are treated as vectors whose components are the centers and half ranges or the bounds, instead of being considered intervals as such. In general, the difficulty to work with symbolic elements under a probabilistic context lays in the extension of the concept of randomness.. 1.4. Linear regression between empirical distributions When this research started, several linear regression models for interval-valued variables had already been defined (Billard and Diday (2000, 2002), Lima Neto and De Carvalho (2008, 2010)), but for histogram-valued variables only an extension of the first models proposed by Billard and Diday for interval-valued variables had been proposed (Billard and Diday (2002, 2006)). The models proposed by Billard and Diday present some limitations. The main one consists in the fact that those models are a simple adaptation of the classical models where the parameters are predicted using the symbolic variance and symbolic covariance definitions; moreover, the estimation of intervals does not consider the intervals as such, but rather the bounds are estimated separately; consequently, the elements predicted by.

(36) 8. CHAPTER 1. INTRODUCTION. the models may fail to build an interval. Because the first models proposed for histogramvalued variables are a generalization of the models presented for interval-valued variables, the problems associated with this case are similar but are more difficult to solve because distributions are much more complex elements. For interval-valued variables, Lima Neto and De Carvalho (2008, 2010); Lima Neto et al. (2011) and, recently, also other authors (Giordani (2011, 2014), Yang et al. (2011), Ahn et al. (2012)), proposed new approaches to define linear regression models where the limitations of the first proposed models are solved. However, no model was defined with the elements “intervals” completely considered as such. During the period under which the research presented in this work was developed, alternative methods for histogram-valued variables were proposed by Verde and Irpino (2010); Irpino and Verde (2012, 2013). In this case the model is already defined taking into account the entire distributions. As histogram-valued variables are still little studied, the development of a linear regression model for histogram-valued variables that allows predicting distributions from other distributions is the main goal of this research. The analysis of the limitations of the models presented in the literature was the starting point to define the goals for the models to propose. These goals consist in designing a linear regression model that: • is flexible and truly predicts histograms or intervals from other histograms or intervals; • solves the problem of the lack of linearity in the spaces whose elements are intervals and histograms; this limitation imposes that a linear regression between intervals or histograms will be direct; • uses an adequate distance to measure the error between the observed and predicted elements; • can be particularized to interval-valued variables; for this particular case different distributions within the intervals may be considered. The difficulties associated with the definition of a linear regression between intervals or distributions lays in the definition of linear combination between these kinds of elements. The interval and histogram arithmetics are complex and the semi-linearity of the spaces, whose elements are intervals or histograms, does not allow for the generalization of the.

(37) 1.4. LINEAR REGRESSION BETWEEN EMPIRICAL DISTRIBUTIONS. 9. classical definition of linear combination and consequently of linear relation between elements of these types. To solve this first limitation, the solution is to consider the representation of the histograms by the inverse of the cumulative distribution function, as proposed by Irpino and Verde (2006). This representation of the histograms is named quantile function. Considering this representation, instead of working with intervals/histograms, we work with linear functions or piecewise linear functions. Considering that the distributions or intervals may be represented by quantile functions, we may now define a linear regression where the elements involved are functions. However, these functions have an important property. As the subintervals within the histograms (in the case of the histogram-valued variables) and the intervals (for interval-valued variables) have the upper bound greater than or equal to the lower bound, the quantile functions used to represented them are always non-decreasing functions.. Since we are working with functions, we might consider Functional Data Analysis methods (Ramsay and Silverman (2005)) to apply to histogram-valued variables. In fact, in the Functional Data Analysis context the data are functions rather than individual values or sequences of individual observations. Usually, in this context, functional data are observed and recorded as discrete points (tj , yj ) where for the individual j, the values yj of the variable are observed, for example, at time tj . However Functional Data Analysis considers not those discrete points but rather a function obtained from the observed data, i.e. it considers data that associates with each individual a curve that represents the mathematical description of discrete data points distributed over space, time, and other types of continuum. These curves are typically adjusted by functions as a weighted sum or linear combination of basis functions, B-spline functions or Fourier series. The functions thus obtained and the functions that in general are considered in this context are smooth functions.. Under the Functional Data Analysis framework, linear models may be functional because the prediction of the response variable is a function and/or the observations of the explicative variables in the model are functions. In any case, the regression coefficients are functions rather than real numbers as in the classical case. Three types of functional linear regression models may be considered..

(38) 10. CHAPTER 1. INTRODUCTION • Functional linear model for functional responses. – The parameters and observations of the response variable are functional but the observations of the explicative variables, xjk are multivariate as in the classical linear regression model. In this case, the predicted function ybj is given by:. ybj (t) = β0 +. p X. βj (t)xjk .. k=1. – The fully functional linear model considers both the observations of the response variable yj and the explicative variables xj as functions defined in intervals TX. and TY , respectively. In this case, the functional prediction is obtained by. ybj (t) = β0 (t) +. Z. βj (s, t)xj (s)ds.. TX. • Functional linear model for scalar responses The response variable yj is a scalar or multivariate predict from the functions xj defined in an interval TX , i.e.,. ybj = β0 +. Z. βj (s)xj (s)ds.. TX. In this work, the goal is to predict distributions from other distributions. Consequently, the most straightforward situation is be the case where both the observations of the response and explicative variables are functions. However, in general, these methods are not adequate to work with quantile functions because they are not smooth functions. A possibility to allowing the methods of Functional Data Analysis to quantile functions would include smoothing the quantile functions that are the observations, assuming in this case that we are working with the distribution function instead of the empirical distribution function. In this situation, when the microdata are not known, the smoothing of the functions would have to be obtained only from the quantiles used in the quantile functions, which could be a limitative information to represent the behavior of the variable for each observation. Another option would be to consider a different model for each piece (linear function) that composes the quantile functions. For this, it would be necessary that all quantile functions are rewritten with the same number of pieces and an equal domain for each piece. However, this option is not applicable to the whole function associated with each unit and it predicts the pieces.

(39) 1.5. ORGANIZATION OF THE THESIS. 11. separately, which may lead to functions that are not non-decreasing; consequently, when the pieces are joined to build the function, we fail to obtain quantile functions.. 1.5. Organization of the thesis In addition to this chapter, the Introduction, this thesis is composed by 7 more chapters and 3 Appendices. The remainder of the thesis is organized as follows. Chapter 2 introduces Symbolic Data Analysis, the kind of data and the variables that this approach uses, focusing on histogram and interval-valued variables. In the first sections of this chapter the arithmetics and distance measures more commonly applied to this kind of elements are presented. The Mallows distance selected to use in this work and the reasons behind this choice are explained. In later sections, the concepts and methods of descriptive statistics for histogram/interval-valued variables are defined in detail. For a good understanding of the definitions, the concepts and methods, some examples are presented throughout the chapter. The main goal of Chapter 3 is to present the state of the art related to linear regression models proposed for the case where the observations in the data tables are intervals or distributions. Both the methods that use the Symbolic Data Analysis approach and the models proposed for imprecise data are presented. The new linear regression models for histogram-valued variables and interval-valued variables are presented in Chapters 4 and 5, respectively. The problem of defining a linear regression model for histogram-valued variables is addressed. Two alternative models are proposed. In Chapter 5, and since interval-valued variables constitute a particular case of histogram-valued variables, the linear regression models for interval-valued variables emerge as a special case. We also prove that the particularization of the proposed model to degenerate intervals is coincident to the classical linear regression model. In addition to the presentation of the models and respective goodness-of-fit measures, several related properties are enunciated and proved in both chapters. Chapter 6 is focused on simulation studies considering both kinds of symbolic variables treated in this work. We first introduce the process to generate symbolic data tables; then the factorial design for each study is described. For interval-valued variables two simulation studies are performed and for histogram-valued variables just one. The results.

(40) 12. CHAPTER 1. INTRODUCTION. and conclusions of these studies are discussed; the tables with the records that support the analysis are included in Appendices C.1, C.2 and C.3. To illustrate the application of the models to real data, four examples are studied in Chapter 7. In several situations, not only the new models proposed in this work but also other methods proposed in the literature are applied. Thereby it is possible to compare the models and the predictions obtained. In one of these cases a classical study is also performed with the goal of comparing the conclusions obtained using two different approaches, the classical and the symbolic. To finalize, the general conclusions of this work are presented and directions for future research are pointed out..

(41) 2. Histogram and Interval data The research presented in this work is developed under the scope of Symbolic Data Analysis considering interval-valued variables and histogram-valued variables, for which the observed values are not single values or categories but intervals or distributions, respectively. In the first part of this chapter we will define histogram-valued variables and intervalvalued variables. As the main goal is to work with interval data and histogram data in the symbolic-symbolic-symbolic category, it is important to find the best representation for these kind of data, study the arithmetics that we may use when we need to operate with these elements and select the distance that can provide a good dissimilarity measure between intervals and distributions. These three points are the main contribute that allow defining a new approach for linear regression that allows predicting intervals or distributions from other intervals or distributions. In the second part of this chapter we will perform a review of the state of the art of the descriptive statistics associated with these types of symbolic variables. Many concepts and definitions of descriptive measures for one and more variables were proposed in the symbolic-numerical-numerical category. Recently were proposed descriptive measures in symbolic-symbolic-symbolic category. The results associated with the descriptive statistics will be introduced and demonstrated with detail.. 2.1. Symbolic variables 2.1.1. Definition and classification Classical multivariate statistics studies data tables that summarize observations of “statistical units” (individuals). Each row of these tables represents one individual and each of 13.