Contributions to the theory of maximum entropy estimation for ill-posed models

(1)

Pedro Filipe

Pessoa Macedo

Contributos para a teoria de máxima entropia na

estimação de modelos mal-postos

Contributions to the theory of maximum entropy

estimation for ill-posed models

(2)

(3)

2013

Pedro Filipe

Pessoa Macedo

Contributos para a teoria de máxima entropia na

estimação de modelos mal-postos

Contributions to the theory of maximum entropy

estimation for ill-posed models

Tese apresentada à Universidade de Aveiro para cumprimento dos requisitos necessários à obtenção do grau de Doutor em Matemática, realizada sob a orientação científica do Doutor Manuel González Scotto, Professor Auxiliar com Agregação do Departamento de Matemática da Universidade de Aveiro, e da Doutora Elvira Maria de Sousa Silva, Professora Associada com Agregação da Faculdade de Economia da Universidade do Porto.

Apoio financeiro da FCT - Fundação para a Ciência e a Tecnologia através da bolsa de doutoramento com referência SFRH/BD/40821/2007.

Apoio financeiro do FEDER através do COMPETE – Programa Operacional Fatores de Competitividade e de fundos nacionais através do Centro de Inves-tigação e Desenvolvimento em Matemática e Aplicações (Universidade de Aveiro) e da FCT, no âmbito do projeto PEst-C/MAT/UI4106/2011 com o núme-ro COMPETE FCOMP-01-0124-FEDER-022690.

(4)

(5)

(6)

(7)

o júri

presidente Prof. Doutor Aníbal Manuel de Oliveira Duarte

Professor Catedrático da Universidade de Aveiro

vogais Prof. Doutor António Manuel Pacheco Pires

Professor Catedrático do Instituto Superior Técnico da Universidade Técnica de Lisboa

Prof.ª Doutora Elvira Maria de Sousa Silva

Professora Associada com Agregação da Faculdade de Economia da Universidade do Porto (Coorientadora)

Prof.ª Doutora Diana Elisabeta Aldea Mendes

Professora Associada do Instituto Universitário de Lisboa

Prof. Doutor Manuel González Scotto

Professor Auxiliar com Agregação da Universidade de Aveiro (Orientador)

Prof.ª Doutora Andreia Teixeira Marques Dionísio

Professora Auxiliar da Escola de Ciências Sociais da Universidade de Évora

Prof.ª Doutora Maria Manuela Souto de Miranda

(8)

(9)

agradecimentos Ao Professor Manuel Scotto e à Professora Elvira Silva pela dedicada orienta-ção científica. Obrigadíssimo.

À Fundação para a Ciência e a Tecnologia pela bolsa de doutoramento. À Universidade de Aveiro, ao Departamento de Matemática e ao Centro de Investigação e Desenvolvimento em Matemática e Aplicações pelos diversos apoios concedidos. Um especial agradecimento ao Professor João Santos e ao Professor Luís Castro.

Aos meus colegas do Departamento de Matemática e do Departamento de Economia, Gestão e Engenharia Industrial, em particular os que comigo priva-ram nos últimos anos, pela camaradagem.

(10)

(11)

palavras-chave Máxima entropia, colinearidade, outliers, pequenas amostras, regressão linear, regressão robusta, regressão ridge, parâmetro ridge, electrodinâmica quântica, eficiência técnica, fronteiras de produção com estados contingentes.

resumo As técnicas estatísticas são fundamentais em ciência e a análise de regressão linear é, quiçá, uma das metodologias mais usadas. É bem conhecido da litera-tura que, sob determinadas condições, a regressão linear é uma ferramenta estatística poderosíssima. Infelizmente, na prática, algumas dessas condições raramente são satisfeitas e os modelos de regressão tornam-se mal-postos, inviabilizando, assim, a aplicação dos tradicionais métodos de estimação. Este trabalho apresenta algumas contribuições para a teoria de máxima entro-pia na estimação de modelos mal-postos, em particular na estimação de mo-delos de regressão linear com pequenas amostras, afetados por colinearidade e outliers. A investigação é desenvolvida em três vertentes, nomeadamente na estimação de eficiência técnica com fronteiras de produção condicionadas a estados contingentes, na estimação do parâmetro ridge em regressão ridge e, por último, em novos desenvolvimentos na estimação com máxima entropia. Na estimação de eficiência técnica com fronteiras de produção condicionadas a estados contingentes, o trabalho desenvolvido evidencia um melhor desem-penho dos estimadores de máxima entropia em relação ao estimador de má-xima verosimilhança. Este bom desempenho é notório em modelos com pou-cas observações por estado e em modelos com um grande número de esta-dos, os quais são comummente afetados por colinearidade. Espera-se que a utilização de estimadores de máxima entropia contribua para o tão desejado aumento de trabalho empírico com estas fronteiras de produção.

Em regressão ridge o maior desafio é a estimação do parâmetro ridge. Embora existam inúmeros procedimentos disponíveis na literatura, a verdade é que não existe nenhum que supere todos os outros. Neste trabalho é proposto um novo estimador do parâmetro ridge, que combina a análise do traço ridge e a estimação com máxima entropia. Os resultados obtidos nos estudos de simu-lação sugerem que este novo estimador é um dos melhores procedimentos existentes na literatura para a estimação do parâmetro ridge.

O estimador de máxima entropia de Leuven é baseado no método dos míni-mos quadrados, na entropia de Shannon e em conceitos da eletrodinâmica quântica. Este estimador suplanta a principal crítica apontada ao estimador de máxima entropia generalizada, uma vez que prescinde dos suportes para os parâmetros e erros do modelo de regressão. Neste trabalho são apresentadas novas contribuições para a teoria de máxima entropia na estimação de mode-los mal-postos, tendo por base o estimador de máxima entropia de Leuven, a teoria da informação e a regressão robusta. Os estimadores desenvolvidos revelam um bom desempenho em modelos de regressão linear com pequenas amostras, afetados por colinearidade e outliers.

Por último, são apresentados alguns códigos computacionais para estimação com máxima entropia, contribuindo, deste modo, para um aumento dos escas-sos recurescas-sos computacionais atualmente disponíveis.

(12)

(13)

keywords Maximum entropy, collinearity, outliers, small samples sizes, robust regression, linear regression, ridge regression, ridge parameter, quantum electrodynamics, technical efficiency, state-contingent production frontiers.

abstract Statistical techniques are essential in most areas of science being linear re-gression one of the most widely used. It is well-known that under fairly condi-tions linear regression is a powerful statistical tool. Unfortunately, some of these conditions are usually not satisfied in practice and the regression models become ill-posed, which means that the application of traditional estimation methods may lead to non-unique or highly unstable solutions.

This work is mainly focused on the maximum entropy estimation of ill-posed models, in particular the estimation of regression models with small samples sizes affected by collinearity and outliers. The research is developed in three directions, namely the estimation of technical efficiency with state-contingent production frontiers, the estimation of the ridge parameter in ridge regression, and some developments in maximum entropy estimation.

In the estimation of technical efficiency with state-contingent production fron-tiers, this work reveals that the maximum entropy estimators outperform the maximum likelihood estimator in most of the cases analyzed, namely in models with few observations in some states of nature and models with a large number of states of nature, which usually represent models affected by collinearity. The maximum entropy estimators are expected to make an important contribution to the increase of empirical work with state-contingent production frontiers.

The main challenge in ridge regression is the selection of the ridge parameter. There is a huge number of methods to estimate the ridge parameter and no single method emerges in the literature as the best overall. In this work, a new method to select the ridge parameter in ridge regression is presented. The simulation study reveals that, in the case of regression models with small sam-ples sizes affected by collinearity, the new estimator is probably one of the best ridge parameter estimators available in the literature on ridge regression. Founded on the Shannon entropy, the ordinary least squares estimator and some concepts from quantum electrodynamics, the maximum entropy Leuven estimator overcomes the main weakness of the generalized maximum entropy estimator, avoiding exogenous information that is usually not available. Based on the maximum entropy Leuven estimator, information theory and robust re-gression, new developments on the theory of maximum entropy estimation are provided in this work. The simulation studies and the empirical applications veal that the new estimators are a good choice in the estimation of linear re-gression models with small samples sizes affected by collinearity and outliers. Finally, a contribution to the increase of computational resources on the maxi-mum entropy estimation is also accomplished in this work.

(14)

(15)

1 Introduction 1

1.1 Motivation and objectives . . . 1

1.2 Structure of the thesis and main achievements . . . 7

2 Background and state of the art 13 2.1 Entropy . . . 13

2.1.1 Entropy concepts . . . 13

2.1.2 Maximum entropy principle . . . 23

2.2 Estimators . . . 30

2.2.1 The generalized maximum entropy estimator . . . 30

2.2.2 The generalized cross-entropy estimator . . . 35

2.2.3 Higher-order entropy estimators . . . 37

2.3 Regression diagnostics: collinearity and outliers . . . 39

2.4 Technical efficiency . . . 43

3 Technical efficiency with state-contingent production frontiers 49 3.1 Introduction . . . 49

3.2 State-contingent production with maximum entropy estimators . . . 51

3.2.1 A state-contingent production frontier model . . . 51

3.2.2 Maximum entropy estimators . . . 53

3.3 Simulation study . . . 58

(16)

3.4 Conclusions . . . 65

4 The choice of the ridge parameter in ridge regression 69 4.1 Introduction . . . 69

4.2 The ridge regression estimator . . . 72

4.3 The Ridge-GME estimator . . . 74

4.4 Simulation study . . . 76

4.5 Numerical example . . . 82

4.6 Conclusions . . . 85

5 Some developments in the maximum entropy estimation 87 5.1 Maximum entropy robust regression group estimators . . . 87

5.1.1 Introduction . . . 87

5.1.2 The maximum entropy Leuven estimator . . . 89

5.1.3 The MERG estimators . . . 92

5.1.4 Some properties of the MERG estimators . . . 95

5.1.5 Simulation study: collinearity and outliers . . . 98

5.1.6 Examples and additional simulation studies . . . 102

5.1.6.1 HDI and Portland cement models . . . 102

5.1.6.2 Additional simulation studies . . . 106

5.1.7 Conclusions . . . 108

5.2 An extension of the maximum entropy robust regression group estimators . . 108

5.2.1 Introduction . . . 108

5.2.2 The MERGE estimators . . . 109

5.2.3 Improvements for MERGE estimators . . . 112

5.2.3.1 Cross-entropy formalism . . . 112

5.2.3.2 Parameter inequality restrictions . . . 113

(17)

5.2.5 Conclusions . . . 117 6 Concluding remarks 119 6.1 Final conclusions . . . 119 6.2 Future work . . . 122 References 125 Index 141 Appendices 144

Appendix A – Shannon entropy as a measure of information . . . 145

Appendix B – Two new measures of technical inefficiency with directional technology distance functions . . . 149

(18)

(19)

2.1 Shannon entropy. . . 16

2.2 R´enyi entropy. . . 19

2.3 Tsallis entropy. . . 21

2.4 Estimated ME distributions for the die problem. . . 29

4.1 Support intervals and ridge interval in the Ridge-GME estimator for an arbi-trary ridge trace. . . 75

4.2 Ridge trace for the Portland cement model (non-standardized coefficients). . 84

4.3 Ridge trace for the Portland cement model (standardized coefficients). . . 84

4.4 Selection of the Ridge-GME estimate for the Portland cement model. . . 85

5.1 The analogy with quantum electrodynamics. . . 90

5.2 OLS regression lines in the HDI model. . . 103

A.1 Shannon entropy as a measure of information. . . 145

B.1 Technical inefficiency measures based on the mean and median of the data. . 151

B.2 Different measures of technical inefficiency. . . 153

(20)

(21)

2.1 Estimated ME distributions for the die problem. . . 28

3.1 MSEL and DMTE for the different estimators (Model 1 and Model 2). . . 61

3.5 MSEL and DMTE for the different estimators (Model 9 and Model 10). . . . 65

4.1 MSEL for OLS and different ridge estimators (N = 10). . . 79

4.5 MSE for different estimators in the Portland cement model. . . 85

5.1 MERG estimators. . . 93

5.2 MSEL in the simulation study with outliers and collinearity. . . 101

5.3 Estimates for β0 and β1 in the HDI model. . . 103

5.4 Estimates in the original Portland cement model. . . 104

5.5 Parameter supports for GME estimators in the Portland cement model. . . . 105

5.6 Estimates in the Portland cement model with intercept. . . 106

5.7 MSEL in the simulation study with outliers. . . 107

(22)

5.8 MSEL in the simulation study with collinearity. . . 107

5.9 MERGE estimators. . . 110

5.10 MSEL for the estimators in the simulation study (N = 10). . . 115

5.11 MSEL for the estimators in the simulation study (N = 30). . . 116

B.1 Most popular directional vectors. . . 150

B.2 Results from Figure B.1. . . 151

(23)

BMM Bayesian method of moments

CONSTR optimization function for MATLAB

COVRATIO outlier diagnostic procedure

DEA data envelopment analysis

DFBETAS outlier diagnostic procedure

DFFITS outlier diagnostic procedure

DMTE difference between the true and the estimated mean of technical efficiency

FMINCON optimization function for MATLAB

GAMS software (http://www.gams.com)

GAUSS software (http://www.aptech.com)

GCE generalized cross-entropy

GCV generalized cross-validation

GDP gross domestic product

GEL generalized empirical likelihood

GLS generalized least squares

GME generalized maximum entropy

GME-α higher-order generalized maximum entropy

(24)

GMM generalized method of moments

HDI human development index

HK ridge parameter estimator from Hoerl and Kennard [76]

HKB ridge parameter estimator from Hoerl et al. [78]

IEE information and entropy econometrics

IRLS iteratively reweighted least squares

KM4 fourth ridge parameter estimator from Muniz and Kibria [120]

KM5 fifth ridge parameter estimator from Muniz and Kibria [120]

KM6 sixth ridge parameter estimator from Muniz and Kibria [120]

KS ridge parameter estimator from Khalaf and Shukur [90]

LAD least absolute deviations

LIMDEP software (http://www.limdep.com)

LMS least median of squares

LTS least trimmed squares

MATLAB software (http://www.mathworks.com)

ME maximum entropy

MEL maximum entropy Leuven

MERG maximum entropy robust regression group

MERG(E) MERG and MERGE estimators

MERGE maximum entropy robust regression group extended

ML maximum likelihood

(25)

MSE mean squared error

MSEL mean squared error loss

OLS ordinary least squares

PPP US$ purchasing power parity United States dollar

QR QR decomposition (orthogonal-triangular decomposition)

Ridge-GME ridge parameter estimator (combines the ridge trace and the GME estimator)

RR-MM robust ridge regression estimator based on repeated M-estimation

SFA stochastic frontier analysis

SIMPS optimization function for MATLAB

(26)

(27)

A matrix (a letter in bold uppercase)

A0 transpose of matrix A

A−1 inverse of matrix A

a column vector (a letter in bold lowercase)

a0 transpose of the column vector a (a row vector)

xk kth column of matrix X

1N column vector of ones with dimension (N × 1)

IN identity matrix with dimension (N × N )

b

β estimator (or estimate) of the unknown parameter vector β

k · k Euclidean norm

cond2 2-norm condition number

⊗ Kronecker product

element-by-element Hadamard product

(28)

(29)

Introduction

“S = k log w”

Entropy formula in the epitaph of Boltzmann’s grave in Vienna.

The motivation and the objectives of this work, as well as the structure of the thesis and the main achievements, are discussed in this introduction. A reference to the publications

and communications produced during the work is provided at the end of the chapter.

1.1 Motivation and objectives

Statistical techniques are essential in most areas of science being linear regression one of the most widely used. It is well-known that under fairly conditions linear regression is a powerful

statistical tool. Unfortunately, some of these conditions are seldom satisfied in practice. This idea is well expressed in Golan et al. [69, p. 3]:

“[. . . ] because of limited, partial data or insufficient information, many econome-tric problems fall in the ill-posed, underdetermined category. Convenient

assump-tions, representing information we do not possess, are typically used to convert ill-posed problems into seemingly well-posed statistical models [. . . ]. However,

this approach often leads to erroneous interpretations and treatments. In fact, in applied mathematics, statistics and econometrics, ill-posed inverse problems may

be the rule rather than the exception.”

(30)

The main motivation of this work comes from this unfortunate reality. In a traditional linear regression framework in which the main interest is to recover the unknown parameter

vector from a known matrix with explanatory variables and a known vector of noisy observa-tions, a regression model is, in general, ill-posed if it does not satisfy the required conditions

of classical statistical estimation methods.1 In such cases, the application of traditional es-timation methods might lead to obtain non-unique solutions and/or solutions that may be

highly unstable, i.e., very sensitive to small perturbations in the original data.2 Ill-posedness is a broad concept. However, the ill-posedness of a model typically arises from the limited

information available, usually from small samples sizes, incomplete data or when the number of the unknown parameters exceeds the number of observations (under-determined model);

and/or from an experiment that can be badly designed (e.g., lead to models with aggregated or missing data), or an experiment that can simply results in a model affected by collinearity

and/or outliers; see, among others, Golan [65], Golan et al. [69] and O’Sullivan [126], and the references therein.

At this point it seems reasonable to ask how to make the best possible predictions with such ill-posed problems. An attractive approach which plays a central role in this work is the

maximum entropy (ME) principle due to Edwin Jaynes; e.g., Jaynes [81, 82, 83, 84, 85, 86]. The ME principle, as well as others methods available in the literature, such as the generalized

maximum entropy (GME), the higher-order generalized maximum entropy (GME-α), the generalized cross-entropy (GCE), the generalized method of moments (GMM), the Bayesian

method of moments (BMM), the generalized empirical likelihood (GEL) and, in general, all the methods directly or indirectly related to the information and entropy econometrics (IEE)

research field, are designed to extract information from limited and noisy data using minimal statements on the data generation process; see Golan [64, 65, 66] and the references therein

for a review.3 These methods are established on three general assumptions:

Assumption 1.1. Not everything about the model is known.

Assumption 1.2. Only minimal a priori assumptions should be assigned.

1_{As noted by O’Sullivan [126], there is an inverse problem whenever inferences are made from partial or}

incomplete information, and thus statistical estimation is an inverse problem. The inverse problems that are not suitable to the classical statistical estimation methods are defined as ill-posed. In this work, for simplicity, an ill-posed inverse model/problem is just denoted as ill-posed model/problem.

2

This definition is usually attributed to Jacques Hadamard at the beginning of the XX century.

3

(31)

Assumption 1.3. The solution should reflect only the available information.

These are logical assumptions for any estimation method to deal with ill-posed models.

The IEE literature is mainly concerned with the estimation of ill-posed models and the study of information measures with a particular emphasis on economic problems. As the name

itself suggests, entropy, information theory and ME are central in IEE; section 2.1 presents an overview on some interpretations of entropy from its advent in classical thermodynamics

to current applications in science, as well as the ME principle proposed by Edwin Jaynes.

This thesis is mainly focused on the ME estimation of ill-posed models, in particular the

ME estimation of linear regression models with small samples sizes affected by collinearity, a problem that hamper the empirical work with regression analysis, and it is probably the most

frequent characteristic of the ill-posed models in real-world problems. The presence of outliers in the linear regression model is also discussed, though briefly, in this thesis, particularly when

associated with collinearity; in section 2.2 a brief review of some ME estimators is presented.

Why are these topics so relevant in regression analysis and why their presence lead to ill-posed models? Regression models with small samples sizes may compromise the usefulness

of classical statistical methods, as well as statistical inference. If there are cases where it is possible to collect additional information (normally requiring more time and cost), there are

other cases where such additional information simply does not exist! In either case, it is necessary to make the best possible predictions with such limited information.4 _Collinearity

is the term usually used in the literature to represent a near-linear relationship between two or more regressors. Collinearity is responsible for inflating the variance associated with the

regression coefficients estimates, and, in general, may affect the signs of the estimates, as well as statistical inference. Outliers are atypical observations being often influential observations

that can produce a large impact on the ordinary least squares (OLS) parameter estimates. In short, small samples, collinearity and outliers may lead, although due to different reasons, to

absurd results in regression analysis, since the solutions may be undefined or may be highly unstable; in section 2.3 some diagnostic procedures and strategies for dealing with collinearity

and outliers are briefly reviewed.

4

Note that there are not precise definitions of a small sample and a large sample. These definitions vary across different areas of science and depend on several factors. However, in empirical work, the boundary between small and large samples usually lies between 30 and 50 observations.

(32)

In this research work, the ME estimation of regression models with small samples sizes, collinearity and outliers is investigated from both a theoretical and applied points of view.

By using the links between information theory, ME and statistical inference this research is developed in three directions, namely: (a) the estimation of technical efficiency with

state-contingent production frontiers; (b) the estimation of the ridge parameter in ridge regression, and (c) some potential developments in the ME estimation.

In a single input-output production technology, technical efficiency can be defined as the ability to minimize the quantity of input used in the production of a given quantity of an

output, or the ability to maximize the quantity of output produced with a given quantity of an input. Technical efficiency can be computed comparing the observed output and the

potential output of a production unit (e.g., a firm). Thus, technical efficiency analysis is a fundamental tool to measure the performance of the production activity. There is a wide

range of methodologies to measure technical efficiency and the choice of a specific approach is always controversial, since different choices lead to different results; e.g., Kalirajan and

Shand [89] and Kumbhakar and Lovell [96]. Section 2.4 presents a brief review on technical efficiency analysis.

In the last decade, after the work of Chambers and Quiggin [23], an increasing interest with the state-contingent production frontiers has emerged in the production literature, which

contributed, at least partially, to decrease the controversy in the production analysis under uncertainty. This interest, as noted by Quiggin and Chambers [134], is due to the fact that

uncertainty in economics is best interpreted in a state-contingent framework. However, this increasing (theoretical) interest has not yet been reflected in an increase of empirical work with this approach. Why? The answer is straightforward: the empirical models with

state-contingent production frontiers are usually ill-posed. In particular, these empirical models involve small samples sizes and are affected by (severe) collinearity.

Thus, how to increase the empirical work with state-contingent production frontiers? In particular, how to estimate technical efficiency with state-contingent production frontiers

un-der difficult empirical conditions (ill-posed models)?

The first objective of this work is to develop a fairly general extension of the

produc-tion model proposed by O’Donnell et al. [125] that can be employed in real-world empirical applications, and to develop all the procedures to use the GME, GME-α and GCE

(33)

estima-tors to assess technical efficiency with state-contingent production frontiers under difficult empirical conditions. The main goal is to make a contribution to the empirical literature

on state-contingent production frontiers, probably the most complete approach to model uncertainty in economics; see Chapter 3 for details.

Ridge regression discussed by Hoerl and Kennard [76] is a very popular estimation me-thodology to handle collinearity without removing variables from the regression model. The

importance of this methodology is discussed by McDonald [112] that analyzed the number of publications related to ridge regression in the Technometrics, the Journal of the American

Statistical Association, the Communications in Statistics – Theory and Methods, and the Communications in Statistics – Simulation and Computation. Approximately 300 articles

related to ridge regression has been published in these four scientific journals since the se-venties.5 In the presence of collinearity, traditional estimators such as the OLS estimator

perform poorly since the variances of the parameter estimates can be substantially large. By adding a small non-negative constant (often referred to as the ridge parameter) to the

diago-nal of the correlation matrix of the explanatory variables, it is possible to reduce the variance of the OLS estimator through the introduction of some bias into the regression model.

The challenge in ridge regression is the selection of the ridge parameter. This choice is usually made by the inspection of the ridge trace (a subjective choice) or by a formal method

(depending on some parameters that must be estimated from the data). Moreover, there is a huge number of methods to estimate the ridge parameter (several dozens) and no single

method emerges in the literature as the best overall.

Thus, how to select the ridge parameter? Is it possible to find a method that reduces the subjectivity in the selection of the ridge parameter and/or does not depend crucially on other

parameters that must be estimated from the data?

The second objective of this work is to introduce a new estimator for the ridge

parameter, combining the analysis of the ridge trace with the ME estimation, namely the GME estimator. The main goal is to create one of the best ridge parameter estimators in the

literature of ridge regression, in particular concerning regression models with small samples sizes; see Chapter 4 for details.

5

This number of publications results from a recent update made by us in May, 2012, based on the study of McDonald [112].

(34)

The GME estimator developed by Golan et al. [69] is described in subsection 2.2.1. The GME estimator has acquired special importance in the toolkit of econometric techniques, by

allowing econometric formulations free of restrictive and unnecessary assumptions. Moreover, the GME estimator is useful in linear regression models with small samples sizes, in which

the design matrix is ill-conditioned and/or the number of unknown parameters exceeds the number of observations. However, despite these advantages, many statisticians reject the

GME estimator. The main weakness of the GME estimator is that support intervals (i.e., exogenous information that may not be always available) for the parameters and error

vec-tors are needed. Those supports are defined as closed and bounded intervals in which each parameter or error is restricted to lie.

Because of this (possible) difficulty in the definition of support intervals, Paris [127] develops the maximum entropy Leuven (MEL) estimator based on some ideas from the theory

of light (quantum electrodynamics) of Feynman [57], the Shannon entropy measure and the OLS estimator. Paris [127, 128] shows that the MEL estimator can rival with the GME

estimator in linear regression models affected by collinearity, without requiring exogenous information as in the GME estimator.

Is it possible to improve the MEL estimator? Is it possible to generalize this estimator using information theory and robust regression?

The third objective of this work is to generalize the MEL estimator using other entropy

measures and different methods employed in robust regression literature. The idea is to explore the advantages obtained by merging different entropy measures and robust estimators

in order to improve the performance of the MEL estimator, namely in the estimation of linear regression models with small samples sizes affected by collinearity and/or outliers.

Moreover, since there are some doubts whether the analogy with the theory of light (quantum electrodynamics) used in the MEL estimator is valid in different regression models, other

approaches should be investigated. Discussing some directions for future research in IEE, Golan [66] raises a question about the possibility of making the theory easier for application

by the practitioners. The third objective also attempts to address this question, by developing new methodologies that are simpler and easier to apply; see Chapter 5 for details.

Finally, concerning computational resources for the ME estimation, there are only a few (and sometimes limited) options available in commercial software for the estimation of linear

(35)

regression models. Thus, in most of the cases, researchers and practitioners need to develop their own codes using different optimization tools.6 _{Naturally, the lack of computational}

resources for the ME estimation does not help the diffusion of these estimation techniques among practitioners.

Is it possible to develop some friendly ME codes designed to users that are not familiar with the ME estimation, using a popular and widely disseminated software?

A final objective of this work is to develop some user friendly codes for the ME estimation in MATLAB, and thus to make a contribution to the increase of computational

resources for this type of estimation; see Appendix C for details.

1.2 Structure of the thesis and main achievements

This thesis contains six chapters, including the introduction and the concluding remarks.

Chapter 2 includes background results and the state of the art on different topics covered in this work. Although the literature review is the main focus of Chapter 2, some original

work (e.g., examples, discussions and proofs) is also presented within this chapter. Chapters 3, 4 and 5 present the main contributions of this work, namely the estimation of technical

efficiency with state-contingent production frontiers, the estimation of the ridge parameter in ridge regression, and some developments in the ME estimation. Finally, an illustration of

the Shannon entropy as a measure of information, a short original research on measures of technical inefficiency with directional distance functions, and some new MATLAB codes for

ME estimation are provided in the appendices.

Chapter 3 presents original work concerning the estimation of technical efficiency with

state-contingent production frontiers under difficult empirical conditions, combining the GCE, GME and GME-α estimators. The contributions of this chapter are

• an extension of the production model proposed by O’Donnell et al. [125] that can be

employed in real-world empirical applications;

• the procedures to use the GME, GME-α and GCE estimators to assess technical

effi-ciency with state-contingent production frontier models, namely

6

(36)

– a new proposal to define the supports for the inefficiency error;

– different possibilities to define the supports with the GCE estimator;

– the possibility to include different orders of entropy with the GME-α estimators

in the two-error component rather than using the same value for both;

• the evidence that the GME, GME-α and GCE estimators are powerful alternatives to

the maximum likelihood (ML) estimator in the estimation of state-contingent produc-tion frontiers under severe empirical condiproduc-tions, namely in

– models with few observations in some states of nature (that strongly restrict the use of traditional estimators);

– models with collinearity and severe collinearity problems.

Although the theory of state-contingent production is well-established, the empirical imple-mentation of this approach is still in an infancy stage. The ME estimators are expected

to make an important contribution to the increase of empirical work with state-contingent production frontiers in the near future. This is an important contribution since, for many

authors, the state-contingent approach is the most complete procedure in the production literature that should be used to evaluate technical efficiency in the context of production

uncertainty.

Chapter 4 presents original work in the ridge regression with an application of the GME

estimator in the estimation of the ridge parameter. The idea is to introduce a new estimator for the ridge parameter, which efficiently combines the ridge trace and the GME estimator.

The contributions of this chapter are

• the development of a new estimator, denoted as the Ridge-GME estimator, that

com-bines the analysis of the ridge trace with the GME estimator;

• the discussion and comparison of the performance of the Ridge-GME estimator with

several traditional competitors in a Monte Carlo simulation study and an empirical

application to the well-known Portland cement data set.

The simulation study reveals that, in the case of regression models with small samples sizes

(37)

estimators available in the literature on ridge regression. This finding is very important for ridge regression users. It is important to note that the main challenge in the ridge regression is

the selection of the ridge parameter and there is in the literature a huge number of methods to estimate this parameter! According to that finding, the Ridge-GME estimator can be

recommended to practitioners and should belong to the restricted group of ridge parameter estimators that may be considered in any ridge regression analysis with small samples.

In Chapter 5, section 5.1, a third set of contributions of this thesis is presented, which may be considered an upgrade of the GME estimator or, more generally, a new approach

to the ME estimation. In fact, a first step is already made by Paris [127] with the MEL estimator, based on the Shannon entropy, some concepts from the theory of light (quantum

electrodynamics) and the OLS estimator. The contributions of this section are

• the introduction of the maximum entropy robust regression group (MERG) estimators,

that represent a generalization of the MEL estimator, and the discussion of

– the structure of the MERG estimators, which includes the Shannon, R´enyi and

Tsallis entropies, the OLS estimator and different estimators based on robust regression, namely the least trimmed squares (LTS), the least absolute deviations

(LAD), and the least median of squares (LMS) estimators;

– some properties, namely scale invariance, consistency and asymptotic normality for some MERG estimators;

• the evaluation and comparison of the performance of the MERG estimators with several

traditional estimators in different simulation studies and in models with real data.

The MERG estimators may be a good choice in the estimation of linear regression models with small samples sizes affected not only by outliers and collinearity simultaneously, but

also in models only affected by collinearity or outliers separately. The MERG estimators are easy to compute and, mostly important, no relevant prior information is needed to implement

them. These two features are probably the most important ones of the MERG estimators.

Additional original work is provided in Chapter 5, section 5.2: an extension of the MERG estimators, denoted as MERGE estimators. This acronym is the initials of the words

(38)

different estimators in a new class with high performance in linear regression models with small samples sizes affected with collinearity and outliers. The contributions of this section

are

• the introduction of the MERGE estimators, that are an extension of the MERG

esti-mators developed in section 5.1, and the discussion of

– the structure of the MERGE estimators which avoids the analogy with quantum

electrodynamics and include supports for the parameters as in the GME estimator;

– some potential improvements for the MERGE estimators, namely the possibility

to impose parameter inequality restrictions through the parameter support matrix (as made in the GME estimator) and the use of the cross-entropy formalism;

• the evaluation and comparison of the performance of the MERGE estimators with the

MERG estimators and a recent powerful estimator for the combined collinearity-outliers problem in linear regression.

This extension allows to include supports for the parameters as in the GME estimator since

there are regression models where the supports for the parameters are known and provided by the theory (e.g., in economics estimating the marginal propensity to consume), or by the

ex-perience of the researchers. The simulation study reveals a good performance of the MERGE estimators in linear regression models with small samples sizes affected by collinearity and

outliers.

Finally, Appendix A presents an illustration of the Shannon entropy as a measure of information in the context of two simple games with roulette wheels; Appendix B provides

a short research on technical inefficiency with directional technology distance functions; and Appendix C includes MATLAB codes with some estimators presented in the thesis. In

summary, the contributions in the appendices are

• two new measures of technical inefficiency with directional distance functions;

• new MATLAB codes for the ME estimation.

It is important to note that directional technology distance functions provide a complete

(39)

provides a natural technical inefficiency measure; see Chambers et al. [27]. The MATLAB codes in Appendix C are based on the ones used in this thesis, but they are particularly

designed to users that are not familiar with the ME estimation. These and other codes, as well as new updates and additional information on the ME estimators will soon be available

in http://www.ua.pt/mat. In order to disseminate the ME estimators to a wider audience, some of those codes for GAMS and Microsoft Excel will also be available in the same website.

As a final remark, five papers, based on this work, were prepared for publication (four in international scientific journals and one in a conference proceeding): Macedo et al. [103, 104,

105, 106]; one paper was recently submitted. Additionally, eleven presentations were given at national and international conferences, namely in the 27th and 28th European Meetings

of Statisticians (two talks; one poster), the 17th, 18th, 19th and 20th Conferences of the Portuguese Statistical Society (five talks; one poster), the first Research Day in the University

of Aveiro (one poster), and the 58th World Statistics Congress of the International Statistical Institute (one talk).

(40)

(41)

Background and state of the art

“[. . . ] the principle of maximum entropy is not an Oracle telling which predictions

must be right; it is a rule for inductive reasoning that tells us which predictions are most strongly indicated by our present information.”

Jaynes [86, p. 369].

In this chapter, an overview on the main topics covered in the thesis is presented, with particular attention to entropy and ME estimation.1 In section 2.1, some interpretations of

entropy and the ME principle are discussed. Section 2.2 presents a brief review of some ME estimators, namely the GME, GCE, and GME-α estimators. The two remaining sections

briefly review some diagnostic procedures and the strategies for dealing with collinearity and outliers, as well as some topics on technical efficiency analysis.

2.1 Entropy

2.1.1 Entropy concepts

The notion of entropy appears in the foundations of thermodynamics in the XIX century. This concept is introduced by Clausius [30], expressing a relationship between heat and

tem-1_{This thesis covers different topics from different areas, such as physics, econometrics, economics, statistics}

and mathematics, which causes a natural problem of inclusion. An additional effort is made so that this work is self-contained. Naturally, a complete background and state of the art on these topics can only be achieved by reviewing some seminal and relevant work which is mentioned throughout the thesis, where detailed definitions, properties, proofs, and some other historical details can be found.

(42)

perature in a physical system. Later, based on the work by Maxwell [110, 111] in the kinetic theory of gases, Ludwig Boltzmann, Josiah Gibbs and Max Plank are the main architects of

statistical mechanics; see, for example, Gibbs [62]. From the work produced by these three authors, reflected in a large number of books and articles,2 _{the following formulas of entropy}

emerge in the literature:

S = −k w X i=1 pilog pi (2.1) and S = k log w. (2.2)

In the above expressions, S is the entropy, k is the Boltzmann constant, pi is the probability

of microstate i and w is the number of microstates for a given macrostate of the system. If the probabilities, pi, are equal, equation (2.2) is obtained from (2.1). In equation (2.2), the

entropy of a macrostate increases when the number of microstates increases, i.e., when the number of possible configurations of the atoms increases.

The entropy is also connected with the second law of thermodynamics, which generally states that the entropy of isolated systems always increases in order to achieve a maximum at

the equilibrium. This law expresses an unmistakable reality of nature through the one way it provides for spontaneous processes. Due to this natural irreversibility imposed by the second

law of thermodynamics, a famous “demon” appears in the literature: the Maxwell’s demon. This demon is an imaginary being that is able to contradict the second law of thermodynamics

and accomplishes the impossible task of reducing the entropy in an isolated system.3 Note that even with the small importance that this issue has been debated over the years, it has

never been abandoned; e.g., Raizen [135].

The interpretation of entropy is still controversial nowadays. This is evident when a

simple search is made on some scientific journals and several dozens of works are found that directly or indirectly debate this issue. For example, Styer [163, p. 1090] states that

“Of all the difficult concepts of classical physics – concepts like acceleration, energy, electric field, and time – the most difficult is entropy. Even von

Neu-mann claimed that “nobody really knows what entropy is anyway.” [. . . ] The

2

Some of these works can be downloaded from http://www.archive.org. In the website of the School of Mathematics and Statistics in the University of St Andrews, Scotland, some biographies can be found.

3

In http://nautilus.fis.uc.pt/molecularium/pt/entropia/index.html, an interesting game with the Maxwell’s demon can be found.

(43)

metaphoric images invoked for entropy include “disorder,” “randomness,” “smooth-ness,” “dispersion,” and “homogeneity.” In a posthumous fragment, Gibbs

men-tioned “entropy as mixed-up-ness”.”

The terms “complexity” and “unavailable energy” are also usually used. Styer [163] suggests

the use of both “disorder” and “freedom” to define entropy. This discussion is naturally beyond the scope of this thesis. The literature concerning entropy, thermodynamics and

statistical mechanics is massive; see, among many others, Ben-Naim [11], Dugdale [42], Jaynes [86], Sethna [148], Styer [163, 164] and the references therein.

The concept of entropy presented above is strictly confined to physics. The Shannon

entropy measure presented next have, in general, a different interpretation and represents a broader concept of entropy. It is interesting to note how Shannon [149, p. 10] introduces the

problem:

“Suppose we have a set of possible events whose probabilities of occurrence [. . . ]

are known but that is all we know concerning which event will occur. Can we find a measure of how much “choice” is involved in the selection of the event or

of how uncertain we are of the outcome?”

In the paper A mathematical theory of communication, Shannon [149] begins by defining three properties that such measure should satisfy.4 These properties are usually the minimal

axioms required for a consistent measure of the “amount of uncertainty”; see Jaynes [86, Chapter 11]. Consider H(p1, p2, . . . , pK) as the measure to find, and p1, p2, . . . , pK the

probabilities of occurrence from a set of possible events.

Axiom 2.1. H(p1, p2, . . . , pK) should be a continuous function of the pk, k = 1, 2, . . . , K.

Axiom 2.2. If all the pk, k = 1, 2, . . . , K, are equal, then H(p1, p2, . . . , pK) should be a

monotonic increasing function of K.

Axiom 2.3. If a choice is split into two successive choices, the original H(p1, p2, . . . , pK)

should be equal to the weighted sum of the individual values of H(p1, p2, . . . , pK).

4

It is considered here only the case of discrete random variables. The entropy of a continuous distribution can be defined in a similar way considering probability density functions. The continuous case shares most of the properties of the discrete case, but not all of them; see Shannon [149, pp. 35–38] for further details.

(44)

Shannon [149] and Jaynes [86, Chapter 11] demonstrate that the only H(p1, p2, . . . , pK)

that satisfies Axioms 2.1–2.3 is given by equation (2.3), presented below.

Definition 2.1. The Shannon entropy measure5 _{is given by}

H(p1, p2, . . . , pK) = −c K

X

k=1

pkln pk, (2.3)

where c is a positive constant and pk, k = 1, 2, . . . , K, are the probabilities of occurrence from

a set of possible events.

Figure 2.1 illustrates the Shannon entropy measure in the case of two possible outcomes with probabilities p and 1 − p, where H(p, 1 − p) = −p ln p − (1 − p) ln(1 − p), considering

c = 1. Note that 0 ln 0 = 0 is considered in Definition 2.1 if some pk= 0, since lim

x→0x ln x = 0

and the continuity of H(p1, p2, . . . , pK) imposed by Axiom 2.1 is satisfied. It is important to

note that the Shannon entropy represents an average logarithm of the probabilities pk, and

the events with the low or high probability have a small contribution to the entropy value;

e.g., Golan and Perloff [68] and Holste et al. [79].

Figure 2.1: Shannon entropy.

Since the choice of c is just a matter of convenience and merely amounts to a choice of an unit of measure, it is usually considered that c = 1. Although, it is used the natural

logarithm in Definition 2.1, it is possible to take any logarithm with any base greater than

5

(45)

one (the constant c reflects these changes). As noted by Shannon [149, p. 1], the choice of the base corresponds to the choice of the units in which the information is measured; see the

example in Appendix A for details.

With Claude Shannon the concept of entropy acquires a new meaning as a measure of

information or uncertainty. In addition to Shannon, others pioneers in information theory deserve a mention here, namely Hartley [74], Nyquist [122, 123] and Wiener [176].

Shannon [149, pp. 11–13] presents six properties supporting the choice of (2.3) as a rea-sonable measure of information; see also Jaynes [86] and Khinchin [91]. Among these

proper-ties, there is one extremely important: for a given K, H(p1, p2, . . . , pK) reaches a maximum

when all the pk are equal. This is intuitively the most uncertain situation and, in this case,

H(p1, p2, . . . , pK) = c ln K.

There are several stories in the literature on the choice of the name for the measure

presented in Definition 2.1. It seems that Claude Shannon did not know which name should give to the new measure of information and John von Neumann was the one who suggested

the name entropy, since there was already a similar expression used in statistical mechanics; see (2.1) presented previously. This story is told in Tribus and McIrvine [172, p. 180]:

“In 1961 one of us (Tribus) asked Shannon what he had thought about when

he had finally confirmed his famous measure. Shannon replied: “My greatest concern was what to call it. I thought of calling it ‘information,’ but the word

was overly used, so I decided to call it ‘uncertainty.’ When I discussed it with John von Neumann, he had a better idea. Von Neumann told me, ‘You should call it entropy, for two reasons. In the first place your uncertainty function has

been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, no one knows what entropy really is, so in

a debate you will always have the advantage.’ ””

In Appendix A, the Shannon entropy as a measure of information is illustrated in the discussion of two simple games with roulette wheels. Several others interesting examples

that illustrate the Shannon entropy as a measure of information can be found, among many others, in Brillouin [17] and Yaglom and Yaglom [178]. The Shannon entropy has acquired a

(46)

field where it was initially developed. Nowadays, the Shannon entropy measure is found in many areas of science.

Two of the most well-known generalizations of the Shannon entropy measure are presented next: the R´enyi entropy and the Tsallis entropy.

R´enyi [139, 140] generalizes the notion of random variable and defines an incomplete ran-dom variable, an incomplete probability distribution and a complete conditional distribution

of the incomplete random variable. Note that if X is an incomplete random variable with values xk and associated probabilities pk > 0, k = 1, 2, . . . , K, then PKk=1pk ≤ 1, and not

necessarily PK

k=1pk= 1.

R´enyi [140, pp. 570–574] defines the gain of information, denoted by I(QkP ), and presents

six postulates that this measure must satisfy. I(QkP ) represents the gain of information obtained when an incomplete distribution P = (p1, p2, . . . , pK), with pk > 0, ∀k, of an

incom-plete random variable X is substituted by an incomincom-plete distribution Q = (q1, q2, . . . , qK).

Assuming that I(QkP ) satisfies the six postulates already mentioned, Theorem 2.1 defines

the gain of information (R´enyi [140, p. 574]).

Theorem 2.1. (Measure of order α of the gain of information). There exists a real number α 6= 1 such that I(QkP ) = Iα(QkP ) = 1 α − 1ln        1 K X k=1 qk K X k=1 (qk)α (pk)α−1        (2.4) or I(QkP ) = I1(QkP ) = 1 K X k=1 qk K X k=1 qkln qk pk . (2.5)

Proof. See R´enyi [140, pp. 574–578].

Based on Theorem 2.1, the R´enyi entropy can be established for a complete probability distribution (a distribution of an ordinary random variable).

(47)

Definition 2.2. The R´enyi entropy measure6 of order α for a complete probability distribu-tion is given by H_αR(p1, p2, . . . , pK) = 1 1 − αln K X k=1 (pk)α, (2.6)

where α 6= 1 is a real number and pk are the probabilities from the complete distribution.

An important relation between the R´enyi entropy and the Shannon entropy is presented

below.

Proposition 2.1. The R´enyi entropy reduces to the Shannon entropy when α → 1.

Proof. Using L’Hˆopital’s rule, it can be established that

lim α→1 ln K X k=1 (pk)α 1 − α = limα→1 1 PK k=1(pk)α K X k=1 (pk)αln pk −1 = − K X k=1 pkln pk. (2.7)

Figure 2.2: R´enyi entropy.

The R´enyi entropy satisfies some of the properties of the Shannon entropy, namely

stan-dard additivity, non-negativity and both reach an extreme value when all probabilities are equal; see Curado and Tsallis [34], R´enyi [140] and Tavares [167]. However, one distinctive

6

(48)

characteristic of the R´enyi entropy, when compared with the Shannon entropy, is the de-pendence on the real number α, implying that the R´enyi entropy is not always concave, as

illustrated in Figure 2.2 for a complete distribution P = (p, 1 − p).

As mentioned by R´enyi [140, p. 581], the entropy measure presented in Definition 2.2

should be considered as a true measure of information only when α is positive.7 Furthermore, the R´enyi entropy is defined as an average of probabilities pk raised to powers of α, rather

than the average logarithm defined by the Shannon entropy. The value of α > 1 defines the relative contribution of event k to the entropy value and thus events with higher probability

contribute more to the value of the entropy than the lower probability events; see, for example, Golan and Perloff [68] and Holste et al. [79].

Later, Tsallis [173] develops another entropy measure, that is defined next.

Definition 2.3. The Tsallis entropy measure8 _{is given by}

H_αT(p1, p2, . . . , pK) = c α − 1 1 − K X k=1 (pk)α ! , (2.8)

where α 6= 1 is a real number, c is a positive constant (usually assumed c = 1) and pk are the

probabilities of occurrence from a set of possible events (possible microscopic configurations

of a system in the original context).

Figure 2.3 illustrates the Tsallis entropy measure presented in Definition 2.3 for K = 2

and some values of α. Considering only the case of α > 1 as in the R´enyi entropy, the events with higher probability contribute more to the value of the Tsallis entropy than the lower

probability events.

Proposition 2.2. The Tsallis entropy reduces to the Shannon entropy when α → 1.

Proof. It can be easily established that

lim α→1 c α − 1 1 − K X k=1 (pk)α ! = c lim α→1 K X k=1 pk 1 − (pk)α−1 α − 1 = −c K X k=1 pkln pk, (2.9)

which is the Shannon entropy measure; see Tavares [167, p. 62] for a detailed discussion.

7

For convenience, it is always considered, throughout this work, that α > 1.

8

(49)

Figure 2.3: Tsallis entropy.

The Tsallis entropy shares some properties with the Shannon and R´enyi entropies, namely

it is non-negative and it reaches an extreme value when all the probabilities are equal. How-ever, the standard additivity property satisfied by the Shannon and R´enyi entropies is

vio-lated by the Tsallis entropy. One of the most important features of the Tsallis entropy is the pseudoadditivity; e.g., Abe [1], Curado and Tsallis [34], Santos [147] and Tsallis [173].

Theorem 2.2. (Pseudoadditivity). For two independent random variables A and B,

H_αT(A, B) = H_αT(A) + H_αT(B) + (1 − α)H_αT(A)H_αT(B). (2.10)

Proof. With no loss of generality, it is assumed, as usually, that c = 1. Considering

H_αT(A) = 1 α − 1  1 − KA X kA=1 (pkA) α   and H_αT(B) = 1 α − 1  1 − KB X kB=1 (pkB) α  , (2.11)

since A and B are independent random variables, then,

H_αT(A, B) = 1 α − 1  1 − KA X kA=1 KB X kB=1 (pkApkB) α  . (2.12)

(50)

H_αT(A, B) = 1 α − 1  1 − KA X kA=1 (pkA) α  + 1 α − 1  1 − KB X kB=1 (pkB) α  − − 1 α − 1  1 − KA X kA=1 (pkA) α₋ KB X kB=1 (pkB) α₊ KA X kA=1 (pkA) α KB X kB=1 (pkB) α   = HT α(A) + HαT(B) + (1 − α)HαT(A)HαT(B). (2.13)

For α 6= 1, the Tsallis entropy is a nonextensive measure where α determines the degree

of nonextensivity of a system; see, for example, Di Sisto et al. [37], Plastino and Plastino [131], Santos [147] and Suyari [165] for further details. Nonextensive statistical mechanics

is nowadays a very impressive research field. The latest research conducted by the group of Prof. Constantino Tsallis, as well as a list with some thousands of published work related

to nonextensive statistical mechanics can be accessed in the website of Centro Brasileiro de Pesquisas F´ısicas.9

It is also important to note that, in addition to the fact that the R´enyi and Tsallis entropies are related to the Shannon entropy, the R´enyi and Tsallis entropies are also related

to each other (e.g., Curado and Tsallis [34]).

Proposition 2.3. The R´enyi entropy is related to the Tsallis entropy as follows:

H_αR(p1, p2, . . . , pK) =

1

1 − αln 1 + (1 − α) H

T

α(p1, p2, . . . , pK) .

Proof. Assuming c = 1, the relation trivially holds by substitution.

It is important to note that the Shannon, R´enyi and Tsallis entropies are not the only

entropy measures in the literature. For example, Taneja [166] presents 25 different entropy expressions, where 24 of them have the Shannon entropy as a limit or a particular case. Taneja

[166, p. 410] presents an interesting “entropy graph” that indicates how the 24 entropies are related to the Shannon entropy measure.

The literature on entropy and information theory is huge, both theoretical and applied; see, among many others, Ash [8], Brillouin [17], Dion´ısio et al. [38, 39], Galleani and Garello

9

(51)

[59], Jaynes [86], Khinchin [91], Mana [107], Merhav [115], Rastegin [138], Saboia et al. [145], Vila et al. [174] and Yaglom and Yaglom [178]. Tavares [167] is an excellent reference

in Portuguese language concerning mathematical issues of the Shannon, R´enyi and Tsallis entropies, as well as the foundations of entropy.

2.1.2 Maximum entropy principle

Jaynes [86, p. 365] states that the maximum entropy (ME) principle is a simple and straight-forward idea. This statement is explored in this subsection. Consider a pure linear inverse model defined as usually.

Definition 2.4. A pure linear inverse model is stated as

y = Xβ, (2.14)

where y denotes a known (N × 1) vector of observations, β is a (K × 1) vector of unknown parameters, and X is a known (N × K) matrix.

Assuming an exact relationship between the dependent variable and the independent

variables, there is no error term in (2.14). Following Golan et al. [69], an ill-posed model is specified by considering that X is a non-invertible matrix with N < K, and β = p is a

vector of probabilities such thatPK

k=1pk = 1 and 0 < pk < 1, for k = 1, 2, . . . , K. From all

the probability distributions that satisfy model (2.14), how can an unambiguous estimate of

p be chosen? The ME principle proposed by Jaynes [81, 82] provides an answer by choosing the distribution of probabilities that maximizes the Shannon entropy measure. Jaynes [81,

p. 623] is clear:

“[. . . ] in making inferences on the basis of partial information we must use that probability distribution which has maximum entropy subject to whatever

is known. This is the only unbiased assignment we can make [. . . ].”

A question arises: why the maximization of the Shannon entropy? In order to justify this, an approach based on the Wallis derivation10is considered since it leads to the maximization

10

(52)

of (2.3) without conditions and, maybe more important, the need for an interpretation of a measure of uncertainty; see Jaynes [86, p. 351] for further details.

Consider an experiment11 with K possible outcomes that is repeated in N trials, and N1, N2, . . . , NK the number of times that each outcome k occurs in the experiment, such

that PK

k=1Nk = N and Nk ≥ 0. The number of ways a particular set of frequencies, say

Nk = N pk, can be realized is given by

W = N ! N1!N2! . . . NK!

, (2.15)

known as the multinomial coefficient. Thus, the most probable set of frequencies (i.e., the set of frequencies that occurs in the greatest number of ways) must be chosen in order to

maximize W or a monotonic function of W , such as

ln W = ln N ! −

K

X

k=1

ln Nk!. (2.16)

As N → ∞, it follows that Nk/N → pk, and using the Stirling’s approximation, it can be

found ln W ≈ N ln N − N − K X k=1 Nkln Nk− K X k=1 Nk ! ≈ N ln N − K X k=1 Nkln Nk ≈ N ln N − K X k=1 N pkln N pk. (2.17) Since K X k=1 N pkln N pk= K X k=1 Nkln N + K X k=1 N pkln pk, (2.18) it follows that ln W ≈ −N K X k=1 pkln pk, (2.19)

which means that N−1ln W ≈ −PK

k=1pkln pk.12 The set of frequencies that occurs in the

greatest number of ways is just the one that maximizes the Shannon entropy measure. Follow-ing Jaynes [81, 86] and Golan et al. [69], by maximizFollow-ing (2.3) subject to the limited available

11

It is followed here a similar development to the one presented by Golan et al. [69, pp. 8–9]. The discussion of the Wallis derivation is provided in Jaynes [86, p. 351].

12

Some simulations with marbles in cells illustrating this relationship are provided by Prof. Arieh Ben-Naim. Information is available in http://www.ariehbennaim.com/books/discover.html.

(53)

data in model (2.14), the most probable set of pk that is consistent with the information

available is obtained.13 _{The importance of measure (2.3) in the ME principle is recognized}

by Jaynes [81, p. 622]:

“The great advance provided by information theory lies in the discovery that there is a unique, unambiguous criterion for the “amount of uncertainty” represented by a discrete probability distribution [. . . ].”

The ME formalism is defined next using a matricial form.

Definition 2.5. For a pure linear inverse model as stated in Definition 2.4, where β = p is a vector of probabilities, the ME formalism is defined as

argmax

p

−p0_{ln p ,} _(2.20)

subject to the model (or data consistency) constraint, Xp = y, and the additivity (or nor-malization) constraint, 10p = 1, where 1 is a (K × 1) vector of ones, and p > 0 is a (K × 1)

vector of probabilities.

The ME principle provides a tool to make the best prediction (i.e., the one that is the

most strongly indicated) from the available information (and only this, since the introduction of any other subjective information is not recommended). If the entropy function in (2.20) is

maximized without the model constraint, a solution from a uniform distribution is obtained. In this case, it is interesting to note that the ME principle can be seen as an extension of the

Bernoulli’s principle of insufficient reason; see Jaynes [81, p. 623].

The analytical solution of the maximization problem specified in Definition 2.5 can be

obtained using the traditional Lagrange multipliers method. Using the matricial form the Lagrangian function is given by

L(p, λ, µ) = −p0ln p + λ0(y − Xp) + µ(1 − 10p), (2.21)

with the first-order optimality conditions

∂L(·)

∂p = − ln p − 1 − X

0_{λ − µ1 = 0,} _(2.22)

13

(54)

∂L(·) ∂λ = y − Xp = 0, (2.23) ∂L(·) ∂µ = 1 − 1 0 p = 0. (2.24)

Solving for pk in terms of λ, it follows that

b pk= exp(−x0_kλ)b K X k=1 exp(−x0_kλ)b , (2.25)

where xkis a (N × 1) vector corresponding to the kth column of X and bλ is a (N × 1) vector

of estimated Lagrange multipliers on the model constraint. Equivalently, the solution can also be presented by b pk= exp − N X n=1 xnkbλ_n ! K X k=1 exp − N X n=1 xnkbλ_n !. (2.26)

Given the Lagrangian function and the first-order optimality conditions, the Hessian matrix

is given by ∇2_{(p) =}         −1 p1 0 · · · 0 0 −_p1 2 · · · 0 .. . ... . .. ... 0 0 · · · −_p1 K         , (2.27)

implying it is negative definite for 0 < pk < 1 and a unique solution (a global maximum)

of the ME formalism is ensured. The maximization problem in Definition 2.5 does not have a closed-form solution which means that the ME solution must be found with numerical

optimization procedures. Finally, it is important to note that the Lagrange multipliers on the model constraint reflect the information contribution of each constraint to the objective

function. For example, if some λn is zero this implies that the corresponding constraint has

no “informational value” and does not reduce the maximum entropy value, i.e., the level of

uncertainty; see Golan et al. [69, p. 26] for further details.

To illustrate the ME formalism a simple example is presented. Suppose that an experiment with three possible outcomes, 1, 2 and 3, is repeated for a large number N of times and the

(55)

y. For example, by assuming that y = 2.5 what is the expected probability of outcome 1 in the N + 1 trial of this experiment?

There are three unknowns (p1, p2 and p3) and two constraints (p1 + p2 + p3 = 1 and

p1+ 2p2+ 3p3 = y). Following the ME formalism from Definition 2.5, the solution is given

by the probabilities that maximize

H(p1, p2, p3) = −p1ln p1− p2ln p2− p3ln p3 (2.28)

subject to

p1+ 2p2+ 3p3= 2.5 (2.29)

and

p1+ p2+ p3= 1. (2.30)

Since p1 is the only variable that matters, one can simply maximize the following entropy

function

H(p1) = −p1ln p1− (−2p1+ 0.5) ln(−2p1+ 0.5) − (p1+ 0.5) ln(p1+ 0.5). (2.31)

It follows that p1 ≈ 0.12. Suppose now that the average of outcomes from a large number N

is y = 2. What is now the expected probability of outcome 1? It is expected that p1 = 1/3,

since y = 2 is the mean of a discrete uniform distribution (1, 3).

To explain this issue in more detail, the previous example is extended to the die problem presented in Golan et al. [69, p. 12], which is based, in turn, on the problems discussed by

Jaynes [83, 86]. Knowing that the average outcome from a large number N of independent rolls of a die is y, the aim is to estimate the probability vector p = (p1, p2, . . . , p6). Only

with this information at hand (the average of the results), the ME principle can be applied to select the probability vector p that maximizes

H(p) = H(p1, p2, . . . , p6) = − 6

X

k=1

pkln pk (2.32)

subject to the model constraint

6

X

k=1

k pk= y (2.33)

and the additivity constraint

6