q-Gaussians for pattern recognition

(1)

Pós-Graduação em Ciência da Computação

q-Gaussians for pattern recognition Por

Dusan Stosic

Dissertação de Mestrado

Universidade Federal de Pernambuco posgraduacao@cin.ufpe.br www.cin.ufpe.br/~posgraduacao

(2)

(3)

UNIVERSIDADE FEDERAL DE PERNAMBUCO CENTRO DE INFORMÁTICA

PÓS-GRADUAÇÃO EM CIÊNCIA DA COMPUTAÇÃO

DUSAN STOSIC

Q-GAUSSIANS FOR PATTERN RECOGNITION

ESTE TRABALHO FOI APRESENTADO À PÓS-GRADUAÇÃO EM CIÊNCIA DA COMPUTAÇÃO DO CENTRO DE INFORMÁTICA DA UNIVERSIDADE FEDERAL DE PERNAMBUCO COMO REQUISITO PARCIAL PARA OBTENÇÃO DO GRAU DE MESTRE EM CIÊNCIA DA COMPUTAÇÃO.

ORIENTADOR(A): Teresa Bernarda Ludermir

RECIFE 2016

(4)

Catalogação na fonte

Bibliotecária Monick Raquel Silvestre da S. Portes, CRB4-1217

S888a Stosic, Dusan

q-Gaussians for pattern recognition / Dusan Stosic. – 2016. 73 p.: il., fig., tab.

Orientadora: Teresa Bernarda Ludemir.

Dissertação (Mestrado) – Universidade Federal de Pernambuco.

CIn, Ciência da computação, Recife, 2016. Inclui referências e anexos.

1. Inteligência artificial. 2. Reconhecimento de padrões. 3. Redes neurais. I. Ludemir, Teresa Bernarda (orientadora). II. Título.

006.3 CDD (23. ed.) UFPE- MEI 2016-028

(5)

Dusan Stosic

q-Gaussians for pattern recognition

Dissertação de Mestrado apresentada ao Programa de Pós-Graduação em Ciência da Computação da Universidade Federal de Pernambuco, como requisito parcial para a obtenção do título de Mestre em Ciência da Computação

Aprovado em: 01/03/2016.

BANCA EXAMINADORA

__________________________________________ Prof. Dr. Tsang Ing Ren

Centro de Informática / UFPE

__________________________________________ Prof. Dr. Adauto Jose Ferreira de Souza

Departamento de Física / UFRPE

__________________________________________ Profa. Dra. Teresa Bernarda Ludermir

Centro de Informática / UFPE

(6)

(7)

Acknowledgements

(8)

(9)

Abstract

Pattern recognition plays an important role for solving many problems in our everyday lives: from simple tasks such as reading texts to more complex ones like driving cars. Subconsciously, the recognition of patterns is instantaneous and an innate ability to every human. However, programming (or “teaching”) a machine how to do the same can present an incredibly difficult task. There are many situations where irrelevant or misleading patterns, poorly represented classes, and complex decision boundaries make recognition very hard, or even impossible by current standards. Important contributions to the field of pattern recognition have been attained through the adoption of methods of statistical mechanics, which has paved the road for much of the research done in academia and industry, ranging from the revival of connectionism to modern day deep learning. Yet traditional statistical mechanics is not universal and has a limited domain of applicability - outside this domain it can make wrong predictions. Non-extensive statistical mechanics has recently emerged to cover a variety of anomalous situations that cannot be described within standard Boltzmann-Gibbs theory, such as non-ergodic systems characterized by long-range interactions, or long-term memories. The literature on pattern recognition is vast, and scattered with applications of non-extensive statistical mechanics. However, most of this work has been done using non-extensive entropy, and little can be found on practical applications of other non-extensive constructs. In particular, non-extensive entropy is widely used to improve segmentation of images that possess strongly correlated patterns, while only a small number of works employ concepts other than entropy for solving similar recognition tasks. The main goal of this dissertation is to expand applications of non-extensive distributions, namely the q-Gaussian, in pattern recognition. We present our contributions in the form of two (published) articles where practical uses of q-Gaussians are explored in neural networks. The first paper introduces q-Gaussian transfer functions to improve classification of random neural networks, and the second paper extends this work to ensembles which involves combining a set of such classifiers via majority voting.

Keywords: non-extensive statistics. pattern recognition. q-Gaussian. random neural

(10)

(11)

Resumo

Reconhecimento de padrões tem um papel importante na solução de diversos problemas no nosso quotidiano: a partir de tarefas simples como ler textos, até as mais complexas como dirigir carros. Inconscientemente, o reconhecimento de padrões pelo cérebro é instantâneo, representando uma habilidade inata de cada ser humano. No entanto, programar (ou “ensinar”) uma máquina para fazer o mesmo pode se tornar uma tarefa extremamente difícil.

Há muitas situações onde padrões irrelevantes ou enganosos, classes mal representadas, ou bordas de decisões complexas, tornam o reconhecimento muito difícil, ou mesmo impossível pelos padrões atuais. Diversas contribuições importantes na área de reconhecimento de padrões foram alcançadas através da aplicação de métodos provenientes da mecânica estatística, que estimularam uma grande parte da pesquisa conduzida na academia bem como na indústria, desde o renascimento de conexionismo até o moderno conceito de “deep learning”. No entanto, a mecânica estatística tradicional não é universal e tem um domínio de aplicação limitado - fora deste domínio ela pode fazer previsões erradas. A mecânica estatística não-extensiva surgiu recentemente para atender uma variedade de situações anômalas que não podem ser descritas de forma adequada com a teoria de Boltzmann-Gibbs, tais como sistemas não-ergódicos, caracterizadas por interações de longo alcance, ou memórias de longo prazo. A literatura sobre reconhecimento de padrões é vasta, e dispersa com aplicações da mecânica estatística não-extensiva. No entanto, a maioria destes trabalhos utilizam a entropia não-extensiva, e existem poucas aplicações práticas de outros conceitos extensivos. Em particular, a entropia não-extensiva é amplamente usada para aperfeiçoar segmentação de imagens que possuem padrões fortemente correlacionados, enquanto apenas um pequeno número de trabalhos empregam outros conceitos não-extensivos para resolver tarefas semelhantes. O objetivo principal desta dissertação é expandir aplicações de distribuições não-extensivas, como a

q-Gaussiana, em reconhecimento de padrões. Nos apresentamos as nossas contribuições no

formato de dois artigos (publicados) onde exploramos usos práticos da q-Gaussiana em redes neurais. O primeiro artigo introduz funções de transferência baseados na q-Gaussiana para aperfeiçoar a classificação de redes neurais aleatórias, e o segundo artigo estende este trabalho para ensembles, onde um conjunto de tais classificadores são combinados através de votação por maioria.

Palavras-chave: estatística não-extensiva. reconhecimento de padrões. q-Gaussiana. redes

(12)

(13)

List of Figures

Figure 1 – Phase space portraits. . . 13

Figure 2 – Scaling of entropy, S, for (a) locally or (b) globally correlated systems. 15

Figure 3 – Non-extensive generalization of (a) exponential and (b) entropy. . . 17

Figure 4 – Schematic diagram of neural network architectures. . . 27

Figure 5 – Neural network representation of (a) input space and (b) hidden space. 29

Figure 6 – Non-extensive generalization of (a) hyperbolic tangent and (b)

polyhar-monic spline. . . 34

Figure 7 – Probability density functions estimated from mixture models applied to

(14)

(15)

1 Introduction

The classical view of the universe until the early 20th century was that physical laws are mechanical in nature, where these laws govern all natural systems and correspond to precise equations of motion. The equations determine positions and velocities of every particle at any point in time, and once solved they provide a complete time evolution of the system. The outcome of a dynamical system can then be predicted at any time given all of its elements are known to us. There is however no theoretically plausible way of simultaneously knowing the exact positions and velocities of each particle because of the uncertainty principle. Moreover, we take this ability for granted until faced with the task of describing such systems at a human scale. Although quantum theory has emerged to replace classical mechanics, it becomes intractable for macroscopic problems as the number of particles is very large and interactions between particles are exceedingly complex. We can then ask ourselves whether it is possible to fill this gap between the laws of mechanics and macroscopic systems that arise in nature. Statistical mechanics is a branch of physics that attempts to provide such a connection by applying concepts from probability theory and statistics.

Statistical mechanics has roots in thermodynamics and plays a crucial role in the development of quantum mechanics. It combines the laws of mechanics and theory of probabilities to provide a connection between macroscopic properties of the system and its microscopic constituents. More generally, the formalism describes how can distinct collective behavior emerge in a system due to interactions with many degrees of freedom. The goal of statistical mechanics is to explain the behavior of systems that we observe in everyday life by studying statistical properties of their underlying components (e.g. molecules in a liquid solution). Statistical mechanics provides a framework with a collection of mathematical tools reserved for dealing with large systems of elements that are subject to the microscopic laws of mechanics. This framework allows us to describe complex systems using relatively simple models where an exact understanding of the microscopic world is not necessary. An early use of statistical mechanics was to explain thermodynamic properties of materials in equilibrium. Since then it has evolved to encompass a much larger domain, creating a highly versatile and exciting area of research. Statistical physics is now accepted as the most interdisciplinary branch of physics with applications in nearly every other field. Its techniques often overlap with many other disciplines, from which a vast array of applications have emerged in areas outside of physics, such as biological and chemical sciences, computer sciences and engineering, and social sciences (1).

The recognition of patterns is an innate ability to every human. Starting at an early age, toddlers can already recognize patterns in ideas, words, symbols, numbers,

(18)

10 Chapter 1. Introduction

and images better than most automatic (machine) recognition systems. It also plays an important role in many everyday life experiences such as reading texts, identifying objects, understanding sounds, and navigation. However, teaching a machine to do the same is an incredibly difficult task. Pattern recognition is the study of how machines can learn to identify patterns and regularities in the data. There are a number of important problems that employ pattern recognition in disciplines such as biology, computer science, medicine, psychology, and statistics. Interest in the area is growing due to the availability of large databases as well as emerging applications which are becoming more challenging and computationally demanding (2). Automatic pattern recognition systems have recently made significant advances in solving problems that involve image and speech recognition, natural language understanding, among others (3). These computer algorithms seem to surpass humans in several visual recognition tasks mainly because they excel at identifying fine-grained objects. For example, humans can easily recognize an object as a dog or a panting, but have trouble distinguishing between different dog species or types of paintings, since it requires domain expertise that only the machine might learn (4). Yet machines still make mistakes in cases that are trivial for humans, usually in problems that are context specific. Thus, finding new ways to improve pattern recognition algorithms is an important step for achieving near human recognitions.

In connection with pattern recognition, statistical mechanics has paved the road for many of the algorithms that are being used today in academic research and industry. Per-haps the first crossover was made in 1969 when Richardson (5) established a mathematical connection between classes of problems in pattern recognition and in statistical mechanics. However, statistical mechanics only made a significant impact in the pattern recognition community with the revival of connectionism, where Hopfield (6) proposed a new form of neural network based on the Ising model that could learn and process information in an entirely new way. This paved the road for a widespread adoption of concepts and methods from statistical mechanics in recognition tasks. Ackley (7) introduced Boltzmann machines as more efficient and stochastic counterparts of Hopfield networks that rely on the Boltzmann factor for their sampling function. Monte Carlo methods have emerged as the most influential in pattern recognition, with applications ranging from true Bayesian approaches of sampling the posterior distribution (8) to optimization problems via sim-ulated annealing (9). Recent advances in deep learning are also indebted to influences from statistical mechanics, where stacking restricted Boltzmann machines in deep belief networks (10) is one of the most common machine learning strategies being used today.

Statistical mechanics is not universal and has a limited domain of applicability, outside this domain it tends to make wrong predictions. Non-extensive statistical mechanics was proposed in order to cover a variety of anomalous systems that could not be described otherwise. These include many non-ergodic systems which are typically characterized by long-range interactions (correlations) or long-term memories. The advent of non-extensive

(19)

11

statistics opens new possibilities for its treatment in thermodynamics as well as in many interdisciplinary areas. Since its inception we have witnessed an explosion in non-extensive literature with a diverse set of applications. In pattern recognition, non-extensive entropy is particularly successful at solving image recognition problems that exhibit strongly correlated patterns, which has resulted in an extensive literature regarding its application to recognition tasks. Yet only a few works in literature have applied other concepts from non-extensive statistics to tackle such problems.

The objective of this work is to extend applications of non-extensive statistical mechanics in computer science literature. Our focus involves practical uses of other non-extensive constructs, such as non-non-extensive distributions, to pattern recognition problems. In particular, we present two (published) articles that incorporate q-Gaussian transfer functions in order to improve classification of random neural networks. The remainder of this work is organized as follows: Chapter 2introduces some of the formalism and theory behind non-extensive statistical mechanics to the reader. Chapter 3reviews basic concepts and history in statistical pattern recognition, and provides background material to neural networks. Chapter 4 presents an overview of our contributions (articles) to the current work, while Chapter 5summarizes the conclusions and indicates possible areas of interest for future work. Appendix A and Appendix B include the articles associated with this work.

(20)

2 Non-extensive Statistical Mechanics

In 1865 Clausius (11) set forth the concept of entropy, S, to explain the energy lost in early heat engines. This was done in the context of classical thermodynamics, without any reference to the microscopic world (12). Nearly a decade later, Boltzmann established a specific connection to the microscopic states of an ideal gas, which led to the famous expression S ≡ kBln W , where kB is the Boltzmann constant and W is the total number of microstates available to the system. However, this formula was valid only in systems where each possible microscopic state is equally probable. Gibbs then proposed a generalization, S ≡ −kBPpiln pi, for situations where states of the system may not have equal probabilities. Many scientists including Maxwell, Shannon, von Neumann, and others also provided significant contributions to statistical entropies and their connections to Clausius entropy. Einstein, for example, independently advanced his own ideas on the subject, publishing several articles in Annalen der Physik around the same period (13). The collection of concepts that were being developed at the turn of the twentieth century established the foundations for what is today known as statistical mechanics.

2.1 Background

In what follows, the reader is briefly introduced to fundamental concepts in sta-tistical mechanics that are needed to understand this chapter. However, we only cover a few relevant ideas and defer a complete review of the theory to any general textbook in physics such as (14).

The goal of statistical mechanics is to describe a dynamical system through its microscopic constituents. One way to accomplish this task involves analyzing the phase space, in which all possible states of the system are represented. These states correspond to points in a multidimensional space, where each dimension represents the range of possible values that a particular variable in the system can take. Also, the phase space is used here in the same context as in classical mechanics. In particular, a system consisting of N particles, where each particle is associated with three position variables and three momentum variables, can be completely described by a point in a 6N -dimensional phase space (see

Figure 1). Such point in phase space is called a microstate, i.e. a specific microscopic

configuration of the system. However, describing physical systems at a microscopic level is impractical because the number of particles is on the order of Avogadro’s number, which makes the number of possible configurations (microstates) extremely large for any practical laboratory situations. Instead, the phase space can be represented by macroscopic properties of the system (e.g. temperature, pressure, volume, and density), where each

(21)

2.1. Background 13

Figure 1 – Phase space portraits.

microstate

6N dimensions

(a) Classical system with N particles (b) Ergodic (solid) and non-ergodic (dotted)

tra-jectories

Source: created by author.

point in phase space is correspondingly called a macrostate. More precisely, a macrostate is characterized as a particular set of microstates which have the same macroscopic properties, but represent different individual states of the system. Since there are many more microstates than macrostates, this allows for a much easier description of the physical system in terms of its phase space.

Another important concept is the time evolution of a dynamical system. The evolution is characterized as a continuous sequence of points (i.e. a trajectory) in phase space, which are defined according to dynamical equations of the system. These trajectories depart from a starting configuration of variables referred to as initial conditions of the system. However, initial conditions cannot in general be described exactly, which becomes an issue for systems that have a sensitive dependence on starting values. From a statistical perspective, initial conditions can instead be represented using a small volume rather than a single point in phase space. The initial volume contains a set of possible trajectories of the system, e.g. the microstates within a given macrostate, as a byproduct of the uncertainty in initial conditions. According to the Liouville theorem, this starting region should expand with time in such a way that its volume remains constant over the whole phase space, while its shape can change drastically. Notice that the above condition provides the essential requirement for the applicability of statistical mechanics. More precisely, this condition can be fulfilled through mixing, which occurs when there is a quick enough (exponential) divergence of trajectories from initial conditions that allows them to uniformly occupy the entire phase space. Krylov (15) also argued that the mixing property is much more

(22)

14 Chapter 2. Non-extensive Statistical Mechanics

important than mere ergodicity, where trajectories evenly explore the accessible phase space in the limit of long times. On the other hand, non-ergodic (or non-mixing) systems only partially visit the accessible phase space, since trajectories are confined to a very small region and thus cannot be described within standard statistical mechanics. These two concepts are illustrated inFigure 1 where the solid line represents trajectories of an ergodic system and thus visits the entire phase space, while the dotted line represents trajectories of a non-ergodic system and explores only a small region of the phase space.

2.2 Theory

Boltzmann-Gibbs (BG) statistical mechanics plays a central role in statistical physics by providing an important bridge between microscopic laws and thermodynamics. It has been successfully applied in a number of thermodynamic problems for more than a century, and thus constitutes one of the pillars of modern physics. However, the BG theory is not universal and has a limited domain of applicability. Since this domain so far has no precise mathematical definition, we present below a rough description of what these physical restrictions might be, and refer the reader to (16) for a more detailed discussion. Krylov argued that mixing is a much more important condition for the applicability of statistical mechanics than ergodicity. The obvious question is what sort of mixing will guarantee that the BG theory works. An answer which seems appropriate suggests that BG statistical mechanics remains valid only in the presence of quick enough, exponential mixing, where short relaxation times and thermodynamic extensivity (i.e., thermodynamic properties such as entropy, potentials, and other macroscopic quantities are proportional to the number of elements in the system) exist. This situation corresponds to strong chaos, characterized by positive Lyapunov exponents. Yet many important phenomena in natural and artificial systems cannot be described within standard BG theory. The typical situation is when weak chaos takes over the microscopic dynamics and causes a slow, power law, mixing which can be associated with algebraic relaxations and thermodynamic nonextensivity. In the current context, weak chaos means an algebraic (not exponential) divergence of infinitesimally close trajectories in the phase space with time, and not in the sense used in theory of non-linear dynamical systems (17). This is particularly frequent in systems that emerge along the frontiers of order and disorder, where strong chaos is replaced by its weak version and the non-linear dynamics becomes asymptotically scale invariant.

An important aspect in any physical system is the structure (or geometry) of the occupied phase space, which in turn depends on the microscopic dynamics and initial conditions: the microscopic dynamics informs where the system is allowed to live and initial conditions reveals where is it likely to live in the explored space (18). We can then ask what geometrical structures will be responsible for exponential or algebraic mixing, and reflect the

(23)

2.2. Theory 15

Figure 2 – Scaling of entropy, S, for (a) locally or (b) globally correlated systems.

0 20 40 60 80 100 0 20 40 60 80 100 q=1.1 q=0.9 S N q=1 (BG) (a) W ∼ 2N 0 20 40 60 80 100 0 20 40 60 80 100 q=1 (BG) q=1/2 q=0.4 S N (b) W ∼ N2

domain where BG theory is valid. It follows that phase spaces with Euclidean geometry (i.e., spaces that are continuous and differentiable) typically have exponential mixing because the system can easily visit all of the allowed microstates with equal probability. On the other hand, phase spaces with multifractal or similar (hierarchical) structures are generally more difficult to occupy and can result in a mixing of the slower algebraic type. Thus, we find that the geometric structure of the phase space determines which statistical mechanical formalism can accommodate a connection to the microscopic information of the system. The only question that remains is what type of phenomena can be accurately described within BG statistical mechanics. It appears that the theory is applicable for systems with short-range interactions (i.e. spatially close correlations) and short-term memories (for example, Markov processes), where the phase space is smooth (Euclidean) and leads to exponential mixing, short relaxations, and extensive thermodynamics. In contrast, systems that are in some sense scale invariant, with long-range interactions or long-term memories, consist of multifractal phase spaces which are associated to power law mixing, long relaxations, and non-extensive thermodynamics (16). These systems fall outside the BG domain of validity and thus need a different formalism to describe them; we will come back to this concept later on. The overall picture is that BG statistical mechanics provides a microscopic description associated with Euclidean geometry (translational invariant in some sense), whereas there is a corresponding (anomalous) statistical mechanics with connections to multifractal structures (scaling invariant in some sense) (17).

Statistical mechanics can also be viewed as adopting a specific entropy functional that accommodates a connection of Clausius entropy with microscopic information of the system. In some sense, it provides us a way to obtain macroscopic properties of the

(24)

system without having to deal with mechanical laws at the microscopic level (18). One of the concepts that appears most naturally within classical thermodynamics is extensivity. Extensivity refers to the (asymptotic) proportionality of some property (e.g. entropy) to the amount of matter involved or the number of elements in a system (12). Properties like volume or mass are obviously extensive, where doubling the size of a system will certainly double its volume or mass, while intensive properties such as temperature remain unchanged with size. Clausius conception of entropy can also be treated as an extensive property, since the amount of disorder is proportional to the amount of matter in a system. This means that the particular form of our entropy functional must be extensive in order to agree with thermodynamic entropy for the phenomenon being observed. Boltzmann-Gibbs entropy, SBG ≡ k ln W for the case of equiprobable microstates, is one example that satisfies Clausius prescription under certain conditions. In particular, extensivity of SBG can be trivially satisfied for statistically independent systems, i.e. the joint probability is

p1,2,...,N = p1p2. . . pN, and it is verified asymptotically for systems with local interactions (correlations) (12). As an example, consider the toss of N identical coins, where each coin has two possible states – heads or tails – that can occur with equal probability. The total number of states W is found by counting all possible combinations of outcomes from the coin tosses. According to combinatorics, this number must be W = 2N _{which means that} the entropy is given by SBG ≡ N k ln 2. Since SBG is proportional to N , we can claim that it is extensive for the coin toss problem (see Figure 2). Now consider the similar example of rolling N dice where each die has six possible states (faces). The total number of states is W = 6N and the entropy becomes SBG≡ N k ln 6, which is also proportional to N . In both cases we find that entropy is extensive because the exponent N can be taken outside of the logarithm as a multiplier, which means that the mathematical form of

SBG ensures it will remain extensive whenever the number of states grows exponentially

with problem size. However, there are many more complex phenomena for which SBG

is not extensive. These situations typically occur in systems with strong correlations, where the number of states no longer increases exponentially with the number of elements, but instead might follow a power law of the form W ∝ Nb_{. Consider linguistics as an} example, or more precisely the formation of sentences. Given any set of words, the number of allowed combinations to form a sentence is greatly reduced by grammar conventions, which produces strong correlations between certain words. In such situations, the entropy formula can be written as SBG = bk ln N and is thus forced to be non-extensive.Figure 2 illustrates this concept for the case of W ∝ N2 _{where S}

BG is no longer proportional to N . The question that arises is whether we can devise an entropy functional that remains extensive and accurately describes the macroscopic behavior of strongly correlated (complex) systems. One theory which appears to at least partially answer this question is

non-extensive statistical mechanics.

(25)

2.3. Formalism 17

Figure 3 – Non-extensive generalization of (a) exponential and (b) entropy.

-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 0.5 1.0 1.5 2.0 2.5 3.0 e q (x ) x q=2 _q=1 q=0 q=-1 (a) q-exponential, ex_q 1 2 3 4 5 0 1 2 3 q=2 q=1 q=0 q=-1 S q W (b) q-entropy, S_q

mechanics in order to deal with anomalous systems (e.g. scale invariant with multifractal phase spaces) that cannot be described otherwise. Tsallis (19) first proposed the theory in 1988, more than a century after Boltzmann and Gibbs, and was inspired by the scaling properties of multifractals. The basis of this non-extensive formalism is a specific entropic form, Sq ≡ kW

1−q₋₁

1−q , which generalizes the BG entropy using probabilities raised to some

power q. In particular, this parameter q is called the entropic index, which is intimately related to the microscopic dynamics and characterizes the degree of nonextensivity in the system (20). The main idea behind Sq is to present an extensive expression of entropy in situations where SBG has proven to be non-extensive. For example, consider a correlated system in which the number of microstates increases as a power law, W ∝ Nb_{. It was shown} earlier that SBG is non-extensive in this case because the logarithm makes its expression not proportional to size. Sq, however, contains an extra degree of freedom that can force entropy to be extensive. Our problem then reduces to one of finding which parameter value will most closely satisfy extensivity. In the example above, the new entropic form becomes Sq ≡ kN

b(1−q)₋₁

1−q which retains extensivity whenever the exponent of N is unitary,

namely when q = 1 − 1_b. Figure 2 illustrates this situation for b = 2, where Sq is extensive only if q = 1/2 (dashed line). As a result, we find that Sq can be used to describe the microscopic behavior in many situations where SBG does not work.

2.3 Formalism

There are several ways for proposing a new physical theory or generalizing an existing one, but ultimately they will often rely on some kind of metaphor (17). A possible

(26)

metaphor for deriving non-extensive statistical mechanics involves solving an ordinary differential equation of the general form

dy dx = y

q _(2.1)

where q ∈ R and the initial conditions are y(0) = 1. The solution to the equation above (and its inverse) corresponds to the q-exponential (and q-logarithm) (21)

eq(x) ≡ [1 + (1 − q)x] 1 1−q lnq(x) ≡ x1−q− 1 1 − q (2.2)

which are generalizations of the usual exponential and logarithmic functions (seeFigure 3), reducing to their classical forms as q → 1. However, the q-exponential is not an entirely new concept and has already been long known as a particular solution of the Bernoulli equation (22). Notice that the q-exponential is always taken to be positive (e.g. [z]+= max{z, 0}),

while the q-logarithm becomes undefined when x is less than zero. These two generalized expressions constitute the basis of non-extensive statistical mechanics, many constructs associated with BG statistical mechanics can be extended to a non-extensive framework by replacing the usual exponential and logarithm functions with their q-exponential and

q-logarithmic counterparts. Most importantly, we can generalize the BG entropy as Sq≡ k lnqW = k

W1−q_{− 1}

1 − q (2.3)

in the case of equal probabilities (see Figure 3), and more commonly as

Sq ≡ k W X i=1 pilnq 1 pi = k1 − PW i=1p q i q − 1 (2.4)

where microstates have arbitrary probabilities. The continuous and the quantum expressions of Sq are respectively given by

Sq ≡ 1 −R dx [p(x)]q q − 1 (2.5) and Sq ≡ 1 − T rρq q − 1 (2.6)

where ρ is the density matrix. However, we are primarily interested in its discrete form given the context of our current work.

It is important to make a distinction between non-extensive entropy and BG entropy. An example is additivity which states that for two probabilistically independent systems A and B, the entropy of the composite systems must coincide with the sum of

(27)

2.3. Formalism 19

the entropies of the individual systems. Notice that this situation is quite different from the thermodynamic concept of extensivity. SBG is said to be additive

SBG(A + B) = SBG(A) + SBG(B), (2.7)

but Sq has the pseudo-additive property

Sq(A + B) = Sq(A) + Sq(B) + (1 − q)Sq(A)Sq(B), (2.8)

which is characterized as superadditive for q < 1, additive for q = 1, and subadditive for q > 1 (16). If A and B are correlated, however, then SBG is nonadditive but a value of q exists such that Sq becomes strictly or asymptotically additive. Another property worth mentioning is the grouping axiom, arguably one of the most important and defining properties of Shannon entropy. It is often called strong-additivity because it represents a more constrained version of the additive entropy. Suppose a system (set) of W mi-crostates (possibilities) is arbitrarily partitioned into N nonintersecting subsets containing

W1, . . . , WN elements that must satisfy PNi=1Wi = W . Non-extensive entropy is defined on the entire set as

Sq({pj=1,...,W}) = Sq({Pi=1,...,N}) + N

X

i=1

P_iqSq({pj=1,...,W/Pi}) (2.9) where Pi = PWipj is the sum of all probabilities in a given subset, and {pj=1,...,W/Pi} are

the conditional probabilities. This relation reduces to the grouping axiom from Shannon theory in the limit q → 1 and plays a central role in the generalization of thermostatistics. Since probabilities {pj} are always between zero and one, raising their values to a power q will make them either increase (q < 1) or decrease (q > 1) in magnitude. Consequently, Sq places a greater emphasis on rare events (i.e. microstates with smaller probabilities) for

q < 1 and privileges more frequent events (i.e. microstates with higher probabilities) for q > 1. Also, the entropic index can be interpreted as a tuning parameter that determines

which type of events, e.g. rare or frequent ones, will matter most in entropy calculations. This means that non-extensive entropy can be very important for describing correlated systems, where distribution of event frequencies is usually distorted and only q 6= 1 will guarantee extensivity.

There are many contributions to the subject of entropies. The first statistical approach was the BG entropy, which provides a microscopic connection to thermody-namics and was later adopted in information theory (23). Since then, a vast array of entropic functionals (i.e. 20 to 25 different ones) have been established in areas outside of physics. In many situations, these entropies appear repeatedly throughout literature under different circumstances. Yet this independent rediscovering of phenomena, natural laws, and constructs happens more frequent in science than one might be inclined to believe. In particular, the non-extensive entropy has already been long ago introduced in

(28)

the cybernetics and information community. Harvda (24) proposed a structural entropy in 1967 and a year later Vajda (25) discussed its axioms, while Daroczy (26) rediscovered it in 1970 by means of information functions. However, it was Tsallis in his 1988 paper who first advanced a connection with thermodynamics, proposing a generalization of BG statistical mechanics with this new entropy functional. Another comparison worth making is between Renyi entropy (27) and non-extensive entropy. Renyi entropy is also a generalization of the BG entropy which relies on multifractals, but violates several thermodynamic properties that are satisfied under non-extensive entropy. Lastly, we find that unlike other statistical entropies, Sq is the source of many mathematical generalizations that appear throughout literature.

Non-extensive statistical mechanics is responsible for extending a diverse collection of mathematical constructs. An entire generalized algebra (28, 29) has emerged with deformed expressions of basic operations such as addition, subtraction, product, and division. Similarly, a non-extensive calculus (29) was proposed with deformed derivatives and integrals. These generalized constructs derive from the q-exponential and q-logarithm functions and have stimulated similar developments in number theory, central limit theorem (CLT), combinatorial analysis, among others (30). Other contributions include theoretical generalizations of Laplacian (31) and Fourier (32) transformations. Table 1 lists some of the common non-extensive constructs that have emerged in literature. Perhaps the most

Table 1 – Non-extensive (mathematical) constructs.

Function Expression

q-sum x ⊕qy = x + y + (1 − q)xy

q-subtraction x qy = _1+(1−q)yx−y

q-product x ⊗qy =x1−q+ y1−q− 1 1−q1 + q-division x qy =x1−q− y1−q+ 1 1−q1 + q-derivative Dqf (x) = [1 + (1 − q)x]df (x)_dx q-integral R qf (x)dqx = R f (x) 1+(1−q)xdx q-Fourier transform Fq[f (x)](w) =R ∞ −∞f (x)e iwx[f (x)]q−1 q dx q-Laplace transform Lq[f (t)](s) = R∞ 0 f (t)e −st q dt q-Exponential distribution pE(x) = _Z_q,λ1 e −x/λ q q-Gaussian distribution pG(x) = _Z_q,β1 e−βx 2 q q-Weibull distribution pW(x) = _Z_q,λ,κ1 (x_λ)κ−1e −(x/λ)κ q

(29)

2.4. Applications 21

important of these are non-extensive distributions, also called q-distributions, which can be obtained by maximizing Sq under certain constraints or by simply replacing exponentials with q-exponentials in an existing distribution. Common non-extensive distributions include the q-Exponential, q-Gaussian, and q-Weibull, which have been applied in areas such as astronomy, chemistry, finance, geology, physics, etc (33). However, these distributions remain largely unknown to the pattern recognition community, where only a small selection of works have applied them to neural networks (34, 35,36, 37) and mixture models (38).

Gaussian distributions play a central role in BG statistical mechanics, so we can only assume that q-Gaussians will have an equal or greater importance in the non-extensive framework. The latter distribution is conjectured to be an attractor in the CLT sense when the random values being summed are allowed to be strongly correlated, it can describe probabilistic models in the limit N → ∞, and provides exact solutions of non-linear homogeneous and linear inhomogeneous equations (39). From a differential equations perspective, one way to derive the non-extensive distribution is by solving

dy dx = ρy

q

(2.10) where ρ ∝ x and the solution is a q-Gaussian of the form

p(x) = e −βx2 q R dx e−βxq 2 (2.11) with β > 0 and the entropic (scaling) parameter q. In particular, the q-Gaussian has finite support for q < 1 and infinite support for q ≥ 1, but remains normalizable only when

q < 3. The tails of these distributions asymptotically decay as power laws for q > 1, while

the variance is finite when q < 5/3 and diverges when q ≥ 5/3. Also, we discover that the limiting cases q = 1 and q = 2 recover Gaussian and Cauchy distributions respectively. It is important to mention that q-Gaussians are not unique to non-extensive statistics, but have appeared in other situations as well. These distributions are long known in plasma physics as suprathermal or κ distributions (40) for q > 1 and are equivalent to Student

t-distributions (41) in heavy tail regions (q > 1), they are also sometimes referred to as generalized Lorentizans (42).

2.4 Applications

Non-extensive concepts have advanced a considerable amount of applications and connections in literature. They concern problems in areas such as physics, astrophysics, geophysics, chemistry, biology, mathematics, economics, linguistics, engineering, medicine, physiology, cognitive psychology, sports, and others (43). Some of these applications present strong experimental and theoretical evidence, while others are phenomenological observa-tions obtained through data fitting. In such cases, they normally have some underlying

(30)

non-extensive character that describes the system, such as long-range interactions (e.g. gravitation and Coulomb forces) or long-range correlations (e.g. vicinity of second-order phase transition or self-organized criticality). However, there are many applications with no physical interpretation which consist of improved algorithms meant for optimization, recognition, among other tasks. We present an overview of applications and connections to the non-extensive formalism that have been explored in literature. Since an exhaustive list containing all applications is not possible due to space limitations, the reader is referred to more complete reviews in (18, 17).

2.4.1 Natural Sciences

Applications which adopt non-extensive concepts can be found in many areas of natural science. In the context of physics, connections have been established with phenomena arising in astrophysics and cosmology (44), cold atoms (45), high energy physics (46), turbulance (47), plasma (48), condensed matter (49), granular matter (50), quantum chaos (51), quantum entanglement (52), and many others. Chemistry involves applications of the non-extensive formalism to Arrhenius law (53), models for chemical reactions and growths (54), folded proteins (55), and ground state energy of chemical elements (56). Biology, however, is mainly concerned with providing connections to biological evolution (57) and imitation games (58). In geophysics, non-extensive concepts are mostly used to study earthquakes (59), but also appear in analysis of clouds (60), volcanoes (61), and oceanic temperatures (62).

2.4.2 Social Sciences

Non-extensive formalism is present in a large number of problems from the social sciences. In economics, many works have applied these concepts to price and volume returns (63), option pricing (64), volatilities (65), and risk aversion (66). Another connection occurs between scale free networks and non-extensive statistical mechanics (67), where several growing (68) and nongrowing (69) models have been studied. Linguistics involves applications that describe the frequency-rank distribution of words in a variety of literary texts (70). Non-extensive behavior also emerges in many other social phenomena such as publication density (publications per citation) (71), city populations (72), train delays (73), and football dynamics (74).

2.4.3 Computer Sciences

We can also discuss applications of non-extensive concepts from a computer science perspective. Many optimization algorithms have been generalized using non-extensive theory, such as gradient-based methods (75,76), particle swarm optimization (77), genetic algorithms (78), and simulated annealing (79). Generalized simulated annealing is a

(31)

2.4. Applications 23

particularly successful approach and can be applied to a variety of problems including traveling salesman (80), curve fitting (81), quantum chemistry (82), gravity models (83), spin systems (84), protein folding (85), etc. Non-extensive applications of time series and signal analysis are also common in computer science. Some of these applications appear in brain activity (86), heart activity (87), complexity estimation (88), and self-similar time series (89). However, most of the existing literature is concerned with the analysis of images (i.e. image processing). In particular, non-extensive entropy has been widely used in problems related to segmentation (90), edge detection (91), facial recognition (92), and many more. Another set of applications involves using non-extensive distributions in neural networks (34, 35, 36, 37).

(32)

3 Pattern Recognition

3.1 Statistical Pattern Recognition

Pattern recognition is the study of how machines can observe the environment, learn to identify patterns of interest, and make reasonable decisions from these patterns (2). It involves the recognition of regularities in the data for a number of tasks (e.g. classification and clustering) that are important in areas such as biology, computer science, finance, medicine, psychology, and statistics. In statistical pattern recognition, each pattern is represented as a set of d features that can be viewed as a d-dimensional feature vector, or a point in a d-dimensional (input) space. These features correspond to properties or characteristics of the phenomenon being observed, such as pixels of an image or statistical measures of a time series. The primary objective of statistical pattern recognition is to assign an input pattern into an object or class based on features present in the data. More precisely, this can be interpreted as partitioning the input space into compact and disjoint regions of patterns belonging to the same class. Note that the problem here is presented in the context of classification, which has been most commonly used in recognition tasks. From a different perspective, pattern recognition problems are formulated as a set of mappings from input variables (features) to output variables (classes). These mappings can be modeled as mathematical functions containing a number of adjustable parameters whose values are determined from the data (93). The goal of pattern recognition is then to determine the functional forms of such mappings (and their parameters) which will construct the best achievable solution. Although we do not here introduce all the concepts completely, we refer the reader to (2, 93) for more detailed reviews of the corresponding formalism.

Statistical pattern recognition can also be carried out using machine learning techniques in order to achieve far better results. Machine learning explores new algorithms for learning and predicting from the data, but is closely related to pattern recognition since it engages in many of the same problems. The two subjects can be viewed as different approaches to the same discipline, where pattern recognition has its origins in engineering and machine learning grew out of computer science (94). However, machine learning can also be applied to problems other than pattern recognition, while pattern recognition does not necessarily need to solve problems only using techniques from machine learning. Given a pattern, its recognition may consist of one of the following machine learning tasks: 1) supervised learning (e.g. classification) in which a known dataset is used to make predictions and seeks to learn a general rule that maps input patterns to predefined classes, 2) unsupervised learning (e.g. clustering) in which the input patterns are unlabeled and

(33)

3.1. Statistical Pattern Recognition 25

attempts to find hidden structures or groupings in the data to represent the unknown classes.

In more than 50 years of research, pattern recognition has covered scientific advances across a variety of areas, and remains an ever growing field due to the recent success of deep learning in applications such as image recognition, language processing, and self-driving cars. The first significant appearance of pattern recognition was in the 1960s, when the academic community witnessed an explosion in pattern recognition literature, with more than 500 papers and about half a dozen books published in the span of only a few years (2). Among the most important ones stand out the earliest textbooks on the subject (95, 96), which take deterministic and statistical approaches to discuss many of the fundamental concepts, along with survey papers written by Nagy (97) and Kanal (98) to describe the current state of the field in 1968 and 1974, respectively. Nagy revealed that early roots of pattern recognition have already been established in other areas such as statistical decision theory (99), automata theory (100), and information theory (101), and introduced a number of potential applications. Kanal, on the other hand, placed more emphasis on modeling and design than applications, and believed an early motivation for working on pattern recognition stemmed from biologically inspired learning, namely the perceptron and other 1960 vintage networks (98). Yet the field of pattern recognition has become so large that it is nearly impossible to cover everything from the last 50 years. For this reason, we only review concepts and applications that are relevant in the current context, and rely on other works found in literature to fill the gaps in any missing information.

Pattern recognition problems could initially only be solved using discriminant analysis (e.g. linear or quadratic) or nonparametric approaches such as clustering (e.g. k-nearest neighbors) and density estimation (e.g. parzen windows) (102). Active research into perceptrons was also being carried out for pattern recognition, but the stagnation of neural network research throughout the 1960s and 1970s shifted most attention to knowledge based systems. These expert systems quickly became mainstream because their design was simple and they made advances in solving problems that have resisted the artificial intelligence community so far. Thus, an entire industry emerged in support of expert systems, including hardware and software companies, where many corporations adopted them for pattern recognition tasks such as medical diagnosis and natural language understanding (103). Traditional classification methods and expert systems remained popular until the late 1980s, when almost all funding for AI research was cut because of the lack of progress and unrealistic expectations that failed to materialize (104). However, the revival of neural networks spurred a widespread adoption of machine learning for pattern recognition. Neural networks then became increasingly common for solving problems in pattern recognition, and today are widely used in many aspects of modern society. Parallel to developments in neural networks, mixture models have carved a rich history in the (unsupervised) statistical branch of pattern recognition. Mixture models were initially

(34)

26 Chapter 3. Pattern Recognition

studied by Pearson (105) in 1894 and then revisited in a number of seminal papers during the 1960s. Despite this long history, genuine interest in applying mixture models to pattern recognition was only manifested with the introduction of expectation maximization (106) in 1977 for estimating the maximum likelihood. Since then, the rapidly growing and available computing power has enabled faster processing of large problems and resulted in a widespread use of mixture models over the last few decades (107).

Pattern recognition plays an important role in many aspects of modern society: from web searches to content filtering, biometric analysis to self-driving cars, and weather forecasting to clinical diagnosis. It can be used to identify objects in images (108), transcribe speech into text (109), recommend movies to watch (110), play a round of poker (111), detect weather patterns (112), predict financial market activity (113), perceive human emotions (114), analyze particle accelerator data (115), discover stars and cosmic rays (116), extract patterns in “big data” (117), among many others. However, emerging applications are becoming more challenging and computationally demanding every day, which has spurred a renewed interest in the area of pattern recognition. A common characteristic in a number of these applications involves borrowing concepts from other disciplines to improve recognition. Many works have exploited different forms of entropy functionals, since pattern recognition algorithms can often be expressed in terms of entropy minimization (118). One such example is the non-extensive entropy, Sq, which has proven particularly good at solving problems with patterns that have long-range interactions (correlations) or exhibit multifractal structures. Albuquerque (90) was the first to introduce non-extensive entropy for image segmentation, where he used the entropic parameter to characterize correlations between pixel intensities in an object of interest. Since then, an extensive body of literature has emerged with applications of non-extensive entropy in many areas of image recognition: image thresholding (90,119, 120, 121, 122,123,124, 125, 126, 127,128,129), edge detection (91), MRI analysis (130,131), image registration (132,133,134,135,136), fuzzy methods (137, 138, 139), texture descriptors (140), classification (141, 142, 143), and facial recognition (92, 144). Non-extensive statistical mechanics has also appeared in other pattern recognition tasks such as wavelets (145, 146, 147, 148, 149, 150, 151), biometric identification (152), texture classification (153, 154), mixture models (38), and others. Nevertheless, due to space limitations, the above does not represent a complete list of all available applications that use non-extensive entropy.

3.2 Neural networks

Conventional techniques in statistical pattern recognition face severe limitations when extended to problems with many dimensions because the number of parameters grows drastically as dimensionality of the input space increases. Yet many practical applications have features that are significantly correlated which leads to only partial occupation of

(35)

3.2. Neural networks 27

Figure 4 – Schematic diagram of neural network architectures.

Σ Σ Σ inputs hidden layer outputs 1 2 . . . . . . . . .

(a) Shallow neural network

inputs h id d e n l a y e rs outputs Σ Σ Σ Σ x x x x x x x x

(b) Deep neural network

the accessible input space. The number of model parameters should only need to depend on complexity of the problem, and not necessarily on the number of features present in the data. Although there are many different approaches to handle the problem of dimensionality scaling, an important contribution from neural networks is the manner in which they manage to solve it (93).

Neural networks are computational models, based on the biological brain, that are capable of solving complex tasks. These models are generally described by structures of interconnected neurons which exchange information between each other. However, they can also be viewed as networks of weighted directed graphs in which nodes represent artificial neurons and edges provide connections between different neurons (2). The main characteristic of neural networks is their ability to describe complex non-linear input-output mappings within a very powerful and general framework. In particular, mapping functions can be expressed in terms of linear combinations of basis functions, also called activation or transfer functions, which are themselves adapted to the data. The process of adjusting network parameters (bias and connection weights) according to patterns in the data is called learning, which is an important and desired factor in most pattern recognition tasks. Learning from a set of patterns allows more compact representations to be formed, such that the model only grows with complexity of the problem and not with dimensionality of the input space.

The most commonly used neural network architectures are feedforward networks, which include multilayer perceptron and radial basis function networks. These are organized into multiple layers of neurons, where each neuron is connected to all neurons in the

(36)

following layer, and the information moves in the forward direction (seeFigure 4). Shallow networks are composed of only one hidden layer and have been very common throughout literature because of their simple architecture and universal approximation (155), i.e. they are able to learn arbitrary decision boundaries. On the other hand, deep neural networks are becoming increasingly popular for solving many complex problems. There are many different ways to interpret the hidden layers in a neural network. Neurons in any hidden layer can represent non-linear decision boundaries formed in the input space, where the first layer divides the original space into halfspaces using hyperplanes, the second layer intersects these halfspaces to form convex regions, and deeper layers construct more complex decision regions. Another interpretation is that each hidden layer maps the input data into a non-linear feature space, typically of lower dimensionality than the input space. The goal is then to find a set of basis functions which can transform the problem into a space where it becomes linearly separable.Figure 5illustrates both of these interpretations with a simple example. Suppose we want a neural network to learn to classify any point in a plane as belonging to either the red or blue (parabolic) curves shown in the left figure. Once the network has been trained, it will form two decision regions where points falling inside the blue region will belong to the blue curve and points within the red region will belong to the red curve. When we look at the input layer (left figure), we find that the hidden neurons form a complex non-linear decision boundary between the two curves. On the other hand, at the hidden layer (right figure) each dimension corresponds to the output of a neuron, and the data becomes linearly separable. In the context of deep learning, hidden layers can also be viewed as multiple levels of representation over the data, with higher levels representing more abstract concepts. An image, for example, is entered into the network as an array of pixel values, where the first layer of abstraction forms basic features such as edges, the second layer combines edges into parts of particular shapes, and deeper layers detect objects as combinations of these parts (3).

Another model that has gained widespread attention is the random neural network (156), which under the pseudonym extreme learning machine (157) has inadvertently caused several heated discussions regarding its true origins (158, 159). Random neural networks are shallow architectures where parameters in the hidden layer are randomly assigned and output weights are determined by solving a linear system of equations. The reasoning behind random choices derives from the fact that neural networks have a redundancy of solutions in the parameter space, so that one can easily find parameter configurations which result in statistically acceptable solutions to the classification task (156). Since these solutions are rather insensitive to hidden layer parameters, optimal classifications can be achieved even with random values. From a different perspective, random neural networks can be understood as performing random transformations on the input data, where the resulting linear problem is solved directly (e.g. by matrix inversion). The hidden layer constructs a feature space composed of random features which can make

(37)

3.2. Neural networks 29

Figure 5 – Neural network representation of (a) input space and (b) hidden space.

-1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0 y x

(a) Non-linear decision boundaries

-1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0 s e c o n d n e u ro n first neuron

(b) Linearly separable regions

pattern classes linearly separable in a number of recognition tasks. Although learned features are generally more powerful and can construct better representations in deep neural networks, random features can work surprisingly well for shallow models. As a result, the main advantage of random neural networks is that they can be trained much faster than most learning algorithms, such as backpropagation, while obtaining similar or even better classification performance compared to other shallow networks. It is also important to emphasize that random representations can hardly be considered a novel concept and have already been explored in many other disciplines. In nuclear physics, for example, random matrices originated more than 60 years ago to explain the energy levels of complex nuclei (160), and have since then appeared in areas such as finance, neuroscience, number theory, and statistics. Random representations have also been used for problems in pattern recognition and machine learning in the context of bootstrap aggregating (161), random features (162), random subspaces (163), and random weights (156).

Since shallow networks consist of only a single hidden layer, transfer functions play an important role in constructing accurate decision boundaries, or equivalently, in making the feature space linearly separable. Many different transfer functions have emerged in literature (164, 165, 166), where neural networks have almost exclusively used sigmoidal or Gaussian functions. Although such networks have shown to be universal approximators (167,168), they do not always provide an optimal choice as some data distributions are easier to deal with certain transfer functions. A classification problem with spherical pattern distributions, for example, can be trivially solved using a single Gaussian function but may need several hyperplanes arising from sigmoidal functions to approximate the same decision boundaries. On the other hand, pattern classes with hyperplanar separations are

(38)

easily estimated using sigmoidal functions whereas Gaussian approximation can be difficult and usually requires many ellipsoids (164). However, many problems in pattern recognition consist of even more complicated distributions where neither sigmoidal or Gaussian functions are flexible enough to describe the decision boundaries. In such situations, more complex transfer functions are used: rational functions (169) that perform well in many real-world problems, conic section functions (170) which generalize sigmoidal and Gaussian-like functions, circular transfer functions (171) which retain both surface-based and prototype-based paradigms, and rectified linear units (3) which is the most popular non-linear function used for deep neural networks.

Neural networks are most commonly used to solve problems in pattern recognition, primarily because of their natural ability to adapt themselves to the data and limited need of domain-specific knowledge. Recently, deep learning has made a big impact in academia and industry with major advances in state-of-the-art recognition tasks such as self-driving cars and speech recognition (3). Although learned features in deep neural networks are generally more powerful than random features in shallow models, the latter have shown capable of achieving good recognition rates in a number of real world problems. More precisely, random neural networks have gained increasing interest in several research areas and have been applied to a vast array of applications ranging from biomedical engineering to computer vision, system modeling and prediction to chemical processes, and fault detection to analysis of time series (172). This shows that random features play an important role in pattern recognition and should not be discarded in favor of abstract representations that are learned from the data. In connection to statistical mechanics, several works have borrowed concepts from the non-extensive formalism and applied them to neural networks. Cannas (173) was the first to propose a generalization of the perceptron learning rule using non-extensive entropy. Other works (75, 76, 174, 175) soon followed by introducing several learning algorithms based on non-extensive statistical mechanics for training neural networks. Another line of research that has emerged more recently in literature includes works (e.g. (34,35, 36, 37)) that apply non-extensive distributions as transfer functions in shallow neural networks. This approach allows the hidden layer to construct a more representative feature space, where the problem can become linearly separable for different values of the entropic parameter.

(39)

31

4 Contribution

The literature on pattern recognition is vast and scattered with many applications of non-extensive statistical mechanics. However, most work has been done using the non-extensive entropy and little can be found on practical uses of other non-extensive constructs. The main objective of the dissertation is to employ these constructs for solving problems which involve the recognition of difficult patterns. In particular, we are interested in applying non-extensive distributions, namely the q-Gaussian, to existing methods in neural networks. Our contributions are presented as research articles (176, 177) which reflect the bulk of our work. These articles have been published in international peer-reviewed journals and can be found in the appendix for the reader’s convenience. The remainder of this chapter briefly reviews what has been done in each paper.

Appendix A presents our contributions to the area of neural networks coupled

with non-extensive statistics. It focuses on improving classification using random neural networks, which have become increasingly popular in pattern recognition. Since these are shallow networks, transfer functions play a vital role in achieving good recognition rates. The work in question introduces q-Gaussians as transfer functions in random neural networks, characterized by low order polynomials for q = 0, 0.5 and heavy tails for q > 1. This approach allows us to construct more flexible decision boundaries (e.g. of different shapes) by varying the entropic parameter. Another possible interpretation is that non-extensive transfer functions can capture long-range correlations between patterns belonging to the same class. Our results indicate that q-Gaussians are superior to many traditional functions (such as Gaussian and sigmoidal) in a number of different situations. Some works (34, 35,36, 37) have also applied non-extensive transfer functions to other types of shallow networks, however, random neural nets are much faster to train and can build more powerful (random) representations. In particular, we find that our model outperforms many classical methods such as multilayer perceptrons, naive bayes, and support vector machines.

AppendixB extends our previous work to ensembles, i.e. combining a set of such classifiers, which are intuitively appealing because they can construct more complex decision boundaries than individual classifiers. The main motivation is that ensembles of random neural networks can solve more complex problems but are still relatively quick to train. Our paper introduces such an ensemble with q-Gaussian transfer functions, and combines individual decisions with majority voting. The advantage of this approach is that each classifier can form different transfer functions, by varying the entropic parameter, which prevents a poorly selected function from dominating ensemble performance. It is found that our model performs better than other random neural network ensembles such

(40)

32 Chapter 4. Contribution

as those based on bagging, pruning, and voting. Also, we conclude this happens mainly because non-extensive classifiers are more accurate than traditional ones, where transfer functions are Gaussian or sigmoidal.

q-Gaussians for pattern recognition

Acknowledgements

Abstract

Resumo

List of Figures

Contents

ANNEX

49

1 Introduction

2 Non-extensive Statistical Mechanics

2.1

Background

2.2

Theory

2.3

Formalism

2.4

Applications

2.4.1

Natural Sciences

2.4.2

Social Sciences

2.4.3

Computer Sciences

3 Pattern Recognition

3.1

Statistical Pattern Recognition

3.2

Neural networks

4 Contribution