A methodology for detection of causal relationships between discrete time series on systems

(1)

UNIVERSIDADEFEDERALDO RIO GRANDE DO NORTE

UNIVERSIDADEFEDERAL DORIOGRANDE DO NORTE

CENTRO DETECNOLOGIA

GRADUATEPROGRAM INELECTRICAL ANDCOMPUTING

ENGINEERING

A methodology for detection of causal

relationships between discrete time series on

systems

Rute Souza de Abreu

Advisor: Prof. DSc. Luiz Affonso Henderson Guedes de Oliveira

Master’s Dissertation presented to Gradu-ate Program in Electrical and Computing En-gineering (area of knowledge: Computing Engineering) as part of the requisites for ob-taining of the title of Master in Science.

(2)

Abreu, Rute Souza de.

Uma metodologia para detecção de relações de causalidade entre séries de tempo discreto em sistemas / Rute Souza de Abreu. - 2019.

64 f.: il.

Dissertação (mestrado) - Universidade Federal do Rio Grande do Norte, Centro de Tecnologia, Programa de Pós-Graduação em Engenharia Elétrica e de Computação. Natal, RN, 2019.

Orientador: Prof. Dr. Luiz Affonso Henderson Guedes de Oliveira.

1. Detecção de Relações de Causalidade - Dissertação. 2. Transferência de Entropia Dissertação. 3. Algoritmo K2 -Dissertação. 4. Redes Bayesianas - -Dissertação. 5. Aprendizagem Estrutural - Dissertação. I. Oliveira, Luiz Affonso Henderson Guedes de. II. Título.

RN/UF/BCZM CDU 004(043.3)

Universidade Federal do Rio Grande do Norte - UFRN Sistema de Bibliotecas - SISBI

Catalogação de Publicação na Fonte. UFRN - Biblioteca Central Zila Mamede

(3)

(4)

(5)

Abstract

The need for detecting causality relations of process, events or variables is present in many areas of knowledge, e.g., distributed computing, the stock market, industry and medical sector. This occurs because the knowledge of these relations can often be helpful in solving a variety of problems. For example, maintaining the consistency of replicated databases when writing distributed algorithms or optimizing the purchase and sale of stocks in the stock market. In this context, this dissertation proposes a new methodology for detecting causality relations in systems by using information criteria and Bayesian networks to generate the most probable structure of connections between discrete time series. Modeling the system as a directed graph, in which the nodes are the discrete time series and the edges represent the relations, the main idea of this work is to detect causality relations between the nodes. This detection is made using the method of transfer entropy, which is a method to quantify the information transferred between two variables, and the K2 algorithm: a heuristic method whose objective is to find the most probable belief-network structure, given a data set. Because K2 depends on the premise of having a previous structure that defines the hierarchy among the network nodes, it is proposed in the methodology the creation of the previous ordering on the nodes considering direct and indirect relations, and the modeling of these relations according to the lag between cause and effect. In addition, knowing that the K2 algorithm considers that each case of the data set occurs simultaneously, the proposed methodology modifies the original algorithm by inserting the dynamics of these lags into it. This modification provides a mechanism for comparing direct and indirect causality relations regarding its contribution to the structure. As the result, it is obtained a graph of causality relations between the series, with the relation’s lags being explicit.

Keywords: Detection of Causality Relations, Transfer Entropy, K2 Algorithm, Bayes-ian Networks

(6)

(7)

Resumo

A necessidade de detectar relações de causalidade entre processos, eventos ou var-iáveis está presente em diversas áreas do conhecimento, por exemplo, computação dis-tribuída, mercado de ações, indústria, medicina, etc. Isso ocorre porque a identificação dessas relações pode, muitas vezes, ser útil na solução de diversos problemas. Por ex-emplo, manter a consistência de bancos de dados replicados ao escrever algoritmos dis-tribuídos ou otimizar a compra e venda de ações no mercado financeiro. Neste contexto, esta dissertação propõe uma nova metodologia para detecção de relações de causalidade em sistemas utilizando critérios de informação e redes Bayesianas para gerar uma estru-tura de conexões entre séries temporais, de tempo discreto, mais provável. Modelando o sistema como um grafo, no qual os nós são as séries temporais discretas e as arestas rep-resentam as relações, a ideia principal deste trabalho é detectar relações de causalidade entre os nós. Essa detecção é feita usando o método de transferência de entropia, que é um método para quantificar a transferência de informação entre duas variáveis, e o algo-ritmo K2, um método heurístico cujo objetivo é encontrar a estrutura de rede Bayesiana mais provável, dado um conjunto de dados. Porque o K2 depende da premissa de ter uma estrutura previa que define a hierarquia entre os nós da rede, é proposto na metodologia a criação desta pré-ordem considerando as relações diretas e indiretas, e a modelagem destas de acordo com o atraso entre causa e efeito. Além disso, sabendo que o algo-ritmo K2 considera que cada instância do conjunto de dados ocorre simultaneamente, a metodologia proposta modifica o algoritmo original inserindo nele a dinâmica desses atrasos. Esta modificação provê um mecanismo para comparar as relações de causalidade direta e indireta em relação à contribuição destas para a estrutura da rede. Como resultado obtém-se um grafo que representa relações de causalidade entre as séries, com os atrasos das relações explicitadas.

Palavras-chave: Detecção de Relações de Causalidade,Transferência de Entropia, Algoritmo K2, Redes Bayesianas

(8)

(9)

1 Introduction 1 1.1 Proposal . . . 2 1.2 Scope . . . 2 1.3 Organization . . . 4 2 Theoretical Foundation 5 2.1 Information Theory . . . 5 2.1.1 Information Entropy . . . 5 2.1.2 Kullback-Leibler Divergence . . . 7 2.1.3 Mutual Information . . . 8 2.1.4 Transfer Entropy . . . 8 2.2 Bayesian Networks . . . 10

2.2.1 Structural Learning in Bayesian Networks . . . 12

2.2.2 K2 - Algorithm . . . 12

3 Proposed Methodology 17 3.1 General Concept . . . 17

3.2 Proposed Approach . . . 18

3.2.1 Generation of information flow graph . . . 19

3.2.2 Application of statistical threshold . . . 20

3.2.3 Removal of Cycles from Graph . . . 21

3.2.4 Generation of Common and Virtual Ancestors . . . 23

3.2.5 Modification and Computation of K2 . . . 26

3.3 Final Considerations . . . 28

4 Case Study and Results 31 4.1 Tennessee Eastman Process . . . 31

4.2 Experimental Setup . . . 32

4.2.1 Processing of Data . . . 33 i

(10)

4.2.2 Settings for the Methodology . . . 34 4.3 Case Study Activity Flow . . . 37 4.4 Results and Discussion . . . 38 4.4.1 Overview on process variables and behavior of the system . . . . 38 4.4.2 Results Before K2-Modified Application . . . 40 4.4.3 Graphs of causal relationships . . . 40

5 Conclusions and Future Works 45

5.1 Future Works . . . 46 5.2 Contributions . . . 46

(11)

List of Figures

1.1 Research Diagram . . . 3

2.1 Entropy of X = p . . . 6

2.2 Example of entropy rate - Symbol Producers Machines . . . 7

2.3 Choosing of the samples in TE. . . 10

2.4 Bayesian Network Asia . . . 11

2.5 Database of cases . . . 13

3.1 Block Diagram of Methodology . . . 19

3.2 Example of graph generated on the second stage. . . 21

3.3 Example - Ancestors of a node . . . 21

3.4 Virtual ancestors on a graph . . . 23

3.5 Structure of node ancestors for all nodes . . . 25

3.6 Typical structure for K2 ordering and data set of cases . . . 26

3.7 Graph of causal relationships . . . 27

4.1 TEP schematic . . . 33

4.2 Effect of moving average . . . 35

4.3 Disturbance’s window application . . . 36

4.4 Histogram of nodes-cause transferred entropy . . . 37

4.5 Histogram of transferred entropies . . . 38

4.6 Case Study Activity Flow . . . 38

4.7 Process Variables Overview . . . 39

4.8 Graph obtained without methodology application . . . 41

4.9 Results comparison . . . 42

(12)

(13)

List of Tables

4.1 Experimental setup of the simulation of TEP . . . 32

4.2 Description of process variable analyzed . . . 33

4.3 Alarm Settings . . . 34

4.4 Settings for the application of the Methodology . . . 35

(14)

(15)

List of Symbols and Abbreviations

KL Kullback-Leiber Divergence MI Mutual Information

TE Transfer Entropy

TEP Tennessee Eastman Process

(16)

(17)

Chapter 1 Introduction

Cause-and-Effect relationships are present in the most diverse situations in life. For example, a speech made by a president may affect the parliament’s trust in him/her. Lack of cooling in an engine could make it stop working. Even the most simple movement played by a person cannot occur if the brain did not send a signal to the muscles. Hence, the study of this type of relationship and on how correctly identifying it plays an important role in many areas of science and technology. The first step to detect causality is to verify the association between the variables. For instance, if the variables do not correlate, the chance that they belong to a cause-and-effect relationship becomes low. According to Pearson (1896) the linear correlation between two variables x and y can be measured by the coefficient ρ expressed in Equation 1.1, where the term E[(x − x)(y − y)] is the covariance between the variables, and σx, σy represents the standard deviation for each

variable.

ρ = E[(x − x)(y − y)] σxσy

(1.1) This equation computes a value ranging from -1 to 1, where -1 means that the vari-ables are strongly inversely correlated. Conversely, the value 1 represents a strong direct correlation. Although this measure denotes a relation characterizing its size and behav-ior, it is not sufficient for determine the nature of the relation. It cannot be affirmed the existence of a causal relationship with such a measure.

Another way of verifying the association between two variables is through the cross-correlation. Cross-correlation is a measure of similarity between two variables. It is similar to standard correlation, but instead of considering the variables at the same time instant, it is computed in function of the relative displacement between the variables. However, just as correlation it cannot be used as a measure of causation. But what is a causal relationship?

As an informal definition, a casual relationship can be defined as a relationship in which the change on the behavior of one variable reflects on the behavior of other(s). A well-known approach for “quantifying” this kind of relationship was given by Granger (1969). Granger proposed that two signals Xt and Ytbelong to a causal relationship where

Y_t is causing Xt, when it is better to predict Xt using all the information about Xt and

Yt, accumulated since time t − 1, than using information of only Xt. This approach is

(18)

2 CHAPTER 1. INTRODUCTION

combination of its past and the past of Yt.

However, the causality defined by Granger needs the assumption of a stationary signal, which is not often the case. In fact, if the signal is a time series the stationary criterion is hardly ever achieved, since the occurrence of noise and another abnormalities are very common. Furthermore this particular type of variable presents some other challenges, regarding to the detection of causality, e.g, distribution changes (Luo et al. 2015).

Despite these obstacles, time series are entities present in several areas of human in-terest, such as economics, biology, neuroscience, engineering, computing, among others. Hence, aware of the relevancy of these variables and due to the relevance of this area, this work proposes a methodology based on information theory and structural learning of Bayesian Network for detecting causal relationships between discrete time series.

1.1 Proposal

This dissertation proposes a methodology for the detection of causal relationships between discrete time series. It uses concepts from Information Theory to identify and quantify the informational relations. More specifically, it uses the method of Transfer Entropy (TE) to grade the relationships according to the amount of information trans-ferred from one variable to another. Because the Transfer entropy is pairwise computed, it can produce an almost complete graph of informational relationships. Hence, aiming to reduce the number of arcs, the present work proposes to model these relationships with structural learning of Network Bayesian. The graph structure is performed using the K2 algorithm, that is a well-established heuristic method for structural learning in Bayesian Networks. Besides doing a reducing on the number of causal relationships, the motivation for using K2-Algorithm started from the desire of identifying the most probable structure of causality. In a sense that this final graph would be free of spurious relationships, as well as of duplicate flows of information. Finally, because the causal relationships between the series have time lags, it is necessary to make a time-base transformation of these series in order to be able to apply the K2 algorithms properly.

1.2 Scope

The proposal described on the previous section defines a methodology for detecting causal relationships between discrete-valued and time series on systems. This approach was achieved through the union of two distinct areas of knowledge: Information theory and structural learning of Bayesian Networks, from which the methods of Transfer En-tropy and the K2 algorithm were respectively chosen.

The Transfer Entropy method consists of a theoretical measure based on the Kullback-Leibler divergence. It is used to verify whether the past of one variable influences the future of another. It possesses 3 parameters: the horizons time k and l, and the horizon prediction h. Where the first ones are related to the past of the cause and effect variables, and the third is related to the future of the effect variable.

(19)

1.2. SCOPE 3

Figure 1.1: Research Diagram

The K2-Algorithm is a heuristic method for determining the most probable Bayesian network structure given a database and a prior ordering on the nodes of the network. It uses score metrics to define whether the addition of a node affects positively or negatively to the quality of the network.

In order to clarify where the this proposed methodology is placed, a Venn diagram is shown in Figure 1.1.

In literature, some approaches for detecting causal relationships between time series are available either with information criteria or directly using Bayesian network struc-tures. Although many of these approaches are performed in the analysis of industrial alarms, the detection of causality relationships has a vast application in several sectors of society, e.g, medical, economics, computing science, etc. Some of these applications are shown subsequently.

For instance, a hybrid implementation of transfer entropy was done by Su et al. (2017). In that work, given two variables Xt and Yt, and the assumption of the computation of TE

from X to Y. The paper proposed the use of the conditional mutual information measure, MI, for defining the prediction horizon of Transfer Entropy. The main idea of that ap-proach was to define the prediction horizon h with the time instant that maximized the mutual information MI(Xt,Yt+h).

Yu & Yang (2015) applied transfer entropy using discrete time series and proposed a routine for extraction of the significant values of TE. Their approach consisted into generating a surrogate distribution from a set false entropies. For a true computed entropy be considered spurious, its value needed to be above the quantile of 95% of the values of fake distribution.

A model for alarm prediction in Cyber-Physical Systems based on causality analysis and probability inference was produced by Chen et al. (2018). A structure describing the causal relationships of a given system was proposed by Pearl (2002). This structure consisted of a Direct Acyclic Graph (DAG), in which the edges represented a relation of causal dependence between a node and its parent set.

(20)

4 CHAPTER 1. INTRODUCTION

Hu et al. (2017) proposed a method based on transfer entropy for detecting causal relationships between industrial alarm variables, that are discrete-valued and time vari-ables. They proposed two statistical metrics: Normalized Transfer Entropy and Normal-ized Direct Transfer Entropy. Given two variables Xt and Yt, and the assumption of the

computation of TE from X to Y. These metrics use the lag, d, of occurrence between the variables to compute TE only considering the historical past of X_t−dl .

These applications corroborate the use of the concepts approached in this work as valuable tools for detecting causal relationships.

1.3 Organization

This dissertation is divided into 5 chapters. The first chapter introduces the propose locating it in a research field. It exposes the motivations for the work, besides doing an overview of the background of the area of knowledge, presenting some of the existing related work. Moreover, it sets the organization of the work, summarizing each part of it. In Chapter 2 it is presented the theoretical foundation necessary for the understanding of the methods and the processes that will be used in this work. Whereas the third chapter details the proposal and the approach used to build the final graph of causal relationships. In Chapter 3 also describes the algorithms produced over the work.

For evaluate the methodology proposed in this dissertation,in the Chapter 4 is carried out a case study in detection of causality between discrete-valued industrial alarm series. The data used for in this study was obtained using a well-established industrial process simulator named as Tennessee Eastman Process (TEP). The Chapter 4 also presents a discussion about the results obtained, comparing it with the graph obtained without the use of the proposed methodology.

In Chapter 5 it is presented a discussion about the obtained results in this dissertation and the obtained conclusions. Moreover, directions for future works are also indicated. Finishing the dissertation a section presenting the contributions given in this work is also included in this last chapter.

(21)

Chapter 2 Theoretical Foundation

This chapter is addressed to explain the fundamental concepts used in this disserta-tion. It ranges from Information Theory to Bayesian Networks, and its purpose is to offer the support needed to comprehend the basic concepts of the methodology proposed in this dissertation. It is divided in two sections: Information Theory and Bayesian Net-works. The first one presents some of the main concepts related to information entropy as well as explain the functioning of the method of Transfer Entropy and exposes some of the applications for it. Whereas the second section gives an overview on the concept of Bayesian Networks explaining some properties of the structure, focusing on the methods for structural Learning, which is type of learning used in this work.

2.1 Information Theory

Information Theory is a mathematical field that studies quantification and transmis-sion of information in communication systems. Its foundation is credited to Claude Shan-non in his paper "A mathematical Theory of Communication" in which ShanShan-non intro-duced and mathematically defined the concept of information entropy as a measure of uncertainty of a random process. According to Cover & Thomas (2012), the entropy can be interpreted as “the self-measure of the information of a random variable”. This con-cept is closely related to the notion of Mutual Information which regards to the amount of information one random variable posses about another.

2.1.1 Information Entropy

The information entropy or the average rate of entropy was defined by Shannon (1948) as a measure of information produced by a stochastic process. Assuming a set of n events with probabilities of occurrence (p1, p2, ...pn), this measure quantifies, in bits, how much

uncertainty is involved in the selection of an event. Defined by equation 2.1, the entropy rate (H) uses the inverse of the probability of the events in order to measure the unexpect-edness of an outcome.

For example, assume an event of flip a coin in which the possible outcomes are heads and tails. If the probability of occurrence of heads is 1, then there is no “surprise” in the outcome, since heads will be the only one. In this case the entropy rate is zero. Thus the

(22)

6 CHAPTER 2. THEORETICAL FOUNDATION

less the probability of occurrence of an outcome, the bigger the entropy. H= n

∑

i pi(i)log2 1 p(i) (2.1)

It is worth it to note that H always satisfies the following properties:

1. It is continuous in the domain of pi, which is the probability mass function.

2. It is monotonically increasing, in the n domain, when all events are likely equally. 3. It is weighted additive when a choice is broken down into successive choices.

Figure 2.1 shows the behavior of H for the case of two possibilities, having the prob-abilities of p and 1-p. The x-axis represents the probability P(X = p), whereas its corre-spondent entropy value is shown on the y-axis. It is possible to note in this graphic that the entropy is maximum when P(X = p) = 0.5, that is when the two possibilities are equally likely, and minimum when the probability is 0 or 1. This occurs because when P(X = p) becomes nearer to 1, the certainty of the outcome p becomes higher, as well as when P(X=p) becomes nearer to 0, the certainty of the outcome 1-p becomes higher. Thus, the entropy, or the uncertainty, is reduced.

Figure 2.1: Entropy of X = p

For a better understanding of the concept of entropy and of the equation 2.1, Figure 2.2(a) and 2.2(b) shows the representation of two different machines, each one with a set of probabilities. Assuming that each machine will produce a symbol, how many ques-tions, on average, would a person have to ask each machine in order to guess the produced symbol?.

(23)

2.1. INFORMATION THEORY 7

(a) Machine #1 (b) Machine #2

Figure 2.2: Example of entropy rate - Symbol Producers Machines

By the definition of equation 2.1, the entropy rate of the machines is given by: H= p(A) · log2( 1 p(A)) + p(B) · log2( 1 p(B)) + p(C) · log2( 1 p(C))p(D) · log2( 1 p(D)) (2.2) Meaning that the entropy rate of Machine #1, H1, is:

H1= 0.25·log2( 1 0.25)+0.25·log2( 1 0.25)+0.25·log2( 1 0.25)0.25·log2( 1 0.25) = 2 (2.3) and the entropy rate of Machine #2, H2, is:

H₂= 0.5 · log2( 1 0.5) + 0.25 · log2( 1 0.25) + 0.125 · log2( 1 0.125)0.25 · log2( 1 0.125) = 1.75 (2.4) Thus, we would have to ask, on average, two questions for Machine #1 and 1.75 ques-tions to Machine #2. This happens because Machine #1 has a higher level of uncertainty since all its probabilities are equally likely. Whereas Machine #2 has its set of probabil-ities not equally distributed, making some symbols to be more probable to be generated than others and reducing the uncertainty of the machine.

2.1.2 Kullback-Leibler Divergence

Here an information source is assumed as a stochastic process, with a set of n events with probability distribution p. According to Kullback (1997), the Kullback-Leibler di-vergence (KL) measures the error or didi-vergence when it is assumed that the probability distribution of these events is q, instead of p. In addition, the measure has the following property: if p(i) = 0, but q(i) 6= 0 then the divergence can be considered as ∞, meaning that the distributions are completely different. Equation 2.5 shows the formula to compute the divergence:

(24)

8 CHAPTER 2. THEORETICAL FOUNDATION KL_I = n

∑

i p(i)logp(i) q(i) (2.5)

Because KL is a non-symmetric measure, meaning that the KL from p to q can be different than the KL from q to p, it can not be considered as a distance. Other important measures, derived from this one, are the KL for joint and conditional probabilities defined by equations 2.7 and 2.6, respectively.

KI,J= n

∑

i, j

p(i, j)logp(i, j)

q(i, j) (2.6)

K_I|J=

n

∑

i, j

p(i, j)logp(i | j)

q(i | j) (2.7)

2.1.3 Mutual Information

Mutual Information (MI) is a measure derived from the Kullback-Leibler divergence. It is calculated between two processes I and J and quantifies the amount of information obtained about one process when observing the other. According to Schreiber (2000), MI can be seen as the “information produced by erroneously assuming that the two processes are independent”. Using KL for joint probabilities, the mutual information can be defined by equation 2.8.

MI_I,J=

n

∑

i, j

p(i, j)log p(i, j)

p(i) · p( j) (2.8)

In this equation, the joint distribution p(i, j) is approximated by the product of the individual distributions p(i) and p( j). That is, mutual information is a measure of the untruthfulness of the equation 2.9. If the equation is true, then there is no mutual infor-mation. Otherwise, the processes are dependent. Because MI is a symmetric measure, it can not give any sense of direction, thus the MI from I to J, it is the same that from J to I.

p(i, j) = p(i) · p( j) (2.9)

2.1.4 Transfer Entropy

The transfer entropy (TE) was initially proposed by Schreiber (2000), being classified as a theoretical measure “that shares some of the desired properties of mutual information but takes the dynamics of information transport into account”. Through the computa-tion of this measure, it is possible to quantify, dynamically, the exchange of informacomputa-tion between two systems and to know in which direction this exchange is propagated.

Given two random variables I and J, it is wanted to know how much information is dynamically transmitted from J to I. Assuming that I and J are two discrete time series,

(25)

2.1. INFORMATION THEORY 9

TE uses a set of samples from the past of both series to verify whether the future of I is influenced by the past of J.

In order to check this hypothesis, it is needed to built the vectors ik and jl, respec-tively defined by equations (2.10) and (2.11). In these equations the superscript indexes, denominated time horizons, define how many samples of the past of the variable are uti-lized, e.g. if k = 3 it will be used 3 samples from the past of the I variable starting from the time instant t. The same is valid for variable J, regarding the l horizon.

ik= [it, ..., it−k+1] (2.10)

jl= [ jt, ..., jt−l+1] (2.11)

In addition to these definitions, it is necessary to set a horizon prediction to the variable I. This horizon indicates how far in the future of I will be analyzed and is symbolized by the parameter h. That is if h = 1 the method will always verify only one sample ahead of the present, over the whole time analysis. Meaning that, assuming t instants as the time reference. The method will check whether or not the past of J is influencing the behavior of the I variable in time instant t + 1.

The transfer entropy is computed through the equation 2.12 using the joint probability: p(it+h, ikt, jlt), and the conditionals probability p(it+h| itk, jtl) and p(it+h | itk) to identify a

directed exchange of information between the series. In this equation, a summation is made through all possible states of the vector [it+h, ikt e jtl] over time, then the probabilities

of each event is calculated in order to compute the equation on each iteration.

T EJ−>I =

_∑

i,it+h, j p(it+h, ikt, jlt)log p(it+h| ikt, jtl) p(it+h| itk) (2.12) Here, it is used the Kullback-Leibler divergence as conditional probabilities to infer the veracity of the equation 2.13. When this equation is true, it is verified that J is inde-pendent of I and because of that, the transfer of entropy is zero. If the assertion is not true, the series are dependent. Thus the quotient of the division made on the logarithm term is different from 1. Then the transfer entropy result is a non-zero value indicating that there is information being exchanged.

p(it+h| itk, jtl) = p(it+h| ikt) (2.13)

Because the probability values can never be less than zero and the logarithm function (log(x)) is monotonically increasing, when x > 0; the greater the value of the conditional probability of the numerator, the greater the entropy transferred. In order to exemplify the functioning of the method, Figure 2.3 shows how the samples are chosen over time to the computation of the transfer entropy. In this case the adopted parameters were k= 3, l = 2, h = 1 and both series have sizes of 6 samples.

Since in this method it is needed to use samples of the past of the series, the computa-tion always starts from the time instant correspondent to the longer time horizon. That is, if we set k = 3 and l = 2, then the computation will start at t = 3 otherwise if we set, for

(26)

Figure 2.3: Choosing of the samples in TE.

example, k = 3 and l = 4, then the computation will start at t = 4. This occurs because is necessary to guarantee that the samples indicated by the time horizons will be available to the computation.

2.2 Bayesian Networks

Bayesian Networks are probabilistic models based on direct acyclic graphs. Being also called belief-networks, in these networks, the nodes represent variables of interest, while the connections represent relationships of informational or causal dependencies (Pearl 2011). These dependencies are relative to the parents of the node, that is they are conditional probabilities for a node given its parents. Figure 2.4 shows an example of Bayesian Network, this structure was originally provided in (Lauritzen & Spiegelhalter 1988) and its main purpose is to assist the diagnose of a patient with shortness-of-breath (dyspnoea). The hypothetical situation is that the patient with this symptom has recently visited Asia and the results of its X-ray are not yet available. Thus, it is desirable to know what are the chances of one the possibles diseases among Lung Cancer, Tuberculosis and Bronchitis being causing the dyspnoea, given the knowledge of the Asia visitation. For instance, if the patient is a smoker, then having lung cancer has its probability increased, although it could also increase the chances of bronchitis.

To model these probabilities, the Bayesian Networks represents the data joint distri-bution in a more compact form, instead of computing the Bayes rule, equation 2.14 that considers all the variables involved in the joint distribution, Bayesian Networks factor the

(27)

2.2. BAYESIAN NETWORKS 11

global joint distribution into local conditional distributions for each node given its par-ents (Pearl 2011). Equation 2.15 shows how the full joint distribution is computed in the network. P(An, ∩... ∩ A1) = n

∏

i P(Ai| ∩i−1j=1Aj) (2.14) P(A1, ...An) =

∏

i P(Ai|pai) (2.15)

The term "pa" in this equation stands for the set of parents of the nodes. This is possible because these networks consider that each node is conditionally independent of the others, given the knowledge of its parents. For example, in the network of Figure 2.4, the only parents of the node Tuberculosis or Cancer are Tuberculosis and Lung Can-cernodes, then the conditional probability of this node given any other subset of nodes, will only consider its parents, as shown in the example of Equation 2.16. In addition, each node has a conditional probability table (CPT) relating it to its parents (Cooper & Herskovits 1992). This table needs to define all possible probability occurrence values.

P(TubercOrCancer | Cancer, Bronchitis, Asia) = P(TubercOrCancer | Cancer) (2.16)

Figure 2.4: Bayesian Network Asia

Having the structure defined, as well as the conditional probability tables, the proba-bility of any event occur given an evidence can be calculated through the use of inference methods.

(28)

2.2.1 Structural Learning in Bayesian Networks

The structure of a Bayesian Network can sometimes be not fully known. This occurs because knowing how a set of variables is connected is often not an easy task, needing the help of specialists in the field of knowledge where the problem takes place.

In literature, there are some data-based approaches to achieve the structure that opti-mally describes the relationships between the variables. Friedman et al. (1999) proposed the "Sparse candidate" algorithm to generate the network structure. The concept con-sists into selecting the variables with strong dependency and to group them near to each other. This strength can be measured by using mutual information or correlations meth-ods. Then, the proposal is to find for each variable a set of nodes in which these variable have strong dependency connections. Thus, during the search for the structure the algo-rithm focus on the networks that contain these relations, retraining the complexity of the search.

Cooper & Herskovits (1992) proposed a quality score based metric in order to de-fine most probable network structure. Their idea was to obtain information about the relationships among the nodes from a database composed of a set of occurrences of these variables. Along with the database, their method required a prior structure of the network. This is made in order to restrain the search space of the method, reducing its complex-ity. The routine, proposed by them, had a bounded number of parents for each node and consisted in, starting from the presupposed that each node is "orphan", verify whether the addition of an ancestor to the node increase or not the score of the network.

Heckerman et al. (1995) uses a prior network, called gold-standard, and a database for estimating the network structure. Through sampling of the gold-standard, a new database is generated, which along with Bayesian quality metrics and search procedures are used for generating a set of possible networks. These new networks are then used for the estimation of the probability of the next case given a database and current status of infor-mation. The result obtained is compared with the result produced by the gold-standard network. Thus, the chosen structured is the one higher posterior probability.

The second approach mentioned above is the one used in the methodology proposed in this dissertation. In order to generate the structure of the graph of causal relationships, the K2 Algorithm proposed by Cooper & Herskovits (1992) is utilized. However, because of the temporal nature of the process used in this work, some modifications are done in K2. In addition, the usual metric of K2, known as Cooper-Herskovits metric was substituted by the Maximum Description Length metric, this variation of K2 is also known as K3, but this work will referrers to it only as K2-Modified, since others modifications are done in the original algorithm.

2.2.2 K2 - Algorithm

K2 Algorithm is a Bayesian method for estimating probabilistic a network from data, proposed by Cooper & Herskovits (1992). The algorithm search for the network that has the highest posterior probability given a database of records, called cases. Figure 2.5 shows an example of database used for a network containing three nodes, taken from

(29)

(Cooper & Herskovits 1992). Each case informs the status of the variables at the record, which in this example are present or absent.

Figure 2.5: Database of cases

For identifying how these three variables relate to each other, the K2 proposes an approach that from a prior ordering on these nodes, BS1calculates the probability of this

structure be the one who best represents the relationship among the nodes. Hence, the objective is to find the Bayesian structure Bs that maximizes P(Bs, D), where D is a given

database. Assuming P(Bs, D) as presented in Equation 2.17.

P(B_s, D) = c n

∏

i=1 qi

∏

j=1 (ri− 1)! (Ni j+ ri− 1)! ri

∏

k=1 N_{i jk}! (2.17) Where: • D is a given database

• n is the number of nodes in the network;

• qiis the number of unique instantiation of the set of parents of a node;

• riis the number of all possible values assumed by the node

• Ni jkis number of cases in D in which the node is instantiated with its kth value, and

the parents of de node are instantiated with the jthinstantiation • Ni j = ∑r_k=1i Ni jk

The authors argue that for maximizing this equation it is only needed to find the par-ent set that maximizes the inner product. Thus, the K2 algorithm defines as g(i, πi) the

function that computes this part of the equation. This function is known as the Cooper-Herskovits metric, and it measures the probability of the parent set πibe the correct parent

set of the node xi. Equation 2.18 shows how the g function is defined.

g(i, πi) = qi

∏

j=1 (r_i− 1)! (Ni j+ ri− 1)! ri

∏

k=1 N_{i jk}! (2.18)

Algorithm 1 shows the steps for the implementation of K2. It starts assuming that the node has no parents, therefore it add incrementally to the parent set of the node, the node

(30)

that maximizes the resultant structure. If there is no parent whose addition increases the probability of the network give the database, the algorithm stops and start the analysis of the next node on the node set. It is import to say that only the ancestors given on the prior network are analyzed as possible parents.

Medium Description Length

In addition to Bayesian metrics, there are other metrics used for estimating the Bayes-ian Network structure. One of them was proposed by Bouckaert (1993) and is considered a measure of information quality being based on the principle of Minimum Description Length (MDL). The main idea is to select the structure that maximizes the MDL score. Equation 2.19 shows the expression used to compute the metric.

L(BS, D) = log P(Bs) + n

∑

i=1 qi

∑

j=1 ri

∑

k=1 N_{i jk}log(Ni jk Ni j ) −1 2log(N) n

∑

i=1 q_i(ri− 1) (2.19) Where: • D is a given database

• n is the number of nodes int he network;

• qiis the number of unique instantiation of the set of parents of a node;

• r_iis the number of all possible values assumed by the node

• N_{i jk}is number of cases in D in which the node is instantiated with its kthvalue, and the parents of de node are instantiated with the jth instantiation

• N_{i j}= ∑ri

k=1Ni jk

The first term of this equation denotes the prior probability of the network structure. As in the Cooper-Herskovits metric, it serves to add a possible prior knowledge of the network. The second term is called the empirical entropy of the structure and its value decreases when more arcs are added to the network. The last term models the costing of estimating the probabilities tables for each node of the network given its parent set. The expression 1₂log(N) on this term aims to penalize the structures with a greater number of nodes since they increase the complexity for computing the joint distribution of the network.

(31)

Algorithm 1: K2 Algorithm Input: a set of n nodes - (x1...xn)

an ordering for the nodes ,

a data set of n columns and m cases - database Output: Bayesian Network Topology

1 πi← parents of node i 2 3 for i ← 1 to n do 4 πi← /0 5 end 6 for i ← 1 to n do 7 P_old ← g(i, π_i)

8 #g function is computed using equation 2.18 9 while True do

10 pred_x_i ←Pred(x_i)- set of nodes that precede x_i

11 select the node x_j∈ pred_x_i\ π_ithat maximizes:g(i,π_i∪ {x_j}) 12 P_new ← g(i,π_i∪ {x_j})

13 sigma ← P_new > P_old 14 if sigma = True then 15 P_old ← P_new πi← πi∪ xj 16 end 17 if not(pred_x_i = /0) then 18 pred_x_i← pred_x_i\ x_j 19 end

20 if (not sigma or pred_x_i= /0) then

21 break

22 end

23 end

24 returns the parent set of each node 25 end

(32)

(33)

Chapter 3 Proposed Methodology

There are many situations in which the identifying of causal relationships favor the observing on the behavior of systems. This occurs because this kind of relationship can be used to determine the impact one variable has over another as well as it can taper the search space when a problem occurs. For instance, in the neuroscience field causal inter-actions across brain structure can reveal the flow of information associated with neuronal processing, as stated by Vicente et al. (2011). In the medical field, the study of causal relationships made possible the identification of the causality between HPV and cervical cancer (Bosch et al. 2002). Thus, the objective of this chapter is to explain the proposal of this dissertation, which consists in the proposition of a methodology to detect causal rela-tionships between discrete time series. The chapter is organized basically in two sections, the first one presents the general concept, while the second details the proposed approach, including the algorithms needed to reproduce it, as well as the description of each stage of it.

3.1 General Concept

The purpose of this dissertation is to define a methodology to identify causal relation-ships between discrete time series on systems with interconnected entities, where these entities can be any source of periodic data. This type of systems is often present in many areas of knowledge, for example, an alarm system which could have its alarm variables as sources of information; a stocks’ system could be modeled for having the close and opening price series, as well as the daily volume of a share as the mentioned discrete time series; even an audience TV ratings system where the audience rating of a show could be seen as a source of periodic information.

By the use of the concepts of Information Theory and Bayesian Networks, this dis-sertation intends to join the method of Transfer Entropy and the K2 algorithm in order to generate a single methodology for the detection of these relations. The Transfer Entropy has a well-known algorithm used to indicate a causal relationship between two discrete variables. It uses the Kullback-Leibler Divergence to conditional probabilities in order to compute the entropy transferred between two variables and takes advantage of its non-symmetry to indicate the direction of this action. However, although it can obtain signif-icant results with little knowledge of the variables, for being a noise sensitive method it

(34)

18 CHAPTER 3. PROPOSED METHODOLOGY

can also generate misleading results.

On the other hand, K2 algorithm is a heuristic method well-used in the field of struc-tural learning and presents a good performance on the determination the most probable structure of a network, using the concepts of conditional probabilities of the Bayesian ap-proach. This algorithm has been tested for reconstruction of complex networks, obtaining satisfactory results, as in the case of the ASIA network which had 36 nodes and 47 arcs, where the algorithm had only one missing arc and one arc added erroneously, as stated by Cooper & Herskovits (1992).

Nevertheless, its performance relies on a robust data set of cases and on the quality of the ordering on the nodes, which can be decisive on the correct definition of the structure delivered for it. The construction of this ordering usually needs someone who has an expertise on the functioning of the system and on how its variables are interconnected, so that the relations contained on it does not lead the algorithm to produce an erroneous structure.

Hence, this dissertation proposes a methodology which utilizes the graph of causal relationships produced by the application of the Transfer Entropy (obtained pairwise), on a set of data, as the ordering on the nodes for the K2 algorithm in order to determine the most probable structure of causal relationships. However, a modification on the K2 algo-rithm needs to be made, aiming to insert the dynamic imposed by the relations between discrete time series. Being this dynamic regards to the delay present in cause-effect rela-tions, where the effect of a change on the state of one variable can take some time to be felt by the other.

3.2 Proposed Approach

In order to detect the causal relationships, here the referred system will be modeled as a graph, in which the nodes will be the entities related to each other, by a causality relation. This detection will be made in five stages, respectively represented by the block diagram in Figure 3.1, Generation of information flow graph: in which the Transfer Entropy measurement will be used to identify the preliminaries causal relationships, as well as the lags of these relationships. Application of Statistical Threshold: It is the stage responsible for select the most relevant relationships based on the amount of in-formation transferred. Removal of Cycles: This stage promotes the removal of possible cycles on the graph, based on information criterion. Generation of Common and Vir-tual Ancestors: This process promotes a new modeling on the graph differentiating direct and indirect relationships, Modification and Computation of K2: Modifies the K2 al-gorithm, in order to insert on it the dynamics of the lags of the relationships among the nodes.

Although the system is modeled as a graph, this work will represent the entity graph as a matrix, thus the next subsections will referrer to the graph of causal relationships as a matrix.

(35)

3.2. PROPOSED APPROACH 19 Generate information flow graph Discrete Time Series Remove cycles from graph Generate Common and Virtual Ancestors Compute Modified K2 Algorithm Reconstruct graph removing virtual ancestors Apply significance threshold Insert relationship lag dynamics Causal relationships 1 2 3 4 5 5.a 5.b 6

Figure 3.1: Block Diagram of Methodology

3.2.1 Generation of information flow graph

Given a set of N discrete time series, this stage computes a N x N matrix where each position, but the ones in the diagonal, are the amount of transferred entropy between two nodes.

As it was mentioned before, the TE is a parametrized method, thus each computation needs a parameterization defined a priori. To the horizons time k and l, in this work a fixed setting is used, while for the predict horizon a maximum value is defined. This differentiation is made in order to identify the lag of each relationship.

Algorithm 2 shows the computation of the transfer entropy. Assuming two series I and J of size N, this algorithm computes the approximate joint probability p(it+h, ik, jl)

using a frequentist approach, running through all possible states of the vector [it+h, ik, jl].

Meaning that if the series have binary values there will be 2k+l+1possible states. Whereas the conditional probabilities p(it+h | itk) and p(it+h | itk, jtl) are computed from the joint

density probabilities: p(it+h, itk) and p(it+h, ikt, jtl), respectively.

After of the probability computations, the TE is then computed by the use of the equation 2.12, previously shown in chapter 2. Because in this case the prediction horizon is the maximum value from an interval, when the generation of the matrix of transferred entropies is done, each TE is computed h times, varying the parameter h from 1 to the maximum h. Thus, it is chosen the maximum value of entropy obtained, as shown in algorithm 2. This h will the be used as the lag of the relationship.

(36)

Algorithm 2: Generate graph of transferred entropies and relation lags Input : a data set composed of the discrete time series,

maximum prediction horizon - h, time horizon - k,

time horizon - l

Output: Graph of transferred entropies

1 foreach time series J ∈ data set do 2 foreach time series I ∈ data set do 3 for i ← 1 to h do 4 if J 6= I then 5 T E_J−>I = ∑_i,i t+h, jp(it+h, i k t, jtl)log p(it+h|ikt, jlt) p(it+h|ikt)

6 entropies← entropies ∪ {T E_J−>I}

7 T E_Graph ← T E_Graph ∪ {max(entropies), max h}

8 end

9 end

10 end 11 end

12 Returns a graph representing relationships entropy and lag

3.2.2 Application of statistical threshold

Aiming to extract only the most relevant TE values, a statistical threshold is applied to the relationships. This stage replaces with zero the TE values that are smaller than the defined threshold.

Because the distribution of the set of computed entropies can vary according to the set of discrete time series there are analyzed, it is advisable that the choice of the threshold be made through the analysis of the data distribution. Although in the studies made in this works the threshold around the percentile of 80% had shown suitable.

As it was mentioned before, aside from the transferred entropy the lag of each relation is also stored. Hence, each causal relationship is characterized by two attributes: the amount of information and the lag between cause and effect. As a result of this stage, a graph similar to the one exposed in Figure 3.2 is produced.

It is worth it to highlight that although delays are commonly presented in time units, in this work, the lags of the relationships will be provided in samples. Another point to be observed is that except for auto-cycles, this graph does not possess any restriction regards to its relations. This type of particularity hinders the identifying of the correct flow of information. For example, the node A is directly related to node C, and also indirectly related to this same node, but with a relationship through the node D, making it harder to identify which flow best describe the system.

(37)

3.2. PROPOSED APPROACH 21

Figure 3.2: Example of graph generated on the second stage.

3.2.3 Removal of Cycles from Graph

In order to prepare the causal graph generated on the previous stage will be used for the K2 algorithm, a removal of the cycles is made on the third stage. The strategy used for doing this removal consists into favor the strongest relationships of the graph by adding first the ones with the greatest amount of information.

Algorithm 3 presents the needed steps to execute this removal. Initially, an empty graph is considered as the target graph, that is, without cycles. Subsequently the rela-tionship with the greatest amount of entropy is selected. In order to guarantee that the addition of this relation to the new graph will not generate cycles, an exploration on the ancestors of the node-cause is made.

This exploration is presented in the Algorithm 4 and consists of a recursive search, in which the nodes “above” the node-cause, are marked as its ancestors. For example, in Figure 3.3 the ancestors returned for the node D will be A, B and C, while the list of ancestors of the node E will be composed only of A and C.

Figure 3.3: Example - Ancestors of a node

If the effect, also named as node son, is not in the list of ancestors of the node-cause, then no cycle is generated and the relation is added to the graph. This process is

(38)

repeated until all relations were analyzed. At the end of this process the non-cyclic graph is generated.

Algorithm 3: Remove Graph Cycles Input: TE graph - GT E

Output: TE graph without cycles G0_{T E}

1 Assuming G_{T E} =< E, N > where E =< node_c, node_e, weight > as the input 2 and G0_{T E} =< E0, N0> where E =< node0_c, node0_e, weight0> as the output 3 Initialization

4 E_max← max(E) - The weight of the edge with maximum value of TE 5 forbiddenSons← /0 - A set of nodes that a node can not have as its son 6 while weight(E_max) > 0do

7 forbiddenSons ← get_node_ancestors(node_c, G0_{T E}) 8 # get_node_ancestors function is defined in Algorithm 4 9 if node_e not inforbiddenSonsthen

10 G0_{T E} ← G0_{T E}∪ {E_max} 11 end

12 select the next E_maxof the graph G_{T E} 13 end

(39)

Algorithm 4: Get node ancestors

Input : a node which ancestors will be computed - node, TE Graph without cycles - G0_{T E} ,

Output: a set with the node ancestors - ancestors

1 Assuming G0_{T E} =< E0, N0> where E0=< node0_c, node0_e, weight0> 2 get_node_ancestors (node, G0_{T E}):

3 A_node← {na| e ∈ E0and e=< node0c= na, node0e= node >}

4 foreach node ∈ A_nodedo

5 A_node← A_node∪ get_node_ancestors (node, G0_{T E}) 6 end

7 returnA_node

8 *the subscriptsc and e stands for cause and effect

3.2.4 Generation of Common and Virtual Ancestors

As stated before, some nodes may relate to others in a direct and indirect form in the same graph. When this happens it is not possible to assert which one is the real causal relationship. In order to determine the most probable flow in which the information is propagated, the fourth stage separates this two kind of relationship, by modeling a new graph where the indirect relationships are represented as direct relationships with its lag being defined as the cumulative sum of the lags of the intermediate connections. In this context, the node-cause of the first type of relationship will be called common ancestor, whereas the node-cause of second type will be called virtual ancestor.

(a) Graph with indirect relations (b) Virtual ancestors associated to the node C Figure 3.4: Virtual ancestors on a graph

For example, in the graph shown in Figure 3.4(a) the node C is caused by the node A, in two manners, passing through B: A → B → C and directly from A: A → C. A similar situation occurs with the relationship D → B → C, but this time only the indirect connection exists. Thus, the node C has two virtual ancestors associated to it: A_13 coming from A → B → C and D_7 coming from D → B → C.

(40)

Although this stage computes the common and virtual ancestors of a node, it is also responsible for generating the K2 ordering. This process is done individually, that is for each node, by Algorithm 5 and then gathered by Algorithm 6, which ensemble the com-putations for all nodes, creating the K2 ordering. To explain how Algorithm 5 does the individual case, the Figure 3.4(a) is took as the reference. For instance, assume Algorithm 5 is computing the ancestors of the node C. Initially it accounts all the adjacent ancestors of the node, that is the ancestors directly connect to the node. Then for each adjacent ancestor it does a recursive search, computing the cumulative sum of the lags among the path.

Meaning that if, in this case, the node C has a direct connection to an ancestor node, as in the relation: (A → C), the algorithm will add A in the list of ancestors of C with its correspondent lag, which in this case is the own relationship lag: 2 samples. However, if C has indirect connections with other nodes, as in the relation (A → B → C), then the nodes will be added to the list of ancestors of C, with the lag being the cumulative sum of the path between the node and the ancestor. In this case the node A would be again added to the list, but this time as a virtual node with a lag of 13 samples, resulting from the sum: 8 samples + 5 samples.

Algorithm 6 does this computation to all nodes of the graph and returns a map struc-ture containing, for each node, the map of the ancestor node generated before. It is worth to note that this structure is an adequacy for the TE graph, and it was made in order to handle the lags of the relations, since K2 algorithm does not take into account the delay among the relationships. Figure 3.5 shows how would be this map structure of ancestors, if the graph generated by the TE was the one shown in Figure 3.4(a).

Algorithm 5: Generation of Common and Virtual Ancestors Input : a node which ancestors will be computed - node,

A Graph without cycles - G0_{T E}

Output: A map structure containing all the ancestors of a node and the cumulative lag correspondent to the respective path among the ancestor and the node.

1 Assuming G0_{T E} =< E0, N0> where E0=< node0_c, node0_e, weight_{T E}0 , lag0_{T E} > and 2 AncesMap← {< node_c, sum_list >}

3 gen_common_virtual_ancestors (node, G0_{T E}, lag_cum_sum, AncesMap) 4 A_edge← {e | e ∈ E0and e=< node0_c, node0_e= node >}

5 foreach edge ∈ A_edgedo

6 lag_cum_sum ← lag_cum_sum + lag(edge)

7 AncesMap(node0_c) ← AncesMap(node0_c) ∪ {lag_cum_sum}

8 gen_common_virtual_ancestors (node0_c,G0_{T E}, lag_cum_sum, AncesMap) 9 lag_cum_sum ← lag_cum_sum − lag(edge)

10 end

11 returnAncestorMap

(41)

(a) Ancestors of node A (b) Ancestors of node B

(c) Ancestors of node C

Figure 3.5: Structure of node ancestors for all nodes

Algorithm 6: Ensemble nodes ancestors Input: A Graph without cycles - G0_{T E}

Output: A map structure containing for each node an AncesMap*

1 Assuming G0_{T E} =< E0, N0> where E0=< node0_c, node0_e, weight_{T E}0 , lag0_{T E} > 2 TreeMap=< node_key, AncesMap) >

3 foreach node in N’ do 4 sum← 0

5 #gen_common_virtual_ancestors function is defined in Algorithm 5 6 TreeMap[node] ← gen_common_virtual_ancestors (node, G0_{T E}, sum) 7 end

8 *see definition of AncesMap in algorithm 5

(42)

3.2.5 Modification and Computation of K2

In order to compute the K2 algorithm, three entries are needed: a set of n variables (also called nodes), an ordering on the referred variables and a data set of cases. This work proposes two modifications to the K2 algorithm, the first one regards to set of nodes: instead of using only the original nodes in the database, the virtual nodes will also be considered in the prior ordering of K2. This strategy aims to detect the real flow of information on the graph when there are ambiguities. For example in Figure 3.4(a) there are the relationship A → B → C and the by-pass relationship A → C constitute a possible ambiguity.

The second modification regards to the concept of causal relationship lag. In the K2 algorithm each case is considered as an instance of an event. For example, in Figure 3.6(b) the case 1 represents that the state of the variables is A = 1, B =1 and C = 0. It does not take into account the instant of occurrence of the variables. Thus, it is not possible to say which variable assumed its value first or even if both three variables assumed its values at the same time. Aiming to adapt the K2 algorithm for handle the dynamics imposed by the lag of the causal relationships among discrete time series, the fifth stage of the proposed approach makes an adequacy to the database of the series by shifting the common and virtual ancestors time series, according to the lag of the relationship they belong.

Because it were added to the graph virtual ancestors, the graph delivered by K2 al-gorithm does not necessarily contains only the true causal relationships. This happens because the indirect relationships were modeled as direct in the generation of the virtual ancestors. Thus, a reconstruction of the graph is also needed. The next three topics de-tails, respectively, the modifications discussed in here, the computation of the modified K2 Algorithm and the reconstruction of the graph of causal relationships.

(a) K2 pre order

(b) Data set of cases

Figure 3.6: Typical structure for K2 ordering and data set of cases

Insertion of relationships lag dynamics

Knowing that K2 algorithm analyses the nodes one by one, the idea of this modifica-tion is to shift all of the discrete time series involved in each individual analysis, except the one whose ancestors are being computed. This shift is defined according to the lag of

(43)

the causal relationship, for example: when defining the ancestors of C, if the relationship B→ C in Figure 3.7 had a lag of 5 samples, then the time series B would be shifted to the left by 5 samples, that is, it will be delayed in 5 samples. If the relationship A → B had lag of 2 samples, then a new series would be added to the data base, which would be the series A shifted to the left 7 samples, since the cumulative sum of the relationship A→ B → C is 7 samples. Thus, each iteration of K2 algorithm uses a database derived from the original, but with the columns shifted. The Algorithm 7 does the generation of this database for each iteration of the K2 algorithm. It is worth to highlight that the original data is not modified, only a copy of it is used to generate the database used on each iteration.

Figure 3.7: Graph of causal relationships

Algorithm 7: Generate data set of K2 iteration Input : a data set with all node cases - database,

a node for which the data set will be generated - node,

A map structure containing for each node an AncesMap - TreeMap Output: a data set with new columns added aligned according to the lag between

the node and its ancestor - dataSet

1 if List[node] 6= /0 then

2 foreach key, valueinTreeMap[node] do 3 foreach lag in value do

4 column ← shift(key, -lag)

5 database ← database ∪ {column}

6 end

7 end 8 end

(44)

Computation of K2 Algorithm

Considering that the original K2 algorithm does not take the time into account, this work presents a modified version of K2 algorithm, described by Algorithm 8 and denom-inated as K2-Modified.The main modifications consists in using an ordering on the nodes that models indirect relationships as direct relationships, and the insertion of the lags dy-namic, by using a technique to introduce the effect of the lag, between the relationships, in the utilized database.

For generating the new graph, by using K2-Modified, it is first considered that each node has no ancestor, thus the first structure proposed by the algorithm is a structure similar to a tree with only node leafs. For future comparison, this structure has its quality factor measured for the f function, this function returns the score of the structure given the provided database. In this work the f function is not the Cooper-Herskovits metric, orig-inally used in K2, but a function that scores the network based on principle of minimum description length, as previously defined in chapter 2.

The next step is to analyze all the ancestors, proposed in the ordering, for each orgi-inal node. However, in order to insert the modification on the algorithm, before starting the computation for a node a new database, generated according to the lags of the node ancestors, is associated to to the node, as shown in Algorithm 7. It is this database which will be used for the algorithm in the respective iteration. Subsequently, it is found the an-cestor whose addition to the network maximizes its criterion of quality. If the new quality factor is better than the older, that is, if the addition of this node produces a more prob-able Bayesian Network structure, then this node is added to the network. Otherwise the algorithm defines the current ancestors of the node as its permanent ancestors and starts the analysis for the next node. This process is repeated until all nodes of the network have their ancestors defined.

Reconstruction of the graph

As commented before, the obtained structure for K2-Modified cannot still be set as the final graph of causal relationships. Due to the addition of the virtual nodes the graph retrieved can contain relationships that does not correctly represent the indirect relation-ships among the nodes, as well as its respective lags. Hence, a reconstruction of the graph is needed, this reconstruction will retrieve the indirect relations together with its respective lags. This process is done as shown in Algorithm 9, where the virtual ancestors has the path among it and the node son reconstituted. Because at this point, the causal relationships were already detected, the final graph will only exhibit the lags, instead of together showing the amount of entropy transferred. This reconstruction finalizes the steps proposed by the methodology.

3.3 Final Considerations

The application of the proposed methodology consists to perform all the process ex-posed in the five stages. The obtaining of the first graph is the core of this methodology, it

(45)

3.3. FINAL CONSIDERATIONS 29

Algorithm 8: K2-Modified

Input: a set of n nodes - (x1...xn)

an ordering for the nodes ,

a data set of n columns and m cases - database Output: Bayesian Network Topology

1 πi← parents of node i 2 3 for i ← 1 to n do 4 πi← /0 5 end 6 for i ← 1 to n do 7 dataSetIteration ← generateDataSet(database)

8 #Generation of the database according to lags between nodes. - See Algorithm

7

9 P_old ← f (i, π_i)

10 #f function is computed using equation 2.19, based on the data set of the

iteration.

11 while True do

12 pred_x_i ←Pred(x_i)- set of nodes that precede x_i

13 select the node x_j∈ pred_x_i\ π_ithat maximizes:f(i,π_i∪ {x_j}) 14 P_new ← f(i,π_i∪ {x_j})

15 sigma ← P_new > P_old 16 if sigma = True then 17 P_old ← P_new πi← πi∪ xj 18 end 19 if not(pred_x_i = /0) then 20 pred_x_i← pred_x_i\ x_j 21 end

22 if (not sigma or pred_x_i= /0) then

23 break

24 end

25 end

26 returns reconstructed graph, done in the Algorithm 9 27 end

is through this graph that the relationships between the series will be previously defined. The removal of cycles helps to construct a more concise graph since information is more likely to flow in one direction than to follow into a loop. Another important aspect is set in the generation of the virtual ancestors and the addition of it to the graph, this modifi-cation allows K2 algorithm to identify the most probable flow of information, when there

(46)

Algorithm 9: Reconstruction of original relations

Input : The graph generated by K2, Gv; The original graph Gr

Output: The reconstructed graph, Gn

1 Let G_v=< E_v, N_v> be the graph produced by K2 algorithm,

2 G_r=< E_r, N_r > be the graph generated by Transfer Entropy already processed and 3 G_n=< E_n, N_n> be the reconstructed graph

4 Where E_v=< node_1v, node_2v>, 5 E_r=< node_1r, node_2r> and 6 E_n=< node_1n, node_2n> 7 foreach e_v∈ E_vdo

8 subGraph← f (e_v, G_r) Where f is a function that retrieves the original path of

the relation and its respective lags

9 G_n← G_n∪ subGraph 10 end

are information flows ambiguities. Likewise the insertion of the relationship lags adjusts the algorithm to handle the delay of the relationship and to produce a consistent result. Finishing these steps, the stage of reconstruction brings back the relationships in the way they were before the adequacy made to substitute the indirect by direct relationships.

(47)

Chapter 4 Case Study and Results

In order to apply and evaluate the performance of the methodology. This chapter car-ries out a case study of the detection of causal relationships between industrial alarm vari-ables. The study was realized using the simulator Tennessee Eastman Process (TEP), that is a well-established benchmark to research in monitoring, control, and fault-detection of industrial processes. The study was realized using the simulator Tennessee Eastman Process (TEP), that is a well-established benchmark to research in monitoring, control, and fault-detection of industrial processes.

This chapter is organized as it follows. The first section gives an overview of the Tennessee Eastman Process and its functioning. The second one details the experimental setup used in the simulation performed for this study case as well as settings used for the application of the methodology. Whereas the third section presents the activity flow of the experiment. The last section presents and discusses the obtained results. This discussion is made from the intermediate to the final result. Furthermore, a comparison with the results obtained by the use of the proposed threshold and of the blind threshold is made.

4.1 Tennessee Eastman Process

The Tennessee Eastman Process (TEP) is a plant wide industrial process based on an actual chemical process from the group Eastman Chemical Company. It was proposed by Downs & Vogel (1993) and is composed of eight components: A, B, C, D, E, F, G and H, which through its exothermic reaction produce two products and two byproducts. Equa-tion 4.1 shows the reacEqua-tions involving these components and the obtained the products.

A(g) +C(g) + D(g) → G(liq), Product 1 A(g) +C(g) + E(g) → H(liq), Product 2 A(g) + E(g) → F(liq), Byproduct

3D(g) → 2F(g), Byproduct

(4.1)

In addition, the process is composed of five major unit operations: the reactor, the product condenser, the vapor-liquid separator, the recycle compressor and a product strip-per. It has a set of 12 manipulated variables and 41 measurements variables listed in tables 3-5 of Downs & Vogel (1993). Some mentioned measurements variables have specific op-erational constraints, that should be respected by the control system. They are defined in

A methodology for detection of causal relationships between discrete time series on systems

A methodology for detection of causal

relationships between discrete time series on

systems

Rute Souza de Abreu

Abstract

Resumo

Contents

List of Figures

List of Tables

List of Symbols and Abbreviations

Chapter 1

Introduction

1.1

Proposal

1.2

Scope

1.3

Organization

Chapter 2

Theoretical Foundation

2.1

Information Theory

2.1.1

Information Entropy

∑

2.1.2

Kullback-Leibler Divergence

∑

∑

∑

2.1.3

Mutual Information

∑

2.1.4

Transfer Entropy

∑

2.2

Bayesian Networks

∏

∏

2.2.1

Structural Learning in Bayesian Networks

2.2.2

K2 - Algorithm

∏

∏

∏

∏

∏

∑

∑

∑

∑

Chapter 3

Proposed Methodology

3.1

General Concept

3.2

Proposed Approach

3.2.1

Generation of information flow graph

3.2.2

Application of statistical threshold

3.2.3

Removal of Cycles from Graph

3.2.4

Generation of Common and Virtual Ancestors

3.2.5

Modification and Computation of K2

3.3

Final Considerations

Chapter 4

Case Study and Results

4.1

Tennessee Eastman Process

_∑