Blood Pressure Pattern Identification in AKI Patients using
Linguistic Summaries
Ana Luísa Martins
Thesis to obtain the Master of Science Degree in
Mechanical Engineering
Supervisors:
Prof. Susana Margarida da Silva Vieira
Eng. Cátia Matos Salgado
Examination Committee
Chairperson: Prof. Paulo Jorge Coelho Ramalho Oliveira
Supervisor: Prof. Susana Margarida da Silva Vieira
Member of the Committee: Prof. João Paulo Baptista de Carvalho
Acknowledgments
Resumo
O desenvolvimento progressivo e constante que tem sido testemunhado nos ´ultimos anos conduziu a di-ficuldades na an ´alise e processamento de dados. Nos ´ultimos anos, a produc¸ ˜ao e acesso a informac¸ ˜ao t ˆem-se tornado gradualmente mais acess´ıveis. Dif´ıcil torna-se, no entanto, a filtragem desta mesma informac¸ ˜ao de modo a que conclus ˜oes relevantes possam ser da´ı retiradas.
Os sum ´arios lingu´ısticos, um m ´etodo utilizado para extrair padr ˜oes e sintetizar grandes volumes de informac¸ ˜ao em frases simples, s ˜ao uma poss´ıvel e interessante soluc¸ ˜ao para lidar com este desafio.
Assim, esta dissertac¸ ˜ao prop ˜oe-se estudar a evoluc¸ ˜ao de duas vari ´aveis variantes no tempo, a Press ˜ao Arterial M ´edia (MAP) e a Taxa de Sa´ıda de Urina (UO), de pacientes que permaneceram mais de 24 horas numa unidade de cuidados intensivos utilizando a sumarizac¸ ˜ao lingu´ıstica. Para uma melhor e mais compreensiva an ´alise, os pacientes foram divididos em dois grupos: aqueles que tiveram uma queda na Taxa de Sa´ıda de Urina abaixo de um determinado valor e aqueles que a tiveram acima desse mesmo valor.
Diferentes m ´etodos de sumarizac¸ ˜ao foram utilizados para lidar com o problema: sum ´arios lingu´ısticos categ ´oricos e multivari ´aveis, sum ´arios lingu´ısticos obtidos a partir de m ´etodos de Clustering dos dados, sum ´arios lingu´ısticos a partir de vari ´aveis demogr ´aficas e, por fim, sum ´arios lingu´ısticos temporais.
Palavras-chave:
Sum ´arios Lingu´ısticos, Press ˜ao Arterial M ´edia, Taxa de Sa´ıda de Urina, Cuidados de Sa ´ude, An ´alise de SimilaridadeAbstract
The high development of Information Systems that has been witnessed in the past few years has led to a new problem of data analysis and processing. In the past few years, the access and production of information have considerably eased. Hardier, though, is it to filter such information so that relevant conclusions can be withdrawn.
Linguistic summaries, a data mining and knowledge discovery approach to extract patterns and sum up large volume of data into simple sentences, are one interesting solution to overcome this challenge.
This dissertation aims to study the evolution of two time-varying variables, Mean Arterial Blood Pres-sure (MAP) and Urine Output (UO), from patients who stayed more than 24 hours at an intensive care unite (ICU) resorting to linguistic summarization. For a better and comparative understanding of their evolutions, patients were initially split into two different groups: those who had a critical drop in UO before a certain MAP level and those who had it above the same value.
In order to treat the problem different summarization methods were applied: categorical and multivari-able linguistic summaries, linguistic summaries obtained from clusters of datasets, linguistic summaries concerning demographic variables and temporal linguistic summaries.
Keywords:
Linguistic Summary, Mean Arterial Blood Pressure, Urine Output, Health Care, Similarity AnalysisContents
Acknowledgments . . . iii
Resumo . . . v
Abstract . . . vii
List of Tables . . . xi
List of Figures . . . xiii
1 Introduction 1 1.1 Clinical Data Analysis . . . 1
1.2 The Usefulness of Linguistic Summarization . . . 1
1.3 Blood Pressure Pattern Identification . . . 2
1.4 Objective . . . 2
1.5 Contributions . . . 2
1.6 Outline . . . 3
2 Linguistic Summaries 4 2.1 Linguistic Summaries of Data . . . 5
2.2 Time Series Segmentation and Linguistic Summarization . . . 7
2.2.1 Features of Segments . . . 8
2.3 Protoforms of Linguistic Summaries of Time Series . . . 10
2.4 Quality Measures for Linguistic Summaries . . . 11
2.4.1 Truth value . . . 11
2.4.2 Measure of specificity . . . 12
2.4.3 Degree of covering (support) . . . 13
2.4.4 Degree of focus . . . 14
2.4.5 Degree of appropriateness . . . 14
2.4.6 Degree of informativeness . . . 15
2.5 Similarity Analysis between Sets of Linguistic Summaries . . . 16
2.6 Evaluation of the similarity of two time series based on a set of their most representative linguistic summaries . . . 18
3 Clustering Techniques 20 3.1 K-Means Clustering . . . 20
3.2 Fuzzy C-Means Clustering . . . 22
3.3 Single-linkage Clustering . . . 22
3.4 NERF C-Means . . . 23
3.5 Correlation Cluster Validity . . . 24
4 Target Blood Pressure Pattern Identification using Linguistic Summaries 26 4.1 Data Set Characterization and Treatment . . . 26
4.1.1 Data Processing . . . 27
4.1.2 Data Characterization . . . 29
4.2 Problem Formulation . . . 39
4.2.1 Categorical linguistic summaries . . . 40
4.2.2 Temporal linguistic summaries . . . 42
4.2.3 Clustering of linguistic summaries for greater conciseness . . . 45
5 Results 47 5.1 Categorical linguistic summaries . . . 48
5.1.1 Categorical linguistic summaries for different demographic groups . . . 51
5.1.2 Categorical linguistic summaries of clustered data . . . 52
5.1.3 Multivariable categorical linguistic summaries . . . 53
5.2 Temporal Linguistic Summaries . . . 53
5.3 Clustering of linguistic summaries for greater conciseness . . . 55
6 Conclusions and Future Work 58 Bibliography 60 A Linguistic Summaries 65 A.1 Categorical linguistic summaries . . . 65
A.2 Temporal linguistic summaries . . . 78
List of Tables
3.1 K-Means Algorithm . . . 21
3.2 C-Means Algorithm . . . 22
3.3 Single-linkage Clustering Algorithm . . . 23
4.1 Patients demographic in Group 1 . . . 29
4.2 Patients demographic in Group 2 . . . 29
4.3 Demographics from Hypertensive Patients of Group 1 . . . 33
4.4 Demographics for non-hypertensive patients of Group 1 . . . 34
4.5 Demographics from Hypertensive Patients of Group 2 . . . 36
4.6 Demographics from Non-Hypertensive Patients of Group 2 . . . 37
5.1 Similarity between the two groups according their temporal summary . . . 55
5.2 Validation indices for different clustering methods . . . 56
5.3 Linguistic medoid for SL and NERF with 3 clusters . . . 56
5.4 Linguistic medoid for SL and NERF with 2 clusters . . . 57
A.1 Categorical linguistic summaries for MAP of patients from group 1 . . . 66
A.2 Categorical linguistic summaries for MAP of patients from group 2 . . . 66
A.3 Categorical linguistic summaries for UO of patients from group 1 . . . 67
A.4 Categorical linguistic summaries for UO of patients of group 2 . . . 67
A.5 Categorical linguistic summaries for general trends of MAP from group 1 . . . 68
A.6 Categorical linguistic summaries for general trends of MAP from group 2 . . . 68
A.7 Categorical linguistic summaries for MAP of hypertensive patients from group 1 . . . 69
A.8 Categorical linguistic summaries for MAP of non-hypertensive patients from group 1 . . . 69
A.9 Categorical linguistic summaries for UO of hypertensive patients from group1 . . . 70
A.10 Categorical linguistic summaries for UO of non-hypertensive patients from group1 . . . . 70
A.11 Categorical linguistic summaries for MAP of hypertensive patients from group 2 . . . 71
A.12 Categorical linguistic summaries for MAP of non-hypertensive patients from group 2 . . . 71
A.13 Categorical linguistic summaries for UO of hypertensive patients from group 2 . . . 72
A.14 Categorical linguistic summaries for UO of non-hypertensive patients from group 2 . . . . 72
A.15 Categorical linguistic summaries for MAP from patients of cluster 1 from group 1 . . . 73
A.17 Categorical summaries for MAP from patients of cluster 1 from group 2 . . . 75
A.18 Categorical summaries for MAP from patients of cluster 1 from group 2 . . . 76
A.19 Multivariable categorical linguistic summaries for MAP from patients of group 1 . . . 77
A.20 Multivariable categorical linguistic summaries for MAP from patients of group 2 . . . 77
A.21 Temporal linguistic summaries for MAP trends from group 1 . . . 79
A.22 Temporal linguistic summaries for MAP trends from group 2 . . . 79
A.23 Temporal linguistic summaries for UO trends from group 1 . . . 80
List of Figures
2.1 Example of a membership function for a Summarizer . . . 6
2.2 Representation of angle granules defining the dynamics of change [16] . . . 9
2.3 Possible membership function for ”long trend” [16] . . . 9
2.4 A trapezoidal membership function of a set [31] . . . 13
3.1 NERF Algorithm [19] . . . 24
4.1 Example of a patient time series . . . 28
4.2 MAP for all patients . . . 28
4.3 UO for all Patients . . . 28
4.4 MAP behaviour from patients of group 1 . . . 30
4.5 Urine Output behaviour from patients of group 1 . . . 31
4.6 MAP behaviour from patients of group 2 . . . 31
4.7 Urine Output behaviour from patients of group 2 . . . 32
4.8 MAP from Hypertensive Patients of Group 1 . . . 33
4.9 UO from Hypertensive Patients of Group 1 . . . 34
4.10 MAP from Non-Hypertensive Patients of Group 1 . . . 35
4.11 UO from Non-Hypertensive Patients of Group 1 . . . 35
4.12 MAP from Hypertensive Patients of Group 2 . . . 36
4.13 Urine Output from Hypertensive Patients of Group 2 . . . 37
4.14 MAP from Non-Hypertensive Patients of Group 2 . . . 38
4.15 Urine Output from Non-Hypertensive Patients of Group 2 . . . 38
4.16 Representation of the memberships functions for each attribute . . . 40
4.17 Representation of the granulation for each linguistic value for duration . . . 43
4.18 Representation of the granulation for each linguistic value for the temporal expression . . 44
4.19 Flowchart explaining temporal linguistic summaries generation process . . . 45
A.1 FCM clustering of patients of group 1, c=2 . . . 81
Chapter 1
Introduction
1.1
Clinical Data Analysis
The high development of Information Systems that has been witnessed in the past few years has led to a new problem of data analysis and processing. In the past few years, the access and production of information has considerably eased. Hardier, though, is it to filter such information so that relevant conclusions can be withdrawn.
This problem reaches dramatic levels when considering medical databases. For each patient, there is a big number of variables that may be described through time. Even though the amount of information is not necessarily higher than for other systems, the inherent complexity of each variable combined with huge databases with several individuals, makes its concise analysis almost a herculean task.
In hospitals, huge amounts of data are recorded concerning the diagnosis and treatments of patients in what is typically called event-logs [1]. Event logs show occurrences of events at specific moments in time, where each event refers to a specific process and case. Usually event logs are large text files with some standardized format, sorted according to the time they were stored [2]. This information, however, may be of difficult read and interpretation and, sometimes, not easy to translate into a visual environment. Thus, a need for a methodology that suitably represents this data in a understandable and easy way prevails.
1.2
The Usefulness of Linguistic Summarization
Linguistic summarization is a data mining and knowledge discovery approach to extract patterns and sum up large volume of data into simple sentences [3]. A linguistic summary of data is then a concise description of this same data in terms of natural human language. Its main aim is to capture the essence of the behaviour of a large data set and translate it into small and human comprehensible statements.
Distinguished and several times complementary to data visualization, which is used to increase the understanding of data by emphasizing or discovering visual cues, linguistic summarization aims to show the most characteristic and relevant aspects of data by natural language and help, this way, humans
dis-covering relations in data that would have, otherwise, remained unknown or difficult to notice. However and despite its shown advantages in translating complex data events into simple sentences, linguistic summarization is still sparsely present in field literature [2].
1.3
Blood Pressure Pattern Identification
At the intensive care unit (ICU), established protocols recommend that patients with sepsis should be treated with vasopressors or fluids as soon as their mean arterial blood pressure (MAP) drops below 65 mmHg. However, older patients with atherosclerosis or previous hypertension, for example, may benefit from being treated at a higher MAP than younger patients without any cardiovascular conditions. There are currently no guidelines indicating a target MAP for groups of patients with distinct characteristics.
A critical urine output (UO) entails an impending organ failure, advocating the initiation of treatment at a specific point in time - being it below or above the established protocol.
From a clinical perspective, it is very important that treatment starts before the onset of a critical UO, to avoid severe kidney complications. Previous studies have tried to establish a relationship between MAP and UO [4]. However, the relation is very complex and not possible to describe through linear systems.
1.4
Objective
In order to understand the evolution of MAP and UO two groups are considered: group 1, the ones who had a critical event (UO dropping below 30 ml/h ) before the established protocol (MAP below 65 mmHg) or group 2, the ones after (MAP above 65 mmHg).
The idea is to understand the main differences between these two groups through the study of the linguistic summaries that characterize their time series, finding this way what may indicate a future occurrence of a critical event.
This knowledge is of most importance in a clinical point of view, once that better protocols can be defined that will allow the administration of vasopressors on patients who really need this treatment before the onset of a critical UO, avoiding serious kidney complications.
1.5
Contributions
The study of what causes a critical event has been an object of careful attention for a long time. If predicting a critical event is already difficult, understanding its causes is an even tougher challenge. This thesis contributed to explore a new method, linguistic summaries, that are used to find patterns and describe them.
Concerning the method itself different types of summaries, that lead to different types of analysis and conclusions, were created.
The main contribution are related to the opening of new paths that can partner up with traditional methods in discovering hidden patterns.
1.6
Outline
This dissertation is split into six parts. The present one makes an overview of the subject, clarifying the aim of this study. At the second chapter, the topic of Linguistic Summaries will be introduced as well as the main methodologies developed to generate and study them, according to expertise literature. The third chapter will briefly introduce the Cluster Techniques used as support to the Linguistic Summaries. This chapter will not go further into this subject because it is not the focus of this dissertation. The explanation of the data and its processing, as well as the problem formulation and procedure for its resolution through the generation of different types of linguistic summaries (here called protoforms) will be the emphasis of chapter four. Finally, chapter five will show the results obtained for the different methods presented in four and chapter six the final conclusions.
Chapter 2
Linguistic Summaries
A linguistic summary of data is a concise description of this same data in terms of natural human lan-guage. Its main aim is to capture the essence of the behavior of a large data set and translate it into small and human comprehensible statements. The concept of linguistic summaries has been developed by Yager [5] and further developed by Kacprzyk and Yager [6] and Kacprzyk, Yager and Zadrozny [7].
In this chapter, the concept of Linguistic Summaries will initially be explored and then its extension to the summarization of time series as proposed by Kacprzyk, Wilbik and Zadrozny [8]. For such pur-pose, the concept of protoforms as proposed by Zadeh [9] will be described for its relevance in dealing with linguistic summaries, as shown by Kacprzyk and Zadrozny [10]. Subsequently, some techniques developed for the computation of the truth of a linguistic value will be shown.
The recent advances in Information Technology and the great amount of data that arrived with it leads to the major issue of how to analyse it with success. As the apprehension of information is essential to the process of decision making, the summarization of big data that is able to catch its core is becoming vital for all sectors of activity. Since natural language is for humanity the most important and developed means of communication, the use of data summarization through a small number of sentences would be both useful and human consistent.
According to [3], most techniques found in literature follow two different ways to develop linguistic summaries: one using natural language generation and the other fuzzy logic tools. A more insightful study about these two major procedures can be found in [11]. This research follows the second one. Most linguistic summaries are generated with protoforms such as “Most students are young” as it will be further explained in this chapter, however it has been recently proposed to perform linguistic summaries using If–Then rules such as “IF X is large and Y is medium, THEN Z is small” [12]. These If–Then rules provide a linguistic description of the database and can also be used for prediction.
Other protoforms were introduced, based on fuzzy dependencies, like “most stocks in the same kind of sector have performed the same way this year” in [13], [14], or based on gradual rules like “the higher the benefit, the higher the dividend” in [14], [15]. As reported in [11], Systems based on these protoforms have been implemented, such as FQuery, Quantirius and Summary SQL.
are also many studies that deal with Health Care ([18], [19]) and others where linguistic summaries are used for analyses of Web Server Logs [20]. A comparison between the similarities of a set of linguistic summaries in different time periods for different investment funds are studied for instance in [16]. Comparing time series based on the result of user defined queries over a data cube with time dimension was performed in [21]. The similarity between time series is then described using local changes. In [22] is performed a linguistic summarization of a time series based on a linear approximation of its trends, here identified with straight linear segments, and then its linguistic characterization based on the slope of its segments, the accuracy of the approximation and the length of the trend. In [18] a similar idea is used for a linguistic summarization procedure for describing long-term trends of change in human behavior, while [19] provides activity summaries for eldercare based on the analysis of time series sensor data collected at an eldercare facility. The continuous monitoring of eldercare received further attention in [23] and [24] with different approaches to compute distances between linguistic summaries to define the presence of abnormal conditions.
2.1
Linguistic Summaries of Data
As stated above, a linguistic summary is aimed to be a natural language representation (sentences for instance) of a data set that summarizes it to its essence. Here the notation as proposed by Yager [5] is followed:
1. Y = {y1, ..., yn} is a set of objects in a database, for instance a set of employees.
2. A = {A1, ..., An} is a set of attributes characterizing yi from Y , e.g., salary and Aj(yi)is a value attribute Ajof yi.
A linguistic summary of a data set consists of:
1. A summarizer P , i.e., an attribute with linguistic value (fuzzy predicate) defined on the domain of Aj(e.g. high for the attribute salary );
2. A quantity of agreement Q, i.e., a linguistic quantifier (e.g. most);
3. The Truth (validity), T , of the summary, i.e. a number from the interval [0,1], assessing the truth (validity) of the summary as, e.g., ”T (most of employees have high salaries) = 0.7”;
4. A qualifier R (optional), i.e., another attribute with linguistic value (also fuzzy predicate) defined on the domain of Akdetermining a fuzzy subset of Y (e.g. young for the attribute age).
As explained in [10], here are not considered other forms of linguistic summaries as exemplified by ”70% of employees have salaries over 3000 C” that may be seen as a similar provider of information as ”most employees have high salaries”, because the first case is clearly outside the class of linguistic summaries here considered.
For a further understanding of these essential elements of the linguistic summary, follows a brief com-mentary on the summarizer, P, and on the quantity of agreement, Q, (what will be said about the sum-marizer is also valid for the qualifier, R, so no special highlight will be made about it).
The Summarizer
Knowing that natural language is the only human, consistent and comprehensible mean of commu-nication, the summarizer can only be a linguistic expression semantically represented by a fuzzy set. For example, a summarizer like ”young” is represented by a fuzzy set in the universe of dis-course, that is, containing possible values of human age, {1, . . . , 100}. This way, the summarizer ”young” can be given, for instance, by a fuzzy set with a non-increasing membership function such that until the age of 35 the certainty of being ”young” is absolute, thus the grade of membership is equal to 1, and after 65 the certainty of not being ”young” is also absolute, and the grade of membership is equal to 0. Between the ages of 35 and 65 years the grades of membership are between 1 and 0, the higher the age the lower its corresponding grade of membership. A simple example can be seen in the following figure. It’s relevant to say that the meaning of the summarizer
Figure 2.1: Example of a membership function for a Summarizer
(therefore its corresponding fuzzy set) is subjective and may be predefined or chosen by the user.
The Quantity of Agreement
The quantity of agreement, Q, shows the extent to which the data satisfies the summary. A precise indication is not human consistent, and a linguistic term must be represented by a fuzzy set. According to [10], two types of a linguistic quantity in agreement can be used:
– absolute, for instance, ”about 10”, ”more or less 100”, and
– relative, e.g., ”a few”, ”about a half”, ”most”, ”many” and ”almost all”.
The above linguistic expressions are this way the fuzzy linguistic quantifiers that can be handled by fuzzy logic and being represented the sense of Zadeh [25], i.e. regular, non-decreasing and monotone:
(a) µQ(0) = 0, (b) µQ(1) = 1, and
It can be exemplified as in [10] with most given by: µQ= 1 for x ≥ 0.8 2x − 0.6 for 0.3 < x < 0.8 0 for x ≤ 0.3 (2.1)
Thus, the core of a linguistic summary is a linguistically quantified proposition in the sense of Zadeh which can be written in its simple form as:
Qy's areP (2.2)
e.g.: ”Most employees have high salaries.” And in its extended form as:
QRy's areP (2.3)
e.g.: ”Most young employees have high salaries.”
The truth value (T ) of them can be calculated by using either original Zadeh’s calculus of linguistically quantified propositions (cf. [25]), or other interpretations of linguistic quantifiers, i.e., respectively:
T (Qy's are P ) = µQ 1 n n X i=1 µP(yi) ! (2.4) T (QRy's are P ) = µQ n P i=1 µR(yi) ∩ µP(yi) n P i=1 µR(yi) (2.5)
where ∩ is the minimum operation (more generally it also can be another appropriate operator, for instance a t-norm).
It is interesting to notice that the use of P and R can also be extended to create more sophisticated linguistic summaries through the union of several attributes, for instance ”young and bright employees”. To combine more than one attribute, P1, P2,..., Pn, t-norms (like the minimum or product) are used for the conjunction.
The truth value can also be computed with OWA operators ([26]) or with the Choquet integrals ([27]), however these approaches will not be here explored.
Linguistic summaries can also be evaluated with other quality criteria that will be better explained later in this chapter.
2.2
Time Series Segmentation and Linguistic Summarization
In many studies, while resorting to time series data, the aim is to look at the relation between certain data points at a certain time, instead of just considering their absolute value. The time series may be very long and take too much space, for which a more compact version is needed in which a time series
data set is split into more compact subsets (i.e. time segments or trends). This procedure is called segmentation.
Many techniques have been developed to reach an optimal segmentation([28],[29],[30],etc.). Ac-cording to Wilbik [31] One of this methods consists on the representation of the time series by intervals and then - for each interval - the value of the time series is represented as a constant. One of the most popular procedures to time series segmentation is to assume the it behaves linearly in every segment [32]. This methods are called piecewise linear.
The trends obtained from the segmentation will then be the basis for linguistic summarization as it is explained is the following subsection.
2.2.1
Features of Segments
All segments are here considered linear. The following approach to study time segments and compute linguistic summaries based on them was developed by Kacprzyk, Wilbik and Zadrozny [8] and is largely followed. It considers three aspects of the time series: (1) dynamic of change, (2) duration, (3) and variability.
1. Dynamic of Change
The dynamic of change refers to the speed of changes that can be described by the slope of a line representing a trend. Thus, to quantify dynamics of change it is used the interval of possible angles α ∈ h−90, 90i.
However, a direct use of an angle to describe the dynamic of change of the trends is not human consistent and goes outside the scope of linguistic summarization. For this reason, fuzzy granu-lation is commonly adopted, meeting the users’ needs and being also defined by them, who may build the scale of linguistic terms corresponding to various inclinations of a trend line as:
• quickly decreasing, • decreasing, • slowly decreasing, • constant, • slowly increasing, • increasing, • quickly increasing
Figure 2.2: Representation of angle granules defining the dynamics of change [16] 2. Duration
The duration describes linguistically the length of a single trend that is related to a fuzzy set defined over a time span of time series.
An example can be the linguistic value ”long trend” defined by a fuzzy set with a membership function that could be similar to the one in Figure 2.2.
Figure 2.3: Possible membership function for ”long trend” [16]
The linguistic terms describing duration are also user defined, concerning their vision and aim. 3. Variability
Variability refers to how the data is spread out and varies through time. Can be described by five classical statistical measures:
• The range (maximum – minimum). Despite being the easiest variable of variability, the range is not frequently used as it only concerns two data points: the extremes. Thus it is highly vulnerable to outliers and may lead to inadequate values of variability;
• The interquartile range (IQR) calculated as the third quartile (the third quartile is the 75th per-centile) minus the first quartile (the first quartile is the 25th perper-centile) that may be interpreted
as representing the middle 50% of the data. It is both resistant to outliers and computationally easy;
• The variance, computed by P
i(xi−¯x)2
n , where ¯xis the mean value; • The standard deviation, computed by
qP
i(xi−¯x)2
n , i.e., the the square root of the variance. Both the variance and the standard deviation are affected by extreme values;
• The mean absolute deviation (MAD), computed as P
i|xi−¯x|
n . It is computationally harder even though it has the naturally intuitive meaning of ”mean deviation from the mean”. For this reason it is not frequently found in statistics.
Once more, the measure of variability is a linguistic variable expressed using linguistic terms mod-elled by user defined fuzzy sets.
2.3
Protoforms of Linguistic Summaries of Time Series
The concept of the protoform as thought by Zadeh’s [9] is useful for dealing with linguistic summaries. Such idea is supported by Kacprzyk and Zadrozny [10] and is owed to the fact that a protoform is a template of a linguistically quantified proposition. This way the summaries above cited can take the following representations as shown Wilbik [31].
• Short protoform (in certain cases can be equivalent to the simple protoform presented in section 2.1).
Among all trends, Q are P . (2.6) e.g.: ”Among all trends, most are quickly decreasing.
• Extended protoform (in certain cases can also be equivalent to the extended protoform presented in section 2.1).
Among all R trends, Q are P . (2.7) e.g.: ”Among all long trends, most are quickly decreasing.
Additionally a temporal expression, such as ”in the beginning” or ”recently”, can be added. This leads to temporal protoforms as the following:
• Short temporal protoform
ET among all trends, Q are P . (2.8) e.g.: ”Initially among all trends, most are quickly decreasing.
• Extended temporal protoform
e.g.: ”Initially among all long trends, most are quickly decreasing.
2.4
Quality Measures for Linguistic Summaries
In order to evaluate the quality of the generated linguistic summaries, quality measures have been used. Except for the truth value, the quality measures here introduced have been proposed in Yager Ford and Canas [33]. Another quality criteria, the degree of focus, found in [34] and widely used will be also here introduced. This way, the linguistic summaries are commonly evaluated through the following criteria:
• truth value (validity), • degree of imprecision, • degree of specificity, • degree of fuzziness, • degree of covering, • degree of focus, • degree of appropriateness, • measure of informativeness, • length of the summary,
In the following sub-section, with the exception of the degree of imprecision, the degree of specificity, the degree of fuzziness and the length of the summary (that are concerned with the form of the summary and thus are out of the scope of this study), all these criteria will be explained in detail. The truth value already explained in sub-section 2.1, will here be expanded for temporal linguistic summaries. Additionally, the measure of specificity will be introduced for being necessary to compute the measure of informativeness.
2.4.1
Truth value
The truth value is the most important quality measure for the linguistic summaries. It was already intro-duced in sub-section 2.1 for simple protoforms and extended protorforms. Here it will be extended to temporal protoforms.
The computation of a temporal linguistic summary is very similar to the previous ones, it will have a tem-poral expression, ET, as an additional qualifier (R), as the temporal expression, similar to the qualifier, limits the universe of interest to those segments that only occur on the time axis described by a fuzzy set modelling the expression ET. That is the proportion of trends that are P and happen in ET to those
that occur in ET. Then the quantity of agreement of this proportion is computed. Thus the truth value of a temporal protoform is computed by:
T (ET among all y's, Q are P ) = µQ n P i=1 µET(yi) ∩ µP(yi) n P i=1 µET(yi) (2.10)
where µET is the degree to which a trend occurs during the time span described by ET. Similarly we compute the truth of an extended temporal protoform as:
T (ET among all Ry's, Q are P ) = µQ n P i=1 µET(yi) ∩ µP(yi) ∩ µR(yi) n P i=1 µET(yi) ∩ µR(yi) (2.11)
2.4.2
Measure of specificity
The concept of specificity provides a measure of the amount of information contained in a fuzzy subset or possible distribution. A specificity measure evaluates the degree to which a fuzzy subset points to one and only one element as its member. It is closely related to the inverse of the cardinality of a fuzzy set.
As proposed by Yager [35], the measure of specificity measures the degree to which a fuzzy subset contains one and only one element. The measure of specificity is a measure Sp : IX → I, I ∈ [0, 1] which has the following properties:
• Sp(A) = 1, if and only if A = x, (is a singleton), • Sp() = 0
• ∂Sp(A)
∂a1 > 0and
∂Sp(A
∂aj <for all j ≥ 2. The measure of specificity is then computed by:
Sp(A) = Z αmax
0
1
card(Aα)dα (2.12)
where αmax is the largest membership grade of A, and Aα the α-level set of A, i.e. A = x : A(X) ≥ α and card(Aα)the number of elements in Aα.
Let X be a continuous space, then Spcan be computed as:
Sp(A) = Z αmax
0
F (µ(Aα)) dα (2.13) where F is a function F : [0, 1] → [0, 1], such that F (0) = 1, F (1) = 0 and F (x) ≤ F (y) ≤ 0 for x ≥ y. If F is defined as F (z) = 1 − z, the measure µ of an interval [a, b] is defined as µ([a, b]) = b − a, and the space is normalized at [0, 1], then the degree of specificity of the fuzzy set A is computed as
If the fuzzy set A has a trapezoidal membership function, as e.g., shown in the following figure
Figure 2.4: A trapezoidal membership function of a set [31]
then the measure of specificity can be computed as:
Sp(A) = 1 −
d − a + c − b
2 (2.15)
2.4.3
Degree of covering (support)
The degree of covering, introduced in [6], expresses how many objects in the data set are covered by the summarizer condition. It represents the proportion of elements show the condition of P and R (when an extended summary) to the total number of elements in the set.
Thus, when the degree of covering is low, the corresponding summary describes a pattern that rarely occurs. On the other hand, a high degree of covering represents many objects, its more general. De-pending on the cases, a higher support is more desirable and can serve as a threshold when generating summaries.
This measure is similar to the measure of support of the association rules and the alternative version of the degree of covering for the simple and extended protoform summaries (cf. Kacprzyk and Wilbik [36]) is, respectively: dc = 1 n n X i=1 µP(yi) (2.16)
dc(Among all Ry’s, Q are P) = 1 n n X i=1 (µR(yi) ∩ µP(yi)) (2.17)
∩ is the minimum but can also be computed with a different t-norm.
The degree of covering for simple and extended temporal protoforms is respectively:
dc(ET among all y's, Q are P ) = 1 n n X i=1 µP(yi) ∩ µET(yi) (2.18)
dc(ET among all Ry's, Q are P ) = 1 n n X i=1 (µR(yi) ∩ µP(yi) ∩ µET(yi)) (2.19)
2.4.4
Degree of focus
Another quality criterion is the degree of focus introduced in [34]. Similar to the degree of covering, the purpose of the degree of focus is to limit the search for the best linguistic summary taking into consid-eration other information behind the truth value. The use of an extended protoform helps limiting this search by reducing it to a subspace of objects that full-fill the R condition. The degree of focus gives the proposition of objects satisfying the property R to all objects. This way, it helps to control the process of discarding non-promising linguistic summaries.
df oc(QRy's are P )) = 1 n n X i=1 µR(yi) (2.20) The degree of focus exists only for summaries of extended protoforms.
If the degree of focus is high,then it is certain that such summary concerns many objects,so that it is more general.On the other hand,when the degree of focus is low, this means the linguistic summary describes a (local) pattern that occurs seldom.
This way, the degree of focus can be used to eliminate groups of extended linguistic summaries for which qualifier R limits the set of possible trends to, for instance, 5%. Such summaries, although possibly true (with high truth value), will not be representative.
The formula for the degree of focus for external temporal protoforms treats the temporal expression as an external qualifier. The resultant formula is:
df oc(ET among all Ry's, Q are P ) = 1 n Pn i=1µET ∩ µR(yi) Pn i=1µET (2.21) This degree represents the proportion of segments satisfying property R in ET time span to all other trends happening in that same time period.
An additional measure to the degree of focus for external temporal protoforms is the degree of focus of temporal expression, dET. This measure gives the proportion of trends from a given tends that occur in the time span, described by ET, to all trends at the time series. This measure is also necessary to compute the measure of informativeness (later explained in this sub-section) and is computed by:
dET(ET among all Ry's, Q are P ) = 1 n n X i=1 µET(yi) (2.22)
2.4.5
Degree of appropriateness
The degree of appropriateness introduced in [6] indicates if, and to which degree, a linguistic summary is surprising to the user. It is believed to be one of the most important measures of quality. In [31], this
measure is computed by: da= | m Y i=1 Pn j=1µAi(yj) n − Pn j=1minAi(µAi(yj)) n | (2.23)
where Aiis a predicate belonging to summarizer P or additionally in the case of extended summaries, to qualifier R. For the temporal protoform summaries, the equation (2.23) is modified to:
da = | m Y i=1 Pn j=1µAi(yj) ∩ µET n − Pn j=1minAi(µAi(yj) ∩ µET) n | (2.24)
ET is here treated as an external qualifier.
This measure can be interpreted as the following, as proposed by Wilbik [31]: Assuming there are two independent properties A and B and among all n trends 50% have property A and 50% have property B, it is then expected that 25% share both properties. This value is computed as the first expression in the difference. The exact number of object having those two properties is computed by the second expression in the above formula. If the real proportion of trends having both those properties is much higher or lower than expected, the summary is interesting.
The maximum value of this measure depends on the number of properties in the summary, and it can be computed as m−1q1
m, where m is the number of properties in the summary. In order to compare those values for summaries with a different number of predicates we should normalize by dividing the obtained value of degree of appropriateness, da by the maximum value of this measure for a given number of predicates in the summary.
Note that this measure equals 0 for the simple protoform summary with one predicate only.
2.4.6
Degree of informativeness
The degree of informativeness proposed by [33] measures to what extent a particular linguistic summary is informative. It is computed for simple protoforms by:
I(Qy's are P ) = (T · Sp(Q) · Sp(P )) ∪ ((1 − T ) · Sp( ¯Q) · Sp( ¯P )) (2.25)
where ¯P is the negation of P and ¯Qthe negation of Q. Sp(Q) is the measure of specificity of Q as computed in section 2.4.2, and the same for P , ¯Pand ¯Q.
For the extended protoform, the measure is computed by:
I(QRy's are P ) = (T · Sp(Q) · Sp(P ) · Sp(R) · df oc) ∪ ((1 − T ) · Sp( ¯Q) · Sp( ¯P ) · Sp(R) · df oc) (2.26)
And for the temporal protoform and extended temporal protoform is computed by respectively: I(ET among all y's, Q are P ) = (T · Sp(Q) · Sp(P ) · Sp(ET) · dET)
∪(1 − T ) · Sp( ¯Q) · Sp( ¯P ) · Sp(ET) · dET)
I(ET among all Ry's, Q are P ) = (T · Sp(Q) · Sp(P ) · Sp(R) · df oc· Sp(ET) · dET) ∪((1 − T ) · Sp( ¯Q) · Sp( ¯P ) · Sp(R) · df oc· Sp(ET) · dET)
(2.28)
2.5
Similarity Analysis between Sets of Linguistic Summaries
The relevance of a Linguistic Summary is not only related to it’s absolute value (given by its Truth Value and meaning) but also by how its related to other Linguistic Summaries from both its set and others. That is to say, it is only possible to evaluate the peculiar result from a LS by comparing its distance to other summaries. This takes a major significance if one wants to find differences between clusters or groups of Data.
As proposed by Wilbik and Keller [24], this work will follow a dissimilarity/similarity measure between summaries based on fuzzy protoforms. It was taken into account not only the linguistic meaning of the summaries, but also two quality evaluations, namely the truth values and degree of focus (in the particular case of this study, the last was not used for the reasons already pointed out).
In this section it will be described how to compute the aimed degree of similarity or distance between two linguistic summaries.
Considering two summaries Q1R1y's are P1 and Q2R2y's are P2 with the truth values T1 and T2, respectively. The qualifiers R1and/or R2may be empty in case of simple protoform summaries.
The degree of similarity is computed as the minimum of the following four elements:
1. Similarity of the summarizers P1 and P2. The formula for calculating this similarity depends if the summarizers describe the same attributes or not. Namely,
sim(P1, P2) = min a b, R µP1∩ µP2 R µP1∪ µP2 (2.29) where a is the number of joint (common) attributes for summarizers P1and P2and b is the number of distinct attributes used in summarizer P1or P2, i.e., a \b is the Jaccard measure [37] of the sets of attributes for the summarizers P1and P2. Also,
R µP1∩ µP2 R µP1∪ µP2
(2.30) is the Jaccard measure of the summarizers where P1 and P2 are the membership functions of fuzzy sets used to model the summarizers P1 and P2. Note that if the summarizers contain more than one attribute we operate on their cylindrical extensions [38].
2. Similarity of the quantifiers Q1and Q2computed using the Jaccard measure, i.e.,
sim(Q1, Q2) = min
R µQ1∩ µQ2 R µQ1∪ µQ2
(2.31)
quanti-fiers Q1and Q2.
3. Similarity of the truth values T1 and T2, of the summaries Q1R1y's are P1 and Q2R2y's are P2, respectively, computed as
sim(P1, P2) = 1 − |T1− T2| (2.32)
4. Similarity of the qualifiers R1and R2. Note that in case of simple protoform summaries R is absent, and so, R should be treated as being the fuzzy set that characterizes the whole universe, Y, i.e., the set whose membership is 1 for all y. The similarity of the qualifiers, sim(R1, R2), is defined as the minimum of two elements:
• Similarity computed using the Jaccard measure of fuzzy sets R1and R2, R µR1∩ µR2
R µR1∪ µR2
(2.33) where R1 and R2 are the membership functions of fuzzy sets used to model the qualifiers R1 and R2. Note that if the qualifiers contain more than one attribute then their cylindrical extensions are considered.
• Similarity between the values of the degree of focus for those two summaries, computed as 1 − |df oc(Q1R1y's areP1) − df oc(Q2R2y's areP2)|.
Hence,
sim(R1, R2) = min
R µR1∩ µR2 R µR1∪ µR2
, 1 − |df oc(Q1R1y's areP1) − df oc(Q2R2y's areP2)|
(2.34)
Thus, if the two protoforms are simple, sim(R1, R2) = 1, since the degrees of focus are zero and the fuzzy sets are the universe set.
Finally, the total similarity between two protoform linguistic summaries is defined as
sim(Q1R1y's areP1, Q2R2y's areP2) = min(sim(P1, P2), sim(Q1, Q2), sim(T1, T2), sim(R1, R2)) (2.35)
Alternatively we may use the notion of dissimilarity, which is defined as
2.6
Evaluation of the similarity of two time series based on a set
of their most representative linguistic summaries
Following the assumption made by Wilbik [31], it is supposed that if two time series have similar linguistic summaries then they may be similar. Therefore, the degree of similarity of two time series can be computed as the similarity between the most representative summaries of one time series, suppose A,with the most representative summaries of time series B. (Note that such procedure will always carry some inaccuracy by extrapolating the information of the ”most” representative summaries to the entire time series).
Assuming time series A is described by k representative linguistic summaries,sAj, j = 1, .., k and time series B with l representative summaries, sBj, j = 1, ..., l. Their truth values are respectively TAj and TBj. Initially the degree of similarity between each representative summary describing A and B is computed with the expression give in section 2.5. The the degree to which a representative summary from time series A is similar to some of the most valid summaries of time series B.
sim(sAi, B) = µsome Pl j=1(1 − |TAi− TBj|)sim(sAi, sBj) sim(sAi, sBj) ! (2.37)
Then the above similarity values are aggregated by a quantifier driven aggregation:
sim(A, B) = µmost 1 k k X i=1 sim(sAi, B) ! (2.38)
Similarly, the similarity between time series can be computed by the similarity between their most rep-resentative temporal linguistic summaries. However, the comparison can only be made between sum-maries reporting to the same temporal expression, ET, that is a summary of time series A that describe a given variable x at the ”beginning of the time span” can only be compared to a summary of time
series B describing the same variable x at the same ”beginning of the time span”.
This way, initially the similarity between most representative summaries of A and B with the same ET is computed. Being sA,ET the summary of time series A that occurs in ET then, for each ET the degree to each representative summary of A is similar to some of the most representative summaries of B that also occur in ET. sim(sA,ET, BET) i= µ some Pl j=1(1 − |T i A,ET − T j B,ET|)sim(s i A,ET, s j B,ET) sim(si A,ET, s j B,ET) ! (2.39)
Again, the above similarity values are aggregated by a quantifier driven aggregation:
sim(AET, BET) = µmost 1 k k X i=1 sim(siA,E T, BET) ! (2.40)
Finally, two different procedures can be done: considering the degrees of similarity for each of the given temporal expressions and analysing the time series similarity based on those values or aggregating the
similarity values for each temporal expression using again a linguistic quantifier aggregation: sim(A, B) = µmost 1 k t X p=1 sim(AET, BET) ! (2.41)
Chapter 3
Clustering Techniques
As it will be further observed in the following chapter, some complementary techniques were used both previous and for a possible linguistic summaries creation. In this chapter these methods will be better explained.
Clustering is a technique that partitions a data set into small groups of similar objects (data points), producing, this way, a concise system’s representation. It’s applied in several fields such as data mining, machine learning, pattern recognition data compression, etc.
The techniques here described were used in both data partition and linguistic summaries grouping.
3.1
K-Means Clustering
The K-Means Clustering has been applied to a variety of areas, including image processing, data com-pression, data preprocessing for system modelling.
The K-Means algorithm decomposes a collection of n vector xj, j = 1, ..., ninto groups Gi, i = 1, ..., c and finds a cluster center in each group such that a objection function of distance measure is minimized. When the chosen dissimilarity measure between a vector xk in group j and the corresponding cluster center cj is the Euclidean distance, then the cost function is defined by
J = c X i=1 X k,xk∈Gi kxk− cik2 (3.1) where Ji=Pk,xk∈Gikxk− cik
2is the cost function within group i. Thus, the value of J
idepends on the geometrical properties of Gi and the location of ci.
In general, a generic distance function d(xk, ci)can be applied for a vector xkin group i; the correspond-ing overall function is thus expressed as
J = c X i=1 X k,xk∈Gi d(xk− ci) (3.2)
For simplicity, the Euclidean distance is used as the dissimilarity measure and the overall cost function is expressed as in (3.1).
The partitioned groups typically defined by c × n binary membership function U, where the element uijis 1 if the jth data point xjbelongs to group i. and 0 otherwise. Once the cluster centrers ciare fixed, the minimizing uij for Equation (3.1) can be derived as follows
uij = 1 if kxk− cik2≤ kxk− cik2for each, k 6= i 0 otherwise (3.3)
Reminding, xj belongs to i if ci is the closest center among all. As a data point can only belong to a group, the membership function U takes the following properties:
c X i=1 uij = 1,∀j = 1, ..., n c X i=1 n X j=1 uij = n
On the other side, if uij is fixed, then the optimal center cithat minimize Equation (3.1) is the mean of all vectors in group i:
ci= 1 |Gi| X k,xk∈Gi xk, (3.4)
Where |Gi| is the size of Gi or Gi=P n j=1uij.
For a batch-mode operations, the K-means algorithm is presented with a data set xi, i = 1, ..., n; the algorithm determines the optimal cluster center ciand the membership matrix U iteratively using the following steps:
K-Means Algorithm 1 Initialize cluster centres randomly
2 Compute membership matrix U
3 Compute cost function. If it’s below tolerance value or over previous iteration: Stop
4 Centre upgrading
5 Go to step 2
Table 3.1: K-Means Algorithm
This algorithm is iterative and no guarantee can be made that it will converge to a optimum solution. The performance of K-means depends on the initial position of the cluster, this way the usage of front-end methods to find good initial cluster or running the algorithm several times, each time-width a different set of initial clusters are advisable approaches.
3.2
Fuzzy C-Means Clustering
Fuzzy C-Means is a data clustering technique which each data point belongs to a cluster to a certain degree that is specified by a membership grade. This technique was originally introduced by Jim Bezdek in 1981 as an improvement on earlier clustering methods.
The FCM algorithm aims to partition a finite collection of n elements X = x1, . . . , xninto a collection of clusters with respect to some given criterion. The major difference between this method and K-Means, specified in the preview section, is that FCM employs a fuzzy partitioning such that a data point can belong to several groups with a degree of belongness specified by membership grades between 0 and 1.
The algorithm returns then a list of centrers C = c1, . . . , ccand a partition matrix W = wi,j∈ [0, 1], i = 1, . . . , n; j = 1, . . . , c where each element wi,j represents the degree to which element xi belongs to cluster cj.
Fuzzy C-Means aims to minimize the objective function argminC Pn i=1 Pc j=1w m i,jkxi− cjk2where, wi,j= 1 Pc k=1 kxi−cjk kxj−ckk 2 m−1
This procedure is divided in five main steps, namely:
C-Means Algorithm 1 Choose a number of clusters
2 Assign coefficients randomly to each data point
3 Repeat until the algorithm has converged
4 Compute the centroid for each cluster
5 Compute the coefficient for being in the cluster for each data point Table 3.2: C-Means Algorithm
3.3
Single-linkage Clustering
Single-linkage clustering is one method of hierarchical clustering. It is based on grouping clusters in bottom-up fashion, at each step combining two clusters that contain the closest pair of elements that do not belong to the same cluster.
At the beginning, each element is in a cluster of its own. The clusters are then sequentially combined into larger clusters, until all similar elements end up being in the same cluster. At each step, the two clusters separated by the shortest distance are combined.
The distance between two clusters is determined by a single element pair, more precisely two ele-ments (one in each cluster) that are closest to one another. The shortest of these links which remains after all steps, causes the fusion of the two clusters whose elements are tangled. The method is also known as nearest neighbour clustering for this reason.
The linkage function is computed as
D(X, Y ) = min
x∈X, y∈Yd(x, y) (3.5) The algorithm is composed by the following steps:
Single-linkage Clustering Algorithm
1 Begin with the disjoint clustering having level L(0) = 0 and sequence number m = 0
2 Find the most similar pair of clusters in the current clustering (computing euclidean distances)
3
Increment the sequence number: m = m + 1.
Merge clusters (r) and (s) into a single cluster to form the next clustering m. Set the level of this clustering to L(m) = d[(r), (s)].
4 Update the similarity matrix, D,
5 If all objects are in one cluster, stop.
Otherwise, go to step 2.
Table 3.3: Single-linkage Clustering Algorithm
3.4
NERF C-Means
NERF C-Means as defined in [19] is an extension of relational FCM, which is dual to FCM when D is a Euclidean matrix. The “non-Euclidean” added to RFCM yields NERFCM, shortened to NERF. The extension is based on the b spread transformation, which is used to adjust D when distances in D are non-Euclidean. It is necessary to specify the number of clusters for each run of NERF.
Figure 3.1: NERF Algorithm [19]
3.5
Correlation Cluster Validity
The Correlation Cluster Validity (CCV) indices compare a given U to the structure in the data matrix D by first transforming U into a matrix of dissimilarities using the formula
D(U ) = 1n−
UTU maxi,j(UTU )i,j
(3.6) Pearson and Spearman’s CCV were used in the present work as will be later specified in chapter 4. Pearson’s product moment moment correlation coefficient [39] is one statistic that compares each D(U) to D, quantifying the linear relationship between its entries. It is a parametric test based on the assump-tion that the observaassump-tions are normally distributed.
νCCV P(D, D(U )) = Pn i=1 Pn j=1AijBij q Pn i=1 Pn j=1A 2 ij q Pn i=1 Pn j=1B 2 ij (3.7)
Aij = (Dij− ¯Dij (3.8)
Bij= (Dij(U ) −Dij¯(U ) (3.9) ¯
Dand ¯D(U )are matrices whose every entry is the average value of D and D(U), respectively.
The CCV index based on Spearman’s correlation coeficient [40] is a more general non-parametric index that bores the extent of monotonic relations between the two matrices. It uses the same formula as Pearson’s correlation but the comparison is made between the paired ranks of the values instead of the values themselves.
νCCV S(D, D(U )) = 1 − 6 n3− n n X i=1 n X j=1 (Dij− [D(U )]ij)2 (3.10)
Chapter 4
Target Blood Pressure Pattern
Identification using Linguistic
Summaries
4.1
Data Set Characterization and Treatment
This study used data from the Multi-parameter Intelligent Monitoring for Intensive Care (MIMIC III) database. This is a large database of ICU patients admitted to the Beth Israel Deaconess Medical Center, collected from 2001 to 2012, that has been de-identified by removal of all Protected Health Information.
Clinical records store various measurements for each patient. These measurements can include patient measurements, fluid output, and laboratory measurements
The used data contains information about all the patients who stayed longer than 24 hours and without chronic renal disease.
For each patient the information about ID of staying in the ICU and the subject ID (which are different since one subject can go to the ICU more than once), demographics, clinical data, laboratory measure-ments and co-morbidity scores. Demographics encompasses information regarding patient age, gender, ethnicity and history of hypertension. Clinical data records contain the patient Urine Output, MAP, Di-astolic and Systolic pressures. Laboratory measurements contain records referring to haemoglobin, white blood cell count, platelets, potassium, serum creatinine, chloride, BUN, AST, ALT, ionized calcium glucose, amylase, Lipase, magnesium and sepsis diagnosis.
At this present study, from all the variables available at MIMIC III database, only age, gender, history of hypertension, MAP, the UO rate from previous hour, if vasopressor where administrated at current hour, magnesium, potassium, OASIS scores, OASIS respiratory rate score and Elixhause for fluid and electrolyte were used. The selection of this particular variables was made taken into consideration the work done in Ricardo Pacheco, C ´atia M. Salgado, Rodrigo Deliberato, Leo Anthony Celi, Susana M.
Vieira [4].
4.1.1
Data Processing
Following the steps made in [4], patients younger than 17 or with stays below to 24 hours weren’t included in this study. Patients that exhibited deficient or unknown kidney function at admission were also excluded. Kidney status was evaluated through creatinine levels at the time of admission. Patients with a creatinine level at or above 1.2 mg/dl were excluded. This threshold follows the one defined in [4]. Clinical data is collected in irregular intervals. Vital signs and fluids are usually collected on an hourly basis, while laboratory measurements are recorded daily. In the ICU, the total amount of urine produced by the patients since the last collection is recorded.
A critical event, that will be the main concern of this study, is characterized by a drop at the UO below 30ml/hin the next hour. Depending on MAP measures during the first critical event, the dataset was split into two different groups as stated on the first chapter: first group with the patients who had a critical event before the established protocol (MAP below 65 mmHg) and the second group with the patients who had a critical event after and even though the established protocol was applied (MAP above 65 mmHg). From now on, all data will be seen and characterized through these two groups, named group 1 and group 2.
The dataset was initially processed to remove outliers, patients with missing data or with no critical event, and then was divided into the two initial groups as said above. From this initial cleanse, a first group with 103 patients and a second group with 601 patients were obtained.
Figures 4.1 shows a typical patient’s time series. On the horizontal axis is the time since the admis-sion of a patient and on the vertical axis, the latest MAP measurement at that time or Urine Output from previous hour.
Figure 4.1: Example of a patient time series
As previously stated, MAP and UO are measured hourly. However a linear regression was made in order to better show the temporal evolution of both variables for all patients. This option took also into consideration the approach described in Section 2.2 to deal with time-varying Linguistic Summaries. For these reasons, all graphics at this report concerning MAP and UO will be linear instead of discrete.
The following figures show examples of MAP and UO for all patients linearised conformed what was said above.
Figure 4.2: MAP for all patients
4.1.2
Data Characterization
For a feasible study of this data, it is necessary that the dataset has the same time interval. This way, initially, a trim of the time series for each patient from both groups was performed. For this to be achieved, only the first six hours before the first critical event were considered for this study. Such was chosen for both the need of a short time range, that could include more patients, and the fact that, according to previous studies [10], more correlations were found during the first six hours before the critical event. Following, a demographic study was performed to understand how data was distributed in both groups and if any similarity could be found between them. To achieve this goal, the data concerning patients’ demographic variables was described as it can be seen in the following tables 4.1 and 4.2.
Number of Patients % <=80 95 92 Age >80 8 8 Female 45 44 Gender Male 58 56 Yes 45 44 History of hypertension No 58 56 Yes 44 43
Administered vasopressors during hospital stay
No 59 57 Table 4.1: Patients demographic in Group 1
Number of Patients % <=80 601 100 Age >80 0 0 Female 239 40 Gender Male 362 60 Yes 250 42 History of hypertension No 351 58 Yes 196 33
Administered vasopressors during hospital stay
No 405 67 Table 4.2: Patients demographic in Group 2
A brief analysis of these tables shows an overall similarity between the demographic data from both groups. The major difference between the two groups here found is still very slight, of ten percentage points, and is related to the administration of vasopressors during the stay at the hospital.
In order that a similar demographic study for the time series could be performed, it was necessary to consider each time segment between a interval of two measures of MAP and UO and see for both this variables which ”behaviours” were more common for patients from both groups. This way, and for future coherence with what will be done in the following section, the behaviours of each trend were described considering their slope:quickly decreasing ]-90, -70], decreasing ]-70, -30], constant ]-30,
30],increasing ]30, 70] and quickly increasing ]70, 90[.
The following figures show how MAP and UO behaves for both groups 1 and 2.
Figure 4.5: Urine Output behaviour from patients of group 1
From the two graphs above it is possible to see the expected decrease in UO at the last hour before critical event (UO must drop below 30ml/h for a critical event to take place). Moreover, looking to figure 4.6, it is possible to see that at the last hour there’s a prevalence of patients from group 1 who have MAP quickly decreasing (about 60%).
Figure 4.7: Urine Output behaviour from patients of group 2
Figures 4.8 and 4.9 show the behaviour of both MAP and UO from patients from the second group. Here again, the UO quickly decreases in the last hour before a critical event for 100% patients from the second group. This evidence shared by the two groups is interesting because, despite being in its definition UO must drop below 30ml/h for a critical event, nothing says this decrease must be fast.
Different from patients of group 1, patients from group 2 seem to have their MAP with no prevalent behaviour among all trends. There’s only a clear incidence on fast dynamics (quickly decreasing or quickly increasing) but even here this characteristic does not cover more than 40% patients.
Pursuing a better description of both groups and trying to find which variables other than MAP and UO could help contributing to a critical event, for both groups 1 and 2 a analysis of MAP and UO was made as well as other demographic values, considering now sub-divisions for both groups 1 and 2 related to the history of hypertension. Demographic groups based on age were not analysed because of the small number of over 80 years patients present in both group 1 and 2.
Data Characterization in Group 1 for Different Demographic Groups 1. Hypertensive Patients Number of Patients % <=80 39 87 Age >80 6 13 Female 24 53 Gender Male 21 47 Yes 17 38
Administered vasopressors during hospital stay
No 28 62 Table 4.3: Demographics from Hypertensive Patients of Group 1
The following figures show how MAP and UO behave during the 6 hours period for all hypertensive patients from group 1.
Figure 4.9: UO from Hypertensive Patients of Group 1
UO has no significant change when considering only the hypertensive patients compared to all patients of group 1. A drop in the coverage of MAP in last hour, however occurs. Also, the fast increase of MAP in the time period between the fourth and fifth hours previous the critical event is much more prevalent in this particular sub-group of patients than in the entire group 1.
2. Non-Hypertensive Patients Number of Patients % <=80 56 90 Age >80 2 10 Female 21 46 Gender Male 37 64 Yes 27 47
Administered vasopressors during hospital stay
No 31 53 Table 4.4: Demographics for non-hypertensive patients of Group 1
The following image shows the behaviour of MAP trends at their respective time intervals in non-hypertensive patients from group 1.
Figure 4.10: MAP from Non-Hypertensive Patients of Group 1
Figure 4.11: UO from Non-Hypertensive Patients of Group 1
Finally, for non-hypertensive patients UO shows, again, no significant change in its distribution throughout the time periods. MAP, however, has a fast decrease in the final hour that covers almost 70% of non-hypertensive patients.
Data Characterization in Group 2 for Different Demographic Groups
The same study was then made for patients from the second group. In the following subsections it is possible to see the differences and similarities between hypertensive and non-hypertensive patients form group 2. 1. Hypertensive Patients Number of Patients % <=80 228 91 Age >80 22 9 Female 95 38 Gender Male 155 62 Yes 88 35
Administered vasopressors during hospital stay
No 162 65 Table 4.5: Demographics from Hypertensive Patients of Group 2
The following images shows how MAP and UO trends’ behaviours are distributed in each time period for hypertensive patients from group 2.
Figure 4.13: Urine Output from Hypertensive Patients of Group 2
From the graphs above, it is clear that no significant change occur for both UO and MAP from a general analysis to this one. MAP is again well distributed in all dynamics throughout time, with some prevalence again in the fast dynamics.
2. Non-Hypertensive Patients Number of Patients % <=80 338 96 Age >80 13 4 Female 144 41 Gender Male 207 59 Yes 108 43
Administered vasopressors during hospital stay
No 243 57 Table 4.6: Demographics from Non-Hypertensive Patients of Group 2
The following images shows how MAP and UO trends’ behaviours are distributed in each time period for non-hypertensive patients from group 2.
Figure 4.14: MAP from Non-Hypertensive Patients of Group 2
Figure 4.15: Urine Output from Non-Hypertensive Patients of Group 2
Looking to the images, it is again visible that for non-hypertensive patients the dynamic of changes that MAP has during the measured period is well distributed among all labels. The most prevalent trends throughout time are ”quickly decreasing” and ”quickly increasing”. Being the ”quickly increasing label for the last hour with major representativeness (but not more than 45%.
The UO follows the tendency seen in both general view from group 2 and hypertensives from group 2.