Exploratory Analysis of Meteorological Data

(1)

Exploratory

Analysis of

Meteorological Data

Joel Agostinho Nunes Pinto Sousa

Mestrado Integrado em Engenharia de Redes e Sistemas

Informáticos

Departamento de Ciência de Computadores 2019

Orientador

Rita Ribeiro, Professor Auxiliar, Faculdade de Ciências da Universidade do Porto

Coorientador

Sérgio Crisóstomo, Professor Auxiliar, Faculdade de Ciências da Universidade do Porto, Investigador Instituto de Telecomunicações

(2)

(3)

Todas as correções determinadas pelo júri, e só essas, foram efetuadas.

O Presidente do Júri,

(4)

(5)

Abstract

The meteorological data recorded by the Escola Médico-Cirúrgica do Porto (EMCP) between December 1860 and March 1898 were stored in files. This data remained raw, consisting only of records that had not yet been analyzed. Through the Instituto Geofísico da Universidade do

Porto (IGUP), we had access to these files which provided us with the opportunity to perform

an exploratory analysis of these meteorological data. This analysis allowed us to identify missing values, suspect values, and values with format errors. Through the development of a web app, it became possible to analyze the evolution of the meteorological parameters trend in the period presented in the records and to analyze the evolution of the meteorological parameters in specific time intervals, such as months, years and seasons. It was also possible to study specific temperature and precipitation indices, such as heatwaves and maximum consecutive days of precipitation, as well as to observe the relationships between meteorological parameters. The developed web app allows the graphical visualization and interpretation of the exploratory analysis of the data in a dynamic and interactive way, a relevant contribution to this study, as well as for future analysis by experts in the field.

(6)

(7)

Resumo

O conjunto dos dados meteorológicos registados pela Escola Médico-Cirúrgica do Porto (EMCP) entre dezembro de 1860 e março de 1898 estavam guardados em ficheiros. Estes dados permaneciam em bruto, consistindo apenas em registos que ainda não tinham sido analisados. Através do Instituto Geofísico da Universidade do Porto (IGUP), tivemos acesso a esses ficheiros que nos proporcionaram a oportunidade de realizar uma análise exploratória desses dados meteorológicos. Esta análise permitiu-nos identificar valores ausentes, valores suspeitos e valores com erros de formato. Através do desenvolvimento de uma web app, tornou-se possível analisar a evolução da tendência dos parâmetros meteorológicos no período apresentado nos registos e analisar a evolução de parâmetros meteorológicos em intervalos de tempo específicos, tais como meses, anos e estações do ano. Tornou-se ainda possível, estudar índices específicos de temperatura e precipitação, como ondas de calor e máximo de dias consecutivos de precipitação, assim como observar as relações entre os parâmetros meteorológicos. A web app desenvolvida permite a visualização gráfica e a interpretação da análise exploratória dos dados de forma dinâmica e interativa, um contributo relevante para este estudo, bem como para futuras análises de especialistas da área.

(8)

oportunidade, à Sofia pelo apoio incondicional. O meu sincero agradecimento.

(9)

List of Tables

2.1 Temperature indices used in [20] and [47]. . . 7

2.2 Precipitation indices used in [16] and [49]. . . 7

3.1 Meteorological parameters available on the excel files. . . 21

3.2 Number of missing values, format error values and suspicious values per parameter. 23 3.3 Missing value: precipitation.. . . 23

3.4 Format error values: atmospheric pressure at 9h. . . 24

3.5 Limits used for each parameter to find suspicious values. . . 24

3.6 Suspicious values: maximum air temperature. . . 25

3.7 Suspicious values: average air temperature. . . 25

3.8 Suspicious values: minimum air temperature ≥ maximum air temperature. . . . 26

3.9 Summary values of minimum air temperature by decade.. . . 28

3.10 Summary values of maximum air temperature by decade. . . 29

3.11 Summary values of precipitation by decade. . . 33

A.1 Missing values and format error values from the atmospheric pressure at 9h, at 12h, at 15h and average. . . 65

A.2 Missing values and format error values from the minimum, maximum and average air temperatures. . . 66

A.3 Missing values and format error values from the shade air temperatures. . . 66

A.4 Missing values and format error values from the exposure air temperatures. . . . 66

A.5 Missing values and format error values from relative humidity.. . . 66

A.6 Missing values and format error values from vapor pressure. . . 67

(14)

A.8 Missing values and format error values from ozone. . . 67

A.9 Missing values and format error values from wind speed. . . 68

A.10 Suspicious values from minimum, maximum and averages temperatures. Limits used to the minimum air temperature: -10 and 30. Limits used to the maximum air temperature: 0 and 41. Limits used to the average air temperature: 0 and 33. 69 A.11 Days when the average air temperature differs from the mean between maximum and minimum air temperatures. The supposed average is the average calculated between the minimum air temperature and the maximum air temperature.. . . . 69

A.12 Days in which the maximum air temperature is less than or equal to the minimum air temperature. . . 72

A.13 Suspicious values from exposure air temperatures. Limits used: 0 and 50. . . 72

A.14 Suspicious values from shade air temperatures. Limits used: -10 and 40. . . 73

A.15 Missing values and format error values from the atmospheric pressure at 9h, at 12h, at 15h and average. Limits used: 700 and 800. . . 73

A.16 Days that the average atmospheric pressure differs from averaging the values of the 3 daily values. The supposed average is the average calculated between the atmospheric pressure at 9h, 12h and 15h. . . 74

A.17 Suspicious values from vapor pressure. Limits used: 1 and 35. . . 77

A.18 Suspicious values from relative humidity. Limits used: 0 and 100. . . 77

A.19 Suspicious values from precipitation. Limits used: 0 and 100. . . 77

A.20 Suspicious values from ozone. Limits used: 0 and 20. . . 78

A.21 Suspicious values from wind speed. Limits used to absolute wind speed: 0 and 1000. Limits used to wind speed at 15h: 0 and 40. . . 78

A.22 Days in which the absolute wind speed is less than the wind speed at 15h. . . 78

(15)

List of Figures

2.1 Meteorological parameters. . . 4

2.2 Bar chart showing the number of observations by product type [37]. . . 9

2.3 Histogram showing the frequency of the temperature values divided into 5-value intervals.. . . 10

2.4 Boxplot composition [24]. . . 11

2.5 Scatter plot with smooth curve and confidence interval (gray area) [65]. . . 11

2.6 Correlogram with precipitation data. The horizontal dashed lines are approximate 95% confidence limits. . . 13

2.7 Stationary Time Series. . . 14

2.8 Non-stationary Time Series. . . 14

2.9 Trending time series [33]. . . 15

2.10 Seasonal and cyclical time series [33]. . . 15

2.11 Different moving averages applied in the same original time series [33]. . . 16

2.12 Decomposed time series using additive model [33]. . . 17

3.1 June 1880 excel worksheet. . . 20

3.2 Minimum air temperature plot where it is possible to see outliers.. . . 26

3.3 Minimum air temperature scatter plot with trend line. . . 27

3.4 Minimum air temperature annual average plot. . . 27

3.5 Annual number of days with minimum air temperature. . . 29

3.6 Maximum air temperature scatter plot with trend line. . . 30

3.7 Shade air temperature at 9h . . . 31

3.8 Decomposition . . . 32

(16)

3.10 Total precipitation by month. . . 33

3.11 Annual number of days . . . 34

3.12 Precipitation . . . 35

3.13 Evolution over time of the various parameters of atmospheric pressure. A moving average of 365 is applied to the plot. . . 35

3.14 Atmospheric pressure at 9h per month where each line represents a different year. 36 3.15 Humidity at 9h annual average. . . 36

3.16 Seasonal average . . . 37

3.17 Monthly average . . . 38

3.18 Vapor pressure at 9h plot with trend line. . . 39

3.19 Seasonal average . . . 40

3.20 Monthly average . . . 41

3.21 Ozone plot with trend line. . . 42

3.22 Ozone . . . 42

3.23 Absolute wind speed decomposition. . . 42

3.24 Absolute wind speed . . . 43

3.25 Absolute wind speed . . . 43

3.26 Meteorological parameters correlation matrix. . . 44

3.27 Scatter plot of correlation between minimum air temperature and vapor pressure at 9h with LOESS approximation and with Pearson correlation R and p-values. . 44

3.28 Scatter plot of correlation between average atmospheric pressure and precipitation with LOESS approximation and with Pearson correlation R and p-values. . . 45

3.29 Scatter plot of correlation between humidity at 15h and shade air temperature at 15h with LOESS approximation and with Pearson correlation R and p-values . . 45

3.30 Scatter plot of correlation between ozone and precipitation with LOESS approxi-mation and with Pearson correlation R and p-values. . . 46

3.31 Evolution of all shade and exposures temperatures parameters. A moving average of 365 is applied to the plot. . . 46

3.32 STL decomposition . . . 47

(17)

4.1 Home tab of the shiny app. . . 50

4.2 Shiny app layout demonstration. . . 50

4.3 Univariate analysis tab layout. . . 51

4.4 Average air temperature statistics. . . 54

4.5 Statistics per year tab view. . . 55

4.6 Statistics per month tab view . . . 56

4.7 Statistics per season tab view.. . . 57

4.8 Decomposition tab view. . . 58

4.9 Data table tab view. . . 58

4.10 Multivariate analysis tab view. . . 59

4.11 Visualizations tab view. . . 60

4.12 Records tab view.. . . 60

4.13 Application directory structure. . . 61

(18)

(19)

Chapter 1 Introduction

The analysis of some meteorological parameters is important for the analysis of the climate and its evolution. A large majority of human activities depend on data collected from weather stations, from ordinary citizens daily lives to the most important activities such as energy, renewable energy, agriculture, or public health. Currently, discussions about the evolution of some meteorological parameters have been intensifying, since significant changes of parameters such as temperature, humidity, precipitation, and atmospheric pressure have started to concern several areas of knowledge involved with climate research. Given the importance of the climate study mentioned above, in this dissertation project, it is intended to analyze the meteorological data recorded by the Escola Médico-Cirúrgica do Porto (EMCP) between December 1860 and March 1898 provided to us by Instituto Geofísico da Universidade do Porto (IGUP).

1.1 Motivation

Data collected by the EMCP remain raw being just records that have not yet been analyzed. Based on this, the main motivation for the dissertation is the access to this data as it allows the elaboration of an exploratory analysis of the meteorological data. The analysis of meteorological data provided by the IGUP becomes relevant, not only for the importance of the climate study but also for the correction of any errors present in these data in order to support a reliable analysis of the climate.

1.2 Objectives

The main objective of this dissertation is the exploratory analysis of meteorological parameters so that it is possible to provide relevant, accurate, and easy-to-understand climate information. These meteorological parameters belong to the meteorological data collected by the EMCP. Thus, based on an exploratory analysis of this data set, the main objectives of the dissertation are:

(20)

• identification and replacement of errors in the data such as missing values, suspicious values, and values with format errors;

• development of a web application that would allow the dynamic generation of easy-to-understand graphs, allowing access, visualization, and analysis of meteorological data in a dynamic and interactive way;

1.3 Thesis outline

After this introductory chapter, the rest of the thesis is organized by the following chapters:

In Chapter 2 we give a short introduction to the meteorological parameters presented in the data recorded by the EMCP. Then we mention types of analyses that are performed on air temperature and precipitation. We give an introduction to essential concepts used in exploratory data analysis and time series. Finally, we discuss the tools used in this study.

In Chapter 3 we characterize the data set, then we describe the pre-processing of the data. Finally, we discuss some analysis of meteorological parameters belonging to the data set and correlations between each other.

In Chapter 4 we introduce the shiny app. We talk about its layout and the features it has. Basically, we can say that it serves as a manual for using the application. It is further described in general the application organization and its architecture.

In Chapter 5 we present the main results of this thesis. The limitations encountered throughout the work are also mentioned, as well as suggestions for future studies.

(21)

Chapter 2 Background

In this chapter, we give a background about some meteorological parameters. We also mention typical analysis performed on air temperature and precipitation. We present different techniques used in exploratory data analysis and time-series study, and finally, we mention the necessary tools to be used in this study.

2.1 Meteorological variables

The meteorological observations were and are used to record the meteorological conditions of each place and its evolution to characterize the respective climates [2]. Meteorology is the study of the physical properties of the atmosphere, getting an understanding of the processes that explain its necessary evolution so that it is possible to make a forecast of the future states. Meteorologists focus on a subset of parameters with the greatest impact on human activities or with greater explanatory power over the meteorological evolution [42]. Some of these parameters (shown in Figure 2.1) are the air temperature, the precipitation, the atmospheric pressure, the relative humidity, the vapor pressure, the wind, and the ozone. Thereafter, these parameters are characterized.

Air temperature

Formally, the temperature is related to the speed of motion of the particles that make up matter. The more agitation they exhibit, the higher the temperature [48]. The air temperature varies from day to night as well as from one place to another, even with positive and negative temperatures on the same day, so shelters with thermometers at 1.5m-2m are built above ground in meteorological stations to obtain high precision weather observation data and temporal resolution [56,60,61]. The thermometer is the instrument used to accurately measure temperature [48], however, recorded data are limited by the poor distribution of weather stations [69]. Temperature measurement can be performed in different units of measurement, depending on which scale is more convenient to us, can be measured in degrees Celsius (°C), Fahrenheit (ºF) or Kelvin (K),

(22)

Figure 2.1: Meteorological parameters.

with the possibility of converting one unit in another through defined formulas.

Precipitation

This parameter comes from the clouds, which are formed by a large number of small droplets and ice crystals [58]. These droplets result from a water vapor change of state that, when ascends to the atmosphere, cools and reaches saturation [48]. Precipitation varies from year to year over the decades with changes in the amount, intensity, frequency, and type of precipitation [58]. Precipitation can occur through two states: the liquid state (e.g., drizzles or bursts) and the solid-state (e.g. snow or hail) [4]. The global average precipitation rate, according to global estimates, is about 2.8mm per day [32,66], however, there is evidence that man-made climate change is radically changing precipitation levels and hydrologic cycle [58]. The rain gauge is the instrument used to measure precipitation at a particular location within a specific time period [27].

Atmospheric pressure

Through the action of gravity, the surrounding air, even without realizing it, weighs and therefore exerts a force on all bodies. This force per unit area is called atmospheric pressure [39]. Atmospheric pressure depends on several variables, such as temperature, humidity, geographic location, climatic conditions and, in particular, altitude. In the latter, the higher the altitude, the lower the atmospheric pressure exerted on a body because of the smaller amount of air [48]. It is also possible to associate atmospheric pressure with climate, noting that when the

(23)

2.1. Meteorological variables 5

climate is unstable, atmospheric pressure decreases and, inversely, when the climate is stable, atmospheric pressure increases [48]. Atmospheric pressure is traditionally measured with a barometer [39], however, to measure it there are different units of measurement. The most common are: millimeter of mercury, atmospheres, millibars, and pascal, the latter being the unit of measure of the International System [4].

Relative humidity

Water can exist in the liquid, solid or gaseous state and is one of the main components of the atmosphere. The amount of water vapor present in the air is called humidity, this amount is not constant and therefore is dependent on several factors such as rain, proximity to the sea or presence of plants [4]. There are several ways to refer to the humidity content of the atmosphere [48], one of them is absolute humidity which is the mass of water vapor, in grams, contained in 1m3 of dry air. However, relative humidity is the most commonly used measure and is expressed as a percentage (%) and calculated according to the following expression:

h = ξ100

Where the relative humidity is represented by h, the vapor content of the air mass by the () and (ξ) represents the maximum storage capacity, titled saturating vapor pressure. This last value shows us the maximum amount of water vapor that a mass of air can contain before it turns into liquid water, commonly known as saturation. Thus, relative humidity conveys of how close the mass of air is to saturation. Based on this, when the relative humidity is at 100%, the mass of air can not store more water vapor, since thereafter any extra amount of vapor will be transformed into liquid water or ice crystals, according to environmental conditions [4, 48]. A change in air temperature may cause a change in relative humidity. This phenomenon is due to the change in the saturation vapor pressure of the air because of the change in the air temperature. Thus, if the air temperature increases, the saturation vapor pressure also increases which leads to an increase in the water vapor capacity of the air [4]. The psycho-meter is the instrument that usually measures the humidity [27].

Wind speed and direction

The movement of air from one area to another means wind. Although it arises due to several causes, it generally originates when, between two points, there is a certain temperature or pressure difference [48].Focusing on the pressure difference, that is, when the air pressure is distinct between two points, the air generally moves from the high-pressure point to the low-pressure point. In this case, we can say that there is a gradient or a difference in low-pressure between the two extremes. Focusing on the temperature difference, the wind is caused by the thermal difference. That is, when the mass of air has a higher temperature than the one around it, it increases its volume and decreases its density. And due to the fluctuating effect, the mass of hot air rises and this position is occupied by other air masses, through this displacement the wind is generated [48]. Due to its complexity, the wind is considered to be one of the most

(24)

difficult to predict weather parameters [52], in particular, its direction and speed. Wind direction is the azimuth angle (measured clockwise from the north) at which the wind is blowing [28] and its prediction has several relevant uses, such as aircraft and ship routing [7]. Wind velocity is considered to be the result of complex interactions between large-scale force mechanisms such as pressures and temperature gradients, earth rotation and local surface characteristics [52]. Wind speed and wind direction are measured by different instruments. For measuring the wind speed the anemometer is the most used instrument. In this instrument, the wind speed is proportional to its rotation and the units of measure are the km/h or m/s. For the measurement of wind direction, weathervanes are used since they indicate the geographical origin of the wind, in this case, it is spoken, for example, the north, northeast, west or southwest wind [27].

Ozone

Ozone is an important element in many physical and chemical atmospheric processes as it absorbs ultraviolet [10] and infrared solar radiation, this is, greenhouse gas [44]. Tropospheric ozone is reported as the main substance in the photochemical pollution index [40] and is recognized as one of the major air pollutants [23, 67]. Based on this, ozone has contributed to climate change [34], producing adverse impacts on forests and crops [9] and even human health [21]. Harmful effects of high concentrations of ozone, especially in terms of human health, remain a problem in air quality [45] and while ozone was once a relatively constant constituent in air, in recent years its concentration has been increasing. show a constant increase in the upper layer of the atmosphere [40]. The relationship between surface ozone and meteorological variables is complex [13], however, surface ozone concentrations are strongly dependent on meteorological variables such as solar radiation fluxes, temperature, cloudiness or wind speed/direction [17, 25] and relative humidity [3].

2.2 Climatological data analysis

The climate is a collection of atmospheric weather conditions, where the state of the atmospheric weather refers to a collection of meteorological conditions at a particular time or place [2]. The climate has always been a subject of interest and studies over the years by researchers, and over time these studies have become increasingly elaborate and complex. However, the researchers of today are not only interested in the climate but also in its changes, with a considerable increase in studies related to this theme in the last century, driven mainly by the increase in average global and regional temperatures [47].

Climate change is the subject of many studies since its effects on the environment and society are of great concern worldwide. Changes in climate variables such as precipitation and/or air temperature can give rise to enormous socioeconomic impacts at both the regional and global levels, affecting several areas such as agriculture, food security, public health and the availability of water and other natural resources [15]. Throughout the twentieth century, never reached temperatures, were recorded on a global scale, confirming a set of scenarios of climate change.

(25)

2.2. Climatological data analysis 7

Based on this, climate change has come to be seen as a major global concern [41]. Thus, there is a wide variety of studies on climate change, making it necessary to seek a better understanding of the results got through these investigations [20].

Climate categorization is done by defining some variables. Regarding to air temperature, studies conducted relatively, as in [20] and [47], indices were selected to characterize extreme air temperatures. These indices can be seen in Table 2.1. The purpose of these indices is to assess climate change, in particular, changes in intensity, frequency, and duration of extreme temperatures. These indexes are divided into categories: absolute, duration, threshold, and other indices. In [47] the selected indices were calculated annually, and in [20], selected indices were calculated at the seasonal scale.

Index Description Definition

MTX Maximum temperature Mean of maximum temperature

TX25 Summer days Number of days with daily maximum temperature>25◦C TX35 Extremely hot days Number of days with daily maximum temperature>35◦_C

TXx Warmest day Maximum value of daily maximum temperature TXn Coldest day Minimum value of daily maximum temperature

HWD Heat wave duration Number of days in intervals of at least 6 consecutive days with TX>mean+5◦C MTN Minimum temperature Mean of minimum temperature

TN0 Frost days Number of days with daily minimum temperature<0◦C TN20 Tropical nights Number of days with daily minimum temperature>20◦C

TNx Warmest night Maximum value of daily Tmin

TNn Coldest night Minimum value of daily Tmin

CWD Cold wave duration Number of days in intervals of at least 6 consecutive days with TN<mean-5◦C DTR Diurnal temperature range Mean difference between maximum and minimum temperatures ETR Extreme temperature range Difference between TXx and TNn

Table 2.1: Temperature indices used in [20] and [47].

Studies on precipitation are made on [16] and [49]. In these studies, a set of indices derived from daily precipitation data were selected. These indices can be seen in Table2.2. With the objective to explore changes in intensity, frequency, and duration of precipitation. The indices were calculated on an annual scale in [16] and on a seasonal scale in [49].

Index Description Definition

CDD Consecutive dry days Maximum length of dry spell (RR < 1mm) CWD Consecutive wet days Maximum length of wet spell (RR ≥ 1 mm)

R10 Heavy precipitation days Number of days per year with RR ≥ 10 mm R20 Very heavy precipitation days Number of days per year with RR ≥ 20 mm R25 Extremely heavy precipitation days Number of days per year with RR ≥ 25 mm RX1D Highest precipitation amount in one-day Annual maximum precipitation on 1-day intervals RX5D Highest precipitation amount in five-days Annual maximum precipitation on 5-day intervals

SDII Simple daily intensity index Mean precipitation amount on a wet day (RR > 1 mm) PrecTot Annual total wet-day precipitation Annual total precipitation from days ≥ 1 mm

Table 2.2: Precipitation indices used in [16] and [49].

(26)

temperature, observations of the monthly, annual, and seasonal average of the minimum, average, and maximum air temperature are presented. It also presents observations of the difference between the maximum and minimum air temperature. In addition, it is presented the annual and seasonal average of the number of days that the minimum air temperature is less than or equal to 0 ºC, greater than or equal to 20 ºC and the maximum air temperature is greater than or equal to 25 ºC.

Regarding precipitation, observations of the monthly and annual average of total precipitation are presented. Observations of the annual and seasonal average number of days with precipitation values greater than or equal to 0.1 mm, greater than or equal to 1 mm, greater than or equal to 10 mm and greater than or equal to 30 mm are also presented. Observations about the months and years with the lowest and highest precipitation are also presented.

2.3 Exploratory data analysis

Much of the progress of science is based on data analysis. The main operations that constitute this analysis are divided into two broad phases: an exploratory phase and a confirmatory phase [29].

Generally, it is advisable to start any statistical analysis of the studies with an exploratory graphical analysis. This approach is called Exploratory Data Analysis (EDA). EDA provides the first contact with the data and is the first step in analyzing data and always precedes the confirmatory data analysis [12]. This approach is used to isolate patterns and characteristics of the data and to reveal them to the analyst. The use of EDA is encouraged for the description of data, but also for the formulation of models [12], since it precedes the selection of these models [29]. In summary, the EDA highlights the search for clues and evidence, attempting to identify the main characteristics of a data set of interest and to produce suggestions for additional investigation. The confirmatory data analysis emphasizes on the evaluation of the possible evidence founded by the EDA [29]. The attention is focused on the study of the observed patterns or effects, specification of the model, estimation of parameters, tests of hypotheses and firm decisions on data [14].

The focus of this thesis is on the EDA phase. The main purposes of the EDA are [12]:

• to make clear the general structure of the data;

• to compute summary statistics;

• to obtain simple descriptive summaries;

• to check the quality of the data;

• to draw appropriate graphs;

(27)

2.3. Exploratory data analysis 9

To measure the strength and direction of the linear correlation between two variables, that is, to see how one variable is linearly related to another, the Pearson correlation coefficient can be used. As stated in [1], this coefficient can have values between -1 and 1. Being that if the value is 1 means that the variables are perfectly linear related (perfect correlation) by an increasing relationship, while if the value is -1, they are perfectly linear related, but by a decreasing relationship (perfect negative correlation). If the value is 0, the variables are not linearly related. In a scatter plot, if all points lay on a straight line, we can say that there is a perfect correlation [51].

In EDA a series of graphic techniques are used. These graphical techniques are very important for studying data relations, trends and patterns in an informal and simplified visual form. They are a data exploring technique that allows us to become familiar with the data and so we can draw some conclusions from it [19]. Therefore, plotting data plays a very important role in EDA, because humans can look at a plot and easily potentially pull a lot of information out of it [59]. Some examples of these graphics techniques are bar charts, histograms, box plots and scatter plots. These techniques are explained below.

Bar chart

Is used for categorical data [38]. It is a graph in which the height of the bars represents the value associated with a set of items - the height is proportional to the values that they represent. The bar charts are typically used to visualize quantities associated with a set of items [57]. Figure

2.2illustrates a bar chart representing the number of observations to a set of items.

Figure 2.2: Bar chart showing the number of observations by product type [37].

Histogram

It is used as graphical summary of quantitative data and is used for revealing the frequency distribution of data values [36]. Visually, a histogram is very similar to a bar chart but without

(28)

the gaps between the bars. Unlike the bar chart, the histogram does not represent unrelated categorical data, but rather different points on a numerical measurement scale[30]. Figure2.3

shows a histogram showing the frequency of temperatures at which each bar corresponds to a range of 5 values.

Figure 2.3: Histogram showing the frequency of the temperature values divided into 5-value intervals.

Box plot

One of the most used graphical techniques for analyzing the distribution of numeric variable is the box plot. The box plot is a compact distributional summary and it aims to summarize the batch of data by displaying several main features. Typically, it assumes a normal distribution for the variable and uses a five number summary to describe the sample distribution: the minimum, the 1st quartile (Q1), 2nd quartile (median), 3rd quartile (Q3) and the maximum. An example of a boxplot can be seen in Figure 2.4. As stated in [31], the boxplot is composed by a box extending from the Q1 to Q3. The interior of this box indicates the interquartile range (IQR), which consists of 50% of the data. The vertical line inside the box points out the median value (Q2) of the distribution. Based on these values it determines the whiskers. The whiskers are the vertical lines that are extended from the ends of the box (Q1 to Q3) to extreme data points that are typically defined as the most extreme data points within (Q1 − 1.5 × IQR) and

Q3 + 1.5 × IQR, respectively. Values that are beyond the maximum and minimum values of

whiskers are considered potential outliers. Outliers are observations in a set of data that appear to be inconsistent with the remaining observations that follow a hypothesized distribution [54]. For this reason, box plots are a very good technique for detecting outliers. However, there is a problem that is when the data is skewed, usually many points exceed the whiskers and are often erroneously declared as outliers. For that reason, some alternatives have been proposed such as the so-called adjusted box plot [31].

(29)

2.3. Exploratory data analysis 11

Figure 2.4: Boxplot composition [24].

Scatter plot

Scatter plot is a type of plot that uses Cartesian coordinates to display discrete data points, typically, described by two variables. Scatter plots are useful for showing the relationship in the data, such as correlation or other patterns [6]. They are used for statistical data analysis and are used for the preliminary examination of regression data. The purpose of regression analysis is to discover, relationships between a response variable and one or more predictors [64]. These relationships will allow predicting future values of the response variable based on observing the predictors [50]. According to [22], simple non-parametric regression is often called scatter plot smoothing because the method passes a smooth curve through the points in a x versus yscatter plot. An example of a scatter plot is shown if Figure2.5.

Figure 2.5: Scatter plot with smooth curve and confidence interval (gray area) [65].

LOESS is short for locally weighted regression smoothing and is a locally weighted least squares regression smoother [5]. LOESS fits multiple regressions in local neighborhood and it is a non-parametric approach since it does not require an a priori specification of the relationship between the variables [35]. Through LOESS it is possible to obtain a graphic summary of the

(30)

relationship between the variables and it is widely used for fitting smooth curves of scatter plots [35]. Another approximation method used in scatter plots is the linear regression approach. In this linear approach, the effect of one variable on another is estimated by the slope of the regression line between these variables [18]. While in LOESS a curve is produced, in the linear regression approximation is produced a straight line.

2.4 Time Series

Often, when studying a phenomenon, we find a set of data in which observations are made according to the order of time. This sequence of time-ordered observations is called time series [63], in which the observed values {x1; x2; . . . ; xn} are measured at fixed discrete moments

{t1; t2; . . . ; tn}.

The fundamental characteristic of a time series is that its observations are correlated [63]. The correlation introduced by the sampling of adjacent points in time leads to new problems in statistical modeling and inference that most traditional statistical methods are not appropriate, since this methods do not respect the order between observations, and consequently, different methods are required. Based on this, the statistical method that solves the problems caused by time correlations is called time series analysis [53,63]. The first thing to do in a time-series study is a cautious analysis of the recorded data (observations) plotted over time, with the consecutive observations joined by straight lines [33]. This research frequently shows the procedure of analysis and statistics that can be used to summarize data information [53].

The main objectives of time series analysis are to explore and extract signals (patterns) contained in the time series, to make forecasts (that is, future predictions in time) using this knowledge to optimally control processes [55] and to develop mathematical models that provide acceptable descriptions for sample data [53].

Before talking about specific statistical methods, it is convenient to say that there are two main approaches used in the time series analysis. These are the time domain approach and the frequency domain approach [53]. The time-domain approach is used to describe the characteristics of a time series, representing them as functions of time. It considers more important the investigation of lagged relationships For such purpose, time functions such as the mean function (simplest descriptive measure of a time series), the autocorrelation function (ACF), the partial autocorrelation function (PACF) and auto-covariance are used to describe the time series [53, 55]. In the frequency domain approach, the time series is represented as spectral expansions of the Fourier or wavelet modes and considers the investigation of cycles as the most important [53,55]. A spectral function is used to study how the variation of a time series can be explained by the mixing of sines and cosines at various frequencies [63].

Measures of Dependence Various measures describe the general the behavior of a process as it evolves over time [53].

(31)

2.4. Time Series 13

• Mean Function

Is a rather simple descriptive measure, such as the average monthly high temperature for a city. In this case, the mean is a function of time.

• Auto-covariance Function

The auto-covariance measures the linear dependence between two points on the same time series observed at different times.

• Autocorrelation Function (ACF)

The ACF measures the linear predictability of the series at time t, say xt, using only the

value x_s. Describes inherent correlation between observations of a time series which are separated in time by some lag k [26]. Figure2.6shows the ACF of precipitation data. We can see that the bigger the lag is, the less the time series exceeds the confidence limits (dashed lines).

Figure 2.6: Correlogram with precipitation data. The horizontal dashed lines are approximate 95% confidence limits.

• Partial Autocorrelation Function (PACF)

PACF is the correlation between Z_t and Z_t+k after their mutual linear dependency on the intervening variables has been removed [62]. This is because if Z0 is correlated with Z1,

(32)

2.4.1 Stationarity

Time series can be considered stationary or non-stationary. A stationary time series is one for which the probabilistic behavior of every collection of values {y_t1, yt2, ..., ytk} and shifted values

{yt1+h, yt2+h, ..., ytk+h}, are identical, for all k = 1, 2, ..., all time points t1, t2, ..., tk, and all time

shifts h = 0, ±1, ±2, ... [53]. A stationary time series data should exhibit similar behaviors over different time intervals, existing regularity in its behavior (mean and autocorrelation functions) [53]. On the other hand, a non-stationary time series is a series for which it does not ensure this regularity. The global temperature over the years is one example of such a time series, because of its increasing trend. Figure2.7represents a stationary time series. It shows that the evolution of the y-axis variable over the years develops with a constant average. Figure 2.8 represents a non-stationary time series. It illustrates the evolution of the y-axis variable over the years, where one can see that the average is not constant.

Figure 2.7: Stationary Time Series. Figure 2.8: Non-stationary Time Series.

2.4.2 Components

According to [33] and [55], it is possible to detect three main components in a time series: trend, seasonal and cyclic.

Trend Long-term changes in the mean level that show that exists a rise or a fall in the data. These changes are not necessarily linear, they can move from a rising trend to a downward trend [33]. Figure2.9illustrates the evolution of a time series trend over the days.

Seasonal A seasonal pattern occurs when the phenomena that occur over time is repeated over each identical periods of time (e.g., every month or every year).

Cyclic This component exists when data increases and decreases without fixed frequency. If a variation occurs with no fixed frequency, it is cyclic, if it is fixed, then is seasonal. In general, cyclical variations are greater than the seasonal ones [33].

(33)

2.4. Time Series 15

Figure 2.9: Trending time series [33].

There is also the irregular component that is what remains after the trend and the seasonality are removed [55].

A time series can have different combinations of these components as shown in the figure

2.10. It is also possible that the components present in the time series are changed if we increase the time period.

Figure 2.10: Seasonal and cyclical time series [33]. .

2.4.3 Transformations and smoothing

The simplest transformation that can be applied to a time series is the so-called moving average. It is applied with the objective to reveal low-frequency features and trends, obtaining a smoother time series [53]. It is obtained by applying a running mean to the original series, over a specified number of consecutive time moments [55]. In Figure 2.11we can see different moving averages applied in the same time series. The transformation applied to the highest number of consecutive observations (9-MA) is the less sensitive one to the changes in the original time series. We can then say that the higher the value of the moving average, less sensitive are the changes to the original time series.

(34)

Figure 2.11: Different moving averages applied in the same original time series [33].

the time series [55]. There are two decomposition models, the additive and the multiplicative. Having a time series Y , the two decomposition models are [33]:

• additive model, where Y = T rend + Seasonal + Random

• multiplicative model, where Y = T rend ∗ Seasonal ∗ Random

The STL is a Seasonal-Trend decomposition procedure based on the LOESS smoother [68]. Is an algorithm that was developed to help to divide up a time series into three components namely: trend, seasonality and remainder. The division is additive, witch means that, summing the three components gives the original series again [8]. An example of a time series decomposed by the additive model is shown in Figure2.12.

2.5 Tools

In this study, we perform an exploratory analysis of meteorological data. Using the Shiny [11] package from R language [46], it is possible to perform dynamic and interactive analysis. This allows the visualization of the data through the creation of graphs of easy understanding, and also provide some statistics that can be considered relevant.

R [46] is a free and open-source statistical programming language. It is a language and environment for computing statistics and graphics. It provides, among other things, a programming language, high-level graphics, interfaces to other languages, and debugging capabilities. Although R [46] comes with an integrated set of packages by default, it is possible to add many others in order to extend its features. These packages are collections of R [46] functions, data, and compiled code.

(35)

2.5. Tools 17

Figure 2.12: Decomposed time series using additive model [33].

Shiny [11] is a web application framework for R [46] that only requires knowledge in the R [46] programming language. Shiny [11] combines the computing power of the R [46] with web interactivity. Through shiny [11], it is possible to build interactive web applications directly from R [46] where we can provide interactive data summaries to end-users through any modern web browser. Shiny [11] comes with a variety of pre-built widgets that allow us to configure interactive user interfaces. These interactive Web applications in HTML language can be hosted on the Internet without using other web programming languages besides the R [46] language.

(36)

(37)

Chapter 3 Exploratory analysis of

meteorological data

In the previous chapter we mentioned the background needed for this dissertation. In this chapter, we describe the data on which the exploratory analysis is done, as well as their pre-processing and validation. We also mentioned some analysis that were performed with the data.

3.1 Data set characterization

The data set to which the exploratory analysis is made were provided by Instituto Geofísico da

Universidade do Porto (IGUP) and were recorded between December 1860 and March 1898. At

that time the IGUP did not yet exist and were recorded by the meteorological observatory of the Escola Médico-Cirúrgica do Porto (EMCP).

The EMCP was founded in 1836 and until 1911 was a higher education establishment in the area of medicine and pharmaceuticals, operating in the Santo António Hospital. The reason for collecting meteorological data had to do with the influence of temperature and humidity variations on the occurrence of cardiovascular accidents [43]. Thus, it was relevant to know these data in advance so that surgical interventions could be scheduled according to weather forecasts. However, along with the educational reform that followed the establishment of the Portuguese Republic, this school was elevated to the Faculty of Medicine of Porto on 22 February 1911. The EMCP became part of the then created University of Porto, making it the embryo of the current Institute of Biomedical Sciences Abel Salazar, Faculty of Medicine and Faculty of Pharmacy of that University.

On the other hand, the IGUP succeeded the so-called Serra do Pilar Meteorological Observatory and was founded in 1883 and became one of the first Portuguese meteorological stations, at the time called Porto Meteorological Station and Magnetic House. Currently, IGUP belongs to the National Meteorological Network and has an automatic weather station, as well as a seismograph of the National Network, both under the responsibility of the Instituto Português

do Mar e da Atmosfera (IPMA). Thus, this infrastructure has the ability to contribute studies,

(38)

data and parameters to civil society and the scientific community.

Focusing on the data, there are records from December 1, 1860 to March 31, 1898. During this period, the data were recorded in books by the EMCP and were later manually transcribed to excel files by IGUP professionals. The data is divided into several files. Each of those files contains data records for one year and records can start either in December or January. Each file, with the exception of the 1898 year file is divided into 13 sheets. One sheet for each month of the year and one additional sheet representing the annual summary. The excel files do not all have the same format, and in specific situations, the data presented in them are not the same since there is a lack of some data.

Figure 3.1: June 1880 excel worksheet.

In Figure 3.1, we can see the normal structure of a sheet that makes up one of the excel files. As we can see, the organization of excel files consists of dividing the meteorological data by the measuring instruments. The instruments present in the files are the thermometer, pluviometer, barometer, psychrometer, ozonometer, anesmoscope and anemometer. Each of these instruments contains data about the parameters that each one of them measures. That is, thermometer - air temperature, pluviometer - precipitation, barometer - atmospheric pressure, psychrometer - vapor pressure and relative humidity, ozonometer ozone, anemoscope wind direction, anemometer -wind speed. The way these meteorological parameters are present in the files is found in more detail in the table3.1. In this table, we can see the different observations made for each of the meteorological parameters mentioned above. In addition, table 3.1 also shows the length of time the meteorological parameters were recorded. We can see that all parameters except one are recorded during the total time period that the files contain, that is, from 1 December 1890 to 31 March 1898. The parameter which differs from the others in the observation period is the wind speed, its observation period being from 1 January 1865 to 31 December 1898.

The frequency of observation of the parameters differs. On the one hand, the air temperature (both in the shade and in exposition), atmospheric pressure, relative humidity, vapor pressure and wind direction are displayed in three periods of the day, respectively at 9 hours, 12 hours and 15 hours. On the other hand, precipitation, ozone and wind speed are present in only one period of the day. Precipitation is recorded at 9h or 15h, ozone at 12h and wind speed at 15h.

(39)

3.2. Data pre-processing 21

In addition to the hourly observations, there are also other records of the parameters. In the case of the air temperature, the minimum, maximum and average of each day. In what confers to the atmospheric pressure, it is presented the daily average. Regarding wind speed, there is information about daily absolute speed.

All the parameters presented in the data set are regular time series since they are recorded at a regularly spaced intervals of time. The regularity differs from parameter to parameter.

Parameter Recordings Recorded period

Air temperature Daily minimum December 1860 - March 1898 Daily maximum Daily average In the shade at 9h In the shade at 12h In the shade at 15h In exposition at 9h In exposition at 12h In exposition at 15h Atmospheric pressure At 9h December 1860 - March 1898 At 12h At 15h Daily average

Precipitation Daily precipitation December 1860 - March 1898

Vapor pressure At 9h December 1860 - March 1898 At 12h At 15h Relative humidity At 9h December 1860 - March 1898 At 12h At 15h

Ozone At 15h December 1860 - March 1898

Wind Direction at 9h December 1860 - March 1898 Direction at 12h Direction at 15h Speed at 15h January 1865 - March 1898 Daily absolute speed

Table 3.1: Meteorological parameters available on the excel files.

3.2 Data pre-processing

We started our study by performing a first pre-processing of the data provided by the IGUP, regarding the meteorological parameters belonging to the period of time previously mentioned.

(40)

The purpose of this initial step is to prepare the data for the exploratory analysis by cleaning and filtering the data. In order to achieve this goal, the data was imported using R. A data set was created by importing data from the excel files. After this importation, we got a data set composed of 13602 observations ranging from the period of December 1860 to March 1898 and 26 attributes concerning the above-mentioned parameters (cf. Table3.1). Each attribute of this data set is a different parameter of the data. In addition to these attributes, an attribute with the date was created to be able to identify the day of each observation. It was necessary to transform the type of the attributes. The attribute representing the date has been converted to the Date type and the remaining attributes except the wind direction attribute have been converted to the numeric type. The wind direction attribute has been converted to factor type.

After the data was imported and transformed, the data was cleaned and filtered. This was done through the pre-processing of the data. It was possible to discover in the data set the missing values, the values with format error and the suspicious values. To help with the pre-processing, a data pre-processing report was created in which are present the values found that fall into the categories mentioned above (missing values, values with format error, suspicious values). This report was made to gather all possible erroneous values found in the pre-processing data preparation stage, so that it is possible to identify and replace them, if necessary, more simply and quickly. This report can be found in AppendixA. Next, it will be exemplified only a few cases of possible erroneous values found in the data pre-processing that are presented in the report.

3.2.1 Missing values

Not available values (NA’s) are missing values in the data set. That is, days for which there are no records of a certain parameter. It is noteworthy that the December 1881 sheet has no records. Table 3.3is an example of a missing value (cf. AppendixA). It presents the attribute (precipitation), and the day for which the corresponding value is missing (1897-01-01). This is the only NA present in the data set that does not belong to wind speed (cf. Table 3.2). There are 126 wind speed NA’s, the reason there are so many wind speed NA’s is because there are no records on wind speed between December 1867 and February 1868 and also in July 1871.

3.2.2 Format error values

Because data set attributes are numeric, records must be numeric. So records with format error values are records that are not of that type. Table3.4shows the records under these conditions for atmospheric pressure at 9h. We can see which attribute they belong to, what days they happen and what their value is. In Table 3.2we can see the amount of format error values in each parameter.

These values are wrong and therefore it was necessary to change them. Since we do not know what the actual values of these observations are, we have decided to change these values. First, they we transformed into NA’s and then we gave it another value. This new value is the

(41)

Parameter Missing values Format error values Suspicious values

Air temperature minimum 0 1 4

Air temperature maximum 0 1 4

Air temperature average 0 1 109

Air temperature in the shade at 9h 0 3 2

Air temperature in exposition at 9h 0 0 6

Atmospheric pressure at 9h 0 7 7

Atmospheric pressure average 0 2 122

Precipitation 1 0 3 Vapor pressure at 9h 0 4 5 Vapor pressure at 12h 0 4 2 Vapor pressure at 15h 0 0 2 Relative humidity at 9h 0 1 3 Relative humidity at 12h 0 0 1 Relative humidity at 15h 0 0 0 Ozone 0 3 4 Wind direction at 9h 0 0 0 Wind direction at 12h 0 0 0 Wind direction at 15h 0 0 0 Wind speed at 15h 126 0 4

Wind absolute speed 126 3 3

Table 3.2: Number of missing values, format error values and suspicious values per parameter.

Parameter Day Value

Precipitation 1897-01-01 NA

Table 3.3: Missing value: precipitation.

average of the two nearest neighbors, that is, the immediately preceding observation and the immediately after observation. To replace the format error values by the average of its neighbors, the R - na.approx function [70] was used. Through this function, the missing values (NA’s) are replaced by linear interpolation.

(42)

Parameter Day Value Atmospheric pressure at 9h 1863-09-02 754..48 1867-02-05 7-62.89 1867-04-08 76..22 1871-12-29 754..57 1882-01-23 762,.07 1882-01-24 763,.65 1882-01-25 763,.95

Table 3.4: Format error values: atmospheric pressure at 9h.

3.2.3 Suspicious values

To find suspicious values, we defined the acceptable range of values for each parameter in the data set (see Table3.5). Suspicious values are values that fall outside the value range defined for the respective parameter. Values below and above the limits are considered suspicious. For example, for the maximum air temperature, we set as lower limit 0 °C and as upper limit 41 °C. Table3.6, contains the suspicious values for the maximum air temperature and the corresponding days in which they occur. The parameters with the highest number of suspicious values are relative to the average air temperature and average atmospheric pressure (cf. Table 3.2).

Parameter Lower limit Upper limit

Maximum air temperature 0 41

Minimum air temperature -10 30

Average air temperature 0 33

Shade air temperature -10 40

Exposure air temperature 10 50

Precipitation 0 100

Atmospheric pressure 700 800

Vapor pressure 1 35

Relative humidity 0 100

Ozone 0 20

Wind speed - absolute 0 1000

Wind speed - 15h 0 40

Table 3.5: Limits used for each parameter to find suspicious values.

There are also cases where suspicious values concern two or more values. The average values are one such case. In the data set, there are two attributes concerning average values: the average air temperature and the average atmospheric pressure. These values, as the name implies, should be the average of the daily values. For the average air temperature case, the value is calculated between the maximum and minimum values of the day. For the average atmospheric pressure, the value is calculated by averaging the three daily atmospheric pressure values - at 9h, 12h,

(43)

Parameter Day Value

Maximum air temperature

1862-08-15 234 1862-08-25 232 1867-04-16 296 1879-12-22 145

Table 3.6: Suspicious values: maximum air temperature.

and 15h. So, for the average air temperature case, the suspicious values are those that differ from the average between the maximum and minimum value of the air temperature. For the average atmospheric pressure case, the suspicious values are those that differ from the calculated average between the atmospheric pressure at 9h, 12h and 15h.

In Table 3.7 we can see some daily average air temperature values that are considered suspicious values because they do not correspond to the average between the maximum and minimum temperature of these days. Table3.7also shows the assumed values for each suspicious value. These expected values are the average between the maximum and minimum temperature, that is, the values that we expected to be correct.

Parameter Day Value Minimum Maximum Expected value

Average air temperature

1863-02-27 12.8 7.2 17.3 12.2 1863-03-18 11.6 6.1 15.1 10.6 1863-04-22 17.7 12.4 21.3 16.9 1863-05-13 18.6 15.0 20.2 17.6 1863-06-12 15.1 11.0 18.2 14.6 1864-04-10 18.5 13.1 23.0 18.1 1864-07-03 24.3 16.4 26.3 21.4 1864-08-13 22.2 18.2 24.3 21.2 1864-08-18 22.3 19.2 23.4 21.3 1865-02-19 8.8 4.0 13.2 8.6

Table 3.7: Suspicious values: average air temperature.

There is yet another case of suspicious values. This case is relative to the air temperature and happens in cases where the maximum air temperature is less than or equal to the minimum air temperature of the same day. Suspicious values found under these conditions are presented in Table3.8.

By representing the parameters graphically, it is possible in some cases to identify outliers, since there are observations that differ greatly from the remaining others. One of these cases is represented in Figure 3.2, where it is possible to observe that there are 4 observations that do not follow the pattern of all others. These outliers were identified during the pre-processing stage as suspicious values.

(44)

Parameter Day Minimum temperature Maximum temperature

Average air temperature

1863-02-27 12.8 7.2 1863-03-18 11.6 6.1 1863-04-22 17.7 12.4 1863-05-13 18.6 15.0 1863-06-12 15.1 11.0 1864-04-10 18.5 13.1 1864-07-03 24.3 16.4 1864-08-13 22.2 18.2 1864-08-18 22.3 19.2

Table 3.8: Suspicious values: minimum air temperature ≥ maximum air temperature.

Figure 3.2: Minimum air temperature plot where it is possible to see outliers.

3.3 Data analysis

After pre-processing the data, we can concentrate on data analysis. The analysis done, is above all, a visual analysis, focusing mainly on graphs. In these graphs, we can see how the meteorological parameters present in the data set evolve, as well as see how each parameter evolves in relation to another. From these graphs, it is intended to draw relevant information and statistics about the meteorological parameters in order to be able to characterize their evolution over time.

3.3.1 Univariate Analysis

In order to understand how each meteorological parameter present in the data set evolved over time, a univariate analysis was performed for each one. In this analysis, it is possible to visualize different graphs for each parameter. From these graphs, we can see how each parameter has evolved over time, whether days, months or years. In the graphs was drawn the trend line. For

(45)

3.3. Data analysis 27

that, although loess gives a better appearance, does not work for larger datasets. Therefore, the gam method is used for when the number of observations is less than 1000, otherwise, the loess method is used.

Speaking of particular cases, we start by present the air temperature.

3.3.1.1 Air temperature

Figure 3.3: Minimum air temperature scatter plot with trend line.

Looking at Figure 3.3, we can see that the minimum air temperature has a considerable upward trend until 1865 and then a downward trend until 1872. From then on, the trend remains stable, without major changes. In Figure 3.3, it can be observed that from 1878, the values have differences, being more continuous. We do not know if this is due to changes in the instrumentation used or for another reason. The average annual minimum air temperature is shown in Figure3.4 and we can see that overall the values were higher during the 1860s.

(46)

Period of time Min Max Avg Cold nights Tropical nights Cold waves 1860 - 1870 0 28.1 11.7 0 123 201 1870 - 1880 -0.8 27.4 10.63 4 65 206 1880 - 1890 -0.4 25.2 10.64 16 60 217 1890 - 1898 -0.8 27 10.74 9 51 163 Global -0.8 28.1 10.92 29 299 741

Table 3.9: Summary values of minimum air temperature by decade.

Table3.9 provides some aggregate minimum air temperature statistics for decades. From the table we can see that it was in the 1860s that the highest values of minimum air temperature were recorded. We also can observe that the number of cold nights (days when the minimum air temperature was below 0 °C) was higher from the early 1980s until the early 1890s. In the 1980s there were more cold nights (16) than the sum of all the others (13). In the 1860s, there is no record of any cold night. The year with the coldest nights recorded was 1887 with 7 3.5a. Regarding tropical nights (days when the minimum air temperature was above 20 °C), it can be observed that it was in the 1860s when they were most recorded (123), and the year 1865 was by far the year with more tropical nights (34)3.5b. Comparing cold nights with tropical nights, one can see that their trends are the opposite. While in the 1960s it was the decade with the most tropical nights (123), it was in the 1960s that the least cold nights were recorded (0). In Figures

3.5aand3.5bwe can see what was said above, as well as see the trend of cold nights and tropical nights. In relation to cold waves, they were obtained by averaging the minimum temperature of each day of the year overall the years recorded (i.e average of 1 January, average of 2 January, ...). Afterward, it was considered a cold wave, if, for at least 6 consecutive days, the minimum air temperature of the days was lower more than 5 °C than the average of each day. That is, for example, if on 1 March 1870 the minimum temperature was lower than the average calculated for all days on 1 March, then a day was found that could be part of a cold wave. If in the next 5 days (at least), the same happens, a cold wave happened. Regarding the data, it is possible to observe that it was in the last decade (1890) of observations that there were less cold waves, but this may be because the record time is shorter, since there are only records until 1898.

Regarding the maximum air temperature, it can be seen in Figure 3.6 that, as with the minimum air temperature, the values are more continuous from a certain period, in this case -1879. In Figure 3.6, it is possible to observe that the trend remains stable until the beginning of the 1890s and from then on the trend grew. The highest value came in 1897 (40.2 °C), and of the eight records that are greater than or equal to 38 °C, 6 happened in the 1890s, which corroborates the idea that the highest values happened in the 1890s. Table 3.10 provides some aggregate maximum air temperature statistics by decades. We can see that the average maximum air temperature was higher in the 1890s. It is also possible to see that the number of summer days and extremely hot days was also higher in that decade. All of these statistics show that there was actually an increase in maximum air temperature values in the 1890s. Regarding the heat waves, they were calculated in the same way as the cold waves, but using the maximum air temperature. It was in the 1860s and 1890s that fewer heat waves occurred, while the 1870s and

(47)

(a) Below 0 °C.

(b) Above 20 °C.

Figure 3.5: Annual number of days with minimum air temperature

1880s had the highest number of occurrences.

Period of time Min Max Average Summer Days Extremely hot days Heat waves

1860 - 1870 3 37.4 20.06 637 13 193

1870 - 1880 8 36.3 19.59 636 8 215

1880 - 1890 5 38.2 19.56 613 19 213

1890 - 1898 7.4 40.2 20.76 664 22 150

Global 3 40.2 19.94 2550 62 766

Table 3.10: Summary values of maximum air temperature by decade.

In Figure 3.7 we can see the annual average shade temperature at 9h. The bar graph of Figure 3.7a and the boxplots of Figure 3.7b show that the year 1881 was an atypical year compared to its neighbors. In this year, the average value was 16.7 °C, much higher than the

(48)

Figure 3.6: Maximum air temperature scatter plot with trend line.

adjacent years. We can also see from Figure3.7b that this year’s boxplot is different from the others. Its interquartile range is smaller than all others and it can be observed that it has more outliers than the other years (except the year 1882).

In Figures 3.8a, 3.8b and 3.8c, we can see the STL decomposition of the exposure air temperature time series at 9h, 12h and 15h. It can be seen that between 1864 and 1871 the time series from the exposure air temperature at 9h and 12h have a tendency that follows abnormal behavior. Since during this time, the values recorded are much higher than the others. We do not know if this is due to an instrumentation anomaly or configuration of the measurement location. At exposure air temperature at 15h, although there is a lot of noise during this period, it does not happen the same as in the other two time series.

3.3.1.2 Precipitation

As for precipitation, we can see that there was a growing trend of the total precipitation until the early 1870s and a decreasing thereafter. This is shown in Figure 3.9, where we can see the total precipitation per year. During the decreasing trend, some years have been running away from this trend, showing high precipitation values, namely: 1876, 1886 and 1995. The only day when it rained 100 mm was in 1871. The years 1872 and 1876 were the years when the total precipitation was higher, 2053 mm and 1897 mm, respectively. Regarding the analysis of total precipitation by months, in Figure3.10we can see the total precipitation for months during the period in which there are records. It is possible to conclude that it was in the months of November and January, in this order, that the total precipitation was higher. On the opposite side, July was the month with the lowest total precipitation, followed by August. Combining months, it can be said that the period from June to September is when the lowest total precipitation values are presented. While the period from October to January is when the total precipitation values are higher.

(49)

(a) Annual average.

(b) Annual average - boxplots.

Figure 3.7: Shade air temperature at 9h

days with heavy precipitation (precipitation greater than 10 mm) and very heavy precipitation (precipitation greater than 20 mm), on the other hand, the year 1876 was the year that occurred more days with precipitation extremely heavy (precipitation greater than 25 mm). The driest years occurred in the 1890s. We can see that the trend of these three statistics is the same, that is, it is increasing until 1870 and from there it is descending.

All statistics are in agreement with the observed trend: annual average, annual maximum, total precipitation per year, the number of days with heavy precipitation, the number of days with very heavy precipitation, the number of days with extremely heavy precipitation, maximum consecutive wet days per year and annual greatest total 5-day precipitation all show the same trend - rising until the early 1870s and decreasing thereafter. The Table 3.11shows the precipitation data divided by decades where we can confirm what was said above. Based on Figure 3.12a, there is less precipitation in the summer. The difference between the other seasons is very big. Contrary to what many people may think, the season with higher average

(50)

(a) Exposure air temperature at 9h (°C).

(b) Exposure air temperature at 12h (°C).

(c) Exposure air temperature at 15h (°C).

Figure 3.8: Decomposition

precipitation is autumn. This being followed very closely by winter. Viewing Figure3.12b, it is possible to complement that, it is in the months of July and August that the lowest average value of the precipitation is registered. The month where the average is higher is November.

Exploratory Analysis of Meteorological Data

Exploratory

Analysis of

Meteorological Data

Joel Agostinho Nunes Pinto Sousa

Mestrado Integrado em Engenharia de Redes e Sistemas

Informáticos

Orientador

Coorientador

Abstract

Resumo

Contents

List of Tables

List of Figures

Chapter 1

Introduction

1.1

Motivation

1.2

Objectives

1.3

Thesis outline

Chapter 2

Background

2.1

Meteorological variables

2.2

Climatological data analysis

2.3

Exploratory data analysis

2.4

Time Series

2.5

Tools

Chapter 3

Exploratory analysis of

meteorological data

3.1

Data set characterization

3.2

Data pre-processing

3.3

Data analysis