A Machine Learning Approach to Sentinel-3 Feature Extraction In The Context Of Harmful Algal Blooms

(1)

A MACHINE LEARNING APPROACH TO SENTINEL-3 FEATURE EXTRACTION IN THE CONTEXT OF HARMFUL ALGAL BLOOMS

JOÃO COSTA

BSc in Computer Science and Engineering

MASTER IN COMPUTER SCIENCE AND ENGINEERING NOVA University Lisbon

DEPARTAMENT OF

INFORMATICS

(2)

Adviser: Dr. Ludwig Krippahl

Assistant Professor, NOVA School of Science and Technology

Co-advisers: Dr. Marta Belchior Lopes

Researcher, NOVA School of Science and Technology

Examination Committee:

Chair: Dr. Sérgio Marco Duarte,

Rapporteurs: Dr. Rui Alberto Pimenta Rodrigues,

Adviser: Dr. Ludwig Krippahl,

DEPARTAMENT OF INFORMATICS

A MACHINE LEARNING APPROACH TO SENTINEL-3 FEATURE EXTRACTION IN THE CONTEXT OF

HARMFUL ALGAL BLOOMS

JOÃO COSTA

BSc in Computer Science and Engineering

MASTER IN COMPUTER SCIENCE AND ENGINEERING

(3)

A Machine Learning Approach to Sentinel-3 Feature Extraction In The Context Of Harmful Algal Blooms

The NOVA School of Science and Technology and the NOVA University Lisbon have the right, perpetual and without geographical boundaries, to file and publish this dissertation through printed copies reproduced on paper or on digital form, or by any other means known or that may be invented, and to disseminate through scientific repositories and admit its copying and distribution for non-commercial, educational or research purposes, as long as credit is given to the author and editor.

(4)

A CKNOWLEDGMENTS

I would first like to thank my thesis adviser Dr. Ludwig Krippahl, and co-adviser, Dr.

Marta Belchior Lopes, for their availability and for guiding me with their expertise through each stage of the process.

I acknowledge the NOVA Laboratory for Computer Science and Informatics for providing me with scholarship funding which allowed me to sustain myself for the duration of the research.

Finally, I would like to thank my family and friends for supporting me and providing me some relief in the highly stressful moments.

(5)

“You cannot teach a man anything; you can only help him discover it in himself.” (Galileo).

(6)

A ^BSTRACT

Harmful Algal Blooms (HAB) are typically described as blooms of phytoplankton species that can not only cause harm to the environment but also humans. Some species that form these blooms can release biotoxins, which accumulate in shellfish [1]. When humans consume contaminated shellfish, it can cause adverse health problems [2]–[4]. Due to the associated risk of contamination, shellfisheries are forced to close, sometimes for months, leading to significant economic losses. Although microscopes enable toxic species identification, and bioassays enable biotoxin identification and quantification, these methods are impractical for continuous monitoring since they require recurrent in situ data sampling, followed by laboratory analysis. Chlorophyll a is a pigment common to almost all marine phytoplankton groups. It has a spectral signature that enables it to be detectable by remote satellites that capture water-leaving radiance [5]. Remote sensing can be very useful since it allows us to take synoptic measurements of large sea areas [6]. Several machine learning algorithms have been researched to detect or forecast algal biomass or HAB presence [7]–[10]. However, the application of remotely sensed images to detect and forecast biotoxin concentration seems relatively unexplored. Given this problem, two datasets with Sentinel-3 imagery patches were created, from along the west coastal region of Portugal, which differ in size and the preprocessing applied. We assessed the application of Machine Learning (ML) models to extract informative features from the datasets. The models were evaluated quantitatively and qualitatively. The qualitative analysis demonstrated how the features extracted by the models seem to be consistent with features extracted for downstream tasks in the literature, suggesting the features retain helpful information. However, at this time, further work Is required to determine whether the feature can be helpful in the task of biotoxin concentration forecasting.

(7)

Keywords: machine learning, feature extraction, Sentinel-3, remote sensing

(8)

R ^ESUMO

Um Harmful Algal Bloom (HAB) é tipicamente descrito como sendo a proliferação de espécies de fitoplâncton que podem causar danos não só ao ambiente, mas também aos humanos. Algumas espécies que formam HABs podem libertar biotoxinas, que se acumulam nos moluscos [1]. Quando o ser humano consome moluscos contaminados, pode causar problemas de saúde adversos [2]–[4]. Devido ao risco associado de contaminação, as áreas de exploração de bivalves são forçadas a fechar, por vezes durante meses, levando a perdas económicas significantes. A clorofila a é um pigmento comum a quase todos os grupos de fitoplâncton marinho e tem uma assinatura espectral que lhe permite ser detectável por satélites remotos que captam a radiância que sai da água do mar [5]. A detecção remota pode ser muito útil, uma vez que nos permite fazer medições sinópticas de grandes áreas marítimas [6]. Foram pesquisados vários modelos de aprendizagem automática para detectar ou prever a presença de biomassa algal ou HAB [7]–[10]. No entanto, a utilização de imagens de detecção remota para detectar e prever a concentração de biotoxinas parece relativamente inexplorada.

Dado este problema, foram criados dois conjuntos de dados com patches de imagens do satélite Sentinel-3 ao longo da região costeira ocidental de Portugal, que diferem em tamanho e no pré-processamento aplicado. Avaliámos diferentes modelos de aprendizagem automática para extrair características informativas dos conjuntos de dados. Os modelos foram avaliados quantitativa e qualitativamente. A análise qualitativa demonstrou como a informação extraída pelos modelos parecem ser consistentes com a extraída na literatura para informar outros modelos, sugerindo que as características retêm informação útil. Contudo, neste momento, é necessário trabalho futuro para determinar se a informação pode ser útil na tarefa de previsão da concentração de biotoxinas.

Palavas chave: aprendizagem automática, extração de informação, Sentinel-3, remote sensing

(9)

(10)

C ^ONTENTS

1 INTRODUCTION ... 1

1.1 Background knowledge ... 1

1.2 Motivation ... 2

1.3 Contributions... 3

2 RELATED WORK... 5

2.1 HAB models ... 5

HAB detection ... 5

HAB Forecasting ... 8

2.2 Biotoxin models ... 10

Biotoxin concentration estimation ... 10

Biotoxin concentration forecasting... 11

3 THEORETICAL BACKGROUND... 13

3.1 Principal Component Analysis ... 13

3.2 Artificial Neural Network ... 14

Perceptron ... 14

Multilayer Perceptron ... 15

Network Regularization ... 16

Recurrent Neural Network ... 17

Convolutional Neural Network ... 18

3.3 Autoencoders ... 18

(11)

Undercomplete Autoencoders ... 19

Sparse Autoencoders ... 20

Denoising Autoencoders ... 20

Convolutional Autoencoders ... 20

4 SENTINEL-3DATA ... 23

4.1 Data description ... 23

Sentinel-3 mission... 23

Ocean and Land Color Instrument ... 24

OLCI Level-2 Water Full Resolution Product ... 26

Product Files and Naming Convention ... 29

5 DATASET EXTRACTION AND TRANSFORMATION ... 33

5.1 Resources Used ... 33

MLCube ... 34

5.2 First Dataset (D1) ... 35

Data Extraction ... 35

Data Pre-processing ... 38

5.3 Second Dataset (D2) ... 45

Data Extraction ... 45

Data Pre-processing ... 46

6 FEATURE EXTRACTION METHODOLOGY AND RESULTS ... 51

6.1 Resources Used ... 51

6.2 Methodology ... 52

Evaluation Metrics ... 52

CNN-AE ... 54

Hyperparameter Search ... 55

PCA ... 56

Visualization ... 56

(12)

Clustering... 56

6.3 Results ... 57

Hyperparameter Search ... 57

Model Reconstruction Error Comparison ... 60

PCA encoding sizes ... 61

Encoding Space Visualization ... 62

DBSCAN ... 62

Reconstructed Image Evaluation ... 63

7 CONCLUSION AND FUTURE WORK ... 67

7.1 Conclusions ... 67

7.2 Future Works ... 69

A PCA-256 CLUSTERS ... 7

A.1 Cluster 0 ... 8

A.2 Cluster 1 ... 9

A.3 Cluster 2 ... 10

A.4 Cluster 3 ... 11

(13)

L ^{IST OF} F ^IGURES

Figure 3.1—Two-dimensional data where the first PC is represented in green, as we can visually confirm that the variance along that direction is the highest, i.e., more "spread out". In pink is the second PC, which as we can see, is orthogonal to the first PC, and the variance along that

direction is smaller. Source [61]. ... 13

Figure 3.2—Perceptron model as proposed by Rosenblatt. Image source [62]. ... 14

Figure 3.3 MLP with an input layer of 𝑛 features, one hidden layer of 𝑘 neurons, and an output layer of one neuron that returns the prediction 𝑦. Source: [63]. ... 16

Figure 3.4—Typical CNN structure. Source [64]. ... 18

Figure 3.5—Undercomplete AE architecture, with the latent space (code) in red. ... 19

Figure 3.6—Sparse AE architecture. ... 20

Figure 3.7—Denoising AE, where the corrupted files are transformed into uncorrupted counterparts... 20

Figure 3.8—CNN-AE architecture. ... 21

Figure 4.1—Revisit time for the OLCI with Sentinel-3A and Sentinel-3B. Source: [65]... 24

Figure 4.2—The area bounded by the two squares represents the region that is covered by the frames 2340 and 2160 of the OLCI Level-2 Water Full Resolution products ... 29

Figure 5.1—Sequence diagram describing the steps taken by the data extraction pipeline to collect data from the WEkEO platform ... 37

Figure 5.2—Boundaries used to extract patches from the north-western coastline of Portugal for D1. ... 40

Figure 5.3—CHL-NN mosaic of the S3B on 09-11-2020 for D1 on the left and the corresponding mask on the right. A pixel the corresponding mask has a count greater than 0. ... 42

(14)

Figure 5.4—On the left side, we can see the boundaries used to extract the patches for D1, alongside the reference points for each patch. On the right side, the red squares correspond to the extracted patches, whereas the green points represent the reference points. ... 43 Figure 5.5—CHL-NN extracted patch of the S3B on 09-11-2020 for D1. ... 44 Figure 5.6—Boundaries used to extract patches of the western coast of Portugal for D2. These boundaries cover most of the western coast, while the boundaries used for D1 focused on the northwest coast. ... 45 Figure 5.7— log10 scaled CHL-NN mosaic of the S3B on 13-12-2017 for D2. ... 48 Figure 5.8—On the left side, we can see the boundaries used to extract the patches for D2, alongside the reference points for each patch as green points. On the right side, the red squares correspond to the extracted patches, whereas the green points represent the reference points.

... 49 Figure 5.9—Example of the Chl-a NN channel of a D2 extracted patch after pre-processing. 49 Figure 6.1—CNN architecture used to extract features from D1 and D2, with an encoding size of 256. This visualization was created with the Visualkeras library. ... 54 Figure 6.2—Parallel curves of the hyperparameters searched for the D1 trained CNN-AE, highlighting the best performing models regarding the validation R². ... 59 Figure 6.3—Parallel curves of the hyperparameters searched for the D2 trained CNN-AE, highlighting the best performing models regarding the validation R². ... 59 Figure 6.4—Line chart showing the cumulative explained variance as the number of principal components increases ... 61 Figure 6.5—2D visualization of the PCA-256 encodings using the TSNE algorithm, color-coded with the respective seasons. ... 62 Figure 6.6—Representation of the PCA-256 clusters, with different colors representing different clusters... 63 Figure 6.7—Images reconstructed by the PCA-256 model plotted against the original images (for a random channel), and a map for the absolute difference between the original and the reconstruction. ... 64 Figure 6.8—Images reconstructed by the D1 trained CNN-AE model plotted against the original images (for a random channel). ... 65 Figure 6.9—Original and reconstructed images from the autoencoder model trained in [59], which were used to extract features from multispectral images (Landsat 8). The extracted features were used for a downstream satellite imagery classification task achieving good performance. ... 65

(15)

L ^{IST OF} T ^ABLES

Table 4.1—S3 OLCI spectral bands definition. Source: [26] ... 25

Table 4.2—S3 OLCI Level-2 Bands. Source: [27] ... 26

Table 4.3—S3 OLCI Quality and Science Flags. Source: [32]. ... 28

Table 4.4—Description of the fields in a product’s filename. Source [33]. ... 30

Table 6.1—Results of the hyperparameter search for the CNN trained with D1, with the validation metrics denoted by "Val". ... 57

Table 6.2—Results of the hyperparameter search for the CNN trained with D2, with the validation metrics denoted by "Val". ... 58

Table 6.3—Comparison of the model reconstruction metrics on the validation set... 60

Table 6.4—Performance of the PCA models with different encoding dimensions trained on D2. ... 61

(16)

(17)

1

I NTRODUCTION

1.1 Background knowledge

Harmful Algal Blooms (HAB) are typically described as blooms of phytoplankton species that can not only cause harm to the environment but also humans. The term "bloom" defines regions where anomalous marine phytoplankton cell concentrations contrast with their surroundings [6]. Some species that form these blooms can release biotoxins, accumulating in filter-feeding organisms, such as shellfish [1]. When humans consume the contaminated shellfish, it can lead to intoxication, diarrhea, respiratory paralysis, and in extreme cases, it can be fatal [2]–[4]. In 2018, two patients were hospitalized with severe symptoms after consuming contaminated shellfish [3]. Due to the associated risk of contamination, shellfisheries are forced to close, sometimes for months, leading to significant economic losses. Early warning forecasting systems are warranted given the reported frequency increase of these events worldwide [6].

Microscopes are commonly used to identify toxic species; however, this procedure requires taxonomic expertise, is impractical for continuous monitoring, and does not quantify the toxin concentration but rather the biomass of the toxin-producing species. Structure-based bioassays can identify and quantify biotoxin concentration, but they are also a burdensome procedure, just like the one mentioned earlier.

Marine phytoplankton cells, which can proliferate to become a HAB, are photosynthesizing organisms. Given this, Chlorophyll-a (Chl-a) is a pigment (used in photosynthesis) that is common to almost all taxonomic groups. For this reason, Chl-a concentration has been used as a proxy for phytoplankton biomass to monitor bloom events [6]. However, it does not provide information about the species or its toxicity [11]. This pigment,

(18)

along with other pigments specific to a genus or species (fucoxanthin in diatoms and peridinin in dinoflagellates, for example), absorbs and reflects radiation in specific of the electromagnetic spectrum (spectral signature), which allows it to be detectable by remote satellites that capture water-leaving radiance [5]. This discoloring also entails that we can utilize ocean color data to locate HABs.

Remote sensing can be very useful since it allows us to take synoptic measurements of large sea areas [6]. To understand synopticity, we can look at a simple example: taking the temperature of a house at a point in time by going into each room with one thermometer (instead of having one in each room). We can get the whole house’s temperature at a particular point in time because the time spent going to each room and taking a measurement is much lower than the time it would take for the house’s thermal conditions to change significantly.

The description of the temperature of the house can therefore be considered synoptic. In other words, if the time between the first and the last measurement is shorter than the time it would take for the observed variable to change, the measurements can be considered synoptic. On the other hand, the measurements taken by boats to cover large areas are not synoptic since it would take hours, if not days, to go from one point to another until complete coverage of an area, as demonstrated by Rixen et al. [12].

Remotely sensed images can also provide measurements where boat access is restricted, and they are cheap compared to oceanographic cruises, rendering them a practical tool for continuous monitoring and early warning systems.

1.2 Motivation

The MATISSE - A machine learning-based forecasting system for shell-fish safety project was created to respond to flaws of the current strategy to monitor shellfish contamination, dictated by the EU legislation. This strategy is reactive, which means avoiding further damage from shellfish contamination. The MATISSE project looks at the problem from a different perspective. Creating an early detection system capable of forecasting future biotoxin values allows for creating proactive strategies (preventive or alternative solutions) to avoid the frequent economic losses related to shellfisheries. The forecasts will be built with machine learning tools trained on different data sources, making the system more robust to missing information (e.g., remote sensing data, which is negatively affected by clouds). The trained models will be validated against historical data from the shellfisheries’ routine environmental surveys and combined for improved performance. The resulting model will then be integrated

(19)

into a web platform that can be used by the shellfish sector and the Public Administration–The Portuguese Institute of Sea and Atmosphere (IPMA), and the Directorate-General of Natural Resources, Security and Maritime Services (DGRM).

The work of this dissertation was developed under the umbrella of project MATISSE. The focus was to explore feature extraction methods in Sentinel-3 image data located in the western coastal region of Portugal. Given the lack of labeled marine toxin data, unsupervised learning approaches can be used to improve the performance of the supervised estimators.

This improvement is particularly noticeable when there is high data availability, as is the case for remote sensing data, where several years of combined Sentinel-2 and Sentinel-3 exist.

Accordingly, an unsupervised estimator can create a good representation of this data that can later be used by the forecasting models, which are supervised. However, we must note that even though we might have a good representation of the remotely sensed data, there is no guarantee that the information extracted is helpful for the supervised task. The impact of the features on the forecasting task can only be determined upon training the supervised model.

The forecasting of marine toxins Is out of the scope of this thesis, so the representations extracted here will need to be validated in future phases of the project.

1.3 Contributions

This thesis is part of an initial phase of the MATISSE project, leaving the following contributions to the research:

• Compared the performance of the representations generated by the feature extractors in both generated datasets.

• Created two datasets that differ in size and the preprocessing procedures, but both pertain to Sentinel-3 imagery patches from along the west coastal region of Portugal.

• Assessed the application of Machine Learning (ML) models to extract informative features from the datasets.

• Provided a quantitative and qualitative evaluation of the models. The qualitative analysis demonstrated how the features extracted by the models seem to be consistent with features extracted for downstream tasks in the literature, suggesting the features retain helpful information. However, at this time, further work is required to determine whether the features can be helpful in the task of biotoxin concentration forecasting, so this is intended for future work.

(20)

(21)

2

R ^ELATED W ^ORK

2.1 HAB models

HAB detection

2.1.1.1 Biomass indicator concentration

Using remotely sensed images, we can employ reflectance band-ratio algorithms to estimate Chl-a concentrations or create HAB indicators with remotely sensed satellite data.

First, an empirical relation between Chl-a concentrations and water-leaving reflectance is established in situ. A standard spectral band ratio used to establish this relationship is the blue-green (440-550nm) band ratio because most phytoplankton absorption occurs within this range of the visible spectrum. With this relationship established and given that phytoplankton is the primary water constituent, we can map the water-leaving reflectance captured by the satellites in that band ratio to the corresponding Chl-a concentration. However, the use of blue-green spectral bands for the specific detection of Chl-a in coastal waters is affected by the absorption signal of colored dissolved organic matter (CDOM) and total suspended matter (TSM). To overcome this limitation, an empiric relationship can also be established in the red- NIR (680–750 nm) region because there seems to be a correlation between the scattering from algal biomass and Chl-a concentration. This region is more appropriate to establish this relationship in coastal areas because the in vivo absorption peak near 676 nm is minimally affected by CDOM and TSM when the two are in low concentrations. In addition, band-ratios algorithms with different widths in the red-NIR region have performed best at different Chl-a concentrations. However, there are some limitations: (1) the relative error on Chl-a retrievals might be more significant at low Chl-a concentration, and (2) these algorithms applied at a

(22)

global scale can have significant errors. Therefore, we might need to validate them using in situ data of the target study regions.

Researchers used a reflectance-based algorithm in 2020 to monitor a HAB during a government-issued emergency lockdown during the Covid-19 pandemic when in situ samples were difficult to collect [13]. This bloom contained a harmful dinoflagellate species, which the authors associated with Chile's recorded massive salmon mortalities in 2020. The harmful species had a strong reflectance response in the near-infrared part of the spectra, which allowed the detection of this peak using the spectral bands of S3 and S2, which they exploited the reflectance-based algorithm termed normalized difference chlorophyll index (NDCI). The algorithm uses the red and red-edge bands, appropriate in coastal regions due to the water’s visual complexity (associated with CDOM and TSM). They implemented the NDCI using both Sentinel-3 (S3) and Sentinel-2 (S2) remotely sensed images. First, the researchers detected the HAB presence and got a rough location with NDCI using the S3 (which has a higher spectral and temporal resolution) at the mesoscale. Then, they were able to get more precise information using the S2 implementation of NDCI due to the increased spatial resolution of this satellite. This research is an excellent example of the vital information we can extract using remote sensing technology to monitor HABs.

As mentioned earlier, some factors might limit the estimation of Chl-a concentration using remote satellites. For example, the images captured with these satellites cover a discrete number of bands, which means the spectral features that would allow us to identify the specific species might be out of the images’ spectral range. Also, the water-leaving reflectance is often the result of interactions between species and other components [14]. Nonetheless, Palenzuela et al. [10] could get reasonable estimations using a clustering algorithm called FCM and an ANN, given S3 reflectance images as input. First, FCM was used to cluster the S3 captured images by different water types, given their visual complexity. They then trained an Artificial Neural Network (ANN) to estimate the Chl-a concentration using the reflectance of the pixels that belonged to the cluster, which covered the most considerable portion of the image, and used in situ collected Chl-a concentration as labels.

Palenzuela et al. [10] used two metrics to evaluate the performance of their model: the coefficient of determination (R²) and the root mean squared error (RMSE). Their model estimated Chl-a with an R² value of 0.95 and RMSE of 0.44. Therefore, the values predicted by the model explain 95% of the variability in the collection of in situ data, indicating a good relationship between the two.

(23)

These results suggest that the model is a contender to substitute manual in-situ data collection instruments. The validation of a measurement instrument should demonstrate the reliability of its measurements of the quantitative variable. However, the correlation between the two instruments’ measurements does not provide helpful information on how the two methods are comparable. Furthermore, this relationship does not consider the differences between the methods (such as a bias). We can perform, for example, a Bland-Altman analysis to evaluate the agreement between two measurements [15]. This method compares the mean differences between the measurements and outputs limits of agreement, which denote the range where we can be 95% confident the differences between the measurements may lay.

These limits of agreement help us understand whether the differences are not too significant to cause problems in practice. Considering that the authors did not use any method to assess the proposed model’s agreement, the model’s practical utility is questionable.

2.1.1.2 HAB occurrence/non-occurrence

Valvi et al. [16] developed an RF model to detect the Alexandrium Minutum (dinoflagellate genus) blooms in the north-western Adriatic coastal waters. In contrast to the previously analyzed model (which postulated the problem as a regression task to estimate algal biomass), their model detects the occurrence/non-occurrence of blooms containing the harmful species. Their RF model takes input nutrient information collected from monthly water samples and other environment variables (such as wind direction, Sea Surface Temperature (SST), and salinity, among others) measured with a CTD probe. They trained 160 RFs with all the predictive features to search for the hyperparameters, which resulted in the best performance in the test set. These hyperparameters were the number of trees in the RF, the number of variables available at each split, and the minimum number of records in each terminal node, i.e., in each ”leaf.” The optimal cutoff value was selected for each trained model based on the best ratio between true and false positives. Although not stated by the authors, optimizing this ratio is equivalent to optimizing the precision of the model. The authors used a receiver operating characteristic (ROC) curve to determine the best cutoff. The ROC curve is a graphical plot that plots the false positive rate on the x-axis and the true positive rate on the y-axis for different cutoffs. The best model was selected based on the Cohen’s Kappa coefficient, a metric used to evaluate inter-rater agreement given the cases where the model correctly predicted the algae’s presence, considering the probability the model would predict correctly by chance. Then, they trained another 160 RFs but excluded the nutrient-related variables. We remove the need to take water samples for laboratory analysis to determine

(24)

nutrient concentrations by excluding the nutrient information, making the model easier to use.

Both trained models were able to predict with 85.5% accuracy and a Cohen’s Kappa of 0.7–

indicating substantial agreement [17]. We should note that, while the lack of nutrient information did not seem to affect the model’s performance, diatoms tend to have a lower affinity for nutrient uptake than flagellates [18]. The model was also only trained to detect a single species’ presence over a relatively restricted area; hence, we should be wary before generalizing to other species, genera, or locations.

HAB Forecasting

We have established that we can obtain a reasonable estimation of Chl-a concentration, but we can also generate forecast future concentration values. Li et al. [8] trained machine learning models to forecast Chl-a concentration (μg/l) with a one-week and a two-week forecast horizon in Tolo Harbour, China. They used monthly and biweekly water quality monitoring data (e.g., Chl-a, water temperature, nutrient information) and daily meteorological wind speed and solar radiation data to make their forecasts. The model that obtained the most robust results in the validation data was the Support Vector Machine (SVM)–R²=0.994 and RMSE= 1.583 in the 1-week forecast horizon; R²= 0.819 and RMSE= 5.436 in the 2-week forecast horizon. SVMs were able to slightly outperform the ANN model, proving to be a robust alternative. However, regular ANNs do not model sequential time-series data like other networks like Recurrent Neural Networks (RNN), which researchers have successfully applied for temporal forecasting in other domains [19]. Therefore, the results might have differed if they used the latter neural network.

Lake Erie is an area where cyanobacteria-induced HABs are a common occurrence.

Therefore, Reinoso et al. [9] proposed two machine learning models to forecast the monthly Cyanobacterial Index (CI), a HAB biomass indicator (which can be measured with re-mote sensing satellites), for July, August, September, and October (when bloom events are recurring).

One of those models was an ANN, which used several algal biomass predictors, such as water temperature, nutrient levels, and others. The model’s test results showed a moderate-high correlation (R= 0.83, R²= 0.69) with the true values. However, a drawback of this model is that it requires the previous month’s CI value for the forecast, which shortens the forecast horizon.

The researchers also trained a Classification and Regression Tree (CART) model, a decision tree predictive algorithm, to predict the forecast months with a discrete, binned CI output. The Ci values were binned into three classes as follows:

• "1"= Mild (CI < 1.5),

(25)

• "2"= Significant (1.5< CI <7),

• "3"= Severe (CI>7).

CART models have some advantages in terms of interpretation and visualization over ANN models because we can see the characteristics of the variables that may lead to a particular classification. During training, CART models essentially form groups (nodes) of data bounded by certain variable conditions, where the mode represents each group (for categorical problems) or mean value (for regression problems). The authors reported the precision of the best CART model, which was 92.9%. However, this information alone does not paint the whole picture. They reported the model’s confusion matrix, which means we can derive other metrics such as the model’s accuracy, which was 92.86%. However, the dataset was unbalanced (class

”1” had the same amount of labeled data points as class ”2” and class ”3” combined); therefore, the accuracy is not an appropriate metric. To understand why accuracy is not appropriate in this situation, we can formulate a hypothetical example: say perhaps we have an unbalanced dataset with 10% of label ”0” and 90% of label ”1”. If the model were always to predict label

”1”, it would achieve a 90% accuracy, even though it did not learn anything. However, even if the model assigned 90% of the prediction to label ”1” randomly and 10% to label ”0”, it could get 82% accuracy due to chance alone. We can also calculate Cohen’s K from the confusion matrix, which results in a 0.877 agreement–indicating an almost perfect agreement [17].

Considering these results, we can execute both the ANN and CART models in conjunction in an early-warning system. The ANN forecasts numerical algal biomass in the short term. In contrast, the CART model can forecast with a broader forecast horizon enabling decision- makers with the opportunity to react earlier on.

Hill et al. [7] used a CNN and RNN-based architecture to detect and forecast the presence of HABs related to the Karenia Brevis species in Florida’s coastal waters using MODIS satellite imagery. They tackled the problem as a multimodal (since multiple modalities were used, such as Chl-a concentration and bathymetry) task with spatiotemporal dependencies, using the concept of a datacube. Datacubes contain images from different modalities aligned in time and space, centered spatially around the HAB event, including images from the event's ten previous days. First, NASNet-Mobile, a CNN-based model, is used to extract features from every modality and at every timestep. Next, the modality features of the same time step are concatenated and then fed into an LSTM model. Finally, the LSTM model’s output at every timestep is concatenated to form a feature vector, which they used as input to several other machine learning models, such as SVM, ANN, and RF. All the trained models were validated with a validation dataset, and the model which obtained the best performance was able to

(26)

detect Karenia Brevis bloom events with 91% accuracy and a Cohen’s Kappa of 0.81–indicating almost perfect agreement [17]. The authors also leveraged datacube architecture to predict HABs events in an 8-day forecast horizon with 86% accuracy, using only data from two timesteps.

2.2 Biotoxin models

Biotoxin concentration estimation

The systems described have been focused on detecting or forecasting algal biomass continuous values (direct indicators such as algal cell concentration or indirect indicators such as Chl-a) or the presence/absence of a HAB. However, only some blooms produce biotoxins, and in those that do, the toxicity of the species does not necessarily increase with the algal biomass. Furthermore, the toxins released by these species are responsible for considerable economic losses due to the closure of shellfisheries and health problems due to the consumption of these toxins in the trophic chain [1]. Therefore, we should also examine the work done until now to monitor the release of these toxins.

Generally, we can detect algal toxins with bioassays, which, in this context, are methods that allow the quantification of the concentration of the biotoxins through observation of the effects of these toxins in vivo or in vitro. Structure-based methods, such as immunoassays, are commonly used to identify and quantify analytes (the toxin we are analyzing). However, this procedure must be carried out in a laboratory, rendering it impractical for continuous monitoring. Thus, [20] explored the application of a portable device with an embedded neural network biosensor (NNB), a sensor capable of taking electrophysiological measurements, which allows for detecting biotoxins on-site. Their function-based bioassay facilitates in situ monitoring of HAB-produced neurotoxins such as brevetoxin-3 (Alexandrium fundyense) and saxitoxin (Karenia Brevis) dissolved in seawater and even in the presence of their algal producers. The NNB detects the reaction of cultured mammalian neurons in the presence of seawater-diluted neurotoxins. Even though the tested toxins elicited different electrophysiological effects on nervous tissue, spike dynamics remained similar. Using a logistic regression model, the researchers established a relationship between mean spike rate and biotoxin concentration (like an ANN with one neuron and a linear activation layer). Though the proposed method is helpful, it does not remove the need for structure-based methods since it can only detect neurotoxins’ presence but not quantify the concentration nor identify

(27)

the toxin. Thus, it should be used as a simple detection approach, followed by a structure- based method for more detailed toxin detection information.

Biotoxin concentration forecasting

Although the method described in the previous section simplifies the toxins’ monitoring, a forecast would help aid decision-making. Grasso et al. [21] tried to tackle this issue with an ANN ensemble consisting of 50 randomized initializations of the same neural network architecture. The model accepts as input the past five weeks of toxicity concentration of 12 different toxins (collected in-situ) to forecast a binned toxicity concentration in the Gulf of Maine. The model can generate a forecast for the following week, though it can forecast reasonably up to two weeks in the future. Past this forecast horizon, we can observe a drop in performance. The authors binned the toxicity concentrations for prediction into four categories:

• Class 0: 0–10 μg 100g^-1 shellfish tissue (80% of samples),

• Class 1: 10–30 μg100g⁻¹ (11% of samples),

• Class 2: 30–80 μg100g⁻¹ (5% of samples),

• Class 3: > 80 μg 100g⁻¹—National Shellfish Sanitation Program’s established cutoff limit (4% of samples).

The cutoff of class 3 considers that shellfish harvesting on the coast of Maine closes when the paralytic shellfish toxin levels in the tissue of M. edulis reach a level of 80 μg 100g⁻¹ of tissue established by the National Shellfish Sanitation Program. The primary objective was the distinction between closure and non-closure events. However, since the dataset is highly unbalanced (>90% non-closures), this categorization alleviates the drawbacks of the unbalance while possibly facilitating the model’s prediction. The model, which was trained with three years of data (2014-2016) and validated with one year of data (2017), achieved an accuracy of 96.1%±0.2%, successfully forecasting discretized toxin concentrations in the Gulf of Maine, given historical toxin concentrations of the previous five weeks, with a one-week horizon.

(28)

3

T HEORETICAL B ^ACKGROUND

3.1 Principal Component Analysis

Unstructured data types, such as images, tend to be high dimensional, making it difficult for machine learning models to discern between valuable features and how they all relate to the output variable. This problem is further aggravated with remote sensing data, such as the one used in this thesis, where the images contain a much higher number of channels.

We can apply feature extraction approaches can be applied to tackle this problem. The methods transform the original data into a smaller dimensional dataset with more informative features. For example, a widely used approach to extract useful features from data such as this is called the Principal Component Analysis (PCA).

Figure 3.1—Two-dimensional data where the first PC is represented in green, as we can visually confirm that the variance along that direction is the highest, i.e., more "spread out". In pink is the second PC, which as we

can see, is orthogonal to the first PC, and the variance along that direction is smaller. Source [61].

(29)

The process can be understood as iterative, though it is not performed iteratively in practice. First, we find the direction along which the variance of the data points is maximized (represented in green in Figure 3.1). Then, we project the data points in this direction, generating a new feature (the principal component [PC]) as a linear combination of the original input variables. This process is then repeated by choosing a new direction orthogonal to all previous ones. In practice, this is done by computing eigenvectors, and it also generates eigenvalues which allow us to rank different PCs by order of importance (variance of the data explained by the PC).

3.2 Artificial Neural Network

Perceptron

Biological brain neurons inspire the Artificial Neural Network (ANN) models. The simplest ANN is called the perceptron [22], which consists of a single neuron that outputs a binary classification (0 or 1). In addition, the neuron consists of parameters multiplied by the input variables, then summed with a bias number and passed to a threshold function. If the result of these numerical operations is greater than or equal to 0, then the perceptron outputs the number 1; otherwise, it returns the number 0. The perceptron is exemplified in Figure 3.2.

Figure 3.2—Perceptron model as proposed by Rosenblatt. Image source [62].

(30)

The perceptron featured an iterative algorithm that would update the parameters using example data and make small changes depending on the error between the predicted and the expected label. This process would be repeated until the convergence of the parameters.

Though this network seemed promising initially, it was later shown to be equivalent to linear models such as the logistic regression, which meant that its adequate performance was limited to linearly separable classes. So, although the perception model is not typically used today, it is conceptually fundamental to understand the novel deep learning models which followed.

Multilayer Perceptron

The multilayer perceptron (MLP) model was the successor of the perceptron, addressing its limitations. Also named deep feedforward networks, these models are trained to approximate some function 𝑓, such that 𝑦 = 𝑓(𝑥), where 𝑥 is the input data, and 𝑦 is the expected labels. It does so by updating a set of parameters until converging to a function that minimizes the estimation error.

These models are considered deep networks, as a chain of functions decomposes them. For example, given the functions 𝑓′⁽¹⁾, 𝑓′⁽²⁾, and 𝑓′⁽³⁾, the output of the network would be 𝑓′(𝑥) = 𝑓′⁽³⁾(𝑓′⁽²⁾(𝑓′⁽¹⁾(𝑥))) , where x is the input data, 𝑓′⁽¹⁾ is the first layer, 𝑓′⁽²⁾ the second layer, and 𝑓′⁽³⁾ the output layer. The training of MLPs became possible due to establishing the back-propagation method. This iterative algorithm gradually updates the parameters of each layer by calculating the gradient of a cost function—a task-specific function we want to minimize, e.g., the mean squared error for regression tasks—concerning the given parameter until a prespecified condition (hopefully converging to a global minimum).

(31)

Furthermore, each layer of an MLP can have multiple neurons (the width of the layer), where each neuron is the weighted sum of the previous layer's outputs plus a bias. Finally, the previous operations are fed to a non-linear function called the activation function, which gives

the neuron's output. An example of MLP can be seen in Figure 3.3.

The non-linearity of this model makes it more potent than its predecessor, as it allows it also to estimate non-linear functions. Adding layers to make the model deep enables the network to learn more data representations, where deeper layers can express concepts increasingly more complex than the previous layers. Unfortunately, this also makes the model more prone to overfitting the training data. If the dataset is large enough, it can minimize the overfitting, leading the deeper model to perform better than a shallow model (only one hidden layer); otherwise, the shallow model could be better.

Network Regularization

A model is overfitting when the training error is low, but the validation error is high. In other words, the model does not generalize to new data. To decrease the model's generalization error, getting more data can help. Nevertheless, this may lead to expensive resource spending or may not be possible.

Another option could be to augment the dataset with newly generated samples. For example, certain operations, such as a translation or a rotation, could be applied to the original dataset to get augmented data. However, the choice of augmentation is specific to the problem and might not always be clear (as is the case for multispectral images).

A third option could be to add regularization to the neural network. Below are some types of regularizations that are commonly used.

Figure 3.3 MLP with an input layer of 𝑛 features, one hidden layer of 𝑘 neurons, and an output layer of one neuron that returns the prediction 𝑦. Source: [63].

(32)

L1 Regularization: ANNs are trained to minimize a cost function. This kind of regularization updates the loss function so that It penalizes the sum of the absolute values of the model parameters. This kind of regularization acts as a type of feature selection since it can make some weights reach the value of zero.

L2 Regularization: This kind of regularization updates the loss function to penalize the square of the modulus of the parameters. When used in a model, it makes the overall magnitude of the parameters closer to zero.

Dropout: This technique consists of dropping neurons, based on a user-specified rate, during a forward pass of the network of the training phase. For example, if we define a dropout layer with a drop rate of 0.2, each network neuron has a 20% chance of being dropped during a forward pass of a training example. This layer adds regularization to the network since deeper layers will learn not to overly rely on one particular neuron, as this neuron might not be available for other training examples. This layer Is only applied during the training of the network, so during inference, the rate of the layer is effectively 1 (100%), so no neurons are dropped/turned off.

Recurrent Neural Network

A Recurrent Neural Network (RNN) is an ANN developed to solve sequence-dependent problems. For this reason, they are often used in Natural Language Processing problems, where the order of words can give helpful information to predict the output. Additionally, Time-series data can also be viewed as sequential, rendering RNNs appropriate models to solve time-dependent problems. A simple RNN is composed of a unit, which processes the input at each time step sequentially, using information from past time steps stored in a recursively updateable internal memory, which helps establish a relationship between timesteps and share features learned across different timesteps. The internal state allows the model to accept inputs and outputs of variable lengths, and it uses fewer parameters than those required to accomplish the same task with an MLP.

Long Short-Term Memory (LSTM) models are variants of RNNs, which solves some of their limitations. For example, ANNs typically use the gradient descent optimization algorithm to minimize the prediction error (calculated with a cost function), which updates the neurons’

weights in small steps until the output error converges to the minimum value. The algorithm calculates the cost function’s gradient given each neuron’s weights to update the weights.

(33)

However, this can lead to exploding and vanishing gradients [23], slowing previous layers’

learning. LSTMs avoid this problem and help establish longer-term dependencies between timesteps. They also learn when to accumulate or forget Information from previous timesteps by using several gate mechanisms inside a unit.

Convolutional Neural Network

A Convolutional Neural Network (CNN) is an ANN commonly used to process 2D inputs, such as images. CNNs are composed of a set of learnable filters (of variable size), which can be stacked in a deep architecture. At prediction time, the filters are applied (with the convolution operation) layer by layer until the final convolution layer, which is usually flattened to before the output layer, generating the final prediction. The shallower layers of the CNN learn low-level features, such as edges and corners, while the deeper layers learn higher-level representations, such as “dog” or “cat” (example in Figure 3.4). If we introduce pooling layers to the architecture, the model becomes robust to small input translations [23]. This type of ANN also uses fewer parameters than an MLP equivalent, as the small kernels are convoluted over the previous layer, hence not requiring a parameter for each of the outputs of that layer, only a small set representing the kernel.

3.3 Autoencoders

An Autoencoder (AE) is an ANN that consists of an encoder network followed by a decoder network, essentially copying the input to the output. However, to prevent the network from simply learning the identity function, it is typically restricted (only allowing for an approximate copy), enabling it to learn useful features that describe the data.

Figure 3.4—Typical CNN structure. Source [64].

(34)

It is an unsupervised algorithm, taking advantage that most unstructured data is available without labels (e.g., Sentinel-3 image products). Given the high volume of data, the model can learn generalizable representations of the data that facilitate the extraction of useful information for supervised predictors. Therefore, the most significant performance improvement can be observed with few labeled data.

Undercomplete Autoencoders

An Undercomplete AE (Figure 3.5) applies a restriction as mentioned in the previous section by effectively creating a “bottleneck,” ensuring that the encoder output (latent space, also called code) is smaller in width than the input layer. When the decoder can reconstruct the original data with low error, it retains a good amount of information while effectively

reducing the input dimensionality.

Figure 3.5—Undercomplete AE architecture, with the latent space (code) in red.

(35)

Sparse Autoencoders

Sparse Autoencoders (Figure 3.6) restrict the model by adding the sparsity of the latent space as a penalty to the cost function, thereby forcing the model to generate sparse encodings from the training examples. Doing this makes it possible to have a model with a code size greater than or equal to the input size that is still capable of learning valuable

information.

Denoising Autoencoders

Denoising Autoencoders (Figure 3.7) are slightly different from the previous autoencoders. Instead of adding a penalty to the cost function, the model's input is corrupted by some noise, while the target output remains the uncorrupted version. Therefore, the model is trained to take a corrupted version of an image and then output the uncorrupted version.

In doing this, the autoencoder also becomes less sensitive to perturbations in the input.

Convolutional Autoencoders

A Convolutional Autoencoder (CNN-AE) is akin to the Undercomplete AE, as the code is also smaller than the number of input features. However, this approach uses CNNs, which

Figure 3.7—Denoising AE, where the corrupted files are transformed into uncorrupted counterparts.

Figure 3.6—Sparse AE architecture.

(36)

are the most appropriate models for 2D data. Therefore, this kind of AE is appropriate to extract features from images. The encoder is composed of convolution layers with activation functions and Max Pooling, which reduce the size of the feature map. There are also convolution and activation layers for the decoder, but we need an operation to essentially invert the Max Pooling operation and increase the size of the feature map. To do so, we can use Deconvolution layers. However, a better alternative would be an Upsampling layer, as it generates fewer artifacts in the reconstructed images [24]. An example of architecture can be seen in Figure 3.8.

Figure 3.8—CNN-AE architecture.

(37)

4

S ^ENTINEL -3 D ^ATA

4.1 Data description

In 1998, several agencies with space-related activities in Europe founded an Earth Observation program termed Global Monitoring for Environment and Security (GMES). It was later renamed Copernicus, and the European Commission coordinates and manages the program in collaboration with the European Space Agency (ESA), European Union Member States, and European Union (EU) agencies. It has been fully operational since 2014. Its objective is to provide accurate and timely atmosphere, marine, land, and climate information free of charge to the public, incentivizing third-party creation of downstream earth monitoring and forecasting services and applications. In addition, the easily accessible information provided by the program also facilitates climate change investigation, social security, and emergency response, among other applications [25].

The information collected through the program has two sources: ground-based, airborne, and seaborne measurement systems, which collect in-situ data, and several observation satellites from space missions, namely, ESA's five Sentinel families.

Sentinel-3 mission

The Sentinel-3 (S3) mission is one of the several space missions managed under the Copernicus program. It is operated by ESA and the European Organization for the Exploitation of Meteorological Satellites (EUMETSAT) to provide ocean and land observation services, and it has a lifetime of 7 years. It is the successor of the Envisat mission, which ended back in 2012;

hence Earth monitoring and forecasting systems should be capable of making predictions based on currently operating missions such as this one.

(38)

Sentinel-3 comprises two satellites: Sentinel-3A, launched in February 2016, and Sentinel-3B in April 2018. These satellites contain payload instruments to capture ocean and land color, surface temperature, and other remotely sensed geophysical land and water products. They were designed to be highly accurate and reliable with over 95% availability while maintaining the same quality as Envisat instruments. In addition, the orbits are sun- synchronous; therefore, the surface is always illuminated at the same sun angle (within the same season) at a constant time, eliminating the need for light corrections since the illumination conditions are consistent.

Ocean and Land Color Instrument

The Ocean and Land Color Instrument (OLCI) is one of the instruments aboard S3, and it is the successor to the Medium Resolution Imaging Spectrometer (MERIS) instrument aboard Envisat. The OLCI products are available at Full Resolution (around 300m) and Reduced Resolution (around 1.2km), with 21 bands from the visible to the near-infra-red (400 nm to 1020 nm). As shown in Table 4.1, there was an increase in bands compared to the 15 bands available with MERIS, among other operational improvements. Furthermore, although there are satellites with higher spatial resolution, such as the Landsat-8 (around 30m), the spectral range is constrained to the optical, making it challenging to discern HABs in near-coastal areas.

However, there is less interference in the near-infrared region, and it has been successfully used to estimate Chl-a in those areas [6].

The orbital cycle of the S3 satellite is 27 days, which means it takes 27 days for the satellite to pass over the same point on the Earth's surface at nadir (directly below the satellite).

However, the orbital cycle is not the same as the revisit period. The satellites view areas (swaths of 1270km for the OLCI), not just the nadir point (the point just below the satellite on the Earth’s surface), which means swaps overlap in different orbits, making the revisit period lower than the orbital cycle. Since two satellites are operational (Sentinel-3A and Sentinel-3B), it enables a short revisit time lower than two days for the OLCI, as shown in Figure 4.1.

Figure 4.1—Revisit time for the OLCI with Sentinel-3A and Sentinel-3B. Source: [65]

(39)

Table 4.1—S3 OLCI spectral bands definition. Source: [26]

Band λ center (nm)

Width

(nm) Function

Oa01 400 15 Aerosol correction, improved water constituent retrieval Oa02 412.5 10 Yellow substance and detrital pigments (turbidity) Oa03 442.5 10 Chlorophyll absorption maximum, biogeochemistry,

vegetation Oa04 490 10 High Chlorophyll,

Oa05 510 10 Chlorophyll, sediment, turbidity, red tide Oa06 560 10 Chlorophyll reference (Chlorophyll minimum) Oa07 620 10 Sediment loading

Oa08 665 10 Chlorophyll (2nd Chlorophyll absorption maximum), sediment, yellow substance/vegetation

Oa09 673.75 7.5 For improved fluorescence retrieval and to better account for smile together with the bands 665 and 680 nm

Oa10 681.25 7.5 Chlorophyll fluorescence peak, red edge

Oa11 708.75 10 Chlorophyll fluorescence baseline, red edge transition Oa12 753.75 7.5 O2 absorption/clouds, vegetation

Oa13 761.25 2.5 O2 absorption band/aerosol correction.

Oa14 764.375 3.75 Atmospheric correction

Oa15 767.5 2.5 O2A used for cloud top pressure, fluorescence over land Oa16 778.75 15 Atmos. corr./aerosol corr.

Oa17 865 20 Atmospheric correction/aerosol correction, clouds, pixel co- registration

Oa18 885 10 Water vapor absorption reference band. Common reference band with SLSTR instrument. Vegetation monitoring

Oa19 900 10 Water vapor absorption/vegetation monitoring (maximum reflectance)

Oa20 940 20 Water vapor absorption, Atmospheric correction/aerosol correction

Oa21 1020 40 Atmospheric correction/aerosol correction

(40)

OLCI Level-2 Water Full Resolution Product

Earth observation data products are typically distributed in different levels of processing, labeled numerically, where a higher number implies that additional layers of processing were applied to the image.

This dissertation used products from the OLCI Level-2 Water (in Full Resolution), containing water and atmosphere geophysical products (Table 4.2). Several flags are available at this level, which allows us to remove pixels that belong to clouds, land, or invalid pixels (Table 4.3). Furthermore, the top of atmosphere (TOP) radiances from Level-1B are converted to water-leaving reflectances (with an Atmospheric Correction algorithm) and subsequently corrected for gaseous absorption, using bands Oa13 to Oa15, Oa19, and Oa20; hence these bands are not available to the end-user. Finally, several products are computed, such as OC4Me Chlorophyll, PAR, and others.

Table 4.2—S3 OLCI Level-2 Bands. Source: [27]

Variables Description Units Input Bands

Rxxx

Surface directional reflectance, corrected for atmosphere and sun specular reflection

dimensionless

all except Oa13, Oa14, Oa15, Oa19 and Oa20 chl_oc4me and

chl_NN

Chlorophyll-a concentration, computed using "OC4Me" or Neural Network algorithms

mg (Chl a) m^-3

Oa3 to Oa6, Oa1-Oa12, Oa16, Oa17, Oa18 and Oa21 TSM_NN Total suspended matter

concentration. g.m^-3 Oa1-Oa12,

Oa16, Oa17, Oa18 and Oa21 KD490_M07 Diffuse attenuation coefficient for

down-welling irradiance, at 490 nm. m^-1 Oa4 and Oa6 ADG_443_NN Absorption of colored detrital and

dissolved material at 443 nm. m^-1

Oa1, Oa12, Oa16, Oa17,

Oa21 PAR Quantum energy flux from the sun

in the spectral range 400-700 nm. µEinstein.m^-2. s^-1 - T865 and A865

Aerosol load, expressed in optical depth at a given wavelength (865 nm) and spectral dependency of the aerosol optical depth, between 779 and 865 nm.

dimensionless Oa5, Oa16, and Oa17 IWV Integrated Water Vapor column kg.m^-2 Oa18, Oa19

(41)

In early 2017, S3 OLCI Level-2 was released for public consumption [28]. Data and Information Access Services (DIAS) provides access to Copernicus data, which allows discovery and download functionality. The DIAS used in this dissertation was WEkEO [29], which provides a REST API to request and download the data we required from specific frames and dates free of expense, as opposed to other services which are paid or have a limit on product downloads per month on free accounts. However, WEkEO only provides access to a one-year rolling archive of S3 data (ending on the day of access to the catalog and starting a year prior) available from the European Organisation for the Exploitation of Meteorological Satellites (EUMETSAT). Older data can be accessed from the Copernicus Online Data Access (REProcessed) service (CODAREP) [30], which contains reprocessed data, or from the EUMETSAT data center (not reprocessed) [31]. After contacting the WEkEO support team, we were made aware that there are plans to integrate the entire archive seamlessly, but it was not available when the dataset was downloaded.

(42)

Table 4.3—S3 OLCI Quality and Science Flags. Source: [32].

Flag Name Flag Descriptions

INVALID Invalid flag: instrument data missing or invalid WATER Clear sky water

LAND Clear sky land CLOUD Cloudy pixel

CLOUD_AMBIGUOUS Potentially cloudy pixels

CLOUD_MARGIN A margin around CLOUD and CLOUD_AMBIGUOUS of 2 pixels in RR and 4 pixels in FR products

SNOW_ICE Possible sea-ice or snow contamination

INLAND_WATER Fresh inland waters flag: based on Level-1 land_water flag TIDAL Pixel is in shallow water: based on Level-1 land_water flag COSMETIC Cosmetic flag (from Level-1B): missing data filled in by

interpolation

SUSPECT Suspect flag (from Level-1B): transmission errors means measurements may be unreliable

HISOLZEN High solar zenith: ?s > 70°

SATURATED Saturation flag: saturated within any band from 400 to 754 nm or in bands 779, 865, 885, and 1020 nm

MEGLINT Flag for pixels corrected for glint

RISKGLINT Flag for pixels for which the glint correction is not reliable WHITECAPS Whitecaps flag: see ATBD SD-03-C06 for details

ADJAC Meaningless - reserved for future use WVFAIL The water vapor retrieval algorithm failed

PAR_FAIL PAR calculation failed. Internal flag is OC_PAR_FAIL ACFAIL Atmospheric correction is suspect

OC4ME_FAIL OC4Me algorithm failed OCNN_FAIL IMT NN algorithm failed KDM_ FAIL KDM07 algorithm failed

KDL_ FAIL KDL05 algorithm failed

BPAC_ON BWAC was switched on and attempted WHITE_SCATT "White" scatterer flag within the water

LOWRW (p'w(b, j, f) < R560MIN) or HIINLD_F raised HIGHRW High RW at 560nm or CASE2_F raised

ANNOT Annotation flag for the quality of the atmospheric correction RWNEG Provides a "negative water-leaving reflectance" flag for each

water-leaving reflectance ban

OLCI Level-2 Water Full Resolution products are disseminated in frames, with each covering a 1227kmx1460km area (calculated considering the 300m resolution of the OLCI instrument and a frame size of 4091x4865 pixels). In this project, frames 2340 and 2160 were

(43)

used because, when combined, they almost entirely cover Portugal's inland, coastal and near- coastal areas of the country (Figure 4.2).

Product Files and Naming Convention

Each product file is distributed with filenames that follow a specific naming convention to describe the contents it provides [33]. The files are named as follows, with fixed-length fields, described in Table 4.4, separated by an underscore:

𝑀𝑀𝑀_𝑂𝐿_𝐿_𝑇𝑇𝑇𝑇𝑇𝑇_𝑦𝑦𝑦𝑦𝑚𝑚𝑑𝑑𝑇ℎℎ𝑚𝑚𝑠𝑠_𝑌𝑌𝑌𝑌𝑀𝑀𝐷𝐷𝑇𝐻𝐻𝑀𝑀𝑆𝑆_𝑌𝑌𝑌𝑌𝑀𝑀𝐷𝐷𝑇𝐻𝐻𝑀𝑀𝑆𝑆

Figure 4.2—The area bounded by the two squares represents the region that is covered by the frames 2340 and 2160 of the OLCI Level-2 Water Full Resolution

products

(44)

_𝐷𝐷𝐷𝐷_𝐶𝐶𝐶_𝐿𝐿𝐿_𝐹𝐹𝐹𝐹_𝐺𝐺𝐺_𝑃_𝑋𝑋_𝑁𝑁𝑁. 𝑆𝐸𝑁3

Table 4.4—Description of the fields in a product’s filename. Source [33].

Field Description

MMM “S3A” for Sentinel-3A or “S3B” for Sentinel-3B OL Indicates the instrument used was the OLCI

L Processing level (for our dataset, it takes the value “2”) TTTTTT Data type (for our dataset, it takes the value “WFR____”) yyyymmddThhmmss Sensing start time of the frame

YYYYMMDDTHHMMSS Sensing end time of the frame YYYYMMDDTHHMMSS Creation date of the frame’s product

DDDD duration CCC Cycle number

LLL Relative orbit number

FFFF Frame id (“2340” and “2160” for our dataset)

GGG The center which generated the file (“MAR” in our dataset) P Platform (“O” for operational in our dataset)

XX Readiness of the processing workflow (“NT” for Non-Time Critical in our dataset)

NNN Baseline collection or data usage (“002” or “003” in our dataset) .SEN3 Filename extension

As an example, below is the filename of one of the files used in our experiments:

𝑆3𝐴_𝑂𝐿_2_𝑊𝐹𝑅____20171213𝑇103050_20171213𝑇103350_20171215𝑇150441_0179_025 _279_2340_𝑀𝐴𝑅_𝑂_𝑁𝑇_002.𝑆𝐸𝑁3

This filename shows that the file contains the products from frame 2340 captured by Sentinel-3A on the 13^th of December of 2017 between 10h30min and 10h33min. The “.SEN3”

file can be opened like a folder. Inside, it contains files in NetCDF4 format (with the “.nc”

extension) for the coordinates of every pixel in the image and every available band and product.

(45)

(46)

5

D ^ATASET E XTRACTION AND

T RANSFORMATION

Two datasets were created in this study, both containing 64x64 pixel patches (19.2x19.2 km patches) from the western coastal area of Portugal. The first dataset (D1) contains data from the 17^th of May 2020 to the 17^th of May 2021, resampled to a weekly frequency (i.e., one product file per week). The second dataset (D2) contains data from the 10^th of December 2017 to the 10^th of December 2019, which was not resampled.

5.1 Resources Used

The following resources were used to develop the pipelines to extract and pre-process the S3 datasets:

• WEkEO HDA API Client [34]: python library which communicates with the WEkEO API to request download orders for the dataset available on the WEkEO platform;

• MLCube [35]: open-source python to standardize the execution of everyday machine learning operations, such as the extraction, pre-processing of data, and model training.

This framework is described in greater detail in Section 5.1.1;

• Snappy: is a python library that wraps the java library for the Sentinel Application Platform (SNAP). This library provides methods to process data from Sentinel satellites;

• xarray: open-source python library capable of operating efficiently with multi- dimensional data, which can also read files saved in NetCDF4 format, which is the format the S3 files are disseminated in;

• satpy: open-source python library with several utility classes and functions to read and process remotely sensed images such as those captured by S3. It is used as a faster

A Machine Learning Approach to Sentinel-3 Feature Extraction In The Context Of Harmful Algal Blooms

A MACHINE LEARNING APPROACH TO SENTINEL-3 FEATURE EXTRACTION IN THE CONTEXT OF HARMFUL ALGAL BLOOMS

JOÃO COSTA

BSc in Computer Science and Engineering

MASTER IN COMPUTER SCIENCE AND ENGINEERING NOVA University Lisbon

DEPARTAMENT OF

INFORMATICS

DEPARTAMENT OF INFORMATICS

A MACHINE LEARNING APPROACH TO SENTINEL-3 FEATURE EXTRACTION IN THE CONTEXT OF

HARMFUL ALGAL BLOOMS

BSc in Computer Science and Engineering

A CKNOWLEDGMENTS

A BSTRACT

R ESUMO

C ONTENTS

L IST OF F IGURES

L IST OF T ABLES

1

I NTRODUCTION

1.1 Background knowledge

1.2 Motivation

1.3 Contributions

2

R ELATED W ORK

2.1 HAB models

HAB detection

HAB Forecasting

2.2 Biotoxin models

Biotoxin concentration estimation

Biotoxin concentration forecasting

3

T HEORETICAL B ACKGROUND

3.1 Principal Component Analysis

3.2 Artificial Neural Network

Perceptron

Multilayer Perceptron

Network Regularization

Recurrent Neural Network

Convolutional Neural Network

3.3 Autoencoders

Undercomplete Autoencoders

Sparse Autoencoders

Denoising Autoencoders

Convolutional Autoencoders

4

S ENTINEL -3 D ATA

4.1 Data description

Sentinel-3 mission

Ocean and Land Color Instrument

OLCI Level-2 Water Full Resolution Product

Product Files and Naming Convention

5

D ATASET E XTRACTION AND

T RANSFORMATION

5.1 Resources Used

A ^BSTRACT

R ^ESUMO

C ^ONTENTS

L ^{IST OF} F ^IGURES

L ^{IST OF} T ^ABLES

R ^ELATED W ^ORK

T HEORETICAL B ^ACKGROUND

S ^ENTINEL -3 D ^ATA

D ^ATASET E XTRACTION AND