First Dataset (D1) - D ATASET E XTRACTION AND T RANSFORMATION

5 D ATASET E XTRACTION AND T RANSFORMATION

5.2 First Dataset (D1)

Data Extraction

For the creation of the first dataset (D1), the Harmonized Data Access (HDA) REST-based API was used, as it enabled the possibility of downloading the product files from the WEkEO catalog programmatically [38]. In addition, a python-based library [39], developed by WEkEO, provides several functions to query the API; hence it was used for the data extraction pipeline.

The product extraction pipeline is demonstrated in Figure 5.1.

In order to have permission to access the data, a token was requested in order to access the dataset with ID “EO:EUM:DAT:SENTINEL-3:OL_2_WFR___” using the credentials of a free WEkEO account. This token had a validity of an hour; therefore, it was refreshed every time the server returned a response with a status code of 403 Forbidden.

I created a job to download the data using a specific query indicating the dataset's ID and the desired date range and bounding box coordinates. Once the job is created, a list with the filenames of the products which match the query is returned, along with the endpoints where they can be accessed (the HDA python library only obtained the first page of items;

hence it was slightly modified to return all the resulting pages of filenames). However, to download a specific product, an order must be first requested to download it. Then, another endpoint, located at ¹, is queried until a status of “completed” is returned. Only then can the product be downloaded at the ² endpoint—substituting {order_id} by the id of the product

1 https://wekeo-broker.apps.mercator.dpi.wekeo.eu/databroker/dataorder/status/{order_id}

2 https://wekeo-broker.apps.mercator.dpi.wekeo.eu/databroker/dataorder/download/{order_id}

order. It is important to note that this must be done individually for every product we wish to download.

The bounding box which covered D1 and was used in the query to HDA corresponds to the following latitude and longitude in the WGS-84 system:

• bottom right corner: 40° 0' 0"N -8° 27' 23.22"E,

• top left corner: 41° 52' 11.3622"N -9° 6' 36.4926"E.

However, HDA only uses these coordinates as a reference to automatically provide the tiles which intersect the bounding box–it does not join the tiles nor subset the data to the given box.

Although D1 only contains one year of data, some orders would be stuck at the extraction time, never returning the status of “completed.” This could be solved by waiting some hours before attempting to download more data again and creating another job to do so. Therefore, to avoid getting all the previously downloaded data, jobs were created using queries that contained only one month’s worth of data. Once an order would get stuck, we would manually stop the download task, wait a couple of hours, and then resume the download starting on the month we were previously stuck. This process was repeated until all the data was downloaded.

As mentioned in Chapter 4.1.3, WEkEO only provides access to a one-year rolling archive of S3 data, hence why D1 contains only one year of data. Additionally, HDA contains a function to download products individually but no function to download a batch of products. Though this could be done sequentially by repeatedly calling the mentioned function, this might not

Figure 5.1—Sequence diagram describing the steps taken by the data extraction pipeline to collect data from the WEkEO platform

be time-efficient, considering file sizes are around 200Mb to 700Mb, and within the DateTime selected for D1, there were over 300 products returned by the query.

Considering the high data volume, the products for D1 were first resampled to a weekly sampling rate before their extraction. This is possible since after ordering a job, the product names of the products that match the query are returned, and as determined in section Chapter 4.1.4, the product name contains some metadata–with fixed-size fields separated by an underscore. Looking at the sensing start dates from product filenames, only every week's earliest product pair (with the previously chosen frames) was extracted. Additionally, by only considering the products where both frames of the same satellite are available on a given day, the number of products is reduced to 128. After grouping the products by the satellite used (Sentinel-3A or Sentinel-3B) and date, we can then exclude the products corresponding to the groups with a count lower than 2 (since there are two tiles), resample the products, and finally download the data from the product endpoints.

Although this reduces the number of products by more than half, it would still take too long to download them synchronously. To solve this problem, the extraction of the files in parallel is helpful, as it is capable of up to a ~99% speed-up compared to the sequential counterparts [40]. For this reason, the request and download of orders for every job were implemented with an Apache Beam pipeline.

Data Pre-processing

After obtaining the data in the previous step, it was still not ready to be fed into an ML model–it required pre-processing. The downloaded data represents the two frames that cover Portugal in the WGS-84 system, so I still had to join the two tiles, resample the data into an appropriate projection (in a way such that an image coordinate always corresponds to the same geocoordinate, which is not the case of the non-processed frames), and extract several patches along the coastal area. Additionally, an orthorectification step was not required as the level 2 products contain a band that maps every image coordinate to the respective geocoordinate. The preprocessing pipeline was also implemented with the Apache Beam python library to concurrently process the frames (matched by date and satellite) for D1.

The MLCube preprocessing task gets the zipped raw data from a given folder and outputs the processed patches into another folder. The data is unzipped for the preprocessing operations, and then it is deleted once the unzipped tiles are no longer needed to avoid excessive disk space usage. This task is not responsible for creating the TFRecord dataset

mentioned in section 5.2.2.3 due to conflicts with Tensorflow and other packages required for the previous pre-processing operations.

5.2.2.1 Mosaic Operation

Until this step, data from a particular day and satellite is partitioned into two files (corresponding to frames 2160 and 2340). To extract the patches, the two frames were first concatenated. Then, the image was reprojected into a shared grid so that the resulting patch coordinates (y, x) consistently corresponded to the same geocoordinates (latitude, longitude), regardless of the day or satellite with which the image was taken. This was done through a resampling operation, which determines the value a particular coordinate in the projected image should take (for example, when more than one point belongs in a particular coordinate).

Finally, the resulting image was re-projected into the EPSG:32629 projection, as it was deemed the most appropriate for the selected frames given that it would minimize the distortion of distances of nearby points, with every pixel representing 300m x 300m.

There are several resampling algorithms, out of which the Nearest Neighbors algorithm (NN) was used since it was fast, the snappy library provided it, and it is the default method for other remote sensing libraries such as Google Earth Engine [41]. In addition, this resampling method vastly reduced the total duration of the pre-processing pipeline when compared to other algorithms such as the Bilinear algorithm. Using the NN, the value of a coordinate in the projected image is determined by the value of the nearest coordinate in the original image.

This implies that all the pixel values in the resulting image come from the original image and that some of these values are lost or duplicated.

The concatenation and reprojection of the two frames were performed using the mosaic operation of the snappy python library. However, this operation also required that a geocoordinate boundary is provided for the target projection, and, although we wanted to create several patches in the coastline, the reprojection operation is costly; hence it would be inefficient to do it for every patch. For this reason, instead of giving the mosaic operation the boundaries for a specific patch, I resampled it to the boundaries specified in Section 5.2.1 (which is visually represented in Figure 5.2).

The pre-processing pipeline starts by getting metadata from the filenames of the products contained in an input folder. Then, the files are grouped by date and satellite (thereby grouping the frames to concatenate), and a folder is created in the output folder for every product date. Within these folders, another folder is created for the satellite that captured data (to avoid conflict in dates where both Sentinel 3-A and Sentinel 3-B captured data). Finally, the input files are unzipped, the snappy mosaic operation is called for every one of the previously matched groups, and the resulting products are saved in a NetCDF4 file format (which allowed it to be manipulated further using the xarray library) in the corresponding date and satellite output folders. The EPSG:32629 projection was the target of the mosaic operation.

The mosaic operation requires that the target bands are specified. The following bands were included for D1:

Figure 5.2—Boundaries used to extract patches from the north-western coastline of Portugal for D1.

• all sea surface reflectance bands (except Oa13, Oa14, Oa15, Oa19, and Oa20 as they are not available at Level 2),

• chl_oc4me,

• TSM_NN,

• Longitude,

• Latitude,

• PAR,

To create masks in the resulting projection, the following flags were given (based on the flags used in [42]):

• CLOUD,

• CLOUD_AMBIGUOUS,

• CLOUD_MARGIN,

• INVALID,

• COSMETIC,

• SATURATED,

• SUSPECT,

• HISOLZEN,

• HIGHGLINT,

• SNOW_ICE,

• AC_FAIL,

• WHITECAPS,

• ANNOT_ABSO_D,

• ANNOT_MIXR1,

• ANNOT_TAU06,

• LAND.

The mentioned operation returned the products with 692 pixels of height and 185 pixels of width (208km x 56km), and for every selected band, it created a mask band, named after the corresponding band and suffixed by “_count” (e.g., to the band “Oa01_reflectance”

corresponds the mask “Oa01_reflectance_count”). If in a particular coordinate of the mask, the value of that pixel is greater than 0, then that same coordinate is valid in the corresponding band (as dictated by the flags passed to snappy).

Given that the resulting mosaics are saved in NetCDF4 format, some products were manually inspected using the Panoply software to validate that the operation was done correctly (example mosaic in Figure 5.3).

5.2.2.2 Patch Operation

After the mosaic operation, the patch operation follows, which is responsible for retrieving 64x64 subsets of the mosaic, with each patch corresponding to a training example of D1.

The patches were extracted along the western coastline of Portugal. We had to find coordinates of several points belonging to the coastline, which could serve as a reference for the extraction. These coordinates were obtained from Natural Earth, a public dataset supported by the North American Cartographic Information Society (NACIS) [43]. This dataset contains vector geodata in ESRI shapefile format. These shapefiles describe points, lines, and polygons, which describe coastlines, country boundary lines, land, and other features. However, we do not deal with these files directly since the data can be easily accessed through the shapely library’s NaturalEarthFeature class. This class was used to get coastline coordinates of the world at a resolution of “10m” (note that this does not mean the provided points are evenly separated by “10m”, this simply defines the scale at which the points are being represented). Out of this data, we were only interested in the points within the bounds of the mosaic; hence only this subset of points was kept for the following operations (Figure 5.4), resulting in 104 patches per mosaic (6716 patches in total for D1).

Figure 5.3—CHL-NN mosaic of the S3B on 09-11-2020 for D1 on the left and the corresponding mask on the right. A pixel the corresponding mask has a count greater than 0.

To determine the bounds for every patch, the coastline points first needed to be resampled to the same grid as the mosaics. This was accomplished by selecting the nearest coordinates in the mosaic. Then, for every point, the patch bounds were selected, such that the point represented the bottom right corner and the patch had 64x64 pixel dimensions. Once these bounds were determined, only the patches whose bounds were still within the mosaic’s

bounds were kept (Figure 5.4).

The patches were then extracted using the xarray library by selecting the subset of mosaic data within each patch’s bounds. Although this subset operation could also be performed by snappy, we found that it was faster to do so using the xarray library, possible due to the overhead associated with snappy being a wrapper to the Java library.

Finally, every patch was saved individually alongside the mosaics they were extracted from, in NetCDF4 format, named with a specific identifier (an integer) related to the patch’s geocoordinates. This means that, for example, a patch with id “t0” from two different dates represents a region bounded by the same coordinates. A few of these patches were also validated using the Panoply software to ensure successful operation (Figure 5.5).

Figure 5.4—On the left side, we can see the boundaries used to extract the patches for D1, alongside the reference points for each patch. On the right side, the red squares correspond to the extracted

patches, whereas the green points represent the reference points.

5.2.2.3 TFRecord dataset

After validating some of the patches from the previous step, the training examples were concatenated and saved into seven files of the type TFRecord. This was ideal, given that the deep learning framework used in this dissertation (Tensorflow) is optimized to read files of this type. Additionally, separating the data into several files (also known as shards) allowed the model to be trained with data that does not fit in memory.

To create the TFRecord shards, Apache Beam’s “WriteToTFRecord“ operator was used. By using Apache Beam, the shards were created in parallel, using only as many resources as available in the machine, making the operation faster (due to the use of multiple CPU cores) and without overflowing the memory (since it would only open as many patch files as would fit memory). Each example saved in the shards corresponds to a single patch, and it contains a mapping from the channel name to the respective matrix with that channel’s values. The

“WriteToTFRecord” accepted these examples and was responsible for automatically creating the seven shards in an output folder, compressed in gzip format. The number of shards was determined for the files to have roughly 100Mb, as it is the size recommended in the Tensorflow documentation to boost model training performance.

Figure 5.5—CHL-NN extracted patch of the S3B on 09-11-2020 for D1.

No documento A Machine Learning Approach to Sentinel-3 Feature Extraction In The Context Of Harmful Algal Blooms (páginas 48-58)