3.4 Data Transformation Process
3.4.1 Dimensional Model
To develop the dimensional model for Electrão, a four-step approach is taken following Kimball’s methodology and design process (Kimball et al., 2016):
1. Summarization of business processes based on the perspectives of analysis and respective business questions.
2. Identifying the grain by building upon the business questions to assess the required granularity.
3. Identifying business relevant dimensions that enable Electrão to get the insights to respond to its business needs.
4. Identifying the facts for measurement that can encapsulate the respective information.
The result of this process is a fact constellation schema, also known as Galaxy schema, as two (or more) fact tables share one or more dimensions. The schema will be implemented in Power-BI.
3.4.1.1 Business Process and Model Granularity
The business needs identified during the interviews with the experts of each department are reused to assess the required granularity and are summarized in the table below. This analysis provides insights to further define the facts and dimensions.
Table 11 - Granularity assessment for a dimensional model Business
activity Perspective Grain
POM
Producers of EEE What is the number of EEE producers per category, per district and year by weight, units & value (€)
Producers of BA What is the number of BA producers per typology, per district and year by weight, units & value (€)
EEE quantities
POM How much EEE was put on market per category & subcategory by weight, units
& value (€) per year?
36 Business
activity Perspective Grain
POM BA quantities POM How much BA was put on market per typology & chemical system by weight, units & value (€) per year?
Operations
Rede Electrão What is the number of collection points per type and typology, per district per year?
Collected quantities of WEEE & BA
What is the total amount of WEEE collected per category, per type &
typology per year?
What is the total amount of BA collected per typology, per chemical system, per collection type & typology per year?
WEEE/BA per treatment destination
What is the total amount of WEEE collected and sent to treatment facilities per category, per type of quantity, per year?
Treatment
Recycling &
recovery rates What is the recycling/ recovery rate per category and per year.
Components of
mandatory removal What is the total amount of material/components collected per category and per year?
Materials obtained What is the material composition per category, per destination, per year?
3.4.1.2 Dimensions & Hierarchies
There are, in total, 14 dimensions containing five different hierarchies. Each dimension has a Surrogate Key (SK) that is a unique identifier and is used to connect the data in the fact tables via Foreign Key (FK). Each dimension is explained in detail in the following.
Date dimension:
The date dimension is the only dimension that is connected to all nine fact tables including two dimensions (the tables “Rede Electrão REEE” and “Rede Electrão PA” represent two exceptions within this model). The date dimension contains a hierarchy that allows a drill down from year to day, which is the lowest granularity in this dimension. However, given that all data sets are aggregated by year (01/01/2018, 01/01/2019 etc.), the level of granularity up to the day is not required. The dates range from 2014 to 2021. Since not every data set has the same date range, some pages show longer time periods than others.
POM producer dimension (EEE & BA):
The POM producer dimension contains all the commercial merchants that estimate their sales volume, including their fiscal number (pt. NIF) and the waste stream (BA and/or EEE). The dimension is connected to the two fact tables “POM EEE” and “POM BA”.
37
Category dimension:
The category dimension only includes the category number and is connected to five fact tables: Fractions, Material, WEEE Collection, Rates and Treatment.
EEE category dimension:
The EEE category POM dimension was created in addition to the previous dimension as it requires a different granularity & hierarchy: It contains the number and name of the official EU categories, including the operational sub-categories that are used by Electrão. This represents the second hierarchy in this model: Category number > Category name > Subcategory name. This dimension is only connected to the POM EEE fact table.
BA typology dimension:
This dimension is unique for the waste stream of batteries and accumulators and is connected to the BA Collection fact table and the BA POM table. The dimension contains a hierarchy: Typology number > Typology name >
Chemical system.
Collection points dimension (WEEE & BA):
The Collection Point dimension contains all collection points for WEEE and BA (this also includes collection points that are not part of Electrão’s network). It contains the name of the collection point (pt. origem), the type & typology of collection point (pt. Tipo LR and Tipo interveniente), the municipality (pt.
concelho), and district (pt. distrito). The two columns Nivel Nacional (engl.
national level) and the column Concelho Mapa (engl. municipality map) were created in Power Query.
The dimension is connected to the WEEE Collection and BA Collection fact tables. This dimension includes two hierarchies: 1) The location hierarchy: National Level > District > Municipality and 2) the collection point typology hierarchy: Typology > Type of collection point.
38
Entry point dimension:
The entry point dimension is also connected to the two collection fact tables (WEEE & BA) and only hosts the entry point of waste collection for either batteries
& accumulators or WEEE.
Fraction dimension:
This dimension is only connected to one single fact table (fraction fact table).
It contains the depollution type and the type of fraction of mandatory removal.
Material dimension:
The dimension is only connected to the fact material table and contains the additional materials obtained during treatment.
Rate dimension:
The rate dimension contains the type of rate (recycling or recovery) and is connected to the rate fact table.
Destination dimension:
The destination dimension is connected to three fact tables: Treatment, WEEE treated, and material fact table and contains the different destinations of materials or treatment destination.
Treatment operator dimension:
This table contains the name of the different treatment operators and is only connected to the treatment fact table.
39
Exceptions - Rede Electrão REEE & BA:
The two tables that contain Electrão’s collection network are an exceptional case within this model and represent the original structure from the excel sheets. In theory, these entries should have been appended to the WEEE and BA collection point dimension with an additional column that would have allowed us to distinguish between the collection points that belong to Electrão’s network and not. However, given the data quality, it was not possible to merge the entries from these two tables as they could not be matched via a unique identifier (key). The connection between this table (potential dimension) and the WEEE/BA Collection fact table would have resulted in a many-to-one relationship. To make time filters (year) interact across tables, these two tables are only connected to the Date dimension table. Both tables include two hierarchies: 1) The location hierarchy: National Level > District > Municipality and 2) the collection point typology hierarchy: Typology > Type of collection point.
3.4.1.3 Fact Tables
In the final step, the quantifiable facts and measures are defined. In total, there are nine fact tables.
As mentioned previously, the galaxy schema results in several fact tables shared with one or more dimensions via a foreign key (FK).
POM EEE fact table:
The POM sales data is always estimated in weights (kg) or (tons), units, and monetary value (€). Hence, most of the additional measures created for further analysis have to be calculated for all three perspectives, such as Year-over-Year changes and the share of totals to understand the percentage share of a given category or producer. In addition, a ranking measure was created that allows to calculate the percentage share of the total value that the Top 10 producers of EEE and BA put on the market.
40
POM BA fact table:
The POM BA fact table is set up very similar to the EEE POM fact table.
All sales data is again estimated in weights, units, and in monetary value (€). Measures like Year-over-Year change and percentage shares of a given dimension are created. The only difference is the connection between this and the dimension tables. Batteries and accumulators use a different categorization (typology) and must be connected to the respective dimension.
WEEE collection fact table:
The collection of WEEE is measured in weight (tons or kg). Four additional measures were created: The percentage share of total collection per category and district, the year-over-year change per collected weight, and the total weight collected of the previous year (this is required for some column charts that show the value of the filtered and the previous year to make comparison easier for the user.
BA collection fact table:
The BA collection fact table contains the collected weight (in tons) of batteries & accumulators. Two additional measures are created: The year-over-year change of collected BA and the percentage share of collected BA per district.
41
Treatment fact table:
The treatment fact table captures all WEEE collected that is sent to a treatment facility. The amounts are stored in tons, and the following three measures were created: The percentage share of WEEE collected per treatment operator, the percentage share of WEEE treated per category, and the year-over-year change of WEEE collected & treated.
WEEE treated fact table:
This fact table represents almost the same data as the previous fact table but already totals that are aggregated by year and per WEEE quantity type (and not broken down per treatment operator). This level of granularity is missing in the previous table, and hence a separate fact table had to be created.
Rates fact table:
The rates fact table contains the different recovery and recycling rates per given EEE category. Given that also the international EU targets vary across years, they were included in the fact table and not in the dimension. In addition, three more measures were created: The rate that was achieved in the previous year (this is used in column charts to show filtered and the previous year at the same time) and the year-over-year change in absolute and relative terms.
Material fact table:
The material fact table captures the amount of material (in kg or tons) retrieved during treatment processes. The following three measures allow additional analysis: The year-over-year change of obtained material, the material that was obtained in previous years (again to allow a filtered year and previous year in one visualization), and the percentage share of the total material collected per category.
42
Fraction fact table:
This table represents the weight (kg or tons) of components that are mandatory to remove during treatment processes and have to be reported according to EU legislation. Two additional measures were created: The percentage share of a fraction per category and the year-over-year change (%) of removed fractions.
3.4.1.4 The Galaxy Schema
The result of this four-step approach by Kimball is the galaxy schema, as multiple fact tables share one or more normalized dimensions (as opposed to the snowflake or star schema, where multiple dimensions are mapped to a single fact table). It is one of the most common schemas used for data warehouse modeling and is usually applied to more complex data structures, which also explains the application for the given data sets. Given that several fact tables share one or more dimensions, they can also be seen as interlinked star schemas. The galaxy schema is generally more flexible but can be much more complex and hence, harder to implement and maintain. It also has a decreased performance compared to star or snowflake schemas as it usually underlines more complex data aggregations.
In this business context, the dimensional model includes nine fact tables and 12 dimensions in which five hierarchies are contained. Given the model's size, the schema is broken down into screenshots of the individual fact tables and their connections to the dimensions.
Figure 24 – Overview galaxy schema
43 Figure 25 – Fact table POM EEE with dimensions
Figure 26 - Fact table POM BA with dimensions
44 Figure 27 - Fact table WEEE collection quantities with dimensions
Figure 28 – Fact table BA collection quantities with dimensions
Figure 29 - Tables Electrão's network WEEE & BA
45 Figure 30 - Fact table WEEE collected & sent to treatment with dimensions
Figure 31 - Fact table general rates with dimensions
Figure 32 - Fact table material destinations with dimensions
46 Figure 33 - Fact table fractions with dimensions
Figure 34 - Fact table treatment with dimensions
47