4.2 Methodology
4.2.1 Local Causal Discovery Module
As explained previously, the proposed methodology takes advantage of the potential causal relationships found in the data to create the model. The discovery of these relation-ships is made in every split with the available data at that moment. This causal module searches for potential causal relations variable → class(being classthe target variable) and uses the measured dependence output as an input to calculate the best attribute to split.
At this moment, and before going into more detail about the module, some key definitions must be reminded, such as what is a (potential) causal relationship and how it is possible to measure them. By causal relationship, we comprehend the relation between different events, in which one is identified as the cause of the other. This means that, in theory, if the first event occurs, we can also expect the second to occur. On the contrary, if the first event does not occur, it is expectable that the second does not occur also.
Generally, a causal algorithm can be classified as a local or global discovery algorithm, depending on its purpose and mechanism applied to find relationships.
Although there is not a well-defined answer for this distinction between causal algo-rithms, we can define a global causal discovery algorithm (also known as global structure
4. SEMI-CAUSALDECISIONTREES 57 discovery) as an algorithm that tries to search for all the existing potential causal relation-ships between several variables [107]. This type of algorithm is usually used to study the general causal interactions in a given system.
As for local causal discovery algorithms (also called local structure discovery), their ob-jective is only to find causal relationships for a specific variable instead of all the variables [107]. This algorithm is mainly used in two specific cases, such as problems with high dimensional data or feature selection problems.
Besides only searching for causal relationships with a target variable, local causal discov-ery differs from global causal discovdiscov-ery in another aspect. Typically local causal discovdiscov-ery algorithms return only undirected causal relationships,i.e., they find relationships but do not give any information about the direction of those relationships. In contrast, global causal discovery algorithms perform an extra step to find the direction of these relation-ships.
Arguably, in machine learning, the most common form found in the literature to mea-sure if a relation is causal or not is through conditional independence tests [41]. State of the art global causal algorithms such as PC, FCI, among others [94] apply these tests to undercover which variables are independent of which, hence remaining the potential dependent ones.
As causal relations tend to be maintained in the presence of others’ influence, these con-ditional independence tests verify whether the potential relations are maintained when one, two, or several variables influence their values (thus not sustaining the claim that they are causally related).
Several conditional independence tests can be applied, but the most commonly used for discrete data areχ2 andG2[94]. Although both of these methods are Chi-squared based independence tests and are widely used for being independent of order, there are several limitations. For example, inχ2case, by merely testing if two variables are dependent on each other only using the information from these two variables, it is not possible to say for sure if they are causally dependent (since, in general, a causal relationship remains, even when other factors are influencing the relationship,i.e.if we have three variables A, B and C, we can only say that A and B are causally related if this relation is maintained when C also influences it) [54,108]. Although G2 solved this problem by inserting the influence
of other variables in the dependence calculation, it seems to be sensitive to sample size [109,110], meaning that it does not detect well relationships in small data sets.
Some tests search for this type of association, called partial association (statistical measure to find conditional independence in controlled experiments [111]). One example is the GCMH.
As mentioned earlier, despite conditional independence tests being a vital component to finding potential causal relations, it is important to note that these tests do not direct the potential dependences,i.e., if A is the potential cause of B or vice versa, they only hint that there is a relationship. As stated in Chapter2, this orientation can be done by using a set of established rules, by using experimental data to orient the edges or by using a mixture of both the previous approaches.
Returning our focus to the proposed module, although it can be classified as a local causal discovery since it searches for causal relationships for a specific variable, it is a crucial dif-ference closer to the global causal discovery algorithms. Instead of searching and return-ing the indirect causal relationships, this module directs all causal relationships found using an asymmetric dependence measure and returns only the causes of the target vari-able (varivari-able → target) and the respective coefficient (Algorithm 4.1)*. With this extra orientation step, we assure that the chosen variables are the ones that influence our target directly (they cause it) and not only (causally) related to the target, like, for example, the CDT that choose any variable that is related to it, without verifying if it is its cause or if its caused by it.
In this module, we propose the usage of the GCMH test instead of the traditionalχ2 or G2 regularly used in literature because it mitigates the problems presented previously.
Besides this fact, the GCMH test can also be used in both binary and non-binary discrete data. As an orientation method, we propose the UC, which measures how dependent two variables are, with the values of this coefficient between 0 and 1. Since this coefficient is asymmetric, it is possible to determine the direction of the dependence by comparing the obtained value for A→ BandB→ A, choosing the direction of the most significant dependency.
*it is important to note that from now on causal sufficiency and faithfulness are assumed
4. SEMI-CAUSALDECISIONTREES 59 Algorithm 4.1:CAUSALM: module for finding potential causal relationships and re-spective UCs
Input:LetDbe a data set with a set of variablesS={s1,s2, ...,sn}and a target variablet. Letαbe the significance level for the conditional independence tests andctthe correspondent critical value. Letucoefbe the minimum accepted coefficient.
Output: R, list of all causal relationships and respective UCs
1 foreachpair of variables{s,t}in D, with distribution d= dist(s,t)do
2 ifGeneralised_CMH(d)≥ ct verifiesthen
3 Verifys→tandt→sdirections using the uncertainty coefficient (UC)
4 ifthe coefficient of s→t is higher than t→s and ucoefthen
5 Savesand the respective coefficient in R
6 returnR
To find all possible causal relationships with the target variable, the causal module (Al-gorithm4.1) starts by applying the independence test to all variables (with a level of sig-nificance defineda priori) to determine what are the possible relationships between these and the target (line 2). To all the chosen variables (that is, they are partially associated with the target), the UC is applied. This is done, by testing bothvariable → target and target → variable. Suppose variable → target is the one with the highest coefficient, and its coefficient is higher than a user-defined minimum coefficient (to assure that only strong dependencies are chosen). The variable and respective coefficient are saved (this value will be used later in the variables’ importance calculation).
4.2.2 Revisiting SC Trees
Returning now our attention to the proposed algorithm, its operation is similar to the traditional decision trees in that it applies a divide and conquer methodology to build the tree, as we can see in Algorithm4.2.
The main difference between the traditional method and the proposed method lies in the way the attributes’ importance is calculated: instead of using only the value of IG as a measure of importance, in this algorithm, we propose the use of the sum of IG with the uncertainty coefficient (if there is a causal relationship), with defined weights (denominated in this work as semi-causal information gain orSCIG) in the information gain ratio’s calculation (denominated in this work as semi-causal information gain ratio orSCIGR).
Algorithm 4.2:SCT: Semi-Causal Tree
Input:LetDbe a data set with a set of variablesS={s1,s2, ...,sn}and a target variablet
Output: Tree
1 Tree = {}
2 ifD is pure OR other stop criteria is metthen
3 returnTree
4 Map all the potential causal relationships inDwith thetusing CAUSALM
5 forall attribute a in Ddo
6 Compute criteria of semi-causal gain ratio if we split on a (4.5)
7 abest =Best attribute according to the above-computed criteria (an attribute that maximises the gain ratio)
8 Tree = Create a decision node that testsabestin the root
9 Dv= Induced subsets from D based onabest
10 forall Dvdo
11 Treev= SCT(Dv)
12 AttachTreevto the correspondent branch of the Tree
SCIG(X) =
(β×IGX) + (θ×UCX) i f GCMHX<α UCX≥ uccoe f
IGX otherwise
(4.4)
SCIGR(X) = SCIG(X)
IV(X) (4.5)
Therefore, and as we can see in Algorithm 4.2, in each split, the algorithm begins by searching all potential relationships with the target (line 5). This information is then used to calculate the semi-causal information gain ratio (4.5) of each variable, then choosing the one with the highest value.
It is important to note that the first statement of (4.4) is only used in the gain ratio calcula-tion if and only if there is evidence of a (strong) causal relacalcula-tionship between the target and the current variable, that is given first and foremost by the GCMH test (Algorithm4.1), that is responsible for accessing if there is a causal dependence between a variable and the target. Only after this causal dependence is established is the UC applied only to assure that the relationship is strong and that the direction is the desired one (variable→target).
If the equation’s condition does not hold, only the IG (the second statement) is used in the gain ratio calculation.