• Nenhum resultado encontrado

2.6 Artificial intelligence

2.6.2 Models description

Several ML strategies exist to learn from annotated data through supervised learning.

These are encoded as algorithms with different calculation methods, assumptions regard-ing data, balances between data distribution assumptions and the modelled function — themodel’s bias—, and the way it adapts to differences in the ds. — the model’s vari-ance[52]. In this subsection, a brief description of each strategy used in this thesis follows.

One of the most ubiquitous ML tools is theLogistic Regression (LR). As a special case of generalized linear models,LRassumes the d.v. is modulated by linear i.vs., but unlike linear regression, LR is designed for classification problems, where the d.v. is the positive class probability [51]. Like linear regression, it also assumes no collinearity between i.vs. — although it is still robust under non-ideal circumstances —, and it requires cases count to be greater than feature count [51]. Considering aDecision Boundary (DB) set atp=0.5, can be found for LR by solving the sigmoid function:

f(x) = 1

1+ex , (2.1)

wherexis a regressive input function. OneLR advantageover other modelling strategies is it does not assume i.vs. are normally distributed, and its d.v. distribution is expected to follow the Bernoulli distribution [51, 53]. One important LR limitation is its linear

DB assumption, which makes it unable to model more complex interactions between variables [53,54].

Also assuming a linear DB, is the Linear Discriminant Analysis (LDA). This clas-sifier, unlike LR, assumes that observations come from a Gaussian distribution and the covariance matrix for all classes to be classified is identical. When those assumptions are not met, it provides unreliable results. LDA improves on theBayesian Optimal Classi-fier[54].

Another model used, derived from LDA isQuadratic Discriminant Analysis (QDA).

It evolves LDA formulation by accepting a different covariance matrices for each class, allowing more complex Decision Boundaries (DBs) to be defined. When that condition happens — the most usual circumstance — the DB is expressed by a quadric formula, the reason for this modelling strategy name [54]. WhenQDAcovariance matrices are all assumed to be diagonal, it is equivalent to Gaussian Na¨ıve-Bayes classifier (GNBC) [55].

GNBCis a classifier modelling strategy that assigns class labels to problem instances, where those are classified as proportions of the outcome to be predicted, based only on the data available on the ds., as approximations to the population’s proportion. After decomposing the ds. and doing those estimates, a maximum likelihood evaluates classes’

combination influence usingBayes probabilitiesas defined by the conditional probability given by:

p(Ck |x) = p(Ck)p(x |Ck)

p(x) , (2.2)

where Ck is the set of Independent Variable (i.v.) classes and x is the vector with class representations of every feature in the observation. GNBC assumes that each of these features is independent of the others, so the conditional probabilityp(Ck | x), wherexis input vector. The main advantage of GNBC is that in dss. where features and classes are indeed independent, it can perform well with small training sets [54].

Decision Trees (DTs)are a very different type of modelling strategy to the ones pre-sented above. It defines models as a branched function, creating a hierarchical structure with the most important classification choices in the upper levels — the ones that reduce entropy the most — and refines the model at each extra level with choices that enable a further reduction of the overall classification entropy. Entropy metric is defined as

D=−

K k=1

ˆ

pmklog ˆpmk, (2.3)

while the Gini Index, on the other hand, is defined by G=

K k=1

ˆ

pmk 1−pˆmk

, (2.4)

where Dis entropy, Gis the Gini Indez, Kis the total number of classes of the outcome, and ˆpmk is the proportion of classkobservations in nodemof the tree. It is preferable in DTsand associated methods to use the more commonclassification error ratedue to its sensitivity. The main advantages of DTs is that they work well with limited data, they are among the mostexplainable and interpretablemodels, and they can mapvery complex DBs [54]. However, givenDecision Tree (DT) models high variance, especially when configured to overfit, models are unstable, varying significantly with the underlying sub-set of data in use, they give more preponderance to categorical features with more classes, and models with many features can have structures empirically hard to understand [54].

The main drawbacks of DTs can be overcome with ensembles. Ensembles or ensem-ble methodsare models that use the output of learning models’ groups to produce their predictions. In theory, the errors of individual models are diluted through majority vot-ing, and the errors of each learner can be emphasized by the next in the ensemble, so those errors are avoided. One majorensemble methods drawbackis the obfuscation of individual learners, so the models created are opaque and, therefore, less explainable and interpretable.

Random Forests (RF)is one popular ensemble method, using as learners DTs. They use bootstrapping for each of their individual trees, and an ensemble process known as bagging, which is mathematically defined as

bag(x) = 1 B

B b=1

b(x), (2.5)

where∗bis the number of bootstrapped dss., andBis the total number of bootstrapped dss.. Inbagging, learners are grown independently, but RF enhances the randomization of data available to each learner, by randomly subsetting features at each split of each DT [54]. Bagging allows DTs to use the entire ds. on each learner but having different perspectives on it, and feature subsetting avoids overrepresentation of features with many categories, making these models usuallyperform better than DTs[54].

The other ensemble methods used in this thesis are Adaptative Boosting — better known asAdaBoost—,Extreme Gradient Boosting— better known asXGBoost— and

Light Gradient Boost Machine (LGBM). Unlike RF, they all useboosting, which is an en-semble of weak learners — i.e., models intentionally underfit — where learner are grown in succession, gives more weight to the misclassified samples [54].Boostingis known for its top performance in many dss., andAdaBoostbecame notorious in early data science competitive scenarios [56]. It has less hyperparameters (hps.) so, it is easier to configure than previous methods, and like all boosting methods, usually performs much better than individual learners. As disadvantages, it is sensitive to noise, and it may become more easily biased by irrelevant features than competing methods [57,58]. XGBoostimproves on AdaBoost by adding automatic Feature Selection (FS), individual tree penalization, proportional leaf nodes’ shrinking, a better method for solving the optimization prob-lem — Newton’s method —, and it can take advantage of parallelized resources [57,59].

Finally, LGBM is a competing method to XGBoost. Though it is technically similar — despite differences in the method for split finding, and the ability to deal with categor-ical variable without prior preprocessing —, it can leverage GPU resources, and several heuristics, lead to much faster execution thanXGBoostwith equivalent results [60].

Another important modelling strategy experimented on during this thesis isSupport Vector Machines (SVMs). Typically, used as classifiers, but, due to their algorithm ex-ponential complexity, best suited for moderately sized dss.. They are classifiersrobust to outliersthat have become known for their speed andhigh performancein classifica-tion, especially in applications whereNNused to be applied. Their robustness to outliers stems from their base principle:SVMsandMaximum Margin Classifiers (MMCs)try to find their DB by finding the furthest points from each classes’ centroid that are the closest to other classes’ boundaries — known assupport vectors[54]. Given that often there is an intersection between the volumes where samples from multiples classes occur, so the MMCs cannot provide a solution.SVMs consider a soft margin, where points of various overlapped classes occur, and each of those points weights on the DB. In this way, out-liers, having a boundary that only considers the points within the margin. This can be mathematically described as:

maximize

β01,...,βp1,...,ϵn,MM (2.6)

subject to

p j=1

β2j =1, yi

β0+β1xi1+β2xi2+· · ·+βpxip

≥ M(1−ϵi), ϵi ≥0,

n i=1

ϵi ≤C,

(2.7)

whereMis the margin’s width,Cis an arbitrary tuning parameter, andϵiis the slack vari-able, controlling the number of points to intrude in opposite side of the DB [54]. SVMs have a rigid assumption regarding their DB shape — which would be linear in the regular case —, but they are more versatile than other methods given thatkernel transformations can be used to adapt to different DB shapes. Considering anSVM kernel with no trans-formationsis defined by:

K(xi,xj) =⟨x,x⟩, (2.8) where K is the kernel transformation, ⟨⟩is the inner product space operation, and x is the derived feature set, in this case, without transformation, which produces linear DBs.

Quadratic boundariescan be achieved by using:

K(xi,xj) =e

xixj2

2 , (2.9)

or any otherpolynomial DBby using:

K(xi,xj) = (γxTi xj+r)d, (2.10) whereσ,γandrare kernel parameters anddis the polynomial transformation degree [54].

SVMs become popular due to their versatility, allowing them to model complex CV and Natural Language Processing (NLP) tasks — especially to their robustness while mod-elling highly-dimensional dss. —, and Support Vector Machine (SVM) derivative methods are still used since they achieve results comparable to NNs while having less parametriza-tion effort, performing faster training and inference on moderately sized dss., [61] but tend to have worse generalizability than NNs for the same volumes of data [54].