• Nenhum resultado encontrado

5.3.1 Data preparation

After preprocessing clinical data and preliminary data analysis through an R pipeline, the triaged and processed data was loaded into a Python notebook. Preliminarily, data to be fed to models was QC’ed by verifying NAs. After confirming the ds. was complete, a principal component analysis (PCA) was conducted to evaluate the best FS strategy.

For each set of chosen clinical data features, data was loaded into a Python framework, classifying each feature as: category, boolean integer, or float point fields. Numeric vari-ables did not show relevant outliers, so all cases were kept, as seen in tvari-ables6.1,6.2 and 6.3. For feature transformation purposes, age, mRS before event, NIH Stroke Scale, time difference in minutes from onset to the first hospital, to the first CT and to the second hospital were considered numerical variables and were rescaled using aMinMaxScaler algorithm. Boolean variables were binarized and categorical variables were converted with dummy encoded by parametrizingOneHotEncoding. Despite irrelevant variables and collinearity issues with the data, most variables were added to the modelling in a first stage, since some models handle better than others these issues.

5.3.2 Testing

The ds. was split in training, validation and test sets. Using the definition provided by Brian Ripley in ’Pattern Recognition for Machine Learning’.

– Training set: A set of examples used for learning, that is to fit the parameters of the classifier.

– Validation set: A set of examples used to tune the parameters of a classifier, for example to choose the number of hidden units in a neural network.

– Test set: A set of examples used only to assess the performance of a fully-specified classifier. [166]

From the initial ds. 20% samples were randomly selected for the test set. For model se-lection, 10-fold stratified repeated cross-validation was used. Non-nested cross-validation was used, so the model metrics shown for cross-validation are the ones obtained while looking for the best model. Given the model selection is optimizing for a metric, the metrics obtained this way are optimistic. One solution would be to use nested

cross-validation or accept the performance on a separate ds.. Given that on average the difference between nested and non-nested cross-validation is below 2 percent [167], and search and evaluation time increases exponentially, nested cross-validation was not used.

The metric selected for search was weighted F1-score, while the intended testing and com-parison metric is the AUC score. Finally, the parametrized models were evaluated with ROC curves on the testset to assess their general performance is within error confidence intervals obtained in cross-validated validation scores.

5.3.3 Model selection

An array of classifiers strategies of interest was chosen, includinglinear based methods such as LR, Linear Discriminant Analysis (LDA), and linear SVM;decision-trees based methodssuch as single DTs, Random Forests (RF), AdaBoost, XGBoost (XGBM) and Light Gradient Boosting Machine (LightGBM) classifiers; Bayesian methods represented by the GNBC — for testing purposes only, given several variables do not follow the nor-mal distribution —;lazy learning methodsrepresented by k-Nearest Neighbours (k-NN) strategy; SVMs with different kernel transformations; Quadratic Discriminant Analy-sis (QDA); andNNsrepresented with the fully connected networks, the Multi-Layer Per-ceptron (MLP) available on Sci-Kit Learn. Before training, a global random generator seed was set and for each model, the hp. was also internally added.

Several manually tailored Grid Searches (GSes)were conducted guided by previ-ously found hps., where each time a range limit value was selected, the next Grid Search (GS) would be adjusted so the extreme value would become a median value in the next GS, following Bayesian search principles. After finding the main models for each mod-elling strategy, modmod-elling strategies were compared by the average metrics calculated on cross-validated results.

5.3.4 Performance metrics and statistical model comparison

During model selection phase, weighted-F1 score was chosen for model selection.The positive class was selected as the bad outcome — the ones rejected for thrombectomy.

Accuracy, balanced accuracy, and Area Under the (ROC) Curve (AUC) where also calcu-lated and checked to assure the model performed consistently in all relevant metrics. All metrics were calculated and stored for each cross-validation set. Aclassification report,

which calculates precision, recall and F1-score, macro- and weighted-averaging separated by predicted class was also performed.

Statistical model comparisonwas done by performing a Friedman’s teston the list of AUC cross-validated scores from the best model of each modelling strategy. AUC score was the main model evaluation metric. It is calculated from the Receiver-Operator Curves (ROC) and has the advantage of summarizing those curves under a single met-ric [54]. When this test rejected the null hypothesis, multiple models’ comparison was done with a Nemenyi post hoc test [168], conducted on the same AUC scores matrix.

Models that distance themselves less than the critical distance were considered not stat. sig.

different and were grouped together. The group of models that performed the best was used in further analysis, after the initial survey on the base clinical ds..

5.3.5 Automated Machine Learning (AutoML)

The clinical ds. was passed to AutoSkLearn, testing both version’s strategies: the base version— using meta-learning to warm-start the Bayesian optimization, followed by en-semble creation —, and version 2.0 strategy, which expands the first version with Portfolio of Successive Halvings — a way to select test groups of models with increasing resources, such as number of samples, iterators, etc. —, early stopping unpromising tests, and by automatically selecting the search policy based on the information learned from the ds..

The models were run for the same time as the total time of the last manually constructed GS andcompared against all base modelspreceding it.

5.3.6 Feature Selection (FS)

A study on each tabular ds. was conducted after the best models were found. Consider-ing the top two or three hyperparametrized models were selected. For each of them the Hughes Phenomenon (HP) [169] was studied, recurring the model’s training with in-creasing number of features, selected by theirANOVA F-valuesbetween label/feature for classification tasks. The previous pipelines were rerun for every ds., trimming the fea-tures by the value recommended by FS analysis, and when metrics improved, these new models were considered final.