Supervised Learning

Supervised algorithms were tested for each set of workers and their health in order to train the model. In the implementation of the current research, regression methods were

prevention Data Gathering Compassion needed

Compassion not needed Elderly caretaker

Diagnosis Treatment

Care Specialist

Elderly companion Nurses

Surgery

Neurologist Psychiatrist

Hematologist

GP Radiologist

Medical Research

Drug Development

prevention Data Gathering Compassion needed

Compassion not needed

Diagnosis Treatment

Human AI

Human + AI assistant

Figure 2.1: AI applicability in healthcare

employed as restrictive algorithms, whereas they were used as more flexible algorithms.

2.3.1 Ensemble Algorithms

Bastos used ensembleDecision Tree (DT)in clinical decision-making for patients with multiple acute or chronic diseases (i.e. multimorbidity). This study showed one factor that can influence how decisions are made under conditions of risk and uncertainty is the decision maker’s personality. The variables were well modelled by, at least, one of the sets of features extracted [14]. The description of these algorithms list as below:

Decision TreeWhen the relationship between the features and the output is highly non-linear and complex,DTs outperform traditional approaches such as linear regression.

To construct a DT, the predictor space is divided into J unique and non-overlapping regions (R1;R2;...;RJ), each of which is used to make a prediction of the observed response for each observation, with the goal of finding the region that minimizes the residual sum

of squares [83].

In a nutshell, aDT’s final structure is a flowchart, with each internal node representing a

"test"on an input variable, each branch representing the test’s conclusion, each leaf node representing a label (i.e. a final decision), and the paths from node to leaf representing rules. The decision process behind the DTis easier to understand and describe using this structure. However,DTs are extremely sensitive to the trained data, and even minor changes can have a big impact on the outcome [83].

Random Forestis an ensemble ofDTs that combines many DTs to provide a more accurate and reliable prediction. To overcome theDTs’ sensitivity, each tree inRandom Forest (RF)is trained on various sets of data using bagging, which is a method of randomly sampling a data set with replacement. Each node considers a random selection of features to produce an uncorrelated forest of trees whose prediction is more accurate than any single tree. If highly predictive traits exist, they show at the top of the tree and produce similar trees [20]. According to [59], the subset of variables used at each node to tune the model should be one-third of the total features.

Gradient Boostingis an ensemble ofDTs, similar toRF. What sets it apart fromRF is the technique of tree growth; instead of bagging, it uses boosting. Boosting applied to data set is different from bagging, where the data set is randomly sampled. Unlike bagging, where the data set is randomly sampled, boosting uses a weighted data set that is more likely to be included in new sets. As a result, each tree is trained using information from previously trained trees, ensuring that they grow sequentially and that weak learners become strong learners. Gradients in the loss function, a measure of how well the model’s coefficients fit the data, are used byGradient Boosting (GB)to identify weak learners. To minimize over-fitting, unlikeRF, the number of trees in boosting should be limited. Furthermore, the shrinkage parameter should be chosen with the number of trees in mind in order to manage the learning rate, as many trees are required to obtain good performance with a small learning rate [53,70,80].

2.3.2 Machine learning and Gradient Boosting Decision Tree

The Bayes theorem, which was proposed by Bayes in 1763 [15], is shown to be the foun-dation ofML. For reference, the Bayes theorem’s mathematical formula is as follows:

p(A|B) =P(B|A)P(A)

P(B) (2.1)

This mathematical formula can be used to assess the probability of an event occurring based on previous observations. Although the Bayes Theorem may appear simple today, it was a groundbreaking scientific accomplishment when it was originally discovered and has made significant contributions to a wide range of fields. Turing et al.(1950) [158]

and Rosenblatt et al. (1958) [131], for example, began their research in the 1950s, which would eventually be connected withML. However, Samuel et al. (1959) [135] was the first to use the termMLin his groundbreaking study on usingMLto construct an artificial

program that could compete with a human player in a game of checkers. Since then,ML has advanced significantly, and it is now used in virtually every aspect of life.

When evaluating and choosingMLmodels in the past, performance, speed, and flexi-bility were often emphasised. However, some researchers, such as [98], suggest that the General Data Protection Regulation (GDPR)in the European Union, as well as the pub-lic’s increasing knowledge of their privacy rights, has made choosing a model far more difficult. TheGDPRis designed to prevent, or at the very least discourage, the employ-ment of algorithms that might exploit, oppress, marginalize, discriminate, or otherwise infringe on an individual’s liberty and right to privacy.

This is especially true in situations when the planned solution deals with personal in-formation. And, as a result, Lepri et al. (2018) [98] argue that attributes like transparency, intelligibility, and accountability have become important, if not mandatory, considera-tions when choosing aMLmodel. Models that have the potential to be highly efficient, such as Artificial Neural Networks (ANN)s, often lack these transparency properties, posing legal and ethical problems if used in certain decision-making contexts. There are various differentMLmodel types and tools accessible when it comes to employing regression approaches for decision making. GBDTs are a sort of regressionMLmodel that can provide a high level of transparency and intelligibility while also performing well in terms of correct predictions, scalability, and efficiency. Friedman first proposed the concept ofGBDTs [54]. They introduced a new classification method in their paper that would merge numerous weak classifiers to form a single robust classifier.

Condorcet et al. [35] proposed that as the number of predictors increases, the probability of reaching a correct conclusion increases, as long as the predictors are more likely to be correct than incorrect. In essence, an ensemble of weak predictors works in a similar way, with individual weak predictions instead of people. InGBDTs, according to [87], a decision tree is a flowchart diagram with a number of different nodes. These nodes are further divided into three types: decision nodes, chance nodes, and end nodes.

• Decision nodes are sub-nodes that have been subdivided into new sub-nodes.

• Chance-nodes are nodes that indicate a group of uncontrollable probable events.

• End-nodes (also known as "leaf nodes") they are connected to the parent nodes and generally signify a result or decision.

GBDTs also have various advantages over otherMLmodel types, one of which is that they are very good at avoiding overfitting. When implementing aMLmodel, overfitting is a typical issue that occurs when the model has learnt to recognize training data by memorising rather than abstraction. OverfittingMLmodels is undesirable since it cause them to fail to make predictions from new data. In study [10] provides an indicator of where to begin if you are trying to develop aGBDTmodel. Chen Guestrin et al. [32], Ke et al. [88], and Prokhorenkova et al. [121] provide three state-of-the-artGBDTs:

XGBoost,Light Gradient Boosting Machine (LightGBM), and CatBoost, respectively. The extreme gradient boostingXGBoost, due to its good decision effect, fast computing speed and other features XGBoost algorithm based onDThas attracted considerable research interest in the industrial machinery, power system, and industrial infrastructure domains [171,176].

The XGBoost-based feature importance ranking, in particular, can analyze the rela-tionship between output results and input features, assisting network operators in com-prehending failure detection results. XGBoost is an integrated model algorithm that uses theClassification and Regression Tree (CART) as its base learner. XGBoost is made up of simple sub units that are coupled to form a system with a high model complexity and learning ability. Unlike typicalANNs, XGBoost’s base learner is made up of the root node, branches, and leaf nodes, and it enhances the model’s performance by greedily adding trees. The feature that can offer the largest gain to the loss function and its splitting point is chosen as the node to perform node splitting while creating theCART DT[32].

Furthermore, the node splitting procedure is parallel, which increases the model’s pro-cessing speed. Figure 2.2shows each tree grows one after the other, each with its own prediction score, and the final result is calculated by adding the scores of all individual trees together.

Figure 2.2: In XGboost trees, level-wise growth occurs. The blue circles represent an older level of leaves that have already been calculated, while the red circles represent the level of newly added leaves to the tree. The purple circles depict end-nodes, which are connected with parent nodes [8].

The expansion of an XGBoost tree growth expands in a level-wise manner (Figure 2.2).

XGBoost is one of theGBDT algorithms ([10]) that generates exponential leaf growth.

When working with larger datasets, it is vital to properly optimize the XGBoost, as it otherwise tends to quickly use available memory. As a result, otherGBDTalternatives are likely to be better choices for instances where applications must scale to large datasets.

Chen et al. outlines some of the new novel features that contribute to XGBoost’s scalability, which can be summarized as follows:[32,31]

• A new tree learning approach for sparse datasets has been developed.

• A theoretically justified weighted quantile sketch approach that allows approximate tree learning to handle instance weights.

• Using distributed computing, data scientists can process vast amounts of data more quickly.

LightGBMis a popularGBDTmodel developed by Ke et al. [88] that consistently proves to be quite capable for tackling classification challenges. They argue in their study that, up until then,GBDTs lacked efficiency and scalability in scenarios involving large amounts of data and a large number of features. TheLightGBMmodel uses XGBoost as a baseline but takes a different approach to classification by introducing and combining two new techniques, Gradient-based One-Side Sampling and Exclusive Feature Bundling.

Gradient-based One-Side Sampling means that the model ignores the vast majority of cases in which the Gradient weight is expected to be lower. This could prevent the algorithms from travelling down branches that are thought to be less important. The authors came up with this strategy after noticing a common trend in which data with varying gradient values had a different impact on the expected information gain. As seen in Figure 2.3, the expansion of the trees inLightGBMis performed leaf by leaf.

Figure 2.3: The leaf-wise growth occurs in LightGBM. The Blue circles denote older leaves, which have already been explored. In this case, the red circles denote the current leaf that is being considered. The leaves that are expanded upon are the ones that have the highest predicted max delta loss. If the new nodes have a lower max delta loss than one of the previous branches, then backtracking to the earlier leaf with the now highest max delta loss. The purple circles depict end-nodes, which are connected with parent nodes [8].

The Exclusive Feature Bundling approach, which is used by LightGBM, helps to reduce feature sparsity in circumstances when one-hot-encoding is used, as the resulting encodings of one feature are virtually always exclusive with the others. By combining these sparse features and reducing their total number, the model’s training time can be reduced and it can be deployed considerably faster than otherGBDTs, such as XGBoost.

In summary, this results in aGBDTthat is extremely fast in terms of training, tolerant in situations when memory is limited, and capable of making correct predictions. The LightGBM’s key disadvantage is that it overfits the data when used in smaller datasets.

In instances where there is only a small amount of data available for the model to train on, other models should be chosen.

CatBoost(Figure 2.4) uses balanced oblivious trees to anticipate labels, which in-cludes reducing significantly on parameter tuning time [40,121].

Figure 2.4: Catboost has the symmetric in building trees, as well as it is the level-wise algorithm. The uniquness of symmetric value highlighted this algorithmes in comparision withGBDT

Prokhorenkova et al. proposed in heterogeneous data, utilizingGBDT algorithms.

"GBhas been the primary method for learning problems with heterogeneous features, noisy data, and complex dependencies for many years: web search, recommendation sys-tems, weather forecasting, and many others,"they write. Heterogeneous datasets include features of various data kinds. In relational databases, tables are frequently heteroge-neous. Homogeneous data is the contrary of heterogeneous data. Data that is all of the same type is referred to as homogeneous data. CatBoost is made with category fea-tures in consideration. The CatBoost method adds two new fundamental functions: a permutation-driven ordered boosting algorithm and a novel algorithmic approach specif-ically for processing category information. CatBoost deals with exponential feature com-bination growth by employing a greedy technique for each new split in the existing tree.

CatBoost can also handle cases when the number of categories is too large for current GBDTmodels to manage. CatBoost takes three measures to address this issue [8]:

• Divide the data into random subsets initially.

• The labels are then converted to integers.

• Finally, the remaining categorical attributes are converted to numerical values.

Another feature of CatBoost is the ability to choose the maximum number of iterations, the maximum depth of constituentDTs, and the maximum number of categorical feature pairings to combine. These are all numbers that the user can change to trade resource usage for performance. Furthermore, the settings that researchers select for these hyper-parameters may help explain why catboost performs differently than other learners. In theSection 3.3, catboost will explain more in terms of its state-of-the-art.

2.3.3 Artificial Neural Network

ANNoutperforms many of the earlierMLmethods in the field ofAI. For example, Belo [17] proposed that the healthcare system is generating a burden on physicians, losing effectiveness on the collection of patient data. Different architecture configurations were explored for signal processing and decision making. ARecurrent Neural Networks (RNN)-based architecture was able to replicate autonomously three types of biosignals with a high degree of confidence.

Deep Learning (DL)withAIis represented byANNs.ANNs, pitch in such situations and fill the gap. ANNare based on the biological neurons in the human body that activate when certain conditions are met, causing the body to respond with a certain activity.

Artificial neural nets are made up of multiple layers of interconnected artificial neurons that are controlled by activation functions that turn on and off. In the training phase, neural nets learn particular values, just as traditional machine algorithms.

In a nutshell, each neuron receives a multiplied version of inputs and random weights, which is then combined with a static bias value (which is unique to each neuron layer), and this is then passed to an appropriate activation function, which determines the final value to be output by the neuron. Depending on the nature of the input values (X_n), different activation functions are available. Once the final neural net layer’s output is generated, the loss function (input vs output) is determined, and backpropagation is used to change the weights to make the loss as small as possible in Figure 2.5. The overall operation revolves around determining the best weight values (W_n).

X₀

X₂ X₂W₂

X₁W₁ X0W0

W₀

W₁

W₂

F O AF(XW)+

Bias

Figure 2.5: Structure of ANN

The inputs are multiplied by theweights, which are integer values (X_nW_n). They are

modified to minimise loss in backpropagation. In basic terms, weights areANNvalues that have been learned allegorically. The gap between predicted outputs and training in-puts causes them to self-adjust. TheActivation Function(AF) is a mathematical function that assists in the ON/OFF switching of neurons (AF(XW)+Bias). Theinput layerwhich is called (I) represents the input vector’s dimensions. The intermediate nodes that divide the input space into regions with (soft) edges are represented by the hidden layer. It takes in a set of weighted input and, using an activation function, generates output. The output layer(O) represents the neural network’s output (Figure 2.6).

I Hidden O

Figure 2.6: Input, hidden and output layers

2.3.3.1 Types of Neural Networks

There are many different types of ANNs that now exists or are in development. They can be categorized based on their: structure, data flow, neuron density, layers, and depth activation filters,etc. The following neural networks will be discussed:

• Perceptron

• Feed Forward Neural Network

• Multilayer Perceptron

• Convolutional Neural Network (CNN)

• Long Short-Term Memory (LSTM)

• RNN

• Modular Neural Network

PerceptronThe Minsky-Papert [106] perceptron model is one of the simplest and old-est Neuron models. It is the smallold-est unit of aANNthat performs certain computations in order to discover features in input data. It takes weighted inputs and applies the acti-vation function to produce the final result. Threshold Logic Unit (TLU)is another name for perceptron. Perceptron is a binary classifier that is a supervised learning system that divides data into two groups. A perceptron divides the input space into two categories using a hyperplane.

Logic Gates such as AND, OR, and NAND can be implemented with perceptrons.

Only linearly separable tasks, such as the boolean AND problem, may be learned using perceptrons. It does not work for non-linear situations like the boolean XOR problem.

Feed Forward Neural Network The most basic version of ANNs, in which input data only flows in one direction, passing via artificial neural nodes and out through output nodes. Input and output layers are present when hidden layers may or may not be present. They can be characterized as a single-layered or multi-layered feed-forward neural network based on this (see Figure 2.7). The number of layers is determined by the function’s complexity. Forward propagation is unidirectional, but there is no backward propagation. Here, the weights are fixed. Inputs are multiplied by weights and sent into an activation function. A classification activation function or a step activation function is used to do this. Consider the following scenario: If the threshold (typically 0) is exceeded, the neuron is engaged, and the neuron produces 1 as an output. If the neuron is below the threshold (typically 0), it is deemed -1 and is not activated. TheFeed Forward Neural Network (FFNN) simple to maintain and are equipped with to deal with data which consists a lot of noise.

The following are the advantages and disadvantages of this model; a) It’s easier to build and manage because it’s less complicated, b) one-way propagation is quick and efficient, c) highly responsive to noisy data, d) due to the lack of dense layers and back propagation, it cannot be used forDL.FFNNapplications are as below:

• Face recognition (Simple straight forward image processing)

• Computer vision (for difficult-to-classify target classes)

• Speech Recognition

As aFFNN is Multilayer perceptron. An introduction to sophisticated neural nets which is Figure 2.8presents, in which input data is transmitted through multiple layers of artificial neurons. It is a completely linked neural network since every node is connected to all neurons in the following layer. FFNNbased on multiple hidden layers, i.e. at least three or more layers, are present in the input and output layers. It possesses bi-directional

I O

Figure 2.7: The simplest form of Feed forward Neural Network where input data travels in one direction to output nods.

I O

Hidden Layers

Figure 2.8: FFNN multi-Layer Perceptron.

propagation, which means it can propagate both forward and backward. Inputs are multiplied by weights and sent to the activation function, where they are changed in backpropagation to minimise the loss. Weights are machine-learned values fromANNs, to put it simply. Depending on the variance between predicted outputs and training inputs, they self-adjust. Softmax is used as an output layer activation function alongside nonlinear activation functions.

The following are the advantages and disadvantages of this model; a) due to the presence of dense completely connected layers and back propagation, it can be used forDL, b) design and maintenance are both somewhat difficult, c) slow in comparison (depends on number of hidden layers). It is worthwhile to add the applicability of Multi-Layer Perceptron is in the areas of speech recognition, machine translation, and complex classification.

Convolutional Neural Network instead of a two-dimensional array, a CNNhas a three-dimensional layout of neurons. A convolutional layer is the first layer. Each convo-lutional layer neuron only analyzes data from a small part of the image field. Like a filter, input features are taken in batch-wise. The network decodes images in chunks and can perform these operations numerous times to complete the entire image processing. The image is from RGB gray scale during processing. Further variations in pixel value will aid in the detection of edges, allowing images to be categorised into several categories. The output of the convolution layer goes to a fully connected neural network for classification, as shown in the above diagram. Propagation is unidirectional where RGB contains one or more convolutional layers followed by pooling, and bidirectional where the output of the convolution layer goes to a fully connected neural network for classification, as shown in the above diagram. InMultilayer Perceptron (MLP), the inputs are weighted and supplied into the activation function. CNNis used in convolution, whileMLPemploys a nonlinear activation function followed by softmax. In picture and video recognition, semantic pars-ing, and paraphrase detection, CNNs produce excellent results. The following are the advantages and disadvantages of this model; a) with only a reward number of parameters compared with fully layerANNs, it’s used for deep learning, b) when compared to a fully linked layer, there are fewer parameters to learn, c) comparatively complex to design and maintain, d) very slow (Depends on the number of hidden layers). It is worthwhile to add the applicability of Image processing, computer vision, speech recognition and machine translation.

Recurrent Neural Network(RNN) are designed as you can see in Figure 2.9using the output of a layer and feed it back to the input to help predict the layer’s outcome.

The first layer is usually a feed forward neural network, followed by a recurrent neural network layer, where a memory function remembers some information from the previous time step. In this situation, forward propagation is used. It has information memory that will be needed in the future. If the prediction is incorrect, the learning rate is used to make minor adjustments. As a result, gradually increasing the probability of making the correct prediction using back propagation learning algorithm.

The following are the advantages and disadvantages of this model; a) one of the benefits of modeling sequential data is that each sample can be believed to be dependent on previous ones, b) used in conjunction with convolution layers to extend the powerful pixel neighbourhood, c) problems with vanishing and exploding gradient, d)recurrent neural networks may be challenging to train, e) processing long sequences of data with Rectified Linear Activation Function (ReLU) as an activation function is difficult. Text

I/P ^Recurrent _Cell O/P

Figure 2.9: Recurrent Neural Networks

processing like auto suggest, grammar checks, text to speech processing, image tagger, sentiment analysis and translation are the examples of applicability of theRNN.

LSTM – Long Short-Term Memorynetworks are a sort ofRNNthat employs a com-bination of internal units. A ’memory cell’ is included in LSTMunits, which can store data for lengthy periods of time. When information enters the memory, when it is output, and when it is forgotten, a system of gates is used to govern it. Input gates, output gates, and forget gates are the three types of gates. The input gate determines how much data from the previous sample will be stored in memory; the output gate controls the amount of data sent to the next layer; and forget gates govern the memory learing rate. It can be seen from the data in Figure 2.10that theLSTMreported.

AModular Neural Network(MNN) is made up of several separate networks that work independently and execute different tasks. During the calculation process, the various networks do not really interact with or notify one another. They each compute on their own to achieve the desired result. As a result, by splitting down and complex computa-tional process into discrete components, it can be completed much more quickly. Because the networks do not interact or are not even connected to each other, the calculation speed increases (see figure 2.11).

The following are the advantages and disadvantages of this model; a) efficient, b) independent training, c) robustness and d) moving target Problems (per unit within the network is trying to evolve into a feature detector but input). Stock market prediction tools, character recognition using adaptiveRectified Linear Activation Function (MNN) and high-level input data compression are the samples of application of this model in

RNN O/P

I/P

Memory

Figure 2.10: LSTM structure

Model 1 X₁

X₂

X₃

X₄

X₅

Model 2

Model 3

Gating Network

[Y]

[X]

∑

g₁ g₂ g₃

Figure 2.11: Modular Neural Network

real environment.

No documento Diagnosis and Prognosis of Occupational disorders based on Machine Learn- ing Techniques applied to Occupational Profiles (páginas 31-45)

I/P Recurrent Cell O/P

I/P ^Recurrent _Cell O/P