To develop an automated histology scoring system, predictive classification and regression models based on pre-trained CNN models including VGG [52], ResNet [55], and AlexNet [50], were developed using transfer learning. In these models, the histology images were input while their associated histology scores (Mankin and OARSI) were output. To achieve the best models, different scenarios were tested. These scenarios are
Figure 6.3: Windowing of the processed image. Highlighted sections indicate the
overlapping parts.
including or excluding the followings: windowing, bone-deletion, and the image sets with the abnormal low-level features. In each scenario, the CNN models were trained via a cross-validation scheme to ensure the prevention of overfitting. Moreover, various metrics including Mean Squared Error Percentage (MSEP), accuracy, and precision were utilized to assess the performance of the CNN models in these scenarios. Furthermore, the best- performing models were packaged for web deployment. The details of the employed dataset, cross-validation scheme, performance metrics, and web deployment are presented in the following subsections.
6.3.1 Dataset
Our dataset consisted of 3102 histopathological images. These images were gathered from knee joints of nine cadavers (nine groups of images). The images from cadavers 8 and 9 were recorded using a different camera setting, hence showing clearly different low-level image features (e.g., contrast and color).
These images were labeled by Mankin and OARSI scoring systems and all possible scores were observed in the dataset. Regression models were developed to predict a real number value in the range of these scoring systems (0-14 for Mankin, 0-6 for OARSI).
Figure 6.4 shows the distribution of the Mankin scores in the dataset. The classification model was developed for the assessment of tissue integrity into three different classes based on specific thresholds applied to the Mankin score (mild: scores<4, moderate:
4<scores<7, advanced: scores>7). To investigate the effect of preprocessing and camera settings, several scenarios were generated:
● Scenario 1: Only images from the first seven cadavers that showed similar camera features were included (groups 1-7), windowing, rotation, and bone-deletion, were applied on all images. Training data consisted of 80 percent of the images.
Training/validation split was performed randomly (plus shuffling) from all seven groups (optimal validation size was achieved by cross-validation).
● Scenario 2: All nine groups of images were included, windowing, rotation, and bone-deletion were applied on all images. The validation set was made from groups 5 and 8.
● Scenario 3: All nine groups of images were included; windowing was applied on all images. The validation set was made from groups 5 and 8.
● Scenario 4: All nine groups of images were included, rotation and bone-deletion were applied on all images. The validation set was made from groups 5 and 8.
● Scenario 5: All nine groups of images were included, none of windowing, rotation, or bone-deletion was applied. The validation set was made from groups 5 and 8.
6.3.2 Learning procedure
The models used in this study include VGG16, VGG19, ResNet18, ResNet50, and AlexNet. Due to the nature of parameter-base transfer learning, in a series of attempts different layers of the pre-trained models were “frozen” (not to be affected by the new training procedure). Different epoch values for each model were tested and chosen properly to avoid overfitting. The test was conducted with up to 24 epochs. Overfitting could be observed for almost all models at epochs greater than 5. A constant size of 224x224 pixels was used as the input for all images, and different values were tested for each model's batch. The Fastai library has a built-in function for the selection of the optimal batch size based on an initial evaluation. These values did not always show the best performance in several trials for different models; hence, the best batch size value was selected after numerous trials. After this initial learning that adjusts the weights in the unfrozen layers, fine-tuning was performed in a series of 1, 2, and 3 repetitions. We adjusted the weights of the new CNN, built on the pre-trained model, using the unfreeze function and the fit cycle. The purpose of this action is to hyper-tune all the layers of the CNN instead of keeping some of the earlier layers fixed (due to overfitting) and only fine- tune some higher layers of the CNN. This is motivated by the observation that the first layers of a CNN include more generic features which are helpful to various machine vision applications, whereas the last layers of the CNN tend to detect more specific features in the dataset.
Figure 6.4: Distribution of scores in the dataset.
Two approaches were taken for the classification problem:
a) Multi-class (3-class classification), and b) Nested binary (2 + 2) classification.
The schematic of these approaches was depicted in Figure 6.5. In the 3-class classification approach, images were labeled as one of the three classes of "mild",
"moderate", and "advanced". While in the second approach, the histology images were first classified as “healthy” and “unhealthy”, then at the second stage the unhealthy images were classified as “moderate” and “advanced”.
The set of images consisted of two distinct groups with different low-level image features.
Based on this difference, two types of tests were conducted, one using only the group with similar features that made 7 out of 9 groups of cadaver images, and the other with all image sets (including the two abnormal groups). In the scenario where only normal groups were evaluated, training/validation splits were followed with 20% for validation and 80% for training, a value that was achieved from fixed-sized standard cross-validation (this was done to include an acceptable distribution of scores within the training set). All images were separated based on their scores; hence, this splitting would result in a stratified division based on their scores. Where the abnormal groups were included (scenarios 2 to 5), the split was conducted differently. The validation and training data each had one set belonging to the abnormal groups, with other images from the normal groups. Also, to avoid biased models, training and validation sets were built upon different cadaver images.
After generating the dropout layers, all the classification models feedworad their output to the softmax layer. The softmax layer at the end uses the softmax function, also known as softargmax, or normalized exponential function, which is a generalization of the logistic
Figure 6.5: Classification methods.
function to multiple dimensions. The output layer of the models includes a softmax activation function to map the output of the network to a probability distribution over the target classes. The softmax function receives a vector of real numbers as input and normalizes it into a probability distribution consisting of K (the number of classes) probabilities proportional to the exponentials of the input numbers. Before the softmax layer is applied, some vector components might be negative or greater than 1, but after applying softmax, the components will be normalized onto the interval of [0, 1] with their sum equal to 1, thus they can be interpreted as probabilities. The larger input components will correspond to larger probabilities. Before this function is applied, the vector of the numbers is not limited to any class and is distributed within a certain linear range, at this point (before the softmax layer), this vector can be extracted to be regarded as a regression output.