Conditional random fields improve the CNN-based prostate cancer classification performance

(1)

Conditional Random Fields Improve the

CNN-based Prostate Cancer Classification

Performance

Paulo Alberto Fernandes Lapa

Dissertation submitted in partial fulfillment

of the requirements for the degree of

Master of Science in Advanced Analiytics

(2)

(3)

Professor Mauro Castelli, Supervisor Doctor Leonardo Rundo, Cosupervisor

(4)

(5)

Ac k n o w l e d g e m e n t s

I would like to thank everyone who directly and indirectly have helped during my academic path and particular in my thesis.

This work would not have been possible without Mauro Castelli and Leonardo Rundo’s guidance. Thank you for your time, infinite patience and knowledge.

A very special note to Maria, for teaching me that taking life too seriously is a function with diminishing marginal returns (and what diminishing marginal returns are in the first place!) and to Jorge, that among various life skills, has taught me how to give a proper handshake. Thanks to both of you for investing in me - who would have thought that one could become close friends with his professors’?

A freaking huge thanks to Jan for teaching me everything I know about data science, making me explore and always pushing me to believe in myself, more than a friend, a brother. Danke for all the brilliant ideas and projects that came out of our heads over caffeine-fueled discussion in the pond.

Obrigado to everyone in invited researchers room, namely Illya and Carina, for all the lunch breaks and nights in Lux (that I managed to evade).

This work, and my happiness, would not have been possible without your patience and love, Daniela. Thank you for enduring with me (and me) more than I could have ever asked. Of all the motives I have to love you, your smile is the biggest.

A man is nothing without his family, and this one is no exception. Ana thank you for being the light in my life and my moral compass, there is no kindness that can even compare to yours. Vasco, I may have been the fastest spermatozoid, but you are most definitely the nicer, funnier and coolest half of us. Mãe, Pai, thank you for believing in me and being my safety net throughout all these years, one can cannot express in words the importance of having two person loving and caring without doubts or restraints. There is no love like mothers’ and no pride like fathers’. To you I am most grateful and words cannot describe it.

Xica, Catarina, Paulo and Dona Paula, my second family: I may be happy now, but all I know about living and enjoying the small things in life, you have taught me.

(6)

(7)

Life is a neural network with many hidden neurons Marisa Fernandes, 2018

(8)

(9)

A b s t r a c t

Prostate cancer is a condition with life-threatening implications but without clear causes yet identified.

Several diagnostic procedures can be used, ranging from human dependent and very invasive to using state of the art non-invasive medical imaging. With recent academic and industry focus on the deep learning field, novel research has been per-formed on to how to improve prostate cancer diagnosis using Convolutional Neural Networks to interpret Magnetic Resonance images.

Conditional Random Fields have achieved outstanding results in the image seg-mentation task, by promoting homogeneous classification at the pixel level. A new implementation, CRF-RNN defines Conditional Random Fields by means of convolu-tional layers, allowing the end to end training of the feature extractor and classifier models.

This work tries to repurpose CRFs for the image classification task, a more tradi-tional sub-field of imaging analysis, on a way that to the best of the author’s knowledge, has not been implemented before.

To achieve this, a purpose-built architecture was refitted, adding a CRF layer as a feature extractor step.

To serve as the implementation’s benchmark, a multi-parametric Magnetic Res-onance Imaging dataset was used, initially provided for the PROSTATEx Challenge 2017 and collected by the Radboud University.

The results are very promising, showing an increase in the network’s classification quality.

Keywords: Prostate Cancer Convolutional Neural Networks Conditional Random Fields

(10)

(11)

R e s u m o

Cancro da próstata é uma condição que pode apresentar risco de vida, mas sem causas ainda corretamente identificadas.

Vários métodos de diagnóstico podem ser utilizados, desde bastante invasivos e dependentes do operador humano a métodos não invasivos de ponta através de ima-gens médicas. Com o crescente interesse das universidades e da indústria no campo do deep learning, investigação tem sido desenvolvida com o propósito de melhorar o diag-nóstico de cancro da próstata através deConvolutional Neural Networks (CNN) (Redes Neuronais Convolucionais) para interpretar imagens de Ressonância Magnética.

Conditional Random Fields (CRF) (Campos Aleatórios Condicionais) alcançaram resultados muito promissores no campo da Segmentação de Imagem, por promoverem classificações homogéneas ao nível do pixel. Uma nova implementação, CRF-RNN redefine os CRF através de camadas de CNN, permitindo assim o treino integrado da rede que extrai as características e o modelo que faz a classificação.

Este trabalho tenta aproveitar os CRF para a tarefa de Classificação de Imagem, um campo mais tradicional, numa abordagem que nunca foi implementada anteriormente, para o conhecimento do autor.

Para conseguir isto, uma nova arquitetura foi definida, utilizando uma camada CRF-RNN como um extrator de características.

Como meio de comparação foi utilizada uma base de dados de imagens multi-paramétricas de Ressonância Magnética, recolhida pela Universidade de Radboud e inicialmente utilizada para o PROSTATEx Challenge 2017.

Os resultados são bastante promissores, mostrando uma melhoria na capacidade de classificação da rede neuronal.

Palavras-chave: Cancro da próstata Redes Neuronais Convolucionais Campos Condi-cionais Aleatórios

(12)

(13)

C o n t e n t s

List of Figures xv

List of Tables xvii

Acronyms xix

1 Introduction 1

1.1 Prostate cancer . . . 1

1.1.1 Causes . . . 1

1.1.2 Diagnosis and treatment . . . 2

1.1.3 Biopsy Gleason Score . . . 3

1.2 Computer assisted diagnosis . . . 3

2 Introduction to deep learning for medical imaging 5 2.1 Prostate Magnetic Resonance Imaging . . . 5

2.1.1 Multiparametric MRI . . . 6

2.2 Convolutional Neural Networks . . . 9

2.2.1 Training . . . 10

2.2.2 Layers . . . 11

2.2.3 Backprograpagation . . . 15

2.2.4 Optimizers . . . 15

2.3 Conditional Random Fields . . . 17

2.3.1 Mean field approximation . . . 21

2.3.2 Conditional Random Fields with Convolutional Neural Networks 22 2.3.3 Conditional Random Fields as Recurrent Neural Networks . . 24

2.3.4 General overview of CRF-RNN . . . 25

2.4 Semantic Learning Machine . . . 26

3 Methods 27 3.1 PROSTATEx Challenge 2017 data . . . 27

3.1.1 Descriptive analysis . . . 28

3.1.2 Data Processing . . . 30

(14)

C O N T E N T S 3.2 Architectures . . . 37 3.2.1 Convolutional Architectures . . . 38 3.2.2 CRF Architectures. . . 43 4 Results 47 4.1 Experimental setup . . . 47 4.1.1 Random search . . . 48 4.1.2 Metrics . . . 49 4.1.3 Code implementation . . . 50 4.2 Discussion . . . 51

4.3 Semantic Learning Machine Results . . . 53

5 Conclusions 55

(15)

L i s t o f F i g u r e s

2.1 T2-Weighted Slice . . . 7

2.2 Apparent Diffusion Coefficient Slice . . . 8

2.3 Proton Density Slice . . . 9

2.4 K-trans Slice . . . 10

2.5 Training loop . . . 11

2.6 Sigmoid activation function. . . 14

2.7 ReLU activation function . . . 14

2.8 Relationship between input image and label on a CRF model . . . 18

2.9 Common CRF graph structures . . . 19

2.10 CNNs features as inputs to CRFs . . . 23

2.11 CRF with CNN as feature extractor . . . 23

3.1 Number of exams per patient . . . 29

3.2 Number of lesions per patient . . . 30

3.3 Centre of mass registration . . . 33

3.4 Affine registration . . . 35

3.5 Rigid body registration . . . 36

3.6 Example of the concatenation of three MRI images used as inputs . . . . 37

3.7 AlexNet architecture . . . 39

3.8 VGG16 architecture . . . 40

3.9 XmasNet architecture. . . 41

3.10 ResNet architecture . . . 42

3.11 CRFXmasNet architecture. . . 44

4.1 Binary cross entropy test set results.. . . 52

4.2 AUROC validation test set results.. . . 53

4.3 Workflow of the proposed neuroevolution approach based on the SLM. . 54

(16)

(17)

L i s t o f Ta b l e s

3.1 Information available for a lesion . . . 28

3.2 Information available for an image . . . 28

3.3 Description of exams per patient . . . 29

3.4 Lesion distribution per patient. . . 30

3.5 Gaussian pyramid parameters used for affine registration . . . 34

3.6 AlexNet parameters . . . 39

3.7 VGG16 - parameters of the convolutional layers. . . 40

3.8 VGG16 - parameters of the fully connected layers. . . 40

3.9 XmasNet parameters . . . 41

3.10 ResNet50 parameters . . . 42

3.11 CRFXmasNet parameters . . . 44

4.1 Optimizer / hyperparameter compatibility. . . 49

4.2 Binary cross entropy example values . . . 50

4.3 Best configuration of each architecture.. . . 52

(18)

(19)

Ac r o n y m s

ADC Apparent Diffusion Coefficient image. AUROC Area under the ROC curve.

BCE Binary Cross Entropy.

CNN Convolutional Neural Network. CoM Centre of Mass registration. CRF Conditional Random Field.

DCE Dynamic Contrast Enhanced imaging. DW Diffusion Weighted imaging.

EDC Endorectal Coil. GS Gleason Score.

ILSVRC ImageNet Large Scale Visual Recognition Challenge. K-TRANS Transfer constant.

MI Mutual Information Criterion. PCa Prostate cancer.

PD Proton Density image. PSA Prostate-specific antigen. RB Rigid Body registration. ReLU Rectified Linear Unit.

ROC Receiver Operating Characteristic. SGD Stochastic Gradient Descent. T2w T2-weighted image.

(20)

(21)

C

h

a

p

t

e

r

1

I n t r o d u c t i o n

1.1 Prostate cancer

The prostate is a gland present in the pelvic district of men, located between the penis and the bladder and typically the size of a walnut. Its main function is to produce the liquid that forms the semen. Prostate cancer (PCa) is characterized by the abnormal growth of cancerogenous cells in that gland. According to the World Cancer Research Fund and the American Cancer Society, Prostate Cancer (PCa) is the second most common form of cancer in men and fourth overall. In 2018 there were 1.3 new millions cases and in the US 30.000 men die every year of related causes [8].

PCa usually develops slowly and without the presence of major symptoms. Be-cause of the typically slow onset, not every diagnosed patient will develop a clinically significant condition to warrant active treatment [36,51].

1.1.1 Causes

It still is not understood what factors cause PCa, but some have been identified as possible causes [45]:

1. Age, it is well understood that men over 50 years old are at a high risk of devel-oping PCa;

2. Unhealthy habits such as smoking and alcohol drinking have a strong relationship. Not only a relationship between smoking and PCa incidence has been identified, but also stronger smoking habits with PCa mortality;

3. Unhealthy eating habits like lack of consumption of fresh fruits and vegetables. A study showed that there is a strong association with consuming tomato-rich products and lower PCa incidence;

(22)

C H A P T E R 1 . I N T R O D U C T I O N

4. Geography, PCa is more common in developed regions (ie.e North America, Aus-tralia, New Zealand, Western and Northern Europe) but the highest mortality rate is found in low- and middle-income regions (sub-tropical Africa, South America and the Caribbean); the PCa mortality rate in Asia, Africa and Cen-tral America is lower than in the other parts of the world. This may be related to a higher life expectancy of men in developed countries and more advanced diagnosis techniques, leading to more diagnoses.

5. Family History and Genetics the PCa diagnosis on a family member of a diagnosed PCa patient is estimated at around 20%. The reasons may be related to similar genes, lifestyles and environmental conditions. At the genetic level, several genes and chromosomal regions have been found to be associated with PCa.

6. Ethnicity African-American Caribbean men have the highest incidence rates and the mortality rate of PCa among African-American men is the double of white men [45].

7. Occupation PCa risk is lower among forestry workers, police officers, office work-ers and white-collar occupations when compared to othwork-ers. The risk of PCa is higher in farmers, but this is is generally associated with the exposure to pesti-cides [45].

1.1.2 Diagnosis and treatment

PCa symptoms are related to an increase of the prostate size, affecting the urethra. This leads to an increase in the need to urinate, pain when doing so or the feeling that the bladder was not fully emptied.

It is important to note that just a single exam is not able to uniquely diagnose PCa, and each has drawbacks and advantages. The most common diagnostic methods are [49]:

1. Digital Rectal Exame(DRE) this is a physical exam where the doctor finger is inserted in the patient’s body to feel the prostate and surrounding tissue. With this is possible to see if any particular bumps or textures that may indicate the presence of PCa;

2. Prostate-specific antigen (PSA) blood test. This exam measures PSA levels, which are associated to be higher values in the presence of PCa. This exam is unreliable, high PSA levels can be associated with other conditions and, in some cases, PCa itself is not associated with high PSA values;

3. Transrectal ultrasound (TRUS) a small probe is inserted in the patient’s body that emits sound waves, thus creating echoes. This data is then transformed into a computer image. TRUS can be used as a second exam after the DRE or PSA

(23)

1 . 2 . C O M P U T E R A S S I S T E D D I AG N O S I S

exams give abnormal results. It can also be used to guide the needle during the biopsy procedure;

4. Biopsy a spring-loaded instrument with a needled is used to extract a sample of the prostate tissue. The sample is then sent to a lab and evaluated. While a biopsy provides a quantitive result (the Gleason Score) it has some drawbacks: namely discomfort to the patient; and false-negative diagnosis, because the probe can miss the cancerigenous cells;

5. Magnetic Resonance Image scan Using a MRI, the doctor can visually evaluate the prostate and, if any suspicion arises, can recommend a biopsy to be performed. This is normally a non-invasive method.

Currently, medical consensus recognizes the potential of using MRI as a mean to guide the biopsy to larger and probably more significant tumours [56].

Depending on the stage of the cancer several treatments can be proposed: watchful waiting (delay the treatment and wait if any the symptoms develop), active surveillance (regular exams to ensure any PCa progression is found early), radical prostatectomy (i.e., removal of the prostate) and radio- or hormone- therapy.

1.1.3 Biopsy Gleason Score

The Gleason Score (GS) analyses the tissue extracted from a biopsy, based on its appear-ance and on how much it looks like healthy tissue. More abnormal looking cappear-ancers, that are more likely to grow and spread, are given a higher grade[49].

The cancer is measured on a scale from 1 (normal tissue) to 5 (very abnormal). Almost all of the cancers are graded 3 or higher [49].

The GS measures the two areas that make up most of the cancer, each area is given a grade and their addition yields the GS. The first number is the most common grade in the tumor tissue. For example, if a GS is given as 3+4=7, most of the tumor is of grade 3 and less of it is grade 4, then adding to a GS of 7 [49].

1.2 Computer assisted diagnosis

Several methods have been presented that proposes the usage of medical images and Machine Learning applied to the task of correctly detecting and staging cancer, in a process called Computer Assisted Diagnosis (CAD).

The usage of CAD can range from data preprocessing tasks (i.e registration, Region of Interest selection, feature extraction and selection) [27], to several cancer-related applications that benefit from Deep Learning (DL) [30].

Models like linear regression, ensemble learning classifiers, Gaussian processes or support vector machines have been used, with varying degrees of success.

(24)

C H A P T E R 1 . I N T R O D U C T I O N

A particularly popular set of models are the Deep Convolutional Neural Networks (CNNs) that have found success in the medical image analysis (MIA) field, in various tasks: of unsupervised learning problems (problems that do not have a target variable to measure the model quality) to supervised learning problems [30].

In the field of supervised MIA, three main challenges can be identified: image classification, detection, and segmentation. All three have a series of exams or images as input, but the desired output differ [30]. Image classification problems (such as this work) have a single variable as output (e.g. cancer present or not). Image detection models define boundaries around objects of interests (i.e. organs, regions or lesions) [46]. Lastly, the segmentation problem’s goal is to identify the voxels that make up the object of interest (i.e the boundaries or the interior).

Various anatomical applications have been found [30]: in the head region, DL with MRIs has been used for brain MIA (e.g. disorder classification, lesion/tumor segmentation/classification or survival prediction) or eye MIA (e.g. blood vessel seg-mentation, glaucoma detection). In the torso region, DL has been used for cardiac MIA (e.g. Ventricle slice detection, heart structure detection or coronary calcium detection), liver lesion segmentation or kidney localization. Lastly, DL has been used for muscu-loskeletal MIA (e.g knee cartilage segmentation, vertebrae localization or even hand age estimation).

In the anatomical region of the prostate, CNNs networks have been employed in CAD tasks like PCa segmentation [17] [54] [53] and classification [32] [57] [2].

With regards to Conditional Random Fields, they have been used for segmentation tasks in PCa [5], [39] [19] and for brain cancer segmentation [58] as well.

In all these applications, some challenges are always present [30] [46], namely the lack of large training data sets, absence of reliable ground truth data or the difficulty in training large models. Nonetheless, some factors can always be considered important in the success of DL models [30]: expert knowledge, novel data preprocessing or augmentation techniques, and the application of task-specific architectures.

The goal of this work is to merge the classification abilities of CNN and the local segmentation provided by CRFs and develop a novel way of diagnosing PCa, that to the best of my knowledge has not been proposed.

This work is organized as follows: in chapter2a brief introduction to Magnetic Res-onance Imaging, Convolutional Neural Networks and Conditional Random Fields is given; in chapter3, subsection3.1introduces the dataset and the treatments performed before using it; subsection3.2presents four off-the-shelf CNN architectures used and the one created for this work. Finally, chapter4discusses the training methodology and the results obtained.

(25)

C

h

a

p

t

e

r

2

I n t r o d u c t i o n t o d e e p l e a r n i n g f o r m e d i c a l

i m a g i n g

To better understand the research work that has been carried out, some clinical back-ground and theoretical knowledge of its various parts are recommended.

This chapter aims to provide a short introduction to them, by organizing the con-tents as follows: section 2.1 is devoted to Magnetic Resonance Imaging (MRI) data acquisition and preparation, with particular interest to prostate cancer diagnosis. Sec-tion 2.2 introduces Convolutional Neural Networks (CNNs) and their strengths in computer vision, while section2.3presents Conditional Random Fields (CRFs). Lastly section2.4gives an intuition on the working of the Semantic Learning Machine[20], a neuroevolutionary algorithm.

2.1 Prostate Magnetic Resonance Imaging

MRI is an acquisition modality that allows for studying both the body human anatomy and physiology, thus providing insights into the diagnosis of different diseases and con-ditions. It does not only provide high-resolution images, especially for analyzing the structure of soft tissues, but also information at the molecular level, without requiring an invasive procedure [40].

As a matter of fact, the human body is composed of different tissues containing mainly water molecules that contain protons. When protons are excited (through a pulse caused by the MRI scanner), they emit a radio frequency signal that is received by a coil.

From the moment the pulse is produced, two-time sequences can be identified: when the protons receive the pulse and go to an excited state (i.e., T1 or longitudinal relaxation time) and how long they take from returning from their excited state to their

(26)

C H A P T E R 2 . I N T R O D U C T I O N T O D E E P L E A R N I N G F O R M E D I C A L I M AG I N G

initial state (i.e., T2 or transverse relaxation time) [38] [44]. These times are measured and serve as the source of contrast in the MR image—the premising being different tissues type have different relaxation times. [38] This allows certain tissue properties to be enhanced by careful parameter tuning.

Sometimes, a contrast mean (typically based on Gadolinium) can be administered to the patient for a higher image contrast, also allowing for dynamically evaluating the vascularity of the tumor microenvironment by means of Dynamic Contrast Enhanced (DCE). Moreover, depending on the magnetic field strength, an endorectal coil is gen-erally be used to increase the Signal-to-Noise (SNR) especially in the case of 1.5T MRI scanners. When 3.0T MRI scanners are exploited, the acquisition can be performed via a pelvic coil guaranteeing a good SNR.

The existing methods for PCa diagnosis have been characterized by overdiagnosing low-risk lesions and underdiagnosing high-risk cancers [56]. Usually, random biopsies are performed but this comes with serious disadvantages, namely: a likely increase in complications due to the over-sampling of healthy tissue; tumors outside the biopsy site could be easily missed and it may be difficult to determine the site of a previous biopsy when repeating the exam [9], which might cause hemorrhages.

2.1.1 Multiparametric MRI

Multiparametric MRI (mpMRI) of the prostate comprehensively depicts of the prostate, allowing for better tumor detection, and has "recently emerged as the most promis-ing imagpromis-ing modality for this application"[56], when compared to other biopsy or traditional Prostate-Specific Antigen (PSA) assessment.

By bearing this in mind, there has been crescent recognition of mpMRI as mean to guide the biopsy to larger and probably more significant tumors [56], essentially creating a synergy between two very different diagnostic methods.

An mpMRI can be obtained by the capture of multiple MRI sequences carefully tuned. An MRI sequence is defined by a particular set of parameters that change the types of tissues or features that are emphasized during the acquisition process. An mpMRI consists of anatomical—i.e., T1-weighted (T1W), T2-Weighted (T2W), Proton Density (PD)—or functional sequences, such as Diffusion Weighted Imaging (DWI), DCE.

To better improve PCa diagnosis, several modalities, which often convey comple-mentary information, should be used in combination. Clinical consensus defends that T2W imaging should be used together with at least two functional modalities, [29], because it can improve cancer detection, location, and staging, and then be used to help define personalized therapies [9].

(27)

2 . 1 . P R O S TAT E M AG N E T I C R E S O N A N C E I M AG I N G

2.1.1.1 MRI Sequences 2.1.1.2 T2W

T2W sequences measure the time taken by the excited protons to return to their nor-mal state. It is particularly well-suited for cancer detection, characterization and localization [56].

On the prostate area, T2w is well suited to depict its anatomy because it returns high signal intensity in the peripheral zone, which is formed of muscle and glandular tissue, when compared to central and transitional zones [9].

T2w sequences are useful for PCa-related applications because both PCa and prostate have low signal intensities in the central and transitional zones. In the peripheral zone, where high-intensity values are expected, low values might be a clue, not only of can-cer cells, but also other conditions, such as biopsy-related hemorrhages, fibrosis or lesions caused by other therapies [9].

Figure 2.1: T2W slice extracted from patient #29

2.1.1.3 Diffusion-weighted Imaging (DWI)

DWI measures the water diffusion characteristics of tissue cells. A quantitative map can be achieved by means of the Apparent Diffusion Coefficient (ADC) [9].

Compared to normal tissue, PCa typically has tightly packed cells, dense surround-ing regions and intra- and inter-cellular membranes that reduce water motion. PCa cells typically have lower diffusion values than healthy cells in ADC images. Further-more, a relationship has been found between lower diffusion values and higher PCa aggressiveness.

(28)

The combination of DWI sequences and T2W showed to significantly improve the diagnosis quality of PCa [56].

Taking this into consideration, caution is still necessary: although ADC values are a good indicator, individual variability can strongly impact the accuracy of ADC in PCa diagnosis [9].

Figure 2.2: ADC slice extracted from patient #29

2.1.1.4 Proton Density (PD)

As the name suggests, PD reflects the presence of protons in the body tissue, with higher density regions appearing brighter. PD provides good distinction between fat, fluid and cartilage. [35] More specifically, PD images are formed as a mix between T1 and T2W images, by having long TR times and short LR times. [35]

2.1.1.5 Dynamic Contrast Enhanced (DCE)

DCE employs an external agent to improve image quality. In DCE, a Gadolinium-based contrast is injected into the patient blood flow. This agent then travels through the patient’s vascular system, further characterizing it. Typically, the blood vessel structure of a tumor is very different from the one of healthy tissue. Indeed, tumors have an increased number of blood vessels, higher permeability and higher amount of interstitial tissue. These conditions make the patterns of cancer tissue different with respect to healthy tissues.

The contrast values can be decomposed into several factors: regional blood flow, size and number of blood vessels and their permeability. It is not possible to separate these components individually, but their combined effect can be modeled using a

(29)

2 . 2 . C O N V O LU T I O N A L N E U R A L N E T WO R K S

Figure 2.3: PD slice extracted from patient #29

transfer constant, Ktrans. This Ktrans is an index that characterizes the presence of gadolinium in the vascular endothelium (the membrane that covers the interior of blood vessels) [55].

From a health economic point of view, DCE images are more expensive to collect and cause more discomfort to the patient—since agents like Gadolinium need to be injected—as well as raise a safety risk because there is evidence of possible depositions in the body [21], such as in the brain [15].

2.2 Convolutional Neural Networks

Neural networks (NN) have become one of the most common supervised learning techniques. They are able to learn complex patterns from unstructured data (i.e. text or images) with little domain knowledge needed. NNs are arranged in a hierarchical fashion, that is in layers. Each layer is capable of extracting simple features from its inputs that are then refined by the next layers.

The element responsible for the feature extraction is the neuron (or hidden unit). Each layer has a variable number of neurons. The neurons take a varying number of inputs (from the previous layer) and perform a dot product using weights that are improved over the training procedure.

Convolutional Neural Networks (CNNs) go a step further. By using the convolution operation they can perform their task on two- or higher-dimensional inputs. They can consider not only the input pixel but also its neighboring region, making them well-suited for image applications.

(30)

Figure 2.4: K-trans slice extracted from patient #29

2.2.1 Training

After defining the network architecture details (more details in sections 2.2.2 and

3.2.1), it must be trained. The training is the procedure that creates the model, making it learn the features in the data. It is an marginal procedure, where improvements are incremental over time.

For the training, two parameters need to be defined: a loss function and an opti-mizer (more details about this in sections2.2.4and4.1.2, respectively).

The loss function measures how well the model is able to predict the data when compared to the ground truth.

An optimizer adjusts the parameters of the network, taking into account the feed-back it receives from the loss function. This adjustment is done by means of backpro-grapagation.

Initially, the parameters of the network (weights) are randomly chosen. This means that in the early epochs the network is just implementing random transformations without any predictive quality. This is why loss values are typically so high in the early epochs.

A network is typically trained in the following fashion: 1. Randomly initialize the network weights

2. For a pre-defined number of epochs, or until a convergence criterion is met: a) Present a batch of data points to the network and generate a prediction

(31)

Figure 2.5: Training loop

b) Calculate the loss score of each pair {prediction, true value}, ( {y0, y}) - this measures of good the prediction compared is to the actual value;

c) Calculate the loss score based on the individual loss values (e.g., their the sum or mean);

d) Present the loss score to the optimizer;

e) The optimizer then performs weight updates on the networks output layers and propagates them to the network’s hidden layers;

f) Go to step a).

This is the training loop, and when enough iterations are done in a dataset, it should return a trained network as optimal as possible. The way these weights are updated is shown in more detail in section2.2.3and2.2.4.

2.2.2 Layers

Layers are the building blocks of a CNN: from the input of the network until a prob-ability is returned, layers and their neurons communicate through weights to extract features, explore relationships in the data and calculate the network’s output. Careful layer configuration can promote faster convergence and lower training times and has an impact on its predictions qualities.

The interactions among layers have extensively been studied, giving rise to the de-velopment of many state of the art CNN architectures. The most relevant architectures have been used in this work, like VGG16, AlexNet or ResNet, and are presented in more detail in section3.2.1.

This section gives a quick introduction to the most common layer types that were used in this work, just to better understand their roles and mechanisms.

(32)

Please note that little mathematical notation is presented. For a more technical explanation the reader is directed to the bibliography, or the books "Deep Learning with Python" by Chollet [11] or "Deep Learning" by Goodfellow et all [14].

2.2.2.1 Convolutional

A convolution (conv) is an operation that allows the learning of local patterns, defined by a scope (also called the stride). These learnable patterns can be edges, textures, lines, etc and have two main characteristics:

• They aretranslation invariant meaning that a convnet can recognize a pattern in any location of the image regardless of the original learning location. E.g., it is able to recognize a lesion in any region of the image, even if during the training all the lesions were in the same position.

• They can learnSpatial Hierarchies, initial conv layers will learn small local pat-terns that will provide the next layers with more complex patpat-terns and so on. This allows convnets to learn complex relationships in the data.

In image analysis, convolutions operate over three dimensions: height, width and channels- that in this work match three mMRI sequences.

A convolution extracts patches from the input and makes dot-product matricial operations between them and its weights. This produces a feature map: it represents the desired features, edges, textures, etc. The number of features the layer learns can be defined by the depth of the output. The depth, width, and height are hyperparameters of the network that need to be definedà priori.

With this information in mind, convolutional layers have two key parameters: • Patch size extracted from the input when considering local patterns, typically

3x3, 5x5 or in some cases 1x1, and

• Output depth, the number of channels computed by the layer.

Each conv layer will have z·z·d parameters to learn, z being patch size and d output depth.

2.2.2.2 Max Pooling

Max pooling is an operation that allows the reduction of the feature-maps extracted by the conv layers, and introduces spatial-filter hierarchies.

Max pooling extracts sliding windows from the input and outputs the maximum value present in the window. This effectively downsamples the input size by a factor, determined by the stride [10].

(33)

• Stride how many pixels the filter shifts at a time, and

• Windows size the region to consider when applying the max function. Lower window size values keep more local information.

2.2.2.3 Batch normalization

Typically, normalization is a preprocessing step done one time before feeding the dataset to the model. This operation converts the input variables into the same scale. Usually, it is performed on supervised learning algorithms that are sensitive to the scale of the inputs. This operations, therefore, promotes faster convergence and better classification performance.

Batch Normalization (BN) extends that concept not only for the input of the first layers but also to the hidden layers, by centering and scaling the output of the previous layer.

This regularization effect reduces the values of the weights change, thus preventing overfitting [18].

2.2.2.4 Fully Connected (FC)

This layer is the building block any Neural Network, it simply outputs the dot product between the inputs and its weights.

This layer can be used as a hidden layer to extract feature or as the last layer, to output the model’s prediction. An activation function can be applied on its output, so that its scales changes, or certain behaviors that ease training are enhanced.

2.2.2.5 Activation

The activation layer applies a function to the output of a previous layer, typically an FC or conv layer.

Activation functions are particularly important because they allow the modeling of non-linear relationships, improve the generalization ability or just make the output on the network in the range [0, 1] to represent a probability.

The activation functions used in this work’s architectures are now presented:

Sigmoid The sigmoid is a very known function that has an S-shape, in the range [0, 1], as illustrated in figure2.6. Because of this, it was used as the last layer in every architecture as a way to create the probability of an image having PCa.

In the case of the multiclass classification problem, the softmax function should be used instead, that is the generalization of the sigmoid function.

ReLU Rectified Linear Unit is function that takes the positive part of its arguments: f (x) = max(0, x).

(34)

4

2

0

2

4

0.0

0.2

0.4

0.6

0.8

1.0

Figure 2.6: Sigmoid activation function.

4

2

0

2

4

0

1

2

3

4

5

Figure 2.7: ReLU activation function. Notice how it resembles a ramp.

Graphically it looks like a ramp, as shown if figure2.7and increases training speed significantly, thus allowing the creation of deeper architectures.

But the ReLU function has very desirable properties compared to the previously popular activation functions because it promotes better gradient propagation, easier computation and faster backpropagation calculations [24].

(35)

2.2.2.6 Dropout

A dropout layer is a very simple way to prevent overfitting and reduce training time in NNs. In each training epoch, a random number of weights between two hidden layers are dropped out of the network and ignored [50]. The number of weights that are ignored depends on a hyperparameter defined by the user.

This significantly reduces the risk of overfitting because it reduces the reliance of the network in certain neurons.

2.2.3 Backprograpagation

While it is easy to calculate the error of the network’s output (through the loss function) it is conceptually difficult to understand the impact of each neuron and layer during the classification process.

Backpropagation does the inverse path that an input observation does, and dis-tributes the error through the network’s layers.

This is done by computing the gradient of the loss function (with chain rule deriva-tives) with regard to the network’s weights and then slightly adjusting them in the correct direction.

2.2.4 Optimizers

It is the job of the optimizer to perform the adjustment. Each optimizer uses different techniques to do this. Some optimizers have mechanisms that prevent sudden jumps, others allow for different parameters to have different updates, and others prefer cer-tain local optima characteristics. This means that there is no overall best optimizer, but different problems and datasets require different solutions.

2.2.4.1 Stochastic Gradient Descent (SGD)

The simplest optimizer is the Stochastic Gradient Descent (SGD) and is the default. This is the default SGD formula:

w = w − w ∗ ∆θ

Overall it works fairly well but falls short in complex gradient landscapes (e.g. saddle points). In these situations, the network gets stuck in a local optimum, and in later iterations may not converge because of the learning rate (lr) being too high.

With this in mind, a more complex SGD implementation has been developed with particular features that solve most of these problems:

1. Decay rate: as the training progresses the learning rate (lr) slowly decreases by a set decay rate (d): lr = lr · d. This prevents the weights from jumping around the optima in the latter stages of the training.

(36)

2. Momentum SGD has troubles navigation complex loss functions surfaces, like ravines, that are common around local optima. Momentum damps oscillations, thus making convergence faster.

3. Nesterov Momentum The previous approach has the shortfall that the updates can be too high and then miss the local optima. Nesterov momentum tries to predict where the next weight updated will be and calculates the gradient into that position.

All this features and hyperparameters come into place in this formula [10]:

lr = d ∗ lr

v_t = γ · v_t−1+ lrOθJ(θ − γ · v_t−1)

w = w − vt

where,

1. w it the neurons weight;

2. OθJ(θ) is the gradient of the loss function w.r.t. to the weight; 3. γ is the momentum (usually a high value like 9.9);

4. d is the decay rate; 2.2.4.2 RMSPROP

RMSPROP is an optimizer that adapts the learning rate to each weight and is based on Adagrad. Most frequently activated weights (e.g. common features) have lower learning rates, preferring smaller updates; larger updates are done to more sparsely used parameters. This is implemented by Adagrad:

gt,i= OθJ(θt,i)

wt+1,i, i = wt,i−

η pGt,ii+ 

gt,i

This is, g_t,iis the partial derivative of the loss function w.r.t to parameter θ_i at time t. AndpGt,ii is the sum of the square of the gradients of parameter wi up to time t.

With this approach, Adagrad eliminates the need to tune down the lr but now ac-cumulates quadratic growing gradients in the denominator, making it shrink towards 0, at which point no more significant changes are performed.

(37)

2 . 3 . C O N D I T I O N A L R A N D O M F I E L D S

RMSPROP tackles this by dividing the learning rate by the square root of the moving average of the squared gradient. This adds a weighted average between the current gradient and past gradients, this weight is defined by the parameters β [10]:

E[g2]_t= βE[g2]_t−1+ (1 − β)(δC δw) 2 wt= wt−1− η p E[g2_] t δC δw 2.2.4.3 Adam

Similar to RMSPROP, ADaptive Moment Estimation (Adam) keeps an average of past gradients of vt, but it is also keeping an exponentially decaying average of past

gradi-ents of mt, as an approach similar to the SGD’s momentum.

Adam is particularly efficient, requires little memory and is suited for problems large in terms of data and/or parameters [22]. Compared to others, Adam behavior prefers flat local optima in the error surface.

m_t = β₁m_t−1+ (1 − β₁)g_t bt= β2mt−1+ (1 − β2)gt2 ˆ mt = mt 1 − β₁t ˆ vt= vt 1 − βt₂ θ_t+1= θ_t−√ η ˆ vt+  ˆ m_t

mtand vt are values related to the weight gradient (the mean and variance

respec-tively), and β1and β2are the decaying rates. The authors empirically show that Adam

[22] works well in practice.

2.3 Conditional Random Fields

When dealing with image analysis, a context can be considered: the value of a pixel certainly is related to the one of surrounding pixels: homogenous regions exist. It does not make sense to have a random pixel of a "cloud"on a region labeled as "grass".

A CRF allows for this dependence to be modeled, by defining a discriminative undi-rected probabilistic graphical model, representing relationships between two different sets of features: observed and unobserved [33].

(38)

A discriminative model learns the conditional probability distribution P (y|X): the probability of y given X. Opposing this, a generative model learns the join probability distribution P (X, Y ): the probability of both X and y [41].

In our case, a generative model would learn the probability of a pixel being black and belonging to a cancerous region. A discriminative model would learn the proba-bility of being a cancerous region, knowing that it has a black pixel.

An undirected probabilistic graphical model means that when inferring the class of an observation yi, not only the input variables associated with yi, Xi needs to be

accounted for, but also its yi neighbors, yi−k, yi−k−1, . . . , yi+k−1, yi−k. This constraint

promotes homogeneous regions.

Figure 2.8: The predicted value y5does not depend only on the input image and the

extracted features but also on the predicted values for the adjacent values y1, y2, ...y8, y9.

A CRF defines a Random Markov Field, by means of an undirected graph, (V , E). A graph defines a set of random variables with nodes V ( in this case pixels) and the edges E that connect them.

Furthermore, this relationship was structured as a pairwise model: each label yi

(e.g. cancer or not) has associated a set of observed values Xion the image (traditionally

(39)

a

b

c

Figure 2.9: Common graph structures for image segmentation. a,b4-,8-grid respec-tively.cFully connected.

For image segmentation, the following graph structures are common: 4- or 8-grid graphs, where only neighboring pixels are connected and fully connected graphs, where all pairs of pixels are connected by edges. These grids are illustrated in Fig-ure2.9

Grid CRFs are very efficient when inferring but suffer from some limitations: only model local interactions and excessively smoothen object boundaries. Fully-connected

(40)

CRFs are slower when compared to Grid CRF, but are not limited to local interactions and allow better-defined object boundaries.

For this implementation, a Fully connected CRF was considered, where all nodes are connected.

To understand how a fully connected CRF works, it is necessary to introduce the notion of Energy. It can be defined as the cost associated with assigning a given label to a given data point.

Based on [23] and [59] lets define:

• Xi as a random variable associated to pixel i, which represents the label assigned

to pixel i.

• Xi can take any values from a pre defined set of labels L. In our work L = 0, 1,

denoting the presence or not of cancer, respectively.

• X is the vector formed of random variables X1, X2, ..., XN, with N being the

num-ber of pixels in the image.

• A graph G = (V , E), where V = X1, X2, . . . , XN.

• An image (global observation) I.

The pair (I, X) - mapping of the input I to the mask X- can be modelled as a CRF characterized by a Gibbs distribution of the form P (X = x|I) = _Z(1)1 exp(−E(x|I)). E(X|I) is the energy of the configuration and Z(I) the partition function. E(X|I) will be abbreviated to E(X) from now on.

The energy E(X) is composed of two parts: unary and pairwise, defined by [23] as:

E(x) =X i Ψ_u(xi) + X p Ψ_p(xi, xj) .

The component Ψu(xi) is the unary energy component: it measures the cost of the

pixel i taking the label xi. This unary energy predicts the label for a given data point

without taking into consideration the smoothness and consistency of the assignment. The unary energy was obtained at the output of the feature extraction phase of a CNN. [59]

Ψ_p(xi, xj) corresponds to the pairwise energy. It measures the cost of assigning label

xi, xj to pixels i and j simultaneously. It ensures image smoothness and consistency:

pixels with similar properties should have similar labels. It is defined as:

Ψ_p(xi, xj) = µ(xi, xj)k(fi, fj)

(41)

2 . 3 . C O N D I T I O N A L R A N D O M F I E L D S k(fi, fj) = w(1)exp       − |_p_i−_p_j|2 2θα2 − |_I_i−_I_j|2 2θ2_β       | {z } appearance kernel +w(2)exp       − |_p_i−_p_j|2 2θγ2       | {z } smoothness kernel

where θα, θβ and θγ are hyper parameters and µ(xi, xj) is a label compatibility

function. It introduces a penalty if nearby pixels i, j have different labels. In this implementation Potts model is used, µ(xi, xj) = [xi, xj] [23] [59].

µ(x_i, x_j) =        1 if xi= xj 0 otherwise

The appearance kernel is inspired by the notion that nearby similar pixels are more likely to be of the same class. The smoothness kernel removes small isolated regions. The degrees of nearness, θα and θγ are hyperparameters defined a priori.

[59] defined the pairwise potentials as weighted Gaussians kernels in the form:

Ψxi, xj = µ(xi, xj)ΣMm=1kGm(fi, fj)

where k_G(m), m = 1, . . . , M is a gaussian kernel applied on feature vectors, derived from image features.

Minimizing the CRF energy E(X) returns the most probable label for the input image. For ease of computation, the mean field approximation can be used [23].

2.3.1 Mean field approximation

To calculate P(X), an exact or approximate inference method can be used. Exact infer-ence tries to learn the exact function P(X). The most popular method, junction tree, tries to convert the graph into a tree, by grouping variables.

This method can require exponential time in the worst case, so approximate meth-ods were developed.

Several methods have been developed for approximating the CRF parameters: pseudo-likelihood, belief propagation and Markov Chain Monte Carlo (MCMC)[52].

In the context of image segmentation, the graph structure can quickly grow in complexity: a 64x64 pixel image will have 4096 nodes, and C(4096, 2) = 8 386 560 different edges. For higher-resolution images, this number grows even higher.

While faster to train, the training time of traditional CRF inference methods train-ing time is still suboptimal, and another drawback arises: the traditional traintrain-ing methods for CRF are not adapted for training using backpropagation.

While the feature extraction layers could be trained separately and their outputs then fed to the CRF to be trained, this would not allow the desirable feature of end to end training.

Mean field approximation on a CRF solves both these problems. Training a fully connected CRF with this approach instead of MCMC can be two orders of magnitude

(42)

faster, [23]. Most importantly, Mean field approximation it can be rewritten as a set of Recurrent Neural Network (RNN) layers.

Mean field approximation consists in approximating the distribution P(X) with a simpler distribution Q(X), that can be written as the product of independent marginal distributions: Q(X) =Y i Qi(Xi) , subject to: P xiQi(Xi) = 1

Qi(xi) is defined as a function that can be updated iteratively, using this algorithm:

Initialize Q: Qi←_Z1_iexp{−φu(xi)};

whilenot converged do do

- Message passing from all Xj to Xi:

˜

Q(m)_i (l) ←P

j,ik(m)(fi, fj)Qj(l) for all m

- Compatibility transform: ˆ Qi(xi) ←Pl∈Lµ(m)(xi, l)Pmw(m)Q˜ (m) i (l) - Local update: Qi(xi) ← exp{−ψu(xi) − ˆQi(xi)} normalize Qi(xi) end

Algorithm 1:Mean field approximation in fully connected CRFs

2.3.2 Conditional Random Fields with Convolutional Neural Networks

CRFs do a good job of using the inputs to create segmentation masks that make sense, are reasonable and accurate. Traditionally, the input given to the CRF were features defined à priori (e.g. color, texture, opacity) by the human operator, with manual tuning and selection. With this, the process was not streamlined: the feature extraction was independent of the classification and naturally often far from the optimum.

CNNs appear promising here because they are able to extract relevant feature maps, without needing them to be defined à priori. This happens because the model will naturally define and optimize the features better suited for the task - in a fashion no human operators could not do-.

The primary idea is that the features extracted from the convolutional layers will serve as the unary energy (e.g. inputs) fed to the CRF, as figure2.10illustrates.

Some CNN’s have used the CRFs as a separate step of the pipeline (e.g. train the CNN separately and then train the CRF), while others have used implemented them directly in the architecture [4] [59]. The latter has achieved that of the art performance in several domains [4], and the results can be seen in figures2.10and2.11and have

(43)

the advantage of making the training streamlined, as the process is reduced to one task.

Figure2.11compares several image segmentation techniques. It is clear that more complex feature extraction methods (i.e. CNN’s) and bigger CRF configurations (e.g. fully connected grids) achieve the best results. For example, notice the high marginal improvement, when compared to previous techniques, of integrating the CRF directly in the networks architecture.

Figure 2.10: The features and classifications extracted by the CNNs can be further improved by applying a CRF model.

Figure 2.11: Evolution of CRFs at image segmentation. Notice the impact of different CRF grid configurations and improvement when using a CNN as a feature extractor. Extracted from [4]

(44)

2.3.3 Conditional Random Fields as Recurrent Neural Networks

As shown in section2.3.2, it is highly desirable to integrate the CNN’s and the CRF in the same process, allowing for end-to-end training. In a novel way, presented by [59], the Mean field approximation algorithm (introduced in section2.3.1, in detail in1) can be reformulated as a set of conv layers and multiple iterations of the algorithm can then be represented using a Recurrent Network. This section details this adaptation:

Initialize Q: Qi(l) ←Z1iexp(Ui(l)) for all i;

whilenot converged do do - Message passing

˜

Q(m)_i (l) ←P

j,ik(m)(fi, fj)Qj(l) for all m

- Weighting filter outputs ˇ Qi← Pmw(m)Q˜ (m) i (l) - Compatibility transform: ˆ Q_i← P l0 ∈Lµ(l, l 0 ) ˇQ_i(l0) - Adding unary potentials

ˇ

Qi(l) ← Ui(l) − ˆ(Q)i(l)

- Normalizing Qi←_Z1_i( ˇQi(l))

end

Algorithm 2:Mean field in fully connected CRFs as a stack of CNN layers [59] Apart from the added steps, the new method is more general: instead of the tra-ditional Potts models for the label comparability function, a custom function can be used, e.g. learned from the data.

2.3.3.1 Initialization

Before any update, the function Q_i(l) needs to be initialized: Q_i(l) ← _Z1

iexp(Ui(l)),

where Zi= Σiexp(Ui(l)). This is simply a softmax function on the unary potentials of

across all labels on every pixel, so at this stages it does not use neighbor information.

2.3.3.2 Image Passing

In a traditional dense CRF, image passing is performed via M Gaussian filters on Q values. These filter coefficients are based on pixel locations and RGB values and reflect how strongly a pixel is related to other pixels.

In [59] implementation, this is performed via a Permutohedral lattice implementa-tion, with a O(N ) time, with N being the number of pixels in the image.

To apply backpropagation, the derivatives of the error regarding the inputs are calculated by sending the error’s input value through the same Gaussian Filters in the inverse direction.

(45)

2.3.3.3 Weighting Filter Outputs

This step performs a weighted sum of the M filter outputs from the previous step for each class label.

If each class is considered individually, this can be seen as a 1x1 convolution with M input channels and one output channel.

Backpropagation can also be performed in similar fashion of the Image Passing step.

2.3.3.4 Compatibility transform

The output of the previous iteration is shared between the labels, depending on their compatibility. The compatibility in this implementation is defined by the Potts models.

This step can be viewed as another 1x1 convolution layer and the number of both input and output channels is L, the number of labels.

2.3.3.5 Adding Unary Potentials

In this step, the output of the compatibility transform step is subtracted element-wise from the unary inputs U .

2.3.3.6 Normalization

Finally, a Normalization is performed, by applying a softmax function.

2.3.4 General overview of CRF-RNN

This approach allows the construction of an end-to-end network, that has both the strengths of the CRF and the flexibility of a CNN, allowing it to be seamlessly inte-grated into any NN architecture.

The CNN stage performs pixel levels feature extraction, that is then followed by a prediction, taking into account the structure of the image.

In our case, a Fully Connected layer was used for calculating the overall image probability.

The final network will have three hyperparameters specific for the CRF-CNN im-plementation:

• θαdegree of nearness required for appearance kernel

• θβassociated

• θγ degrees of nearness required for the smoothness kernel

(46)

2.4 Semantic Learning Machine

On the topic of CNN-based PCa classification, this work also explored a novel way to improve the model’s performance.

Most of the contributions that try to improve CNN-based classification do so by focusing on the earlier layers of the network, and little attention is given to the last layers [53], responsible for the actual prediction. They typically form a very simple fully connected architecture and no thought is given to their design.

The idea explored in this work?s contribution was to improve the performance of the final model using a network generated by a neuroevolutionary algorithm, the Semantic Learning Machine, SLM [20].

The SLM constructs Neural Networks using hill-climbing, and not by relying on traditional backpropagation. The network definition actually occurs by means of a specially defined variation operator, a mutation operator that induces a unimodal fitness landscapes (i.e., without any local optima) [20].

In the original contribution, the SLM outperformed several NN algorithms: Neu-roevolution of augmenting topologies (NEAT), Fixed-topology neuNeu-roevolution (FTNE), Multilayer Perceptron (MLP) and Support Vector Machines (SVM). SLM had the best performance in 24 out of 36 comparisons.

For our contribution, the SLM was used to build the neural network used to create a prediction, based on the features extracted by the CNN.

(47)

C

h

a

p

t

e

r

3

M e t h o d s

This chapter highlights all the pieces necessary to execute this work, namely the dataset and data preparation steps employed or considered (section3.1). In section3.2 an introduction to the architectures used is provided. The chapter ends with the presen-tation of the proposed architecture, integrated with a CRF (subsection3.2.2).

3.1 PROSTATEx Challenge 2017 data

The dataset used for this work was compiled at the Radboud University Medical Centre in Nijmegen, The Netherlands [29]. It was made available for the SPIE-AAPM-NCI Prostate MR Classification Challenge (PROSTATEx Challenge 2017) [3, 31]. It was compiled in-house for the purpose of developing and evaluating a CAD system under the supervision of Dr. Huisman [29].

The data contains multi-parametric images and corresponding lesion information for 344 patients. Of those 344, 204 comprise the training set and 140 the test set. Only the training set was considered because the test set did not have the target variable available.

The images were presented to an expert that identified regions in which he consid-ered there could be cancerous cells present. In those regions of interest, a biopsy was performed and then analyzed. This information is considered the question of interest of the data, as well as the ground truth.

If the lesion had a biopsy Gleason score of 7 or higher, was considered Clinically Significant (CS).

For each lesion, the following information was available, provided in a comma-separated file.

(48)

C H A P T E R 3 . M E T H O D S

Field Description

ProxID ProstateX patient identifier fid Finding ID

pos Scanner coordinate position of the finding

ClinSig Whether this is a clinically significant lesion or not (1 if so, 0 otherwise) Table 3.1: Information available for a lesion

The last column, ClinSig is only available in the training set (204 patients), as the ground truth.

The data contains at least 5 images for each patient: T2-weighted (T2W), Proton Density-weighted (PDw), diffusion-weighted (DW) and Dynamic Contrast Enhanced (DCE) images in various planes. This data comes encoded in two formats: the Dynamic Contrast Enhanced image comes from a T1 weighted image encoded in two files, .mhd and .zraw. The remaining image modalities are provided in DICOM format.

Field Description

ProxID ProstateX patient identifier

fid Finding ID

pos Scanner coordinate position of the finding WorldMatrix Matrix describing image orientation and scaling ijk

Image column (i),row (j), and slice (k) coordinates of the finding. Using the VTK/ITK/Python array convention, (0,0,0) represents the first column and first row of the first slice.

TopLevel

0 - Series forms one image

1 - set of series forming a 4D image

NA - Series form one image, but part of a level 1 4D image Spacing Between Slices Scalar spacing between slices

VoxelSpacing Vector with x, y, z spacing scalars

Dim Vector with 4D dimensions of image

DCMSerDescr Original DICOM series description

DCMSerNum DICOM series number

Table 3.2: Information available for an image

An additional comma-separated file was made available with metadata for every image, as can be seen in table3.2.

Further detailed metadata is available in the DICOM/ KTRANS encoded images, but because these details are not uniformly available, it was not considered further.

3.1.1 Descriptive analysis

3.1.1.1 Images

Every patient has the 3 modalities available, with T2w images captured in the three planes: sagittal, coronal and transverse. Because the data was collected on an ad-hoc

(49)

3 . 1 . P R O S TAT E X C H A L L E N G E 2 0 1 7 DATA Measure Value Count 204 Max 114 Min 8 Mean 9.43 Std 9.43

Table 3.3: Description of exams per patient

8.0

8.5

9.0

9.5

10.0

10.5

11.0

0

20

40

60

80

100

Figure 3.1: Number of exams per patient

basis, the patient may have other modalities available, as can be seen in table3.3, but they were not considered.

It is possible to see that a patient has over 110 images available. These are probably DCE images collected at different time steps. Naturally, because this was the only patient with this information available, it could not be considered further.

In this work, only non-contrast enhanced images (T2w PD and ADC series) in the transverse plane were used, because DCE images present several disadvantages without evidence of impact in the model’s predictive ability [9]. This was further discussed in section2.1.1.5.

(50)

C H A P T E R 3 . M E T H O D S

3.1.1.2 Lesion

A lesion correspondents to a region of interest where the technician suspects there could be prostate cancer present. As it originally stands, there are 359 lesions, of which 81 (22.6 % ) are clinically significant - cancerigenous.

Measure Value

Max 10.0

Min 1.0

Median 1.0 Average 1.768

Table 3.4: Lesion distribution per patient

1

2

3

4

5

6

7

8

9

0

20

40

60

80

100

Figure 3.2: Number of lesions per patient

3.1.2 Data Processing

Being this a mMRI study, different modalities and images are available for each patient. Therefore it is of the utmost interest to combine this information and use all of it when understanding the presence of PCa. This can be done, e.g. by assigning to each modality a channel in the final image.

In this section, the processing steps carried will be explained and given a short introduction. In this work, four steps were applied: interpolation, co-registration,

(51)

3 . 1 . P R O S TAT E X C H A L L E N G E 2 0 1 7 DATA

standardization and patch retrieval. Lastly, image augmentation was tried but ulti-mately not implemented. Outlier or anomaly removal methods were not considered. 3.1.2.1 Isotropic interpolation

The resolution of an image can be interpreted as the amount of information per pixel of that image. This means that an image with higher resolution will provide greater detail and it is easier to distinguish different tissues neighboring each other.

Depending on the collection conditions (e.g. different collection parameters) the different modalities will not have the same resolution in all three axes: a 3D image is actually not be perfectly 3D.

This can be analysed by verifying the slice thickness used: slice thickness is the distance between each slice. Higher thickness means that each slice has to carry more information - thus less detail and resolution, making images blurrier.

In the case of this dataset, the data is anisotropic: unequal slice thickness means unequal images resolution. The x- and y-axis (the short axes) have lower slice thickness (higher resolution) than the z-axis (long axis).

For example, the T2w images collected have a slice thickness of Z = (0.5625 0.5625 3) mm on the x, z and y axes respectively. Therefore a pixel on the x axis represents 0.5625mm of tissue, same for the yaxis. But the pixel on the y axis will contain 3mm of tissue.

Based on this information and the previous work of Liu et all [32], isotropic inter-polation was performed.

The objective of this step is to create images that have the same resolution in all planes. The chosen slice thickness is of Z∗ = (1 1 1) mm: each pixel will carry information at the resolution 1 mm of tissue. To achieve this cubic interpolation was performed, on the Dipy [13] package, a library implemented in Python that focuses on diffusion MRI analysis.

The process to carry interpolation requires access to the metadata of the image, provided in table3.2, namely the World Matrix and the Voxel Spacing.

For example, if the original image had the dimensions of Sh320 320 19ipixels, the new image will have the dimensions of _Z∗ZS = h180 180 68i. Now each pixel corresponds to 1 mm of tissue.

It is possible to see that the long axis carries less information ( from a resolution of 0.5625 mm to 1 mm), but the short axis carries more ( from 3 mm to 1 mm ). To achieve the changes in image resolution, cubic interpolation was used.

3.1.3 Image co-registration

Two MRI sequences can not be simply overlayed, despite depicting the same location on the same patient: they may have the same features in different locations, additional noise, different representations, etc.