Handling data access latency in distributed medical imaging environments

(1)

Universidade de Aveiro Departamento deElectrónica, Telecomunica¸cões e Informática, 2015

Universidades do Minho, Aveiro e Porto Programa Doutoral em Inform´atica das

Carlos Andr´

e Marques

Viana Ferreira

Gest˜

ao de Latˆ

encia no Acesso a Dados em

Ambientes Distribu´ıdos de Imagem M´

edica

Handling Data Access Latency in Distributed

Medical Imaging Environments

(2)

(3)

Universidade de Aveiro Departamento deElectrónica, Telecomunica¸cões e Informática, 2015

Universidades do Minho, Aveiro e Porto Programa Doutoral em Inform´atica das

Carlos Andr´

e Marques

Viana Ferreira

Gest˜

ao de Latˆ

encia no Acesso a Dados em

Ambientes Distribu´ıdos de Imagem M´

edica

Handling Data Access Latency in Distributed

Medical Imaging Environments

Tese apresentada à Universidade de Aveiro para cumprimento dos requisi-tos necessários à obten¸cão do grau de Doutor em Ciências da Computa¸cão, realizada sob a orienta¸cão cient´ıfica do Doutor Carlos Costa, Professor do Departamento de Electrónica, Telecomunica¸cões e Informática da Universi-dade de Aveiro

(4)

(5)

o j´uri / the jury

presidente / president Doutor António Carlos Matias Correia Professor Catedrático da Universidade de Aveiro (por delega¸cão do Reitor da Universidade de Aveiro) vogais / examiners committee Doutor Rui Pedro Sanches de Castro Lopes

Professor Coordenador

Escola Superior de Tecnologia e Gest˜ao Instituto Polit´ecnico de Bragan¸ca

Doutor Rui Carlos Mendes Oliveira Professor Associado

Universidade do Minho

Doutor Enrique Fern´andez Blanco Professor Assistente

Departamento de Tecnologias de Informa¸c˜ao e Comunica¸c˜ao Universidade da Corunha

Doutor Paulo Jos´e Os´orio Rupino Cunha Professor Auxiliar

Faculdade de Ciˆencias e Tecnologia Universidade de Coimbra

Doutor Carlos Manuel Azevedo Costa Professor Auxiliar

(6)

(7)

acknowledgements First of all, I would like to express my special thanks of gratitude to my advisor prof. Carlos Costa as well as prof. Jos´e Lu´ıs Oliveira for the tireless cooperation and patience. I would also like to thank S´ergio Matos for some advices and guidance that proved to be vital.

I express my deepest thanks to the whole bioinformatics group for the good atmosphere and cooperation, turning the investigation and development processes much more pleasent, especially Lu´ıs Basti˜ao, Lu´ıs Ribeiro, Daniel Ferreira, Frederico Valente, Eduardo Pinho and Tiago Godinho, that directly helped me in this doctorate.

Moreover, many thanks to my family and Cˆandida Vitoriano for the constant love, strength and support, without whom this work would not be possible.

Many friends have helped me in numerous ways, my deepest thanks to them. Especially Ant´onio Silva, Adriana and Lu´ıs Filipe, Lu´ıs Pinho, Rui Henriques, Domingos Terra, Andreia Soares, Rita Silva, Andreia Cabral and Isabel Peixoto, not forgetting Ricardo Leite, Vera Godinho, Filipe Sousa, Isabel Tavares, Vitor Correia, Ver´onica Vitoriano and Adriana Meneses.

I also aknowledge all night out companions, for all relaxed moments with humourous spirit they provided me, namely: Hugo Belo, Jorge Ruivo, Bruno Tavares, Lionel Sousa and Alberto Ferreira.

Finally, I gratefully acknowledge the “Funda¸c˜ao para a Ciˆencia e Tec-nologia” (FCT) for funding this work, with the SFRH/BD/68280/2010 grant.

(8)

(9)

Resumo As tecnologias Web têm sido usadas cada vez mais no universo dos Pic-ture Archiving and Communication Systems (PACS), nomeadamente em servi¸cos de armazenamento, distribui¸cão e visualiza¸cão de imagem médica. Atualmente, verificamos que existe uma tendência para as institui¸cões partilharem fluxos de trabalho e contratualizarem servi¸cos na Cloud. No entanto, gerir as comunica¸cões entre entidades geografi-camente distribu´ıdas continua a ser um desafio complexo devido ao enorme volume de dados e às limita¸cões de largura de banda. A latência de acesso remoto aos dados é um problema importante que dificulta a adop¸cão deste paradigma. Para melhorar o desempenho de redes distribu´ıdas de imagem médica, podemos utilizar mecanismos de en-caminhamento com cache e prefetching. Este doutoramento propõe uma arquitetura de cache baseada em regras estáticas e reconheci-mento de padrões para prefetching e limpeza da cache.

(10)

(11)

Abstract Web-based technologies have been increasingly used in Picture Archive and Communication Systems (PACS), in services related to storage, distribution and visualization of medical images. Nowadays, many healthcare institutions are federating services and outsourcing their repositories to the Cloud. However, managing communications be-tween multiple geo-distributed locations is still challenging due to the complexity of dealing with huge volumes of data and bandwidth lim-itations. Communication latency is a critical issue that still hinders the adoption of this paradigm. In order to improve the performance of distributed medical imaging networks, routing mechanisms with cache and prefetching can be used. This doctorate proposes a cache archi-tecture based on static rules together with pattern recognition for both cache eviction and prefetching.

(12)

(13)

List of Figures

1.1 Outline of the thesis with chapters and their dependencies. . . 4

2.1 A typical PACS architecture instance. . . 11

2.2 A simplified diagram of the DICOM composite instance IOD information model. . . 13

2.3 Diagram of a non-exhaustive example of the structure of DICOM image file. 15 2.4 Diagram of the actors and transactions between them in the XDS architecture 19 2.5 Architecture of the PACS Cloud system proposed by Silva et al. . . 25

2.6 Architecture of MIAPS system proposed by Shen et al. . . 26

2.7 Diagram of the system architecture deployed in a federation of healthcare institutions. . . 30

2.8 Diagram of the architecture proposed by Ribeiro et al, using multiple data sources. . . 43

3.1 Distribution of the time consumed by each part of the message interchanging between two peers, without and with the query optimization. . . 49

3.2 Graph of the number of results retrieved per second as a function of the total number of search hits. . . 50

3.3 Graph of the time needed to transfer DICOM studies. . . 51

3.4 Diagram of the cache system architecture. . . 52

3.5 Diagram of the software tools and components of the cache system. . . 53

3.6 Distribution of the cache in regions and its limits. . . 54

3.7 Sequence diagrams showing how it is possible to provide the same quality of service in both scenarios: (a) the study is all stored in the gateway’s cache and (b) only 3 of the study’s images are stored in cache. . . 55

(18)

3.8 Percentage of the achieved speedup(Y-Axis), with multiple percentages of the study in cache (X-Axis), taking the complete study in cache as reference

(in inverse order). . . 57

3.9 Gateway usage scenario. . . 59

3.10 Diagram of the architecture of the gateway. . . 60

4.1 Diagram of a neuron of an ANN. . . 66

4.2 Diagram of the structure of a feedforward, fully connected ANN. . . 66

4.3 Diagram of the SVM training process: (a) samples distributed in the feature space, (b) feature space transformed by a kernel, (c) maximization of the distance between the boundary and the samples of each class. . . 68

4.4 Diagram of an example of the topology of a DT. . . 69

4.5 Diagram of the pattern recognition system architecture. . . 76

4.6 Diagram of the organization of the MLPs in the pattern recognition system. 83 4.7 Accuracy of the pattern recognition model for each distinct training condi-tion, with the real-world dataset. . . 89

4.8 F-measures of the pattern recognition model for each distinct training con-dition, with the real-world dataset. . . 89

4.9 Accuracy of the pattern recognition model for each distinct training condi-tion, for workstation A of the synthesized dataset. . . 90

4.10 F-measures of the pattern recognition model for each distinct training con-dition, for workstation A of the synthesized dataset. . . 90

4.11 Accuracy of the pattern recognition model for each distinct training condi-tion, for workstation B of the synthesized dataset. . . 91

4.12 F-measures of the pattern recognition model for each distinct training con-dition, for workstation B of the synthesized dataset. . . 92

4.13 Accuracy of the pattern recognition model for each distinct training condi-tion, for workstation C of the synthesized dataset. . . 92

4.14 F-measures of the pattern recognition model for each distinct training con-dition, for workstation C of the synthesized dataset. . . 93

4.15 Accuracy of the pattern recognition model for each distinct training condi-tion, for all workstations of the synthesized dataset. . . 94

4.16 F-measures of the pattern recognition model for each distinct training con-dition, for all workstations of the synthesized dataset. . . 94

(19)

4.17 Evolution of the accuracy of the pattern recognition system with incremen-tal learning and MLP as decision function for the combined dataset, along time (in months). For comparison purposes, the rectangles represents the accuracy of the tested batch learning conditions and the period they were

able to predict. . . 95

5.1 Diagram of the proposed overall architecture. . . 100

A.1 Main components of the dataset generator. . . 117

A.2 Example of a network status function for one day. . . 119

B.1 Comparison of accuracies obtained for the same taxa by each technique, with the section I showing the taxa that were best predicted by KNN in decreasing order or accuracy, and section II those taxa best predicted by SVM. In the middle are the few taxa that were bestpredicted by MLP. . . 129

B.2 Linear regressions between O50 and E50 values obtained through Hydra configuration A (minimum accuracy = 50% and no rare species elimina-tion), B (minimum accuracy = 75% and no rare species elimination) and C (minimum accuracy = 50% with elimination of rare species at <5% sites) . 129 B.3 Screenshot of Aquaweb showing a description of one Hydra model. . . 131

(20)

(21)

List of Tables

1.1 Contributions of the publications in the thesis by chapters. . . 8

3.1 Study retrieval times between the DICOM router platform and the standard DICOM protocol. . . 56

3.2 Remote storage performance measurements. . . 58

4.1 Distribution of the samples among the distinct classes (1 - Patient revising, 2 - Modality revising, 3 - Inconsequent query and 4 - Other usages) in the 3 synthetized workstations (A, B and C) and in the whole synthesized dataset (Combined). . . 87

5.1 Hit ratios for the real-world dataset. . . 106

5.2 Hit ratios for the synthesized dataset. . . 106

5.3 Retrieval time per image for the real-world dataset in seconds. . . 107

5.4 Retrieval time per image for the synthesized dataset in seconds. . . 107

B.1 Number of taxa predicted by each technique (n) and respective average accuracy (%) and standard deviation (Mean±SD). . . 128

B.2 Comparison of the three tested Hydra configurations (A, B and C), regarding number of selected models, taxa that could not be predicted (invalid), OE ranges for each quality class and Spearman correlations. . . 130

(22)

(23)

Acronyms

AAC Advanced Audio Coding. 34

AE Application Entity. 42, 79

AETitle Application Entity Title. 78, 80, 82–84

ANN Artificial Neural Network. 65–67, 82, 124

ANNA Assessment by Nearest Neighbor Analysis. 124

ARPANET Advanced Research Projects Agency Network. 20

AUSRIVAS Australian River Assessment System. 123–125, 130, 131

BEAST BEnthic Assessment of SedimenT. 123, 124, 130, 131

C/S Client/Server. 20

CAD Computer-Aided Diagnosis. 64, 67

CBIR Content-based Image Retrieval. 64

CIFS Common Internet File System. 42

CPU Central Processing Unit. 19, 38

CR Computed Radiography. 29

CT Computed Tomography. 1, 9, 29, 75, 98, 101

DICOM Digital Imaging and Communications in Medicine. 1, 2, 12–17, 24–26, 28, 33, 37, 41, 42, 50–53, 56, 58–60, 64, 74, 75, 77, 78, 85, 89, 101, 112, 113, 120, 122

(24)

DIMSE-C DICOM Message Service Element-Composite. 14, 116

DT Decision Tree. 69, 70, 81, 93, 95, 124

FIFO First In First Out. 98

GAE Google App Engine. 46–48, 50, 111

HIMSS Healthcare Information and Management Systems Society. 17

HIS Hospital Information System. 11

HL7 Health Level 7. 17, 77

HTTP HyperText Transfer Protocol. 15, 28, 33, 58, 60

HTTPS Hypertext Transfer Protocol Secure. 15, 28, 33, 47, 59, 60

IE Information Entities. 12, 13

IHE Integrating the Healthcare Enterprise. 17, 18

IOD Information Object Definition. 12, 13

IT Information Technology. 2, 3, 17, 23, 24, 26, 27, 38, 62

JPEG Joint Photographic Experts Group. 34, 37

KNN K-Nearest Neighbors. 68, 70–73, 82, 128

LAN Local Area Network. 3, 24, 46, 48, 50, 51, 55

LFU Least Frequently Used. 97, 99, 113

LRU Least Recently Used. 63, 96, 97, 99, 103, 105

LZW Lempel-Ziv-Welch. 60

ML Machine Learning. 64, 68, 70–73, 124, 125, 130

MLP Multilayer Perceptron. 66, 71–73, 82, 83, 87, 88, 90, 91, 93–96, 104, 112, 125, 126, 128

(25)

MP3 Moving Picture Experts Group Audio Layer 3. 34

MPEG Moving Picture Experts Group. 34

MRI Magnetic Resonance Imaging. 1, 9, 101

MTU Maximum Transfer Unit. 42

OE Observed/Espected. 125, 128

OSI Open Systems Interconnection. 20

P2P Peer-to-Peer. 19–24, 32, 46

PaaS Platform as a Service. 46

PACS Picture Archiving and Communication System. 1–3, 9–12, 27, 28, 33, 38, 41–43, 45, 46, 74, 81, 98, 99, 108, 111, 112, 118, 120, 121

PC Personal Computer. 38

PET Positron Emission Tomography. 1

PNG Portable Network Graphics. 34

QIDO-RS Query based on ID for DICOM Objects by Representational State transfer. 16

QoS Quality of Service. 21

RAM Random Access Memory. 38

RCA Reference Condition Approach. 123

REST Representational State Transfer. 16

RIS Radiology Information System. 11, 77

RIVPACS River InVertebrate Prediction and Classification System. 123–125, 130, 131

RMI Remote Method Invocation. 24

(26)

RSNA Radiological Society of North America. 17

SMO Sequential Minimal Optimization. 72

SPECT Single Photon Emission Computed Tomography. 1, 9

SR Structured Report. 13

STOW-RS STore Over the Web by Representational State transfer. 16

SVM Support Vector Machines. 67, 68, 71–73, 81, 124, 128

TCP/IP Transmission Control Protocol/Internet Protocol. 28, 42

TLV Tag Length and Value. 14

UID Unique Identifier. 77

US Ultrasound. 1, 29

VO Virtual Organization. 21, 22

VoIP Voice over Internet Protocol. 20

VPN Virtual Private Network. 23, 50, 51, 56

VR Value Representation. 14

WADO Web-Access to DICOM Persistent Objects. 15

WAN Wide Area Network. 26, 42, 46, 50, 51, 55

WFD Water Framework Directive. 123, 126

XDS Cross Enterprise Document Sharing. 18, 19

(27)

Chapter 1 Introduction

Medical imaging plays an important role in the healthcare environment, in both di-agnostic and treatment processes [1, 2]. It contains important biomarkers and provides, not only a view of the anatomy, but also of the physical processes [3]. The importance of such tool is increasing along time. It follows the tendency of the availability of compu-tational resources that supported the creation of new imaging tools such as: Computed Tomography (CT), Magnetic Resonance Imaging (MRI), Ultrasound (US), Positron Emis-sion Tomography (PET) and Single Photon EmisEmis-sion Computed Tomography (SPECT) [4, 5]. Besides the fact that computers are enabling methods of acquiring medical images, they are also changing the storage and distribution of images and associated data [6]. For that, much is due to the appearance of the Picture Archiving and Communication System (PACS) concept. It is an umbrella term for systems that gather a set of technologies of acquisition, storage, transmission and visualization of medical imaging [6, 7, 8].

PACS appeared somewhere in the 80s decade in the shape of small islands, each one mainly composed of: (1) imaging acquisition devices, i.e., the equipment responsible for the extraction of data and, consequently, for the construction of an image representative of the collected data; (2) visualization workstations and (3) printers. However, soon, this concept started to proliferate associated with the massification of different modalities and its usage by other hospital departments beyond Radiology. Yet, the interoperability was restricted to equipment of the same fabricant. Hence, an international consortium devel-oped a normalization effort that resulted in the Digital Imaging and Communications in Medicine (DICOM) standard [9]. Among others, DICOM defines the network communi-cation layers, the services’ commands, the persistent objects coding, the media exchange structure and the documentation that must follow an implementation [10]. This standard

(28)

was well accepted by the community, being implemented on most medical imaging equip-ment. As a consequence, during the last decades, much DICOM data have been created and stored in distinct medical institution’s repositories. However, these data need to be exchanged between healthcare providers, so that physicians can have an integrated view of patients’ examinations [11].

The widespread availability of the Internet enables the expansion of the PACS horizons, promoting, not only the data access from other departments, but also from other institu-tions. This new reality also potentiated the development of telemedicine and teleradiology platforms that can be defined as the use of telecommunication technologies for provision of medical (radiologic) information and services [12, 13]. In this context, two main scenarios appeared: (1) the outsourcing of the storage facility and (2) inter-institutional collabora-tion.

In this doctorate, we have studied some of the current technologies that could support those scenarios, concluding that Cloud technology is the most suitable for medical imaging environments. Among several Cloud computing definitions, the common points describe Cloud computing as a distributed system technology that consists on the aggregation of distributed resources into a single system. Cloud aims to the virtualization (in other words, decoupling the business service from the needed infrastructure) and scalability, i.e., the capability of the system to grow when it is needed [14, 15].

With Cloud computing, the resources shift their location from the users’ machines to the Cloud’s servers, freeing users from the responsibility to maintain an Information Technology (IT) infrastructure. In this way, it is possible to perform tasks of high resources consumption, from equipment with reduced computational power, such as PDAs, mobile phones, tablets or cars [16].

PACS services’ outsourcing is a very present trend in medical imaging environments [17, 18], where institutions aim to reduce the costs of local storage maintenance and promote inter-institutional workflows. Although this migration of services to the Cloud may bring many economic and technological benefits, it also presents new challenges for storing, indexing and retrieving data. The main drawbacks of migrating PACS to the Cloud are the privacy, ownership and availability of medical imaging data.

(29)

1.1 Motivation

Hosting the PACS infrastructure on the Cloud is very distinct from hosting it indoors. Traditional PACS installations take advantage of high-speed Local Area Network (LAN), while using Cloud, either for bridging communications, or for infrastructure outsourcing, makes communications highly dependent on several third-parties’ conditions, for instance, Internet connection bandwidth [19]. Cloud services availability rate is very high, which means that services are virtually always ready and reachable. However, the performance of the access to the resources is a major concern.

This problem is reinforced with the typical sizes of medical imaging examinations. They may reach sizes of hundreds of megabytes, or even gigabytes. Besides, physicians need to access the images as fast as possible to provide an efficient healthcare service. Concluding, one of the main problems associated to the use of Cloud Computing in medical imaging scenarios is the communication latency [20, 21]. Thus, this thesis main motivation is to study and propose solutions for handling data access latency in distributed medical imaging environments.

1.2 Objectives

The main goal of this doctorate was to study, design and develop new informatics so-lutions to reduce communication’s latency when accessing medical imaging remote reposi-tories or, at least, its impact. To reach this goal, several minor goals were defined:

• To study and analyze medical imaging environment: concepts, standards, workflows, user behaviors, characteristics of medical imaging data, traditional repositories’ char-acteristics for this scenario, required features, among others.

• To study and analyze technologies that may support IT infrastructure outsourc-ing and communication bridgoutsourc-ing, takoutsourc-ing into account the medical imagoutsourc-ing scenario specificities. Afterwards, one must be chosen as the supporting technology of the architectural design.

• To design an architecture that promotes inter-institutional collaboration and IT in-frastructure outsourcing, while taking the most of the selected technology. Moreover, it should merge the advantages of the existent architectural designs.

(30)

• To study, design, develop and improve solutions that minimize the communication latency problem of remote repositories.

• To develop mechanisms to validate the developed solutions, so that they can be improved. Also, the validation mechanisms shall be extensible in a way that they can be used for other research projects.

• To contribute to science with international publications (e.g. book chapters, scientific journal articles and conferences papers).

• To point to future research directions that can be followed by others to achieve enhanced solutions.

1.3 Structure

The remainder of this document is structured as follows (figure 1.1):

3 - Architecture

5 – Cache and Prefetching

4 – Pattern Recognition

2 – Medical Imaging and Distributed

Repositories

1 - Introduction

6 - Conclusions

(31)

• Chapter 2 - Medical Imaging and Distributed Repositories: describes medi-cal imaging scenarios and the standards associated with this tool. It also characterizes and compares some technologies that might support distributed repositories: Peer-to-Peer, Grid and Cloud. After the conclusion that, amongst these three paradigms, Cloud is a promising technology to support medical imaging distributed repositories, this chapter describes the main concerns related with the use of Cloud for medical imaging. One of them is the high communication latency. This is the main focus of this doctorate. As such, this chapter ends with a description of some of the strategies already existent to minimize this issue.

• Chapter 3 - Architecture: describes the path taken towards the proposed archi-tecture for availing medical imaging everywhere, at any time. It finishes with the description of the proposed architecture.

• Chapter 4 - Pattern Recognition: this chapter describes the path followed in order to build a mechanism able to predict which usage pattern fits the best with the user’s following actions. The output of this mechanism is used by cache and prefetching, since they define a subset of objects that are likely to be needed soon.

• Chapter 5 - Cache and Prefetching: describes the main characteristics of the resulting cache and prefetching solutions, including how pattern recognition was em-bedded in this mechanism.

• Chapter 6 - Conclusions: presents some remarks of this thesis. In addition, it also presents some research paths that may be followed in the future.

1.4 Scientific Results

This doctorate generated a set of scientific outcomes that are enumerated in the fol-lowing sub-items.

1.4.1 International Journals

1. Carlos Viana-Ferreira, S´ergio Matos, and Carlos Costa, “An Intelligent Cloud Stor-age Gateway for Medical Imaging”, IEEE Transactions on Biomedical Engineering, Under Review.

(32)

2. Carlos Viana-Ferreira, Lu´ıs Ribeiro, S´ergio Matos, and Carlos Costa, “Pattern Recog-nition for Cache Management in Distributed Medical Imaging Environments”, Inter-national Journal of Computer Assisted Radiology and Surgery, Under Review.

3. Tiago Godinho, Carlos Viana-Ferreira, Lu´ıs Silva, and Carlos Costa, “A Routing Mechanism for Cloud Outsourcing of Medical Imaging Repositories”, IEEE Journal of Biomedical and Health Informatics, 2014.

4. Maria Feio, Carlos Viana-Ferreira, and Carlos Costa, “Testing a Multiple Machine Learning Tool (HYDRA) for the Bioassessment of Fresh Waters”, Freshwater Science, vol. 33, no. 4, pp. 1286-1296, Dec. 2014.

5. Maria Jo˜ao Feio, Carlos Viana-Ferreira, and Carlos Costa, “Combining Multiple Ma-chine Learning Algorithms to Predict Taxa Under Reference Conditions for Streams Bio-assessment”, River Research and Applications, vol. 30, no. 9, pp. 1157-1165, Nov. 2014.

6. Carlos Viana-Ferreira, Lu´ıs Ribeiro, and Carlos Costa, “A Framework for Integra-tion of Heterogeneous Medical Imaging Networks”, The Open Medical Informatics Journal, vol. 8, pp. 20-32, Sep. 2014.

7. Lu´ıs Ribeiro, Carlos Viana-Ferreira, Jos´e Lu´ıs Oliveira, and Carlos Costa, “XDS-I outsourcing Proxy: Ensuring Confidentiality While Preserving “XDS-Interoperability”, IEEE Journal of Biomedical and Health Informatics, vol. 18, no. 4, pp. 1404-1412, July 2014.

8. Frederico Valente, Carlos Viana-Ferreira, Carlos Costa, and Jos´e Lu´ıs Oliveira, “A RESTful Image Gateway for Multiple Medical Image Repositories”, IEEE Transac-tions on Information Technology in Biomedicine, vol. 16, no. 3, pp. 356-364, May 2012.

1.4.2 Book Chapters

9. Carlos Viana-Ferreira, and Carlos Costa, “Challenges of Using Cloud Computing in Medical Imaging”, in Advances in Cloud Computing Research, Muthu Ramachan-dran, Eds. Nova Publishers, 2014.

(33)

1.4.3 International Conferences

10. Eduardo Pinho, Carlos Viana-Ferreira, and Carlos Costa, “Simulation of DICOM Traffic in PACS Networks Using Behavior Profiles”, 29th International Congress and Exhibition on Computer Assisted Radiology and Surgery, Barcelona, Spain, Jun. 2015. (Accepted)

11. Carlos Viana-Ferreira, S´ergio Matos, and Carlos Costa, “Long-term Prefetching for Cloud Medical Imaging Repositories”, 26th _{Medical Informatics Europe Conference,}

Madrid, Spain, May 2015. (Accepted)

12. Carlos Viana-Ferreira, S´ergio Matos, and Carlos Costa, “Incremental Learning ver-sus Batch Learning for Classification of User’s Behaviour in Medical Imaging”, 8th International Conference on Health Informatics, Lisbon, Portugal, Jan. 2015.

13. Carlos Viana-Ferreira, and Carlos Costa, “DICOM Traffic Generator based on Behav-ior Profiles”, International Conference on Biomedical and Health Informatics (BHI), Valencia, Spain, Jun. 2014.

14. Carlos Viana-Ferreira, and Carlos Costa, “A Cloud based architecture for medical imaging services”, IEEE 15th _{International Conference on e-Health Networking,}

Ap-plications and Services (Healthcom), Lisbon, Portugal, Sep. 2013.

15. Carlos Viana-Ferreira, Carlos Costa, and Jos´e Lu´ıs Oliveira, “A Multi-Domain Plat-form for Medical Imaging”,IEEE Symposium on Computer-Based Medical System, in Grid and Cloud Computing in Biomedicine and Life Sciences Special Track, Porto, Portugal, Jun. 2013.

16. Tiago Godinho, Lu´ıs Basti˜ao Silva, Carlos Viana-Ferreira, Carlos Costa, and Jos´e Lu´ıs Oliveira, “Enhanced regional network for medical imaging repositories”, 8a

Con-ferência Ibérica de Sistemas e Tecnologias de Informa¸cão, Lisbon, Portugal, Jun. 2013.

17. Carlos Viana-Ferreira, Daniel Ferreira, Frederico Valente, Eriksson Monteiro, Carlos Costa, and Jos´e Lu´ıs Oliveira, “Dicoogle Mobile: a medical imaging platform for Android”. 24th _{Conference of the European Federation for Medical Informatics, Pisa,}

(34)

18. Carlos Viana-Ferreira, Carlos Costa, and Jos´e Lu´ıs Oliveira, “Dicoogle Relay a Cloud Communications Bridge for Medical Imaging”, IEEE Symposium on Computer-Based Medical System, in Grid and Cloud Computing in Biomedicine and Life Sciences Special Track, Rome, Italy, Jun. 2012.

19. Maria Jo˜ao Feio, Carlos Viana-Ferreira, and Carlos Costa, “Prediction of River Inver-tebrate Taxa Through a Multiple Machine-Learning Tool (Hydra)”, Annual Meeting of the Society for Freshwater Science, Louisville, KY, USA, May 2012.

1.5 Contribution of the Publications in the Thesis

Much of the content of this thesis is based on published work. Table 1.1 shows which publications contributed for which chapters. The identification of each publication is the same as the one found in section 1.4.

Table 1.1: Contributions of the publications in the thesis by chapters.

Publication Chapter Chapter Chapter Chapter Appendix Appendix

2 3 4 5 A B 1 _X 2 _X 3 _X 4 _X 5 _X _X 6 _X 7 8 _X 9 _X 10 _X 11 _X 12 _X 13 _X 14 _X 15 _X 16 _X 17 18 _X 19 _X

(35)

Chapter 2 Distributed Medical Imaging

Environments

Medical imaging is an important tool for medical diagnosis and treatment support [1, 22]. Historically, this tool has taken advantage of computer technology advances. In fact, medical imaging is becoming extremely supported by computers. Namely, they enabled the appearance of numerous imaging modalities which acquisition is based on computer processes, for instance: CT, SPECT and MRI [5]. Moreover, computers have revolutionized how medical imaging data is visualized, stored and distributed with the appearance of PACS [6, 7, 8].

More recently, the advances of distributed computing and high bandwidth communi-cations are revolutionizing medical imaging processes, promoting new use-cases such as: inter-institutional collaboration and infrastructure outsourcing. Those technological ad-vances could bring numerous advantages for the provision of the healthcare service [20]. Nevertheless, there are some obstacles that are still hindering its wide adoption.

This chapter describes: (1) standards and concepts related with medical imaging; (2) distributed repository technologies; (3) use-cases for distributed repositories to support medical imaging services; (4) the obstacles that hinder the adoption of distributed reposi-tories and (5) some approaches for the problem of communication latency.

2.1 Medical Imaging Laboratories

A set of technologies and standards are responsible for supporting and providing medical imaging services. The following sub-items describe some of them.

(36)

2.1.1 PACS

PACS concept appeared in 1980’s in the shape of small islands responsible for printing, display and storage of medical imaging data produced by one imaging acquisition device, also known as modality. Nevertheless, currently, PACS embraces a wide set of technolo-gies and devices, being responsible for all main services concerning medical imaging of a healthcare institution [6].

Regarding the architecture of this kind of systems, there is not a fixed architecture for all PACS instances, varying from institution to institution while meeting the requirements associated with patient and institutional workflow. As an example, the patient workflow varies from one imaging department to another, but, in general, it follows the following steps [6]:

1. The patient arrives at the healthcare center.

2. If the patient is a new patient then he is registered.

3. The imaging exam is ordered.

4. The technologists receive the exam requisition and call the patient.

5. Technologist performs the imaging exam.

6. The imaging exam data is produced and stored.

7. Technician previews the exam and reports.

8. The report is stored and made available for clinician viewing.

Taking this example of a patient workflow, the architecture of one typical PACS instance is shown in figure 2.1. It is a three-tier architecture where, at the top, there are the acquisition devices. The second-tier is constituted by the servers and the printer. In its turn, the third-tier is responsible for the availability of the information to the users.

In a general way, a PACS is composed by the image acquisition equipment, a PACS server and archive, and display workstations linked by a network. Some years ago, there was also an image and data acquisition gateway which was the bridge between the imaging modality (image acquisition equipment) and the PACS server. This component was the responsible for the translation between the communication protocols of medical imaging devices and the ones supported by the PACS server. Nowadays, most image acquisition

(37)

Archive Server Web Server RIS Database Network Switch Modalities Modalities Network Switch Printer Visualization Workstations Short-Term Archive Long-Term Archive

Figure 2.1: A typical PACS architecture instance (adapted from [23]).

devices and PACS server already speak the same language, i.e. DICOM. Thus, the image acquisition equipment uses the data received from the Radiology Information System (RIS) through the modality worklist, docking that data with the image and sending it to the PACS server.

The PACS server is a key component of a PACS, since this is the management agent of the system, receiving data from the image acquisition equipment, the Hospital Information System (HIS) and RIS. It has two main components, the database server and the archive system (that in its turn can be divided into: short-term, long-term and permanent stor-age). The PACS server is responsible for several processes, such as: (1) receiving images from imaging modality (or the acquisition gateway); (2) updating a database management system; (3) checking the data integrity and (4) compressing the image data and provision of the query/retrieve service [6].

(38)

Another fundamental part of a PACS is the display workstations that have a database, display and processing software. These workstations can request images from PACS archive and store it locally. Commonly, they have only a limited set of needed images for a patient examination. Their usability is increased with inclusion of reporting and measurement tools to support diagnosis. Moreover, some equipment also provides image reconstruction services.

2.1.2 DICOM

In the beginning, PACS were developed in an ad-hoc way, most of them composed, mainly, by the modality acquisition equipment, the display workstation and the storage component [6]. Soon, numerous healthcare institutions had more than one of those solu-tions working simultaneously. This brought the need to merge those systems into a major one to deal with all the imaging data of the institution. However, the interoperability was restricted for equipment of a single fabricant. Then, it was critical to make the devices of distinct providers interoperable. A normalization effort resulted in the international DI-COM standard [9]. Among others, DIDI-COM defines the network communication layers, the services commands, the persistent objects coding, the media exchange structure and the documentation that must follow an implementation [10]. This standard was well accepted, being implemented in most part of current medical imaging equipment. As a consequence, during last years, much DICOM data have been created and stored in repositories.

Information Object Definition

DICOM standardize a way to represent Real-World objects. To do so, it uses the Information Object Definition (IOD) concept. An IOD is a way to encode information about digital medical images and related data. This is because in the real-world the objects are not isolated islands. Instead, they are related with each other. DICOM takes also this into account, accomplishing that with relationships between objects. An IOD that includes information about other real-world objects is called a Composite information object. DICOM normalizes an Entity-Relationship Model that relates the components or Information Entities (IE) of the IOD, instead of arbitrarily aggregating them.

Figure 2.2 shows a simplified diagram of the DICOM composite IOD information model, pointing out the existence of a hierarchy in the way the objects are associated to each other. The Patient is in the top, being the subject of the study (i.e. the definition of

(39)

the characteristics of a medical study). This study contains one or more series (i.e. an aggregation of IOD composites into distinct logical sets), which contains image raw data, Structured Report (SR) document, waveform, spectroscopy, measurements, presentation state, image and/or surface. Either way, each series must contain one or more: Presentation State IE, SR Document IE or Image IE [24].

Figure 2.2: A simplified diagram of the DICOM composite instance IOD information model.

DICOM Persistent Objects

DICOM standard also contemplates the way the IOD instances are encapsulated into files [25]. According to the standard, these files must begin with the DICOM file meta information block followed by the stream representing the data set (see figure 2.3) [26].

(40)

preamble (available for specific implementations, usually padded with zeros) followed by the string “DICM”, the size in bytes and other relevant data. Except the preamble and the string “DICM”, all other fields of a DICOM file are represented in a Tag Length and Value (TLV) structure, that means that each element of the file is divided in three fields: the tag (the identification of the field), the length (the number of bytes of the value field) and the value [26]. For instance: the tag 00100010 (in hexadecimal notation) represents the “patient name” field and, if the value contains “John Smith”, the length field would be 10. This means that the DICOM file containing this example is related to a patient whose name is “John Smith”.

Actually, DICOM adds an additional field to the TLV structure that is the Value Representation (VR). VR is optional, depending on the transfer syntax, and it defines how data in the Value field is represented. It occupies two or four bytes and its possible values are defined in DICOM dictionary [27]. In case of implicit transfer syntax (i.e. the VR is omitted) the DICOM dictionary must be used in order to decode the value taking into account the Tag.

In figure 2.3, a diagram of a non-exhaustive example of the DICOM file structure is shown, where some headers precede the image data. As previously described, these files have a sequence of elements with data about the patient, the study, the series and the image itself.

DIMSE-C

DICOM also defines a set of basic operations. One of the classes of services defined is the DICOM Message Service Element-Composite (DIMSE-C)[28]. In this class of services, there are several commands, namely:

• C-Store: for pushing a DICOM object into a repository.

• C-Get: to request objects from a PACS, using its unique identifier.

• C-Move: to copy an object from one location to a third one. This service uses also the C-Store and can replace the C-Get.

• C-Find: to query the PACS for objects that match a query.

• C-Echo: works like a DICOM ping, it is for checking the connectivity of an element of the DICOM network.

(41)

Figure 2.3: Diagram of a non-exhaustive example of the structure of DICOM image file.

Web Compliant

Web-Access to DICOM Persistent Objects (WADO) is a part of the DICOM standard oriented for the distribution of images or other medical data via Web [29].

This protocol allows the remote access to DICOM files through HyperText Transfer Protocol (HTTP) or Hypertext Transfer Protocol Secure (HTTPS) channels. Basically, with an implementation of this standard, the client must send a HTTP Request (GET) with a URL/URI of the wanted object. DICOM defines the response as a HTTP Response whose message body is encoded in one of the following formats: DICOM, JPEG, GIF, PNG, JP2, MPEG, plain text and HTML. Besides, it is also recommended the server also supports XML, PDF and RTF, not discarding the possibility to support other file formats.

(42)

This standard is very simple to implement and use, giving the possibility of receiving DICOM objects without needing to “speak” DICOM. The client has to specify the DICOM wanted object through its unique identifiers. This allows retrieving the images over the Web but there is a gap in storage and search mechanisms [30]. Until very recently, it was the only part of the standard compliant with Web 2.0. However, in 2013, the standard’s organization published also two new extensions based on Representational State Transfer (REST) Web services: the STore Over the Web by Representational State transfer (STOW-RS) [31] to support storage of DICOM objects and the Query based on ID for DICOM Objects by Representational State transfer (QIDO-RS) [32] to support queries based on DICOM Object IDs. However, the huge majority of current equipment does not support those new extensions.

DICOM-Related Challenges

DICOM standard solves some problems related with interoperability between systems of different manufacturers. Besides, it is a flexible standard that can be readjusted to changes in medical imaging equipment and technologies [33].

DICOM standard normalizes object structure, coding and communication protocol. However, a problem of inter-institutional interoperability remains. For instance: (1) the patient ID can vary from one institution to another; (2) a patient whose name is “John Doe Smith” can be known as “John D. Smith” in one hospital and “John Smith” in another. Actually, DICOM solves the second problem, but our experience leads to the fact that such details are commonly neglected. Consequently, the filling of the DICOM file fields can change from hospital to hospital. In any case, the DICOM standard is not prepared for the first situation once it is mainly focused on intra-institutional processes. We consider as intra-institutional processes all transactions between different sites, nearby or distant, that have the same governance. Another huge disadvantage is its weak file searching capability. In fact, it only supports the search for some mandatory DICOM fields and it does not support content-based searches.

2.1.3 IHE

As previously described, in the beginning, healthcare information systems were like the “Tower of Babel” where each manufacturer defined its own processes: communication protocol, data representation, and so on. Therefore, only systems of the same manufacturer

(43)

were interoperable. In this context, standards like DICOM and Health Level 7 (HL7) (i.e. a standard for the integration, exchange, and retrieval of electronic health information) were (and still are) of great usefulness for healthcare service. They standardized numerous processes, making equipment of different manufacturers interoperable, at least, at a certain level.

In fact, DICOM and HL7 solve many issues about interoperability inside one institu-tion, but much had to be done for interoperability between systems of different healthcare institutions. In contrast, for a better healthcare service, it is supposed to avail the most data possible to physicians. As such, in 1998, a collaboration between Healthcare Informa-tion and Management Systems Society (HIMSS), Radiological Society of North America (RSNA) and healthcare enterprise professionals resulted in the Integrating the Health-care Enterprise (IHE) [34, 35, 36, 37] initiative, which aimed to promote the collaboration between institutions [34, 38].

IHE is not a standard per se. Though, it adopts existing standards such as: DICOM and HL7, defining integration profiles. These are specifications about how the standards must be implemented in healthcare institutions so the system is interoperable with others [39]. For instance, IHE defines how different actors must cooperate to run a specific task [38].

Although IHE ranges different application domains (e.g. IT infrastructure, radiology and laboratory), it is most extensively used in the radiology area. In general, this frame-work is well-accepted in the healthcare environment. Actually, it is so well accepted, that yearly there is a “connect-athon”. It is a kind of contest in which the manufacturers try to obtain a certificate of IHE-compatibility to their products.

IHE defines numerous integration profiles [40], each one for a specific set of tasks. For instance:

Retrieve Information for Display: This integration profile provides access to remote patient-centric information. For instance, it defines the presentation formats of that information.

Cross-Enterprise User Authentication: This profile is used for user authentication in transactions of other healthcare institutions. The Cross-Enterprise User Authentica-tion defines how the identificaAuthentica-tion of the user must be provided.

Notification of Document Availability: It is an extension of the Cross-Enterprise doc-ument sharing profile. It allows notifications about changes on the availability of

(44)

documents by e-mail.

Audit Trail and Node Authentication: It defines security measures to provide au-thenticity, privacy and data integrity.

Consistent Time: The consistent time profile defines mechanisms of time synchroniza-tion of the systems of the multiple healthcare institusynchroniza-tions.

Document Digital Signature Content Profile: It defines how digital signatures must be dealt and stored in a cross-enterprise document sharing repository.

Patient Identifier Cross-Referencing: One of the flaws of the standards oriented for a single institution is that one patient has a different identification in each institution. This integration profile provides a mechanism of associate different identifiers of the same patient with each other.

Special emphasis to Cross Enterprise Document Sharing (XDS), which is a profile cre-ated by IHE for sharing clinical documents among institutions. This integration profile introduces the concept of Clinical Affinity Domain, which is the group of institutions that accepts to share their documents. These institutions must agree on some policies, such as the patient identification and coding [40]. XDS does not define specifically any kind of documents. Instead, it only specifies metadata for document location. This integration profile defines the interaction of the following system actors (figure 2.4):

Document Source: This actor represents any healthcare institution that belongs to the Clinical Affinity Domain. In order to share documents with the Clinical Affinity Domain, this actor must send the documents to the Document Repository.

Document Registry: The document registry is the actor that handles with the metadata of the Clinical Affinity Domain’s shared documents. This actor is also the one that supports the searching mechanism.

Document Repository: This actor is the actual responsible for the storage and retrieval of shared documents. Nevertheless, this actor does not store the document’s meta-data, therefore it cannot be queried. For document retrieval the client must know the document’s unique identifier.

(45)

Document Consumer: The document consumer is the client application in which the images are visualized. In order to search, this actor must query the Document Reg-istry to obtain the unique identifiers of the documents he wants. Then, he must retrieve the documents from the Document Repository.

Patient Identity Source: this actor is the one responsible to manage and assign the identifiers of the patients of the different institutions.

Patient Identity Source Document Registry Document Consumer Document Repository Document Source

Patient Identity Feed Registry Stored Query

Retrieve Document Set Provide and Register

Document Set

Register Document Set

Figure 2.4: Diagram of the actors and transactions between them in the XDS architecture.

The XDS integration profile has an extension especially focused on the sharing of med-ical imaging documents. This extension is called as Cross Enterprise Document Sharing for Imaging (XDS-I) [41].

2.2 Distributed Architectures

2.2.1 Enabling Technologies

As previously stated, medical imaging has taken advantage of computing advances, one is the capability of establishing distributed processes to support data sharing and remote access. For that reason, this section briefly describes three technologies at this level: Peer-to-Peer (P2P), Grid and Cloud.

Peer-to-Peer

A P2P network is a distributed system, where its participants (peers) share a part of their own resources, for instance, files, Central Processing Unit (CPU) cycles, storage, printers, etc. The set of these shared resources provides the service that the network was made for [42]. In this kind of systems, each peer acts as a “servent” (server + client), i.e.

(46)

each peer has the capability of acting as client and server at the same time, providing and consuming resources.

In fact, the Internet infrastructure usage is growing in volume and services as, for instance, file sharing, instant messaging, Voice over Internet Protocol (VoIP), Cloud com-puting, social networking and crypto-currencies. Many of those services implementation are supported by P2P technology.

Stepping back in time, the birth of this concept is wrongly associated with the appear-ance, in 1999 of Napster. Actually, it was already being used in the Ethernet protocol [43], which treats all machines equally. Another example of the P2P concept usage is an ances-tral of Internet called Advanced Research Projects Agency Network (ARPANET), which was built as a P2P network that linked some United States universities as equal computing peers [44], where each university shared its own resources. Actually, the novelty of Napster was not about the P2P concept but its ascension in the Open Systems Interconnection (OSI) model from the bottom layers to the application layer. Nevertheless, until Napster, through the Internet, home computers were treated as dumb machines with no resources. For numerous reasons, the use of this technology in the design of some systems may become valuable assets, such as:

1. Efficient usage of computational resources, i.e. with a P2P network, resources shared by other machines can be used, which could be wasted otherwise.

2. Resiliency and Scalability. In comparison with the Client/Server (C/S) architecture, a system built over P2P technology is less likely to be unavailable and, at the same time, it may easily grow accordingly to the number of users of the system.

3. Common machines have the opportunity to share their own resources in order to collaborate with the provision of a service and, at the same time, use the resources of the other machines.

Nevertheless, the shift of responsibilities of the resources provision from servers to peers also brings some problems, for instance:

1. Peers are less reliable than servers, i.e. peers can join or leave the network at any time.

2. The question: “How does a new peer know which address it must contact to, in order to join the network?” is a common problem in P2P architecture designs, known as

(47)

the bootstrap problem. This is caused by the dispersion of the resources and volatility of peers in the network, which causes difficulties in their access. While it may be assumed that a server is almost always available, no assumptions can be made when we are dealing with, for instance, home computers.

3. Some peers may be malicious, in other words, they can intend to damage: the net-work, other peers or the service provided by the network.

4. The management and control of many machines may be difficult. As an example, changing the P2P protocol may imply the update of the software of each peer.

5. Peers with fewer computational capabilities shall not be held liable for some respon-sibilities, so the network performance is not compromised.

Grid

Grid computing is a computing and data management platform that supports a global society, mainly focused on resource sharing in a coordinate, flexible and secure way [45, 46, 47]. According to Andrew Grimshaw: “a grid is a system that coordinates resources that are not subject to centralized control, using standard, open, general-purpose protocols and interfaces to deliver nontrivial qualities of service” [48].

This concept appeared from the need of computer processing that a single machine was not able to provide [46]. Therefore, in resemblance to P2P, grid computing aims to connect numerous machines, so they can share resources within a virtual community. Nevertheless, unlike P2P, grid computing is supported by trusted relationships and control of Quality of Service (QoS) [49].

One important characteristic of this computing paradigm is the Virtual Organization (VO). A VO is a group of individuals or organizations defined by a set of rules about what is shared, when and by whom [49]. Usually, these VOs group users and resources that share common interests and goals. This characteristic provides to the grid computing system the possibility to manage the system’s resources and users.

Another characteristic of grid computing is the QoS control[47, 48, 50], that includes the following processes: (i) the negotiation between the service requester and provider about the service level; (ii) the monitoring of the services quality, in other words, the estimation, planning and adjusting of resources usage; (iii) migration of executing services for load balancing [51]. This control is needed to ensure the attribution of resources for those who

(48)

need (or deserve) them the most. In this way, it is possible to make some guarantees about the service that is provided to the VO members.

Therefore, the target communities of the grid computing are different from the P2P ones. The former one is more suitable for communities where members are usually linked by trust relationships, existing means to sanction inappropriate behavior [49]. Instead, P2P systems are typically focused on reaching the masses, ideally with no assumptions about the stability and trustworthiness of the members, whatsoever. Therefore, grid computing systems are, usually, not especially designed to be scalable, even though, developers of such systems have been paying attention to this matter [52].

Cloud

Although, Cloud is a trendy word, there is no consensus about its definition [53]. In fact, there is even no agreement among the scientific community about its novelty. For some, it is a new computing paradigm; for others, it is nothing more than a new name for old technologies [15]. Nevertheless, the common points of the Cloud definitions describe it as a distributed system technology that consists on the aggregation of resources that are distributed into one single system, aiming the virtualization (in other words, decoupling the business service from the infrastructure needed) and scalability (i.e. the capability of the system to grow when it is needed) [14, 15].

As Jeff Bezos 1 _{once said: “You don’t generate your own electricity. Why generate}

your own computing?” It was a big old dream having computing as a utility like water, electricity, gas and telephony. This is finally possible thanks to the Cloud, because this technology allows the provision of resources (for instance: CPU, storage, applications and files) as a general utility through the Internet [54, 55, 56]. As such, with Cloud, the resources shift their location from the users’ machines to the Cloud’s servers, freeing users from the responsibility to maintain the needed infrastructures to execute a task. Therefore, it is now possible to perform tasks of high resources consumption with equipment with little computational power, for instance: PDAs, mobile phones and cars [16]. It must be highlighted that this is possible, due to the usual web-interface of the Cloud services, enabling their easy access and portability [53].

As previously mentioned, consensus is not abundant for what concerns Cloud. So, not surprisingly, there is no agreement about the architecture of Cloud systems and there are several documented Cloud architecture designs [57, 58].

(49)

The main reason for the choice of Cloud paradigm is its business model and the partial release of the IT infrastructure and associated maintenance costs. In order to evaluate the operational costs, several aspects must be taken into account, for instance, server hardware, network equipment, software licences, energy, air conditioning, maintenance, and the technological obsolescence. Having the whole IT infrastructure indoors forces the institutions to resize the infrastructure accordingly to the demands of the highest activity moments, which causes the misuse of the resources at lower activity times, for instance, at night, in the weekend, in holidays season. With Cloud, it is possible to pay only for the resources used, because the allocation of resources may be done accordingly to the needs, leaving the scaling of the infrastructure worries to the Cloud service provider [55].

Some public Cloud Computing services have a geo-distributed infrastructure. This means that the quality of service is more uniformly distributed around the world than if the service was only hosted in one location. Besides, this characteristic makes the service more disaster-proof, since even if a disaster (for instance: fire and earthquake) occurs in one of the places where the Cloud service is hosted, it will have other servers running in other parts of the world that will keep the service available.

In the e-health scenario, hospitals spend a significant budget in local IT infrastructure [59]. Even at the physical level, reducing the IT infrastructure in a healthcare center may result on the reduction of the space needed to store the servers and other machines, leaving more useful space for the healthcare service provision, which is the main concern of this kind of institutions.

Resuming, the Cloud’s characteristics highlight it among the other technologies de-scribed in this document, considering that both grid computing and P2P use the resources available by the institutions. Another reason for the choice of Cloud paradigm is that it is a brand new concept with plenty of room for research and innovation. Besides, IT industry is more often using this paradigm, which leads us to the assumption and belief that, in the future, thanks to Cloud, we will actually have computing as a utility, like electricity and telecommunications.

2.2.2 Practical Cases

Nowadays, inter-institutional exchange of imaging data is done through traditional channels, including mail, e-mail, patient or private solutions over Virtual Private Network (VPN). However, all these solutions have drawbacks that can make data exchange unfea-sible. For instance: (1) VPN can be hard to set up and manage for bureaucratic and

(50)

technical reasons; (2) traditional mail can take too long; (3) the patient can forget, lose or damage the exams; (4) through email it is necessary to know, previously, where the data will be needed, or to wait until the person responsible for the data answers the request. Thus, efforts have been made to solve this problem. Therefore, the need to develop a system with a distributed repository has been addressed in the literature, in recent years. DIPACS [60] is a P2P tele-radiography system. This system uses Java Remote Method Invocation (RMI) for communication between different institutions. Nevertheless, the ar-chitecture of this system raises some drawbacks, such as the fact that each institution needs to the setup of routers and firewalls to allow the communication via RMI. Moreover, it does not have an automatic discovery mechanism: the nodes of the system must be previously registered in a server with a public address.

In [61], Bian et al. described a distributed file system based on P2P paradigm. This system is especially focused on the distribution and security of data among the nodes that belong to the system. However, the node identification and discovery is carried out through broadcast of messages, which is not possible in Internet. Therefore, the nodes of this system are only able to communicate with each other, if they are inside the same LAN.

Other P2P systems were also developed, such as the ones described in [62, 63]. Those two systems are based on JXTA [64], and both use the rendezvous peer for promoting communications between peers from different institutions. Nevertheless, the use of only one node as rendezvous compromises the system’s scalability. Besides, this architecture does not support important scenarios like, for instance, the IT infrastructure outsourcing. Solutions based on grid paradigm are also present in literature, as the ones described in [65, 66, 67, 68, 69, 70]. However, the greatest disadvantages of those systems are some bureaucratic and technical issues related to joining and setup a grid [71].

With the advent of Cloud services, some systems were also developed to take advan-tage of such technology, like the ones described in [17, 18]. In these two examples, the Cloud is used to maintain a central repository of the medical imaging data. In this way, the repository is endowed with the scalability and resiliency inherent in the Cloud tech-nology. However, moving the repository from the institution to the Cloud raises latency problems, since the velocity to access data stored somewhere in Internet is usually slower than accessing data stored inside the same institution.

Silva et al. developed also a solution that ensures interoperability between DICOM uni-verse and Cloud [17]. As shown in figure 2.5, apart from the modalities and workstations,

(51)

this system is mainly constituted by the following components:

• The master index is the one responsible to implement the access control policies, being the element with access to the keys with which data stored in Cloud services is encoded. This component is also responsible to maintain an index of all resources stored in the system. Because of its duties, this component must be hosted in a trustworthy place, for instance, a private server or a private Cloud.

• The Cloud slave is the component responsible to store encrypted chunks of data related to the DICOM data. Since the data is encrypted, this component can be hosted in a public Cloud service.

• The gateway is a component that makes the translation between DICOM communi-cations and the Web services interface of the master index and Cloud slaves. In this way, the gateway serves as an intermediary between any DICOM-compliant PACS and the other components of the system.

As the authors concede, this architecture has communication latency problems that hinder its adoption in the real-world [17].

Figure 2.5: Architecture of the system proposed by Silva et al. (figure from [17]).

Another system described in the literature is MIAPS [72], a web-based system to assess and present remote medical images. Its architecture is depicted in figure 2.6. It has some similar points with the system proposed by Silva et al. [17]: the query/retrieve service is divided in two components: image index and transmission server. However, the main concerns regarding this architecture are related with the direct access to the DICOM

(52)

repository and not considering how images are sent to the repository. Besides, since the repository stores the DICOM files and the server must have direct communication with it through Fiber or SATA, this means that such system cannot be deployed in a public Cloud service, otherwise data privacy would not be ensured.

Retrieval Server DICOM Image Browser H TT P Internet Index Building Image index: · Adjacent matrix · Basic attribute · Semantic attribute

· Low level feature Fiber/SATA Transmission Server HTTP HTTP On-demand transmission: · DICOM visualization · Multi-resolution transmission · Region transmission · Tile transmission MIAPS front-end: · User login · Medical image retrieval · Image display · Image maintenance · Image download

Figure 2.6: Architecture of MIAPS proposed by Shen et al. (adapted from [72]).

Other system architecture that uses Cloud to store medical imaging data is the one described by Yang et al. in [73]. This architecture is especially designed for reliability and fault tolerance. For that, it uses co-allocation that allows parallel downloading of images. Nevertheless, as the authors indicate, the file accessing must be improved.

In conclusion, the literature lacks of a system for medical imaging sharing and out-sourcing of IT infrastructure with the following characteristics: (1) easy to deploy and maintain; (2) able to grow accordingly to the needs; (3) that ensures data privacy from the IT infrastructure providers; and (4) finally, that addresses the problem of latency which is one of the most critical in this kind of systems, as described in [19], since Wide Area Network (WAN) is a high latency environment.

2.3 Cloud Usage in Medical Imaging Services

Two main use-case scenarios where using Cloud is advantageous for medical imaging were identified during this doctorate. They are described in the following sub-sections.

Handling data access latency in distributed medical imaging environments

Carlos Andr´

e Marques

Viana Ferreira

Gest˜

ao de Latˆ

encia no Acesso a Dados em

Ambientes Distribu´ıdos de Imagem M´

edica

Handling Data Access Latency in Distributed

Medical Imaging Environments

Carlos Andr´

e Marques

Viana Ferreira

Gest˜

ao de Latˆ

encia no Acesso a Dados em

Ambientes Distribu´ıdos de Imagem M´

edica

Handling Data Access Latency in Distributed

Medical Imaging Environments

Contents

List of Figures

List of Tables

Acronyms

Chapter 1

Introduction

1.1

Motivation

1.2

Objectives

1.3

Structure

3 - Architecture

5 – Cache and Prefetching

4 – Pattern Recognition

2 – Medical Imaging and Distributed

Repositories

1 - Introduction

6 - Conclusions

1.4

Scientific Results

1.4.1

International Journals

1.4.2

Book Chapters

1.4.3

International Conferences

1.5

Contribution of the Publications in the Thesis

Chapter 2

Distributed Medical Imaging

Environments

2.1

Medical Imaging Laboratories

2.1.1

PACS

2.1.2

DICOM

2.1.3

IHE

2.2

Distributed Architectures

2.2.1

Enabling Technologies

2.2.2

Practical Cases

2.3

Cloud Usage in Medical Imaging Services