An approach for evaluating and mitigating intra-application I/O performance variability over parallel file systems

(1)

AN APPROACH FOR EVALUATING AND MITIGATING INTRA-APPLICATION I/O PERFORMANCE VARIABILITY OVER PARALLEL FILE SYSTEMS

Tese submetida ao Programa de Pós-Graduação em Ciência da Computação para a obtenção do Grau de Doutor em Ciência da Computação.

Orientador: Prof. Mario Antonio Ribeiro Dantas, Ph.D.

Florianópolis 2019

(2)

Ficha de identificação da obra elaborada pelo autor através do Programa de Geração Automática da Biblioteca Universitária da

UFSC.

Inacio, Eduardo Camilo

An approach for evaluating and mitigating intra-application I/O performance variability over parallel file systems / Eduardo Camilo Inacio ; orientador, Mario Antonio Ribeiro Dantas, 2019.

156p.

Tese (doutorado) - Universidade Federal de Santa Catarina, Centro Tecnológico, Programa de Pós-Graduação em Ciência da Computação, Florianópolis, 2019.

Inclui referências

1. Ciência da Computação. 2. variabilidade de desempenho de E/S. 3. sistemas de arquivos paralelos. 4. modelagem de sistemas. 5. gerador de carga de trabalho. I. Dantas, Mario Antonio Ribeiro. II. Universidade Federal de Santa Catarina. Programa de Pós-Graduação em Ciência da Computação. III. Título.

(3)

AN APPROACH FOR EVALUATING AND MITIGATING INTRA-APPLICATION I/O PERFORMANCE VARIABILITY OVER PARALLEL FILE SYSTEMS

Esta Tese foi julgada aprovada para a obtenção do Título de “Doutor em Ciência da Computação”, e aprovada em sua forma final pelo Programa de Pós-Graduação em Ciência da Computação.

Florianópolis, 24 de Junho 2019.

Prof. Dr. José Luís Almada Güntzel Coordenador

Banca Examinadora:

Prof. Mario Antonio Ribeiro Dantas, Ph.D. Orientador

Prof. Felipe Maia Galvão França, Ph.D. (videoconferência)

Universidade Federal do Rio de Janeiro

Prof. Maurício Aronne Pillon, Ph.D. (videoconferência)

Universidade do Estado de Santa Catarina

Prof. Dr. Ronaldo dos Santos Mello Universidade Federal de Santa Catarina

(4)

(5)

(6)

(7)

I would like to express my gratitude to my advisor, Prof. Mario Dantas, who guided me through this challenging task. Many of the opportunities that have made me the researcher that I am today, I owe to him, and for that, I will always be grateful.

I am grateful for research ideas exchanged with Professors Alex Pinto, Frank Siqueira, Juliana Eyng, Márcio Castro, Patrícia Plentz, Pedro Barbetta, and Sergio Petters from the Department of Informat-ics and StatistInformat-ics (INE) at the Federal University of Santa Catarina (UFSC), and Prof. Douglas Macedo from the Department of Informa-tion Science (CIN) at UFSC.

I wish to express my gratitude to Prof. Kenji Ono and re-searchers Jorji Nonaka, Kazunori Mikami, and Tomohiro Kawanabe, from the Advanced Visualization Research Team at RIKEN Advanced Institute for Computational Science (AICS), for the opportunity of par-ticipating, during three months, in the RIKEN AICS International HPC Computational Science Internship Program, in Japan. I would also like to mention Aya Motohashi, from RIKEN AICS, for her com-petent administrative support.

I would like to thank my colleagues from the Distributed Sys-tems Research Laboratory at UFSC, Alexis Huf, Alyson Deives, Bruno Oliveira, Eliza Gomes, Felipe Volpato, Ivan Salvadori, and Pedro Penna, for continuous discussions on varying research topics.

I wish to express my appreciation to the Graduate Program in Computer Science (PPGCC) at UFSC, represented by all its faculty members. I also wish to acknowledge Katiana Castro and Ivan Lemke for their competent administrative support.

This research work was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Fi-nance Code 001. Experiments presented in this thesis were carried out using the Grid’5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Univer-sities as well as other organizations (see https://www.grid5000.fr).

(8)

(9)

Leto Atreides, in Frank Herbert’s novel Children of Dune, 1976.

(10)

(11)

Para atender aos crescentes requisitos de capacidade e desempenho de aplicações que fazem uso intensivo de dados, sistemas de armazena-mento altamente distribuídos e com múltiplas camadas têm sido em-pregados em ambientes de computação de alto desempenho (CAD). Um dos principais componentes dessas infraestruturas é o sistema de arquivos paralelo (SAP), um sistema projetado para absorver trans-ferências de dados em massa de aplicações com milhares de proces-sos. A distribuição de carga nos servidores de dados dos SAPs compõe uma importante fonte de variabilidade de desempenho de entrada/saída (E/S). Embora reduzir tal variabilidade seja desejável, já que essa é conhecida por prejudicar o desempenho percebido pela aplicação, en-tender e lidar com a variabilidade nesses ambientes complexos continua sendo um desafio. Nesta pesquisa, uma abordagem diferenciada para avaliar e mitigar a variabilidade do desempenho de E/S intra-aplicação em SAPs é proposta. Uma proposta de modelo analítico, denomi-nado DTSMaxLoad, fornece estimativas para a carga máxima em um servidor de dados. Para complementar o DTSMaxLoad, modelando condições e mecanismos difíceis de representar analiticamente, foi pro-posto o modelo de simulação Parallel I/O and Storage System (PIOSS). Para avaliação experimental em ambientes reais, foi pro-posta uma ferramenta de avaliação de desempenho de E/S flexível e dis-tribuída, chamada IOR-Extended (IORE). Finalmente, uma abor-dagem de distribuição de arquivos de alto nível para SAPs, chamada N-N Round-Robin (N2R2), foi proposta visando a redução da vari-abilidade do desempenho de E/S para aplicações em que cada processo acessa um arquivo independente. Um extenso esforço experimental foi realizado neste trabalho de pesquisa para avaliar cada uma das abor-dagens propostas. Em resumo, essa avaliação indicou que as propostas de modelagem DTSMaxLoad e PIOSS podem representar o compor-tamento da distribuição de carga em SAPs com fidelidade significativa. Adicionalmente, os resultados demonstraram que o N2R2 reduziu com sucesso a variabilidade de desempenho de E/S intra-aplicação para 270 cenários experimentais distintos, que, em última análise, se traduziram em melhorias gerais de desempenho de E/S da aplicação.

Palavras-chave: variabilidade de desempenho de E/S, sistemas de arquivos paralelos, modelagem analítica, simulação, gerador de carga de trabalho, distribuição de arquivos

(12)

(13)

Introdução

Conforme adentramos na era da ciência orientada a dados, tem-se ob-servado um crescimento significativo no volume de dados produzido e consumido por aplicações científicas e de engenharia. Visando aten-der a demanda crescente de capacidade e desempenho imposta por tais aplicações, ambientes modernos de computação de alto desem-penho (CAD) têm empregado infraestruturas de armazenamento alta-mente especializadas e distribuídas, compostas por múltiplas camadas. No cerne desta infraestrutura encontram-se sistemas de arquivos alelos (SAPs), como o Lustre e o OrangeFS, onde arquivos são par-ticionados e distribuídos entre múltiplos servidores de dados visando oferecer uma alta taxa de transferência.

Devido a complexidade e o grande número de elementos que compõe tais ambientes, inúmeros fatores podem impactar no desempenho de entrada/saída (E/S) percebido por aplicações distribuídas de grande escala que fazem uso intensivo de dados. Pesquisas anteriores inves-tigaram o efeito de características da carga de trabalho, arquitetura e topologia do sistema, configurações de software e propriedades de hardware no desempenho de E/S de SAPs. No entanto, a maior parte dessas pesquisas focaram no desempenho absoluto, ignorando parcial ou totalmente os efeitos da variabilidade de desempenho.

A variabilidade de desempenho se refere a flutuação de valores medidos para uma referida métrica de desempenho. A variabilidade de desem-penho de E/S pode ser causada por uma interferência inter-aplicação, quando múltiplas aplicações concorrem por um recurso compartilhado, ou uma interferência intra-aplicação, quando processos de uma mesma aplicação compartilham um recurso. No advento de aplicações de ciên-cia computacional de extrema escala, que utilizam todos os recursos disponibilizados pelo ambiente de CAD, a interferência intra-aplicação se torna particularmente relevante.

Uma importante fonte de variabilidade de desempenho de E/S intra-aplicação em SAPs é o método de distribuição de arquivos entre os servidores de dados, uma vez que esse influencia diretamente no bal-anceamento de carga do sistema. Desta forma, reduzir tal variabilidade tem o potencial de melhorar o desempenho de E/S percebido por uma aplicação. Contudo, informações sobre a magnitude de tal variabili-dade e como essa é afetada pelos múltiplos fatores que influenciam o

(14)

desempenho de E/S do sistema ainda são escassas, o que faz com que lidar com essa variabilidade se torne um grande desafio.

Objetivos

O objetivo geral deste trabalho de pesquisa é conceber uma abordagem de avaliação de desempenho de E/S paralela em SAPs que considere a variabilidade intra-aplicação e que permita reduzir o impacto dessa no desempenho de aplicações distribuídas de ciência computacional que fazem uso intensivo de dados.

Proposta

Este trabalho de pesquisa propõe uma abordagem abrangente de avali-ação de desempenho, combinando métodos complementares com difer-entes níveis de detalhamento e acurácia, e uma abordagem para redução de variabilidade de desempenho de E/S intra-aplicação em SAPs. As principais contribuições desta pesquisa são: um modelo analítico para estimativa de carga máxima em um servidor de dados, denominado DTSMaxLoad; um modelo de simulação para rápida avaliação de de-sempenho de E/S em SAPs, chamado Parallel I/O and Storage System (PIOSS); uma abordagem flexível e distribuída para avaliação de desempenho de E/S paralela em ambientes reais, chamada IOR-Extended (IORE); e uma abordagem de distribuição de arquivos em SAPs de alto nível para aplicações distribuídas em que cada processo acessa um arquivo individual e independente, chamada N-N Round-Robin (N2R2).

Um diferencial do DTSMaxLoad é a consideração do parâmetro stripe width, existente na maioria dos SAPs, e a segmentação do problema em termos do mapeamento processo-arquivo da aplicação: N-1, quando todos os processos da aplicação compartilham um mesmo arquivo, e N-N, quando cada processo acessa um arquivo individual e indepen-dente. Utilizando um conjunto reduzido de parâmetros de entrada, o DTSMaxLoad permite realizar análises preliminares com relação ao balanceamento de carga de um SAP com baixo custo computacional para um amplo número de condições e cenários de uso.

O modelo de simulação PIOSS foi proposto visando complementar o DTSMaxLoad, modelando condições e mecanismos difíceis de repre-sentar analiticamente. Combinando o moderado nível de detalhamento com uma arquitetura de simulação distribuída, o PIOSS permite a avaliação de aspecto temporais e espaciais relacionados ao desempenho de E/S de um SAP, assim como sua variabilidade. Adicionalmente, a modularidade da sua implementação favorece a extensão, permitindo a

(15)

disponibilizando uma abordagem diferenciada para a avaliação exper-imental do desempenho de E/S de infraestruturas de armazenamento existentes. Mais do que um gerador de carga de E/S sintética, o IORE foi concebido para suportar fluxos de trabalho experimentais comple-tos: da definição e execução do experimento até a disponibilização dos resultados para análise. Com relação a considerações de variabilidade de desempenho, o IORE permite que experimentos sejam replicados com aleatorização da ordem de execução dos cenários experimentais, visando atender às suposições de grande parte dos métodos estatísticos de análise de dados.

Finalmente, a abordagem de distribuição de arquivos N2R2 foi proje-tada para atuar entre a aplicação e o SAP. Por meio da coordenação distribuída do processo de criação de arquivos, o N2R2 propõe um método heurístico para balanceamento da carga nos servidores de da-dos do SAP. Desta forma, a variabilidade de desempenho de E/S para aplicações com mapeamento N-N é reduzida e, em última análise, o desempenho médio de E/S é potencialmente reduzido.

Resultados e Discussão

As abordagens propostas nesta pesquisa foram avaliadas extensiva-mente por meio de um grande número de experimentos envolvendo am-bientes reais. Nestas avaliações métodos de análise estatística e projeto de experimentos foram empregados visando uma maior reprodutibili-dade dos resultados e uma maior solidez nas conclusões.

A acurácia da proposta de modelo de simulação PIOSS foi avaliada comparando a distribuição de carga predita com a observada em um ambiente real, considerando diferentes métodos de distribuição de ar-quivos, número de processos, mapeamentos processo-arquivo e configu-rações do SAP. Uma analysis of variance (ANOVA) aplicada aos resul-tados de 144 cenários experimentais não identificou evidências estatísti-cas da diferença entre os resultados simulados e medidos, considerando um nível de confiança de 95%. Este resultado indica a alta acurácia do PIOSS na representação da distribuição de carga em SAPs.

Uma vez verificada a acurácia do PIOSS com relação a ambientes reais, esse foi utilizado para avaliar a acurácia do modelo analítico proposto, o DTSMaxLoad. Nessa avaliação, a carga máxima em um servidor de dados estimada pelo DTSMaxLoad é comparada a carga estimada pelo PIOSS e pela abordagem utilizada por outros métodos do estado da arte. Os resultados demonstraram que o DTSMaxLoad é

(16)

pre-ciso para cenários com mapeamento N-1, enquanto um erro percentual quadrático médio (MSPE) de 0,5% foi observado para cenários com mapeamento N-N. Para comparação, a abordagem do estado da arte apresentou MSPEs de 27% e 12% para os mesmos cenários, respectiva-mente.

Por fim, a abordagem de distribuição de arquivos N2R2 foi avaliada com relação a diferentes métricas, comparada a abordagens tradicional-mente encontradas em SAPs, a saber: random e round-robin. Con-siderando 270 cenários experimentais, foi verificado que o N2R2 reduz consideravelmente a variabilidade do tempo de escrita por processo. O coeficiente de variação (CV) médio observado nos resultados com o N2R2 foi de 28%, enquanto com as outras abordagens o CV médio foi de 45%. Verificou-se também nessa avaliação a distribuição de carga nos servidores de dados. O CV médio do número de bytes por servi-dor de dados foi de 3% para o N2R2, 19% para o round-robin e 28% para o random. O desempenho de E/S percebido pela aplicação teve uma melhora significativa com o N2R2. Ao nível de confiança de 95%, o N2R2 propiciou um ganho na vazão de E/S paralela entre 52% e 70%, quando comparado aos métodos random e round-robin. Embora o tempo total para criação de arquivos no N2R2 tenha sido entre 20% e 31% superior aos demais métodos, observou-se uma redução do tempo total de execução da aplicação entre 20% e 27%, em média.

Considerações Finais

Baseado nos resultados das avaliações realizadas, é possível concluir que esta pesquisa atingiu com sucesso os objetivos estipulados. Pro-pondo métodos complementares de avaliação, incluindo modelagens analítica e de simulação e mensuração, esta pesquisa provê um mel-hor entendimento sobre o impacto da variabilidade de desempenho de E/S intra-aplicação em SAPs. Adicionalmente, foi apresentada uma abordagem de distribuição de arquivos em SAPs que, por meio de um balanceamento de carga efetivo, consegue reduzir consideravelmente a variabilidade de desempenho de E/S para aplicações com mapeamentos complexos de processos para arquivos.

Palavras-chave: variabilidade de desempenho de E/S, sistemas de arquivos paralelos, modelagem analítica, simulação, gerador de carga de trabalho, distribuição de arquivos

(17)

To meet ever increasing capacity and performance requirements of emerging data-intensive applications, highly distributed and multilay-ered back-end storage systems have been employed in large-scale high performance computing (HPC) environments. A main component of these storage infrastructures is the parallel file system (PFS), a espe-cially designed file system for absorbing bulk data transfers from ap-plications with thousands of concurrent processes. Load distribution on PFS data servers compose a major source of intra-application in-put/output (I/O) performance variability. Albeit mitigating variability is desirable, as it is known to harm application-perceived performance, understanding and dealing with I/O performance variability in such complex environments remains a challenging task. In this research, a differentiated approach for evaluating and mitigating intra-application I/O performance variability over PFSs is proposed. More specifically, from the evaluation perspective, a comprehensive approach combining complementary methods is proposed. An analytical model proposal, named DTSMaxLoad, provides estimates for the maximum load in a PFS data server. To complement DTSMaxLoad, modeling conditions and mechanisms hard to represent analytically, the Parallel I/O and Storage System (PIOSS) simulation model was proposed. Finally, for experimental evaluation over real environments, a flexible and dis-tributed I/O performance evaluation tool, coined as IOR-Extended (IORE), was proposed. Furthermore, a high-level file distribution ap-proach for PFSs, called N-N Round-Robin (N2R2), was proposed focusing on mitigating I/O performance variability for distributed ap-plications where each process accesses an individual and independent file. An extensive experimental effort, including measurements on real environments, was conducted in this research work for evaluating each of the proposed approaches. In summary, this evaluation indicated both DTSMaxLoad and PIOSS modeling proposals can represent load dis-tribution behavior on PFSs with significant fidelity. Moreover, results demonstrated N2R2 successfully reduced intra-application I/O perfor-mance variability for 270 distinct experimental scenarios, which, ulti-mately, translated into overall application I/O performance improve-ments.

Keywords: I/O performance variability, parallel file system, analytical modeling, simulation, workload generator, file distribution

(18)

(19)

Figure 1 Demand shifting in the computational science evolution

cycle. . . 25

Figure 2 Pareto chart with the ANOVA mean square and the cu-mulative percentage of the variability explained by considered fac-tors . . . 29

Figure 3 Conceptual framework considered in this research work. 30 Figure 4 Overview of an infrastructure model and I/O software stack typically observed in HPC environments. . . 34

Figure 5 Data access parallelism at the application level. . . 37

Figure 6 The NetCDF file format. . . 40

Figure 7 An example of an HDF5 file with its main components. 41 Figure 8 File partitioning among processes through MPI file view. 42 Figure 9 A ROMIO collective write example. . . 43

Figure 10 A simple example of the file striping technique in a PFS. 44 Figure 11 OrangeFS main components and data access flow.. . . . 46

Figure 12 Lustre main components in a basic cluster. . . 47

Figure 13 Proposed approach components in the context of the conceptual framework considered in this research work. . . 61

Figure 14 Graphical view of all possible load distributions for a scenario with two files, having nine and eight units of data, and a PFS with three data servers, a stripe fragment size of three units of data, and a stripe width of two.. . . 67

Figure 15 Simplified tree of possibilities for the maximum number of times a data server can be selected for a scenario with three files, and a PFS with three data servers, and a stripe width of two. . . 71

Figure 16 Components of the PIOSS simulation model. . . 80

Figure 17 Event flow of a file opening in PIOSS. . . 81

Figure 18 Event flow of a file write in PIOSS. . . 82

Figure 19 PIOSS distributed simulation architecture. . . 83

Figure 20 Overview of PIOSS internal software structure. . . 84

Figure 21 Layout of LPs mapping to PEs in PIOSS. . . 85

Figure 22 IORE experiment conceptual structure. . . 89

Figure 23 Comparison between IORE and IOR parameters for offset-based workloads. . . 90

(20)

Figure 24 IORE Cartesian dataset workload generator. . . 91 Figure 25 Comparison between IORE and IOR on generating work-loads based on data sets with block-block partitioning. . . 92 Figure 26 Overview of IORE main software components. . . 94 Figure 27 Overview of the N2R2 file distribution approach. . . 97 Figure 28 Algorithm for selecting PFS data servers for a file in N2R2. . . 99 Figure 29 Load distribution on PFS data servers simulated with PIOSS and observed on an OrangeFS environment, considering applications with N-N process-to-file mapping.. . . 106 Figure 30 Load distribution on PFS data servers simulated with PIOSS and observed on an OrangeFS environment, considering applications with N-1 process-to-file mapping. . . 107 Figure 31 Maximum load in a data server estimated with DTS-MaxLoad, state-of-the-art analytical models, and PIOSS, consid-ering applications with N-1 process-to-file mapping. . . 110 Figure 32 Maximum load in a data server estimated with DTS-MaxLoad, state-of-the-art analytical models, and PIOSS, consid-ering applications with N-N process-to-file mapping. . . 112 Figure 33 Maximum load in a data server estimated with state-of-the-art analytical models and PIOSS, considering applications with N-N process-to-file mapping. . . 113 Figure 34 Write time per process using N2R2, random, and round-robin file distribution approaches. . . 114 Figure 35 File sizes generated by IORE in experimental scenarios with heterogeneous workloads. . . 115 Figure 36 Load distribution on PFS data servers using N2R2, ran-dom, and round-robin file distribution approaches.. . . 116 Figure 37 Parallel I/O throughput using N2R2, random, and round-robin file distribution approaches. . . 117 Figure 38 Total file creation time using N2R2, random, and round-robin file distribution approaches. . . 118 Figure 39 Application I/O wallclock time using N2R2, random, and round-robin file distribution approaches. . . 119

(21)

Table 1 List of main symbols, and their related meaning, consid-ered in the DTSMaxLoad analytical modeling proposal. . . 64 Table 2 Comparison between DTSMaxLoad and state-of-the-art analytical models for high performance I/O and storage systems in terms of response variable and a selected set of context-relevant parameters. . . 76 Table 3 Comparison between DTSMaxLoad and state-of-the-art analytical models for high performance I/O and storage systems in terms of process-to-file mapping and variability considerations. . . . 77 Table 4 Comparison between PIOSS and state-of-the-art PFS simulation proposals in terms of detailing level, MPI support for distributed execution, simulation of general PFS designs, and code availability.. . . 87 Table 5 Comparison between IORE and state-of-the-art synthetic I/O workload generators in terms of workload flexibility and hetero-geneity, support for different I/O libraries, export of performance results, and execution of experiment runs in randomized order. . . . 95 Table 6 Comparison between N2R2 and state-of-the-art file dis-tribution approaches focusing on I/O performance variability and load balance in PFS data servers in terms of I/O layer integra-tion, process-to-file mapping, decision method used for driving data server selection, and code availability. . . 102 Table 7 Overview of hardware characteristics of Grid’5000 clusters used in the experimental evaluation. . . 104 Table 8 Parameters and values considered in PIOSS evaluation. 105 Table 9 Parameters and values considered in DTSMaxLoad eval-uation. . . 109 Table 10 Parameters and values considered in N2R2 evaluation. . 114 Table 11 Mean coefficient of variation of the number of bytes per data server using N2R2, random, and round-robin file distribution approaches. . . 116

(22)

(23)

1 INTRODUCTION . . . 25 1.1 HYPOTHESIS . . . 29 1.2 OBJECTIVES . . . 30 1.3 SCOPE AND KEY ASSUMPTIONS . . . 31 1.4 DOCUMENT STRUCTURE . . . 31

2 HIGH PERFORMANCE I/O AND STORAGE . . 33

2.1 INFRASTRUCTURE MODEL . . . 34 2.2 APPLICATION I/O WORKLOAD . . . 35 2.3 HIGH-LEVEL I/O LIBRARIES . . . 39 2.3.1 NetCDF . . . 39 2.3.2 HDF5 . . . 40 2.4 I/O MIDDLEWARE . . . 41 2.5 PARALLEL FILE SYSTEMS . . . 43 2.5.1 OrangeFS . . . 45 2.5.2 Lustre . . . 47 2.6 CONCLUDING REMARKS . . . 48 3 RELATED WORKS . . . 49 3.1 I/O PERFORMANCE MODELING . . . 49 3.2 PARALLEL FILE SYSTEM SIMULATION . . . 52 3.3 I/O WORKLOAD GENERATORS . . . 54 3.4 FILE DISTRIBUTION IN PARALLEL FILE SYSTEMS . 56 3.5 CHALLENGES AND OPPORTUNITIES . . . 58 4 PROPOSAL . . . 61 4.1 MODELING LOAD BALANCING IN DATA SERVERS . 62 4.1.1 General Problem Definition . . . 62 4.1.2 Model for N-1 Mapping-based Applications . . . 64 4.1.3 Model for N-N Mapping-based Applications . . . 66 4.1.3.1 Single Process Application . . . 70 4.1.3.2 Distribution Over All Data Servers . . . 70 4.1.3.3 At Least One Data Server for All Files . . . 70 4.1.3.4 One Data Server per File . . . 71 4.1.3.5 Summary of N-N Mapping-focused Approaches . . . 74 4.1.4 DTSMaxLoad Considerations . . . 75 4.2 PIOSS: PARALLEL I/O AND STORAGE SYSTEM

DIS-TRIBUTED SIMULATION MODEL . . . 79 4.2.1 System Modeling . . . 79 4.2.2 Architecture and Implementation . . . 83

(24)

4.2.3 PIOSS Considerations . . . 86 4.3 IORE: A FLEXIBLE AND DISTRIBUTED I/O

PER-FORMANCE EVALUATION APPROACH . . . 88 4.3.1 Experiment-driven Execution . . . 88 4.3.2 Heterogeneous Offset-based Workloads . . . 89 4.3.3 Dataset-based Workloads . . . 91 4.3.4 Performance Metrics Collection and Output . . . 93 4.3.5 Implementation Aspects . . . 93 4.3.6 IORE Considerations . . . 95 4.4 N2R2: A HIGH-LEVEL FILE DISTRIBUTION APPROACH

FOR PARALLEL FILE SYSTEMS . . . 97 4.4.1 Selecting Data Servers of a File . . . 98 4.4.2 Implementation Aspects . . . 101 4.4.3 N2R2 Considerations . . . 101

5 EXPERIMENTAL ENVIRONMENTS AND

RE-SULTS . . . 103 5.1 ENVIRONMENTS DESCRIPTION . . . 103 5.2 PIOSS EVALUATION . . . 104 5.3 DTSMAXLOAD EVALUATION . . . 108 5.4 N2R2 EVALUATION . . . 112 6 CONCLUSIONS AND FUTURE WORKS . . . 121 6.1 FUTURE WORKS . . . 124 REFERENCES . . . 127 APPENDIX A -- Publications . . . 149

(25)

1 INTRODUCTION

Digital data production and consumption is increasing at un-precedented rates. By 2020, approximately 44 zettabytes (ZB) of data, or 44 × 1012 _{gigabytes (GB), is expected to be created and copied} an-nually in a global scale (TURNER et al., 2014). Prognostics for emerging Big Data and Internet of Things (IoT) applications are even more im-pressive, overcoming the yottabyte scale, which refer to an amount of 1000 ZB (ACKX, 2014;RADHA et al., 2015). In scientific and engineering domains, a similar phenomenon has been observed.

Historically, the high performance computing (HPC) field has been mainly focused on processing-related topics, mostly due to the in-creasing processing demands of large-scale computational science appli-cations. Such demands have motivated a great deal of research works, resulting in the development of novel methods and techniques aiming at improving the processing power available for these applications. At the same time, the availability of more powerful computing infrastruc-tures has encouraged the investigation of even larger and more complex problems, resulting in the cycle illustrated in Figure 1.

Figure 1: Demand shifting in the computational science evolution cycle.

Larger

Problems

Increasing Processing Demand Research & Development Processing Power Improvement Increasing Data Volumes

In the era of data-intensive scientific discovery (HEY; TANSLEY; TOLLE, 2009), however, as computational science applications grow in scale, it has been observed a significant increase in data volumes produced and consumed (RAICU; FOSTER; BECKMAN, 2011). Power-ful instruments employed on experimental and observational research works, such as particle accelerators, telescopes, satellites, and genome sequencers, can produce terabytes of data per second and, even after aggressive compression, store petabytes of data annually (BELL; HEY; SZALAY, 2009;CERN, 2016;GUZMAN et al., 2016). Computer simulation

(26)

26

applications in cosmology, high-energy physics, chemistry, earth sci-ences, oil & gas, to name a few, can generate terabyte to petabyte-scale output in a single run (HABIB et al., 2016;ROTEN et al., 2016;SATOH et al., 2017). These huge data sets are usually processed by visualization and data analytics applications to extract meaningful information to support scientific discoveries and breakthroughs (MITCHELL; AHRENS; WANG, 2011; NONAKA; ONO; FUJITA, 2014;DORIER et al., 2016).

This data deluge (HEY; TREFETHEN, 2003) poses great challenges to storage systems (GORTON et al., 2008). Although increasing largely in capacity, storage devices have failed to keep pace with processing performance improvements in the last decades (DENG, 2011;LÜTTGAU et al., 2018). Consequently, a widening performance gap is observed in most systems as measured by floating point operations per second (flops) compared with the maximum input/output (I/O) bandwidth offered by such systems (LIU et al., 2014). Such I/O bottleneck heavily affects the performance of data-intensive applications, because precious processing cycles are wasted while data sets are moved to or from the memory (SONG et al., 2011;NGUYEN; TAN; ZHANG, 2017). These obser-vations provide compelling arguments for current and future research on storage systems based on distributed architectures.

Modern data-intensive computational science applications are usually executed on highly distributed computing environments, such as clusters, grids, clouds, and supercomputers, with complex and deep file I/O paths (PRABHAT; KOZIOL, 2014). In most petascale systems, i.e., large-scale computers capable of executing 1015 _{flops, there are} separate resources for compute and storage purposes. Applications’ processes run on thousands of specialized compute nodes, communicat-ing through high-speed interconnects with a remote back-end storage system. The remote back-end storage system is usually multilayered, consisting of intermediate I/O nodes, a scratch file system, and an archival storage (LOCKWOOD et al., 2017). The archival storage main-tains long-term data (LÜTTGAU; KUNKEL, 2017), such as users home and project directories, while the scratch file system is intended for absorbing bulk data transfers from concurrent applications and pro-cesses running on compute nodes (KOGGE et al., 2008). Intermediate I/O nodes can alleviate the pressure on the scratch file system by ag-gregating requests from multiple compute nodes (ALI et al., 2009), and hide the I/O latency of the scratch file system from applications, acting as a burst buffer (LIU et al., 2012).

The scratch file system is traditionally implemented through a parallel file system (PFS), such as OrangeFS (MOORE et al., 2011),

(27)

Lustre (BRAAM; SCHWAN, 2002), Ceph (WEIL et al., 2006), IBM Spectrum Scale (QUINTERO et al., 2015), and Panasas File Sys-tem (PanFS) (NAGLE; SERENYI; MATTHEWS, 2004). In most PFSs, files are divided into multiple contiguous chunks, which are distributed across a cluster of data servers (CARNS et al., 2000; OPENSFS; EOFS, 2019). This approach provides high data rate, specially for large read and write requests, as they can be served by multiple data servers in parallel. Although relying on such highly specialized storage infrastruc-tures, most computational science applications rarely attain more than a fraction of the peak I/O bandwidth in HPC environments (SONG et al., 2012;LÜTTGAU et al., 2018).

In such complex computing environments, many factors can af-fect the I/O performance perceived by an application: workload charac-teristics, system architecture and topology model, configurations in the many layers of the parallel I/O software stack, properties of hardware components, interference and system noises, to name a few. A number of research works investigated workload and performance characteris-tics of specific HPC applications and environments (YU; VETTER; ORAL, 2008; LANG et al., 2009; CARNS et al., 2009b;KIM et al., 2010; CARNS et al., 2011;SAINI et al., 2012;GUNASEKARAN et al., 2015;LUU et al., 2015; WANG et al., 2017;LOCKWOOD et al., 2018;INACIO et al., 2018). In pre-vious works, we provided a more general and granular perspective of the impact of different factors in the performance of write operations in PFSs (INACIO et al., 2015;INACIO; DANTAS; MACEDO, 2015;INACIO; PILLA; DANTAS, 2015). It is noteworthy that these research works have mainly focused on the absolute I/O performance, partially or totally ignoring the effects of I/O performance variability.

Performance variability refers to the fluctuation of values mea-sured for a performance metric. In a general sense, performance vari-ability can be caused by many factors (KRAMER; RYAN, 2003), even though it usually results from some sort of contention due to concurrent accesses to a shared resource (SKINNER; KRAMER, 2005). Therefore, mitigating performance variability normally results in performance im-provements. From an application perspective, both internal and ex-ternal interferences can contribute to the I/O performance variabil-ity (LOFSTEAD et al., 2010). External interferences, or inter-application, relate to effects perceived by an application due to other applications accessing the remote back-end storage system at the same time. In-ternal interferences, or intra-application, on the other hand, denotes contention caused by processes from the application itself.

(28)

vari-28

ability propose inter-application coordination approaches focusing on reducing the impact of multiple applications simultaneously execut-ing on the HPC system (LOFSTEAD et al., 2010; DORIER et al., 2012; KUO et al., 2014; DORIER et al., 2014; YILDIZ et al., 2016; WAN et al., 2017b, 2017a). However, on extreme-scale computational science ap-plications, denoting applications with dedicated access to all resources of the HPC environment for execution, I/O performance variability can be mostly attributed to intra-application interferences. One major, yet intrinsic, source of intra-application interference in this context relates to the load balance on PFS’s data servers.

In a previous research work (INACIO; BARBETTA; DANTAS, 2017), we conducted a statistical analysis of the impact of nine factors on the I/O performance variability of read and write operations over files in a PFS. A main result of this study, presented in the Pareto chart of Figure 2, demonstrates that 99.32% of the performance variability in these experiments is explained by four main and two interaction ef-fects, according to the analysis of variance (ANOVA) performed. Apart from the massive impact of the evaluated experimental environments (71.02%), mostly attributed to considerable differences in their inter-connection networks, the stripe width and the process-to-file mapping, and their interaction, appear as the most impacting factors, accounting for 25.68% of the I/O performance variability. These results confirm load balancing effects on intra-application I/O variability, given that both factors are closely related to the file distribution in PFSs.

Without any information about an application workload, current PFSs distribute file chunks across data servers in a random or semi-random fashion (CARNS et al., 2000; OPENSFS; EOFS, 2019). In such conditions, specially when the application adopts an N-N process-to-file mapping, in which each process accesses an individual and independent file, data servers selection for different files may overlap, resulting in some data servers being more loaded than others. This is particularly concerning in computational science applications, as many parallel I/O operations are synchronized, which means that computation continues only when all processes finalize the I/O operation (SONG et al., 2011b). Aside from our previous experimental effort (INACIO; BARBETTA; DANTAS, 2017), information about the real magnitude of the intra-application I/O performance variability over PFSs and how it can be related to relevant application and PFS factors are scarce. Moreover, although some previous research works proposed file distribution ap-proaches aiming at balancing the load on PFS data servers from an intra-application perspective (LOFSTEAD et al., 2010;WANG et al., 2014;

(29)

Figure 2: Pareto chart with the ANOVA mean square and the cumu-lative percentage of the variability explained by considered factors

1202.01 272.97 86.64 75.03 26.87 17.5 71.02 87.14 92.26 96.7 98.28 99.32 0 500 1000 0 25 50 75 100 EnvironmentStripe Width File Mapping Stripe Widt h x File Mapping Operat ion Operation x Environment Mean S quare Cumulat ive P ercent age (%)

Source: (INACIO; BARBETTA; DANTAS, 2017).

NEUWIRTH et al., 2016, 2017; SON et al., 2017), all of them rely on some sort of up-to-date and dynamic information about the system load. A drawback of these approaches is that maintaining such global view of the system normally requires a centralized control or a com-munication overhead in order to distribute the load information. In both cases, I/O performance scalability can be impaired as the num-ber of resources increases. Therefore, addressing intra-application I/O performance variability, both in terms of evaluation and optimization, remains a challenging task.

1.1 HYPOTHESIS

This research work explores the hypothesis that to mitigate intra-application I/O performance variability over PFSs, a better under-standing regarding this variability and its interplay with the load bal-ance across data servers of a PFS is required. Furthermore, in order to obtain such understanding, an effective approach should combine com-plementary performance evaluation methods, including modeling and measurement approaches, as illustrated in Figure 3.

(30)

30

Figure 3: Conceptual framework considered in this research work.

Understanding Optimization

Analytical Simulation Modeling Measurement

Evaluation

It is worth mentioning that evaluation methods presented in this conceptual framework can be considered as the main methods for gen-eral performance evaluation (JAIN, 1991; OBAIDAT; BOUDRIGA, 2010). Analytical models, which stand for models built upon mathematical functions, can provide an approximated perspective of the expected performance behavior of either existing or envisioned systems, and are particularly useful for preliminary analysis. Simulation models, on the other hand, can provide a more detailed representation of a system, which usually results in higher accuracy. Finally, measurement meth-ods are intended for assessing performance behavior during the opera-tion on existing systems.

1.2 OBJECTIVES

The main objective of this research work is to devise a compre-hensive performance evaluation approach that can be used to lead to a better understanding about the impact of load balance in PFS data servers on intra-application I/O performance variability, and to address this impact focusing on data-intensive computational science applica-tions. This objective can be described in terms of the following specific objectives:

1. To identify main application and PFS factors with significant effect in the intra-application I/O performance variability; 2. To model parallel I/O performance variability caused by load

imbalance in PFS data servers in terms of its impacting factors; 3. To design, and implement an I/O performance evaluation tool

that incorporates features to support variability considerations into experimental efforts;

(31)

4. To design, implement, and evaluate a file distribution approach that mitigates the I/O performance variability perceived by data-intensive computational science applications accessing data in PFSs.

1.3 SCOPE AND KEY ASSUMPTIONS

This research work focuses on I/O performance variability issues perceived by data-intensive computational science applications running over HPC back-end storage systems based on PFSs, like OrangeFS and Lustre. A main feature that defines the target storage environ-ments is the parallel access to multiple parts of a file across a cluster of remote data servers. Therefore, other types of distributed storage systems, such as Hadoop File System (HDFS) and NoSQL and NewSQL distributed databases, which do not conform with the de-scribed target environment, are not investigated in this research work. Additionally, this research work focuses mainly on the perfor-mance of data movement operations, namely, reads and writes, under the assumption that these operations are more relevant regarding the I/O performance of data-intensive applications accessing large files. Metadata I/O operations (e.g., open, close, stat), which are more likely to impact applications accessing multiple small files, are not ex-plored in this research work.

Finally, the analysis of the influence of features considered or-thogonal to applications’ I/O activity, such as data security and de-pendability, is also out of the scope of this research work.

1.4 DOCUMENT STRUCTURE

This document is structured as follows. Chapter 2 presents an overview of high performance I/O and storage systems. A discussion about research works focusing on I/O performance variability issues is presented in Chapter 3. In Chapter 4 this research work proposal is presented, detailing its main differences to the state of the art. Exper-imental methods and results are presented and discussed in Chapter 5. In Chapter 6, a summary of this research work contributions is pre-sented, followed by directions for future works. Finally, Appendix A concludes this document, listing publications resulting from this re-search work in journals, conference proceedings, and book chapters.

(32)

(33)

2 HIGH PERFORMANCE I/O AND STORAGE

Nowadays high performance computing (HPC) systems, such as clusters, grids, and supercomputers, consist of a very specialized and complex combination of hardware and software elements. Their design mainly focuses on providing high processing power for large-scale distributed and parallel applications, even though low commu-nication and data access latency are considered increasingly impor-tant requirements (LUCAS et al., 2014; CAPPELLO et al., 2016). The complexity of most modern HPC systems is abstracted to users by resource management systems (RMSs), such as PBS (HENDERSON, 1995), SLURM (YOO; JETTE; GRONDONA, 2003), HTCondor (THAIN; TANNENBAUM; LIVNY, 2005), or meta-schedulers, such as Globus ( FOS-TER, 2005), GridWay (HUEDO; MONTERO; LLORENTE, 2007), and Dé-dalo (INACIO; SIQUEIRA; DANTAS, 2016).

Users interact with such large-scale HPC environments by sub-mitting jobs. A job specifies one or more applications, input and output parameters, the expected total execution time, the required computing resources, and, eventually, other information that describe the desired task and environment (GREHANT; DEMEURE; JARP, 2013). Submitted jobs are preprocessed and enqueued by the RMS (or meta-scheduler) according to varying criteria, including priority and computing resource demands. Once a job is ready to be serviced, it is dequeued, re-quested computing resources are allocated, and specified applications are launched for execution in the HPC system. It is worth noticing that, except for small to medium-scale systems, users seldom interact directly with HPC system resources. This approach focuses, among other things, on optimizing resource utilization. Furthermore, it al-leviates the burden on users regarding resource allocation and setup, data movement, and application execution on systems with hundreds to hundreds of thousands of nodes.

Figure 4 presents an overview of a general infrastructure model and an input/output (I/O) software stack commonly observed in many modern HPC environments. A main characteristic observed in these environments is the separation of compute and storage resources, which results in massive data movements through layers of the I/O path dur-ing the execution of a data-intensive application. Consequently, even in this simplified illustration, it can be noted the complexity of such environments. The infrastructure model is detailed in the next section, while subsequent sections discuss I/O software-related topics.

(34)

34

Figure 4: Overview of an infrastructure model and I/O software stack typically observed in HPC environments.

File System Network Storage Interconnection

File System Client File System Server

POSIX I/O Middleware Application Compute Nodes File System Servers Storage Devices

Infrastructure Model I/O Software Stack High-Level I/O

Library

2.1 INFRASTRUCTURE MODEL

Applications are executed on compute nodes, which are nodes specifically designed for improved processing power. Compute nodes are commonly equipped with multi-core or many-core processors, ac-celerators (e.g., GPUs, FPGAs, co-processors), and volatile random-access memories (RAMs). High-speed interconnection fabrics, usually of proprietary technology, provide low latency communication among application’s processes running on compute nodes. In large-scale HPC systems, especially in supercomputers, compute nodes are diskless, and persistent storage service is provided by a remote back-end storage sys-tem (MIYAZAKI et al., 2012; XU et al., 2014; LUU et al., 2015; XIE et al., 2017a). This separation facilitates scaling processing and storage re-sources individually and independently. Furthermore, embedding stor-age devices on compute nodes is known to decrease the mean time to failure (MTTF) of the system (BENT et al., 2012).

Compute nodes communicate with the remote back-end stor-age system through a file system network, which is generally built on commodity technologies, such as Gigabit Ethernet (GbE) and In-finiband. The remote back-end storage system can be comprised of multiple tiers. Traditionally, the first storage tier, also known as the scratch file system, is usually implemented through a parallel file system (PFS) (PRABHAT; KOZIOL, 2014). Such high performance distributed

(35)

file systems are designed to support highly concurrent data accesses and to provide high data rate, especially for large I/O requests, by distributing files across multiple file system servers.

Data on file system servers are persisted into storage devices. In spite of the availability of newer and faster non-volatile storage technologies, such as solid-state drives (SSDs) and other non-volatile random-access memories (NVRAMs), hard disk drives (HDDs) are still the most used in HPC scratch file systems (LÜTTGAU et al., 2018). The higher durability and smaller cost per byte of modern HDDs, compared to the aforementioned alternative technologies, are among the main rea-sons for their preference. For improved performance and reliability, a popular approach is to combine multiple HDDs into a redundant array of independent disks (RAID) (PATTERSON; GIBSON; KATZ, 1988).

File system servers access storage devices through a storage in-terconnection. The type of interconnection depends on the storage architecture, among which direct-attached storage (DAS) and storage area network (SAN) are the most observed in HPC systems (TROPPENS et al., 2009). Both DAS and SAN provide block-based access to stor-age devices. In the DAS architecture, storstor-age devices are connected directly to the I/O bus of the file system server, while in the SAN architecture, a switched interconnection fabric allows multiple file sys-tem servers to communicate with multiple storage devices (MESNIER; GANGER; RIEDEL, 2003).

Other storage tiers, not depicted in Figure 4, can be found com-posing the back-end storage of modern HPC environments. While the scratch file system is mostly intended for absorbing bulk data transfers from concurrent applications and processes running on compute nodes, an archival storage tier, commonly implemented over tape technolo-gies, offers large capacity for long-term files (e.g., project data, users’ home directories) (KOGGE et al., 2008). Dedicated I/O nodes, equipped with low latency storage devices (e.g., SSDs, NVRAMs), also known as burst buffers, can be interposed between compute nodes and the back-end storage system, focusing on alleviating the concurrency and latency on traditional HDD-based file system servers (LIU et al., 2012; LÜTTGAU et al., 2018).

2.2 APPLICATION I/O WORKLOAD

Large-scale HPC environments have been traditionally employed for executing distributed parallel scientific and engineering applications

(36)

36

from the most varying areas, such as cosmology, high-energy physics, chemistry, earth sciences, oil & gas, to name a few (CARNS et al., 2011; BYNA et al., 2012; BAKER et al., 2014; ROTEN et al., 2016). Most of these applications alternate between very distinguishable execution phases (YU et al., 2017). During the compute phase, application’s pro-cesses are busy processing data and communicating with each other, whereas the I/O phase is dominated by processes reading or writing data from or to the back-end storage system. For most applications, the I/O phase is synchronous, in the sense that each process blocks un-til all processes complete the I/O phase (SONG et al., 2011b). Reasons include enforcing that required data is available at all processes and guaranteeing that each process output its data before the application enters the compute phase.

The I/O activity in HPC applications can be segregated into two groups: productive and defensive I/O (SUBRAMANIYAN et al., 2008; ARUNAGIRI; DALY; TELLER, 2009). Productive I/O considers data ac-cesses contributing to the actual purposes of the application, including loading input data and outputting intermediate and final results. De-fensive I/O comprises data accesses originated by fault tolerance mech-anisms. The checkpoint/restart approach is the most popular fault tol-erance mechanism in the context of HPC applications, and consists of an application’s processes periodically writing out their memory states to the back-end storage system for recovery in the event of system fail-ures (HERAULT et al., 2019). In large-scale HPC systems, defensive I/O is the main source of I/O activity, accounting for about 75% of the over-all I/O of applications (ARUNAGIRI; DALY; TELLER, 2009; NEUWIRTH et al., 2017). Furthermore, a recent literature survey (BOITO et al., 2018) indicated that almost a fourth of the considered research works propos-ing I/O performance optimization techniques used checkpoint/restart as motivation.

In large-scale distributed parallel applications, the workload is usually distributed across a huge number of processes in an as balanced way as possible focusing on reducing the total execution time (CHEN et al., 2009;CUI et al., 2010;SATOH et al., 2017). From the I/O perspective, this potentially translates into a huge number of processes concurrently generating I/O requests to the back-end storage system (DONG et al., 2012). An I/O request, in the context of this research work, denotes an operation over a file. This operation can involve data transferring, such as read and write operations, or act over the directory structure of the file system, i.e., a metadata operation, such as creating or opening a file (TANENBAUM, 2007). Each data transfer request issued by the

(37)

application can be fundamentally described in terms of three attributes: the file, the offset, which specifies a position within the file, and the size of the contiguous data block to be transferred.

Different data access parallelism levels can be achieved depend-ing on how an application coordinates its data transfer requests. Based upon approaches observed in distributed applications (BENT et al., 2009; YANG et al., 2014;LANL; NERSC; SNL, 2016;INACIO et al., 2017), and con-sidering the aforementioned attributes, data access parallelism can be mainly categorized into six levels, as illustrated in Figure 5.

Figure 5: Data access parallelism at the application level.

P1 P2 P3 P4 P1 P2 P3 P4 P1 P2 P3 P4 (a) No parallelism; 11 (c) Interblock parallelism; N1 (d) Intrablock parallelism; N1 P1 P2 P3 P4 (e) Interfile, Interblock parallelism; NM P1 P2 P3 P4 (f) Interfile, Intrablock parallelism; NM P1 P2 P3 P4 (b) Interfile parallelism; NN

P0 Process Data block File

The baseline level, depicted in Figure 5a, actually describes a condition with no data access parallelism. In this condition, a single process carry out I/O on behalf of others. This approach can be inter-esting for avoiding multiple small data transfers to the back-end storage system, a pattern known to produce poor I/O performance (

(38)

HILDE-38

BRAND; WARD; HONEYMAN, 2006;CARNS et al., 2009a; LI et al., 2011). In contrast, concentrating the I/O activity on a single node can re-sult in degraded performance due to network congestion and limited memory space in the aggregator node.

Figure 5b illustrates the inter-file parallelism level, in which each process transfers data from/to an individual and independent file. The N-N process-to-file mapping, also known as file-per-process, is easier to program and avoids performance degradation due to consistency enforcing mechanisms required when multiple processes access the same file (YU et al., 2007; LIU et al., 2014; SON et al., 2017). A drawback of this approach may arise in extreme-scale applications, in which the huge number of files might pose a great pressure on the metadata service of the back-end storage system (CARNS et al., 2011;LIU et al., 2014).

In Figures 5c and 5d, two data access parallelism levels relying on an N-1 process-to-file mapping approach, also known as shared-file, are presented. In the former, called inter-block parallelism, the file is segmented and each process transfers its data block to a con-tiguous file region. In the intra-block parallelism level, on the other hand, the contiguous data block in each process’ memory is mapped onto a noncontiguous region in the file, following a strided access pat-tern (NIEUWEJAAR et al., 1996). Inter-block or intra-block parallelism mostly results from the geometry of the problem domain, its partition among application’s processes, and the format of the data set file. Par-allel accesses to a shared file may be serialized in the back-end storage system in order to enforce consistency, which in general translates into a performance degradation.

Finally, Figures 5e and 5f present data access parallelism levels that explore a trade-off between N-1 and N-N process-to-file mappings. In the N-M mapping, N processes transfer data from/to M files, aiming at exploring the benefits of inter-file along with inter-block and intra-block parallelisms. In general, when M < N, files are distributed to groups of processes, as illustrated in Figures 5e and 5f. This condition is common, for instance, when a visualization system executes with a smaller number of processes than the simulation application that generated the data set (INACIO et al., 2017). With M > N, on the other hand, each process accesses multiple files. An example of such situation is when a simulation system generates one file per process and per time step.

It is important to observe that some characteristics of the ap-plication I/O workload, such as the number of I/O processes and the process-to-file mapping, can be specified during application design and

(39)

development. Others, such as the access pattern (i.e., sequential, strided) or the request size, will mostly emerge naturally according to the data set structure and the file organization. As data sets become increasingly complex, specialized libraries have been proposed to help application developers to specify and coordinate their applications’ I/O activities.

2.3 HIGH-LEVEL I/O LIBRARIES

High-level I/O libraries provide data set abstractions and en-hanced programming interfaces to allow application developers to be-come more productive and to obtain improved I/O performance. More-over, high-level I/O libraries commonly provide their own file formats, which adds cross-application and cross-platform portability to data sets. The most popular high-level I/O libraries for general-purpose computational science applications are NetCDF (REW; DAVIS, 1990) and HDF5 (The HDF Group, 1997).

2.3.1 NetCDF

The Network Common Data Form (NetCDF) provides an interface for accessing multidimensional arrays of typed data in a portable and self-describing file format (REW; DAVIS, 1990). A NetCDF file is comprised of two parts, namely, file header and raw data. Figure 6 depicts the NetCDF file format. The file header maintains metadata about data arrays in the raw data section, including information on dimensions, variables, and attributes. Fixed-size arrays are stored in contiguous space, following a predefined linear order. NetCDF also supports variable-size arrays with one unlimited dimension through record-oriented I/O, in which records are stored interleaved in a regular pattern (REW; DAVIS, 1990;LATHAM, 2014).

To write a data set with NetCDF, the developer must first define the dimensions of the data set, the variables and their association with dimensions, and, optionally, annotate (i.e., add attributes) to data set components. Originally, the NetCDF I/O library allowed only serial access to data sets. Up to version 3, to perform parallel I/O to NetCDF files, users have to rely on third-party libraries, such as the Parallel-NetCDF (PnetCDF) (LI et al., 2003;LATHAM, 2014). Native parallel I/O support on NetCDF started on version 4, initially built on top of the HDF5 library (UNIDATA, 2019).

(40)

(41)

Figure 7: An example of an HDF5 file with its main components. A B C D E F G H HDF5 Group HDF5 Attribute HDF5 Dataset A HDF5 Link

Source: Adapted from Koziol et al. (2014).

2.4 I/O MIDDLEWARE

The POSIX I/O application programming interface (API), while portable, was not designed for parallel I/O in distributed applications. For instance, POSIX provides limited features to express noncontigu-ous data access patterns, common in computational science applica-tions, as well as it lacks features for coordinating global data transfers between processes and files (KIMPE; ROSS, 2014). To overcome such limitations and to offer more efficient parallel data accesses, I/O mid-dlewares have been developed.

An I/O middleware, in the context of this research work, is a software layer that provides parallel data access services to distributed applications. As depicted in Figure 4, the I/O middleware is usually built on top of POSIX or file system-specific APIs. Applications and high-level I/O libraries can, in turn, exploit features and optimizations provided by I/O middlewares while carrying out their parallel file data accesses. In current HPC environments, predominantly consisting of distributed architectures (DONGARRA et al., 2019), the I/O middleware is mainly provided by MPI-IO implementations.

The MPI-IO specification was first defined as part of the second version of the Message Passing Interface (MPI) standard to pro-vide a comprehensive API, focusing on I/O portability, parallelism, and

(42)

(43)

ferred as collective I/O (THAKUR; GROPP; LUSK, 1999b). In ROMIO, a two-phase I/O method is employed for collective I/O (THAKUR; GROPP; LUSK, 1999a). An example of a collective write in ROMIO is presented in Figure 9. In this example, three processes write each to a noncontigu-ous region of the file. First, processes broadcast information on their required data regions, and a file domain is defined for each process. Then, in the communication phase, processes exchange data according to the file domain. This data is maintained in a temporary collective buffer. Lastly, in the I/O phase, each process writes its collective buffer content to the file. The method for collective reads is identical, except that the I/O phase happens before the communication phase.

Figure 9: A ROMIO collective write example.

P1 P2 P3

file domain P1 file domain P2 file domain P3

I/O Phase Communication Phase Application Buffer Temporary Collective Buffer File

Users can pass additional information to I/O functions through what is referred in MPI-IO as hints (MPI Forum, 2015). MPI-IO pro-vides some predefined hints for defining access patterns or file system specifics, although MPI-IO implementations are free to define cus-tom hints (THAKUR; GROPP; LUSK, 1999b). The main purpose of user-supplied hints is to allow an MPI-IO implementation to perform fur-ther I/O performance optimization.

2.5 PARALLEL FILE SYSTEMS

Parallel file systems (PFSs), such as OrangeFS (MOORE et al., 2011), Lustre (BRAAM; SCHWAN, 2002), Ceph (WEIL et al., 2006), IBM Spectrum Scale (QUINTERO et al., 2015), and Panasas File System (PanFS) (NAGLE; SERENYI; MATTHEWS, 2004), are special-ized distributed file systems whose design focuses on providing high data rate to large-scale distributed parallel applications. Although dif-ferences in design and implementation can be observed, in general,

(44)

44

PFSs present a cluster architecture composed of three main compo-nents: clients, data servers, and metadata servers. Clients, usually located at compute nodes (or intermediate I/O nodes), provide the in-terface to the file system namespace. Data servers are responsible for storing files contents, while metadata servers keep up-to-date informa-tion about files (e.g., locainforma-tion, name, timestamps, owner, permissions). In order to provide high data rate, especially for large I/O re-quests, most PFSs stripe file data across multiple data servers, a tech-nique known as file striping (CORBETT; FEITELSON, 1996). An example of the file striping technique in a typical PFS is presented in Figure 10. In this example, a write request of 400 KiB is carried out by the appli-cation. The PFS client, upon receiving the request, divides the request data into fixed-size stripe fragments, and distributes them, in a round-robin fashion, across a subset of the available data servers. This way, a PFS can serve a single I/O request from multiple data servers in par-allel, leveraging the aggregated throughput of their network interfaces and disks to improve the data transfer rate.

Figure 10: A simple example of the file striping technique in a PFS.

400 KiB Write Request

Stripe Fragment Size (64 KiB)

Stripe Width (4)

Server #1 Server #2 Server #3 Server #4 Server #5 Application

Parallel File System Client

Parallel File System Servers

The file distribution method, i.e., the method employed for se-lecting the subset of data servers for storing stripe fragments from a file, can also vary among implementations. Nevertheless, main meth-ods employ some random component focusing on evenly distributing the amount of data across all available data servers. The number of

(45)

data servers per file, known as the stripe width or stripe count, and the stripe fragment size are usually parameters of the PFS. In many PFSs, these parameters can be set in a per-directory or per-file basis, and oftenly considered at the file creation time.

It is worth noticing that, while a file is always aligned with the stripe fragment, i.e., the first stripe fragment always begins at the first offset of the file, this is not necessarily true for every I/O request. Also, only the last fragment of a file can be incomplete, in the sense that it is smaller than the stripe fragment size parameter. This can be observed in Figure 10 as well, supposing the write request transfers the entire file.

In the high-performance parallel I/O research community, ac-cording to a recent literature survey (BOITO et al., 2018), the most used PFSs are Lustre and OrangeFS. Details about these very popular PFS implementations are provided in the following sections.

2.5.1 OrangeFS

OrangeFS (MOORE et al., 2011) is an open-source PFS that runs entirely at user level, with an additional kernel module that al-lows mounting it like any other file system under Linux (LIGON; WIL-SON, 2014). As an evolution of the Parallel Virtual File Sys-tem (PVFS) (Ligon, III; ROSS, 1996;CARNS et al., 2000), OrangeFS is mainly designed for I/O performance, relaxing POSIX consistency se-mantics where necessary to eliminate bottlenecks. Compared to PVFS, main extensions made by OrangeFS include support for a larger col-lection of interfaces (e.g., Windows, WebDAV, Hadoop JNI Client), distributed file metadata and directories, and improved security.

OrangeFS is an object-based file system. Each file or directory in the file system is composed of one metadata object and one or more data objects. The number of data objects created for a file is config-urable, and is equivalent to the stripe width in Figure 10, as each data object is stored in a single server. The set of servers into which data objects from a file are stored is defined at file creation according to a distribution layout. At version 2.9 of OrangeFS, five distribution layouts are available (LIGON, 2019):

• PVFS_SYS_LAYOUT_NONE: servers are selected following the order of the configuration file, starting from the first one;

(46)

(47)

2.5.2 Lustre

Lustre (BRAAM; SCHWAN, 2002) is an open-source PFS de-signed for massive scalability, high-performance, and high-availability. Implemented entirely in the Linux kernel, Lustre provides a global POSIX-compliant namespace, being capable of handling extremely large volumes of data and huge numbers of files with strong coherence of both data and metadata (BARTON; DILGER, 2014). According to the June 2018 edition of the Top500 list (DONGARRA et al., 2018), from the 100 most powerful supercomputers in the world, 77 uses Lustre, or a Lustre-derived file system, for their back-end storage system.

Lustre architecture is founded upon distributed object-based storage, in which two types of objects are used: data objects for storing a file’s contents, and index objects, which are key-value stores used for implementing POSIX directories (BARTON; DILGER, 2014). These objects are implemented by the object storage device (OSD), which is an abstraction for different local file systems, such as ext4 and ZFS. Figure 12 presents a basic Lustre cluster with its main components.

Figure 12: Lustre main components in a basic cluster.

Source: (OPENSFS; EOFS, 2019).

An instance of an OSD can be exported as either a metadata tar-get (MDT), for metadata operations, or object storage tartar-get (OST), for data operations, by, respectively, metadata servers (MDSs) and ob-ject storage servers (OSSs). An important service provided by stor-age targets is the Lustre Distributed Lock Manstor-ager (LDLM), which is responsible for serializing conflicting data and metadata oper-ations on objects managed by the target, and by ensuring distributed cache coherency (OPENSFS; EOFS, 2019). A management server (MGS)

(48)

48

is responsible for maintaining system configuration information, along with sharing this information with other system components. Lus-tre clients communicate with servers through a remote procedure call (RPC) layer built on top of the Lustre Networking (LNet), an abstraction for physical and logical networks (OPENSFS; EOFS, 2019). Prior to version 2.4, a Lustre file system could have a single MDT, while a pair of MDSs could be configured for active-passive failover ( BAR-TON; DILGER, 2014). More recent versions support multiple MDTs and, thereby, multiple MDSs with active-active failover configuration in a single file system namespace.

Data objects are allocated at file creation by the MDS. Each data object is stored in a single OST, and the number of data objects is defined by a parameter called stripe_count. In Lustre, file stripe fragments are assigned to data objects following two allocation meth-ods (OPENSFS; EOFS, 2019):

• Round-robin allocator: alternates stripe fragments between OSTs on different OSSs, in a round-robin manner;

• Weighted allocator: randomly chooses OSTs based on size and location, focusing on filling emptier OSTs faster.

Currently, the round-robin allocator is used by default in Lustre. The weighted allocator is employed when the free-space among OSTs differs by a configurable threshold (17% by default).

2.6 CONCLUDING REMARKS

In this chapter, basic concepts and topics related to high per-formance I/O and storage systems are presented. Even in this brief exposition, it can be observed the complexity of such storage infras-tructures, not only because of the scale but also due to the variety of components involved. Moreover, this provided conceptual overview gives insight on how difficult it is to evaluate the parallel I/O per-formance and the perper-formance variability perceived by a large-scale data-intensive application in such HPC environments in the face of the large number of characteristics and parameters that can significantly affect these metrics. Challenges involved in understanding these rela-tionships motivated a great deal of research works, which are discussed in the next chapter, focusing, ultimately, in providing effective I/O performance optimization approaches.

(49)

3 RELATED WORKS

An extensive literature review, in the context of high perfor-mance I/O and storage systems, was conducted in this research work. It started with the proposal and development of a literature survey, carried out in collaboration with other research colleagues, considering research works published in the most important journals and confer-ence proceedings in the field between 2010 and 2014 (BOITO et al., 2018). This survey provides a comprehensive discussion about the state of the art, accompanied by a bibliometric analysis, and presents our conclu-sions on main current and future research topics in the field.

The publication corpus resulting from the literature survey was continuously augmented during the development of the present research work. Papers published from 2015 up to March 2019 (i.e., the time when this document was written) where included in the corpus. More-over, several other research works referred by already collected papers, when found relevant for this research work, were also considered in this literature review independent of the year of its publication. From this extensive corpus of published research works on high performance I/O and storage systems, a selected set of research works is discussed in the following sections.

3.1 I/O PERFORMANCE MODELING

Performance modeling, along with performance and workload characterization, is a technique widely adopted in order to obtain a better understanding about the behavior of a system (INACIO; DAN-TAS, 2014; CALZAROSSA; MASSARI; TESSERA, 2016). Regarding high performance I/O and storage systems, performance models have been explored by a number of research works, mainly focusing on driving more effective I/O performance improvements (BOITO et al., 2018). Typical I/O performance modeling methods observed in the literature include theoretical analytical models, henceforth referred just as ana-lytical models, regression models, and simulation models. This section focuses on analytical and regression model proposals, while simulation models are discussed in the next section.

Analytical models were explored in some research works for driv-ing different I/O performance improvement approaches, includdriv-ing op-timizations for: per-file data layout selection (SUN; CHEN; YIN, 2009;