BIO-DIA: ferramenta web para integração de dados e algoritmos

(1)

INSTITUTO METRÓPOLE DIGITAL

PROGRAMA DE PÓS-GRADUAÇÃO EM BIOINFORMÁTICA

THIAGO DANTAS SOARES

BIO-DIA: Ferramenta web para integração de dados e algoritmos

NATAL RN 2019

(2)

THIAGO DANTAS SOARES

Defesa de Mestrado apresentanda ao Programa de Pós-Graduação em Bioinformática da Universidade Federal do Rio Grande do Norte.

Área de concentração: Bioinformática

Linha de Pesquisa: Desenvolvimento de Produtos e Processos

Orientador: Prof. Dr. Wilfredo Blanco Figuerola

NATAL-RN 2019

(3)

Soares, Thiago Dantas.

BIO-DIA: ferramenta web para integração de dados e algoritmos / Thiago Dantas Soares. - 2020.

45 f.: il.

Dissertação (mestrado) - Universidade Federal do Rio Grande do Norte, Instituto Metrópole Digital, Programa de Pós-Graduação em Bioinformática, Natal, RN, 2020.

Orientador: Prof. Dr. Wilfredo Blanco Figuerola.

1. Big data - Dissertação. 2. Integração - Dissertação. 3. Reprodutibilidade Dissertação. 4. Reusabilidade de projeto -Dissertação. 5. Ferramenta web - -Dissertação. I. Figuerola, Wilfredo Blanco. II. Título.

RN/UF/BCZM CDU 004.65

Catalogação de Publicação na Fonte. UFRN - Biblioteca Central Zila Mamede

(4)

Defesa de Mestrado apresentanda ao Programa de Pós-Graduação em Bioinformática da Universidade Federal do Rio Grande do Norte.

Área de concentração: Bioinformática

Linha de Pesquisa: Desenvolvimento de Produtos e Processos Orientador: Prof. Dr. Wilfredo Blanco Figuerola

Natal, 19 de dezembro de 2019.

BANCA EXAMINADORA

___________________________________________ Prof. Dr. Wilfredo Blanco Figuerola

Universidade Federal do Rio Grande do Norte (Presidente)

___________________________________________ Prof. Dr. Rodrigo Juliani Siqueira Dalmolin Universidade Federal do Rio Grande do Norte

(Examinador Externo ao Programa)

___________________________________________ Prof. Dr. Alberto Signoretti

Universidade Estadual do Rio Grande do Norte (Examinador Externo à Instituição)

(5)

AGRADECIMENTOS Gostaria de agradecer acima de tudo

הוהי

por tudo.

Meus parentes e familiares em especial Carlos, Rose e Priscilla.

Minha namorada, Raiany.

Meus colegas de pós Daniela, Éden, Themístocles, Diego, M. Eduarda, Felipe, Raul, Rafael, Karla.

Meu Orientador, Wilfredo por me apoiar e me guiar durante esses 2 anos.

Agradecer a CAPES pela bolsa de estudos no PPg-Bioinfo/UFRN onde a minha

(6)

RESUMO

A ciência de dados é historicamente um campo complexo, não apenas pela enorme quantidade de dados e sua variedade de formatos, mas também pela necessidade de colaboração entre vários especialistas para recuperar informações valiosas. Nesse contexto, criamos o Bio-DIA, um software on-line para criar um processo de fluxo de trabalho de ciência de dados focado na integração de dados e algoritmos. O Bio-DIA também facilita a reutilização de informações/resultados obtidos em processos anteriores sem a necessidade de habilidades específicas do campo da ciência da computação. O software foi criado com o Angular no front-end, o Django no back-end e o Spark para manipular e processar uma variedade de formatos de big data. O fluxo de trabalho/projeto é especificado através do arquivo XML. O aplicativo Bio-DIA facilita a colaboração entre os usuários, permitindo que grupos de pesquisadores compartilhem dados, scripts e informações.

Disponível: https://ucrania.imd.ufrn.br/biodia-app/. Login: bioguest, senha: welcome123.

Palavras-chave: _{Big data, integração, reprodutibilidade, reusabilidade de projeto,} ferramenta web.

(7)

ABSTRACT

Data science is historically a complex field, not only because of the huge amount of data and its variety of formats, but also because the necessity of collaboration between several specialists to retrieve valuable information. In this context, we created Bio-DIA, an online software to build data science workflow process focused in the integration of data and algorithms. Bio-DIA also facilitates the reusability of information/results obtained in previous process without the need of specific skills from the computer science field. The software was created with Angular at the front-end, Django at the back-end together with Spark to handle and process a variety of big data formats. The workflow/project is specified through XML file. Bio-DIA application facilitated the collaboration among users, allowing researcher ́ s groups to share data, scripts and information.

Availability: https://ucrania.imd.ufrn.br/biodia-app/. Login: bioguest, password: welcome123.

(8)

LISTA DE FIGURAS

Figura 1. Representação visual do Bio-DIA. Bio-DIA é o nó principal desse grafo, sendo responsável de prover para o usuário em seu novo projeto/pipeline, scripts, dados e projetos, que já foram criados por outros usuários, possibilitando assim o compartilhamento de conhecimento e reuso entre todos os usuários………..12

(9)

LISTA DE ABREVIATURAS

Bio-DIA Biology Data Integration Analysis IOT Internet of Things

NGS Next-Generation Sequence XML Extensible Markup Language MYSQL Structured Query Language

NOSQL Not Only Structured Query Language TSV Tab Separated Values

CSV Comma Separated Values GUI Graphical User Interface

(10)

SUMÁRIO

1 INTRODUÇÃO 11

1.1 Ciência de dados 1₁

1.2 Processo da Ciência de Dados 11

1.3 Problemas da Ciência de Dados 1₃

1.4 Trabalhos Relacionados 1₄ 1.5 Justificativa do Estudo 1₅ 2 OBJETIVOS 17 2.1 Geral 1₇ 2.2 Específicos 1₇ CAPÍTULO I 17 3 DISCUSSÃO 44 3.1 Bio-DIA 44 4 CONCLUSÃO 45 4.1 Trabalhos Futuros45 REFERÊNCIAS

(11)

1 INTRODUÇÃO

1.1 Ciência de dados

Nos anos mais recentes a ciência de dados ( _{data science), emergiu como} uma nova e importante disciplina. Pode ser vista como uma fusão de disciplinas clássicas como estatística, mineração de dados, banco de dados e sistemas distribuídos (VAN DER AALST, Wil., 2016). Devido a era digital são criados dados diariamente, só em 2018 foram criados 33 zettabytes de dados, segundo a Statista, uma das maiores empresas de _{Data Business} _{do mundo, este fato ocorre pela} aparição da Internet das Coisas (_{IOT, Internet of Things), um mundo mais} globalizado, rápido acesso a internet, todos esses fatores e muitos outros geram dados para empresas ou para uso individual. Assim abordagens são necessárias para serem combinadas e tornar toda a abundância de dados em algo de valor (VAN DER AALST, Wil., 2016).

A missão de um cientista de dados é transformar dados em novos insights_{, provendo novas possibilidades para serem tomadas (KIM, Miryung et al.,} 2016), para chegar a ter novos descobrimentos é necessário de times muitas vezes multidisciplinares levando o time todo a ter _backgrounds _{diferentes, assim tornando} possível a descoberta de novas perguntas e respostas que muitas vezes não era sabido antes. Até não muito tempo atrás os cientistas de dados eram encontrados em times onde os dados eram mais intensos como um mecanismo de busca na internet ou marketing (KIM, Miryung et al., 2016), porém esta realidade mudou com o passar do tempo diversas áreas de pesquisa vem crescendo e criando diversos tipos de dados, com diferentes formatos e padrões, a Bioinformática é um exemplo, neste campo de pesquisa é necessário lidar diariamente com arquivos em diferentes formatos e padrões, bancos de dados e scripts.

1.2 Processo da Ciência de Dados

O processo da ciência de dados (_{Data Science Process) é uma seqûencia de} passos(_{pipeline) para alcançar um objetivo final assim obter valor dos}

(12)

dados(_{insights). Estes passos podem ser definidos como: Adquirir, Preparar,} Analisar, Relatar e Atuar.

Adquirir: Nesta primeira etapa é necessário levar em consideração quais dados vão ser necessários para obter o resultado final esperado, deve-se pensar também os tipos de dados que serão utilizados como, imagens, vídeos, tabelas, bancos SQL, NoSQL e assim por diante. Nessa parte do processo é recomendado um membro da equipe que tenha experiência em conseguir obter esses dados, sabendo que eles podem ser adquiridos de diferentes formas como: web-services para dados remotos, Sistema de Gestão de Banco de Dados (SGBD), linguagem de programação.

Preparar: Após serem adquiridos os dados, esta etapa consiste em duas etapas explorar e pré-processar. A etapa de explorar consiste em conhecer os dados obtidos e analisar as colunas de uma tabela por exemplo, assim será possível identificar colunas que possam estar com dados faltantes, duplicados e assim por diante. Após ser feita a exploração dos dados, é a parte do pré-processamento, será feito a limpeza dos dados ( _{data cleaning), aqui possíveis anomalias encontradas na} exploração podem ser removidas ou adaptadas de acordo com a escolha da equipe.

Analisar: Com os dados mais preparados possível é a parte de realmente analisar eles. Nesta etapa envolve a escolha de técnicas analiticas, modelos matemáticos, computacionais para a analise e validação dos resultados. Dentre esses modelos existem diversas escolhas que podem ser levadas em consideração para obter a resposta desejada: Classificação, Agrupamento (_clustering), Associação, Regressão e muitas outras.

Relatar: Esta etapa do processo se refere na avaliação dos resultados do passo anterior (analisar), apresentando os resultados obtidos para toda a equipe, geralmente esse processo é realizado através de apresentações gráficas que ajudem a equipe a entender melhor os resultados. Também nessa etapa são realizadas perguntas para verificar se realmente o resultado obtido é válido: Qual/quais é/são o/os principais resultados? Qual é o valor agregado que estes resultados oferecem?, são exemplos. Relatar bons e maus resultados é outro aspecto levado em consideração, especialistas da área podem dar sugestões de como melhorar e tratar os dados para obter melhores resultados no futuro.

(13)

Atuar: Como os resultados podem ou não satisfazer o resultado esperado, nesse momento é de retornar ao princípio de tudo e relembrar dos propósitos e objetivos de todo esse pipeline.

Vale ressaltar que toda essa sequência do processo da ciência de dados é iterativo, pode ser repetido N vezes individualmente ou por todo o processo. Todo este processo parte de uma pergunta(hipótese) inicial, onde entra todo o processo para encontrar a resposta ou não para esta hipótese

1.3 Problemas da Ciência de Dados

Um dos problemas é como automatizar o _{data cleaning. Podem ser} criados modelos e encontrar padrões rapidamente atualmente, mas 90 por cento do recurso é gasto em pré-processamento (integração de dados, limpeza dos dados, etc) (YANG, Qiang; WU, Xindong, 2006). O tempo perdido em entender um novo dado, saber o que é realmente válido ou não, confiar na integridade do mesmo, é um processo que pode demorar muito tempo. Reduzir este custo será uma recompensa maior do que reduzir de fato o custo de modelagem e busca de padrões (YANG, Qiang; WU, Xindong, 2006).

Dificuldade em acessar um dado ou executar um _{pipeline de terceiros não} deveriam acontecer. Da mesma forma que um artigo é revisado por outros pesquisadores tornando mais fácil de entender, revisões de dados e _{pipelines devem} acontecer para remoções de bugs e legibilidade (WILSON, Greg et al., 2014). Outros problemas que podem ser abordados são dependências de software entre diferentes sistemas operacionais, repetição de um mesmo conjunto de dados entre diversos usuários em um mesmo ambiente, bancos de dados mal elaborados onde não existe normalização ou integridade.

Como também foi ressaltado no ponto 1.2, passar por todo o processo da ciência de dados: adquirir, preparar, analisar, relatar e atuar, vem a ser um processo grande e trabalho que exige a colaboração em diversas áreas. Como no ponto de adquirir é recomendado um especialista em programação para conseguir lidar com as diferentes formas que se podem ter para obter um dado. Na parte de relatar no cenário ideal seria muito bom um especialista em computação gráfica, podendo

(14)

demonstrar da melhor forma possível os dados obtidos na análise de uma forma gráfica ideal para aquele conjunto de dados.

Como também é de grande importância especialistas na área em que se está aplicando a ciência de dados e todo o seu processo, pois um bom estudioso na área saberá lidar com os dados adquiridos e informar como obter novos dados, preparar eles e assim por diante.

1.4 Trabalhos Relacionados

Foram estudados softwares que tentam resolver um dos problemas abordados no tópico anterior ou todos, foram encontrados 4 softwares: Splunk, Galaxy, Kepler e o Knime. Praticamente todos possuem o mesmo propósito, foco em integração de dados e análise, outros com o foco no campo da biologia. Segue a seguir um resumo sobre esses programas.

Splunk (Stearley, Corwell e Lord, 2010) é uma plataforma on-line e pode se conectar com todos os tipos de dados: sensores,estrutura de dados, NoSQL e usa o ecossistema do Hadoop. No entanto, a licença do Splunk com todos os recursos disponíveis não é gratuita, a versão gratuita é muito limitada para análise em big data, porque eles fornecem apenas 500mb de memória para armazenamento de dados.

O Galaxy (Goecks, Nekrutenko e Taylor, 2010) é uma ferramenta de código-fonte aberto e foi desenvolvido para pesquisa biomédica, qualquer usuário pode criar novos recursos para preencher alguma necessidade pessoal, também é possível usar funções comuns como sort_{, select, join ou mesmo funções específicas,} como sequenciamento de NGS. O principal problema do Galaxy é a falta de documentação, pois qualquer usuário pode criar novos recursos, este motivo tornou o software complexo de gerenciar recursos tanto novos e antigos.

KEPLER (Ludäscher et al. 2006) é um aplicativo de desktop de código aberto e multiplataforma, com o objetivo de criar um fluxo de trabalho de ciência de dados para a comunidade de pesquisa em geral. É um aplicativo de código aberto, trazendo a possibilidade de criar seus próprios módulos; O BioKepler (Altintas, 2011)

(15)

é um exemplo disso. Sua principal falha é que uma ferramenta offline, dificulta a integração e a colaboração entre os pesquisadores que possam estar remotamente.

O KNIME (Beisken et al., 2013) também é uma ferramenta de desktop de código aberto, com o objetivo de encontrar, organizar e minerar dados, foi desenvolvido com base no Hadoop e Spark, possui mais de 1500 módulos e muitos exemplos de base. A versão desktop é gratuita, portanto, é possível criar recursos pessoais. A licença online do Knime não é disponível gratuitamente.

1.5 Justificativa do Estudo

Com os pontos levantados e comentados anteriormente é possível notar a grande área da ciência de dados e todas as descobertas e avanços que essa área pode levar, assim como seus problemas de manuseio devido a necessidade de grandes equipes, trabalhar com diversos tipos de dados e fontes diferentes, problemas de controle de versionamento, entre outros. Assim como ferramentas que tentam contornar esses problemas com soluções que visam solucionar problemas como integração de dados ou análise de dados, porém nenhuma dessas ferramentas tem um foco em colaboração entre grupos.

Pensando nesses problemas e como conseguir prover dados, scripts e projetos para outros usuários promovendo o reuso dos mesmos, na figura 1, demonstra como o Bio-DIA vai se comportar e prover tais recursos.

(16)

Figura 1. Representação visual do Bio-DIA. Bio-DIA é o nó principal desse grafo, sendo responsável de prover para o usuário em seu novo projeto/pipeline, scripts, dados e projetos, que já foram criados por outros usuários, possibilitando assim o compartilhamento de conhecimento e reuso entre todos os usuários.

O Bio-DIA, é uma ferramenta que é tem como foco a ciência de dados, assim como a integração de dados e análise, porém seu diferencial é a colaboração e compartilhamento de pipelines, conjunto de dados e scripts entre usuários. O Bio-DIA possibilita o reuso de pipelines, dados e scripts já desenvolvidos por outros usuários de forma simples, além de não necessitar de nenhum conhecimento prévio de linguagem de programação, apenas o conhecimento do padrão XML desenvolvido para executar comandos no Bio-DIA.

(17)

2 OBJETIVOS

2.1 Geral

Este trabalho tem como objetivo principal a implementação de uma ferramenta web, que venha integrar dados, scripts e pipelines e facilitar o compartilhamento dos mesmos entre seus usuários.

2.2 Específicos

● Integrar dados de extensões diferentes (tsv, csv) e banco de dados (MySQL);

● Permitir a execução de outras linguagens(R, Shell, Python e Perl);

● Permitir que o usuário execute scripts e pipelines de outros usuários;

● Permitir o compartilhamento de dados;

● Realizar análises de dados sem conhecimento prévio de linguagens de programação;

(18)

CAPÍTULO I

Artigo: BIO-DIA: Web-based tool for data and algorithms integration

Escrito por:Thiago D. Soares_{, Vandeclécio Lira da Silva, André Fonseca, Diego} Morais_{, Alberto Signoretti, Sandro José de Souza, Wilfredo Blanco}

(19)

Thiago Dantasa_{, Vandeclécio Lira da Silva}a_{, André Faustino Fonseca}a_{, Diego A A Morais}a_{, Alberto} Signorettib_{, Wilfredo Blanco}ba*

a_{Bioinformatics Multidisciplinary Environment (BioME), Digital Metropolis Institute, Federal University of Rio}

Grande do Norte, Natal, RN, Brazil

b_{Computer Science Department, State University of Rio Grande do Norte, Natal, RN, Brazil}

* Corresponding author wilfredoblanco@uern.br wblancof@gmail.com

Bioinformatics Multidisciplinary Environment (BioME), Federal University of Rio Grande do Norte (UFRN), Av. Odilon Gomes de Lima 1722, Natal/RN, CEP: 59078-400, Fone: 55 (84) 3342-2216, e-mail:

bioinfo@imd.ufrn.br, biome@imd.ufrn.br

Computer Science Department, State University of Rio Grande do Norte (UERN – Natal), Av. Dr. João Medeiros Filho, 3419 - Natal/RN CEP: 59.120-200, Fone: 55 (84) 3207-8789/3207-2889, e-mail: natal@uern.br

(20)

Data science is historically a complex field, not only because of the huge amount of data and its variety of formats, but also because the necessity of collaboration between several specialists to retrieve valuable information. In this context, we created Bio-DIA, an online software to build data science workflow process focused in the integration of data and algorithms. Bio-DIA also facilitates the reusability of information/results obtained in previous process without the need of specific skills from the computer science field. The software was created with Angular at the front-end, Django at the back-end together with Spark to handle and process a variety of big data formats. The workflow/project is specified through XML file. Bio-DIA application facilitated the collaboration among users, allowing researcher ́s groups to share data, scripts and information.

Availability: https://ucrania.imd.ufrn.br/biodia-app/. Login: bioguest, password: welcome123.

Keywords:

(21)

A data science process relies on a sequence of steps (pipeline) in which data is explored, integrated, analyzed, and processed (modeled, classified, etc), producing results that support (or not) to a given question/hypothesis. Several scientific fields are considered data-intensive, including biomedical sciences (Holzinger, Dehmer, and Jurisica 2014), an area where data science processes deals with a huge volume of data, coming from numerous sources and in distinct formats and levels of complexity. These issues have become a problem in biomedical research as well. Data from different formats and origins, such as symptoms, medical record information, clinical and laboratory data, medical images, genetic information, electrophysiological data, etc. tend to be poorly structured and mainly treated individually; as a result, studies usually are unable to integrate them and find possible relations.

To execute a given data science process in biomedical sciences, a group of professionals is usually composed by members from different areas (ex: biologists, mathematicians, computational scientists, statisticians, among others). They must collaborate to accomplish their goals and among other responsibilities, they must also decide or chose, within a vast domain of computer environments, languages and frameworks that are suitable for their necessities. This decision usually is taken based on individual knowledge or preference, that most of the time, it is not the best for the group or the tasks to be executed. Moreover, building upon previous computational experiment is almost mandatory in a collaborative environment. Research teams from all over the globe share source codes, data, and results in order to continue developing new projects. Therefore, reusability and reproducibility compose the essential building blocks in this type of scientific enterprise.

Under the described scenario, some issues are important to carry out a collaborative project, including: (1) data integration, (2) algorithms integration and (3) reusability.

This study proposes a web-based tool to facilitate the construction of data science process pipelines, called Bio-DIA, which combine big data exploration, integration and analyses. The results obtained by the built pipelines can be reused to replicate studies, facilitating the testing of new hypothesis and questions. This platform will facilitate the data, script and knowledge sharing among the scientific community.

We studied several software that work based on the three major points just mentioned. We found four software: Splunk, Galaxy, Kepler, and Knime. All of them have basically the same purpose, to built a data science process, focusing on data integration and analysis. Below a brief summary of these software.

Splunk (Stearley, Corwell, and Lord 2010) is an online platform that can connect with all type of data: sensors, structure data, NoSQL; through the use of Hadoop´s ecosystem. However, a subscription is required to use all features available in Splunk and the free version is very limited for big data analysis because they provide only 500MB for data storage.

(22)

research. Any user can build new features to fill some personal need and it is also possible to use common functions like sort, select, join, or even specific functions, like the ones for Next Generation Sequencing (NGS). The major problem with Galaxy is the lack of documentation since any user can build new features, which makes the management of the system quite complex.

Kepler (Ludäscher et al. 2006) is an open-source Multi-platform desktop application, whose main goal is the creation of data science workflows to the research community in general. It is an open-source application, allowing the creation of your own modules (BioKepler (Altintas 2011) is an example). As KEPLER is an offline tool, the collaboration among researchers is more difficult.

Knime (Beisken et al. 2013) is also an open-source desktop tool (available off and online) with the objective to find, organize and mine data. The tool is built using Hadoop and Spark and contains more than 1500 modules and plenty of examples for starters. The desktop version is free to use, allowing the creation of personal features.

Bio-DIA was compared with these tools and some advantages were noticed: (1) totally free to use if compared with Splunk and Knime; (2) the use of xml to built data science process simplify the way to manipulate and share information/results if compared with the steep learning curve needed to used Galaxy; (3) since Bio-DIA is an online tool, only an internet browser is needed to executed data science process, without the necessity to install any additional software in the local machine, as Kepler does.

Basic Structure of the Application

Bio-DIA is a web-based tool whose main goal is to provide an efficient platform for the integration of data and algorithms. Its front-end was developed using Angular 4, and the back-end uses Python programing language with Django framework as well as Spark. The use of Angular and Django allowed software scalability without loss of performance; and Spark permitted Bio-DIA to have a good scalability in the hardware side, being also capable to work with multiple types of data (Figure 1).

The BIO-DIA application is able to read and process (Selection, Projection, Filters, etc) several structured data types such as: text formatted files (tsv, cvs) and tables in databases (MySQL and PostgreSQL). The application is also able to load and run scripts (source codes) from different programming languages, such as Python, R, Perl and Shell.

Bio-DIA uses XML format to specify a data science workflow. The users are able to create a XML file (a “project” from now on) defining the workflow steps using several tags, such as: <data>, <select> and <save> to load, process and save data, respectively; the <algorithm> tag allows the loading and execution of scripts (see Supplemental Section 1: Tags and attributes definition and description). The data saved for a project can be used in another project using the tag <sub_project> (Supplementary Section 6: Figure S9).

(23)

<algorithm> tags, Bio-DIA is able to send parameters to the script in two different ways. The first one is by using the <value> tag (Supplementary Figure S8, line 7), which is used to pass static values like numbers, strings or list of them (e.g.: a,b,c or 1,2,3); the second one is the <inter_data> tag (Supplementary Figure S8, line 6) used to reference data that was previously crated/generated in the current project. This is done through the value of the id typo, defined usually in the tags <data>, <select> or <sub_project>. This tag structure creates a standard scripting interface that allows a different code to be suitable for use by different researchers without them having to know computer coding.

A script executed inside a Bio-DIA project usually reads/writes (input/output) data or files. In order to control these events, Bio-DIA automatically add four (4) hidden parameters to decide where the data will be read or written. Therefore, the scripts have access to these parameters, which are named as:

DIR, ID, XMLID and PROJECT; and they are sent in this order (more details, see Supplementary Section 2).

Bio-DIA was tested under a model HP Proliant DL386-G6 (AMD Opteron 243 with 12 cores) server, with 32 GB of RAM, and hard-disk space of 292GB.

Figure 1: Bio-DIA architecture and technologies. The front-end was developed with Angular. A simple xml-editor was developed, allowing the user to create and submit his XML projects. For more details about the user interface, see Supplementary Section 5. After the XML is created, the front-end will send the xml file to the back-end (server) to be processed. In case the project requires any data, script or other project, Bio-DIA will carry out its execution. Once the project is finished, outputs are usually generated for further analysis. All the back-end was developed under Python/Django framework and Spark libraries as big data processing.

(24)

In this section, we present 3' projects in order to verify the results of Bio-DIA software according with the original pipeline executed. Although it could be use data from any science field, we focused on case studies applied to Bioinformatics field. First, we will present 2 projects that are usually very used on the Bioinformatics field, and finally we reproduce a previous work published by our group in (Silva et al 2015), that also used the two pipelines previously mentioned. We checked all the uploading, processing and writing events, comparing the results of the original code source.

Case 01: Identifying conflicting genes names (Gene identifiers standardization)

This case was built to address the inconsistency regarding the gene identification (id) value in different databases, such as id conversions between Entrez (Maglott et al. 2005), Ensembl (Flicek et al. 2013), HUGO Symbol (Eyre et al. 2006) and RefSeq (Pruitt, Tatusova, and Maglott 2005). The project was created to confer and identify values discrepancy, which can be a time-consuming activity, delaying significantly the projects during data integration and cross-database analysis. This bottleneck is often related to a large data volume and data version, producing false associations and possible misinterpretations. Nowadays, several tools are available to solve this problem, like MyGene (Xin et al. 2015) and Biomart (Durinck et al. 2005). In this context, we illustrate a protocol to bypass this problem using Bio-DIA XML syntax (Figure 2).

(25)

B

Figure 2: Gene identifiers standardization protocol. (A) The workflow schema of the pipeline is shown, in which a python script algorithm takes two external inputs: a list of gene identifiers (geneList.tsv), and the subset of columns to produces a TSV file (Output). A local database file is consulted by the script to show the distinct ID types (such as: HUGO Symbol, Ensembl Gene, Transcript RefSeq, and NCBI Entrez) from the selected inputs (geneList.tsv). The red box delimited shows an internal input that the script will be able to access. (B) XML script to execute the pipeline shown in A. Line 2, using <data> tag, is in charge to load the gene list input (geneList.tsv file seem in A). Line 3 defines a python-type scrip named “gene_id_mapping.py”. From line 4 to 7 are defined the parameters for the algorithm. The first parameter (line 5), is the gene list that is referenced by the ‘data1’ which was previously defined in line 2 as id. The second parameter (line 6), passed as values, is the subset columns.

(26)

Case 02: Genes related with a tumor and platform

In the present case, the goal of the task is to search and retrieve a specific list of genes from TCGA (https://tcga-data.nci.nih.gov), an external database that stores data from roughly eleven thousand patients comprising 32 tumor types. The R script gets 3 parameters: the list of desired genes, the tumor name (in this case GBM), and the type of the filter (platform) that is ‘mrna’ in the present case. The output is a text file containing 5 columns (sample, gbm_tcga_rna_seq_v2_mrna, gbm_tcga_mrna, gbm_tcga_mrna_U133 and gene name), and their names will change according to the parameters passed. A workflow and the corresponding XML file of this case are shown in Figure 3.

A

B

Figure 3: Pipeline to retrieve list of genes related with a tumor and platform. (A) The workflow schema with three inputs parameters: gene list, tumor name and platform. The R script (cbio_downloader.R) connects with the TCGA external database and retrieve a filtered list with the genes related with the tumor GBM(Glioblastoma) and mrna as platform. The red box is just delimiting the R algorithm context, which is in charge to connect and access TCGA. (B) XML script to execute the pipeline shown in A. The line 1 defines

(27)

<data> tag to load the gene list file (db_name = “geneList.tsv”). From lines 3-10 the R script is executed with three parameters (from lines 4 to 7): lines 5,6 and 8, through <value> tag, the ‘GENOMICS’, ‘GBM’ and ‘MRNA_EXPRESSION’ are passed as string values respectively; in line 7, through <inter_data> tag, the data1 is passed, which is pointing to the list of genes referenced by the id created in line 2.

Case 3: Cancer/testis (CT) project

The CT study was published by our group in (da Silva et al. 2017), in which the integration of gene expression and clinical data guided us to detect some CT genes that are associated to prognosis in different types of cancer. This study executed a genome-wide screen for CT genes using data from several databases; hence, we reproduced the original pipeline (Figure 4A) using the Bio-DIA XML script (Figure 4B, C) in order to replicate the results.

A

(28)

Figure 4: CT´s project built on Bio-DIA XML script. (A) Schematic pipeline of the original CT´s project, but divided in left and right side, to better explain its implementation in Bio-DIA projects. Red labels correspond to the XML lines in charge to execute each part (modified from Da Silva et al, 2017). (B) XML script lines that reproduce the first project (A left side) so-called CTs. It starts accessing 3 databases using <data> tags (lines 2, 3 and 4 for HBM, GTEX and HPA databases respectively). In line 5, a script shell, named “copy_protein_code.sh”, is called using <algorithm> tag. This script is in charge to copy to the Bio-DIA folder (defined by DIR parameter) a list of genes from the human genome that are validated by the NCBI, making possible to get only protein-coding genes from these 3 databases. Subsequently (from line 6-26), a Perl source code, named “expression_anaysis.pl”, is called 3 times to filter the databases. Each call has three parameters: the database (using <inter_data> tag), the tissue and the cutoff. The last two arguments, both are passed by values (using <value> tag). The last step, an R source code is called with 3 parameters, which are the second output of each perl algorithm previous executed (using <inter_data> algo_perl_1::2</inter_data>). (C) XML script lines that reproduce the second project so-called CTs_2. Through the <sub_project> tag (Line 2) and its attribute output, BIO-DIA is able to load the specific output 7 (listGeneName.txt) from the project already executed, CTs. Then, a python script (case1) is executed, receiving the <sub_project id=’sub1’> and the parameters -s for subset columns and the ‘-m’ for the mapping type. After that, a new script is executed ‘expression_table.py’, receiving as parameter the first output generated(geneList) by the previous script ‘gene_id_expression.py’, and the second parameter ‘-n’ passing the tumor name (BLCA). The next script on line 17 – 30 is used the case 2, getting Genomics and Clinical data from the tumor (SKCM), finally, the algorithm in line 31, is receiving as parameters the data collected by the scripts(algo_r_1 and algo_r_2) with these outputs, it will be possible to make plots to check what genes are helping the patients to live longer.

The methodology applied on CTs study (da Silva et al. 2017) focuses on data integration from distinct sources. Three RNA-Seq databases from normal tissues were used to identify genes with expression bias to testis: the Human Body Map (HBM) (GEO accession: GSE30611), Genotype-Tissue

(29)

[DOI:10.1126/science.1260419]. Normalized transcript level in each tissue was converted to a proportional score (transcript level in a tissue divided by the sum of levels in all tissues), and a threshold of at least 0.9 was used to select genes preferentially or exclusively expressed in testis (additionally, a more restrict threshold of 0.99 was performed).

In the next step from the pipeline, RNA-Seq data from TCGA with 6,221 tumor samples and 15 tumor types were used to identify genes significantly expressed in a given tumor, considering them as putative CT gene if it had a level of expression (cutoff threshold of RSEM >1) in at least 10% or 15% of all informative samples for a given tumor. Finally, the integration of gene expression, abundance of CD8+ cells infiltration and clinical data led us to identify dozens of CT genes associated with either good or poor prognosis observing the survival curve.

Conclusions

We implemented an application to help the scientific community to overcome issues related with data/algorithm integration and replicability; with the main goal to provide a web-based tool to improve scientific collaboration. The Bio-DIA project/pipelines, specified by XML, intent to make easier the access to big data with different types of formats, execute algorithms (scripts) developed in different languages (R, python, Shell, and Perl) and reuse/share these results among projects.

Bio-DIA can be accessed via public site at https://ucrania.imd.ufrn.br/biodia-app/. We developed a simple graphic user interface (GUI), providing the necessary functionalities to manage (create, retrieve, edit, remove and execute) the projects and their results. More details and explanation of the GUI can be found in Supplementary Section 5. In further versions, a new GUI will be available to manage and automatically generate projects by only dragging-and-dropping graphic shapes, icons and their connections. Thus, knowledge of XML will no longer be required to use Bio-DIA, making its use even simpler for a heterogeneous data science team.

References

Holzinger, Andreas, Matthias Dehmer, and Igor Jurisica. 2014. "Knowledge discovery and interactive data mining in bioinformatics-state-of-the-art, future challenges and research directions." BMC

bioinformatics 15 (6):I1.

Altintas, Ilkay. 2011. "Distributed workflow-driven analysis of large-scale biological data using biokepler." Proceedings of the 2nd international workshop on Petascal data analytics: challenges and opportunities.

Beisken, Stephan, Thorsten Meinl, Bernd Wiswedel, Luis F de Figueiredo, Michael Berthold, and Christoph Steinbeck. 2013. "KNIME-CDK: Workflow-driven cheminformatics." BMC bioinformatics 14 (1):257.

(30)

Antonio de Souza, and Sandro José de Souza. 2017. "Genome-wide identification of cancer/testis genes and their association with prognosis in a pan-cancer analysis." Oncotarget 8 (54):92966. Durinck, Steffen, Yves Moreau, Arek Kasprzyk, Sean Davis, Bart De Moor, Alvis Brazma, and Wolfgang

Huber. 2005. "BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis." Bioinformatics 21 (16):3439-3440.

Eyre, Tina A., Fabrice Ducluzeau, Tam P. Sneddon, Sue Povey, Elspeth A. Bruford, and Michael J. Lush. 2006. "The HUGO Gene Nomenclature Database, 2006 updates." Nucleic Acids Research 34 (suppl_1):D319-D321. doi: 10.1093/nar/gkj147.

Flicek, Paul, M. Ridwan Amode, Daniel Barrell, Kathryn Beal, Konstantinos Billis, Simon Brent, Denise Carvalho-Silva, Peter Clapham, Guy Coates, Stephen Fitzgerald, Laurent Gil, Carlos García Girón, Leo Gordon, Thibaut Hourlier, Sarah Hunt, Nathan Johnson, Thomas Juettemann, Andreas K. Kähäri, Stephen Keenan, Eugene Kulesha, Fergal J. Martin, Thomas Maurel, William M. McLaren, Daniel N. Murphy, Rishi Nag, Bert Overduin, Miguel Pignatelli, Bethan Pritchard, Emily Pritchard, Harpreet S. Riat, Magali Ruffier, Daniel Sheppard, Kieron Taylor, Anja Thormann, Stephen J. Trevanion, Alessandro Vullo, Steven P. Wilder, Mark Wilson, Amonida Zadissa, Bronwen L. Aken, Ewan Birney, Fiona Cunningham, Jennifer Harrow, Javier Herrero, Tim J.P. Hubbard, Rhoda Kinsella, Matthieu Muffato, Anne Parker, Giulietta Spudich, Andy Yates, Daniel R. Zerbino, and Stephen M.J. Searle. 2013. "Ensembl 2014." Nucleic Acids Research 42 (D1):D749-D755. doi: 10.1093/nar/gkt1196.

Goecks, Jeremy, Anton Nekrutenko, and James Taylor. 2010. "Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences."

Genome biology 11 (8):R86.

Ludäscher, Bertram, Ilkay Altintas, Chad Berkley, Dan Higgins, Efrat Jaeger, Matthew Jones, Edward A Lee, Jing Tao, and Yang Zhao. 2006. "Scientific workflow management and the Kepler system."

Concurrency and Computation: Practice and Experience 18 (10):1039-1065.

Maglott, Donna, Jim Ostell, Kim D. Pruitt, and Tatiana Tatusova. 2005. "Entrez Gene: gene-centered information at NCBI." Nucleic Acids Research 33 (suppl_1):D54-D58. doi: 10.1093/nar/gki031. Pruitt, Kim D., Tatiana Tatusova, and Donna R. Maglott. 2005. "NCBI Reference Sequence (RefSeq): a

curated non-redundant sequence database of genomes, transcripts and proteins." Nucleic Acids

Research 33 (suppl_1):D501-D504. doi: 10.1093/nar/gki025.

Stearley, Jon, Sophia Corwell, and Ken Lord. 2010. "Bridging the Gaps: Joining Information Sources with Splunk." SLAML.

Xin, Jiwen, Adam Mark, Cyrus Afrasiabi, Ginger Tsueng, Moritz Juchler, Nikhil Gopal, Gregory S. Stupp, Timothy E. Putman, Benjamin J. Ainscough, Obi L. Griffith, Ali Torkamani, Patricia L. Whetzel, Christopher J. Mungall, Sean D. Mooney, Andrew I. Su, and Chunlei Wu. 2015. "MyGene.info and MyVariant.info: Gene and Variant Annotation Query Services." bioRxiv:035667. doi: 10.1101/035667.

(31)

Bio-DIA: A web-based tool for data and algorithms integration

Thiago Dantas Soaresa_{, Vandeclécio Lira da Silva}a_,_{André Faustino Fonseca}a_{, Diego Morais}a_{, Alberto}

Signorettib_{, Wilfredo Blanco}ab*

a_{Bioinformatics Multidisciplinary Environment (BioME), Digital Metropolis Institute, Federal University of Rio}

Grande do Norte, Natal, RN, Brazil

b_{Computer Science Department, State University of Rio Grande do Norte, Natal, RN, Brazil}

* Corresponding author wilfredoblanco@uern.br wblancof@gmail.com + 55 (84) 98729-0888

(32)

Supplementary Section 1: XML tags and attributes

Tag Attributes Text Child Definition

Required Not Required

<project> name None None All tag’s below

Used to create a project. <data> id db_table type_data db_name sep comment header

None No Used to access some data format such as: csv, txt, tsv, and tables from MySQL and Postgre databases.

<sub_project> id user project output

None None No Used to load output data

from another project and used it in the current project.

<algorithm> id lang name

None None Yes:

Used to call some algorithm developed on specific script language, such as: Python, R, etc.

<params> None None None Yes:

<inter_data> or

<value>

Inform the arguments needed to execute the script.

<inter_data> None None True No Used to link with some databases that exists inside the project.

<value> None None True No Inform some value that the

script need to execute. <select> id

from get

filter limit

None No Spark function to make a select on the database. <save> Store

Folder

None None No Save something that will

be executed on the project on your project folder.

(33)

Id Used to identify and reference a tag inside the XML. Its value is a string/text type format and is unique in the all XML project.

<select from=“myDATA” /></select> db_table Used only inside the <data> tag to

connect with ether a table from an database or text file.

db_name Used to identify the database name to be connected. In order to open a table, the <db_table> attribute has to be specified.

type_data It is used to inform which data type will be used, the types are: txt, sql and pg. The txt value is used to say that it will work with text files (txt, csv and tsv). The sql value is to say that it will work with MySQL(Community Edition ), and pg with Postgre.

sep To enter the file separator type, either comma separated values or \t(tab).

<data db_name= “myFile.txt” sep=”\t”></data> <data db_name= “myFile.csv” sep=”,”></data> comment If it has comment lines to ignore,

it must be specified the character (#,//,;).

<data db_name= “myFile.txt” comment=“#”></data> <data db_name= “myFile.txt” comment=“;”></data> header If it is a text file, it must be

informed if it has header passing the value True, if it exists or False.

<data db_name=“myFile.txt” header=“True”></data> <data db_name= “myFile.txt” header=“False”></data> user Used to enter the name of the

user providing the project outputs.

<sub_project user= "userName_id" ></sub_project>

project Name of the project that the user has set to public for all users.

<sub_project user="userName_id" project="projectName"> </sub_project> <sub_project user="userName_id" project="projectName"></sub_project> output Output number that tells which

output the user wants to get for his own project.

<sub_project user="userName_id" project="projectName" output="output_number" ></sub_project> <sub_project user="userName_id" project="projectName" output="outputNumber"></sub_project>

(34)

this attribute must be some ID of some tag of type <date> or <sub_project>

get It must be informed which

columns should be returned in the query.

filter The filter condition must be passed to the query.

<select id=“select1” from=“data_1”

get=“column_a, column_b” filter=“column_a like ‘test’ “><select/>

limit Number of rows to be returned, if the attribute is missing, all rows will be returned.

get=“column_a, column_b” filter=“column_a like ‘test’ “ limit=“50” ><select/>

inplace If the value is True, the original data of the from attribute will be overwritten by the select query result, if False, will not change the original data, a new file will be created, to save this file, just need to pass the id of the <select> on the tag <save>.

get=“column_a, column_b” filter=“column_a like ‘test’ “ limit=“50” inplace=“True” ><select/> <select id=“select1” from=“data_1”

get=“column_a, column_b” filter=“column_a like ‘test’ “ limit=“50” inplace=“False” ><select/>

Lang Used to specify the programming

language to be used (Rscript, python or perl).

name Name of the script. <algorithm lang="perl" name="myCode.pl" >

</algorithm>

Flag This attribute can only be used in

the <inter_data> or <value> tags, its functionality is to pass some argument to the script.

<inter_data flag=“-x”>select1</inter_data> <value flag=“-y”>23,22</value>

(35)

your project folder. ‘test’ “ limit=“50” inplace=“False” ><select/> <save store=“select1" ></save>

folder Name of the folder that the

output will be saved, by default the folder name is the same of the project

<project name=“projectName”> <select id=“select1” from=“data_1”

get=“column_a, column_b” filter=“column_a like ‘test’ “ limit=“50” inplace=“False” ><select/> <save store= “select1" folder="projectName"> </save> </project>

Scripts usually need parameters to customize their execution and the <parameter> tag is used to define them. Bio-DIA has two ways to send parameters to the script; the first one is using the <value> tag (Supplementary Figure S1, line 8), which is used to pass static values like numbers, strings, or list of them (e.g.: a,b,c or 1,2,3); the second one is the <inter_data> tag (Supplementary Figure S1, line 7) used to reference data that was previously crated/generated in the current project. This is done through the value of the id attribute, defined usually in the tags <data>, <select> or <sub_project>.

Supplementary Section 2: How Bio-DIA execute scripts

In Figure S1 depicts that Bio-DIA can execute scripts from languages (Shell, Python, Rscript or Perl) using the <algorithm> tag. In order to Bio-DIA be able to execute the scripts on the server, it will send add four (4) hidden parameters (DIR, ID, XMLID, PROJECT) mainly to control the reading and writing events (Figure S1). Consequently, the scripts have access to these parameters. Case the script does not need using flags as parameters, these hidden arguments will be added in the end, if the script need flags to send parameters, the developer must reserve the flag ‘-c’ as default to receive these 4 hidden parameters.

The DIR parameter defines the directory path where Bio-DIA will load or save files; in other words, any access to the server hard disk memory will be done in this directory. All files that will be passed as <inter_data> the script may read it using the DIR. The ID parameter stores the user identification value, this ID it will be used to create a file to write everything that happen with the script (log file). A log file it will be a way to know what occurred inside the script. The log file is always named like ID_log.tsv and have 2 columns named as STATUS and MSG. The first column is to specify which is happening inside the script, like: [START], [RUN], [OUTPUTS], [ERROR] and [CONNECT], the second column is the message related with the status, like: “[START] The script just started” or “[CONNECT] Connection made with success with the database”.

(36)

procedure is: the output must be described on the log file, with the status [OUTPUTS]; the message in this status must be using this template XMLID::PROJECT_myOutput.csv.

The fourth parameter PROJECT, is to be concatenated with every output name created by the script, allowing Bio-DIA to recognize and control the outputs from the projects. The name pattern looks like PROJECT_myOutput.csv.

All these parameters along with the log file, allows Bio-DIA to know the outputs generated for the projects, consequentially other users re-use these outputs on their projects. Check the figure below to see an example and values of these four parameters.

A

B

Figure S1: Log file and hidden parameters example. (A) table with the four hidden parameters and values, that all scripts will receive before being executed, these parameters are an example of how the script in the Figure S1, line 4 will receive. (B) shows how the log file is build, and how the parameters are used and where the parameters must be placed following the procedures talked on Section (How to execute scripts on Bio-DIA).

(37)

Supplementary Section 3: Passing outputs from an algorithm to another one within the same project.

It is possible to pass an output from an algorithm to another one (as a parameter) in the same XML project. The Figure S2 shows an example of how to do this.

Figure S2. Example how to access output among algorithms. The algorithm ‘algo_py_1’ is receiving a list of numbers as parameters to be added. The output will be saved in a file with the summation(Sum) of this list, the second algorithm ‘algo_py_2’ need two parameters, the first is the output created by the first algorithm ‘algo_py_1’, to access this output the syntax is ‘id_of_the_algorithm::number_of_output’, this syntax was used in line 9, algo_py_1::1, this represent that it will be passed as parameter the first output created by the algo_py_1, and the second parameter on line 10, is the number to be elevated to.

(38)

Supplementary Section 5: Bio-DIA user interface.

Access Bio-DIA through the link https://ucrania.imd.ufrn.br/biodia-app/, after that the user will see the login page (Figure S3A). In case the user does not have an account, he could create an account by clicking on the “subscribe” hyperlink and he will be sent to fill out the form (Figure S3B)

A

B

(39)

Figure S4: Bio-DIA main page. Creating a project. The braces group graphic elements of the interface to better explain them. Brace 1, to upload databases files such as: csv, tsv, or pipeline projects (XML files). After uploading your files, by checking the box pointed by the brace 2, the user can see and select all the XML files shown in the red rectangle. Brace 3 points to a frame in which are shown real time feedback messages from what is happening during the execution of every step of the project shown in brace 4. Brace 5 shows of results generated by the project.

(40)

Figure S5: Bio-DIA project management page. This page allows the user to manage his projects. The first column “Project Name” list the projects names; the second column “Status” shows whether your project is public or private; the next column “Check content”, clicking on the eye icon will show all the files that belongs to the project; the column “Change status” change the project status to private if the project is public, and vice versa; the last column ‘Delete’, remove the project and delete all the outputs created.

Figure S6: Bio-DIA public project page. This page lists all public projects of the system; hence all users can use and reproduce them. The project list shows the project name, its owner and by clicking on the eye icon at the “Check output”, all outputs created by the project will appear above and the user just need to copy and paste the tag “sub_project” (gray text box) on their own pipeline to reuse the specific output.

(41)

Supplementary Section 6: Project examples

Log into Bio-DIA by access the link: https://ucrania.imd.ufrn.br/biodia-app/, make the login using, login: bioguest, password: welcome123. Copy this pipeline below and press the submit button.

A

B

Figure S7: Simple pipeline to get started with Bio-DIA. This pipeline will load the file ‘GTEX.txt’ (line 2), which is a public file for all users. On the line 3 it will make a select on the file, getting 3 columns (vagina, lung and Gene Name), retrieving only the records where the lung values are equal to 19. In line 4, the result of this select is being saved on the project folder ‘pipe1’. After finishing the pipeline on the Figure S7. Panel B shows the result page.

(42)

Figure S8: Project example using Bio-Dia XML tags. The parameter values have generic values to better explain the XML tags. The line 1 uses the tag project to create the project. For Bio-DIA, this is always the root tag. Line 2 is to connect with data; a table in a database in this case. Line 3, through a <select> tag, executes a selection and projection on data1 (it could be text files or tables in a relational databases). Line 4, through <algorithm> tag, executes scripts from languages (Shell, Python, Rscript or Perl). To pass the parameters to an algorithm an <params> tag embedded is used. Line 6 and 7 show two types of parameters: line 6 is using <inter_data> tag with a value making a reference to some id (data1), which was created previously in the project (Line 2) as an attribute of <data> or <select> tags; line 7 is using to pass static values. The line 10 is responsible, through <save> tag, to save in a specific folder data generated previously in the project. Line 10 is saving the content that makes reference to select1 (store="select1").

Figure S9: How to reuse other project outputs. The <sub_project> tag allows to load results/outputs from others project previously executed. Line 1 is loading the project="projectName" (created in Fig. S1), that is owned by the user="userName_2", and the point 3 is the ID of the output that was generated by the project on point 2. For each output needed from other project, a <sub_project> tag must be created.

(43)

3 DISCUSSÃO

3.1 Bio-DIA

A implementação foi feita no back-end em Python usando o framework de desenvolvimento web Django, além disso foi utilizado o Spark como ferramenta responsável para ser responsável na integração e análise de dados, no front-end foi utilizado o Angular, um framework voltado para criação de interface web, foi criado e é mantido pelo Google.

O Bio-DIA com seu padrão XML se mostrou capaz de acessar e tratar diversos arquivos de maneira transparente para o usuário, tornando possível acessar e manipular arquivos (.tsv, .csv) e tabelas de bancos de dados (MySQL, PostgreSQL) da mesma forma. Além do acesso a dados, a execução de scripts em diversas linguagens (R, Perl, Shell e Python) foram executadas de transparente, onde o único requerimento para execução de qualquer script é o conhecimento dos parâmetros necessários para a execução do mesmo. A execução de um pipeline feito inteiramente pelo Bio-DIA e ser re-utilizado por outro usuário ou apenas usar um dado de outro usuário, foram capazes de serem compartilhados entre os demais usuários do sistema, assim possibilitando uma maior colaboratividade e velocidade em transferir conhecimento entre os usuários.

Então o Bio-DIA se mostra uma ferramenta capaz de lidar com diversos tipos de dados, execução de scripts em várias linguagens e possibilita o compartilhando de scripts, dados e pipelines entre seus usuários.

(44)

4 CONCLUSÃO

Foi implementado um aplicativo com o objetivo de auxiliar a comunidade científica a superar problemas relacionados à integração e replicabilidade de dados e algoritmos; com o objetivo principal de fornecer uma ferramenta baseada online, fácil de usar, para melhorar a colaboração científica. Os projetos e pipelines especificados por XML no Bio-DIA, pretende facilitar o acesso à ciência de dados, integrando diferentes tipos de formatos, executar algoritmos (scripts) desenvolvidos em diferentes línguagens de programação (R, python, Shell e Perl) e reutilizar e compartilhar esses resultados entre projetos.

O Bio-DIA também foi capaz de gerenciar projetos, tornando possível do dono de um projeto ser capaz de: Criar projetos novos, Atualizar projetos, Ler os projetos criados e Deletar projetos existentes.

4.1 Trabalhos Futuros

Em futuras versões, uma interface gráfica do usuário (GUI) será desenvolvida, possibilitando criar visualmente projetos (pipelines), arrastando e soltando formas gráficas (_{drag and drop), ícones e suas conexões. Assim, o} conhecimento de XML não será mais necessário para usar nossa ferramenta, a GUI irá abstrair para o usuário final todo o conhecimento prévio do padrão XML, simplificando ainda mais a usabilidade do software para uma equipe heterogênea de ciência de dados.

Assim como iremos adicionar mais funcionalidades de segurança como gerenciamento de conta, onde um usuário terá limite de espaço para sua conta, controle de sessão de usuário. Prover um controle maior na execução de scripts de terceiros. Prover a funcionalidade de convidar um ou mais usuários para participar de um projeto privado. Conectar e acessar dados remotos pelo próprio Bio-DIA. Ranking de projetos e scripts mais usados, possibilitando para novos usuários conhecer os projetos e scripts que mais são compartilhados e reusados e analisarem se eles são relevantes para os projetos que este novo usuário poderá vir a começar.

(45)

REFERÊNCIAS

Altintas, Ilkay. 2011. "Distributed workflow-driven analysis of large-scale biological data using biokepler." Proceedings of the 2nd international workshop on Petascal data analytics: challenges and opportunities.

Beisken, Stephan, Thorsten Meinl, Bernd Wiswedel, Luis F de Figueiredo, Michael Berthold, and Christoph Steinbeck. 2013. "KNIME-CDK: Workflow-driven cheminformatics."

BMC bioinformatics 14 (1):257.

da Silva, Vandeclecio Lira, André Faustino Fonseca, Marbella Fonseca, Thayna Emilia da Silva, Ana Carolina Coelho, José Eduardo Kroll, Jorge Estefano Santana de Souza, Beatriz Stransky, Gustavo Antonio de Souza, and Sandro José de Souza. 2017. "Genome-wide identification of cancer/testis genes and their association with prognosis in a pan-cancer analysis." _Oncotarget 8 (54):92966.

Durinck, Steffen, Yves Moreau, Arek Kasprzyk, Sean Davis, Bart De Moor, Alvis Brazma, and Wolfgang Huber. 2005. "BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis." _{Bioinformatics 21} (16):3439-3440.

Eyre, Tina A., Fabrice Ducluzeau, Tam P. Sneddon, Sue Povey, Elspeth A. Bruford, and Michael J. Lush. 2006. "The HUGO Gene Nomenclature Database, 2006 updates."

Nucleic Acids Research 34 (suppl_1):D319-D321. doi: 10.1093/nar/gkj147.

Flicek, P., Amode, M. R., Barrell, D., Beal, K., Billis, K., Brent, S., ... & Gil, L. (2013). Ensembl 2014. _{Nucleic acids research}, ₄₂(D1), D749-D755.

Goecks, Jeremy, Anton Nekrutenko, and James Taylor. 2010. "Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences." _{Genome biology} 11 (8):R86.

Holzinger, Andreas, Matthias Dehmer, and Igor Jurisica. 2014. "Knowledge discovery and interactive data mining in bioinformatics-state-of-the-art, future challenges and research directions." _{BMC bioinformatics} 15 (6):I1.

Pruitt, Kim D., Tatiana Tatusova, and Donna R. Maglott. 2005. "NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins." _Nucleic Acids Research 33 (suppl_1):D501-D504. doi: 10.1093/nar/gki025.

Kim, M., Zimmermann, T., DeLine, R., & Begel, A. (2016, May). The emerging role of data scientists on software development teams. In Proceedings of the 38th International Conference on Software Engineering (pp. 96-107). ACM.

Ludäscher, Bertram, Ilkay Altintas, Chad Berkley, Dan Higgins, Efrat Jaeger, Matthew Jones, Edward A Lee, Jing Tao, and Yang Zhao. 2006. "Scientific workflow management and the Kepler system." _{Concurrency and Computation: Practice and Experience 18} (10):1039-1065.

(46)

Maglott, Donna, Jim Ostell, Kim D. Pruitt, and Tatiana Tatusova. 2005. "Entrez Gene: gene-centered information at NCBI." _{Nucleic Acids Research 33 (suppl_1):D54-D58.} doi: 10.1093/nar/gki031.

Stearley, Jon, Sophia Corwell, and Ken Lord. 2010. "Bridging the Gaps: Joining Information Sources with Splunk." SLAML.

Van Der Aalst, W. (2016). Data science in action. In Process Mining (pp. 3-23). Springer, Berlin, Heidelberg.

Wilson, G., Aruliah, D. A., Brown, C. T., Hong, N. P. C., Davis, M., Guy, R. T., ... & Waugh, B. (2014). Best practices for scientific computing. PLoS biology, 12(1), e1001745.

Yang, Q., & Wu, X. (2006). 10 challenging problems in data mining research. International Journal of Information Technology & Decision Making, 5(04), 597-604.

Xin, Jiwen, Adam Mark, Cyrus Afrasiabi, Ginger Tsueng, Moritz Juchler, Nikhil Gopal, Gregory S. Stupp, Timothy E. Putman, Benjamin J. Ainscough, Obi L. Griffith, Ali Torkamani, Patricia L. Whetzel, Christopher J. Mungall, Sean D. Mooney, Andrew I. Su, and Chunlei Wu. 2015. "MyGene.info and MyVariant.info: Gene and Variant Annotation Query Services."