Keyword Search over COVID-19 Data
Yenier T. Izquierdo
1,2, Grettel M. García
2, Melissa Lemos
1,2, Alexandre Novello
1, Bruno Novelli
2, Cleber Damasceno
2,
Luiz André P.P. Leme
3, Marco A. Casanova
1,22
Agenda
• Motivation
• DANKE
• CovidKeys
• Conclusions
Motivation
• Examples of COVID-19 data
• FAPESP / USP / Instituto Fleury, H. Sírio-Libanês and H. Israelita Albert Einstein
• The COVID-19 Data Sharing / BR initiative - open data with demographic, clinical and laboratory data about patients tested for COVID-19 in the State of São Paulo
• Ministry of Health/Brazil
• NSG (Notificações de Síndrome Gripal) - notifications of suspected COVID-19 cases
• SRAG 2020 - Acute Respiratory Syndrome (SARS) and COVID-19 data
• The Center for Systems Science and Engineering (CSSE) at Johns Hopkins University
• COVID-19 Data Repository
Motivation
• COVID-19 data
• Data is typically available
• as downloadable CSV files
• as datasets that can be queried through a user interface (e.g., Elastic Search)
• as datasets available through a "sandbox" (e.g., Google Cloud services)
• Files can be quite large, with hundreds of columns and millions of lines
• Data can be at different levels of granularity
4
Motivation
• COVID-19 data
• Cumbersome to process with the usual desktop spreadsheet tools
• A perhaps more robust approach:
• Download and store the data in a standard DBMS
• Define a query that retrieves the data the user is interested in
• Export the query results to a data analysis tool
Motivation
• COVID-19 data
• Cumbersome to process with the usual desktop spreadsheet tools
• A perhaps more robust approach:
• Download and store the data in a standard DBMS
• Define a query that retrieves the data the user is interested in
• Export the query results to a data analysis tool
6
Keyword search
Motivation
• Keyword search over databases
• the user specifies a few terms, called keywords
• A keyword query is simply a list of keywords
• the system must retrieve the data that best match the list of keywords
• Keyword search has been expanded to relational databases (and RDF datasets)
• Keyword queries are typically compiled into SQL (or SPARQL)
8
Agenda
• Motivation
• DANKE
• CovidKeys
• Conclusions
DANKE
• What is DANKE?
• a platform for data and knowledge retrieval
• Main Components
• Data Ingestion and Indexing Services
• Keyword Query Processing Service
• Compiles a keyword query into an SQL (or SPARQL) query
• Uses the database schema to guide the process
10
name A1
P1
M1 act
title
name descr
P2
“Richard Burton”
act A2
win descr
rel
“Liz Taylor”
“Oscar Best Actress”
"Oscar Best Cinematography"
"Who's Afraid of Virginia Woolf"
win
A3
descr
name win
“John Hurt”
P3 Query: Taylor Burton Hurt
award
?v1 name “Taylor”
?v2 name “Burton”
?v3 name “Hurt”
?v1 rel ?v2
?v2 win ?v4
?v3 win ?v4
"Valladolid Best Actor
(1984)"
...
Taylor Burton Hurt
12
Agenda
• Motivation
• DANKE
• CovidKeys
• Conclusions
Experiments with CovidKeys
• CovidKeyS
• a Web Application that uses the DANKE services to query COVID-19 data
Experiments with CovidKeys
• Scenario 1 - The NSG (Notificações de Síndrome Gripal) dataset
• stores notifications of suspected COVID-19 cases
• data allow relating demographic, clinical, and laboratory data
• data is organized in DANKE as a relational database with three tables:
• Paciente – patient data
• Teste – COVID-19 test data
• Desfecho – how the case evolved
14
Experiments with CovidKeys
• Scenario 1 - The NSG (Notificações de Síndrome Gripal) dataset
• NL Query: “Quais foram os casos em Saquarema de gestantes, com sintoma de febre, e idade menor do que 40 anos?” ( “Which were the cases in
Saquarema involving pregnant women, with fever symptoms, and age less than 40 years?”)
Keyword query: Saquarema gestante febre idade < 40
• NL Query: “Quais enfermeiros, e em que municípios, fizeram teste rápido e
tiveram resultado positivo?” (“Which nurses in which counties had a rapid
test with positive result?”)
Experiments with CovidKeys
• Scenario 1 - The NSG (Notificações de Síndrome Gripal) dataset
• Keyword query: enfermeiro município “teste rápido” resultado positivo
• enfermeiro
• partially matches values of attribute Paciente.profissao
• generates the restriction “Paciente.profissao = '*enfermeiro*' ”
• (the keyword “profissão” is not required, since attribute profissao is inferred from the match)
• “teste rápido” (a single term due to the use of double quotes)
• partially matches values of attribute Teste.tipo
• generates the restriction “Teste.tipo='*teste rápido*' ”
• município, resultado and positivo
• (processed likewise)
• requires joining two tables, Paciente and Teste
• discovered from the matches and the schema
16
Experiments with CovidKeys
• Scenario 1 - The NSG (Notificações de Síndrome Gripal) dataset
• Keyword query: enfermeiro município “teste rápido” resultado positivo
• Compiled SQL query:
select Paciente.id_paciente, Paciente.profissao,
Paciente.municipio, Teste.tipo, Teste.Resultado from Paciente, Teste
where Paciente.profissao = '*enfermeiro*' and Teste.tipo = = '*teste rapido*'
and Teste.resultado = '*positivo'
Experiments with CovidKeys
• Scenario 2 - Global Data (from the Google COVID-19 Public Datasets)
• The John Hopkins University (JHU)
• global confirmed COVID-19 cases, deaths, and recoveries
• Oxford Policy Tracker
• COVID-19 policies at the country level
• Google Mobility Report
• movement trends over time by geography across different categories of places
• (some) World Bank global data
Experiments with CovidKeys
• Scenario 2 - Global Data (from the Google COVID-19 Public Datasets)
• NL Query: “List the total number of deaths and the recreation mobility index on the same date for Belarus. ”
Keyword query: Belarus deaths recreation date
20
Experiments with CovidKeys
• Scenario 2 - Global Data (from the Google COVID-19 Public Datasets)
• Keyword query: Belarus deaths recreation date
• Belarus
• matches a value of attribute “Country Name” of table Country (from the World Bank global data about population)
• deaths and date
• deaths partially matches the name of attribute “Total Deaths” and date matches the name of attribute Date of table Contamination (from the global synthesis of COVID-19 cases by country)
• recreation
Agenda
• Motivation
• DANKE
• Architecture
• Keyword Query Processing
• Experiments with CovidKeys
• Conclusions
Conclusions
• Summary
• DANKE - a platform for data and knowledge retrieval
• concentrated on the keyword search component
• CovidKeyS
• uses DANKE to implement keyword queries over three COVID-19 data scenarios
• two scenarios described in the paper
• The NSG (Notificações de Síndrome Gripal) dataset
• Global Data (from the Google COVID-19 Public Datasets)
• third scenario:
• The COVID-19 Data Sharing / BR initiative
24
Conclusions
• Future work
• extend DANKE to operate over federated databases
• extend keyword query processing to...
• support aggregation operations in natural language-like syntax
• support specific join clauses (e.g., same date)
• ... beyond keyword query
• (see Tutorial 1 – "Palavras, apenas*: Métodos e Técnicas para Interfaces de Linguagem Natural em Bancos de Dados")
26