• Nenhum resultado encontrado

Selected Topicson the Web-of-DataandData Integration

N/A
N/A
Protected

Academic year: 2022

Share "Selected Topicson the Web-of-DataandData Integration"

Copied!
65
0
0

Texto

(1)

Selected Topics the Web-of-Data on

Data Integration and

Marco A. Casanova

http://www.inf.puc-rio.br/~casanova/

(2)

Topics

§ Selected Topics on the Web-of-Data

§ Introduction

§ Basic Technology

§ RDF Keyword Search

§ Selected Topics on Data Integration

§ Introduction

§ Classic Data Integration

§ Data Lake Management

§ Web-of-Data and Data Integration

(3)

Selected Topics on the Web-of-Data

Introduction

§ Traditional Web

§ Adherence to Standards

URLs – Web Page Ids

HTML

HTTP

...

Internet protocols

In March 1989, Tim Berners-Lee submitted a proposal for an information management system to his boss, Mike Sendall.

‘Vague, but exciting’ were the words that Sendall wrote on the proposal.

(4)

Selected Topics on the Web-of-Data

Introduction

§ Traditional

Web search engines

§ Coverage

data captured by Web crawlers

§ Search capabilities

Keyword search

Ranking

Iceberg Photo

Judith Currelly, Diane Farris Gallery

(5)
(6)
(7)

Selected Topics on the Web-of-Data

Introduction

§ Traditional

Web search engines

§ Coverage

Just the data captured by Web crawlers

§ Search capabilities

üKeyword search üRanking

§ Web page semantics

“Everything is text”

Iceberg Photo

Judith Currelly, Diane Farris Gallery

(8)

Selected Topics on the Web-of-Data

Introduction

§ Traditional

Web search engines

§ Coverage

Just the data captured by Web crawlers

§ Search capabilities

üKeyword search üRanking

§ Web page semantics

“Everything is text” (really?)

other types of data Iceberg Photo

Judith Currelly, Diane Farris Gallery

(9)
(10)

Topics

§ Selected Topics on the Web-of-Data

§ Introduction

§ Basic Technology

RDF

Adding Semantics to Web Pages

Ontologies

§ RDF Keyword Search

§ Selected Topics on Data Integration

§ Web-of-Data and Data Integration

(11)

Selected Topics on the Web-of-Data

What is the Web-of-Data

(12)

Selected Topics on the Web-of-Data

What is the Web-of-Data

http://lattes.cnpq.br/0807511237795775

(13)

Selected Topics on the Web-of-Data

What is the Web-of-Data

http://lattes.cnpq.br/0807511237795775

http://purl.org/dc/elements/1.1/creator

(14)

Selected Topics on the Web-of-Data

What is the Web-of-Data

http://lattes.cnpq.br/0807511237795775

http://purl.org/dc/elements/1.1/creator

https://doi.org/10.17771/PUCRio.acad.5181

(15)

Selected Topics on the Web-of-Data

RDF

§ URI – Uniform Resource Identifier

§ a compact sequence of characters

that identifies an abstract or physical resource

§ Examples

http://lattes.cnpq.br/0807511237795775 (a person) http://purl.org/dc/elements/1.1/creator (a term)

https://doi.org/10.17771/PUCRio.acad.5181 (a document)

(16)

Subject Predicate Object

https://doi.org/10.17771/PUCRio.acad.5181 http://purl.org/dc/elements/1.1/date 2004/07/13 https://doi.org/10.17771/PUCRio.acad.5181 http://purl.org/dc/elements/1.1/language Portuguese

https://doi.org/10.17771/PUCRio.acad.5181 http://purl.org/dc/elements/1.1/creator http://lattes.cnpq.br/0807511237795775

(17)

Subject Predicate Object

https://doi.org/10.17771/PUCRio.acad.5181 http://purl.org/dc/elements/1.1/date 2004/07/13 https://doi.org/10.17771/PUCRio.acad.5181 http://purl.org/dc/elements/1.1/language Portuguese

https://doi.org/10.17771/PUCRio.acad.5181 http://purl.org/dc/elements/1.1/creator http://lattes.cnpq.br/0807511237795775

https://doi.org/10.17771/PUCRio.acad.5181

http://lattes.cnpq.br/0807511237795775

2004/07/13

Portuguese

http://purl.org/dc/elements/1.1/date http://purl.org/dc/elements/1.1/creator

http://purl.org/dc/elements/1.1/language

(18)

§ Ontology

§ Defines classes and properties, and their constraints

§ Typically written in RDFS or OWL 2

§ Examples

§ DCMI Metadata Terms (“Dublin Core”)

§ FOAF – The Friend of a Friend Vocabulary

§ FRBR – Functional Requirements for Bibliographic Records

§ CIDOC CRM – Conceptual Reference Model for cultural heritage documentation

Selected Topics on the Web-of-Data

Ontologies

(19)

§ Remarks about ontology design

§ “Same-old-conceptual-design”

§ “Lavoisier Principle” applies

Reuse known ontologies as much as possible

§ Ontology = Vocabulary + Axioms

Axioms (or constraints) capture the semantics of the terms

§ Example

§ The Music ontology uses the FOAF, FRBR, Event and Timeline ontologies

Selected Topics on the Web-of-Data

Ontologies

(20)

Topics

§ Selected Topics on the Web-of-Data

§ Introduction

§ Basic Technology

§ RDF Keyword Search

§ Selected Topics on Data Integration

§ Web-of-Data and Data Integration

(21)

Selected Topics on the Web-of-Data

RDF Keyword Search

RDF Graph

A1

P1

M1 act

title

name descr

P2

“Richard Burton”

act A2

win descr

rel

“Liz Taylor”

“Oscar Best Actress”

"Oscar Best Cinematography"

"Who's Afraid of Virginia Woolf"

name award

(22)

Selected Topics on the Web-of-Data

RDF Keyword Search

§ The RDF Keyword Search Problem

How to implement keyword search over RDF graphs?

... with good performance and proper ranking of the results

(23)

Selected Topics on the Web-of-Data

RDF Keyword Search

§ Schema-based approach

§ Requires... a schema!

§ Locates keyword matches in the graph

§ Compiles the keywords into a SPARQL query using schema information

§ Graph summary approach

§ Computes "graph summaries"

§ Locates keyword matches in the graph

§ Compiles the keywords into a SPARQL query using the graph summaries

§ Graph traversal approach

§ Locates keyword matches in the graph

§ Directly traverses the graph from the matching nodes or arcs

(24)

Query: “Taylor” “Oscar”

?p name “Taylor”

?a descr “Oscar*”

?p act ?m

?m award ?a

?p name “Taylor”

?a descr “Oscar*”

?p win ?a A1

P1

M1 act

title

name descr

P2

“Richard Burton”

act A2

win descr

rel

“Liz Taylor”

“Oscar Best Actress”

"Oscar Best Cinematography"

"Who's Afraid of Virginia Woolf"

name award

(25)

Query: “Taylor” “Oscar”

?p name “Taylor”

?a descr “Oscar*”

?p act ?m

?m award ?a

?p name “Taylor”

?a descr “Oscar*”

?p win ?a A1

P1

M1 act

title

name descr

P2

“Richard Burton”

act A2

win descr

rel

“Liz Taylor”

“Oscar Best Actress”

"Oscar Best Cinematography"

"Who's Afraid of Virginia Woolf"

name award

(26)

Query: “Taylor” “Oscar”

No schema

Graph summary approach (via KMV-synopses)

§ the domain of name is similar to

the domain of win

§ the domain of descr is similar to

the range of win

Graph traversal approach A1

P1

M1 act

title

name descr

P2

“Richard Burton”

act A2

win descr

rel

“Liz Taylor”

“Oscar Best Actress”

"Oscar Best Cinematography"

"Who's Afraid of Virginia Woolf"

name award

(27)

Selected Topics on the Web-of-Data

RDF Keyword Search

§ Problem: RDF keyword search

How to locate, in the RDF graph, the entities the keywords refer to?

How to discover, in the RDF graph, relationships between such entities?

§ Solution I: compile the keywords into a SPARQL query, using

Information extracted from the RDF schema, if available Information extracted from the RDF graph

§ Solution II: traverse the RDF graph from the keyword matches

(28)

Summary

§ Selected Topics on the Web-of-Data

§ Basic Technology

RDF

Adding Semantics to Web Pages

Ontologies

§ RDF Keyword Search

Schema-based approach

Graph summary approach

Graph traversal approach

(29)

Topics

§ Selected Topics on the Web-of-Data

§ Selected Topics on Data Integration

§ Introduction

§ Classic Data Integration

§ Data Lake Management

§ Web-of-Data and Data Integration

(30)

Selected Topics on Data Integration

Introduction

§ The Data Integration Problem:

How to integrate data from different sources?

Poço Anp-Id Lat Long

P1 x1 y1

P2 x2 y2

Well Id X Y

Q1 x1 y1

Q2 t2 u2

(31)

The Data Integration Problem:

"Mais velho que a Sé de Braga"

(32)

Selected Topics on Data Integration

Introduction

§ Recent challenges

... Big Data applications ... Data Science applications

"... Merck has approximately 4,000 Oracle databases, a large data lake, uncountable individual files and the company is interested in public data from the Web. Merck data scientists

spend at least 90% of their time finding data sets relevant to their task at hand and then performing data integration on the result. ... ".

Stonebraker, M. "Data Integration: The Current Status and the Way Forward", Bulletin of the IEEE Comp. Society Tech. Committee on Data Eng., 2018.

(33)

Selected Topics on Data Integration

Introduction

§ Percentage of time spent on a Data Science project (source: Crowdflower)

§ Building training sets: 3%

§ Cleaning and organizing data: 60%

§ Collecting data sets: 19%

§ Mining data for patterns: 9%

§ Refining algorithms: 4%

§ Other: 5% https://visit.figure-eight.com/rs/416-ZBE-142/images/

CrowdFlower_DataScienceReport_2016.pdf

(34)

Topics

§ Selected Topics on the Web-of-Data

§ Selected Topics on Data Integration

§ Introduction

§ Classic Data Integration

Schema Alignment

Entity Linkage

Data Extraction

Data Fusion

§ Data Lake Management

§ Web-of-Data and Data Integration

(35)

Selected Topics on Data Integration

Schema Alignment

§ The Schema Alignment Problem

How to map classes and attributes with the same semantics?

Poço Anp-Id Lat Long

P1 x1 y1

P2 x2 y2

Well Id X Y

Q1 x1 y1

Q2 t2 u2

(36)

Selected Topics on Data Integration

Schema Alignment

§ Approaches to Schema Alignment

§ Syntactical approach:

"syntactical proximity" implies

"semantic proximity"

§ Semantical approach:

"things" with similar data have the same meaning

Bruegel, Pieter. The Tower of Babel. c 1563 Oil on oak panel. 114 x 155 cm

Kunsthistorisches Museum Wien, Vienna

(37)

Selected Topics on Data Integration

Schema Alignment

§ Problems

§ Syntactical approach:

Presupposes that

"syntactical proximity" implies

"semantic proximity"

§ Semantical approach:

Requires overlapping data sources

Works for simple conceptual schemas

Bruegel, Pieter. The Tower of Babel. c 1563 Oil on oak panel. 114 x 155 cm

Kunsthistorisches Museum Wien, Vienna

(38)

Selected Topics on Data Integration

Schema Alignment

§ Problem

§ ... Too many alignments!

(39)

Selected Topics on Data Integration

Schema Alignment

§ Mediated schema

§ Shared vocabulary (with whom?)

§ Difficult to construct

§ ... impossible dream!

§ Domain ontology

§ Different name for mediated schema

§ Different formalism

§ ... same problems!

Picasso, Pablo. 'Don Quixote', c. 1955 Lithography. 28 x 36 cm

"Schema First, Query Later" Paradigm

(40)

Selected Topics on Data Integration

Entity Linkage

§ The Entity Linkage Problem

How to identify the same entity in distinct data sources?

Poço Anp-Id Lat Long

P1 x1 y1

P2 x2 y2

Well Id X Y

Q1 x1 y1

Q2 t2 u2

?

(41)

Selected Topics on Data Integration

Entity Linkage

§ The Entity Linkage Problem

How to identify the same entity in distinct data sources?

§ Solution

§ Align the data source schemas

§ Discover similar classes from different data sources

§ Discover which attributes from similar classes identify entities

§ Locate entities with the same (or similar) attribute values

(42)

Selected Topics on Data Integration

Data Extraction

§ The Data Extraction Problem

How to extract structured data from semi-structured or unstructured data?

§ Two Types of Data Extraction (DE)

§ Closed-world DE: align to known entities and attributes

§ Open-world DE: not limited to known entities or attributes

(43)

Selected Topics on Data Integration

Data Extraction

§ Document Interpretation Problems

How to recognize the entities a document refers to?

(Entity recognition)

How to recognize relationships in a document?

(Relation extraction)

§ An approach...

§ Use data from the database to train algorithms to

Recognize entities represented in the database ("CWA")

Recognize relationships represented in the database ("distant learning")

(44)

Selected Topics on Data Integration

Data Fusion

§ The Data Fusion Problem

How to resolve conflicts and obtain correct values ?

§ Challenge

§ Reasoning about source quality works only for easy cases

(45)

Summary

§ Classic Data Integration

§ Schema Alignment

§ Entity Linkage

§ Data Extraction

§ Data Fusion

(46)

Topics

§ Selected Topics on the Web-of-Data

§ Selected Topics on Data Integration

§ Introduction

§ Classic Data Integration

§ Data Lake Management

§ Web-of-Data and Data Integration

(47)

Selected Topics on Data Integration

Data Lake Management

"Load First, Schema Later" Paradigm

Ungerer, G. (2018) Cleaning Up the Data Lake with an Operational Data Hub. O'Reilley Media, Inc. ISBN 978-1-492-02735-5

(48)

Selected Topics on Data Integration

Data Lake Management

§ Data Lakes...

§ data can be in large volumes

§ data can have many formats

§ data can be dirty

§ data can change over time

§ ...

(49)

Selected Topics on Data Integration

Data Lake Management

§ Common Tasks in Data Lakes

§ Ingestion

§ Extraction (Type Inference)

§ Cleaning

§ Metadata Management

§ Discovery

§ Integration

§ Versioning

(50)

Topics

§ Selected Topics on the Web-of-Data

§ Selected Topics on Data Integration

§ Web-of-Data and Data Integration

(51)

Web-of-Data and Data Integration

§ "Unique identifiers avoid Entity Linkage"

§ Use URIs as universal unique IDs

Google Knowledge Graph

https://g.co/kgs/8FEXUu

(52)

Web-of-Data and Data Integration

§ "Unique identifiers avoid Entity Linkage"

§ Use URIs as universal unique IDs

§ Use "sameAs" to account for multiple unique identifiers

ISNI Google Knowledge Graph

(53)

Web-of-Data and Data Integration

§ "Ontologies simplify Schema Alignment"

§ Align your schema with a well-known ontology

(54)

Web-of-Data and Data Integration

§ "Ontologies simplify Schema Alignment"

§ Align your schema with a well-known ontology

§ "Meaning is alignment"

Quine, W. V. (1968). Ontological relativity. The Journal of Philosophy, 65(7):185–212.

(55)

Web-of-Data and Data Integration

§ "Ontologies simplify Schema Alignment"

§ Align your schema with a well-known ontology

§ "Meaning is alignment"

§ "Run away from schema alignment!"

§ Publish your data using a well-known ontology

Charles M. Schulz, The Complete Peanuts, 1959-1962

(56)

Summary

§ Selected Topics on the Web-of-Data

§ Basic Technology – RDF, Ontologies

§ RDF Keyword Search

§ Selected Topics on Data Integration

§ Classic Data Integration

§ Data Lake Management

§ Web-of-Data and Data Integration

§ "Unique identifiers avoid Entity Linkage"

§ "Ontologies simplify Schema Alignment" Bruegel, Pieter. The Tower of Babel. c 1563 Oil on panel. 60 ×74.5 cm

Museum Boijmans Van Beuningen, Rotterdam,

(57)

Opportunities

§ ... beyond keyword search

§ Databases

Natural Language Interfaces

Q&A Interfaces

(58)

Opportunities

§ ... beyond keyword search

§ Databases

Natural Language Interfaces

Q&A Interfaces

§ Domain-specific search

Real-time video search

Query by image content

...

(59)

Opportunities

§ Data Integration Machine Learning

§ ML for effective DI:

Automating DI tasks using ML techniques

(60)

Opportunities

§ Data Integration Machine Learning

§ ML for effective DI:

Automating DI tasks using ML techniques

§ DI for effective ML:

Automating the creation of large training datasets from different sources

(61)

Opportunities

§ Data Lakes

§ Real-time, scalable ingestion of unstructured data

§ Real-time, scalable data cleaning

§ On-demand scalable metadata inference and querying

§ On-demand scalable integration of heterogeneous data

(62)

Opportunities

§ Data Lakes

§ Real-time, scalable ingestion of unstructured data

§ Real-time, scalable data cleaning

§ On-demand scalable metadata inference and querying

§ On-demand scalable integration of heterogeneous data

(63)

Opportunities

§ ... beyond keyword search

§ Data Integration Machine Learning

§ Data Lakes: Real-time, scalable, on-demand

(64)

Thank You

(65)

References for this presentation...

casanova puc-rio

Referências

Documentos relacionados

....|....| ....|....| ....|....| ....|....| ....|....| ....|....| ....|....| ....|....| ....|....| ....|....| 205 215 225 235 245 255

Em contraste com as mulheres da peça Hamlet, Dona Glória e Capitu da obra Dom Casmurro, do século XIX, são totalmente diferentes. Por deter posses, ela prevalecia no

Por fim, na secção 1.4, iremos debruçar-nos sobre o tema das Contas Satélites do Turismo (CST), um projecto desenvolvido pela Organização Mundial do Turismo, que se revela

Para tal, o sistema de transmissão em HVDC a estudar apresenta uma nova estrutura de conversão multinível de potência de configuração simplificada, utilizando

Quando tomou conhecimento meses anteriores ao movimento “que a reorganização do ensino estadual de São Paulo fecharia nosso ensino médio”, gerando “um abalo muito grande de

The probability of attending school four our group of interest in this region increased by 6.5 percentage points after the expansion of the Bolsa Família program in 2007 and

Contudo, os instrumentos de avaliação de QdV centrados na perspectiva do doente têm ganho relevo significativo, permitindo acompanhar a evolução da doença e resposta

Nesse sentido, este trabalho teve por objetivo desenvolver um protocolo de obtenção de esferas porosas de fibroína de seda e glucomanana de konjac, visando melhorar as