• Nenhum resultado encontrado

Selected Topics the Web-of-Data on

N/A
N/A
Protected

Academic year: 2022

Share "Selected Topics the Web-of-Data on "

Copied!
62
0
0

Texto

(1)

Selected Topics the Web-of-Data on

and

Data Integration

Marco A. Casanova

http://www.inf.puc-rio.br/~casanova/

(2)

Topics

§ Selected Topics on the Web-of-Data

§ Introduction

§ Basic Technology

§ Advanced Topics

§ Selected Topics on Data Integration

§ Introduction

§ Basic Technology

§ Use of Ontologies for Data Integration

(3)

Selected Topics on the Web-of-Data

Introduction

§ Traditional

Web search engines

§ Coverage

• data captured by Web crawlers

§ Search capabilities

• Keyword search

• Ranking

Iceberg Photo

Judith Currelly, Diane Farris Gallery

(4)

Paul Gauguin, French, 1848–1903

Where Do We Come From? What Are We? Where Are We Going?

1897–1898 Oil on canvas

Image: 139.1 x 374.6 cm (54 3/4 x 147 1/2 in.)

Framed: 171.5 x 406.4 x 8.9 cm (67 1/2 x 160 x 3 1/2 in.) Wildenstein 561

Museum of Fine Arts, Boston: Tompkins Collection 36.270 http://www.mfa.org/artemis/fullrecord.asp?oid=32558&did=500

Description:In 1891, Gauguin left France for Tahiti, seeking in the South Seas a society that was simpler and more elemental than that of his homeland. In Tahiti, he created paintings that express a highly personal mythology. He considered this work—created in 1897, at a time of great personal crisis—to be his masterpiece and the summation of his ideas. Gauguin's letters suggest that the fresco-like painting should be read from right to left, beginning with the sleeping infant. He describes the various figures as pondering the

questions of human existence given in the title; the blue idol represents "the Beyond." The old woman at the far left, "close to death," accepts her fate with resignation.

(5)

Cicero Dias, Brasil 2003

Eu vi o mundo… Ele começava no Recife Rio de Janeiro, 1926-1929

Guache e técnica mista s/ papel, colado em tela, 1,94 x 12m Coleção do artista, Paris

http://www.estadao.com.br/divirtaseonline/galeria/

cicerodias/painel/index.frm

O Painel do Escândalo (Salão de 1931)

...Na arte de vanguarda brasileira não fora feita até então obra similar. Nem no porte, nem na ousadia da concepção. Media quinze metros de largura, por dois e meio de altura. Estava impregnada das forças

incontroláveis e misteriosas do inconsciente. Cícero Dias fez uma composição telúrica, cheia de desvarios e animada de uma convulsão subjetiva de enorme

intensidade.

Figuras voam no alto. Mostrou o universo visto a partir de Pernambuco ou do Brasil. Tanto que seu título era este: Eu vi o mundo...ele começava no Recife. Uma denominação ao mesmo tempo regional, nacional e

(6)

Selected Topics on the Web-of-Data

Introduction

§ Traditional

Web search engines

§ Coverage

• Just the data captured by Web crawlers

§ Search capabilities

ü Keyword search ü Ranking

§ Data Semantics

• “Everything is text”

Iceberg Photo

Judith Currelly, Diane Farris Gallery

(7)

Selected Topics on the Web-of-Data

Introduction

§ Coverage:

§ Deep Web = databases, dynamic pages, multimedia data,…

• Web crawlers do not capture Deep Web data

(8)
(9)

Selected Topics on the Web-of-Data

Introduction

§ Data Semantics

§ data lack proper object IDs

§ data lack semantics

(10)
(11)
(12)
(13)
(14)

Topics

• Selected Topics on the Web-of-Data

– Introduction

– Basic Technology

• What is the Web-of-Data?

• RDF

• Adding Semantics to Web Pages

• Publishing Data on the Web

• Ontologies

– Advanced Topics

(15)

Selected Topics on the Web-of-Data

What is the Web-of-Data?

(16)

Selected Topics on the Web-of-Data

RDF

§ URI – Uniform Resource Identifier

§ a compact sequence of characters

that identifies an abstract or physical resource

§ Examples

http://lattes.cnpq.br/0400232298849115

http://purl.org/dc/elements/1.1/creator

http://www-di.inf.puc-rio.br/~casanova/

(17)

Selected Topics on the Web-of-Data

RDF

§ RDF – Resource Description Framework

ex:index.html exterms:creation-date "August 16, 1999" . ex:index.html exterms:language "English" .

ex:index.html dc:creator http://www.example.org/staffid/85740 .

<?xml version="1.0"?>

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

xmlns:dc="http://purl.org/dc/elements/1.1/"

xmlns:exterms="http://www.example.org/terms/">

<rdf:Description rdf:about="http://www.example.org/index.html">

<exterms:creation-date>August 16, 1999</exterms:creation-date>

<exterms:language>English</exterms:language>

<dc:creator rdf:resource="http://www.example.org/staffid/85740"/>

</rdf:Description>

</rdf:RDF>

(18)

<?xml version="1.0"?>

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

xmlns:dc="http://purl.org/dc/elements/1.1/"

xmlns:exterms="http://www.example.org/terms/">

<rdf:Description rdf:about="http://www.example.org/index.html">

<exterms:creation-date>August 16, 1999</exterms:creation-date>

<exterms:language>English</exterms:language>

<dc:creator rdf:resource="http://www.example.org/staffid/85740"/>

</rdf:Description>

</rdf:RDF>

(19)

Selected Topics on the Web-of-Data

Adding Semantics to Web Pages

§ Problems 1

§ Web pages are just text, with no semantics

§ Solution 1

§ Complement Web pages with tags (in RDF)

(20)

Selected Topics on the Web-of-Data

Publishing Data on the Web

§ Problem 2

§ Data lack proper object IDs and Data lack semantics

§ Solution 2

§ Publish data on the Web as RDF and

Follow the "Linked Data Principles"

(21)

Selected Topics on the Web-of-Data

Publishing Data on the Web

(22)
(23)

§ ‘Linked Data Principles’ (in plain terms)

1. Use URIs to identify the “things” in your data

2. Use http:// URIs so people (and machines) can look them up on the Web

3. When a URI is looked up,

return a description of the “thing” (in RDF format)

4. Include links to related “things”

Selected Topics on the Web-of-Data

Publishing Data on the Web

(24)

Use http:// URIs so people (and machines)

can look them up on the Web

(25)

1,224 datasets

with 16,113 links

(26)

Selected Topics on the Web-of-Data

Publishing Data on the Web

§ Open Government Data

§ public government information – such as government records –

that is shared with the public digitally, over the Internet, in open raw formats,

and ways that make it accessible and readily available to all to promote analysis and allow reuse –

such as the creation of data mashups

(27)

Selected Topics on the Web-of-Data

Publishing Data on the Web

§ Open Government Data in Brazil

§ Federal Law 131 - 27/05/2009

• Establishes norms for publishing, in real time,

detailed information about budget and financial resources

§ Action Plan - 15/09/2011

• Stimulates the use of new technologies in the management and provision of public services and access to public information

• Related to the participation of Brazil as member of

the Open Government Partnership

(28)

§ Ontology

§ Defines classes and properties, and their constraints

§ Typically written in RDFS or OWL 2

§ Examples

§ DCMI Metadata Terms (“Dublin Core”)

§ FOAF – The Friend of a Friend Vocabulary

§ FRBR – Functional Requirements for Bibliographic Records

§ CIDOC CRM – Conceptual Reference Model for cultural heritage documentation

Selected Topics on the Web-of-Data

Ontologies

(29)
(30)

§ Examples (cont.)

§ Energistics Standards

• WITSML drilling and wellsite data standards

• PRODML production and data standards

• RESQML reservoir and structural interpretation data standards

§ Pipeline Open Data Standard (PODS) data model

• an industry standard, used by pipeline operators to provide a “single master source of information,”

and to eliminate “localized silos of information that are often unconnected.”

Selected Topics on the Web-of-Data

Ontologies

(31)

§ Remarks about ontology design

§ “Same-old-conceptual-design”

§ “Lavoisier Principle” applies

• Reuse known ontologies as much as possible

§ Ontology = Vocabulary + Axioms

• Axioms (or constraints) capture the semantics of the terms

§ Example

§ The Music ontology uses the FOAF, FRBR, Event and Timeline ontologies

Selected Topics on the Web-of-Data

Ontologies

(32)

Topics

§ Selected Topics on the Web-of-Data

§ Introduction

§ Basic Technology

§ Selected Topics

• Surfacing Deep Web Data

• Keyword search

• Entity Relatedness

§ Selected Topics on Data Integration

(33)

Selected Topics on the Web-of-Data

Surfacing Deep Web Data

§ Problem: How to make Deep Web data visible to search engines?

§ ... with proper IDs and proper semantics

(34)

Selected Topics on the Web-of-Data

Surfacing Deep Web Data

§ Non-standard publication of opaque data

(35)
(36)

Selected Topics on the Web-of-Data

Surfacing Deep Web Data

(37)

Selected Topics on the Web-of-Data

Keyword Search

§ How to implement keyword search over RDF datasets?

§ ... with good performance and proper ranking of the results

(38)

Selected Topics on the Web-of-Data

Keyword Search

§ Schema-based approach

§ Requires... a schema!

§ Probably faster for large graphs

§ Each query is compiled from the schema

• a query defines a fixed subgraph pattern

§ Graph-based approach

§ Does not require a schema!

§ May be prohibitively slow for large graphs

§ Depends on a graph-traversal algorithm

• a result is an approximation of a Steiner tree

(39)

A1

P1

“Oscar Best Comedy"

M1 win

acted

“Annie Hall”

title

name name

P2

name acted

Query: “Allen” “Oscar”

?p name “Allen”

?a name “Oscar*”

?p acted ?m

?m win ?a A2

“Oscar Best Director"

win name

?p name “Allen”

?a name “Oscar*”

?p win ?a

(40)

Example: Schema-Based RDF-KwS Tool

(41)

Albert Einstein Kurt Gödel Selected Topics on the Web-of-Data

Entity Relatedness

§ Problem: How two entities, described in an RDF

dataset, are related?

(42)

Selected Topics on the Web-of-Data

Entity Relatedness

Albert Einstein

Friendship

Kurt Gödel

§ Problem: almost ten thousand paths in DBpedia...

(43)

Selected Topics on the Web-of-Data

Entity Relatedness

§ Main challenges

§ How to find relationship paths between a given pair of entities in a knowledge base?

§ How to group the relationship paths?

§ What characteristics of the groups of paths

must be selected to generate a connectivity profile?

(44)

Topics

§ Selected Topics on the Web-of-Data

§ Selected Topics on Data Integration 1

§ Introduction

§ Basic Technology

• Entity Linkage

• Data Extraction

• Data Fusion

• Schema Alignment

§ Use of Ontologies for Data Integration

1

Dong, X.L. and Rekatsinas, T. "Data Integration and Machine Learning: A Natural Sinergy". Tutorial at VLDB 2018

(45)

Selected Topics on Data Integration

Introduction

§ Problem: How to integrate data from different sources?

§ ... when the number of sources is very large

§ ... when the sources change over time

§ Data integration

§ In conventional business:

• Data warehouse

• Virtual integration

§ In the Web-of-Data:

• Knowledge graph ~ Data warehouse

• Linked data ~ Virtual integration

(46)

Selected Topics on Data Integration

Introduction

§ Machine Learning for Data Integration

§ Transition from logic to probabilities revolutionized Data Integration

• Probabilities allow reasoning about noisy data

• Machine Learning techniques help Data Integration

(47)

Selected Topics on Data Integration

Introduction

§ Data Integration for Data Science

§ Cleaning and organizing data comprises 60% of the time spent on a Data Science project (source: Crowdflower):

• Building training sets: 3%

• Cleaning and organizing data: 60%

• Collecting data sets: 19%

• Mining data for patterns: 9%

• Refining algorithms: 4%

• Other: 5%

(48)

Selected Topics on Data Integration

Introduction

"... we expect a new application area to be commercialized that of supporting the work of data scientists. For example, Merck has approximately 4000 Oracle databases, a large data lake, uncountable individual files and the company is interested in public data from the web. Merck data scientists spend at least 90% of their time finding data sets relevant to their task at hand and then performing data integration on the result.

... "

Mike Stonebraker, "Data Integration: The Current Status and the Way Forward", Bulletin of the IEEE

Computer Society Technical Committee on Data Engineering, 2018

(49)

Selected Topics on Data Integration

Introduction

"... we expect a new application area to be commercialized that of supporting the work of data scientists. For example, Merck has approximately 4000 Oracle databases, a large data lake, uncountable individual files and the company is interested in public data from the web. Merck data scientists spend at least 90% of their time finding data sets relevant to their task at hand and then performing data integration on the result.

... "

Mike Stonebraker, "Data Integration: The Current Status and the Way Forward", Bulletin of the IEEE

Computer Society Technical Committee on Data Engineering, 2018

(50)

Selected Topics on Data Integration

Introduction

§ Data Integration processes

§ Entity linkage: necessary for identifying the same instances in different data sources

§ Data extraction: important for integrating non-structured data

§ Data fusion: necessary in presence of erroneous data

§ Schema alignment: helpful for integrating data descriptions of

different data sources

(51)

Selected Topics on Data Integration

Entity Linkage

§ Entity linkage (EL)

§ Partition a given set R of records such that

each partition corresponds to a distinct real-world entity

(52)

Selected Topics on Data Integration

Entity Linkage

(53)

Selected Topics on Data Integration

Data Extraction

§ Data Extraction

§ Extract structured data from semi-structured or unstructured data

§ Three Types of Data Extraction (DE)

• Closed-world DE: align to existing entities and attributes

• Closed DE: align to existing attributes, but extract new entities

• Open DE: not limited by existing entities or attributes

(54)

Selected Topics on Data Integration

Data Extraction

(55)

Selected Topics on Data Integration

Data Fusion

§ Data Fusion

§ Resolve conflicts and obtain correct values

§ Challenges in Data Fusion

§ Reasoning about source quality works only for easy cases

§ Machine Learning techniques have potential

(56)

Selected Topics on Data Integration

Data Fusion

(57)

Selected Topics on Data Integration

Schema Alignment

§ Schema Alignment

§ Map classes and attributes with the same semantics

(58)

Selected Topics on Data Integration

Schema Alignment

(59)

Selected Topics on Data Integration

Use of Ontologies for Data Integration

§ Schema alignment:

§ An ontology as a reference schema

§ Entity linkage:

§ Ontology instances as reference or training instances

§ Data extraction:

§ Ontology terms as reference or training terms

§ Data fusion

§ Ontology attribute values as reference or training values

(60)

Selected Topics on Data Integration

Use of Ontologies for Data Integration

§ "Unique identifiers are elusive"

§ Use URIs as universal unique IDs

§ Use "sameAs" to account for multiple unique identifiers

§ "Meaning is alignment"

§ Align your schema with a well-known ontology

§ Expand schema alignment using reference ontologies

§ "Run away from schema alignment!"

§ Publish your data using a well-known ontology

(61)

Final Remarks

§ Web-of-Data

§ RDF

§ Adding Semantics to Web Pages (with RDF)

§ Publishing Data on the Web (with RDF)

§ Ontologies

§ Data Integration

§ Entity Linkage

§ Data Extraction

§ Data Fusion

(62)

References for this presentation...

Search for "casanova puc-rio"

Referências

Documentos relacionados

Não se trata apenas de uma troca, na qual benefícios fiscais são adquiridos em função da disponibilidade de verbas para efetivação de projetos culturais. Muitos apoiam a cultura

As was demonstrated, it is possible to cross Egan’s sustainability components with the envi- ronmental and urban quality improvement of Polis Program goals, as also the night uses

It overcame the challenges of the difficulty to incorporating both cognitive (angular change) and physical (distance, slope and bike path) parameters, and also

The probability of attending school four our group of interest in this region increased by 6.5 percentage points after the expansion of the Bolsa Família program in 2007 and

Espaços iluminados e limpos são valorizados e este item já tinha sobressaído no estudo de Costa (2009), com os inquiridos a mencionarem a sua importância para garantir uma

Em boa verdade, só um trabalho conjunto não só de sinalização por parte do professor e de resolução por parte dos PSTS que ponha a tónica, por um lado, no problema

Em contraste com as mulheres da peça Hamlet, Dona Glória e Capitu da obra Dom Casmurro, do século XIX, são totalmente diferentes. Por deter posses, ela prevalecia no

This study aimed to evaluate the quality of Enterolobium contortisiliquum (Vell.) Morong seedlings depending on the inoculation and natural nodula- tion in soils from southwest