Selected Topics the Web-of-Data on
and
Data Integration
Marco A. Casanova
http://www.inf.puc-rio.br/~casanova/
Topics
§ Selected Topics on the Web-of-Data
§ Introduction
§ Basic Technology
§ Advanced Topics
§ Selected Topics on Data Integration
§ Introduction
§ Basic Technology
§ Use of Ontologies for Data Integration
Selected Topics on the Web-of-Data
Introduction
§ Traditional
Web search engines
§ Coverage
• data captured by Web crawlers
§ Search capabilities
• Keyword search
• Ranking
Iceberg Photo
Judith Currelly, Diane Farris Gallery
Paul Gauguin, French, 1848–1903
Where Do We Come From? What Are We? Where Are We Going?
1897–1898 Oil on canvas
Image: 139.1 x 374.6 cm (54 3/4 x 147 1/2 in.)
Framed: 171.5 x 406.4 x 8.9 cm (67 1/2 x 160 x 3 1/2 in.) Wildenstein 561
Museum of Fine Arts, Boston: Tompkins Collection 36.270 http://www.mfa.org/artemis/fullrecord.asp?oid=32558&did=500
Description:In 1891, Gauguin left France for Tahiti, seeking in the South Seas a society that was simpler and more elemental than that of his homeland. In Tahiti, he created paintings that express a highly personal mythology. He considered this work—created in 1897, at a time of great personal crisis—to be his masterpiece and the summation of his ideas. Gauguin's letters suggest that the fresco-like painting should be read from right to left, beginning with the sleeping infant. He describes the various figures as pondering the
questions of human existence given in the title; the blue idol represents "the Beyond." The old woman at the far left, "close to death," accepts her fate with resignation.
Cicero Dias, Brasil 2003
Eu vi o mundo… Ele começava no Recife Rio de Janeiro, 1926-1929
Guache e técnica mista s/ papel, colado em tela, 1,94 x 12m Coleção do artista, Paris
http://www.estadao.com.br/divirtaseonline/galeria/
cicerodias/painel/index.frm
O Painel do Escândalo (Salão de 1931)
...Na arte de vanguarda brasileira não fora feita até então obra similar. Nem no porte, nem na ousadia da concepção. Media quinze metros de largura, por dois e meio de altura. Estava impregnada das forças
incontroláveis e misteriosas do inconsciente. Cícero Dias fez uma composição telúrica, cheia de desvarios e animada de uma convulsão subjetiva de enorme
intensidade.
Figuras voam no alto. Mostrou o universo visto a partir de Pernambuco ou do Brasil. Tanto que seu título era este: Eu vi o mundo...ele começava no Recife. Uma denominação ao mesmo tempo regional, nacional e
Selected Topics on the Web-of-Data
Introduction
§ Traditional
Web search engines
§ Coverage
• Just the data captured by Web crawlers
§ Search capabilities
ü Keyword search ü Ranking
§ Data Semantics
• “Everything is text”
Iceberg Photo
Judith Currelly, Diane Farris Gallery
Selected Topics on the Web-of-Data
Introduction
§ Coverage:
§ Deep Web = databases, dynamic pages, multimedia data,…
• Web crawlers do not capture Deep Web data
Selected Topics on the Web-of-Data
Introduction
§ Data Semantics
§ data lack proper object IDs
§ data lack semantics
Topics
• Selected Topics on the Web-of-Data
– Introduction
– Basic Technology
• What is the Web-of-Data?
• RDF
• Adding Semantics to Web Pages
• Publishing Data on the Web
• Ontologies
– Advanced Topics
Selected Topics on the Web-of-Data
What is the Web-of-Data?
Selected Topics on the Web-of-Data
RDF
§ URI – Uniform Resource Identifier
§ a compact sequence of characters
that identifies an abstract or physical resource
§ Examples
http://lattes.cnpq.br/0400232298849115
http://purl.org/dc/elements/1.1/creator
http://www-di.inf.puc-rio.br/~casanova/
Selected Topics on the Web-of-Data
RDF
§ RDF – Resource Description Framework
ex:index.html exterms:creation-date "August 16, 1999" . ex:index.html exterms:language "English" .
ex:index.html dc:creator http://www.example.org/staffid/85740 .
<?xml version="1.0"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:exterms="http://www.example.org/terms/">
<rdf:Description rdf:about="http://www.example.org/index.html">
<exterms:creation-date>August 16, 1999</exterms:creation-date>
<exterms:language>English</exterms:language>
<dc:creator rdf:resource="http://www.example.org/staffid/85740"/>
</rdf:Description>
</rdf:RDF>
<?xml version="1.0"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:exterms="http://www.example.org/terms/">
<rdf:Description rdf:about="http://www.example.org/index.html">
<exterms:creation-date>August 16, 1999</exterms:creation-date>
<exterms:language>English</exterms:language>
<dc:creator rdf:resource="http://www.example.org/staffid/85740"/>
</rdf:Description>
</rdf:RDF>
Selected Topics on the Web-of-Data
Adding Semantics to Web Pages
§ Problems 1
§ Web pages are just text, with no semantics
§ Solution 1
§ Complement Web pages with tags (in RDF)
Selected Topics on the Web-of-Data
Publishing Data on the Web
§ Problem 2
§ Data lack proper object IDs and Data lack semantics
§ Solution 2
§ Publish data on the Web as RDF and
Follow the "Linked Data Principles"
Selected Topics on the Web-of-Data
Publishing Data on the Web
§ ‘Linked Data Principles’ (in plain terms)
1. Use URIs to identify the “things” in your data
2. Use http:// URIs so people (and machines) can look them up on the Web
3. When a URI is looked up,
return a description of the “thing” (in RDF format)
4. Include links to related “things”
Selected Topics on the Web-of-Data
Publishing Data on the Web
Use http:// URIs so people (and machines)
can look them up on the Web
1,224 datasets
with 16,113 links
Selected Topics on the Web-of-Data
Publishing Data on the Web
§ Open Government Data
§ public government information – such as government records –
that is shared with the public digitally, over the Internet, in open raw formats,
and ways that make it accessible and readily available to all to promote analysis and allow reuse –
such as the creation of data mashups
Selected Topics on the Web-of-Data
Publishing Data on the Web
§ Open Government Data in Brazil
§ Federal Law 131 - 27/05/2009
• Establishes norms for publishing, in real time,
detailed information about budget and financial resources
§ Action Plan - 15/09/2011
• Stimulates the use of new technologies in the management and provision of public services and access to public information
• Related to the participation of Brazil as member of
the Open Government Partnership
§ Ontology
§ Defines classes and properties, and their constraints
§ Typically written in RDFS or OWL 2
§ Examples
§ DCMI Metadata Terms (“Dublin Core”)
§ FOAF – The Friend of a Friend Vocabulary
§ FRBR – Functional Requirements for Bibliographic Records
§ CIDOC CRM – Conceptual Reference Model for cultural heritage documentation
Selected Topics on the Web-of-Data
Ontologies
§ Examples (cont.)
§ Energistics Standards
• WITSML drilling and wellsite data standards
• PRODML production and data standards
• RESQML reservoir and structural interpretation data standards
§ Pipeline Open Data Standard (PODS) data model
• an industry standard, used by pipeline operators to provide a “single master source of information,”
and to eliminate “localized silos of information that are often unconnected.”
Selected Topics on the Web-of-Data
Ontologies
§ Remarks about ontology design
§ “Same-old-conceptual-design”
§ “Lavoisier Principle” applies
• Reuse known ontologies as much as possible
§ Ontology = Vocabulary + Axioms
• Axioms (or constraints) capture the semantics of the terms
§ Example
§ The Music ontology uses the FOAF, FRBR, Event and Timeline ontologies
Selected Topics on the Web-of-Data
Ontologies
Topics
§ Selected Topics on the Web-of-Data
§ Introduction
§ Basic Technology
§ Selected Topics
• Surfacing Deep Web Data
• Keyword search
• Entity Relatedness
§ Selected Topics on Data Integration
Selected Topics on the Web-of-Data
Surfacing Deep Web Data
§ Problem: How to make Deep Web data visible to search engines?
§ ... with proper IDs and proper semantics
Selected Topics on the Web-of-Data
Surfacing Deep Web Data
§ Non-standard publication of opaque data
Selected Topics on the Web-of-Data
Surfacing Deep Web Data
Selected Topics on the Web-of-Data
Keyword Search
§ How to implement keyword search over RDF datasets?
§ ... with good performance and proper ranking of the results
Selected Topics on the Web-of-Data
Keyword Search
§ Schema-based approach
§ Requires... a schema!
§ Probably faster for large graphs
§ Each query is compiled from the schema
• a query defines a fixed subgraph pattern
§ Graph-based approach
§ Does not require a schema!
§ May be prohibitively slow for large graphs
§ Depends on a graph-traversal algorithm
• a result is an approximation of a Steiner tree
A1
P1
“Oscar Best Comedy"
M1 win
acted
“Annie Hall”
title
name name
P2
name acted
Query: “Allen” “Oscar”
?p name “Allen”
?a name “Oscar*”
?p acted ?m
?m win ?a A2
“Oscar Best Director"
win name
?p name “Allen”
?a name “Oscar*”
?p win ?a
Example: Schema-Based RDF-KwS Tool
Albert Einstein Kurt Gödel Selected Topics on the Web-of-Data
Entity Relatedness
§ Problem: How two entities, described in an RDF
dataset, are related?
Selected Topics on the Web-of-Data
Entity Relatedness
Albert Einstein
Friendship
Kurt Gödel
§ Problem: almost ten thousand paths in DBpedia...
Selected Topics on the Web-of-Data
Entity Relatedness
§ Main challenges
§ How to find relationship paths between a given pair of entities in a knowledge base?
§ How to group the relationship paths?
§ What characteristics of the groups of paths
must be selected to generate a connectivity profile?
Topics
§ Selected Topics on the Web-of-Data
§ Selected Topics on Data Integration 1
§ Introduction
§ Basic Technology
• Entity Linkage
• Data Extraction
• Data Fusion
• Schema Alignment
§ Use of Ontologies for Data Integration
1