Selected Topics the Web-of-Data on
Data Integration and
Marco A. Casanova
http://www.inf.puc-rio.br/~casanova/
Topics
§ Selected Topics on the Web-of-Data
§ Introduction
§ Basic Technology
§ RDF Keyword Search
§ Selected Topics on Data Integration
§ Introduction
§ Classic Data Integration
§ Data Lake Management
§ Web-of-Data and Data Integration
Selected Topics on the Web-of-Data
Introduction
§ Traditional Web
§ Adherence to Standards
• URLs – Web Page Ids
• HTML
• HTTP
• ...
• Internet protocols
In March 1989, Tim Berners-Lee submitted a proposal for an information management system to his boss, Mike Sendall.
‘Vague, but exciting’ were the words that Sendall wrote on the proposal.
Selected Topics on the Web-of-Data
Introduction
§ Traditional
Web search engines
§ Coverage
• data captured by Web crawlers
§ Search capabilities
• Keyword search
• Ranking
Iceberg Photo
Judith Currelly, Diane Farris Gallery
Selected Topics on the Web-of-Data
Introduction
§ Traditional
Web search engines
§ Coverage
• Just the data captured by Web crawlers
§ Search capabilities
üKeyword search üRanking
§ Web page semantics
• “Everything is text”
Iceberg Photo
Judith Currelly, Diane Farris Gallery
Selected Topics on the Web-of-Data
Introduction
§ Traditional
Web search engines
§ Coverage
• Just the data captured by Web crawlers
§ Search capabilities
üKeyword search üRanking
§ Web page semantics
• “Everything is text” (really?)
– other types of data Iceberg Photo
Judith Currelly, Diane Farris Gallery
Topics
§ Selected Topics on the Web-of-Data
§ Introduction
§ Basic Technology
• RDF
• Adding Semantics to Web Pages
• Ontologies
§ RDF Keyword Search
§ Selected Topics on Data Integration
§ Web-of-Data and Data Integration
Selected Topics on the Web-of-Data
What is the Web-of-Data
Selected Topics on the Web-of-Data
What is the Web-of-Data
http://lattes.cnpq.br/0807511237795775
Selected Topics on the Web-of-Data
What is the Web-of-Data
http://lattes.cnpq.br/0807511237795775
http://purl.org/dc/elements/1.1/creator
Selected Topics on the Web-of-Data
What is the Web-of-Data
http://lattes.cnpq.br/0807511237795775
http://purl.org/dc/elements/1.1/creator
https://doi.org/10.17771/PUCRio.acad.5181
Selected Topics on the Web-of-Data
RDF
§ URI – Uniform Resource Identifier
§ a compact sequence of characters
that identifies an abstract or physical resource
§ Examples
http://lattes.cnpq.br/0807511237795775 (a person) http://purl.org/dc/elements/1.1/creator (a term)
https://doi.org/10.17771/PUCRio.acad.5181 (a document)
Subject Predicate Object
https://doi.org/10.17771/PUCRio.acad.5181 http://purl.org/dc/elements/1.1/date 2004/07/13 https://doi.org/10.17771/PUCRio.acad.5181 http://purl.org/dc/elements/1.1/language Portuguese
https://doi.org/10.17771/PUCRio.acad.5181 http://purl.org/dc/elements/1.1/creator http://lattes.cnpq.br/0807511237795775
Subject Predicate Object
https://doi.org/10.17771/PUCRio.acad.5181 http://purl.org/dc/elements/1.1/date 2004/07/13 https://doi.org/10.17771/PUCRio.acad.5181 http://purl.org/dc/elements/1.1/language Portuguese
https://doi.org/10.17771/PUCRio.acad.5181 http://purl.org/dc/elements/1.1/creator http://lattes.cnpq.br/0807511237795775
https://doi.org/10.17771/PUCRio.acad.5181
http://lattes.cnpq.br/0807511237795775
2004/07/13
Portuguese
http://purl.org/dc/elements/1.1/date http://purl.org/dc/elements/1.1/creator
http://purl.org/dc/elements/1.1/language
§ Ontology
§ Defines classes and properties, and their constraints
§ Typically written in RDFS or OWL 2
§ Examples
§ DCMI Metadata Terms (“Dublin Core”)
§ FOAF – The Friend of a Friend Vocabulary
§ FRBR – Functional Requirements for Bibliographic Records
§ CIDOC CRM – Conceptual Reference Model for cultural heritage documentation
Selected Topics on the Web-of-Data
Ontologies
§ Remarks about ontology design
§ “Same-old-conceptual-design”
§ “Lavoisier Principle” applies
• Reuse known ontologies as much as possible
§ Ontology = Vocabulary + Axioms
• Axioms (or constraints) capture the semantics of the terms
§ Example
§ The Music ontology uses the FOAF, FRBR, Event and Timeline ontologies
Selected Topics on the Web-of-Data
Ontologies
Topics
§ Selected Topics on the Web-of-Data
§ Introduction
§ Basic Technology
§ RDF Keyword Search
§ Selected Topics on Data Integration
§ Web-of-Data and Data Integration
Selected Topics on the Web-of-Data
RDF Keyword Search
RDF Graph
A1
P1
M1 act
title
name descr
P2
“Richard Burton”
act A2
win descr
rel
“Liz Taylor”
“Oscar Best Actress”
"Oscar Best Cinematography"
"Who's Afraid of Virginia Woolf"
name award
Selected Topics on the Web-of-Data
RDF Keyword Search
§ The RDF Keyword Search Problem
How to implement keyword search over RDF graphs?
... with good performance and proper ranking of the results
Selected Topics on the Web-of-Data
RDF Keyword Search
§ Schema-based approach
§ Requires... a schema!
§ Locates keyword matches in the graph
§ Compiles the keywords into a SPARQL query using schema information
§ Graph summary approach
§ Computes "graph summaries"
§ Locates keyword matches in the graph
§ Compiles the keywords into a SPARQL query using the graph summaries
§ Graph traversal approach
§ Locates keyword matches in the graph
§ Directly traverses the graph from the matching nodes or arcs
Query: “Taylor” “Oscar”
?p name “Taylor”
?a descr “Oscar*”
?p act ?m
?m award ?a
?p name “Taylor”
?a descr “Oscar*”
?p win ?a A1
P1
M1 act
title
name descr
P2
“Richard Burton”
act A2
win descr
rel
“Liz Taylor”
“Oscar Best Actress”
"Oscar Best Cinematography"
"Who's Afraid of Virginia Woolf"
name award
Query: “Taylor” “Oscar”
?p name “Taylor”
?a descr “Oscar*”
?p act ?m
?m award ?a
?p name “Taylor”
?a descr “Oscar*”
?p win ?a A1
P1
M1 act
title
name descr
P2
“Richard Burton”
act A2
win descr
rel
“Liz Taylor”
“Oscar Best Actress”
"Oscar Best Cinematography"
"Who's Afraid of Virginia Woolf"
name award
Query: “Taylor” “Oscar”
No schema
Graph summary approach (via KMV-synopses)
§ the domain of name is similar to
the domain of win
§ the domain of descr is similar to
the range of win
Graph traversal approach A1
P1
M1 act
title
name descr
P2
“Richard Burton”
act A2
win descr
rel
“Liz Taylor”
“Oscar Best Actress”
"Oscar Best Cinematography"
"Who's Afraid of Virginia Woolf"
name award
Selected Topics on the Web-of-Data
RDF Keyword Search
§ Problem: RDF keyword search
How to locate, in the RDF graph, the entities the keywords refer to?
How to discover, in the RDF graph, relationships between such entities?
§ Solution I: compile the keywords into a SPARQL query, using
Information extracted from the RDF schema, if available Information extracted from the RDF graph
§ Solution II: traverse the RDF graph from the keyword matches
Summary
§ Selected Topics on the Web-of-Data
§ Basic Technology
• RDF
• Adding Semantics to Web Pages
• Ontologies
§ RDF Keyword Search
• Schema-based approach
• Graph summary approach
• Graph traversal approach
Topics
§ Selected Topics on the Web-of-Data
§ Selected Topics on Data Integration
§ Introduction
§ Classic Data Integration
§ Data Lake Management
§ Web-of-Data and Data Integration
Selected Topics on Data Integration
Introduction
§ The Data Integration Problem:
How to integrate data from different sources?
Poço Anp-Id Lat Long
P1 x1 y1
P2 x2 y2
Well Id X Y
Q1 x1 y1
Q2 t2 u2
The Data Integration Problem:
"Mais velho que a Sé de Braga"
Selected Topics on Data Integration
Introduction
§ Recent challenges
... Big Data applications ... Data Science applications
"... Merck has approximately 4,000 Oracle databases, a large data lake, uncountable individual files and the company is interested in public data from the Web. Merck data scientists
spend at least 90% of their time finding data sets relevant to their task at hand and then performing data integration on the result. ... ".
Stonebraker, M. "Data Integration: The Current Status and the Way Forward", Bulletin of the IEEE Comp. Society Tech. Committee on Data Eng., 2018.
Selected Topics on Data Integration
Introduction
§ Percentage of time spent on a Data Science project (source: Crowdflower)
§ Building training sets: 3%
§ Cleaning and organizing data: 60%
§ Collecting data sets: 19%
§ Mining data for patterns: 9%
§ Refining algorithms: 4%
§ Other: 5% https://visit.figure-eight.com/rs/416-ZBE-142/images/
CrowdFlower_DataScienceReport_2016.pdf
Topics
§ Selected Topics on the Web-of-Data
§ Selected Topics on Data Integration
§ Introduction
§ Classic Data Integration
• Schema Alignment
• Entity Linkage
• Data Extraction
• Data Fusion
§ Data Lake Management
§ Web-of-Data and Data Integration
Selected Topics on Data Integration
Schema Alignment
§ The Schema Alignment Problem
How to map classes and attributes with the same semantics?
Poço Anp-Id Lat Long
P1 x1 y1
P2 x2 y2
Well Id X Y
Q1 x1 y1
Q2 t2 u2
Selected Topics on Data Integration
Schema Alignment
§ Approaches to Schema Alignment
§ Syntactical approach:
• "syntactical proximity" implies
"semantic proximity"
§ Semantical approach:
• "things" with similar data have the same meaning
Bruegel, Pieter. The Tower of Babel. c 1563 Oil on oak panel. 114 x 155 cm
Kunsthistorisches Museum Wien, Vienna
Selected Topics on Data Integration
Schema Alignment
§ Problems
§ Syntactical approach:
• Presupposes that
"syntactical proximity" implies
"semantic proximity"
§ Semantical approach:
• Requires overlapping data sources
• Works for simple conceptual schemas
Bruegel, Pieter. The Tower of Babel. c 1563 Oil on oak panel. 114 x 155 cm
Kunsthistorisches Museum Wien, Vienna
Selected Topics on Data Integration
Schema Alignment
§ Problem
§ ... Too many alignments!
Selected Topics on Data Integration
Schema Alignment
§ Mediated schema
§ Shared vocabulary (with whom?)
§ Difficult to construct
§ ... impossible dream!
§ Domain ontology
§ Different name for mediated schema
§ Different formalism
§ ... same problems!
Picasso, Pablo. 'Don Quixote', c. 1955 Lithography. 28 x 36 cm
"Schema First, Query Later" Paradigm
Selected Topics on Data Integration
Entity Linkage
§ The Entity Linkage Problem
How to identify the same entity in distinct data sources?
Poço Anp-Id Lat Long
P1 x1 y1
P2 x2 y2
Well Id X Y
Q1 x1 y1
Q2 t2 u2
?
Selected Topics on Data Integration
Entity Linkage
§ The Entity Linkage Problem
How to identify the same entity in distinct data sources?
§ Solution
§ Align the data source schemas
§ Discover similar classes from different data sources
§ Discover which attributes from similar classes identify entities
§ Locate entities with the same (or similar) attribute values
Selected Topics on Data Integration
Data Extraction
§ The Data Extraction Problem
How to extract structured data from semi-structured or unstructured data?
§ Two Types of Data Extraction (DE)
§ Closed-world DE: align to known entities and attributes
§ Open-world DE: not limited to known entities or attributes
Selected Topics on Data Integration
Data Extraction
§ Document Interpretation Problems
How to recognize the entities a document refers to?
(Entity recognition)
How to recognize relationships in a document?
(Relation extraction)
§ An approach...
§ Use data from the database to train algorithms to
– Recognize entities represented in the database ("CWA")
– Recognize relationships represented in the database ("distant learning")
Selected Topics on Data Integration
Data Fusion
§ The Data Fusion Problem
How to resolve conflicts and obtain correct values ?
§ Challenge
§ Reasoning about source quality works only for easy cases
Summary
§ Classic Data Integration
§ Schema Alignment
§ Entity Linkage
§ Data Extraction
§ Data Fusion
Topics
§ Selected Topics on the Web-of-Data
§ Selected Topics on Data Integration
§ Introduction
§ Classic Data Integration
§ Data Lake Management
§ Web-of-Data and Data Integration
Selected Topics on Data Integration
Data Lake Management
"Load First, Schema Later" Paradigm
Ungerer, G. (2018) Cleaning Up the Data Lake with an Operational Data Hub. O'Reilley Media, Inc. ISBN 978-1-492-02735-5
Selected Topics on Data Integration
Data Lake Management
§ Data Lakes...
§ data can be in large volumes
§ data can have many formats
§ data can be dirty
§ data can change over time
§ ...
Selected Topics on Data Integration
Data Lake Management
§ Common Tasks in Data Lakes
§ Ingestion
§ Extraction (Type Inference)
§ Cleaning
§ Metadata Management
§ Discovery
§ Integration
§ Versioning
Topics
§ Selected Topics on the Web-of-Data
§ Selected Topics on Data Integration
§ Web-of-Data and Data Integration
Web-of-Data and Data Integration
§ "Unique identifiers avoid Entity Linkage"
§ Use URIs as universal unique IDs
Google Knowledge Graph
https://g.co/kgs/8FEXUu
Web-of-Data and Data Integration
§ "Unique identifiers avoid Entity Linkage"
§ Use URIs as universal unique IDs
§ Use "sameAs" to account for multiple unique identifiers
ISNI Google Knowledge Graph
Web-of-Data and Data Integration
§ "Ontologies simplify Schema Alignment"
§ Align your schema with a well-known ontology
Web-of-Data and Data Integration
§ "Ontologies simplify Schema Alignment"
§ Align your schema with a well-known ontology
§ "Meaning is alignment"
Quine, W. V. (1968). Ontological relativity. The Journal of Philosophy, 65(7):185–212.
Web-of-Data and Data Integration
§ "Ontologies simplify Schema Alignment"
§ Align your schema with a well-known ontology
§ "Meaning is alignment"
§ "Run away from schema alignment!"
§ Publish your data using a well-known ontology
Charles M. Schulz, The Complete Peanuts, 1959-1962
Summary
§ Selected Topics on the Web-of-Data
§ Basic Technology – RDF, Ontologies
§ RDF Keyword Search
§ Selected Topics on Data Integration
§ Classic Data Integration
§ Data Lake Management
§ Web-of-Data and Data Integration
§ "Unique identifiers avoid Entity Linkage"
§ "Ontologies simplify Schema Alignment" Bruegel, Pieter. The Tower of Babel. c 1563 Oil on panel. 60 ×74.5 cm
Museum Boijmans Van Beuningen, Rotterdam,
Opportunities
§ ... beyond keyword search
§ Databases
• Natural Language Interfaces
• Q&A Interfaces
Opportunities
§ ... beyond keyword search
§ Databases
• Natural Language Interfaces
• Q&A Interfaces
§ Domain-specific search
• Real-time video search
• Query by image content
• ...
Opportunities
§ Data Integration Machine Learning
§ ML for effective DI:
• Automating DI tasks using ML techniques
Opportunities
§ Data Integration Machine Learning
§ ML for effective DI:
• Automating DI tasks using ML techniques
§ DI for effective ML:
• Automating the creation of large training datasets from different sources
Opportunities
§ Data Lakes
§ Real-time, scalable ingestion of unstructured data
§ Real-time, scalable data cleaning
§ On-demand scalable metadata inference and querying
§ On-demand scalable integration of heterogeneous data
Opportunities
§ Data Lakes
§ Real-time, scalable ingestion of unstructured data
§ Real-time, scalable data cleaning
§ On-demand scalable metadata inference and querying
§ On-demand scalable integration of heterogeneous data
Opportunities
§ ... beyond keyword search
§ Data Integration Machine Learning
§ Data Lakes: Real-time, scalable, on-demand
Thank You
References for this presentation...
casanova puc-rio