[PENDING] University of Crete Library

This work is the result of the constant and endless love of my parents, Markos and Anastasia, and my sister Nantia. Consequently, there is a need for an interoperability model that will allow the specification of the entities of interest as well as of the related and useful semantic data.

List of Tables

Introduction

General Objective
Motivation and Vision
Research Questions
The Approach
The Approach 5 a list of search results, we introduce a probabilistic (Random Walk-based) ranking model
Contributions of this Dissertation
Outline of Dissertation

We present the results of a comparative evaluation with other ranking methods that illustrate the effectiveness (and also quantify the difference) of the proposed ranking scheme. In Appendix B, the formulas of the metrics used in the evaluation of the proposed method of reranking the results are given.

Figure 1.1: An example of a top-5 semantic graph.

Motivation and Context

Motivating Scenarios
Motivating Scenarios 11
Possible Approaches
Basic Concepts 15 is that if a user submits a query that does not belong to the index then the systemis that if a user submits a query that does not belong to the index then the system
Basic Concepts
The Considered Search Process
The Considered Configuration Process 17 retrieve the incoming and outgoing properties of each entity URI.retrieve the incoming and outgoing properties of each entity URI
The Considered Configuration Process
The Considered Configuration Process 19

Then, for each incoming query, the entities of the top-L hits of the answer are retrieved from the index and given to the user. This functionality can also be offered on demand as a supplementary representation of the identified entities.

Figure 2.1: Search form of a marine-related search application.

Background and Related Work

Semantic Web and Semantic Search

The Semantic Web (or the Web of Data)

The principles of Linked Data were first described by Berners-Lee in 20061 and provide broad guidelines from which data publishers have begun to realize the web of data. The Web of Data can be accessed through Linked Data browsers, just as the traditional Web of Documents is accessed through HTML browsers [37].

Figure 3.1: Associations among the several concepts, areas and topics involved in this dissertation.

Semantic Web and Semantic Search 23

Semantic Search

Semantic data search includes crawling, storing and indexing semantic data, searching and arranging semantic data, searching in multi-data source and multi-repository scenarios, handling vague or incomplete semantic data, infrastructure for querying semantic data on the web , etc. Below we discuss the most important works related to both Semantic Data Search and Semantic-driven IR.

Semantic Web and Semantic Search 25 aggregated statistical metadata about the indexed Semantic Web documents and Seman-

GKG tries to understand the submitted query and presents a semantic description (in a right panel) of the entity the user might be looking for. For example (and for now), for the query "Barack Obama and Honolulu", GKG returns no semantic information, even though Honolului is the birthplace of Barack Obama, i.e. the two entities are highly related.

Semantic Web and Semantic Search 27 identified in the list of results (i.e., relationships that do not exist in the underlying knowl-

The final output of the system consists of a set of ontology elements that answer the user's question and a complementary list of semantically ordered relevant documents. In this work, the authors introduce an entity ranking algorithm, called LDRANK, which combines the biased likelihood method we present in 5.3 and a variation of the Singular Value Decomposition (SVD) algorithm that mines textual data associated with the identified entities.

Named Entity Extraction 29 3.1.3 Our Approach3.1.3Our Approach

Named Entity Extraction

LOD-based NEE Tools of General Purpose
Life Sciences-tailored Annotation Tools
Our Approach

The result of entity extraction can be saved in different forms (HTML, XML, JSON or XHTML+RDFa). Configurations can be shared by users/communities, eg, for marking different document bodies using the same configuration, i.e., the same categories, entity lists, SKBs, etc.

Categorizing/Grouping Search Results 33 that accepts and uses such configurations, while the result of the annotation processthat accepts and uses such configurations, while the result of the annotation process

Categorizing/Grouping Search Results

Automatically Improving Search Results

Automatic Query Expansion

Automatically Improving Search Results 35 query entities and their links to knowledge bases, including structured attributes and text

Pseudo-Relevance Feedback
Re-ranking
Our approach

Link Analysis

Link Analysis 37 the underlying hyperlink graph. A node is created for every Web page and a directed edge

Link Analysis Techniques

Tomlin [196] proposes a generalization that computes flow values for the edges of the web graph and aTrafficRank value for each page. The hub role captures the quality of the page as a reference to useful resources, while the authority role captures the quality of the page as a resource itself.

Link Analysis 39 ciated every page with a hub and an authority weight. Following the mutual reinforcing

When at the node on the hub side, the algorithm chooses one of the outgoing links uniformly at random and passes to an authority. The parameters of the algorithm are (a) a firing thresholdF (a real number in the interval[0,1]), and (b) a decay factorD (a real number in the interval[0,1]).

Link Analysis 41

Applying Link Analysis on the Web of Data
Our approach
Comparing Link Analysis Techniques

Let A1 and A2 be two link analysis ranking algorithms, then we can define the distance between algorithms A1 and A2 on graphGasd(A1(G),A2(G)), where: Rn×Rn → R is some function that maps two real n-dimensional weight vectors sa1,a2 in a real numbered(a1,a2). Authority vectors can be viewed as points in ann-dimensional space, so we can use common geometric measures of distance (e.g. the Manhattan distance).

Link Analysis 43 by two different algorithms. The motivation is that the ordinal ranking is the usual end-

Evaluating Link Analysis Techniques

By clicking on the query, the user is shown a combination of the top 10 results of all algorithms. This is the proportion of results in the top 10 ranking positions that are relevant to the query.

Link Analysis 45 the average is taken over the intersection over the top-1, top-2, up to top-10)

Each result was described by the human-readable label and the URI of the entity. The evaluation compared the proposed approach with two relevant baselines and the results showed that the proposed approach outperforms the baselines, and the improvement is large for the domain-specific dataset.

Link Analysis 47 problem of retrieving and ranking entities from a structured knowledge repository based

Configuring Named Entity Extraction

Notions and Notations

For an entity URIu, let Descr(u) be an array of RDF triples expressing information about u in the RDF graph. Graph(doc) =∪u∈U(doc)Descr(u)is a set of triples about these URIs that actually define the RDF graph.

The Proposed Configuration Model

For entity name, let U(e) denote URIs associated with toe that exist in one or more RDF graphs, e.g. U(Chum Salmon) = –http://dbpedia.org/resource/Chum_salmon,https://. In general, we can identify entities of different categories in a given document, each of these entities is associated with URIs, and each of these URIs with triples that describe those URIs.

The Proposed Configuration Model 51 SPARQL queries. This template query is also associated with a name that can be used for

However, a feature of this trivial disambiguation is that we already know the category of the corresponding entity and thus can design the query template accordingly (e.g., we can compare the name of the entity with the names of entities belonging to a given RDF class, as in the template queries in Figures 4.4 and 4.5). To enrich entities, KBM1 can be connected to the template query shown in Figure 4.6, which retrieves the outgoing properties of the entity URI.

The Open NEE Configuration Model

For example, we can define that the category Species is a broader concept of the categories Fish Species and Bird Species. Open NEE Configuration Model 55 is a well-known thesaurus related to the maritime domain, we can define that the category.

The Open NEE Configuration Model 55 is a well-known thesaurus related to the marine domain, we can define that the category

Exporting/Exchanging the Annotation Results

Exporting/Exchanging the Annotation Results 57 SELECT ?tool ?name WHERE {

Finally, the propertyoae:has- MatchedURI is used to represent the URIs that match an entity name.

Figure 4.14: The extension of the Open Annotation Data Model.

Exporting/Exchanging the Annotation Results 59

Exporting/Exchanging the Annotation Results 61

The X-Link Framework

Functionality
Configurability

Regarding the connectivity of the entity URIs, X-Link calculates a graph to make more clear how the entity URIs are associated. In that example, X-Link supports 7 categories of entities (line 1), that is, the entity names of these categories are retrieved and stored in Gate ANNIE.

The X-Link Framework 65

Evaluation

Task-based User Study

Evaluation 67

We recorded whether they succeeded in completing each task of the above scenario, as well as the time to successfully complete each task.

Figure 4.23: The configuration options of the X-Link evaluation prototype.

Evaluation 69 Q4 How easy was to specify how to enrich the identified entities?

82% of the participants found the overall setup (Q0) an "easy" task, while 18% found it "very easy". Regarding V6, a few participants mentioned a problem in understanding the idea of the SPARQL template queries (one also suggested providing a user-friendly interface for constructing them).

Table 4.3: Evaluation of the difficulty in performing the scenario (results from 11 users).

Evaluation 71 2 Case Study: Querying Online DBpedia

The time is highly dependent on the total number of subjects belonging to the respective category. For each resource class, we randomly selected 10 entity tags belonging to that class and measured the average time to execute the SPARQL query shown in Figure 4.26 ([URI_OF_ RES_CLASS] corresponds to the URI of the resource class, while [ENTITY] corresponds to a randomly selected tag ).

Figure 4.25: Time for adding a new category.

Evaluation 73

Other Aspects

Epilogue

Stochastic Ranking of Entities, Proper- ties and Search Results

Notions and Notations
Entity Importance
Ranking of Entities and Properties 79 consider that the entities detected in the title of the article are more important than thoseconsider that the entities detected in the title of the article are more important than those
Ranking of Entities and Properties

The Semantically-Enriched Graph of Identified Entities

Ranking of Entities and Properties 81 blank node, node “d” (Thunnus atlanticus@en) is a literal (specifically a string in English),

The State Transition Graph (STG)
Analyzing the STG

Ranking of Entities and Properties 83 where J[e i ] = Jump(e i ) and T is the transition matrix
return r

Promoting the Important Entities
Top-K Semantic Graphs
Ranking of Entities and Properties 85 struction a semantic graph does not contain blank nodes). However, some resources

Consider the following example in themarinedomain (which will be our running example for now): User submits a query to a marine-related search engine, eg, the query "bonito". Also let props(e,e') ⊆ o(e) be the set of (directed) edges connected to' inX(ie, the properties connecting the two entities).

Figure 5.1: Stochastic ranking of entities and properties: a marine-related example.

5.4 (Re-)Ranking of Search Results

Modeling a Random Walker

We also report experimental results regarding the distribution of vertices in these clusters for several top-K semantic graphs produced by a prototype system. In the latter case, the user can now either i) open one or more of the displayed results, or ii) click on some other entities and update the displayed list of results accordingly, or iii) delete his selection (reset) and view the results again .

The Semantic Graph of Documents and Entities

The STG

We notice that the transition probabilities are influenced by the “importance” of the identified entities. For simplicity and understandability, the graph contains only the outgoing edges of the gray entity nodes.

Figure 5.4: An example of a SGODE graph.

Analyzing the STG

First, we need to determine the value for the decay factor, i.e. for the probability that a random walker will make a random jump. As for p3, ie. the probability of selecting a document node or a related entity/property node from an entity node, we believe that when a walker is in an entity node, it is more likely to move to a document node (that is, contains/refers to that entity) as the destination end user to find one or more documents that satisfy his information needs.

Exploiting the Outcome

Since we want to prioritize documents related to important (highly rated) entities, we can define a small value, e.g. d< 0.4. In the event that we want to enable the selection of only the nodes of the document, we can define p3= 1.0.

Evaluation

When searching the web, PageRank is usually initialized with the same value for all web pages. Evaluation 93LOD enriched ranking approach with the simple NER method, and we present the evaluation.

Evaluation 93 LOD-enriched ranking approach with the plain NER method, and we present the evalu-

Usefulness of Top-K Semantic Graphs

The participant can choose one of the following options: Yes, Maybe Yes - it depends on the interaction model and the quality of the visualization of the graph, Maybe No, No. For example, a query like astuna species looks for instances of a class of entities, while a query like yellowfin tuna looks for information for one specific entity, in this case a certain tuna species.

Figure 5.8: Indicative screenshot of the questionnaire in which the participant must answer the first question for the query “yellowfin tuna”.

Evaluation 95

Effectiveness of Entity and Property Ranking

We also notice that for the last query (of other type of query) which is a fairly common query, a high percentage of participants (26%) chose MAYBE NO or NO. Regarding Q1, the results depicted in the respective column of Table 5.1 show that we can reject the null hypothesis for the first four queries with a Type-I statistical error of 5%, while we cannot reject it for the last one not, i.e. the queryfishing in Hawaii of type.

Evaluation 97 the semantic information (entities and properties) that better characterizes these results

Fish species related to the detected species, but not detected in the search results - Properties (literal and numeric values) related to the detected species. These types of entities correspond to the five clusters described in 5.3.5 (Vans,Vrel,Vlit, Vctg,Vweb).

Figure 5.9: Indicative screenshot of the evaluation prototype showing the guidelines and the three top-10 lists for the query “tuna species”.

Evaluation 99

We note that the higher score is for the literal features (Vlit) that correspond to the features of the detected entities. He justifies this by the fact that most of the submitted queries are entity queries, the purpose of which is to find information about a specific entity.

Figure 5.11: The scores given to the three ranking schemes.

Evaluation 101

Regarding the difference in the top elements, we compared the algorithms using the Jaccard similarity coefficient. As for the difference in the order of the elements, we compared the algorithms using the normalized Kendall tau distance measure as described in 3.5.4.

Figure 5.13: Distribution of vertices in the top-K semantic graphs.

Evaluation 103

Effectiveness of Results (Re-)Ranking

A topic is actually a medical case narrative that serves as an idealized representation of an actual medical record and describes information such as the patient's medical history, the patient's current symptoms, the tests performed by a doctor to diagnose the patient's condition , the eventual diagnosis of the patient as. as well as the steps taken by a doctor to treat the patient. For each topic offered, an effective IR system should find documents that can help the physician answer a common general clinical question, such as what is the patient's diagnosis or what tests should the patient receive based on the medical report.

Evaluation 105

The results showed that the improvement is statistically significant for most cases (it fails the test when the improvement is small, less than 14%). ForL = 250andd = 0.0 the results are statistically significant only for the metrics nDCG'andP@10', while ford=0.2 the results are statistically significant for all metrics except Q'.

Evaluation 107

Efficiency of the Entire Semantic Analysis Process

This means that the specific semantic information about the identified entities (DBpediasubjectproperty), although it may be quite useful in another context (e.g. for entity-based faceted search), misleads the random wanderer and negatively affects the reordering of the retrieved results. The time to perform entity mining in a set of search results depends on many parameters, such as the number of results we want to analyze, the size of the text in each result/document, the efficiency of the underlying NER algorithm, etc. .

Figure 5.16 depicts the results. We notice that entity enrichment did not improve the top-100 list and this is clear for all evaluation metrics

Evaluation 109

The time to generate a SEGIE depends on the number of entities detected, the performance of the underlying SKBs, and the categories of entities detected (as some entity categories may contain many input or output properties). The time to generate STG depends on the number of triples in SEGIE, while the time to run PageRank depends on the number of iterations and on the number of edges in STG.

Evaluation 111 5 Other Aspects

On average, the top-5 entities have a Kendall rope distance of 0.08, which means that there are often (but not always) a few entities in the top-5 list with different orders in the two approaches. Top-10 units have Kendall rope distance of 0.10 (meaning there are about 4-5 pairs of units in the top-10 list with different orders), while top-15 and top-20 units have Kendall- rope distance around 0.20.

Epilogue

In the same experiments, we also compared the linear order of top-5, top-10, top-15 and top-20 entities (not all entities returned as before), ignoring entities that do not exist in the corresponding top-20 . K entities of both approaches, in order to clarify the positions in which there are differences in the linear order. The authors benchmarked (using a crowdsourced dataset): . i) the biased PageRank algorithm we propose in this thesis (they call it HIT), ii) a modified biased PageRank algorithm with prior knowledge based on SVD (called SVD), iii) an unmodified PageRank algorithm which applies equal probability distribution (called EQUI ), and iv)LDRANKi which combines the three previous approaches using a consensus algorithm of opinion group [50].

Epilogue 113

Experimental results on the 2014 and 2015 datasets of the TREC Clinical Decision Support track showed that the proposed reranking approach can improve the list of results returned by a classic IR system, specifically by moving relevant but low-ranked hits to higher positions. move. The results of the experimental evaluation carried out showed that for up to 100 detected entities (which is the case for snippet mining) we can provide the proposed functionality in real time (in less than 4 seconds), even if we have access to an online SKB such as DBpedia.

Interaction Model and Applications

Faceted Search and Graph Exploitation
Faceted Search and Graph Exploitation 117 6.3) allowing thereby to control the amount of information that she/he wants to display
On-Demand Entity Enrichment and Exploration 119
On-Demand Entity Enrichment and Exploration
On-Demand Entity Enrichment and Exploration 121 SELECT ?name ?uri ?genus WHERE {
Assembling the Pieces: The Search System “X-Search”
Assembling the Pieces: The Search System “X-Search” 123
Assembling the Pieces: The Search System “X-Search” 125

X-Search for Patent Search

Theophrastus: Entity-based Automatic Annotation of Web Documents 127
Theophrastus: Entity-based Automatic Annotation of Web Doc- uments
Theophrastus: Entity-based Automatic Annotation of Web Documents 129
Link

Weaknesses and Limitations 131
Weaknesses and Limitations

For example, in Figure 6.15, the user requested to view the properties of the DBpedia resource associated with the "pink salmon" entity. The user can also review identified entities and metadata (grouped into categories) (D) and search space aggregation (E).

Figure 6.1: Identified entities grouped in facets (left) vs. list of identified entities (right).

Conclusion

Synopsis of Contributions
Directions for Future Work and Research 135 We should stress that, due to the lack of standards related to entity extraction, it wouldWe should stress that, due to the lack of standards related to entity extraction, it would
Directions for Future Work and Research
Directions for Future Work and Research 137 needed. It is also very interesting to compare the proposed approach with other methods
Directions for Future Work and Research 139 Interacting with Top-K Semantic Graphs

Another interesting aspect regarding the proposed rearrangement method is the study of the extreme cases. Another related interesting direction is to study approaches on how to automatically (on request) select the NEE services to be used from a list of available descriptions (expressed using the proposed Open NEE configuration model), for example based on the information submitted. query or the context of the retrieved results.

Bibliography

ACM, 2005

Proceedings of the 8th Joint ACM/IEEE-CS Conference on Digital Libraries, JCDL ’08, pages 52–56. Proceedings of the 31st annual ACM SIGIR international conference on Research and development in information retrieval, pages 603–610.

Appendix A

Configuration Forms of X-Link Evalua- tion Prototype

Appendix B

Evaluation Metrics for Incomplete Rel- evance Judgements

Note that r′−numrel(r′) is the number of judged irrelevant documents ranked above the document at rank′. This metric favors aggregated lists that contain many judged relevant documents over judged irrelevant documents at the top of the sorted list.

Appendix C

Publications, Systems and Models

Publications

In (1), we began leveraging LOD to enrich keyword-based web search with entity mining performed at query time. In (4), we elaborate on the configurability of a NEE system and we present the X-Link framework.

Systems and Models

Appendix D

Acronyms