Schema quality analysis in a data integration system

Texto

(1)Universidade Federal de Pernambuco Centro de Informática Pós-Graduação em Ciências da Computação. Schema Quality Analysis in a Data Integration System by. Maria da Conceição Moraes Batista Tese de Doutorado. Recife, June 2008.

(2) Universidade Federal de Pernambuco Centro de Informática Pós-Graduação em Ciências da Computação. Schema Quality Analysis in a Data Integration System Maria da Conceição Moraes Batista. Submitted in partial fullfilment of the requirements for the degree of Doctor of Philosophy.. Supervisor: Profa. Ana Carolina Salgado. Recife, June 2008. - ii -.

(3) Batista, Maria da Conceição Moraes Schema quality analysis in a data integration system / Maria da Conceição Moraes Batista. – Recife: O Autor, 2008. xi, 112 folhas : il., fig., tab. Tese (doutorado) – Universidade Federal de Pernambuco. CIn. Ciência da Computação, 2008. Inclui bibliografia. 1. Banco de dados. 2. Integração de dados. I. Título. 025.04. CDD (22.ed.). MEI2008-109.

(4)

(5) Ackowledgments. This work is a result of a big effort and it would not be accomplished without the support given by the people mentioned here. First, I would like to say thank you to my supervisor, Ana Carolina Salgado for her continuous support, discussions, questions and comments. Always with enormous patience, friendship and dedication, her critical comments have played a major role in the quality of my discussion and arguments. A special thanks goes to my examiners, Professors Claudia Bauzer Medeiros, Maria Luiza Machado Campos, Alexandre Vasconcelos, Fernando Fonseca and Décio Fonseca. Their comments and suggestions were extremely important and useful. My gratitude to my mother Helena, an eternal source of loving care and a strong support for all my steps. To my brothers – Maria Helena and Juarez Junior – I also send a big thank you. I also thanks to my father Juarez (in memoriam), wherever he is. I am sure that, he is extremely happy and without him, this work would not be possible. My singular thanks to my friend Bernadette Lóscio from the database research group. She was essential, directly helping the development of this thesis and always giving me motivation. Finally, my most special thank you goes to Cris, my guardian-angel and my dear friend; who supported me in all steps of this work, giving me strength, coffee and always remembering me to trust in the axiom “Everything is gonna be alright!”. I will always be grateful to you.. - iii -.

(6) Abstract Information Quality (IQ) has become a critical topic in organizations and, consequently, in Information Systems research. Poor information quality can have a severe impact on the overall effectiveness of an organization. The growth of data warehouses and the direct access to information from various sources by managers and information users have increased the need for, and awareness of, highquality information in organizations. The notion of Information Quality (IQ) has emerged during the past years and shows a steadily increasing interest. There is no common or agreed definition or measure for Information Quality apart from such general and classical notion as “fitness for use”. The information is considered appropriate for use in the perspective of users´ requirements, i.e., the value of the information depends on its utility when being used. In data integration systems, the access to information that is spread over multiple, distributed and heterogeneous sources is an important problem in many domains. Typically there are many ways to obtain answers to a global query, using data from different sources in different combinations, but in general, it is prohibitively expensive to obtain all answers. While much work has been done on query processing and choosing plans under cost criteria, very little is known about the important problem of measuring the Information Quality aspects into data integration global schemas. In our work, we present the proposal of IQ analysis in a data integration system, mainly related to the system schemas. The main goal we intend to accomplish is to minimize the query processing time. Our hypothesis is that an acceptable alternative to decrease a query execution time would be the construction of good schemas, with high quality scores, and we have based our approach in this affirmative. We focused on developing IQ analysis mechanisms to address schema quality specially the integrated schema. Initially we built a list of IQ criteria related to data integration aspects and chose to focus on formally specifying the algorithms and definitions of schema IQ criteria – minimality, schema completeness and type consistency. We also defined an algorithm to carry out with schema minimality improvements and algorithms for testing the type consistency measurements. With these experiments we have showed that the query execution time in a data integration system may decrease if the query is submitted to a schema with high scores of minimality and type consistency.. Keywords: Information Quality; Data Quality; Data Integration.. - iv -.

(7) Resumo Qualidade da Informação (QI) tem se tornado um aspecto crítico nas organizações e em pesquisas da área de sistemas de informação. Informações de pouca qualidade podem ter impactos negativos na efetividade de uma organização. O crescimento do uso de data warehouses e acesso direto de gerentes e usários a informações obtidas de várias fontes contribuíram para o crescimento da necessidade de qualidade nas informações das empresas. A noção de QI em sistemas de informação emergiu nos últimos e vem sendo alvo de interesse cada vez maior. Não existe ainda um acordo comum acerca de uma definição da QI. Apenas um consenso de que tratase de um conceito de “adequação ao uso”. A informação é considerada apropriada para o uso dentro da perspectiva dos requisitos e necessidades de um usuário, ou seja, a qualidade da informação depende de sua utilidade. O acesso integrado a informações distribuídas em múltiplas fontes de dados heterogêneas, distribuídas e autônomas é um problema importante a ser resolvido em muitos domínios de aplicações. Tipicamente existem algumas formas de se obter respostas a consultas globais, sobre dados em fontes diferentes com diferentes combinações. entretanto é bastante custoso obter todas as respostas possíveis. Enquanto muita pesquisa tem sido feita em relação a processamento de consultas e seleção de planos com critérios de custo, pouco se conhece com relação ao problema de incorporar aspectos de QI em esquemas globais de sistemas de integração de dados. Neste trabalho, nós propomos a análise da QI em um sistema de integração de dados, mais especificamente a qualidade dos esquemas do sistema. O nosso principal objetivo é melhorar a qualidade da execução das consultas. Nossa proposta baseiasse na hipótese de que uma alternativa de otimizar o processamento de consultas seria a construção de esquemas com altos escores de QI. Assim, o foco deste trabalho está no desenvolvimento de mecanismos de análise da QI voltados esquemas de integração de dados, especialmente o esquema global. Inicialmente, nós construímos uma lista de critérios de QI e relacionamos estes critérios com os elementos existentes em sistemas de integração de dados. Em seguida, direcionamos o foco para o esquema integrado e especificamos formalmente critérios de qualidade de esquemas – minimalidade, completude do esquema e consistência de tipo. Também especificamos um algoritmo de execução de ajustes de forma a melhorar a minimalidade e algoritmos para medir a consistência de tipo nos esquemas. Com esses experimentos conseguimos mostrar que o tempo de execução de uma consulta em um sistema de integração de dados pode diminuir se esta consulta for submetida a um esquema com escores altos de minimalidade e consistência de tipo. Palavras-chave: Qualidade da Informação; Qualidade de Dados; Integração de Dados.. -v-.

(8) Contents. Chapter 1. Introduction ................................................................................. 1. 1.1. Information Quality .............................................................................................1. 1.2. Quality in Data Integration Systems......................................................................2. 1.3. Motivation ..........................................................................................................3. 1.4. Objectives ............................................................................................................4. 1.5. Summary of Contributions ...................................................................................5. 1.6. Document Plan ....................................................................................................5. Chapter 2. Data Integration ........................................................................... 7. 2.1. Introduction ........................................................................................................7. 2.2. Data Integration Approaches ................................................................................9. 2.2.1. Data Warehouse Architecture ......................................................................10. 2.2.2. Mediators Architecture ................................................................................12. 2.2.3. Hybrid Architecture ....................................................................................14. 2.3. Final Considerations .......................................................................................... 15. Chapter 3. Information Quality for Data Integration Systems ...................... 16. 3.1. Introduction ...................................................................................................... 16. 3.2. Scenarios for Information Quality Analysis ......................................................... 17. 3.3. Comparative Analysis of IQ Approaches ............................................................. 19. 3.4. Relevant IQ Criteria for Data Integration ...........................................................23. 3.4.1. Reputation ..................................................................................................24. 3.4.2. Completeness ..............................................................................................24. 3.4.3. Timeliness ...................................................................................................26. 3.4.4. Verifiability .................................................................................................27. 3.4.5. Accuracy .....................................................................................................27. 3.4.6. Availability ..................................................................................................28. 3.4.7. Response Time ............................................................................................29. 3.4.8. Representational Consistency ......................................................................29.

(9) 3.4.9 3.5. Minimality ..................................................................................................30. Analysis of IQ in Data Integration Processes ....................................................... 30. 3.5.1. Data Source Schema ....................................................................................31. 3.5.2. Mediation Queries.......................................................................................32. 3.5.3. Integrated Schema .......................................................................................32. 3.5.4. Source Selection ..........................................................................................33. 3.5.5. Query Processing .........................................................................................33. 3.5.6. Data Materialization ....................................................................................33. 3.5.7. Data Integration ..........................................................................................34. 3.5.8. Summary of IQ Criteria and Data Integration Elements ................................34. 3.6. Information Quality in Data Integration Systems ................................................ 34. 3.6.1. IQ Criteria for Selective Data Materialization ..............................................35. 3.6.2. Data Freshness in a Data Integration System ................................................36. 3.6.3. Source Quality in Data Integration ...............................................................39. 3.7. Final Considerations .......................................................................................... 41. Chapter 4. Schema Quality in a Data Integration System ............................. 43. 4.1. Introduction ...................................................................................................... 43. 4.2. Schema Representation and Mappings................................................................ 44. 4.2.1. X-Entity Model ...........................................................................................44. 4.2.2. Schema Mappings........................................................................................46. 4.3. Data Integration IQ Criteria Classification..........................................................49. 4.3.1 4.4. Relevant IQ Criteria ....................................................................................51. Minimality ......................................................................................................... 52. 4.4.1. Minimality Definitions ................................................................................52. 4.4.2. Minimality Specification ..............................................................................55. 4.4.3. Minimality Example ....................................................................................56. 4.4.4. Improvements over Schema Minimality .......................................................57. 4.4.5. Redundant Entities Elimination ...................................................................58. 4.4.6. Redundant Relationships Elimination ..........................................................60. 4.4.7. Redundant Attributes Elimination ................................................................61. 4.5. Type Consistency ............................................................................................... 62. - vii -.

(10) 4.5.1. Type Consistency Definitions ......................................................................62. 4.5.2. Type Consistency Specification ....................................................................63. 4.5.3. Type Consistency Example ..........................................................................66. 4.6. Schema Completeness ........................................................................................ 68. 4.6.1. Definitions ..................................................................................................68. 4.6.2. Schema Completeness Specification .............................................................69. 4.6.3. Example ......................................................................................................71. 4.7. Final Considerations .......................................................................................... 73. Chapter 5. Practical Experiments ................................................................. 75. 5.1. Introduction ...................................................................................................... 75. 5.2. The Integra System ........................................................................................ 76. 5.3. IQ Analysis Space............................................................................................... 78. 5.4. Implementation ................................................................................................. 79. 5.4.1. Evaluate Schema Minimality ........................................................................81. 5.4.2. Improve Minimality ....................................................................................82. 5.4.3. Evaluate Schema Type Consistency ..............................................................82. 5.5. Practical Experimentation .................................................................................. 83. 5.5.1. Experiment Definitions................................................................................83. 5.5.2. Minimality Results ......................................................................................85. 5.5.3. Type Consistency Results.............................................................................94. 5.6. Final Considerations ........................................................................................ 101. Chapter 6 ....................................................................................................... 102 Conclusion ..................................................................................................... 102 6.1. Thesis Overview .............................................................................................. 102. 6.2. Research Contribution ..................................................................................... 103. 6.3. Future Works .................................................................................................. 104. Bibliography .................................................................................................. 105. - viii -.

(11) List of Figures. Figure 2.1 – Data warehouse architecture for data integration [Batista 2003].........................11 Figure 2.2 – Mediator architecture for data integration [Batista 2003] ...................................13 Figure 3.2 – Data integration components and IQ criteria for data integration purposes ........31 Figure 3.3 – Workflow representation of a data integration system [Peralta et al. 2004].........37 Figure 4.1 – Integrated schema (Smed) and data source schemas (S1 and S2)...........................47 Figure 4.2 – Schema with redundant elements ......................................................................56 Figure 4.3 – Calculation of schema minimality score .............................................................56 Figure 4.4 – Redundant entity elimination ............................................................................60 Figure 4.5 – Schema after redundant entity elimination .........................................................60 Figure 4.6 – Redundant relationship detection ......................................................................61 Figure 4.7 – Redundant relationship elimination ...................................................................61 Figure 4.8 – Redundant attribute detection ...........................................................................62 Figure 4.9 – Integrated schema (Sm) and data source schemas (S1 and S2) .............................67 Figure 4.10 – Data source schemas and schema mappings in Ð..............................................70 Figure 4.11 – Integrated schema Sm.......................................................................................71 Figure 4.12 – Data source schema S1 ....................................................................................71 Figure 4.13 – Data source schema S2 ....................................................................................72 Figure 4.14 – Data source schema S3 ....................................................................................72 Figure 5.1 – Integra System Architecture ..........................................................................77 Figure 5.2 – Schema maintenance in the Integra system ....................................................77 Figure 5.3 – Query execution in the Integra system ...........................................................78 Figure 5.4 – Use cases diagram .............................................................................................80 Figure 5.5 – IQ Manager class diagram .................................................................................80.

(12) Figure 5.6 – Sequence diagram for use case UC01 – Evaluate Schema Minimality ..................81 Figure 5.7 – Sequence diagram for use case UC02 – Improve Minimality...............................81 Figure 5.8 – Sequence diagram for use case UC03 – Evaluate Schema Type Consistency ........81 Figure 5.9 – Schema of a public hospital data source (S1) .....................................................85 Figure 5.10 – Schema of a telemedicine data source (S2) .......................................................86 Figure 5.11 – Redundant integrated schema (Sm) ...................................................................86 Figure 5.12 – Minimal integrated schema (Sm).......................................................................89 Figure 5.13 – Execution times of a three entities join (UQ1) ..................................................91 Figure 5.14 – Execution times of a merge (UQ2)...................................................................92 Figure 5.15 – Execution times of projection (UQ3) ................................................................92 Figure 5.16 – Execution times of a join with refer relationship (UQ4) ....................................93 Figure 5.17 – Summary of execution times of UQ in the minimality experiment ...................94 Figure 5.18 – X-Entity of S1 with attributes data types ..........................................................95 Figure 5.19 – X-Entity of S2 with attributes data types ..........................................................95 Figure 5.20 – X-Entity of Sm with attributes data types ..........................................................96 Figure 5.21 – Execution times of a simple selection user query (UQ5) ....................................97 Figure 5.22 – Execution times of a selection with condition (UQ6) ........................................98 Figure 5.23 – Execution times of a nested join (UQ7) ............................................................99 Figure 5.24 – Execution times of a join (UQ8) .......................................................................99 Figure 5.25 – Summary of execution times of UQ in the type consistency experiment.........100. -x-.

(13) List of Tables. Table 3.4 – Criteria in the different proposals (extended from [Scannapieco 2005]) ..............23 Table 3.4 – Classification of IQ criteria [Naumann & Rolker 2000].........................................23 Table 3.5 – Data integration processes and IQ criteria...........................................................35 Table 4.1 – Schema mappings between the integrated schema Smed and schemas S1 and S2 .....48 Table 4.2 – Data integration IQ criteria classification ............................................................49 Table 4.3 – Data integration IQ Criteria assessment ..............................................................50 Table 4.4 – Schema improvement algorithm .........................................................................57 Table 4.5 – Schema mappings between the integrated schema Sm and schemas S1 and S2 .......67 Table 4.6 – Example of attributes data types .........................................................................68 Table 4.7 – Schema mappings between the integrated schema Sm and the source schemas S1 , S2 and S3 ..................................................................................................................................72 Table 5.1 – Schema mappings between the redundant schema Sm and the source schemas S1 and S2 ..................................................................................................................................87 Table 5.2 – Redundancy scores for entities in Sm ...................................................................88 Table 5.3 – Schema mappings between the minimal schema Sm and the source schemas S1 and S2 ........................................................................................................................................90 Table 5.4 – Summary of query execution times .....................................................................94 Table 5.5 – Summary of query execution times ...................................................................100. - xi -.

(14) Chapter 1 Introduction. This chapter introduces our work and proposal for the analysis of the Information Quality (IQ) in the schemas of a data integration system. In the rest of the chapter, we present the concepts of Information Quality and data integration; an overview of the motivation and objectives of this work and; a summary of the main contributions of the proposed approach. The chapter concludes with a summary of the remainder sections of the document.. 1.1 Information Quality Information Quality (IQ) has become a critical aspect in organizations and, consequently, in Information Systems research [Ge & Helfert 2007, Stvilia et al. 2007]. The notion of IQ has only emerged during the past years and shows a steadily increasing interest. IQ is a multidimensional aspect and it is based on a set of dimensions or criteria. The role of each one is to assess and measure a specific IQ aspect [Wang & Strong 1996, Tayi & Ballou 1998, Angeles & Mackinnon 2005, Ballou & Pazer 1985, Wang 1998, Jarke & Vassiliou 1997, Lee et al. 2002]. All these IQ works assume that there exist some shared norms of quality, or quality expectations, and the ways of measuring the extent of meeting those norms and expectations. For our purposes, however, we will use the general definition of IQ – ‘fitness for use’ - which encompasses the aspects of quality. Following the definition of Keeney & Raiffa [Keeney & Raiffa 1976], the measurability of an IQ dimension is defined as the ability to assess the variation along a. -1-.

(15) dimension within a reasonable cost. Measuring is defined as the process of mapping the attribute-level distributions of real-world entities to score values in an objective and systematic way. Accordingly, a measure is defined as a relation associating the attributelevel distributions of real-world entities or processes with numbers. We define a measurement as a score value characterizing a particular IQ attribute or criteria in an objective way. There is a discussion regarding the two concepts of Data Quality and Information Quality. Information quality (IQ) is a term to describe the quality of any element or content of information systems [Wang & Strong 1996], not only the data. Information Quality assurance is the certainty that particular information meets some quality requirements. This leads us to think in a service-based perspective of quality which focuses on the information consumer’s response to his/her task-based interactions with the information system. The use of the term information rather than data implies that the use and delivery of the data must be considered in any quality judgements, i.e. the quality of delivered data represents its value to information consumers [Price and Shanks 2004]. We use the definition of Information Quality as a set of criteria to indicate the overall quality degree associated with the information in the system [Pipino et al. 2002]. The term Data Quality is similar to the accuracy IQ criterion, i.e., only one characteristic or aspect of the Information Quality broader concept [English 1999].. 1.2 Quality in Data Integration Systems Data integration is the process of extracting and merging data from multiple heterogeneous sources [Batini et al. 1986]. Solving structural, syntactical and semantic heterogeneities between source and target data has been a complex problem for data integration systems for a number of years [Sheth & Larson 1990, Batini et al. 1986, MacKinnon et al. 1998, Williams et al. 2000]. One solution to this problem has been developed through the use of a single global schema that represents the integrated information with mappings from global schema to local schemas, where each query to the global schema is translated to queries to the local databases using these mappings [Batini et al. 1986]. Thus, a data integration system provides to users a unified view of several data sources, called integrated schema. In this kind of system, where data is. -2-.

(16) spread over multiple, distributed and heterogeneous sources, the query execution is an essential feature. The propagation of data with lack of quality is a real problem in data integration and, in some cases, the integration step may not be executed if IQ problems are not fixed [Bouzeghoub & Lenzerini 2001, Naumann 2001, Gertz et al. 2004]. In data integration systems its is reasonable to consider data quality as the overall quality of the results generated to a user query, and IQ can be considered, for example, as the schema quality aspects. We use Data Quality or the accuracy IQ criterion, as a component or part of the broader Information Quality concept.. 1.3 Motivation The IQ evaluation in data integration systems is based in methods of investigating the quality scores of the query answers and query processing. Generally, the multiple IQ dimensions or factors, describe certain properties about the information that flows in the data integration system and about the processes that generate this information. Consequently, IQ evaluation consists in calculating several quality attributes, each one describing a specific quality aspect of data. Obviously, when the user submits a query, s/he intrinsically is demanding certain levels of completeness, consistency and confidence for the results. And these characteristics may be associated to a good response time in obtaining the query answer. The quality of the delivered data depends on the data sources contents and on the query processing. As the data sources may be many, distributed and autonomous, the query processing can suffer with prohibitive response time. Additionaly, information may be of poor quality because it does not reflect real world conditions or because it is not easily used and understood by users. There are some works on IQ issues in the context of data integration scenarios, in particular the usage of IQ in query formulation, processing (mediation) and optimization [Gertz 1998, Mihaila et al. 2000, Naumann & Leser 1999, Pernici & Scannapieco 2002, Scannapieco & Catarci 2003]. Important investigated aspects in data integration are: how to assess and measure the quality of data derived from different, heterogeneous and dynamic sources and how to represent the IQ dimensions. -3-.

(17) to the users and applications. However, we did not found specifc works dealing with data integration schemas and IQ measurements. Thus, we believe that an effective use of IQ resources and analysis of the components of data integration systems is a problem to be adressed and the presented approach will improve the query processing time, as we can confirm along this work.. 1.4 Objectives Based on the problems previously discussed, we state that data integration systems are almost always in risk of generating poor answers, even incorrect ones, to the user queries. Thus, the main goal we intend to accomplish is to improve the quality of query execution, more specifically, we intend to decrease the query processing time by using IQ resources over the schemas in data integration systems. Our hypothesis is that an acceptable alternative to do this would be the construction of good schemas, with high quality scores, and we have based the presented approach in this affirmative [Batista & Salgado 2007c]. We focused our work in developing IQ analysis mechanisms to address schema quality especially for the integrated schema. We have compiled a list of IQ criteria related to several data integration aspects, but we will focus on formally specifying the algorithms and definitions related only to schema IQ criteria – minimality [Batista & Salgado 2007a, Batista & Salgado 2008], schema completeness and type consistency [Batista & Salgado 2007b]. In database systems, in order to minimize performance bottlenecks, the presence of redundant elements in a schema may be useful [Heuser 1998]. But in data integration systems, where there are a number of heterogeneous and distributed data sources and the queries are to be processed over their data, we believe that redundant schema elements may put at risk the system performance. Therefore, we based our minimality approach in minimizing the occurrence of redundant elements in the integrated schema. Thus, we also defined an algorithm to perform schema minimality improvements through detecting and eliminating the redundant elements. and. transforming a redundant schema into a minimal one [Batista & Salgado 2007c], in order to achieve better query execution times.. -4-.

(18) In summary, our main objective is to decrease the query processing time in a data integration system. Thus, we show that the improvement of the scores of these three aspects – minimality, schema completeness and type consistency – of an integrated schema will accomplish the objective and we confirm this statement with the practical experiments executed with minimality and type consistency [Batista & Salgado 2008].. 1.5 Summary of Contributions The primary contribution of this thesis is the proposal of IQ criteria analysis in a data integration system, mainly related to the system schemas. We can summarize the contributions in the following topics: i. The compilation of a set of criteria specifically to address IQ problems that may occur in data integration systems; ii. The proposal of a new IQ criteria classification to address the main components of a data integration system; iii. The formal specification of three relevant schema IQ criteria, i.e. minimality, schema completeness and type consistency, mainly considering the integrated schema in a mediator based data integration system; iv. The analysis of the system’s schemas according to the specified IQ criteria. We presented specifically the integrated schema IQ analysis and algorithm for minimality improvement; v. Practical experimentations which results showed that a query submitted over minimal (or with high minimality scores) and consistent (or with high type consistency scores) schemas have better performance than if it is submitted over redundant schemas and schemas with inconsistent data types. The proposed approach was experimentally validated through specification, implementation and tests of a software module – the IQ Manager – as part of a data integration system with a real health care application.. 1.6 Document Plan The rest of this document is organized in the following chapters:. -5-.

(19) •. Chapter 2 – discusses the Information Quality aspects in many relevant contexts: users perspective, query processing, data warehouse, schemas and makes a comparative analysis on some of the IQ approaches;. •. Chapter 3 – is dedicated to IQ criteria issues related to data integration. The approaches of data integration systems are discussed and a set of IQ criteria is detailed. We present a new IQ criteria classification oriented to data integration. Then, we define some components of a data integration system and associate each one with a group of the defined data integration IQ criteria, to make possible the quality evaluation of the component. At the end of this chapter we compile the most recent relevant works in this area;. •. Chapter 4 – presents our IQ approach. Initially, the adopted schema representation is detailed, and then we present the final data integration IQ criteria classification and the specifications of the following schema IQ criteria: minimality, type consistency and schema completeness. We also present an algorithm to improve minimality scores.. •. Chapter 5 – presents the practical experiments executed to validate the proposed approach. Details of implementation and obtained results of query execution improvements are also presented in this chapter.. •. Chapter 6 – discusses our conclusions and the final considerations about the topics discussed in this thesis and presents some suggestions of future researches in IQ and data integration.. -6-.

(20) Chapter 2 Data Integration. In this chapter we describe the overal functioning, the main approaches and architectures of data integration systems. The topics are organized as follows: section 2.1 introduces the chapter; section 2.2 explains the data integration approaches and the respectives implementation architectures; section 2.3 presents some reflections about the data integration area.. 2.1 Introduction The data integration systems are tools to offer a uniform access to distributed and heterogeneous web data sources. This is done by resolving the heterogeneities and giving to the disparate sources an integrated view, i.e., a collection of views over the data sources reflecting the users’ requirements. Users submit queries over the integrated view without having to spend a lot of time in searching and browsing the web. The main feature of a data integration system is to free the user from knowing about specific data sources and interact with each one. Instead, the user submits queries to a global or integrated schema, which is a set of views, over a number of data sources, designed for a particular data integration application. Commonly, the tasks of query processing involving query submission, planning, decomposition and results integration are performed by a software module called mediator [Wiederhold 1992]. Each source publishes a data source schema with the representation of its contents. The mediator reformulates a user query into queries that refers directly to schemas on the sources.. -7-.

(21) To successfully reformulate a query, the mediator uses a set of correspondences, called schema mappings. Thus, the main role of a data integration system is to answer queries that may require extracting and combining data from multiple web sources. These sources are heterogeneous, distributed and autonomous [Levy 2000, Florescu et al. 1998]. One of the most difficult is to handle data heterogeneities. Some of the most common heterogeneities are summarized in the following list: •. Data models: data from the data sources may be modeled with distinct and incompatible data models, as relational, object-oriented, semi-structured, multidimensional, and so on.. •. Schemata: a schema may restrict objects, relationships, attributes, types. Some data sources may have highly structured and restrictive schemas and others may not have too many restrictions over its data.. •. Data values may be distinctly coded. For example, a product price may be stored in US$ for the hypothetic data source S1 and the same item may have a price coded in £ for the data source S2;. •. Query language: some data sources, e.g. DBMS, may process query languages (SQL, OQL, XQuery, etc), and other sources may have data stored in files with no structured access methods;. •. DBMS or file systems: some sources have resources provided by database management systems, as others may not have any data management engine;. •. Hardware and operating systems;. •. Semantic heterogeneities concerning the meanings of an object in an application domain [Florescu et al. 1998].. Besides the above heterogeneities, other aspects in data integration are also relevant: the data sources’ autonomy and distribution. The autonomy is due to the fact that the data sources are independent and may modify their data and schema without concerning the data integration system. The distribution is a consequence of the topology of the network or Web. The next section discusses the classical data integration approaches and the main architectures which implements the mentioned approaches. -8-.

(22) 2.2 Data Integration Approaches Several research works [Baru et al. 1999, Chawathe et al. 1994, Ambite et al. 1998, Levy 2000, Ashish et al. 1999] consider the issues of data integration. Classical data integration systems can be classified according to the adopted approach: the virtual approach [Chawathe et al. 1994] and the materialized approach [Labio et al. 1997]. In the virtual approach, user queries are processed on demand, i.e. data remains in the sources and queries submitted to the data integration system are decomposed at run time into queries addressed directly to the sources. In the materialized approach, data are previously accessed, cleaned, integrated and stored in a previously built repository [Widom 1995]. The queries submitted to the integration system are evaluated in this repository without direct access to the data sources. Each one of the approaches – virtual and materialized – is well suited for different situations. The virtual approach, for example, is better suited to cope with dynamic information sources since data is guaranteed to be fresh at query time. On the other hand, it is disadvantageous with respect to the following factors: (i) the possibility of existing unavailable data sources and; (ii) the query response time is often very high mainly because a large number of data sources may be accessed over the network to answer most queries. The main disadvantage of the materialized approach is the problem of consistency maintenance of replicated data stored in the data warehouse. On the other hand, this approach carries a number of advantages in the ease and effectiveness of access, once the data are stored in a repository. This approach is well suited when: •. Specific and predictable queries are submitted to the data integration system;. •. It is necessary to obtain a high performance, despite the timeliness of the information in the repository;. •. The users need to derive information from the data in data sources, such as historical and aggregated information.. The materialized approach must not be adopted when the query results must be as fresh as possible. The decision of which approach to use depends on the kind of application that will adopt it. Thus, for complex and large applications, it may be -9-.

(23) necessary to combine resources from both approaches. Some information may be previously obtained from the sources and loaded into a data warehouse and some other information is only obtained by on demand queries. Depending on the used approach, data integration systems may be built with the following architectures: •. Mediator architecture implements the virtual approach [Chawathe et al. 1994, Baru et al. 1999] where a software module, called mediator, receives the user query and decomposes it into sub-queries over the data sources [Wiederhold 1992]. The results from these queries on the local data sources are translated, filtered and merged, and then the final answer is returned either to the user or to the application.. •. Data warehouse architecture implements the materialized approach where the data are previously accessed, cleaned, integrated and stored in a data warehouse [Widom 1995, Gupta & Mumick 1995] and all queries are applied to the stored data, without access to the original data sources.. In order to minimize the impacts of most common problems presented by the mentioned approaches, some portions of data more intensively unavailable and static may be materialized in a data warehouse and the more dynamic data are accessed by virtual queries. These features characterize a hybrid data integration architecture [Batista et al. 2003].. 2.2.1 Data Warehouse Architecture The materialized approach of data integration is implemented by data warehousing systems. The data warehouse is the materialized view of the data sources and the user queries are submitted and processed in this repository. Usually a data warehousing system is implemented in relational DBMS. In the relational model, a view is a relation – or a query – derived from others. Views can be materialized by storing its tuples in a database. Rundensteiner in [Rundensteiner et al. 2000] shows that, in contrast to the on demand approach to information integration, the approach of tailored information repository construction, commonly referred to as data warehousing, is characterized by the following properties: - 10 -.

(24) •. At setup time, relevant information is extracted from different information sources on the network, transformed and cleaned as necessary, merged with information from other sources, and then loaded into the data warehouse;. •. During query processing time, the queries posed against the system are directly evaluated in the data warehouse without further interaction with the original sources;. •. During operation time, the modifications of the sources may be filtered for relevance and then propagated in some manner to upgrade the data warehouse.. Figure 2.1 depicts the data warehouse architecture for a data integration system. The bottom of the figure shows the data sources. These sources may include nontraditional data such as flat files, HTML, XML and SGML documents, knowledge bases, legacy systems, and so on. Connected to each data source there is a wrapper/monitor. The wrapper component of this module is responsible for translating information from the native format of the source into the format and data model used by the warehousing system, while the monitor component is responsible for automatically detecting changes of interest in the source data and reporting them to the integrator. User Query. Results. Data Warehouse. Integrator. Wrapper/Monitor. Source 1. Wrapper/Monitor. Wrapper/Monitor. Source 2. Source n. Figure 2.1 – Data warehouse architecture for data integration [Batista 2003]. - 11 -.

(25) When a new information source is attached to the warehousing system, or when relevant information at a source changes, the new or modified data is propagated to the integrator. The integrator is responsible for installing the information in the warehouse, which may include filtering the information, summarizing it, or merging it with information from other sources. In order to properly integrate new information into the warehouse, it may be necessary for the integrator to obtain further information from the same or different data sources. An important issue is the prompt and correct propagation of updates at the sources to the views at the warehouse. Thus, a relevant problem in data warehousing systems is the maintenance of data warehouse consistency [Labio et al. 1997, Gupta & Mumick 1995]. The data sources are independent and autonomous and their data may be updated without informing the data integration system. This can lead the data warehouse to a non-consistent state. Usually, there are two approaches to handle with the consistency maintenance problem: one approach is to ignore the change detection issue altogether and simply propagate entire copies of relevant data from the sources to the warehouse. The integrator can combine this data with existing warehouse data from other sources or it can request complete information from all sources and recompute the warehouse data from scratch. Ignoring change detection may be acceptable in certain scenarios, for example when it is not important for the warehouse data to be current and it is acceptable for the warehouse to be online occasionally. The second option can be used if currency, efficiency, and continuous access are required. The data warehouse architecture for data integration presents some relevant issues: the system efficiency decreases when, at most of the time, the users need the information fresh and up-to-date. On the other hand, this approach is well suited in the following situations: (i) the queries submitted by users do not vary and may be previously defined; (ii) an adequate performance must be guaranteed at query time and; (iii) some queries require data warehouse typical information, e.g. historical and aggregated values.. 2.2.2 Mediators Architecture According to one of the first and more referred research in data integration area [Wiederhold 1992], mediators are software components that exploit the knowledge. - 12 -.

(26) embedded in a data set level, in order to generate and provide knowledge to a higher level application. Data integration systems may implement the virtual approach in mediator architecture as in Figure 2.2: User Query. Results. Global View Mediator Sources Description Wrapper. Wrapper. Wrapper. Wrapper. Flat Files. HTML. Relational Database. OO Database. Figure 2.2 – Mediator architecture for data integration [Batista 2003]. The mediator offers a global and integrated view over the data sources and provides a schema for this view. The mediator uses sources description to know where the data is stored. As the Figure 2.2 shows, the mediator serves as a middle-layer which provides data access and data integration to a user application so that the user application does not need to distinguish the differences among the data sources. Instead, the user application perceives an uniform or global view provided by the mediator, called the integrated schema. When a mediator receives a user query, it decomposes the query into sub-queries (if necessary) and forwards the sub-queries to the correct wrapper(s). A wrapper provides the mapping from the mediator's integrated common model to its specific data source model. A wrapper receives queries from a mediator and translates the queries into the source-specific query language and terminology. The query results are returned from the wrapper(s) to the mediator. The mediator then integrates all the results and returns a single response to the user application. An integrated schema is a set of views, which are designed for a particular application. The views over the integrated schema are not actually stored anywhere. As a consequence, the data integration system must first reformulate a user query into a query that refers directly to schemas on the sources. In order to perform the - 13 -.

(27) reformulation step, the data integration system requires a set of source descriptions. A description of an information source specifies the contents of the source (e.g., data source S1 contains movies and actors information), the attributes that can be found in the source (e.g., genre, cast), constraints on the contents of the source (e.g., data source S2 contains only American movies), and the query processing capabilities of the source (e.g., only perform selections, or answer arbitrary SQL queries). In the mediator architecture for data integration, the data remains in the data sources, and queries to the data integration system are decomposed at run time into appropriated queries to the sources. In this approach, data is not replicated, and it is guaranteed to be fresh at query time. On the other hand, because the data sources are autonomous, more sophisticated query optimization and execution methods are needed to guarantee adequate performance. The mediator architecture is more appropriate for building systems where the data is changing frequently and there is little control over a large number of data sources.. 2.2.3 Hybrid Architecture At a first sight, one may state that the best data integration alternative for a data integration environment is the virtual approach. At least, in this case, the systems guarantee that the queries results are always fresh. However, some problems may arise, for example, some data sources may become unavailable and the execution time for a virtual query may become highly prohibitive. The implementation of data materialization together with a mediator architecture can be used to minimize these problems. This is called a hybrid architecture for data integration systems. In data materialization applications, the data are materialized in terms of some views obtained from the data sources. A view is a derived relation defined in terms of base relations. Thus, a view defines a function from a set of base tables to a derived table and this function is recomputed every time the view is referenced [Gupta & Mumick 1995]. A view can be materialized by storing the tuples of the view in a repository. Consequently, accesses to the materialized view can be much faster than recomputing the view. The amount of data in a materialized view can be high and it is necessary to take care with the physical project of the view storage, including for example, - 14 -.

(28) fragmentation mechanisms. A materialized view is thus like a cache – a copy of the data that can be quickly accessed. Like a cache, a materialized view provides faster access to the data. The speed difference may be critical in applications where the query rate is high and the views are complex so that it is not possible to recompute the view for each query. The materialization views can contribute to the query execution optimization and consequently they are very useful in data integration systems. The use of materialization views has been recognized as an important technique to data integration with the concept of integration views [Gupta & Mumick 1995]. Integration views are defined over the data of multiple data sources. Usually, in a data integration system with hybrid approach, a critical portion of the data may be materialized in a local data warehouse and the rest of the data is accessed on demand directly from the data sources. In this work we adopted as a case study and experimentation environment, a data integration system with hybrid architecture. It will be detailed in Chapter 5 together with the results of the practical experiments.. 2.3 Final Considerations In this chapter we discussed the importance, main characteristics and issues of the data integration environments which are object of study of the proposed IQ approach. The data integration systems, commonly follows two classical approaches: virtual and materialized. Each one has its advantages and disadvantages. For example, in the virtual approache, the user query is reformulated into subqueries addressing directly the data sources. This approach is adequate when the user needs the query answers to be the freshest possible despite of a good response time, and the materialized approach provides good response times for the queries, although the answers may be not so fresh, as the queries are processed in a local repository not in the original data sources. The implementation of data materialization together with a mediator architecture can be used to minimize these problems. This is called a hybrid architecture for data integration systems, and the data integration environment that we use to validate our IQ approach is built under the hybrid architecture and it is presented in the Chapter 5 . The next chapter discusses the main concepts and issues related to Information Quality. - 15 -.

(29) Chapter 3 Information Quality for Data Integration Systems. We have already mentioned that any system or study of IQ should define and/or deal with a set of dimensions or criteria. The role of each dimension is to assess and measure a specific IQ aspect. In this chapter it is discussed the criteria we believe that are important to be used in IQ analysis of a data integration system. Usually, the overall IQ concept is associated with the expression “fitness for use” and its usage is related with a set of quality criteria [Wang et. al 1996, Ge & Helfert 2007, Gertz et al. 2004]. These criteria represent the aspects that must be analyzed and measured in order to determine the quality of a service or a product. We start this chapter discussing the IQ characteristics that are relevant and related to the presented approach. Some of the most relevant IQ approaches are described and their important points are discussed. Then In the rest of the chapter we define the set of IQ criteria adopted in our work which we think will be adequate to be attached to the elements composing a data integration system.. 3.1 Introduction As in [Pipino et al. 2002], we use the term Information Quality or IQ as a multicriteria definition that indicates the overall quality degree obtained for any information in the system. This chapter presents some issues related to Information Quality important to the development of the proposed approach.. - 16 -.

(30) The topics are organized as follows: section 3.2 discusses some scenarios that are propitious to be associated with IQ analysis; in the section 3.3 we make a comparative analysis between several relevant IQ approaches;. in the section 3.4 we make a. compilation of the IQ criteria to be used with data integration puposes; the section 3.5 details the data integration scenarios in which the IQ analysis may be important; in section 3.6 we describe related works concerning IQ in data integration and section 3.7 discusses some reflections about the chapter contents.. 3.2 Scenarios for Information Quality Analysis To better characterize IQ aspects or dimensions, it is important to recognize that IQ must be studied under a number of different contexts, not just only a simple application. In this section it will be discussed some relevant scenarios in which is useful to apply IQ aspects [Gertz et al. 2004, Naumann 2001, Ge & Helfert 2007]. Underlying the management of data, there are typically some complex processes. Therefore, it is more suitable to verify IQ aspects for the entire data management process which can be seen with three components: (1) data producers, (2) data custodians (entities that provide and manage resources for processing and storing data), and (3) data consumers [Gertz et al. 2004]. Complex information system infrastructures comprise many producers, custodians, and consumers. In such environments, the analysis and characterizations of IQ aspects naturally become more difficult because of the complex data processes underlying such systems. Thus, in terms of IQ analysis, some scenarios must be relevant, and consequently, should be taken into account [Gertz et al. 2004]. Some aspects may be observed when analyzing the IQ for information systems, and, in our case, for data integration system: (1) IQ assessment: it is related to the measurement of IQ criteria values. Some works like [Naumann & Rolker 2000] and [Pipino et al. 2002] address the problem of IQ assessment. (2) IQ metadata: ideally, data should have associated metadata describing their main properties. In this context, the notion of data lineage (or data provenance) is of particular interest. The term data provenance is used to refer to a description of the. - 17 -.

(31) data origin and the process by which they are included in a database [Buneman et al. 2000]. (3) IQ life-cycle: in this context, the particular interest is the development of models that provide some IQ measurements and mechanisms to improve the quality of information through feedback. Formalizing and modeling such IQ life-cycles for several application scenarios might help to better communicate IQ requirements and issues among producers, custodians, and consumers. (4) The usage of IQ aspects at the consumer side: this issue is related to providing applications with facilities to explicitly formulate IQ requirements. For example, a query processing system may have resources for the users to indicate their query answer preferences in terms of some IQ aspects. (5) IQ improvement: it is important to identify appropriate means to improve the quality of information. Most of the works focusing on IQ has mainly been dealing with general applications settings, primarily in the context of information management systems, Web-based information systems, or data integration. It is been showed in several researches [Angeles & Mackinnon 2005, Gertz et al. 2004, Naumann & Leser 1999], that the concern on IQ aspects is an important issue for integrated information systems. We can say that data integration systems are collections of structured, semistructured, or unstructured information items. Usually, the information is gathered from multiple, possibly autonomous information sources. Users of such information system have no control over the information it provides. In information systems, the activities involved with IQ reasoning are even more complex than in databases systems. If the system is running in the Web environment, the complexity is even greater. According to [Naumann 2001], there are some scenarios where data usage can be enhanced through IQ-reasoning: •. Data integration: IQ reasoning may enhance the integration of incoming query results in the following ways: (i) conflict resolution and (ii) ranking query results by quality. A data conflict may occur when two sources report different data values about the same real world object. Resolution functions may be employed. - 18 -.

(32) to resolve these conflicts by deciding which value is the final result. Qualitydependent resolution functions enhance the query result by favoring high quality information over low quality information. The presentation of the final integrated results also profits from IQ reasoning. Quality scores may be determined for a source, a part of a query plan, or an entire plan and they may represent the quality of the data generated by the query. This information can be used to rank the query results. •. Data mining: data mining techniques are sensitive towards data with poor quality [Lee et al. 1999]. Commonly, any data mining method is preceded by a data cleaning technique to improve quality [Pyle 1999]. Some aspects of IQ are important for data mining: the completeness of the data; the reputation and objectivity of the source.. As we discuss later in this work, we focused our researches in IQ evaluation in data integration systems.. 3.3 Comparative Analysis of IQ Approaches Some of the presented IQ approaches in this chapter were aggregated and compared in the study discussed in [Scannapieco & Catarci 2003] and [Scannapieco 2005]. In this section, we describe this comparative study between some of the relevant IQ criteria classifications presented in the previous sections. Scannapieco [Scannapieco & Catarci 2003, Scannapieco 2005] focused specifically on the IQ definitions created in the computer science field in the last years. It is made a background on some different definitions; some correspondences among these definitions are outlined and a classification of the different proposals is suggested. The authors based this study in the following affirmatives: •. In the literature, there is no agreement on the set of the dimensions characterizing IQ. Many proposals have been made, but no one has emerged as a standard.. •. Even if some dimensions are universally considered as important, there is no agreement on their meanings. In different proposals, the same name is often used to indicate semantically different things (as well as different names are. - 19 -.

(33) used for the same thing). The authors try to approximate some of the similar criteria into a unique definition. They selected a number of approaches to be compared, each one is representative in a specific context. Some of them are summarized in the following, specifically the ones not discussed in the previous sections. In these cases, we have indicated their sections in this document: •. WandWang96 [Wand & Wang 1996]: IQ criteria are defined by considering mapping functions from the real world to an information system. For example, inaccuracy means that the information system represents a real world state different from the one of real world. The completeness criterion is defined as a missing mapping from real world states to the information system states. Five dimensions are proposed: accuracy, completeness, consistency, timeliness, and reliability.. •. WangStrong96 [Wang & Strong 1996]. Wang and Strong in have conceived one of the first set of structured and classified IQ dimensions which has been a strong reference for most of the studies in IQ area. They empirically identified fifteen IQ criteria. To achieve this, various IQ attributes were analyzed from the perspective of a set of users. An empirical approach analyzed the information collected from the users and determined the characteristics of useful data for their tasks. The aspects were grouped into four broad information quality classes: intrinsic, contextual, representational, and accessibility. Intrinsic data quality denotes the quality of data itself. Contextual data quality highlights the requirement that data quality must be considered within the context of a task at hand, i.e., data must be relevant, timely, complete and appropriate in terms of amount. The Representational data quality category is related to the format and the meaning of data. Accessibility defines if data are available or obtainable for the user.. •. Redman96 [Redman 1996]. The proposal groups data quality dimensions into three categories, corresponding to the conceptual view of data, the data values and the data format respectively. Five dimensions are proposed for the - 20 -.

(34) conceptual view, Four dimensions for the data values and eight dimensions for the data format. •. Jarke99 [Jarke & Vassiliou 1997, Jarke et al. 1999]. This work handles the problem of quality in data warehousing. The objective is to establish foundations of data warehouse quality through linking semantic models of the data warehouse architecture to explicit models of data quality. To achieve this it was produced a general multi-tier DW architecture modeling framework in three levels: source, data warehouse and client. In the levels, there are views or schemas mappings from one level to another. Some quality criteria support this DW architecture and they are categorized according to the stakeholders that are typically interested in them. There exist different roles of users in a data warehouse environment. Based in the argument that different user roles imply a different collection of quality dimensions, the authors compiled a set of IQ criteria to support .. •. Bovee01 [Bovee et al. 2001]. Following the concept of IQ as “fitness for use”, the proposal includes four dimensions (with some sub-dimensions). Data “fit for use” whenever a user: 1) is able to get information (Accessibility); 2) is able to understand it (Interpretability); 3) finds it applicable to a specific domain and purpose of interest (Relevance); 4) believes it to be credible (Credibility).. •. Naumann02 [Naumann 2002, Naumann & Leser 1999]. Naumann defines an IQ framework directly addressing the query processing in a data integration system with a mediator-based architecture. This work proposes the interleaving of query planning with quality considerations. The mediator knows the mappings from source schemas to the integrated schema. This knowledge is encoded in sets of query correspondence assertions, called QCA, that are the equations between relations of a source and relations of the mediator’s global schema. - 21 -.

(35) The approach distinguishes three classes of quality aspects, which are each treated differently: Source-specific criteria: determine the overall quality of a data source; QCA-specific criteria: determine quality aspects of specific queries that are computable by a source; User query-specific criteria: denote the users preferences. Herden01 [Herden 2001]. In this work, the quality assurance is extensively given by using the three level approach design methodology: conceptual modeling, logical database design and physical database design. The author states that to meet the users’ requirements and to build a maintainable system, a quality assurance process, in terms of an explicit review of the conceptual database schema is necessary. Thus, given a set of schema quality criteria, the schema must be manually reviewed by a specialist in terms of each one of the criteria. In the next section it will be presented the criteria and respective reviews. This approach was added by us to the original comparative study [Scannapieco & Catarci 2003] because of the importance of schema IQ criteria to our work and to extend the original comparative discussion to schema issues.. The Table 3.4 summarizes the criteria presented in the mentioned approaches. Only consistency and completeness are dimensions defined in all proposals. Besides these two specific dimensions, consistency-related dimensions and time-related dimensions are also taken into account by all proposals, except the schema-related approach defined by Herden [Herden 2001]. Specifically, consistency is typically considered at instance level (consistency dimension) or at format level (representational consistency). Time-related quality aspects are mainly represented by the timeliness criterion. Also interpretability is considered by most of the proposals, both at format and schema level. Each of the remaining dimensions is included only by a minority of proposals. In some cases there is a complete disagreement on a specific dimension definition.. - 22 -.

(36) Table 3.1 – Criteria in the different proposals (extended from [Scannapieco 2005]) WandWang 1996 Acessibility Accuracy Amount of Data Availability Believability Completeness Concise Representation Consistency Correctness Credibility Interpretability Level of detail Minimality Objectivity Price Portability Relevance Reliability Reputation Response Time Scope Security Time-related aspects Traceability Understandability Value-added. X. WangStrong 1996 X X X. Redman 1996 X X. Jarke 1999 X X. Bovee 2001 X X. Herden 2001. Naumann 2002 X X X. X X. X X X. X. X. X. X. X. X. X. X. X. X X. X. X. X. X X X X. X X. X. X X X. X X X X. X X. X X. X. X X X. X X X X. X X. X X. X. X X. X. X. X X X. X. 3.4 Relevant IQ Criteria for Data Integration As part of our approach, we create a three-class classification for the IQ Criteria, with aspects we think are applicable to data integration processes. Table 3.2 shows the complete set of IQ criteria in each class. This classification is based in the criteria sets of [Ballou et al. 1985, Naumann & Rolker 2000]. These criteria form the basis of our quality analysis and evaluation for data integration systems. Table 3.2 – Classification of IQ criteria [Naumann & Rolker 2000] Class IQ Criteria Subject Criteria Reputation Object Criteria Schema Completeness, Data Completeness, Timeliness, Verifiability Process Criteria Accuracy, Availability, Response Time, Representational Consustency, Minimality. In the following, we detail each one of the adopted criteria and the assessment methods that can be applied in their evaluation.. - 23 -.

(37) 3.4.1 Reputation Wang defines reputation as “the extent to which data are trusted or highly regarded in terms of their source or content” [Wang & Strong 1996]. It is the degree to which the information or its source is in high standing. The reputation increases with a higher level of awareness among the users. Thus, older, long-established information sources typically have a higher reputation. As an example, people tend to trust data from their own institute more than external data. Over time, information about the accuracy of different sources, leads to a poor reputation for less accurate sources. As a reputation for poor-quality data becomes from common knowledge, the data sources are not supposed to be requested in user’s queries. Reputation is highly subjective and the integration system must take into account the users’ preferences. This criterion scores can be measured by a grade from 1 (bad reputation) to 10 (very good reputation) assigned by users [Naumann & Rolker 2000].. 3.4.2 Completeness Wang & Strong in [Wang & Strong 1996] state that completeness is the degree in which data are of sufficient breadth, depth, and scope for the task at hand. Completeness is seen as a crucial IQ factor and it is viewed as the characteristic of a set of information to represent reality with all required descriptive elements. The completeness dimension can be viewed from many perspectives, leading to different metrics. In [Pipino et al. 2002], Pipino introduces different types of completeness metrics. At the most abstract level, there is the concept of schema completeness, which is the degree to which entities and attributes of the application domain are not missing in the schema. At the data level, he defines column completeness as a function of the missing values in a column of a table [Scannapieco & Batini 2004]. Each of the types – schema and data completeness – can be measured by taking the ratio of the number of incomplete items to the total number of items and subtracting from 1. In a data integration system, the completeness can be measured as the percentage of the real-world objects modeled in the mediation schema that can be found in the. - 24 -.

(38) sources. It is always desirable to increase completeness. For example, in a user query UQ1, suppose that it will be necessary to decompose the original query into subqueries over the data sources S1 and S2. In this case, querying only source S1 gives only one part of the result. In the same user query the data source S2 will provide another part. In this example, the more sources are accessed to compose the results for the original query, the more complete the result will be. Sometimes it is necessary to select the sources that will be queried among a number of candidate data sources. A completeness measure is needed to determine which sources are to be preferred over others. One problem related to completeness in data integration systems is the problem of merging multiple tuples, identified as being about the same object [Naumann & Haeussler 2002]. Regarding data about some object, sources can enforce, complement with each other, or conflict. Sources enforce each other if they store the same data for the same attribute of an object. Sources complement each other if they store data about different attributes of the same object. Sources conflict if they store different values for the same attribute of the same object. Ensuring data completeness is, in effect, the problem of managing the merge operations from different source objects into a final merged object. Naumann & Freytag [Naumann & Freytag 2000] divide the measure of data completeness into two aspects: (i) coverage, a measure for the number of tuples a source stores and (ii) density as a measure for how well the attributes stored at a source are filled with up-to-date non-null values. In essence, the coverage measure describes how many objects an information source can provide; the density measure describes how much attributes for each of those objects the source can provide. In this work we adopt the schema completeness criterion as being the number of entity types provided by the source with respect to the user requirements and data completeness as being the number of attributes values per entity returned by a data source.. - 25 -.