Evaluative Measures of Search Engines
Jitendra Nath Singh 1, Dr. S.K. Dwivedi 21
Research Scholar, Department of Computer Science, BBAU, Lucknow
2Assoc.Professor. Department of Computer Science, BBAU, Lucknow
ABSTRACT: The ability to search and retrieve information from the web efficiently and effectively is great challenge of search engine. Information retrieval on the Web is very different from retrieval in traditional indexed databases because it’s hyper-linked character, the heterogeneity of document types and authoring styles. Thus, since Web retrieval is substantially different from information retrieval, new or revised evaluative measures are required to assess retrieval performance using search engines. In this paper we suggested a number of evaluative measures to evaluate the effectiveness of search engines. The motivation behind each of these measures is presented, along with their descriptions and definitions.
Keywords: Information retrieval, Evaluation, Search Engines, Precision, Recall
1. INTRODUCTION
Web search engines were compared and evaluated in terms of their search capabilities and retrieval performance using sample queries drawn from real reference. The Web has rapidly gained popularity and become the second most widely used application of the Internet [3]. In order to overcome this difficulty in retrieving information from web, more than two dozen companies and institutions quickly developed various search engine [4] .However, since there are usually only one or two search engine for other Internet applications, why have at least two dozen search engines been developed for the Web so far. The sheer number invites research. For instance, what features do various web search engines offer? How do they differ from one another in performance? Is there a single Web search engine that out-performs all others in information retrieval? As a result, the web has become a sea of all kinds of data, making any query into the huge information reservoir extremely difficult.
1.1 Why Evaluate
Evaluation is the key to making progress in building better search engines. It is also essential to understanding if a search engine is being used effectively in a specific application. One of the primary distinctions made in the evaluation of search engines is between effectiveness and efficiency. Effectiveness, loosely speaking, measures [5] the ability of the search engine to find the right information and efficiency measures how quickly this is done. For a given query, and a specific definition of relevance, we can more precisely define effectiveness as a measure of how well the ranking produced by the search engine corresponds to a ranking based on user relevance judgments. Efficiency is defined in terms of the time and space requirements for the algorithm that produces the ranking. In this environment, effectiveness and efficiency will be affected by many factors such as the interface used to display search results and techniques such as query suggestion and relevance feedback. Carrying out this type of holistic evaluation of effectiveness and efficiency, while important, is very difficult because of the many factors that must be controlled. For this reason, evaluation is more typically done in tightly defined experimental settings and this is the type of evaluation we have to focus.
In addition to efficiency and effectiveness, the other significant consideration in search engine design is cost. We may know how to implement a particular search technique efficiently, but to do so may require a huge investment in processors, memory disk and networking. An important point about terminology is that “optimization” is frequently discussed in the context of evaluation. The retrieval and indexing techniques in a search engine have many parameters that can be adjusted to optimize performance, both in terms of effectiveness and efficiency. Typically the best values for these parameters are determined using training data and a cost function. Training data is a sample of the real data, and the cost function is the quantity based on the data that is being maximized (or minimized).
1.2 Evaluation Corpus
IJECSE,Volume1,Number 2
Jitendra Nath Singh and Dr. S.K. Dwivedi
ISSN-2277-1956/V1N2-603-607
is used for statistical analysis of various kinds. The test collection, or evaluation corpus, in information retrieval is unique in that the queries and relevance judgments for a particular search task are gathered in addition to the documents.
1.3 Logging
Query logs that capture user interactions with a search engine have become an extremely important resource for Web search engine development. From an evaluation perspective, these logs provide large amounts of data showing how users browse the results that a search engine provides for a query. In a general Web search application, the number of users and queries represented can number in the tens of millions. Compared to the hundreds of queries used in typical TREC collections, query log data can potentially support a much more extensive and realistic evaluation. The main drawback with this data is that it is not as precise as explicit relevance judgments. An additional concern is maintaining the privacy of the users. This is particularly an issue when query logs are shared, distributed for research, or used to construct user profiles. Various techniques can be used to anonymize the logged data, such as removing identifying information or queries that may contain personal data, although this can reduce the utility of the log for some purposes.
2. Effectiveness Metrics
Information retrieval systems have been evaluated and compared for many years. Cleverdon (1966) listed six criteria that could be used to evaluate an information retrieval system: (1) recall (2) precision, (3) coverage, (4) time lag, (5) presentation and (6) user effort. Of these criteria, recall and precision have most frequently been applied in measuring information retrieval. However, in spite of their deficiencies, recall and precision are continued to be used widely. Since Cleverdon listed six broad evaluative criteria in 1966, recall and precision have received the lion's share of the attention. However, presentation and user effort are two other criteria that assess information retrieval in user terms. Unlike recall and precision, these criteria have not been studied in depth, and there is no consensus agreement on how they should be measured. Most of the discussions of measurement of information retrieval found in the literature are general and do not include aspects of particular significance to the web.
2.1 Recall and Precision
The two most common effectiveness measures, recall and precision, were introduced in the Cranfield studies to summarize and compare search results. Intuitively, recall measures how well the search engine is doing at finding all the relevant documents for a query [1], and precision measures how well it is doing at rejecting non-relevant documents.
The definition of these measures assumes that, for a given query, there is a set of documents that is retrieved and a set that is not retrieved (the rest of the documents).This obviously applies to the results of a Boolean search, but the same definition can also be used with a ranked search If, in addition, relevance is assumed to be binary, then the results for a query can be summarized as shown in Table 1.1.
In this table, A is the relevant set of documents for the query, Ā is the non-relevant set, B is the set of retrieved documents, and B is the set of documents that are not retrieved. The operator ̅ ∩ gives the intersection of two sets.
For example, A ∩ B is the set of documents that are both relevant and retrieved.
Relevant Non-Relevant
Retrieved A ∩ B Ā∩ B
Not Retrieved A ∩̅B Ā∩̅B
Table 1.1 Sets of documents defined by a simple search with binary relevance.
A number of effectiveness measures can be defined using this table. The two we are particularly interested in are:
Recall = |A ∩ B|
|A|
Precision = |A ∩ B|
retrieved documents that are relevant. There is an implicit assumption in using these measures that the task involves retrieving as many of the relevant documents as possible, and minimizing the number of non-relevant documents retrieved. In words, even if there are 500 relevant documents for a query, the user is interested in finding them all. When a document is retrieved, it is the same as making a prediction that the document is relevant. From this perspective, there are two types of errors that can be made in prediction (or retrieval). These errors are called false positives (a non-relevant document is retrieved) and false negatives (a relevant document is not retrieved). Recall is related to one type of error (the false negatives), but precision is not related directly to the other type of error. Instead, another measure known as fallout, which is the proportion of non-relevant documents that are retrieved, is related to the false positive errors:
Fallout = | Ā∩ B| Ā
Given that fallout and recall together characterize the effectiveness of a search as a classifier, why do we use precision instead? The answer is simply that precision is more meaningful to the user of a search engine. If 20 documents were retrieved for a query, a precision value of 0.7 means that 14 out of the 20 retrieved documents will be relevant. Fallout, on the other hand, will always be very small because there are so many non-relevant documents. If there were 1,000,000 non-relevant documents for the query used in the precision example, fallout would be 6/1000000 = 0.000006. If precision fell to 0.5, which would be noticeable to the user, fallout would be 0.00001. The skewed nature of the search task, where most of the corpus is not relevant to any given query, also means that evaluating a search engine as a classifier can lead to counter-intuitive results. A search engine trained to minimize classification errors would tend to retrieve nothing, since classifying a document as non-relevant is always a good decision! The F measure is an effectiveness measure based on recall and precision that is used for evaluating classification performance and also in some search applications. It has the advantage of summarizing effectiveness in a single number. It is defined as the harmonic mean of recall and precision, which is:
F = 1 = 2RP 1 (1/R +1/P) (R+P) 2
Why use the harmonic mean instead of the usual arithmetic mean or average?
The harmonic mean emphasizes the importance of small values, whereas the arithmetic mean is more affected by values that are unusually large (outliers). A search result that returned nearly the entire document database, for example, would have a recall of 1.0 and a precision near 0. The arithmetic mean of these values is 0.5, but the harmonic mean will be close to 0. The harmonic mean is clearly a better summary of the effectiveness of this retrieved set. Most of the retrieval models we have discussed produce ranked output. To use recall and precision measures, retrieved sets of documents must be defined based on the ranking. One possibility is to calculate recall and precision values at every rank position. Figure 1.2 shows the top ten documents of two possible rankings, together with the recall and precision values calculated at every rank position for a query that has six relevant documents. These rankings might correspond to, for example, the output of different retrieval algorithms or search engines. At rank position 10 (i.e. when ten documents are retrieved), the two rankings have the same effectiveness as measured by recall and precision. Recall is 1.0 because all the relevant documents have been retrieved, and precision is 0.6 because both rankings contain 6 relevant documents in the retrieved set of 10 documents. At higher rank positions, however, the first ranking is clearly better. For example, at rank position 4 (4 documents retrieved), the first ranking has a recall of 0.5 (3 out of 6 relevant documents retrieved) and a precision of 0.75 (3 out of 4 retrieved documents are relevant). The second ranking has a recall of 0.17 (1/6) and a precision of 0.25 (1/4).
Ranking #1
Recall 0.17 0.17 0.33 0.5 0.67 0.83 0.83 0.83 0.83 1.0
Precision 1.0 0.5 0.67 0.75 0.8 0.83 0.71 0.63 0.56 0.6
Ranking #2
Recall 0.0 0.17 0.17 0.17 0.33 0.5 0.67 0.67 0.83 1.0
Precision 0.0 0.5 0.33 0.25 0.4 0.5 0.57 0.5 0.56 0.6
IJECSE,Volume1,Number 2
Jitendra Nath Singh and Dr. S.K. Dwivedi
ISSN-2277-1956/V1N2-603-607
If there are a large number of relevant documents for a query, or if the relevant documents are widely distributed in the ranking, a list of recall-precision values for every rank position will be long and unwieldy. Instead, a number of techniques have been developed to summarize the effectiveness of a ranking. The first of these is simply to calculate recall-precision values at a small number of predefined rank positions. In fact, to compare two or more rankings for a given query, only the precision at the predefined rank positions needs to be calculated. If the precision for a ranking at rank position p is higher than the precision for another ranking, the recall will be higher as well. This can be seen by comparing the corresponding recall-precision values in Figure 1.2. This effectiveness measure is known as precision at rank p. There are many possible values for the rank position p, but this measure is typically used to compare search output at the top of the ranking, since that is what many users care about. Consequently, the most common versions are precision at 10, and precision at 20. Note that if these measures are used, the implicit search task has changed to finding the most relevant documents at a given rank, rather than finding as many relevant documents as possible. Differences in search output further down the ranking than position 20 will not be considered. This measure also does not distinguish between differences in the rankings at positions 1 to p, which may be considered important for some tasks. For example, the two rankings in Figure 1.2 will be the same when measured using precision at 10.
2.2 Coverage
Calculating overall coverage of search engines is in itself a complicated task. A statistical sampling method for measuring overlap among search engines and their relative coverage has been developed by Bharat and Broder (1998). The two metrics proposed here are used to measure relative ratio of coverage and overlap only for results of a given query. The metrics are: number of unique hits, which measures overlap, and the relative number of returned hits to total hits in a given domain, which measures a relative ratio of coverage for a given query. These measures are comparative, in that they are applicable to comparisons of web search engines.
2.3 Time lag
Time lag is obviously a very important factor for users of information retrieval from the Web. While it is important, it is also highly dependent on the level of Internet traffic at the time of measurement. This introduces a large amount of error variation into the measure, because the level of Internet traffic varies widely over both time and location. Thus, at the current stage of Internet technology, it is difficult to use time lag as a discriminative measure.
3. Evaluation Parameter
Performance of search engines include four aspects in which ‘Quality of result’ is divided into two more subset. The search engine performance revolves around these four factors and used to evaluate the efficiency of search engines.
3.1 Response Time
This is the time period that starts from the point of query submission till user gets a huge list of responded results. Response time is directly related to activeness of search engines and kept minimum as much as possible.
3.2 Total numbers of results and its ordering
To find the relevance result it is essential to cover the whole area of web and select to best among them. User does not prefer huge list of results. It is better for quality page that fulfills the need of user’s query to be listed first.
3.3 Quality of Result
‘Result Quality’ is important parameter to justify the search engines quality. WebPages containing the most suitable solutions of user’s problem are considered ‘relevance result’. Every search engine is expected to provide only relevance result in its Responded list.
3.4 User Efforts
Web is continuously changing t should be aware about what searchi affect the information retrieval proce
(1) Query formulation
(2) User feedback to a webpage
(3) W3 rules
(4) Web Developer’s Fake Techn
Search engine has become new r and efficient evaluation system. In search engine and performance.
The measures presented in this p carry out such an exercise it is fir representing the views and interests retrieval evaluation. While many dif some initial how well a number of m
In the future, we plan to apply the will truly enable Web users to select engine developers design even better
[1] [1] H. Chu and M. Rosenthal, Search Proc. 59th ASIS Annual Meeting, Medfo
[2] [2] G. Salton, Automatic Text Process Series in Computer Science, Addison-W [3] [3] Courtois, Martin P., Baer, William M [4] [4] Harman, Donna. (1995). Overview o [5] [5] Tague-Sutcliffe, J.M. (1996). Some [6] L. Wishard, Precison among Internet
1998.
g the way people locate information as it is a world c ching trends are being used. We identified and examine
cess directly or indirectly.
chnique
Conclusion
research and development field. At present there is an u n this paper we suggested a number of measures to ev
is paper are only a few of many such measures that co first necessary to define a wide range of possible ev sts of a broad range of researches in the area of searc different measures can be defined, they must be used a measures work in practice.
the proposed methodology to a wider scope with the hop ect a search engine appropriate to their specific search n ter ones for the Internet community.
REFERENCES
rch engines for the World Wide Web: A comparative study and evalua dford, NJ, Information Today, Inc., 1996.
essing: The Transformation, Analysis, and Retrieval of Information Wesley Longman, Reading, MA, 1989.
m M., and Stark, Marcella. (November/December 1995).
w of the second Text REtrieval Conference (TREC-2). Information Pr e Perspectives on the Evaluation of Information Retrieval. net search engines: An earth sciences case study, Issues in Science
coverage phenomenon, one ined following factors which
n urgent need for an accurate evaluate the effectiveness of
could be defined. In order to evaluative measures, ideally arch engine and information d appropriately. We provided
ope that our research findings needs, and help Web search
luation methodology, in ASIS''96:
ion by Computer, Addison-Wesley
Processing & Management, 31(3).