Online edition (c) 2009 Cambridge UP
An
Introduction
to
Information
Retrieval
Online edition (c) 2009 Cambridge UP
An
Introduction
to
Information
Retrieval
Christopher D. Manning
Prabhakar Raghavan
Hinrich Schütze
Online edition (c) 2009 Cambridge UP
©
2009 Cambridge University PressBy Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütze
Printed on April 1, 2009
Website: http://www.informationretrieval.org/
Comments, corrections, and other feedback most welcome at:
Online edition (c) 2009 Cambridge UP
Brief Contents
1 Boolean retrieval 1
2 The term vocabulary and postings lists 19
3 Dictionaries and tolerant retrieval 49 4 Index construction 67
5 Index compression 85
6 Scoring, term weighting and the vector space model 109 7 Computing scores in a complete search system 135 8 Evaluation in information retrieval 151
9 Relevance feedback and query expansion 177 10 XML retrieval 195
11 Probabilistic information retrieval 219
12 Language models for information retrieval 237
13 Text classification and Naive Bayes 253
14 Vector space classification 289
15 Support vector machines and machine learning on documents 319
16 Flat clustering 349 17 Hierarchical clustering 377
18 Matrix decompositions and latent semantic indexing 403 19 Web search basics 421
Online edition (c) 2009 Cambridge UP
Contents
Table of Notation xv
Preface xix
1 Boolean retrieval 1
1.1 An example information retrieval problem 3 1.2 A first take at building an inverted index 6 1.3 Processing Boolean queries 10
1.4 The extended Boolean model versus ranked retrieval 14 1.5 References and further reading 17
2 The term vocabulary and postings lists 19
2.1 Document delineation and character sequence decoding 19 2.1.1 Obtaining the character sequence in a document 19 2.1.2 Choosing a document unit 20
2.2 Determining the vocabulary of terms 22 2.2.1 Tokenization 22
2.2.2 Dropping common terms: stop words 27
2.2.3 Normalization (equivalence classing of terms) 28 2.2.4 Stemming and lemmatization 32
2.3 Faster postings list intersection via skip pointers 36 2.4 Positional postings and phrase queries 39
2.4.1 Biword indexes 39 2.4.2 Positional indexes 41 2.4.3 Combination schemes 43 2.5 References and further reading 45
3 Dictionaries and tolerant retrieval 49
3.1 Search structures for dictionaries 49 3.2 Wildcard queries 51
Online edition (c) 2009 Cambridge UP
3.2.2 k-gram indexes for wildcard queries 54 3.3 Spelling correction 56
3.3.1 Implementing spelling correction 57 3.3.2 Forms of spelling correction 57 3.3.3 Edit distance 58
3.3.4 k-gram indexes for spelling correction 60
3.3.5 Context sensitive spelling correction 62 3.4 Phonetic correction 63
3.5 References and further reading 65
4 Index construction 67
4.1 Hardware basics 68
4.2 Blocked sort-based indexing 69 4.3 Single-pass in-memory indexing 73 4.4 Distributed indexing 74
4.5 Dynamic indexing 78 4.6 Other types of indexes 80
4.7 References and further reading 83
5 Index compression 85
5.1 Statistical properties of terms in information retrieval 86 5.1.1 Heaps’ law: Estimating the number of terms 88 5.1.2 Zipf’s law: Modeling the distribution of terms 89 5.2 Dictionary compression 90
5.2.1 Dictionary as a string 91 5.2.2 Blocked storage 92 5.3 Postings file compression 95
5.3.1 Variable byte codes 96 5.3.2 γcodes 98
5.4 References and further reading 105
6 Scoring, term weighting and the vector space model 109
6.1 Parametric and zone indexes 110 6.1.1 Weighted zone scoring 112 6.1.2 Learning weights 113 6.1.3 The optimal weightg 115 6.2 Term frequency and weighting 117
6.2.1 Inverse document frequency 117 6.2.2 Tf-idf weighting 118
6.3 The vector space model for scoring 120 6.3.1 Dot products 120
Online edition (c) 2009 Cambridge UP
6.4 Variant tf-idf functions 126 6.4.1 Sublinear tf scaling 126
6.4.2 Maximum tf normalization 127
6.4.3 Document and query weighting schemes 128 6.4.4 Pivoted normalized document length 129 6.5 References and further reading 133
7 Computing scores in a complete search system 135
7.1 Efficient scoring and ranking 135
7.1.1 Inexact topKdocument retrieval 137
7.1.2 Index elimination 137 7.1.3 Champion lists 138
7.1.4 Static quality scores and ordering 138 7.1.5 Impact ordering 140
7.1.6 Cluster pruning 141
7.2 Components of an information retrieval system 143 7.2.1 Tiered indexes 143
7.2.2 Query-term proximity 144
7.2.3 Designing parsing and scoring functions 145 7.2.4 Putting it all together 146
7.3 Vector space scoring and query operator interaction 147 7.4 References and further reading 149
8 Evaluation in information retrieval 151
8.1 Information retrieval system evaluation 152 8.2 Standard test collections 153
8.3 Evaluation of unranked retrieval sets 154 8.4 Evaluation of ranked retrieval results 158 8.5 Assessing relevance 164
8.5.1 Critiques and justifications of the concept of relevance 166
8.6 A broader perspective: System quality and user utility 168 8.6.1 System issues 168
8.6.2 User utility 169
8.6.3 Refining a deployed system 170 8.7 Results snippets 170
8.8 References and further reading 173
9 Relevance feedback and query expansion 177
9.1 Relevance feedback and pseudo relevance feedback 178 9.1.1 The Rocchio algorithm for relevance feedback 178 9.1.2 Probabilistic relevance feedback 183
Online edition (c) 2009 Cambridge UP
9.1.4 Relevance feedback on the web 185
9.1.5 Evaluation of relevance feedback strategies 186 9.1.6 Pseudo relevance feedback 187
9.1.7 Indirect relevance feedback 187 9.1.8 Summary 188
9.2 Global methods for query reformulation 189
9.2.1 Vocabulary tools for query reformulation 189 9.2.2 Query expansion 189
9.2.3 Automatic thesaurus generation 192 9.3 References and further reading 193
10 XML retrieval 195
10.1 Basic XML concepts 197
10.2 Challenges in XML retrieval 201
10.3 A vector space model for XML retrieval 206 10.4 Evaluation of XML retrieval 210
10.5 Text-centric vs. data-centric XML retrieval 214 10.6 References and further reading 216
10.7 Exercises 217
11 Probabilistic information retrieval 219
11.1 Review of basic probability theory 220 11.2 The Probability Ranking Principle 221
11.2.1 The 1/0 loss case 221
11.2.2 The PRP with retrieval costs 222 11.3 The Binary Independence Model 222
11.3.1 Deriving a ranking function for query terms 224 11.3.2 Probability estimates in theory 226
11.3.3 Probability estimates in practice 227
11.3.4 Probabilistic approaches to relevance feedback 228 11.4 An appraisal and some extensions 230
11.4.1 An appraisal of probabilistic models 230
11.4.2 Tree-structured dependencies between terms 231 11.4.3 Okapi BM25: a non-binary model 232
11.4.4 Bayesian network approaches to IR 234 11.5 References and further reading 235
12 Language models for information retrieval 237
12.1 Language models 237
12.1.1 Finite automata and language models 237 12.1.2 Types of language models 240
Online edition (c) 2009 Cambridge UP
12.2.1 Using query likelihood language models in IR 242 12.2.2 Estimating the query generation probability 243 12.2.3 Ponte and Croft’s Experiments 246
12.3 Language modeling versus other approaches in IR 248 12.4 Extended language modeling approaches 250
12.5 References and further reading 252
13 Text classification and Naive Bayes 253
13.1 The text classification problem 256 13.2 Naive Bayes text classification 258
13.2.1 Relation to multinomial unigram language model 262 13.3 The Bernoulli model 263
13.4 Properties of Naive Bayes 265
13.4.1 A variant of the multinomial model 270 13.5 Feature selection 271
13.5.1 Mutual information 272 13.5.2 χ2Feature selection 275
13.5.3 Frequency-based feature selection 277 13.5.4 Feature selection for multiple classifiers 278 13.5.5 Comparison of feature selection methods 278 13.6 Evaluation of text classification 279
13.7 References and further reading 286
14 Vector space classification 289
14.1 Document representations and measures of relatedness in vector spaces 291
14.2 Rocchio classification 292 14.3 knearest neighbor 297
14.3.1 Time complexity and optimality of kNN 299 14.4 Linear versus nonlinear classifiers 301
14.5 Classification with more than two classes 306 14.6 The bias-variance tradeoff 308
14.7 References and further reading 314 14.8 Exercises 315
15 Support vector machines and machine learning on documents 319
15.1 Support vector machines: The linearly separable case 320 15.2 Extensions to the SVM model 327
15.2.1 Soft margin classification 327 15.2.2 Multiclass SVMs 330
15.2.3 Nonlinear SVMs 330 15.2.4 Experimental results 333
Online edition (c) 2009 Cambridge UP
15.3.1 Choosing what kind of classifier to use 335 15.3.2 Improving classifier performance 337
15.4 Machine learning methods in ad hoc information retrieval 341 15.4.1 A simple example of machine-learned scoring 341 15.4.2 Result ranking by machine learning 344
15.5 References and further reading 346
16 Flat clustering 349
16.1 Clustering in information retrieval 350 16.2 Problem statement 354
16.2.1 Cardinality – the number of clusters 355 16.3 Evaluation of clustering 356
16.4 K-means 360
16.4.1 Cluster cardinality inK-means 365
16.5 Model-based clustering 368 16.6 References and further reading 372 16.7 Exercises 374
17 Hierarchical clustering 377
17.1 Hierarchical agglomerative clustering 378 17.2 Single-link and complete-link clustering 382
17.2.1 Time complexity of HAC 385 17.3 Group-average agglomerative clustering 388 17.4 Centroid clustering 391
17.5 Optimality of HAC 393 17.6 Divisive clustering 395 17.7 Cluster labeling 396 17.8 Implementation notes 398
17.9 References and further reading 399 17.10 Exercises 401
18 Matrix decompositions and latent semantic indexing 403
18.1 Linear algebra review 403
18.1.1 Matrix decompositions 406 18.2 Term-document matrices and singular value
decompositions 407
18.3 Low-rank approximations 410 18.4 Latent semantic indexing 412 18.5 References and further reading 417
19 Web search basics 421
19.1 Background and history 421 19.2 Web characteristics 423
Online edition (c) 2009 Cambridge UP
19.2.2 Spam 427
19.3 Advertising as the economic model 429 19.4 The search user experience 432
19.4.1 User query needs 432 19.5 Index size and estimation 433 19.6 Near-duplicates and shingling 437 19.7 References and further reading 441
20 Web crawling and indexes 443
20.1 Overview 443
20.1.1 Features a crawlermustprovide 443
20.1.2 Features a crawlershouldprovide 444
20.2 Crawling 444
20.2.1 Crawler architecture 445 20.2.2 DNS resolution 449 20.2.3 The URL frontier 451 20.3 Distributing indexes 454 20.4 Connectivity servers 455
20.5 References and further reading 458
21 Link analysis 461
21.1 The Web as a graph 462
21.1.1 Anchor text and the web graph 462 21.2 PageRank 464
21.2.1 Markov chains 465
21.2.2 The PageRank computation 468 21.2.3 Topic-specific PageRank 471 21.3 Hubs and Authorities 474
21.3.1 Choosing the subset of the Web 477 21.4 References and further reading 480
Bibliography 483
Online edition (c) 2009 Cambridge UP
Table of Notation
Symbol Page Meaning
γ p. 98 γcode
γ p. 256 Classification or clustering function:γ(d)isd’s class
or cluster
Γ p. 256 Supervised learning method in Chapters 13 and 14: Γ(D) is the classification function γ learned from
training setD
λ p. 404 Eigenvalue
~µ(.) p. 292 Centroid of a class (in Rocchio classification) or a
cluster (inK-means and centroid clustering)
Φ p. 114 Training example
σ p. 408 Singular value
Θ(·) p. 11 A tight bound on the complexity of an algorithm
ω,ωk p. 357 Cluster in clustering
Ω p. 357 Clustering or set of clusters{ω1, . . . ,ωK}
arg maxxf(x) p. 181 The value ofxfor whichf reaches its maximum arg minxf(x) p. 181 The value ofxfor whichf reaches its minimum c,cj p. 256 Class or category in classification
cft p. 89 The collection frequency of termt(the total number
of times the term appears in the document collec-tion)
C p. 256 Set{c1, . . . ,cJ}of all classes
C p. 268 A random variable that takes as values members of
Online edition (c) 2009 Cambridge UP
C p. 403 Term-document matrix
d p. 4 Index of thedthdocument in the collectionD
d p. 71 A document
~
d,~q p. 181 Document vector, query vector D p. 354 Set{d1, . . . ,dN}of all documents Dc p. 292 Set of documents that is in classc
D p. 256 Set{hd1,c1i, . . . ,hdN,cNi}of all labeled documents
in Chapters 13–15
dft p. 118 The document frequency of termt(the total number
of documents in the collection the term appears in)
H p. 99 Entropy
HM p. 101 Mth harmonic number
I(X;Y) p. 272 Mutual information of random variablesXandY
idft p. 118 Inverse document frequency of termt J p. 256 Number of classes
k p. 290 Topkitems from a set, e.g., knearest neighbors in kNN, topkretrieved documents, topkselected fea-tures from the vocabularyV
k p. 54 Sequence ofkcharacters
K p. 354 Number of clusters
Ld p. 233 Length of documentd(in tokens)
La p. 262 Length of the test document (or application docu-ment) in tokens
Lave p. 70 Average length of a document (in tokens)
M p. 5 Size of the vocabulary (|V|)
Ma p. 262 Size of the vocabulary of the test document (or ap-plication document)
Mave p. 78 Average size of the vocabulary in a document in the collection
Md p. 237 Language model for documentd
N p. 4 Number of documents in the retrieval or training
collection
Online edition (c) 2009 Cambridge UP
O(·) p. 11 A bound on the complexity of an algorithm
O(·) p. 221 The odds of an event
P p. 155 Precision
P(·) p. 220 Probability
P p. 465 Transition probability matrix
q p. 59 A query
R p. 155 Recall
si p. 58 A string
si p. 112 Boolean values for zone scoring
sim(d1,d2) p. 121 Similarity score for documentsd1,d2
T p. 43 Total number of tokens in the document collection Tct p. 259 Number of occurrences of word tin documents of
classc
t p. 4 Index of thetthterm in the vocabularyV t p. 61 A term in the vocabulary
tft,d p. 117 The term frequency of termtin documentd(the
to-tal number of occurrences oftind)
Ut p. 266 Random variable taking values 0 (termtis present)
and 1 (tis not present)
V p. 208 Vocabulary of terms{t1, . . . ,tM}in a collection (a.k.a.
the lexicon)
~v(d) p. 122 Length-normalized document vector
~
V(d) p. 120 Vector of documentd, not length-normalized
wft,d p. 125 Weight of termtin documentd
w p. 112 A weight, for example for zones or terms
~
wT~x=b p. 293 Hyperplane; ~w is the normal vector of the
hyper-plane andwicomponentiof~w
~x p. 222 Term incidence vector~x = (x1, . . . ,xM); more
gen-erally: document feature representation
X p. 266 Random variable taking values inV, the vocabulary (e.g., at a given positionkin a document)
X p. 256 Document space in text classification
Online edition (c) 2009 Cambridge UP
|si| p. 58 Length in characters of stringsi |~x| p. 139 Length of vector~x
|~x−~y| p. 131 Euclidean distance of~xand~y(which is the length of
Online edition (c) 2009 Cambridge UP
Preface
As recently as the 1990s, studies showed that most people preferred getting information from other people rather than from information retrieval sys-tems. Of course, in that time period, most people also used human travel agents to book their travel. However, during the last decade, relentless opti-mization of information retrieval effectiveness has driven web search engines to new quality levels where most people are satisfied most of the time, and web search has become a standard and often preferred source of information finding. For example, the 2004 Pew Internet Survey (Fallows 2004) found that “92% of Internet users say the Internet is a good place to go for getting everyday information.” To the surprise of many, the field of information re-trieval has moved from being a primarily academic discipline to being the basis underlying most people’s preferred means of information access. This book presents the scientific underpinnings of this field, at a level accessible to graduate students as well as advanced undergraduates.
Information retrieval did not begin with the Web. In response to various challenges of providing information access, the field of information retrieval evolved to give principled approaches to searching various forms of con-tent. The field began with scientific publications and library records, but soon spread to other forms of content, particularly those of information pro-fessionals, such as journalists, lawyers, and doctors. Much of the scientific research on information retrieval has occurred in these contexts, and much of the continued practice of information retrieval deals with providing access to unstructured information in various corporate and governmental domains, and this work forms much of the foundation of our book.
Online edition (c) 2009 Cambridge UP
the whole Web would rapidly become impossible, due to the Web’s expo-nential growth in size. But major scientific innovations, superb engineering, the rapidly declining price of computer hardware, and the rise of a commer-cial underpinning for web search have all conspired to power today’s major search engines, which are able to provide high-quality results within subsec-ond response times for hundreds of millions of searches a day over billions of web pages.
Book organization and course development
This book is the result of a series of courses we have taught at Stanford Uni-versity and at the UniUni-versity of Stuttgart, in a range of durations including a single quarter, one semester and two quarters. These courses were aimed at early-stage graduate students in computer science, but we have also had enrollment from upper-class computer science undergraduates, as well as students from law, medical informatics, statistics, linguistics and various en-gineering disciplines. The key design principle for this book, therefore, was to cover what we believe to be important in a one-term graduate course on information retrieval. An additional principle is to build each chapter around material that we believe can be covered in a single lecture of 75 to 90 minutes. The first eight chapters of the book are devoted to the basics of informa-tion retrieval, and in particular the heart of search engines; we consider this material to be core to any course on information retrieval. Chapter 1 in-troduces inverted indexes, and shows how simple Boolean queries can be processed using such indexes. Chapter 2 builds on this introduction by de-tailing the manner in which documents are preprocessed before indexing and by discussing how inverted indexes are augmented in various ways for functionality and speed. Chapter 3 discusses search structures for dictionar-ies and how to process querdictionar-ies that have spelling errors and other imprecise matches to the vocabulary in the document collection being searched. Chap-ter 4 describes a number of algorithms for constructing the inverted index from a text collection with particular attention to highly scalable and dis-tributed algorithms that can be applied to very large collections. Chapter 5 covers techniques for compressing dictionaries and inverted indexes. These techniques are critical for achieving subsecond response times to user queries in large search engines. The indexes and queries considered in Chapters 1–5 only deal withBoolean retrieval, in which a document either matches a query,
Online edition (c) 2009 Cambridge UP
relevance of the documents it retrieves, allowing us to compare the relative performances of different systems on benchmark document collections and queries.
Chapters 9–21 build on the foundation of the first eight chapters to cover a variety of more advanced topics. Chapter 9 discusses methods by which retrieval can be enhanced through the use of techniques like relevance feed-back and query expansion, which aim at increasing the likelihood of retriev-ing relevant documents. Chapter 10 considers information retrieval from documents that are structured with markup languages like XML and HTML. We treat structured retrieval by reducing it to the vector space scoring meth-ods developed in Chapter 6. Chapters 11 and 12 invoke probability theory to compute scores for documents on queries. Chapter 11 develops traditional probabilistic information retrieval, which provides a framework for comput-ing the probability of relevance of a document, given a set of query terms. This probability may then be used as a score in ranking. Chapter 12 illus-trates an alternative, wherein for each document in a collection, we build a language model from which one can estimate a probability that the language model generates a given query. This probability is another quantity with which we can rank-order documents.
Chapters 13–17 give a treatment of various forms of machine learning and numerical methods in information retrieval. Chapters 13–15 treat the prob-lem of classifying documents into a set of known categories, given a set of documents along with the classes they belong to. Chapter 13 motivates sta-tistical classification as one of the key technologies needed for a successful search engine, introduces Naive Bayes, a conceptually simple and efficient text classification method, and outlines the standard methodology for evalu-ating text classifiers. Chapter 14 employs the vector space model from Chap-ter 6 and introduces two classification methods, Rocchio and kNN, that op-erate on document vectors. It also presents the bias-variance tradeoff as an important characterization of learning problems that provides criteria for se-lecting an appropriate method for a text classification problem. Chapter 15 introduces support vector machines, which many researchers currently view as the most effective text classification method. We also develop connections in this chapter between the problem of classification and seemingly disparate topics such as the induction of scoring functions from a set of training exam-ples.
Online edition (c) 2009 Cambridge UP
clusterings (instead of flat clusterings) in many applications in information retrieval and introduces a number of clustering algorithms that produce a hierarchy of clusters. The chapter also addresses the difficult problem of automatically computing labels for clusters. Chapter 18 develops methods from linear algebra that constitute an extension of clustering, and also offer intriguing prospects for algebraic methods in information retrieval, which have been pursued in the approach of latent semantic indexing.
Chapters 19–21 treat the problem of web search. We give in Chapter 19 a summary of the basic challenges in web search, together with a set of tech-niques that are pervasive in web information retrieval. Next, Chapter 20 describes the architecture and requirements of a basic web crawler. Finally, Chapter 21 considers the power of link analysis in web search, using in the process several methods from linear algebra and advanced probability the-ory.
This book is not comprehensive in covering all topics related to informa-tion retrieval. We have put aside a number of topics, which we deemed outside the scope of what we wished to cover in an introduction to infor-mation retrieval class. Nevertheless, for people interested in these topics, we provide a few pointers to mainly textbook coverage here.
Cross-language IR (Grossman and Frieder 2004, ch. 4) and (Oard and Dorr
1996).
Image and Multimedia IR (Grossman and Frieder 2004, ch. 4), (Baeza-Yates and Ribeiro-Neto 1999, ch. 6), (Baeza-Yates and Ribeiro-Neto 1999, ch. 11), (Baeza-Yates and Ribeiro-Neto 1999, ch. 12), (del Bimbo 1999), (Lew 2001), and (Smeulders et al. 2000).
Speech retrieval (Coden et al. 2002).
Music Retrieval (Downie 2006) andhttp://www.ismir.net/.
User interfaces for IR (Baeza-Yates and Ribeiro-Neto 1999, ch. 10).
Parallel and Peer-to-Peer IR (Grossman and Frieder 2004, ch. 7), (Baeza-Yates and Ribeiro-Neto 1999, ch. 9), and (Aberer 2001).
Digital libraries (Baeza-Yates and Ribeiro-Neto 1999, ch. 15) and (Lesk 2004).
Information science perspective (Korfhage 1997), (Meadow et al. 1999), and
(Ingwersen and Järvelin 2005).
Logic-based approaches to IR (van Rijsbergen 1989).
Online edition (c) 2009 Cambridge UP
Prerequisites
Introductory courses in data structures and algorithms, in linear algebra and in probability theory suffice as prerequisites for all 21 chapters. We now give more detail for the benefit of readers and instructors who wish to tailor their reading to some of the chapters.
Chapters 1–5 assume as prerequisite a basic course in algorithms and data structures. Chapters 6 and 7 require, in addition, a knowledge of basic lin-ear algebra including vectors and dot products. No additional prerequisites are assumed until Chapter 11, where a basic course in probability theory is required; Section 11.1 gives a quick review of the concepts necessary in Chap-ters 11–13. Chapter 15 assumes that the reader is familiar with the notion of nonlinear optimization, although the chapter may be read without detailed knowledge of algorithms for nonlinear optimization. Chapter 18 demands a first course in linear algebra including familiarity with the notions of matrix rank and eigenvectors; a brief review is given in Section 18.1. The knowledge of eigenvalues and eigenvectors is also necessary in Chapter 21.
Book layout
✎
Worked examples in the text appear with a pencil sign next to them in the left margin. Advanced or difficult material appears in sections or subsections indicated with scissors in the margin. Exercises are marked in the margin✄
with a question mark. The level of difficulty of exercises is indicated as easy (⋆), medium (⋆⋆), or difficult (⋆ ⋆ ⋆).?
Acknowledgments
We would like to thank Cambridge University Press for allowing us to make the draft book available online, which facilitated much of the feedback we have received while writing the book. We also thank Lauren Cowles, who has been an outstanding editor, providing several rounds of comments on each chapter, on matters of style, organization, and coverage, as well as de-tailed comments on the subject matter of the book. To the extent that we have achieved our goals in writing this book, she deserves an important part of the credit.
Online edition (c) 2009 Cambridge UP
Fuhr, Vignesh Ganapathy, Elmer Garduno, Xiubo Geng, David Gondek, Ser-gio Govoni, Corinna Habets, Ben Handy, Donna Harman, Benjamin Haskell, Thomas Hühn, Deepak Jain, Ralf Jankowitsch, Dinakar Jayarajan, Vinay Kakade, Mei Kobayashi, Wessel Kraaij, Rick Lafleur, Florian Laws, Hang Li, David Losada, David Mann, Ennio Masi, Sven Meyer zu Eissen, Alexander Murzaku, Gonzalo Navarro, Frank McCown, Paul McNamee, Christoph Müller, Scott Olsson, Tao Qin, Megha Raghavan, Michal Rosen-Zvi, Klaus Rothenhäusler, Kenyu L. Runner, Alexander Salamanca, Grigory Sapunov, Evgeny Shad-chnev, Tobias Scheffer, Nico Schlaefer, Ian Soboroff, Benno Stein, Marcin Sydow, Andrew Turner, Jason Utt, Huey Vo, Travis Wade, Mike Walsh, Changliang Wang, Renjing Wang, and Thomas Zeume.
Many people gave us detailed feedback on individual chapters, either at our request or through their own initiative. For this, we’re particularly grate-ful to: James Allan, Omar Alonso, Ismail Sengor Altingovde, Vo Ngoc Anh, Roi Blanco, Eric Breck, Eric Brown, Mark Carman, Carlos Castillo, Junghoo Cho, Aron Culotta, Doug Cutting, Meghana Deodhar, Susan Dumais, Jo-hannes Fürnkranz, Andreas Heß, Djoerd Hiemstra, David Hull, Thorsten Joachims, Siddharth Jonathan J. B., Jaap Kamps, Mounia Lalmas, Amy Langville, Nicholas Lester, Dave Lewis, Daniel Lowd, Yosi Mass, Jeff Michels, Alessan-dro Moschitti, Amir Najmi, Marc Najork, Giorgio Maria Di Nunzio, Paul Ogilvie, Priyank Patel, Jan Pedersen, Kathryn Pedings, Vassilis Plachouras, Daniel Ramage, Ghulam Raza, Stefan Riezler, Michael Schiehlen, Helmut Schmid, Falk Nicolas Scholer, Sabine Schulte im Walde, Fabrizio Sebastiani, Sarabjeet Singh, Valentin Spitkovsky, Alexander Strehl, John Tait, Shivaku-mar Vaithyanathan, Ellen Voorhees, Gerhard Weikum, Dawid Weiss, Yiming Yang, Yisong Yue, Jian Zhang, and Justin Zobel.
And finally there were a few reviewers who absolutely stood out in terms of the quality and quantity of comments that they provided. We thank them for their significant impact on the content and structure of the book. We express our gratitude to Pavel Berkhin, Stefan Büttcher, Jamie Callan, Byron Dom, Torsten Suel, and Andrew Trotman.
Parts of the initial drafts of Chapters 13–15 were based on slides that were generously provided by Ray Mooney. While the material has gone through extensive revisions, we gratefully acknowledge Ray’s contribution to the three chapters in general and to the description of the time complexities of text classification algorithms in particular.
The above is unfortunately an incomplete list: we are still in the process of incorporating feedback we have received. And, like all opinionated authors, we did not always heed the advice that was so freely given. The published versions of the chapters remain solely the responsibility of the authors.
Online edition (c) 2009 Cambridge UP
contents were refined. CM thanks his family for the many hours they’ve let him spend working on this book, and hopes he’ll have a bit more free time on weekends next year. PR thanks his family for their patient support through the writing of this book and is also grateful to Yahoo! Inc. for providing a fertile environment in which to work on this book. HS would like to thank his parents, family, and friends for their support while writing this book.
Web and contact information
This book has a companion website athttp://informationretrieval.org. As well as
links to some more general resources, it is our intent to maintain on this web-site a set of slides for each chapter which may be used for the corresponding lecture. We gladly welcome further feedback, corrections, and suggestions
Online edition (c) 2009 Cambridge UP
1
Boolean retrieval
The meaning of the terminformation retrievalcan be very broad. Just getting
a credit card out of your wallet so that you can type in the card number is a form of information retrieval. However, as an academic field of study,
information retrievalmight be defined thus:
INFORMATION RETRIEVAL
Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).
As defined in this way, information retrieval used to be an activity that only a few people engaged in: reference librarians, paralegals, and similar pro-fessional searchers. Now the world has changed, and hundreds of millions of people engage in information retrieval every day when they use a web search engine or search their email.1 Information retrieval is fast becoming
the dominant form of information access, overtaking traditional database-style searching (the sort that is going on when a clerk says to you: “I’m sorry, I can only look up your order if you can give me your Order ID”).
IR can also cover other kinds of data and information problems beyond that specified in the core definition above. The term “unstructured data” refers to data which does not have clear, semantically overt, easy-for-a-computer structure. It is the opposite of structured data, the canonical example of which is a relational database, of the sort companies usually use to main-tain product inventories and personnel records. In reality, almost no data are truly “unstructured”. This is definitely true of all text data if you count the latent linguistic structure of human languages. But even accepting that the intended notion of structure is overt structure, most text has structure, such as headings and paragraphs and footnotes, which is commonly repre-sented in documents by explicit markup (such as the coding underlying web
Online edition (c) 2009 Cambridge UP
pages). IR is also used to facilitate “semistructured” search such as finding a document where the title containsJavaand the body containsthreading.
The field of information retrieval also covers supporting users in browsing or filtering document collections or further processing a set of retrieved doc-uments. Given a set of documents, clustering is the task of coming up with a good grouping of the documents based on their contents. It is similar to ar-ranging books on a bookshelf according to their topic. Given a set of topics, standing information needs, or other categories (such as suitability of texts for different age groups), classification is the task of deciding which class(es), if any, each of a set of documents belongs to. It is often approached by first manually classifying some documents and then hoping to be able to classify new documents automatically.
Information retrieval systems can also be distinguished by the scale at which they operate, and it is useful to distinguish three prominent scales. Inweb search, the system has to provide search over billions of documents
stored on millions of computers. Distinctive issues are needing to gather documents for indexing, being able to build systems that work efficiently at this enormous scale, and handling particular aspects of the web, such as the exploitation of hypertext and not being fooled by site providers manip-ulating page content in an attempt to boost their search engine rankings, given the commercial importance of the web. We focus on all these issues in Chapters 19–21. At the other extreme is personal information retrieval. In
the last few years, consumer operating systems have integrated information retrieval (such as Apple’s Mac OS X Spotlight or Windows Vista’s Instant Search). Email programs usually not only provide search but also text clas-sification: they at least provide a spam (junk mail) filter, and commonly also provide either manual or automatic means for classifying mail so that it can be placed directly into particular folders. Distinctive issues here include han-dling the broad range of document types on a typical personal computer, and making the search system maintenance free and sufficiently lightweight in terms of startup, processing, and disk space usage that it can run on one machine without annoying its owner. In between is the space ofenterprise, institutional, and domain-specific search, where retrieval might be provided for
Online edition (c) 2009 Cambridge UP
In this chapter we begin with a very simple example of an information retrieval problem, and introduce the idea of a term-document matrix (Sec-tion 1.1) and the central inverted index data structure (Sec(Sec-tion 1.2). We will then examine the Boolean retrieval model and how Boolean queries are pro-cessed (Sections 1.3 and 1.4).
1.1
An example information retrieval problem
A fat book which many people own is Shakespeare’s Collected Works. Sup-pose you wanted to determine which plays of Shakespeare contain the words
BrutusANDCaesarAND NOT Calpurnia. One way to do that is to start at the
beginning and to read through all the text, noting for each play whether it containsBrutusandCaesarand excluding it from consideration if it
con-tainsCalpurnia. The simplest form of document retrieval is for a computer
to do this sort of linear scan through documents. This process is commonly referred to asgreppingthrough text, after the Unix commandgrep, which GREP
performs this process. Grepping through text can be a very effective process, especially given the speed of modern computers, and often allows useful possibilities for wildcard pattern matching through the use of regular expres-sions. With modern computers, for simple querying of modest collections (the size of Shakespeare’s Collected Works is a bit under one million words of text in total), you really need nothing more.
But for many purposes, you do need more:
1. To process large document collections quickly. The amount of online data has grown at least as quickly as the speed of computers, and we would now like to be able to search collections that total in the order of billions to trillions of words.
2. To allow more flexible matching operations. For example, it is impractical to perform the queryRomans NEARcountrymenwith grep, whereNEAR
might be defined as “within 5 words” or “within the same sentence”.
3. To allow ranked retrieval: in many cases you want the best answer to an information need among many documents that contain certain words. The way to avoid linearly scanning the texts for each query is toindexthe
INDEX
documents in advance. Let us stick with Shakespeare’s Collected Works, and use it to introduce the basics of the Boolean retrieval model. Suppose we record for each document – here a play of Shakespeare’s – whether it contains each word out of all the words Shakespeare used (Shakespeare used about 32,000 different words). The result is a binary term-documentincidence
INCIDENCE MATRIX
matrix, as in Figure 1.1. Termsare the indexed units (further discussed in TERM
Online edition (c) 2009 Cambridge UP
Antony Julius The Hamlet Othello Macbeth . . . and Caesar Tempest
Cleopatra
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
. . .
◮Figure 1.1 A term-document incidence matrix. Matrix element(t,d) is 1 if the
play in columndcontains the word in rowt, and is 0 otherwise.
them as words, but the information retrieval literature normally speaks of terms because some of them, such as perhapsI-9orHong Kongare not usually
thought of as words. Now, depending on whether we look at the matrix rows or columns, we can have a vector for each term, which shows the documents it appears in, or a vector for each document, showing the terms that occur in it.2
To answer the queryBrutusAND CaesarAND NOT Calpurnia, we take the
vectors forBrutus,CaesarandCalpurnia, complement the last, and then do a
bitwiseAND:
110100AND110111AND101111 = 100100
The answers for this query are thus Antony and CleopatraandHamlet
(Fig-ure 1.2).
TheBoolean retrieval modelis a model for information retrieval in which we BOOLEAN RETRIEVAL
MODEL can pose any query which is in the form of a Boolean expression of terms, that is, in which terms are combined with the operatorsAND,OR, andNOT.
The model views each document as just a set of words.
Let us now consider a more realistic scenario, simultaneously using the opportunity to introduce some terminology and notation. Suppose we have
N = 1 million documents. Bydocumentswe mean whatever units we have
DOCUMENT
decided to build a retrieval system over. They might be individual memos or chapters of a book (see Section 2.1.2 (page 20) for further discussion). We will refer to the group of documents over which we perform retrieval as the (document)collection. It is sometimes also referred to as acorpus(abodyof COLLECTION
CORPUS texts). Suppose each document is about 1000 words long (2–3 book pages). If
Online edition (c) 2009 Cambridge UP
Antony and Cleopatra, Act III, Scene iiAgrippa [Aside to Domitius Enobarbus]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain.
Hamlet, Act III, Scene ii
Lord Polonius: I did enact Julius Caesar: I was killed i’ the Capitol; Brutus killed me.
◮Figure 1.2 Results from Shakespeare for the queryBrutusANDCaesarAND NOT
Calpurnia.
we assume an average of 6 bytes per word including spaces and punctuation, then this is a document collection about 6 GB in size. Typically, there might be aboutM = 500,000 distinct terms in these documents. There is nothing special about the numbers we have chosen, and they might vary by an order of magnitude or more, but they give us some idea of the dimensions of the kinds of problems we need to handle. We will discuss and model these size assumptions in Section 5.1 (page 86).
Our goal is to develop a system to address thead hoc retrievaltask. This is
AD HOC RETRIEVAL
the most standard IR task. In it, a system aims to provide documents from within the collection that are relevant to an arbitrary user information need, communicated to the system by means of a one-off, user-initiated query. An
information needis the topic about which the user desires to know more, and
INFORMATION NEED
is differentiated from aquery, which is what the user conveys to the
com-QUERY
puter in an attempt to communicate the information need. A document is
relevantif it is one that the user perceives as containing information of value
RELEVANCE
with respect to their personal information need. Our example above was rather artificial in that the information need was defined in terms of par-ticular words, whereas usually a user is interested in a topic like “pipeline leaks” and would like to find relevant documents regardless of whether they precisely use those words or express the concept with other words such as
pipeline rupture. To assess theeffectivenessof an IR system (i.e., the quality of
EFFECTIVENESS
its search results), a user will usually want to know two key statistics about the system’s returned results for a query:
Precision: What fraction of the returned results are relevant to the informa-PRECISION
tion need?
Recall: What fraction of the relevant documents in the collection were re-RECALL
Online edition (c) 2009 Cambridge UP
Detailed discussion of relevance and evaluation measures including preci-sion and recall is found in Chapter 8.
We now cannot build a term-document matrix in a naive way. A 500K× 1M matrix has half-a-trillion 0’s and 1’s – too many to fit in a computer’s memory. But the crucial observation is that the matrix is extremely sparse, that is, it has few non-zero entries. Because each document is 1000 words long, the matrix has no more than one billion 1’s, so a minimum of 99.8% of the cells are zero. A much better representation is to record only the things that do occur, that is, the 1 positions.
This idea is central to the first major concept in information retrieval, the
inverted index. The name is actually redundant: an index always maps back INVERTED INDEX
from terms to the parts of a document where they occur. Nevertheless, in-verted index, or sometimesinverted file, has become the standard term in infor-mation retrieval.3The basic idea of an inverted index is shown in Figure 1.3.
We keep adictionaryof terms (sometimes also referred to as avocabularyor
DICTIONARY
VOCABULARY lexicon; in this book, we usedictionaryfor the data structure andvocabulary LEXICON for the set of terms). Then for each term, we have a list that records which
documents the term occurs in. Each item in the list – which records that a term appeared in a document (and, later, often, the positions in the docu-ment) – is conventionally called aposting.4 The list is then called apostings
POSTING
POSTINGS LIST list(or inverted list), and all the postings lists taken together are referred to as thepostings. The dictionary in Figure 1.3 has been sorted alphabetically and
POSTINGS
each postings list is sorted by document ID. We will see why this is useful in Section 1.3, below, but later we will also consider alternatives to doing this (Section 7.1.5).
1.2
A first take at building an inverted index
To gain the speed benefits of indexing at retrieval time, we have to build the index in advance. The major steps in this are:
1. Collect the documents to be indexed:
Friends, Romans, countrymen. So let it be with Caesar . . .
2. Tokenize the text, turning each document into a list of tokens: Friends Romans countrymen So . . .
3. Some information retrieval researchers prefer the terminverted file, but expressions like in-dex constructionandindex compressionare much more common thaninverted file constructionand inverted file compression. For consistency, we use(inverted) indexthroughout this book.
Online edition (c) 2009 Cambridge UP
Brutus −→ 1 2 4 11 31 45 173 174
Caesar −→ 1 2 4 5 6 16 57 132 . . .
Calpurnia −→ 2 31 54 101
.. .
| {z } | {z }
Dictionary Postings
◮Figure 1.3 The two parts of an inverted index. The dictionary is commonly kept
in memory, with pointers to each postings list, which is stored on disk.
3. Do linguistic preprocessing, producing a list of normalized tokens, which are the indexing terms: friend roman countryman so . . .
4. Index the documents that each term occurs in by creating an inverted in-dex, consisting of a dictionary and postings.
We will define and discuss the earlier stages of processing, that is, steps 1–3, in Section 2.2 (page 22). Until then you can think oftokensand normalized tokens as also loosely equivalent to words. Here, we assume that the first 3 steps have already been done, and we examine building a basic inverted index by sort-based indexing.
Within a document collection, we assume that each document has a unique serial number, known as the document identifier (docID). During index
con-DOCID
struction, we can simply assign successive integers to each new document when it is first encountered. The input to indexing is a list of normalized tokens for each document, which we can equally think of as a list of pairs of term and docID, as in Figure 1.4. The core indexing step issortingthis list
SORTING
so that the terms are alphabetical, giving us the representation in the middle column of Figure 1.4. Multiple occurrences of the same term from the same document are then merged.5 Instances of the same term are then grouped,
and the result is split into adictionary and postings, as shown in the right column of Figure 1.4. Since a term generally occurs in a number of docu-ments, this data organization already reduces the storage requirements of the index. The dictionary also records some statistics, such as the number of documents which contain each term (thedocument frequency, which is here DOCUMENT
FREQUENCY also the length of each postings list). This information is not vital for a ba-sic Boolean search engine, but it allows us to improve the efficiency of the
Online edition (c) 2009 Cambridge UP
Doc 1 Doc 2
I did enact Julius Caesar: I was killed
i’ the Capitol; Brutus killed me. So let it be with Caesar. The noble Brutushath told you Caesar was ambitious:
term docID I 1 did 1 enact 1 julius 1 caesar 1 I 1 was 1 killed 1 i’ 1 the 1 capitol 1 brutus 1 killed 1 me 1 so 2 let 2 it 2 be 2 with 2 caesar 2 the 2 noble 2 brutus 2 hath 2 told 2 you 2 caesar 2 was 2 ambitious 2 =⇒ term docID ambitious 2 be 2 brutus 1 brutus 2 capitol 1 caesar 1 caesar 2 caesar 2 did 1 enact 1 hath 1 I 1 I 1 i’ 1 it 2 julius 1 killed 1 killed 1 let 2 me 1 noble 2 so 2 the 1 the 2 told 2 you 2 was 1 was 2 with 2 =⇒
term doc. freq. → postings lists
ambitious 1 → 2
be 1 → 2
brutus 2 → 1 → 2
capitol 1 → 1
caesar 2 → 1 → 2
did 1 → 1
enact 1 → 1
hath 1 → 2
I 1 → 1
i’ 1 → 1
it 1 → 2
julius 1 → 1
killed 1 → 1
let 1 → 2
me 1 → 1
noble 1 → 2
so 1 → 2
the 2 → 1 → 2
told 1 → 2
you 1 → 2
was 2 → 1 → 2
with 1 → 2
◮Figure 1.4 Building an index by sorting and grouping. The sequence of terms
Online edition (c) 2009 Cambridge UP
search engine at query time, and it is a statistic later used in many ranked re-trieval models. The postings are secondarily sorted by docID. This provides the basis for efficient query processing. This inverted index structure is es-sentially without rivals as the most efficient structure for supporting ad hoc text search.
In the resulting index, we pay for storage of both the dictionary and the postings lists. The latter are much larger, but the dictionary is commonly kept in memory, while postings lists are normally kept on disk, so the size of each is important, and in Chapter 5 we will examine how each can be optimized for storage and access efficiency. What data structure should be used for a postings list? A fixed length array would be wasteful as some words occur in many documents, and others in very few. For an in-memory postings list, two good alternatives are singly linked lists or variable length arrays. Singly linked lists allow cheap insertion of documents into postings lists (following updates, such as when recrawling the web for updated doc-uments), and naturally extend to more advanced indexing strategies such as skip lists (Section 2.3), which require additional pointers. Variable length ar-rays win in space requirements by avoiding the overhead for pointers and in time requirements because their use of contiguous memory increases speed on modern processors with memory caches. Extra pointers can in practice be encoded into the lists as offsets. If updates are relatively infrequent, variable length arrays will be more compact and faster to traverse. We can also use a hybrid scheme with a linked list of fixed length arrays for each term. When postings lists are stored on disk, they are stored (perhaps compressed) as a contiguous run of postings without explicit pointers (as in Figure 1.3), so as to minimize the size of the postings list and the number of disk seeks to read a postings list into memory.
?
Exercise 1.1 [⋆]Draw the inverted index that would be built for the following document collection. (See Figure 1.3 for an example.)
Doc 1 new home sales top forecasts Doc 2 home sales rise in july Doc 3 increase in home sales in july
Doc 4 july new home sales rise
Exercise 1.2 [⋆]
Consider these documents:
Doc 1 breakthrough drug for schizophrenia
Doc 2 new schizophrenia drug
Doc 3 new approach for treatment of schizophrenia Doc 4 new hopes for schizophrenia patients
Online edition (c) 2009 Cambridge UP
Brutus −→ 1 →2 →4 →11→31→45→173→174 Calpurnia −→ 2 →31→54→101
Intersection =⇒ 2 →31
◮Figure 1.5 Intersecting the postings lists forBrutusandCalpurniafrom Figure 1.3.
b. Draw the inverted index representation for this collection, as in Figure 1.3 (page 7).
Exercise 1.3 [⋆]
For the document collection shown in Exercise 1.2, what are the returned results for these queries:
a. schizophreniaANDdrug
b. forAND NOT(drugORapproach)
1.3
Processing Boolean queries
How do we process a query using an inverted index and the basic Boolean retrieval model? Consider processing thesimple conjunctive query:
SIMPLE CONJUNCTIVE QUERIES
(1.1) BrutusANDCalpurnia
over the inverted index partially shown in Figure1.3(page7). We:
1. LocateBrutusin the Dictionary
2. Retrieve its postings
3. LocateCalpurniain the Dictionary
4. Retrieve its postings
5. Intersect the two postings lists, as shown in Figure1.5.
Theintersectionoperation is the crucial one: we need to efficiently intersect POSTINGS LIST
INTERSECTION postings lists so as to be able to quickly find documents that contain both
terms. (This operation is sometimes referred to as merging postings lists: POSTINGS MERGE
this slightly counterintuitive name reflects using the termmerge algorithmfor
a general family of algorithms that combine multiple sorted lists by inter-leaved advancing of pointers through each; here we are merging the lists with a logicalANDoperation.)
Online edition (c) 2009 Cambridge UP
INTERSECT(p1,p2)
1 answer← h i
2 whilep16=NILandp26=NIL
3 do ifdocID(p1) =docID(p2)
4 thenADD(answer,docID(p1))
5 p1←next(p1)
6 p2←next(p2)
7 else ifdocID(p1)<docID(p2)
8 thenp1←next(p1) 9 else p2←next(p2) 10 returnanswer
◮Figure 1.6 Algorithm for the intersection of two postings listsp1andp2.
and walk through the two postings lists simultaneously, in time linear in the total number of postings entries. At each step, we compare the docID pointed to by both pointers. If they are the same, we put that docID in the results list, and advance both pointers. Otherwise we advance the pointer pointing to the smaller docID. If the lengths of the postings lists arex and y, the intersection takesO(x+y) operations. Formally, the complexity of querying is Θ(N), whereNis the number of documents in the collection.6
Our indexing methods gain us just a constant, not a difference in Θ time
complexity compared to a linear scan, but in practice the constant is huge. To use this algorithm, it is crucial that postings be sorted by a single global ordering. Using a numeric sort by docID is one simple way to achieve this.
We can extend the intersection operation to process more complicated queries like:
(1.2) (BrutusORCaesar)AND NOTCalpurnia
Query optimizationis the process of selecting how to organize the work of an-QUERY OPTIMIZATION
swering a query so that the least total amount of work needs to be done by the system. A major element of this for Boolean queries is the order in which postings lists are accessed. What is the best order for query processing? Con-sider a query that is anANDoftterms, for instance:
(1.3) BrutusANDCaesarANDCalpurnia
For each of thetterms, we need to get its postings, thenANDthem together.
The standard heuristic is to process terms in order of increasing document
Online edition (c) 2009 Cambridge UP
INTERSECT(ht1, . . . ,tni)
1 terms←SORTBYINCREASINGFREQUENCY(ht1, . . . ,tni)
2 result←postings(f irst(terms)) 3 terms←rest(terms)
4 whileterms6=NILandresult6=NIL
5 doresult←INTERSECT(result,postings(f irst(terms))) 6 terms←rest(terms)
7 returnresult
◮Figure 1.7 Algorithm for conjunctive queries that returns the set of documents
containing each term in the input list of terms.
frequency: if we start by intersecting the two smallest postings lists, then all intermediate results must be no bigger than the smallest postings list, and we are therefore likely to do the least amount of total work. So, for the postings lists in Figure1.3(page7), we execute the above query as:
(1.4) (CalpurniaANDBrutus)ANDCaesar
This is a first justification for keeping the frequency of terms in the dictionary: it allows us to make this ordering decision based on in-memory data before accessing any postings list.
Consider now the optimization of more general queries, such as:
(1.5) (maddingORcrowd)AND(ignobleORstrife)AND(killedORslain)
As before, we will get the frequencies for all terms, and we can then (con-servatively) estimate the size of eachORby the sum of the frequencies of its disjuncts. We can then process the query in increasing order of the size of each disjunctive term.
Online edition (c) 2009 Cambridge UP
intersection can still be done by the algorithm in Figure 1.6, but when the difference between the list lengths is very large, opportunities to use alter-native techniques open up. The intersection can be calculated in place by destructively modifying or marking invalid items in the intermediate results list. Or the intersection can be done as a sequence of binary searches in the long postings lists for each posting in the intermediate results list. Another possibility is to store the long postings list as a hashtable, so that membership of an intermediate result item can be calculated in constant rather than linear or log time. However, such alternative techniques are difficult to combine with postings list compression of the sort discussed in Chapter5. Moreover, standard postings list intersection operations remain necessary when both terms of a query are very common.
?
Exercise 1.4 [⋆]For the queries below, can we still run through the intersection in timeO(x+y), wherexandyare the lengths of the postings lists forBrutusandCaesar? If not, what
can we achieve?
a. BrutusAND NOTCaesar
b. BrutusOR NOTCaesar
Exercise 1.5 [⋆]
Extend the postings merge algorithm to arbitrary Boolean query formulas. What is its time complexity? For instance, consider:
c. (BrutusORCaesar)AND NOT(AntonyORCleopatra)
Can we always merge in linear time? Linear in what? Can we do better than this?
Exercise 1.6 [⋆⋆]
We can use distributive laws forANDandORto rewrite queries.
a. Show how to rewrite the query in Exercise1.5into disjunctive normal form using the distributive laws.
b. Would the resulting query be more or less efficiently evaluated than the original form of this query?
c. Is this result true in general or does it depend on the words and the contents of the document collection?
Exercise 1.7 [⋆]
Recommend a query processing order for
d. (tangerineORtrees)AND(marmaladeORskies)AND(kaleidoscopeOReyes)
Online edition (c) 2009 Cambridge UP
Term Postings size
eyes 213312
kaleidoscope 87009
marmalade 107913
skies 271658
tangerine 46653
trees 316812
Exercise 1.8 [⋆]
If the query is:
e. friendsANDromansAND(NOTcountrymen)
how could we use the frequency ofcountrymenin evaluating the best query evaluation
order? In particular, propose a way of handling negation in determining the order of query processing.
Exercise 1.9 [⋆⋆]
For a conjunctive query, is processing postings lists in order of size guaranteed to be optimal? Explain why it is, or give an example where it isn’t.
Exercise 1.10 [⋆⋆]
Write out a postings merge algorithm, in the style of Figure1.6(page11), for anxORy query.
Exercise 1.11 [⋆⋆]
How should the Boolean queryxAND NOTybe handled? Why is naive evaluation of this query normally very expensive? Write out a postings merge algorithm that evaluates this query efficiently.
1.4
The extended Boolean model versus ranked retrieval
The Boolean retrieval model contrasts withranked retrieval modelssuch as the
RANKED RETRIEVAL
MODEL vector space model (Section6.3), in which users largely usefree text queries, FREE TEXT QUERIES that is, just typing one or more words rather than using a precise language
with operators for building up query expressions, and the system decides which documents best satisfy the query. Despite decades of academic re-search on the advantages of ranked retrieval, systems implementing the Boo-lean retrieval model were the main or only search option provided by large commercial information providers for three decades until the early 1990s (ap-proximately the date of arrival of the World Wide Web). However, these systems did not have just the basic Boolean operations (AND,OR, andNOT)
which we have presented so far. A strict Boolean expression over terms with an unordered results set is too limited for many of the information needs that people have, and these systems implemented extended Boolean retrieval models by incorporating additional operators such as term proximity oper-ators. Aproximity operatoris a way of specifying that two terms in a query
Online edition (c) 2009 Cambridge UP
must occur close to each other in a document, where closeness may be mea-sured by limiting the allowed number of intervening words or by reference to a structural unit such as a sentence or paragraph.
✎
Example 1.1: Commercial Boolean searching: Westlaw. Westlaw (http://www.westlaw.com/)is the largest commercial legal search service (in terms of the number of paying sub-scribers), with over half a million subscribers performing millions of searches a day over tens of terabytes of text data. The service was started in 1975. In 2005, Boolean search (called “Terms and Connectors” by Westlaw) was still the default, and used by a large percentage of users, although ranked free text querying (called “Natural Language” by Westlaw) was added in 1992. Here are some example Boolean queries on Westlaw:
Information need: Information on the legal theories involved in preventing the disclosure of trade secrets by employees formerly employed by a competing company.Query:"trade secret" /s disclos! /s prevent /s employe!
Information need:Requirements for disabled people to be able to access a work-place.
Query:disab! /p access! /s work-site work-place (employment /3 place)
Information need:Cases about a host’s responsibility for drunk guests. Query:host! /p (responsib! liab!) /p (intoxicat! drunk!) /p guest
Note the long, precise queries and the use of proximity operators, both uncommon in web search. Submitted queries average about ten words in length. Unlike web search conventions, a space between words represents disjunction (the tightest bind-ing operator), & isANDand /s, /p, and /kask for matches in the same sentence, same paragraph or withinkwords respectively. Double quotes give aphrase search (consecutive words); see Section2.4(page39). The exclamation mark (!) gives a trail-ing wildcard query (see Section3.2, page51); thusliab!matches all words starting
withliab. Additionallywork-sitematches any ofworksite, work-siteorwork site; see Section2.2.1(page22). Typical expert queries are usually carefully defined and incre-mentally developed until they obtain what look to be good results to the user.
Many users, particularly professionals, prefer Boolean query models. Boolean queries are precise: a document either matches the query or it does not. This of-fers the user greater control and transparency over what is retrieved. And some do-mains, such as legal materials, allow an effective means of document ranking within a Boolean model: Westlaw returns documents in reverse chronological order, which is in practice quite effective. In 2007, the majority of law librarians still seem to rec-ommend terms and connectors for high recall searches, and the majority of legal users think they are getting greater control by using them. However, this does not mean that Boolean queries are more effective for professional searchers. Indeed, ex-perimenting on a Westlaw subcollection, Turtle (1994) found that free text queries produced better results than Boolean queries prepared by Westlaw’s own reference librarians for the majority of the information needs in his experiments. A general problem with Boolean search is that usingANDoperators tends to produce high pre-cision but low recall searches, while usingORoperators gives low precision but high recall searches, and it is difficult or impossible to find a satisfactory middle ground.
Online edition (c) 2009 Cambridge UP
inverted index, comprising a dictionary and postings lists. We introduced the Boolean retrieval model, and examined how to do efficient retrieval via linear time merges and simple query optimization. In Chapters2–7we will consider in detail richer query models and the sort of augmented index struc-tures that are needed to handle them efficiently. Here we just mention a few of the main additional things we would like to be able to do:
1. We would like to better determine the set of terms in the dictionary and to provide retrieval that is tolerant to spelling mistakes and inconsistent choice of words.
2. It is often useful to search for compounds or phrases that denote a concept such as“operating system”. As the Westlaw examples show, we might also wish to do proximity queries such as Gates NEAR Microsoft. To answer such queries, the index has to be augmented to capture the proximities of terms in documents.
3. A Boolean model only records term presence or absence, but often we would like to accumulate evidence, giving more weight to documents that have a term several times as opposed to ones that contain it only once. To be able to do this we needterm frequencyinformation (the number of times TERM FREQUENCY
a term occurs in a document) in postings lists.
4. Boolean queries just retrieve a set of matching documents, but commonly we wish to have an effective method to order (or “rank”) the returned results. This requires having a mechanism for determining a document score which encapsulates how good a match a document is for a query.
With these additional ideas, we will have seen most of the basic technol-ogy that supports ad hoc searching over unstructured information. Ad hoc searching over documents has recently conquered the world, powering not only web search engines but the kind of unstructured search that lies behind the large eCommerce websites. Although the main web search engines differ by emphasizing free text querying, most of the basic issues and technologies of indexing and querying remain the same, as we will see in later chapters. Moreover, over time, web search engines have added at least partial imple-mentations of some of the most popular operators from extended Boolean models: phrase search is especially popular and most have a very partial implementation of Boolean operators. Nevertheless, while these options are liked by expert searchers, they are little used by most people and are not the main focus in work on trying to improve web search engine performance.