Conclusion

This chapter has introduced conditions PRF models should satisfy. These conditions are based on standard IR constraints, with the addition of aDocument Frequency(DF) constraint which we have experimentally validated. We have partially validated the IDF condition and we have then investigated standard PRF models wrt to these constraints.

This theoretical study has revealed several important points.

First, the simple mixture and the divergence minimization models, are deficient as one does not satisfy the DF constraint while the other does not sufficiently enforce the IDF

effect. Second, relevance models and their geometric variant do not satisfy either the IDF condition. Note that smoothing language models do enforce the IDF effect for ad-hoc retrieval and we showed here that this is not always the case for PRF. Only two models satisfy all the PRF constraints but the one related to the document score (DS). These models are the log-logistic and the smoothed power law models of the information-based family. Moreover, all these theoretical findings were experimentally illustrated. Overall, this chapter provides anexplanation on why the information-based models perform better than other models in PRF settings.

Chapter 7

We have studied probabilistic models for word frequencies and for information retrieval.

Our goal was to link these probabilitistic models in order to have a good model of word frequencies and a good IR model at the same time.

First of all, we have studied the problem of modeling word frequencies in a collection with a major emphasis on the burstiness phenomenon and state of the art probabilistic model such as the 2-Poisson mixture model, the Negative Binomial, the Beta Binomial and its multivariate extensions DCM and EDCM were reviewed.

Even if the burstiness phenomenon is often mentionned in studies dealing with word frequencies probabilistic models, its precise meaning is vague and is often poorly defined.

This is why we returned to the roots of the burstiness characterization as proposed by Katz [45]. We then summarized the significant studies of Church [13, 12], who introduced and validated empirically the notion of adaptation for word frequencies.

Overall, burstiness addresses the fact that for given word, its occurrences in a document, are far from being independent from each other. Even if burstiness has been studied extensively in several studies, each of which proposed a different probabilistic model, our approach differentiate from others by tackling this phenomenon with aformal definition of burstiness. This definition translates as a property of probability distributions, related to the log-convexity of the survival functionP(X > x). The benefits of this formal definition enable to test whether a distribution can account or not for burstiness and can guide the design of new probability distributions.

This formal definition of burstiness leads us to consider the family of power law distributions as candidates for modeling word frequencies. We introduced two novel models of word frequencies: theBeta Negative Binomial [14] distribution, a discrete model, and the Log-Logisticdistribution [16, 15], a continuous one. The Beta Negative Binomial build on the Negative Binomial proposed by Church [13] as it can be viewed as an infinite mixture of Negative Binomial distributions, which itself is an infinite mixture of Poisson distributions. We then showed how particular instances of the Log-Logistic distributions can be viewed as a continuous counterpart of the Beta Negative Binomial. For both distributions, we provided constant time estimation procedure on the contrary to the DCM and EDCM models, either with a generalized method of moments that can be approximated with the mean document frequency.

The Beta Negative Binomial and Log-Logistic were compared to Poisson distributions and Katz-Mixture on several IR collections. We stressed that the burstiness phenomenon is related to a variance problem. Experiments have validated the BNB and Log-Logistic ability to capture word burstiness which besides suggest that the definition of burstiness we proposed is indeed appropriate. In a nutshell, the Beta Negative Binomial and the Log- Logistic distributions are sound models of word frequencies: they enjoy good theoretical

141

properties, as bursty distributions, and they fit well word frequencies empirically.

Our second goal was to design IR model compatible with word burstiness as state- of-the-art IR model do not rely on bursty distribution according to our definition. We then have studied probabilistic information retrieval models and review the Probability Ranking framework, the language modeling approach to IR and the Divergence From Randomness models. Although these three probabilistic IR models differentiate from each other, either by a underlying theorical framework or by a distinct choice of a word frequency distribution, all these well performing models share common properties that allow one to describe these models in a single framework. This is the approach advocated by Fang [33] referred to as the axiomatic approach to IR. The main conditions that an IR model should satisfy are the TF, the Concave, the Document Length and IDF conditions. We gave an analytical version of these axiomatic constraints in order to ease their applications to the analysis of IR models.

Among the three main families of probabilistic IR models, the Divergence From Ran- domness framework seemed to be the best fit to the Beta Negative Binomial and Log- Logistic requirements and this is why we analyzed Divergence From Randomness models with the axiomatic conditions [15, 19]. Above all, this analysis revealed a link between the first normalization principle of DFR model and a particular property of IR models, namely the concavity in word frequencies. Overall, it seems that there is no direct align- ment between the concavity of IR models and the burstiness property of the underlying probability distributions. Most state of the art models are concave functions with term frequency but most of the distributions used are not bursty. Therefore, burstiness and the IR model concavity in term frequency seem to be two sides of the same coin and this suggested that current paradigms for IR models are not fully compatible with the bursty distributions we wanted to use. This is what seeded the idea of a novel IR family:

information-based models.

Information models [17, 19, 18] draw their inspiration from a long-standing hypothesis in IR, namely the fact that the difference in the behaviors of a word at the document and collection levels brings information on the significance of the word for the document.

Shannon information can be used to capture whenever a word deviates from its average behavior, and we showed how to design IR models based on this information. Above all, these models have a remarkable property: a direct relationship between the burstiness property of the probability distributions used and the concavity of the resulting IR model.

In addition, information-based models enjoy good theoretical properties: they satisfy most retrieval heuristic constraints when these models rely on bursty distributions and they can be understood as a simplification of DFR models in the light of retrieval constraints and burstiness.

We then have proposed two effective IR models within this family: the log-logistic and the smoothed power law models. The experiments we have conducted on different collections illustrate the good behavior of these models. These models yield state of the art performance, without pseudo relevance feedback, and significantly outperforms state of the art models with pseudo relevance feedback. We have tested these models with different term frequency normalizations and extended them with the beneficial use of the q-logarithm.

Furthermore, the good performance of information models for pseudo relevance feedback lead us to analyze theoretically and empirically several pseudo relevance feedback algorithms [20, 21]. As a result, a list of pseudo feedback constraints was drawn up to better characterize valid pseudo feedback algorithms. In particular, we formulate heuristic constraints for PRF similar to the TF, Concave, Document Length and IDF effects for IR models with an additional constraint refered to asDocument Frequency effect. We have analyzed the terms chosen by several PRF models, validated experimentally the DF condition and we reviewed several models according to the PRF conditions. The theo-

retical study we have conducted reveals that several standard PRF models either fail to enforce the IDF effect or the DF effect. For example, the simple mixture in the language modelling family is deficient as it does not satisfy the DF constraint and relevance models do not satisfy the IDF condition. On the contrary, the log-logistic and the smoothed power law models satisfy all the PRF properties . This theoretical analysis thus provide an explanation on why the information-based models perform better than other models in PRF settings. All in all, we have proposed:

1. new probabilistic models of word frequencies: the Beta Negative Binomial and the Log-logistic distribution.

2. new probabilistic models for IR: the information based models.

3. a theoretical analysis of PRF models

These new models were analyzed thoroughly and were proved to be theoretically and empirically sound.

Future Work

There are several interesting questions, directions in order to pursue our study on word frequencies and IR models. Are there other properties than burstiness characterizing word frequencies that could be relevant to capture and to use in an IR model ? In others words, as word frequencies data and IR models have distinct and common properties, which are the general properties that should be preserved in an IR setting ? For example, the grouped word frequencies exhibits a Zipfian distribution [4] and the Heap Law [5]

characterizes the relation between the size of the vocabulary and the collection lengh.

Does the LogLogistic IR model fully account for these phenomenon ? Should these features be taken into account in an IR setting ?

Another question deals with the axiomatic approach to IR. The axiomatic constraints describe general conditions which are not sufficient alone, as it is possible to design an IR model meeting this constraints and which perform poorly empirically. Overall, the three mainstream family of IR models performs similarly and this is why ideally we would like to have a mathematical notion of equivalence between IR models or a notion of capacity of a ranking function. For example, it would be interesting to formalize that the InL2 DFR model performs similarly to the log-logistic model on a given neighborhood of query and document, which would be written asInL2∼V(q,d)LGD. The axiomatic constraint could be understood as a weak form of equivalence between IR models.

The last direction we will mention would extend our study in a supervised setting with learning to rank methods. Could we infer new axiomatic conditions from training data which could be use more generally for several IR tasks ? Respectively, could/should we learn an IR model compatible with some axiomatic constraints ?

Appendices

145

A1 Preprocessing

Many applications require a preprocessing step in order to extract the relevant information from the raw content of a document. For most IR scenarios, the preprocessing of documents consists in filtering words too frequent words (stopwords) like articles, and if required standardize the surface form of the observed word (remove conjugations, plurals) and to count for each term its number of occurrences in a document. For example, let’s examine the following (famous) French verses from La Fontaine:

Maˆıtre Corbeau, sur un arbre perch´e, Tenait en son bec un fromage.

Maˆıtre Renard, par l’odeur alléché, Lui tint à peu près ce langage : Hé ! bonjour, Monsieur du Corbeau.

Que vous ˆetes joli ! que vous me semblez beau !

Initially, the preprocessing filter stopwords like “ce”, “un”, . . . . Then, the occurrences of the words are counted: the termCorbeauhas a number of occurrences of 2 in this document. In the same way,fromage occurs 1 times. One can thus represent a document by a vector whose each dimension contains the frequency of a particular term. For example, a preprocessing of the previous text can lead to the following vectorial representation:

d~=







maitre 2 corbeau 2 arbre 1 perche 1 tenait 1 bec 1 f romage 1 tint 1 langage 1 bonjour 1 joli 1 semblez 1 beau 1 . . . . cigale 0 f ourmi 0 . . . 0







Each dimension in the vector corresponds to a given index term, and the coordinates corresponds here to the number of occurrences of the word in the document. Then, all non-occurring terms have 0 occurences and are not explicitly represented in the previous vector. The frequencies of thedifferent words are supposed to be statistically independent.

For example, it will be supposed that the random variable for occurrences of fromage is independent of the random variable for Corbeau. After preprocessing, the corpus of documents can then be represented as a matrix: X= (xwd) where rows stand for words and columns for documents.

A2 Estimation of the 2-Poisson Model

The moment generating function for the 2-Poisson is given by:

m(t) = E(e^tx) =α

+∞

x=0

e^−λ^Eλ^x_E

x! e^tx+ (1−α)

+∞

x=0

e^−λ^GλGx

x! e^tx

= αe^−λ^E

+∞

x=0

e^txλ^x_E

x! + (1−α)e^−λ^G

+∞

x=0

e^txλ_G^x x!

= αe^λ^E^(e^t⁻¹⁾+ (1−α)e^λ^G^(e^t⁻¹

The first three derivatives ofm(t) are computed and then tis set to 0 which gives the 3 first moments of the distribution:

R₁ = αλ_E+ (1−α)λ_G (1)

R2 = α(λ²_E+λE) + (1−α)(λ²_G+λG) (2) R₃ = α(λ³_E+ 3λ²_E+λ_E) + (1−α)(λ³_G+ 3λ²_G+λ_G) (3) LetM =R1,L=R2,K=R3+ 2R1−3R2.

M = αλE+ (1−α)λG (4)

L = αλ²_E+ (1−α)λ²_G (5)

K = αλ³_E+ (1−α)λ³_G (6)

These equations shows that λE andλG are the roots of :

(M²−L)λ²+ (K−LM)λ+ (L²−M K) = 0 (7) Finally,αcan be estimated with:

α=M −λG

λ_E∗λ_G (8)

Publications

Here is the list of my publications where the symbol?refers to publications prior to this thesis.

International Journal Papers

• S. Clinchant and E. Gaussier. Retrieval constraints and word frequency distributions: a log-logistic model for IR. Information Retrieval, 14(1), 2010.

? J. Ah-Pine, M. Bressan, S. Clinchant, G. Csurka, Y. Hoppenot, and J.-M.

Renders. Crossing textual and visual content in different application scenarios.

Multimedia Tools Appl., 42(1): 31–56, 2009.

Book Chapter

• J. Ah-Pine, S. Clinchant, G. Csurka, F. Perronnin, and JM Renders, Leverag- ing Image, Text and Cross–media Similarities for Diversity–focused Multime- dia Retrieval in ImageCLEF , Experimental Evaluation in Visual Information Retrieval Springer Series: The Information Retrieval, 2010

National Journal Papers

• S. Clinchant and E. Gaussier. Modèle de RI fondés sur l’information. Docu- ment Numérique, Juin 2011, to appear

International Conferences

• S. Clinchant and E. Gaussier. Is document frequency important for PRF? In Proceeding of International Conference on the Theory of Information Retrieval, ICTIR 2011, to appear

• S. Clinchant, J. Ah-Pine, and G. Csurka. Semantic combination of textual and visual information in multimedia retrieval. In International Conference on Multimedia Retrieval, 2011.

• S. Clinchant and E. Gaussier. Information-based models for ad hoc IR. In Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’10, pages 234–241, New York, NY, USA, 2010. ACM.Nominated for Best Paper award.

• S. Clinchant and E. Gaussier. Bridging language modeling and divergence from randomness models: A log-logistic model for ir. In In Proceeding of International Conference on the Theory of Information Retrieval, pages 54–65, 2009.

149

• S. Clinchant and E. Gaussier. The BNB distribution for text modeling. In C.

Mac- donald, I. Ounis, V. Plachouras, I. Ruthven, and R. W. White, editors, European Conference in Information Retrieval ECIR, volume 4956 of Lecture Notes in Computer Science, pages 150–161. Springer, 2008.

? S. Clinchant, C. Goutte, and E. Gaussier. Lexical entailment for information retrieval. In ECIR, pages 217–228, 2006.

National Conferences

• S. Clinchant and E.Gaussier . A document frequency constraint for pseudo- relevance feedback models. In CORIA, pages 73-88. Editions Universitaires d’Avignon, 2011

• S. Clinchant and E. Gaussier. Mod`eles de RI fond´e sur l’information. In CORIA, pages 99–114, 2010. Prix du meilleur article

Posters

• S. Clinchant and E. Gaussier. Do IR models satisfy the TDC constraint ?, In Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval , SIGIR’11 to appear

• G.Csurka, S. Clinchant , G.Jacquet, Medical Image Modality Classification and Retrieval. In 9th Internatioanl Workshop on Content-Based Multimedia Indexing, 2011

• S. Clinchant and E. Gaussier. Retrieval constraints and word frequency distributions: a log-logistic model for IR. In CIKM, pages 1975–1978, 2009.

Index

2-Poisson, 30, 67 adaptation, 29

Axiomatic Approach, 79 Beta Binomial Model, 35 Beta Negative Binomial, 43 Binary Independence Retrieval, 66 Binomial Distribution, 23

BM25, 67

Burstiness, 27, 38, 113

Burstiness Characterization, 39 Chi Square Test, 56

Concave Effect, 81, 89, 91, 96, 113 DFR, 74, 80, 88

Dirichlet Compound Multinomial, 36, 69 Dirichlet Smoothing, 70

Divergence From Randomness, 74, 88, 90, 97

Document Frequency Constraint, 125 Document Length Effect, 83

EDCM, 37, 70

Generalized Method of Moments, 33 Goodness of fit, 48

Heuristic Constraints, 79, 96, 113 IDF Effect, 84

Information Models, 94 Jelinek-Mercer Smoothing, 70 K-mixture, 34

Language Models, 70, 80, 98 Latent Dirichlet Allocation, 26 Log-Logistic, 47, 98, 113

Multinomial Distribution, 23, 69, 70 Negative Binomial, 32

Okapi, 67 P´olya’s Urn, 35 PLSA, 26

Probability Ranking Principle, 64 Pseudo Relevance Feedback, 115 q-logarithm, 111

Shannon Information, 74, 94 Smoothed Power Law, 100, 113 Smoothing, 70

Term Frequency Effect, 81

term frequency normalization, 68, 95, 108 Topics Models, 24

Word Frequencies, 21

151

Bibliography

[1] Giambattista Amati, Claudio Carpineto, Giovanni Romano, and Fondazione Ugo Bordoni. Fondazione Ugo Bordoni at TREC 2003: robust and web track, 2003.

[2] Gianni Amati and Cornelis Joost Van Rijsbergen. Probabilistic models of information retrieval based on measuring the divergence from randomness.ACM Trans. Inf. Syst., 20(4):357–389, 2002.

[3] Arthur Asuncion, Max Welling, Padhraic Smyth, and Yee Whye Teh. On smoothing and inference for topic models. In In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence, 2009.

[4] R.H. Baayen. Word Frequency Distributions. Kluwer Academic, 2001.

[5] Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval.

Addison-Wesley Publishing Company, USA, 2nd edition, 2008.

[6] A. L. Barabasi and R. Albert. Emergence of scaling in random networks. Science, 286(5439):509–512, October 1999.

[7] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet Allocation.

Journal of Machine Learning Research, 3:993–1022, 2003.

[8] Wray Buntine and Aleks Jakulin. Applying discrete PCA in data analysis. InAUAI

’04: Proceedings of the 20th conference on Uncertainty in artificial intelligence, pages 59–66, Arlington, Virginia, United States, 2004. AUAI Press.

[9] Chris Burges, Tal Shaked, Erin Renshaw, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to rank using gradient descent. InIn ICML, pages 89–96, 2005.

[10] Deepayan Chakrabarti and Christos Faloutsos. Graph mining: Laws, generators, and algorithms. ACM Comput. Surv., 38(1):2, 2006.

[11] Kenneth Church and William A. Gale. Inverse document frequency (idf): A measure of deviations from Poisson. In Proceedings of the Third Workshop on Very Large Corpora, pages 121–130, 1995.

[12] Kenneth W. Church. Empirical estimates of adaptation: the chance of two noriegas is closer to p/2 than p2. InProceedings of the 18th conference on Computational linguistics, pages 180–186, Morristown, NJ, USA, 2000. Association for Computational Linguistics.

[13] Kenneth W. Church and William A. Gale. Poisson mixtures. Natural Language Engineering, 1:163–190, 1995.

153

[14] St´ephane Clinchant and ´Eric Gaussier. The BNB distribution for text modeling.

In Craig Macdonald, Iadh Ounis, Vassilis Plachouras, Ian Ruthven, and Ryen W.

White, editors, ECIR, volume 4956 of Lecture Notes in Computer Science, pages 150–161. Springer, 2008.

[15] St´ephane Clinchant and ´Eric Gaussier. Bridging language modeling and divergence from randomness models: A Log-Logistic model for IR. In ICTIR, pages 54–65, 2009.

[16] St´ephane Clinchant and ´Eric Gaussier. Retrieval constraints and word frequency distributions: a log-logistic model for ir. InCIKM, pages 1975–1978, 2009.

[17] St´ephane Clinchant and Eric Gaussier. Information-based models for ad hoc IR.

In Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’10, pages 234–241, New York, NY, USA, 2010. ACM.

[18] Stéphane Clinchant and Éric Gaussier. Modèles de RI fondés sur l’information. In CORIA, pages 99–114, 2010.

[19] St´ephane Clinchant and ´Eric Gaussier. Retrieval constraints and word frequency distributions: a log-logistic model for ir. Information Retrieval, 14(1), 2010.

[20] Stéphane Clinchant and Éric Gaussier. A document frequency constraint for pseudo-relevance feedback models. InCORIA, pages 73–88. Éditions Universitaires d’Avignon, 2011.

[21] St´ephane Clinchant and ´Eric Gaussier. Is document frequency important for prf? In ICTIR, to appear, 2011.

[22] K. Collins-Thompson. Estimating robust query models with convex optimization. In NIPS, pages 329–336, 2008.

[23] K. Collins-Thompson. Reducing the risk of query expansion via robust constrained optimization. InProceeding of the 18th ACM conference on Information and knowledge management, CIKM ’09, pages 837–846, New York, NY, USA, 2009. ACM.

[24] Kevyn Collins-Thompson and Jamie Callan. Estimation and use of uncertainty in pseudo-relevance feedback. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’07, pages 303–310, New York, NY, USA, 2007. ACM.

[25] D. W. Crabtree, P. Andreae, and X. Gao. Exploiting underrepresented query aspects for automatic query expansion. InProceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’07, pages 191–200, New York, NY, USA, 2007. ACM.

[26] Ronan Cummins and Colm O’Riordan. An axiomatic comparison of learned term- weighting schemes in information retrieval: clarifications and extensions.Artif. Intell.

Rev., 28:51–68, June 2007.

[27] Scott Deerwester. Improving Information Retrieval with Latent Semantic Indexing.

In Christine L. Borgman and Edward Y. H. Pai, editors,Proceedings of the 51st ASIS Annual Meeting (ASIS ’88), volume 25, Atlanta, Georgia, October 1988. American Society for Information Science.

No documento ProbabilisticModelsofWordFrequenciesandInformationRetrieval StéphaneClinchant DOCTEURDEL’UNIVERSITÉDEGRENOBLE THÈSE (páginas 141-162)