• Nenhum resultado encontrado

Chapitre 5 Entre l’estimation générative et discriminative

5.4 Theoretical properties of the GDT etimators

5.4. Theoretical properties of the GDT etimators – with assumptions 2, the GDT solution is asymptotically optimal for a particular0< λn <1which decreases

to 0 as the training sample sizenincreases.

The first statement is very similar to a result given Chapter 2. We refer the reader to the Proposition 1 page 37, in which the discriminative estimator can be replaced by any GDT estimator withλ <1. This result is intuitive because the ML estimatorθˆ1estimates the true model parametersθ0which also minimizes the classification loss since the Bayes rule is optimal atθ0. The second case is more interesting, since it justifies the use of the GDT estimation when the conditional distribution is true, but the joint distribution is false. We give the main result of this chapter :

Theorem 1 If the following conditions are satisfied :

(i) For allε >0, the solution ofmaxθΘlogp(Y|X;θ) +εlogp(X, Y;θ)is unique,

(ii) the conditional model belongs to the parametric family, i.e. there existθ0 Θsuch thatlogp(Y|X;θ0) = logp(Y|X),

(iii) the asymptotic generative solutionθ1 = limn→ ∞θ(n)1 has a higher classification loss than the asymptotic discriminative solutionθ0= limn→ ∞θ(n)0 , i.e.E[logp(Y|X;θ1)]< E[logp(Y|X;θ0)].

(iv) the model pdfp(X, Y;θ)is twice differentiable with respect to the parameterθaroundθ0and the second derivative is bounded,

then the valueλn = minλ[0,1]ηn(λ)tends to 0 asntends to∞. Moreover, fornsufficiently large, there exists a constantC >0such that

C

n ≤λn<1. (5.14)

The proof of this theorem is given in appendix B. The main point is that the optimal value ofλis neither 0 nor1, there are intermediary estimates (λ ]0,1[) that have better generalisation performance than the standard generative and discriminative estimators. In addition, the theorem gives a lower bound on the convergence speed ofλnto 0.

We apply this theorem to the LinearGDT estimator, i.e. when we use multivariate Gaussian distributions with a shared covariance matrix to model the class-conditional pdfs. This characterizes a class of distributions for which the GDT estimate has better generalization performance than LDA and linear logistic regression.

Corollary 1 If the sample distribution satisfies these conditions (i) and (ii) :

(i) There existβ∈Rdsuch thatp(Y = 1|X) = exp(βX)p(Y = 2|X), (ii) p(Y = 1|X)is not a Gaussian distribution,

then, for a sufficiently large training sizen, the expected loss of the LinearGDT estimator is strictly smaller than the expected loss of LDA and the linear logistic regression.

In other words, Corollary 1 states that the optimal tuning parameterλn = minλ[0,1]ηn(λ)satisfies :

ηn(λn)< ηn(0) (better than the linear logistic regression) and ηn(λn)< ηn(1) (better than the linear discriminant solution).

Hence, for any non-Gaussian distribution for which the linear logistic model is true, the GDT estimation with λ=λnhave better prediction performances than LDA and the logistic regression.

5.4.2 Choice of λ .

In practice, the valueλnis not directly available as it requires knowledge of the sample distributionp(X, Y).

The tuning parameterλfunctions like the smoothing parameter in regularization methods.λcannot be set on the basis of minimum classification loss on the training set, since by definition,λ = 0gives the optimalθ for training set classification. Instead,λis set to the valueλˆthat minimizes the cross-validated classification loss

λˆ= argmin

λΛ

ν i=1

logp(y(i)|x(i); ˆθ(i)λ ) (5.15)

whereΛis a set containing possible values ofλ(between 0 and 1),θˆ(i)λ is the GDT parameter estimate based on the ithtraining set and(x(i),y(i))is theithvalidation set. It might seen more natural to minimize 0-1 classification loss at this point, but experimentally, we find that it leads to less stable estimates ofλ.

If the optimal λˆ is close to one, the generative classifier is preferred. This suggests that the bias inpθ(x, y) (if any) does not affect the discrimination of the model too much. Similarly, if λˆis close to zero, it suggests that the modelpθ(x, y)does not fit the data well, and the bias of the generative classifier is too high to provide good classification results. In this case, a more complex model — i.e. with more parameters, or less constrained ones —

5.4. Theoretical properties of the GDT etimators may be needed to reduce the bias. For intermediateλ, there is an equilibrium between the bias and the variance,ˆ meaning that the model complexity is well adapted to the amount of training data.

It would be usefull to find computationally efficient alternatives to cross-validation for estimating the value of λthat give good classification performance on the test data.

5.4.3 Bias-variance decomposition

We now give some theoretical insights about the optimal value forλ. The classification error can be decompo- sed into bias and variance terms. The relative importance of these two terms is controlled by the GDT parameter λ. This explains why the generalization performance of the method is related to the choice of lambda.

θλ= lim

n→ ∞θˆ(n)λ

By the law of large numbers, the parametersθλ are the expected values of their respective estimator θˆλ(n). We denoteθλthe values of the parameters minimizing the expected GDT loss function :

θλ= argmin

θΘ E[λlogp(X, Y;θ) + (1−λ) logp(Y|X;θ)]. (5.16) We assumed that the parameter estimates are consistent, i.e. thatθˆλ→θλalmost surely asntends to infinity.19.

The “best” parameter in terms of classification loss isθC = θ0, for which the discriminative estimator θˆ0is unbiased. However, this unbiasedness is associated with a high estimation variance. The GDT estimator, allows some bias in return for lower variance. Without loss of generality, we look for the estimator θˆλ minimizing the quantityη(λ)defined in (5.12). LetLC(θ) =E[logp(Y|X;θ)]. The overall classification lossη(λ)can be split into bias and variance terms :

η(λ) =E

LCθλ)−LC(θλ)

variance(λ)

+LC(θλ)−LC(θC)

bias(λ)

+bias20 (5.17)

where bias20is the irreducible minimal model biasLC(θC)−E[logp(Y|X)]. Due to the fact that the generative estimation is the minimum variance estimator ofθ, variance(λ)is minimal forλ= 1(the proof is found in the

19Sometimes the discriminative solution is not unique, but the GDT estimators forλ > 0are consistent. In this case we defineθ0 = limλ0θλ. This appears for example in the NB classifier and the LDA model.

appendix). Conversely, bias(0) = 0i.e. the bias term is minimum for the discriminative estimator. As for many other learning methods, the estimator should be chosen in order to balance these two quantities.