Part 2 - Gatsby Computational Neuroscience Unit

(1)

Kernel Methods for Comparing Distributions and Training Generative Models: Part 2

Arthur Gretton

Gatsby Computational Neuroscience Unit, University College London

OAMLS, 2022

1/34

(2)

Training generative models

Have: One collection of samplesX from unknown distributionP.

Goal: generate samplesQ that look like P

LSUN bedroom samples P Generated Q, MMD GAN

Role of divergence D ( P

^;

Q )?

2/34

(3)

Outline

-divergences (f-divergences) and a variational lower bound (KL) Generalized energy-based models

“Like a GAN” but incorporatecriticinto sample generation

Perform better than usinggeneratoralone

Arbel, Zhou, G., Generalized Energy Based Models (ICLR 2021)

3/34

(4)

Divergences

4/34

(5)

Divergences

5/34

(6)

The Integral Probability Metrics

6/34

(7)

Maximum mean discrepancy

Ahelpful critic witness:

MMD

(

^P^;^Q

) = sup

^k^f^k^F1E_Pf

(

^X

)

^E^Q^f

(

^Y

)

^.

MMD=1.8

7/34

(8)

Maximum mean discrepancy

Ahelpful critic witness:

MMD

(

^P^;^Q

) = sup

^k^f^k^F1E_Pf

(

^X

)

^E^Q^f

(

^Y

)

MMD=1.1

7/34

(9)

The

-divergences

8/34

(10)

The

-divergences

Define the -divergence(f-divergence):

D

(

^P^;^Q

) =

^Z ^p

(

^z

)

q

(

^z

)

q

(

^z

)

^dz

where is convex, lower-semicontinuous,

(

¹

) =

^0.

Example:

(

^u

) =

^u

log(

^u

)

gives KL divergence, D_KL

(

^P^;^Q

) =

^Z

log

^p

(

^z

)

q

(

^z

)

p

(

^z

)

^dz

=

^Z ^p

(

^z

)

q

(

^z

)

log

^p

(

^z

)

q

(

^z

)

q

(

^z

)

^dz

9/34

(11)

The

-divergences

Define the -divergence(f-divergence):

D

(

^P^;^Q

) =

^Z ^p

(

^z

)

q

(

^z

)

q

(

^z

)

^dz

where is convex, lower-semicontinuous,

(

¹

) =

^0.

Example:

(

^u

) =

^u

log(

^u

)

gives KL divergence, D_KL

(

^P^;^Q

) =

^Z

log

^p

(

^z

)

q

(

^z

)

p

(

^z

)

^dz

=

^Z ^p

(

^z

)

q

(

^z

)

log

^p

(

^z

)

q

(

^z

)

q

(

^z

)

^dz

9/34

(12)

Are

-divergences good critics?

Simple example: disjoint support.

Goodfellow et al. (NeurIPS 2014), Arjovsky and Bottou [ICLR 2017]

DKL

(

^P^;^Q

) =

¹ ^D^JS

(

^P^;^Q

) = log

²

10/34

(13)

Are

-divergences good critics?

Simple example: disjoint support.

Goodfellow et al. (NeurIPS 2014), Arjovsky and Bottou [ICLR 2017]

D_KL

(

^P^;^Q

) =

¹ ^D^JS

(

^P^;^Q

) = log

²

10/34

(14)

-divergences in practice

Background: the conjugate (Fenchel) dual

(

^v

) = sup

u^2R

fuv

(

^u

)

^g^:

(

^v

)

is negative intercept of tangent towith slopev

11/34

(15)

-divergences in practice

(

^v

) = sup

u^2R

fuv

(

^u

)

^g^:

For a convex l.s.c. we have

(

^x

) =

(

^x

) = sup

v^2R

fxv

(

^v

)

^g

11/34

(16)

-divergences in practice

(

^v

) = sup

u^2R

fuv

(

^u

)

^g^:

For a convex l.s.c. we have

(

^x

) =

(

^x

) = sup

v^2R

fxv

(

^v

)

^g

KL divergence:

(^x) =^xlog(^x) (^v) = exp(^v ¹)

11/34

(17)

A variational lower bound

A lower-bound -divergence approximation:

D

(

^P^;^Q

) =

^Z ^q

(

^z

)

^p

(

^z

)

q

(

^z

)

dz

Nguyen, Wainwright, Jordan, IEEE Transactions on Information Theory (2010);

Nowozin, Cseke, Tomioka, NeurIPS (2016)

12/34

(18)

A variational lower bound

D

(

^P^;^Q

) =

^Z ^q

(

^z

)

^p

(

^z

)

q

(

^z

)

dz

=

^Z ^q

(

^z

)sup

fz

p

(

^z

)

q

(

^z

)

^f^z

(

^f^z

)

| {z }

p⁽z⁾ q⁽z⁾

(^v)^{is dual of}(^x)^:

12/34

(19)

A variational lower bound

D

(

^P^;^Q

) =

^Z ^q

(

^z

)

^p

(

^z

)

q

(

^z

)

dz

=

^Z ^q

(

^z

)sup

fz

p

(

^z

)

q

(

^z

)

^f^z

(

^f^z

)

sup

f^2H

E_Pf

(

^X

)

^E^Q

(

^f

(

^Y

))

(restrict the function class)

Nowozin, Cseke, Tomioka, NeurIPS (2016) ^12/34

(20)

A variational lower bound

D

(

^P^;^Q

) =

^Z ^q

(

^z

)

^p

(

^z

)

q

(

^z

)

dz

=

^Z ^q

(

^z

)sup

fz

p

(

^z

)

q

(

^z

)

^f^z

(

^f^z

)

sup

f^2H

E_Pf

(

^X

)

^E^Q

(

^f

(

^Y

))

(restrict the function class) Bound tight when:

f(^z) =

p(^z) q(^z)

if ratio defined.

(21)

Case of the KL

D_KL

(

^P^;^Q

) =

^Z

log

^p

(

^z

)

q

(

^z

)

p

(

^z

)

^dz

13/34

(22)

Case of the KL

D_KL

(

^P^;^Q

) =

^Z

log

^p

(

^z

)

q

(

^z

)

p

(

^z

)

^dz

sup

f^2H

E_Pf

(

^X

) +

¹ ^E^Q

exp(

^f

(

^Y

))

| {z }

( ^f(^Y)+¹)

13/34

(23)

Case of the KL

D_KL

(

^P^;^Q

) =

^Z

log

^p

(

^z

)

q

(

^z

)

p

(

^z

)

^dz

sup

f^2H

E_Pf

(

^X

) +

¹ ^E^Q

exp(

^f

(

^Y

))

Bound tight when:

f(^z) = log^p(^z) q(^z) if ratio defined.

13/34

(24)

Case of the KL

D_KL

(

^P^;^Q

) =

^Z

log

^p

(

^z

)

q

(

^z

)

p

(

^z

)

^dz

sup

f^2H

E_Pf

(

^X

) +

¹ ^E^Q

exp(

^f

(

^Y

))

sup

f^2H

2

4

1 n

n

X

j=¹^f

(

^xⁱ

)

_n¹ ^Xⁿ

i=¹

exp(

^f

(

^yⁱ

))

3

5

+

¹

x_i ⁱ^:ⁱ^:^d^:P y_i ⁱ^:ⁱ^:^d^:Q

13/34

(25)

Case of the KL

D_KL

(

^P^;^Q

) =

^Z

log

^p

(

^z

)

q

(

^z

)

p

(

^z

)

^dz

sup

f^2H

E_Pf

(

^X

) +

¹ ^E^Q

exp(

^f

(

^Y

))

sup

f^2H

2

4

1 n

n

X

j=¹^f

(

^xⁱ

)

_n¹ ^Xⁿ

i=¹

exp(

^f

(

^yⁱ

))

3

5

+

¹

This is a KL

Approximate Lower-bound Estimator.

(26)

Case of the KL

D_KL

(

^P^;^Q

) =

^Z

log

^p

(

^z

)

q

(

^z

)

p

(

^z

)

^dz

sup

f^2H

E_Pf

(

^X

) +

¹ ^E^Q

exp(

^f

(

^Y

))

sup

f^2H

2

4

1 n

n

X

j=¹^f

(

^xⁱ

)

_n¹ ^Xⁿ

i=¹

exp(

^f

(

^yⁱ

))

3

5

+

¹

This is a K A L E

(27)

Case of the KL

D_KL

(

^P^;^Q

) =

^Z

log

^p

(

^z

)

q

(

^z

)

p

(

^z

)

^dz

sup

f^2H

E_Pf

(

^X

) +

¹ ^E^Q

exp(

^f

(

^Y

))

sup

f^2H

2

4

1 n

n

X

j=¹^f

(

^xⁱ

)

_n¹ ^Xⁿ

i=¹

exp(

^f

(

^yⁱ

))

3

5

+

¹

The KALE divergence

13/34

(28)

Empirical properties of KALE

KALE

(

^P^;^Q

;

^H

) = sup

f^2H

E_Pf

(

^X

)

^E^Q

exp(

^f

(

^Y

)) +

¹

f

=

^h^w^;

(

^x

)

ⁱ^H ^H^{an RKHS}

kw^k²

H penalized

:

Glaser, Arbel, G. “KALE Flow: A Relaxed KL Gradient Flow for Probabilities with

Disjoint Support,” (NeurIPS, 2021, Section 2) ^14/34

(29)

Empirical properties of KALE

KALE

(

^P^;^Q

;

^H

) = sup

f^2H

E_Pf

(

^X

)

^E^Q

exp(

^f

(

^Y

)) +

¹

f

=

^h^w^;

(

^x

)

ⁱ^H ^H^{an RKHS}

kw^k²

H penalized

:

KALE smoothie

(30)

Empirical properties of KALE

KALE

(

^P^;^Q

;

^H

) = sup

f^2H

E_Pf

(

^X

)

^E^Q

exp(

^f

(

^Y

)) +

¹

f

=

^h^w^;

(

^x

)

ⁱ^H ^H^{an RKHS}

kw^k²

H penalized

:

KALE smoothie KALE

(

^Q^;^P

;

^H

)

^=0.18

(31)

Empirical properties of KALE

KALE

(

^P^;^Q

;

^H

) = sup

f^2H

E_Pf

(

^X

)

^E^Q

exp(

^f

(

^Y

)) +

¹

f

=

^h^w^;

(

^x

)

ⁱ^H ^H^{an RKHS}

kw^k²

H penalized

:

KALE smoothie KALE

(

^Q^;^P

;

^H

)

^=0.12

(32)

The KALE smoothie and “mode collapse”

Two Gaussians with same means, different variance

Example thanks to M. Arbel and M. Rosca _15/34

(33)

Topological properties of KALE (1)

Key requirements on ^H and ^X: Compact domain^X,

Hdense in the space C

(

^X

)

of continuous functions on^X wrt^k^k1. Iff ²^Hthen f ²^H and cf ²^Hfor 0c Cmax^.

Theorem: KALE

(

^P^;^Q

;

^H

)

^{0 and}^KALE

(

^P^;^Q

;

^H

) =

^{0 iff}^P

=

^Q.

Zhang, Liu, Zhou, Xu, and He. “On the Discrimination-Generalization Tradeoff in GANs”

(ICLR 2018, Corollary 2.4; Theorem B.1) Arbel, Liang, G. (ICLR 2021, Proposition 1)

16/34

(34)

Topological properties of KALE (1)

Key requirements on ^H and ^X: Compact domain^X,

Hdense in the space C

(

^X

)

of continuous functions on^X wrt^k^k1. Iff ²^Hthen f ²^H and cf ²^Hfor 0c Cmax^.

Theorem: KALE

(

^P^;^Q

;

^H

)

^{0 and}^KALE

(

^P^;^Q

;

^H

) =

^{0 iff}^P

=

^Q.

H dense inC

(

^X

)

^for^X ^R^d ^when:

H

=

^span^f

(

^w^>^x

+

^b

) : [

^w^;^b

]

²

Θ

^g

(

^u

) = max

^f^u^;⁰^g^;²^N^{, and} ^f

:

⁰^;²

Θ

^g

=

^R^d⁺¹^.

Zhang, Liu, Zhou, Xu, and He. “On the Discrimination-Generalization Tradeoff in GANs”

(ICLR 2018, Corollary 2.4; Theorem B.1) Arbel, Liang, G. (ICLR 2021, Proposition 1)

16/34

(35)

Topological properties of KALE (2)

Additional requirement: all functions in ^H Lipschitz in their inputs with constant L

Theorem: KALE

(

^P^;^Qⁿ

;

^H

)

^!^{0 iff}^Qⁿ ^! ^P under the weak topology.

Liu, Bousquet, Chaudhuri. “Approximation and Convergence Properties of Generative Adversarial Learning” (NeurIPS 2017); Arbel, Liang, G. (ICLR 2021, Proposition 1) ^17/34

(36)

Topological properties of KALE (2)

Additional requirement: all functions in ^H Lipschitz in their inputs with constant L

Theorem: KALE

(

^P^;^Qⁿ

;

^H

)

^!^{0 iff}^Qⁿ ^! ^P under the weak topology.

Partial proof idea:

KALE

(

^P^;^Q

;

^H

) =

^Z ^f^dP ^Z

exp(

^f

)

^d^Q

+

¹

=

^Z ^f

(

^x

)

^d^Q

(

^x

)

^f

(

^x⁰

)

^dP

(

^x⁰

)

Z

(exp(

^f

) +

^f ¹

)

| {z }

0

dQ

Z

f

(

^x

)

^d^Q

(

^x

)

^f

(

^x⁰

)

^dP

(

^x⁰

)

^LW¹

(

^P^;^Q

)

Liu, Bousquet, Chaudhuri. “Approximation and Convergence Properties of Generative Adversarial Learning” (NeurIPS 2017); Arbel, Liang, G. (ICLR 2021, Proposition 1) ^17/34

(37)

How to train your GAN

Generalized Energy-Based Model

18/34

(38)

Visual notation: GAN setting

19/34

(39)

Visual notation: GAN setting

19/34

(40)

Reminder: the generator

Radford, Metz, Chintala, ICLR 2016

20/34

(41)