generative models: Part 1

(1)

generative models: Part 1

Arthur Gretton

Gatsby Computational Neuroscience Unit, University College London

DeepLearn, 2022

(2)

Goal: doP and Q differ?

(3)

CIFAR 10 samples Cifar 10.1 samples

Significant difference?

(4)

(5)

Goal: AreX and Y independent?

Their noses guide them through life, and they're never happier than when following an interesting scent.

A large animal who slings slobber, exudes a distinctive houndy odor, and wants nothing more than to follow his nose.

A responsive, interactive pet, one that will blow in

Y

X

(6)

A third task: testing goodness of fit

Given: A modelP and samplesQ.

Goal: isP a good fit forQ?

Chicago crime data

model?

(7)

Goal: isP a good fit forQ?

Chicago crime data

Model is Gaussian mix- ture with two compo- nents. Is this a good model?

(8)

...as an integral probability metric(not just a technicality!)

Astatistical test based on the MMD learn adaptive NN features

Training GANs generative adversarial networks with MMD learn adaptive NN features

Next parts:

(9)

(10)

(11)

(12)

(13)

(14)

(15)

(16)

(17)

Simple example: 2 Gaussians with different means Answer: t-test

−6 −4 −2 0 2 4 6

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

Two Gaussians with different means

X

Prob. density

P_X Q_X

(18)

Idea: look at difference inmeans of featuresof the RVs In Gaussian case: second order features of form'(x) =x²

0.1 0.15 0.2 0.25 0.3 0.35 0.4

Two Gaussians with different variances

Prob. density

P_X Q_X

(19)

Two Gaussians with same means, different variance Idea: look at difference inmeans of featuresof the RVs In Gaussian case: second order features of form'(x) =x²

−6 −4 −2 0 2 4 6

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

Two Gaussians with different variances

Prob. density

X

P_X Q_X

10⁻¹ 10⁰ 10¹ 10²

0 0.2 0.4 0.6 0.8 1 1.2 1.4

Densities of feature X²

X²

Prob. density

P_X Q_X

(20)

Same meanandsame variance

Difference in means usinghigher order features...RKHS

0.2 0.3 0.4 0.5 0.6 0.7

Gaussian and Laplace densities

Prob. density

P_X Q_X

(21)

Kernels: dot products of features

Feature map'(x) 2 F, '(x)= [: : : 'i(x) : : :] 2 `2

Forpositive definitek, k(x;x⁰) = h'(x); '(x⁰)i_F Infinitely many features '(x), dot product in

Exponentiated quadratic kernel k(x;x⁰) = exp

kx x⁰k²

(22)

GivenP a Borel probability measure on X, definefeature map of probabilityP,

P = [: : :E_P['i(X)] : : :]

Forpositive definitek(x;x⁰),

hP; QiF =EP;Qk(x;y) forx P andy Q.

(23)

GivenP a Borel probability measure on X, definefeature map of probabilityP,

P = [: : :E_P['i(X)] : : :]

Forpositive definitek(x;x⁰),

hP; QiF =EP;Qk(x;y) forx P andy Q.

Fine print:feature map'(x)must be Bochner integrable for all probability measures considered.

Always true if kernel bounded.

(24)

The maximum mean discrepancy

Themaximum mean discrepancy is the distance betweenfeature means:

MMD²(P;Q) = kP Qk²_F

= hP Q;P Qi_F

(a)= within distrib. similarity, (b)= cross-distrib. similarity.

(25)

The maximum mean discrepancy

= hP Q;P Qi_F

= hP;Pi_F+ hQ;Qi_F 2hP;Qi_F

(a)= within distrib. similarity, (b)= cross-distrib. similarity.

(26)

The maximum mean discrepancy

= hP Q;P Qi_F

=EPk(X;X⁰)

| {z }

(a)

+EQk(Y;Y⁰)

| {z }

(a)

2EP;Qk(X;Y)

| {z }

(b)

(27)

Each entry is one of k(dog_i;dog_j),k(dog_i;fishj), ork(fishi;fishj)

(28)

MMD\ =

n(n 1)_i₆₌_j k(dog_i;dog_j) +

n(n 1)_i₆₌_j k(fish_i;fish_j)

2 n²

X

i;j

k(dog_i;fishj)

(29)

Find a"well behaved function"f(x)tomaximize E_Pf(X) E_Qf(Y)

-1 -0.5 0 0.5 1

f(x)

Smooth function

(30)

E_Pf(X) E_Qf(Y)

0 0.5 1

f(x)

Smooth function

(31)

MMD(P;Q;F) := sup

kfk_F1

[EPf(X) EQf(Y)]

(F =unit ballin RKHSF)

(32)

MMD(P;Q;F) := sup

kfk_F1

[EPf(X) EQf(Y)]

Functions are linear combinations of features:

(33)

MMD(P;Q;F) := sup

kfk_F1

[EPf(X) EQf(Y)]

−0.2 0 0.2 0.4 0.6 0.8

Witness f for Gauss and Laplace densities

Prob. density and f

f Gauss Laplace

(34)

MMD(P;Q;F) := sup

kfk_F1

[EPf(X) EQf(Y)]

ForcharacteristicRKHSF,MMD(P;Q;F) =0 iffP =Q Other choices forwitness function class:

(35)

MMD(P;Q;F) := sup

kfk_F1

[EPf(X) EQf(Y)]

Expectations offunctions are linear combinations of expected features

E_P(f(X)) = hf;E_P'(X)i_F = hf;Pi_F

(36)

The MMD:

MMD(P;Q;F)

= sup

kfk1

[E_Pf(X) E_Qf(Y)]

0 0.2 0.4 0.6 0.8 1

x -1

-0.5 0 0.5 1

f(x)

Smooth function

(37)

The MMD:

MMD(P;Q;F)

= sup

kfk1[EPf(X) EQf(Y)]

= sup

kfk1

hf;P Qi_F

use

E_Pf(X) = hP;fi_F

(38)

The MMD:

MMD(P;Q;F)

= sup

kfk1

[EPf(X) EQf(Y)]

= sup

kfk1

hf;P Qi_F

f

(39)

The MMD:

MMD(P;Q;F)

= sup

kfk1

[EPf(X) EQf(Y)]

= sup

kfk1

hf;P Qi_F

f

(40)

The MMD:

MMD(P;Q;F)

= sup

kfk1

[EPf(X) EQf(Y)]

= sup

kfk1

hf;P Qi_F

f*

(41)

The MMD:

MMD(P;Q;F)

= sup

kfk1

[EPf(X) EQf(Y)]

= sup

kfk1

hf;P Qi_F

= kP Qk_F

IPM view equivalent to feature mean

difference (kernel case only)

(42)

Observe X= fx₁; : : : ;x_ngP

Observe Y= fy₁; : : : ;y_ngQ

(43)

(44)

v

(45)

v

witness(v)

(46)

f /P Q

(47)

f/P Q

The empirical feature mean forP bP := 1

n

X

i=1

'(xi)

(48)

f /P Q

n

X

i=1

'(xi)

The empirical witness function atv

f(v) = hf; '(v)i_F

(49)

f/P Q

n

X

i=1

'(xi)

The empirical witness function atv f(v) = hf; '(v)i_F

/ hbP bQ; '(v)i_F

(50)

f /P Q

n

X

i=1

'(xi)

The empirical witness function atv f(v) = hf; '(v)i_F

/ hb ; '( )i

(51)

(52)

A statistical test using MMD

The empirical MMD:

MMD\²= 1 n(n 1)

X

i6=j

k(x_i;x_j) + 1 n(n 1)

X

i6=j

k(y_i;y_j)

2 n²

X

i;j

k(xi;yj)

How does this help decide whetherP =Q?

Alternative hypothesisH1 whenP 6=Q should seeMMD\² “far from zero”

WantThreshold c forMMD\² to get false positive rate

(53)

A statistical test using MMD

The empirical MMD:

MMD\²= 1 n(n 1)

X

i6=j

k(x_i;x_j) + 1 n(n 1)

X

i6=j

k(y_i;y_j)

2 n²

X

i;j

k(xi;yj)

Perspective fromstatistical hypothesis testing:

Null hypothesisH0 whenP =Q should seeMMD\² “close to zero”.

Alternative hypothesisH1 whenP 6=Q should seeMMD\² “far from zero”

(54)

MMD\ =

n(n 1) _i₆₌_j k(x_i;x_j) +

n(n 1) _i₆₌_j k(y_i;y_j)

2 n²

X

i;j

k(xi;yj)

Perspective fromstatistical hypothesis testing:

Null hypothesisH0 whenP =Q should seeMMD\² “close to zero”.

Alternative hypothesisH1 whenP 6=Q

(55)

Drawn =200 i.i.d samples fromP and Q Laplace with different y-variance.

pn \MMD² =1:2

-2 0 2

-10 -8 -6 -4 -2 0 2 4 6 8 10

P Q

(56)

Laplace with different y-variance.

pn \MMD² =1:2

3 4 5 6 7 8

-4 -2 0 2 4 6 8 10

P Q

(57)

Drawn =200newsamples from P and Q Laplace with different y-variance.

pn \MMD² =1:5

1 1.5 2 2.5 3 3.5 4

-8 -6 -4 -2 0 2 4 6 8 10

P Q

(58)

0.6 0.8 1 1.2 1.4 1.6 1.8

(59)

Repeat this 300 times: : :

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

(60)

0.6 0.8 1 1.2 1.4 1.6 1.8

(61)

WhenP 6=Q, statistic is asymptotically normal, MMD\² MMD²(P;Q)

pVn(P;Q)

D! N (0;1);

wherevarianceV_n(P;Q) =O n ¹ .

0.5 1 1.5

Empirical PDF Gaussian fit

−6 −4 −2 0 2 4 6

0 0.5 1 1.5

Two Laplace distributions with different variances

X

Prob. density

PX QX

(62)

What happens whenP and Q are the same?

(63)

Case ofP =Q= N (0;1)

0.1 0.2 0.3 0.4 0.5 0.6 0.7

(64)

0.2 0.3 0.4 0.5 0.6 0.7

(65)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

(66)

0.2 0.3 0.4 0.5 0.6 0.7

(67)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

(68)

nMMD\² X¹

l=1

l

z_l² 2

0.4 0.6

where

i i(x⁰) =Z

X

~k(x;x⁰)

| {z }

centred

i(x)dP(x)

z_l N (0;2) i:i:d:

(69)

0.1 0.2 0.3 0.4 0.5 0.6 0.7

(70)

0.2 0.3 0.4 0.5 0.6

(71)

MMD\²= 1 n(n 1)

X

i6=j

k(xi;xj)

+ 1

n(n 1) X

i6=j

k(yi;yj)

2 n²

Xk(x_i;y_j)

k(xi,yj) k(xi,xj)

k(y_i,y_j)

(72)

(73)

MMD\²= 1 n(n 1)

X

i6=j

k(~xi;x~j)

+ 1

n(n 1) X

i6=j

k(~yi;~yj)

2 n²

X

i;j

k(~x_i;~y_j)

k(˜xi,y˜j) k(˜xi,x˜j)

k(˜yi,y˜j)

(74)

Exact level (upper bound on false positive rate) at finite n and number of permutations

(when unpermuted statistic

k(˜xi,y˜j) k(˜xi,x˜j)

(75)

optimising the kernel parameters

(76)

The best test for the job

A test’s power depends on k(x;x⁰);P;andQ (andn)

With characteristic kernel, MMD test has power!1 as n ! 1for any (fixed) problem

But, for manyP andQ, will have terrible power with reasonablen!

“No one test can have all that power”

(77)

A test’s power depends on k(x;x⁰);P;andQ (andn)

With characteristic kernel, MMD test has power!1 as n ! 1for any (fixed) problem

But, for manyP andQ, will have terrible power with reasonablen! Youcan choose a good kernel for a given problem

Youcan’t get one kernel that has good finite-sample power for all problems

“No one test can have all that power”

(78)

Choosing a kernel for the test

Simple choice: exponentiated quadratic k(x;y) = exp

1

2²kx yk²

Characteristic: for any : for any P and Q, power!1 as n ! 1

(79)

Choosing a kernel for the test

1

2²kx yk²

Characteristic: for any : for any P and Q, power!1 as n ! 1 But choice of is very important for finiten. . .

(80)

Choosing a kernel for the test

1

2²kx yk²

(81)

Choosing a kernel for the test

1

2²kx yk²

(82)

Choosing a kernel for the test

1

2²kx yk²

(83)

k(x;y) = exp 1

2²kx yk²

. . . and some problems (e.g. images) might have no good choice for

(84)

0.2 0.3 0.4 0.5 0.6

(85)

The power of our test (Pr1 denotes probability under P 6=Q):

Pr1

nMMD\² > ^c

(86)

Pr1

nMMD\² > ^c

! MMD²(P;Q) pV_n(P;Q)

c np

V_n(P;Q)

!

where

is the CDF of the standard normal distribution.

(87)

The power of our test (Pr1 denotes probability under P 6=Q):

Pr1

nMMD\² > ^c

! MMD²(P;Q) pV_n(P;Q)

| {z }

O(n¹⁼²)

c np

V_n(P;Q)

| {z }

O(n ¹⁼²)

!

For largen, second term negligible!

(88)

Pr1

nMMD\² > ^c

! MMD²(P;Q) pVn(P;Q)

c np

Vn(P;Q)

!

To maximize test power, maximize

(89)

X P Y Q

Choose a kernel k maximizing ^MMD^\

p 2

^ Vn(P;Q)

Use chosen k for MMD test

(90)

Learning a kernel helps a lot

Kernel with deep learned features:

k(x;y) = [(1 )((x); (y)) + ]q(x;y) and q are Gaussian kernels

(91)

CIFAR-10 vs CIFAR-10.1,null rejected 75% of time

(92)

CIFAR-10 vs CIFAR-10.1,null rejected 75% of time

(93)

(94)

Code: https://github.com/antoninschrab/mmdagg-paper

(95)

(96)

(97)

(98)

Figure 1: DCGAN generator used for LSUN scene modeling. A 100 dimensional uniform distribu- tionZis projected to a small spatial extent convolutional representation with many feature maps.

A series of four fractionally-strided convolutions (in some recent papers, these are wrongly called deconvolutions) then convert this high level representation into a64⇥64pixel image. Notably, no fully connected or pooling layers are used.

suggested value of 0.9 resulted in training oscillation and instability while reducing it to 0.5 helped stabilize training.

4.1 LSUN

As visual quality of samples from generative image models has improved, concerns of over-fitting and memorization of training samples have risen. To demonstrate how our model scales with more data and higher resolution generation, we train a model on the LSUN bedrooms dataset containing a little over 3 million training examples. Recent analysis has shown that there is a direct link be- tween how fast models learn and their generalization performance (Hardt et al., 2015). We show samples from one epoch of training (Fig.2), mimicking online learning, in addition to samples after convergence (Fig.3), as an opportunity to demonstrate that our model is not producing high quality samples via simply overfitting/memorizing training examples. No data augmentation was applied to the images.

4.1.1 DEDUPLICATION

To further decrease the likelihood of the generator memorizing input examples (Fig.2) we perform a simple image de-duplication process. We fit a 3072-128-3072 de-noising dropout regularized RELU autoencoder on 32x32 downsampled center-crops of training examples. The resulting code layer activations are then binarized via thresholding the ReLU activation which has been shown to be an effective information preserving technique (Srivastava et al., 2014) and provides a convenient form of semantic-hashing, allowing for linear time de-duplication . Visual inspection of hash collisions showed high precision with an estimated false positive rate of less than 1 in 100. Additionally, the technique detected and removed approximately 275,000 near duplicates, suggesting a high recall.

4.2 FACES

Radford, Metz, Chintala, ICLR 2016

53/75

(99)

W₁(P;Q) = sup_kfkL1E_Pf(X) E_Qf(Y).

kfkL:= supx6=yjf(x) f(y)j =kx yk

W1=0.88

(100)

kfkL:= supx6=yjf(x) f(y)j =kx yk

W1=0.65

(101)

MMD(P;Q) = sup_kfkF1E_Pf(X) E_Qf(Y).

MMD=1.8

(102)

MMD=1.1

(103)

An unhelpfulcritic witness:

MMD(P;Q) with a narrow kernel.

MMD=0.64

(104)

MMD(P;Q) with a narrow kernel.

MMD=0.64

(105)

From ICML 2015:

From UAI 2015:

(106)

(107)

The critic(teacher) also needs to be trained.

K(x;y) =h ^>(x)h (y) whereh (x) is a CNN map:

Wasserstein GANArjovsky et al. [ICML 2017]

WGAN-GP

K(x;y) =k(h (x);h (y)) where h (x) is a CNN map,

k is e.g. an exponentiated quadratic kernel

MMD Li et al., [NeurIPS 2017]

Cramer

(108)

K(x;y) =h ^>(x)h (y) whereh (x) is a CNN map:

Wasserstein GANArjovsky et al.

K(x;y) =k(h (x);h (y)) where h (x) is a CNN map,

k is e.g. an exponentiated quadratic

(109)

(110)

Challenges for learned critic features

Learned critic features:

MMD with kernelk(h (x);h (y)) must giveuseful “gradient” to generator.

(111)

MMD with kernelk(h (x);h (y)) must giveuseful “gradient” to generator.

Relation with test power?

If the MMD with kernelk(h (x);h (y))gives a powerful test, will it be a good critic?

(112)

generator.

Relation with test power?

If the MMD with kernelk(h (x);h (y))gives a powerful test, will it be a good critic?

(113)

Samples from targetP and model Q target model

(114)

target model

(115)

What the kernelsk(x;y) look like MMD Gaussian

target model

(116)

Use kernelsk(h (x);h (y))with features h (x) =L₃

x L2(L1(x))

whereL₁;L₂;L₃ are fully connected with quadratic nonlinearity.

(117)

to learn h (x) fork(h (x);h (y))

(118)

(119)

Also related toSobolev GAN Mroueh et al. [ICLR 2018]

On gradient regularizers for MMD GANs

Michael Arbel Gatsby Computational Neuroscience Unit

University College London [email protected]

Dougal J. Sutherland Gatsby Computational Neuroscience Unit

University College London [email protected] Mikołaj Bi´nkowski

Department of Mathematics Imperial College London [email protected]

Arthur Gretton Gatsby Computational Neuroscience Unit

University College London [email protected]

Abstract

We propose a principled method for gradient-based regularization of the critic of GAN-like models trained by adversarially optimizing the kernel of a Maximum Mean Discrepancy (MMD). We show that controlling the gradient of the critic is vital to having a sensible loss function, and devise a method to enforce exact, analytical gradient constraints at no additional cost compared to existing approximate techniques based on additive regularizers. The new loss function is provably continuous, and experiments show that it stabilizes and accelerates training, giving image generation models that outperform state-of-the art methods on160⇥160 CelebA and64⇥64unconditional ImageNet.

1 Introduction

There has been an explosion of interest inimplicit generative models(IGMs) over the last few years, especially after the introduction of generative adversarial networks (GANs) [16]. These models allow approximate samples from a complex high-dimensional target distributionP, using a model distributionQ^✓, where estimation of likelihoods, exact inference, and so on are not tractable. GAN- type IGMs have yielded very impressive empirical results, particularly for image generation, far beyond the quality of samples seen from most earlier generative models [e.g.18,21,22,23,37].

These excellent results, however, have depended on adding a variety of methods of regularization and other tricks to stabilize the notoriously difficult optimization problem of GANs [37,41]. Some of this difficulty is perhaps because when a GAN is viewed as minimizing a discrepancyD^GAN(P,Q^✓), its gradientr^✓D^GAN(P,Q✓)does not provide useful signal to the generator if the target and model distributions are not absolutely continuous, as is nearly always the case [2].

An alternative set of losses are the integral probability metrics (IPMs) [35], which can give credit to modelsQ✓“near” to the target distributionP[3,8, Section 4 of15]. IPMs are defined in terms of a critic function: a “well behaved” function with large amplitude wherePandQ^✓differ most. The IPM is the difference in the expected critic underPandQ✓, and is zero when the distributions agree. The Wasserstein IPMs, whose critics are made smooth via a Lipschitz constraint, have been particularly successful in IGMs [3,14,18]. But the Lipschitz constraint must hold uniformly, which can be hard to enforce. A popular approximation has been to apply a gradient constraint only in expectation [18]:

the critic’s gradient norm is constrained to be small on points chosen uniformly betweenPandQ.

63/75

(120)

Gradient regulariserArbel, Sutherland, Binkowski, G. [NeurIPS 2018]

Also related toSobolev GAN Mroueh et al. [ICLR 2018]

Maximise scaled MMD over critic features:

SMMD(P; ) =P; MMD where

²P; = + Z

k(h (x);h (x))dP(x)+

d

X

i=1

Z

@i@i+dk(h (x);h (x))dP(x)

(121)

Our empirical observations

Data-dependent gradient regularizer of critic Similar regularization strategies apply in:

WGAN-GP Gulrajani et al. [NeurIPS 2017]

“Witness function” in f-GANs (next talk!)Roth et al [NeurIPS 2017, eq. 19 and 20]

Incomplete training of the criticis also a regularisation strategy

(122)

Data-dependent gradient regularizer of critic Similar regularization strategies apply in:

WGAN-GP Gulrajani et al. [NeurIPS 2017]

“Witness function” in f-GANs (next talk!)Roth et al [NeurIPS 2017, eq. 19 and 20]

Alternate critic and generator training:

(123)

Entropicregularizer (avoid mode collapse):

(124)

(125)

(126)

KID scores:

BGAN:

47

SN-GAN:

44

SMMD GAN:

35

ILSVRC2012 (ImageNet) dataset, 1 281 167 images,

(127)

KID scores:

BGAN:

47

SN-GAN:

44

SMMD GAN:

35

ILSVRC2012 (ImageNet) dataset, 1 281 167 images, resized to 64 × 64. 1000 classes.

(128)

KID scores:

BGAN:

47

SN-GAN:

44

SMMD GAN:

35

ILSVRC2012 (ImageNet) dataset, 1 281 167 images,

(129)

GAN critics rely on two sources of regularisation Regularisation by incomplete training

Data-dependent gradient regulariser

Some advantages of hybrid kernel/neural features:

MMD loss still a valid critic when features not optimal (unlike WGAN-GP)

Kernel features do some of the “work”, so simplerh features possible.

“Demystifying MMD GANs,” including KID score, ICLR 2018:

https://github.com/mbinkowski/MMD-GAN Gradient regularised MMD, NeurIPS 2018:

https://github.com/MichaelArbel/Scaled-MMD-GAN

(130)

(131)

convolutional filters in layers 1-4. LSUN 6464.

(132)

Based on the classification outputp(yjx) of the inception modelSzegedy et al. [ICLR 2014],

E_XexpKL(P(yjX)kP(y)):

High when:

predictive label distributionP(yjx) has low entropy (good quality images)

label entropyP(y) is high (good variety).

(133)

Based on the classification outputp(yjx) of the inception modelSzegedy et al. [ICLR 2014],

EXexpKL(P(yjX)kP(y)):

High when:

predictive label distributionP(yjx) has low entropy (good quality images)

label entropyP(y) is high (good variety).

Problem: relies on a trained classifier! Can’t be used on new categories (celeb, bedroom...)

(134)

FID(P;Q) = kP Qk²+tr(P) +tr(Q) 2tr

(PQ)¹² whereP and P are the feature mean and covariance ofP

(135)

FitsGaussians to features in the inception architecture (pool3 layer):

FID(P;Q) = kP Qk²+tr(P) +tr(Q) 2tr

(PQ)¹² whereP and P are the feature mean and covariance ofP

Problem: bias. For finite samples can consistently give incorrect answer.

Bias demo,

10 20 30 40 50

FID

(136)

Given two alternatives:

P₁ N (0; (1 m ¹)²) P₂ N (0;1) Q N (0;1):

Clearly,

FID(P1;Q) = 1

m² >FID(P2;Q) =0 Givenm samples from P₁ andP₂,

(137)

Assumem samples from P andn ! 1 samples from Q.

P₁ N (0; (1 m ¹)²) P₂ N (0;1) Q N (0;1):

Clearly,

FID(P1;Q) = 1

FID(Pc1;Q) <FID(Pc2;Q):

(138)

P₁ N (0; (1 m ¹)²) P₂ N (0;1) Q N (0;1):

Clearly,

FID(P1;Q) = 1

(139)

Assumem samples from P andn ! 1 samples from Q.

P₁ N (0; (1 m ¹)²) P₂ N (0;1) Q N (0;1):

Clearly,

FID(P1;Q) = 1

FID(Pc1;Q) <FID(Pc2;Q):

(140)

P1 =relu(N (0;Id)) P2=relu(N (1; :8+:2Id)) Q=relu(N (1;Id)) where = _d⁴CC^T, with C a dd matrix with iid standard normal entries.

For a random draw ofC:

FID(P1;Q) 1123:0>1114:8FID(P2;Q) Withm =50 000 samples,

(141)

Letd =2048, and define

FID(Pc1;Q) 1133:7<1136:2FID(Pc2;Q)

(142)

(143)

Letd =2048, and define

FID(Pc1;Q) 1133:7<1136:2FID(Pc2;Q)

(144)

The Kernel inception distanceBinkowski, Sutherland, Arbel, G. [ICLR 2018]

Measures similarity of the samples’ representations in the inception architecture (pool3 layer)

MMDwith kernel k(x;y) =

1

dx^>y+1 3

: Checks match for feature means, variances, skewness

Unbiased: eg CIFAR-10 ^0.003

0.002 0.001 0.000 0.001 0.002 0.003 0.004

KID

(145)

The Kernel inception distanceBinkowski, Sutherland, Arbel, G. [ICLR 2018]

1

dx^>y+1 3

: Checks match for feature means, variances, skewness Unbiased: eg CIFAR-10

train/test ⁰ ²⁵⁰ ⁵⁰⁰ ⁷⁵⁰ 1000 1250 1500 1750 2000

n

0.003 0.002 0.001 0.000 0.001 0.002 0.003 0.004

KID

...“but isn’t KID is computationally costly?”

(146)

architecture (pool3 layer) MMDwith kernel

k(x;y) = 1

dx^>y+1 3

train/test ⁰ ²⁵⁰ ⁵⁰⁰ ⁷⁵⁰ 1000 1250 1500 1750 2000

n

0.003 0.002 0.001 0.000 0.001 0.002 0.003 0.004

KID

(147)

1

dx^>y+1 3

train/test ⁰ ²⁵⁰ ⁵⁰⁰ ⁷⁵⁰ 1000 1250 1500 1750 2000

n

0.003 0.002 0.001 0.000 0.001 0.002 0.003 0.004

KID

Also used for automatic learning rate adjustment: ifKID(Pbt+1;Q) not significantly better thanKID(Pb_t;Q) then reduce learning rate.