Ferenc: When the goal is to train a model that can generate natural-looking samples, maximum likelihood is not a desirable training objective. Under model misspecification and finite data (that is, in pretty much every practically interesting scenario), it has a tendency to produce models that overgeneralise.


Train 2 neural networks \(D\) and \(G\). \(P\) generates real data, and \(G=\wh P\) generates fake data. Suppose either \(P\) or \(\wh P\) is chosen with probability \(\rc2\) and then one same \(x\) is drawn. (I.e., consider \(\rc2 (P+\wh P)\).)

\(G\) tries to minimize the objective and \(D\) tries to maximize it: \[ V(D, G) = \E_P \ln D + \E_G \ln (1-D). \] (The second expectation is over \(G(z)\), \(z\sim p_z\) a fixed distribution.) If the data was real, there is a loss \(\ln \wh\Pj\pat{real}\), and if the data was fake, there is a loss \(\ln \wh \Pj\pat{fake}\). (Penalize for mistakes.)

Note that given \(G\), the optimal \(D\) is $ $, \[\begin{align} \max_D V(D,G) &= \EE_p \ln \fc{p}{p+\wh p} + \EE_{\wh p} \ln \fc{\wh p}{p+\wh p}\\ &= -2\ln 2 + KL\ba{p||\fc{p+\wh p}2} + KL\ba{\wh p ||\fc{p+\wh p}2}. \end{align}\]

Interpretation of objective

\[ KL[Q||P] := \EE_Q\ln \fc{Q}{P} = H(Q) - \EE_Q\ln P = H(Q) + CE(Q,P) \] where CE is cross-entropy. \[\begin{align} JSD[P||Q] &= \rc 2 KL\ba{P||\fc{P+Q}2} + \rc 2 KL\ba{Q||\fc{P+Q}2}\\ JSD[Q||P] &= \pi KL\ba{P||\pi P+(1-\pi)Q} + (1-\pi) KL\ba{Q||\pi P+(1-\pi)Q} \end{align}\]

Let \(P\) be the natural distribution and \(Q\) be the estimated one.

Another interpretation

Let \(s=0,1\) wp \(\rc2\), determining whether we see real or fake data. Let \(\wt p\) be combined distribution. \[I[s;x] = KL[\wt p(s,x)||\wt p(s) \wt p(x)].\] This is intractable to estimate. Subtract out KL divergence: \[ L[p;q] := I[s;x] - \E_{\wt p(x)} [KL[\wt p(s|x)||q(s|x)]] = H[s] + \E_{\wt p(s,x)} [\ln q(s|x)] \] (CHECK THIS) The second term is the GAN objective function. “GAN can be viewed as an augmented generative model which is trained by minimizing mutual information.” YZL



Table p. 8

Improved techniques.

“GANs are typically trained using gradient descent techniques that are designed to find a low value of a cost function, rather than to find the Nash equilibrium of a game.”

Semi-supervised learning

Label generated samples with “generated” class \(K+1\). For unlabeled data, maximize \(\ln p_{\text{model}}(y\in [K]|x)\).


\[L_{infoGAN}(\te) = I[x,y] - \la I[x_{\text{fake}},c].\]

Want \(c\) to effectively explain most variation in fake data.

Variational bound on MI: \[I[X,Y] = \max_q \bc{H[Y] + \EE_{x,y} \ln q(y|x)}.\]

Thus, \[ I[x_{\text{fake}},c]\ge \EE_{x_{\text{fake}}\sim G(z,c), c\sim C|x}[\ln \fc{Q(c|x)}] + H(c) \] Sample using Monte Carlo.

GANs use the variational bound in the wrong direction \(I(s;x) \ge H(s) + V\). InfoGANs use it twice. \[ \ub{I(s;x)}{\ge H(s)+V} - \la \ub{I[x_{\text{fake}}, c]}{\ge ...} \]

Example: for MNIST, have 10 known labels and 2 latent variables which turn out to represent slant and width.

AdaGAN: Boosting Generative Models

Background on boosting

GANs suffer from missing modes: the model doesn’t produce examples in certain regions.

Idea: combine multiple generative models into a mixture. Each step, focus on examples that the mixture has not been able to properly generate, and add another model addressing those.

This is a meta-algorithm which can be used with any implementation of generative models (e.g. GANs).

Minimizing f-divergence with additive mixtures

\(f\)-divergence is \[ D_f(Q||P):=\int f\pa{\dd QP (x)}\,dP(x) \] Note \(D_f(P||Q) = D_{f^{\circ}}(Q||P)\) where \(f^{\circ}(x) = xf(1/x)\). Adding a multiple of \(x-1\) doesn’t change.

\(f\) is Hilbertian if \(\sqrt{D_f(P||Q)}\) satisfies the triangle inequality (ex. JSD, H, TV)


AdaGAN derivation

An alternate analysis:


Two views of GANs

  1. Divergence minimization: minimize divergence between real data and implicit generative model \(q_\te\). Problems
    • GAN algorithms minimize lower bounds. The discriminator must be powerful.
    • Degenerate distributions
  2. Constrast function view: learn a function that takes low values on data manifold and high values everywhere else. The generator is a smart way of generating contrastive points.

Adversarial training with maximum mean discrepancy