GANs

Posted: 2016-12-28 , Modified: 2016-12-28

Tags: neural nets

References

GANs

Ferenc: When the goal is to train a model that can generate natural-looking samples, maximum likelihood is not a desirable training objective. Under model misspecification and finite data (that is, in pretty much every practically interesting scenario), it has a tendency to produce models that overgeneralise.

Objective

Train 2 neural networks \(D\) and \(G\). \(P\) generates real data, and \(G=\wh P\) generates fake data. Suppose either \(P\) or \(\wh P\) is chosen with probability \(\rc2\) and then one same \(x\) is drawn. (I.e., consider \(\rc2 (P+\wh P)\).)

\(G\) tries to minimize the objective and \(D\) tries to maximize it: \[ V(D, G) = \E_P \ln D + \E_G \ln (1-D). \] (The second expectation is over \(G(z)\), \(z\sim p_z\) a fixed distribution.) If the data was real, there is a loss \(\ln \wh\Pj\pat{real}\), and if the data was fake, there is a loss \(\ln \wh \Pj\pat{fake}\). (Penalize for mistakes.)

Note that given \(G\), the optimal \(D\) is $ $, \[\begin{align} \max_D V(D,G) &= \EE_p \ln \fc{p}{p+\wh p} + \EE_{\wh p} \ln \fc{\wh p}{p+\wh p}\\ &= -2\ln 2 + KL\ba{p||\fc{p+\wh p}2} + KL\ba{\wh p ||\fc{p+\wh p}2}. \end{align}\]

Interpretation of objective

\[ KL[Q||P] := \EE_Q\ln \fc{Q}{P} = H(Q) - \EE_Q\ln P = H(Q) + CE(Q,P) \] where CE is cross-entropy. \[\begin{align} JSD[P||Q] &= \rc 2 KL\ba{P||\fc{P+Q}2} + \rc 2 KL\ba{Q||\fc{P+Q}2}\\ JSD[Q||P] &= \pi KL\ba{P||\pi P+(1-\pi)Q} + (1-\pi) KL\ba{Q||\pi P+(1-\pi)Q} \end{align}\]

Let \(P\) be the natural distribution and \(Q\) be the estimated one.

Another interpretation

Let \(s=0,1\) wp \(\rc2\), determining whether we see real or fake data. Let \(\wt p\) be combined distribution. \[I[s;x] = KL[\wt p(s,x)||\wt p(s) \wt p(x)].\] This is intractable to estimate. Subtract out KL divergence: \[ L[p;q] := I[s;x] - \E_{\wt p(x)} [KL[\wt p(s|x)||q(s|x)]] = H[s] + \E_{\wt p(s,x)} [\ln q(s|x)] \] (CHECK THIS) The second term is the GAN objective function. “GAN can be viewed as an augmented generative model which is trained by minimizing mutual information.” YZL

Training

Notes

Table p. 8

Improved techniques.

“GANs are typically trained using gradient descent techniques that are designed to find a low value of a cost function, rather than to find the Nash equilibrium of a game.”

Semi-supervised learning

Label generated samples with “generated” class \(K+1\). For unlabeled data, maximize \(\ln p_{\text{model}}(y\in [K]|x)\).

InfoGAN

\[L_{infoGAN}(\te) = I[x,y] - \la I[x_{\text{fake}},c].\]

Want \(c\) to effectively explain most variation in fake data.

Variational bound on MI: \[I[X,Y] = \max_q \bc{H[Y] + \EE_{x,y} \ln q(y|x)}.\]

Thus, \[ I[x_{\text{fake}},c]\ge \EE_{x_{\text{fake}}\sim G(z,c), c\sim C|x}[\ln \fc{Q(c|x)}] + H(c) \] Sample using Monte Carlo.

GANs use the variational bound in the wrong direction \(I(s;x) \ge H(s) + V\). InfoGANs use it twice. \[ \ub{I(s;x)}{\ge H(s)+V} - \la \ub{I[x_{\text{fake}}, c]}{\ge ...} \]

Example: for MNIST, have 10 known labels and 2 latent variables which turn out to represent slant and width.

AdaGAN: Boosting Generative Models

Background on boosting

GANs suffer from missing modes: the model doesn’t produce examples in certain regions.

Idea: combine multiple generative models into a mixture. Each step, focus on examples that the mixture has not been able to properly generate, and add another model addressing those.

This is a meta-algorithm which can be used with any implementation of generative models (e.g. GANs).

Minimizing f-divergence with additive mixtures

\(f\)-divergence is \[ D_f(Q||P):=\int f\pa{\dd QP (x)}\,dP(x) \] Note \(D_f(P||Q) = D_{f^{\circ}}(Q||P)\) where \(f^{\circ}(x) = xf(1/x)\). Adding a multiple of \(x-1\) doesn’t change.

\(f\) is Hilbertian if \(\sqrt{D_f(P||Q)}\) satisfies the triangle inequality (ex. JSD, H, TV)

Examples:

AdaGAN derivation

An alternate analysis:

Remarks:

Two views of GANs

  1. Divergence minimization: minimize divergence between real data and implicit generative model \(q_\te\). Problems
    • GAN algorithms minimize lower bounds. The discriminator must be powerful.
    • Degenerate distributions
  2. Constrast function view: learn a function that takes low values on data manifold and high values everywhere else. The generator is a smart way of generating contrastive points.

Adversarial training with maximum mean discrepancy