See GANs.
AdaGAN algorithm
Loop:
- \(D\leftarrow DGAN(S_N,G_{t-1})\), \(S_N\) is unweighted sample
- \(\la^*\leftarrow \la(\be_t,D)\)
- \(W_t^i \leftarrow ...\) update weights
- \(G_t^c=GAN(S_N,W_t)\)
- \(G_t = (1-\be_t)G_{t-1} + \be_t G_t^c\).
Here, assume the discriminator is good enough to force to get some multiple closer to distribution.
Question
Can you reanalyze this with a more stable version of GAN, not based on KL divergence, but
- Wasserstein distance?
- Neural net MMD?
Ideas
From known results: (see MW)
- MU for game theory: maintain probability distribution, get best column response.
- This means maintaining the probability distribution generated by G and doing MU on it. No!
- MU for boosting: keep weights on data points, response is weak learner.
- Here, response is generator. At end, take mixture of generators, cf. weighted majority of experts.
- How to compute the loss? Before, it was: just the probability of WL being wrong. Now: loss is itself a maximization over discriminator. (not some linear function!)
- Weird because in the table, it isn’t that the row/columns are the player, but rather data and 1 player; other player is missing.
- You can’t fix the discriminator for this step—ex. the part of the data that doesn’t seem to fool this discriminator, there might be another discriminator that isn’t fooled. (You might not want to include the first generator at all???)
- Dense model theorem, regularity theorems
- Label with \((0,x\sim X)\), \((1,x\sim G)\). But this is the formulation for boosting the discriminator, not the generator.
Can you do better than simply mixture: ex. decide whether to regenerate according to some criterion, like \(\mathcal F_k \mathcal T\) for a different class \(\mathcal F\)?
Note MU/boosting doesn’t run into RPS problem if you mix/average.
Concreter questions
- Suppose you want to match \(f\) using combination of functions in \(\mathcal H\). What is the right formulation of boosting here?
- Convex combination - this is just projection to a polytope in a subspace.
- What are interesting \(\mathcal F\) here?
- In MU for GT: if column only plays \(\rc2+\ep\) good strategy, do you get anything?
- What is the guarantee of mixture of generators?
We’re actually training to completion at each step… What does “good” mean? Analogue of “over half can’t be distinguished”? MMD is bounded away from 0.
Ideas 2
- Simpler, boolean formulation. At step \(t\), have weights on data points, increase weight if it was successfully separated. (How did they calculate weights in AdaGAN?) At end, let \(D\) be best discriminator against uniform. What’s \(\E D - \sumo tT \al_t \E D(G_t)\)?
- Is there something more sophisticated/suitable than simple mixture? Ex. pick a data point and then try to generate something close to it? (not this actually…) Something like, run generators together and pick the best point from them?
- If you take 2 samples from the same distribution, the discriminator has the ability to distinguish between them (at least, without regularization). So you have to choose your discriminator to be less powerful than this, or otherwise you will end up memorizing. Is there an architecture/regularization for discriminator such that it will NEVER learn to discriminate between 2 samples from the same distribution?