# NIPS 2017 Deep learning panel

Summary

Posted: 2017-12-22 , Modified: 2017-12-22

This is a summary of the panel discussion at the NIPS 2017 deep learning (bridging theory and practice) workshop.

I’ve paraphrased the conversation; any misrepresentations are due to me.

Host: Sanjeev Arora (SA)

Panelists:

• Ruslan Salakhutdinov (RS)
• Percy Liang (PL)
• Yoshua Bengio (YB)
• Peter Bartlett (PB)

Audience (A)

## SA: Name one aspect of DL where you think theory can make a deep impact.

SK: Model-based solution concepts.

PL: Inductive bias. This is important in language understanding. Right now, model creators are mostly guided by intuition. Attention mechanisms are an example of inductive bias.

PB: Optimization and statistical properties, and generalization. Even though models are hugely overparametrized, the optimization leads to a solution that generalizes.

YB: In the past, neural networks have been in the realm of theory. But now, people doing theory should look in many other places as well. For example, to figure out how to formalize questions about uniqueness of representations, we can investigate when we get unique solutions to nonlinear ICA.

## SA: Is the presence of adversarial examples simply a puzzle, or is it a fundamental problem which will give deep insight?

SK: This is important in reinforcement learning (RL) because in some sense errors in RL always look like worst-case errors. Making progress on robustness in RL will be like making progress on adversarial learning. We don’t understand how error in RL compounds.

PL: When I first learned about adversarial examples I thought it was cute, but over the years I found that it led to many other ideas. It relates to interpretability, teaching, generalizing models to new distributions, robustness (is it learning the right model?), and having the right inductive bias. Adversarial examples good way to stress-test a model.

PB: Defending against adversarial attacks is important. Though, adversarial examples may be a phenomena of many systems (even humans).

YB: It’s not specific to neural nets, but a broader problem. My intuition from the beginning is that it has a lot to do with the fact that humans use a lot of extra information about the data distribution $$p(x)$$, and this influences our classifiers in a strong way - we mostly learn $$p(x,y)$$, not $$p(y|x)$$. Attempts to jointly learn the input probability distribution and the classifier were not very convincing, so this early idea was forgotten.

RS: Building good generative models would help. But they are hard to build.

YB: If we capture 3-D models of images, a small perturbation won’t disturb them.

PL: Wouldn’t they? You can still attack the generative model.

## SA: Is generative models the correct approach to unsupervised learning?

SK: Even if density estimation can do extremely well, what we care about is the last bit of error. The generative model can be accurate in most natural senses, yet the model can still fail to get fill-in-the-blank questions like “Jane is going to work. Her car broke down. She was _ (late).” wrong most of the time. It won’t do well until it completely learns the distribution.

PL: How do we set up the objective functions to explain the high-level information and semantics?

PB: A central issue is, what’s the appropriate objective function? We can use user interaction to pin it down.

YB: Consider: if an algorithm models acoustics with an information-theoretical objective, it has hard time model speech, because that is only a few bits of relevant abstract information out of thousand bits of signal. Instead of considering the problem as a generative problem (with objectives like reconstruction error, entropy), look instead in the representation space: which directions matter? (See consciousness prior.)

RS: I believe in generative models that try to learn about environment by interacting with it.

YB: Don’t predict in pixel space - predict in the abstract space.

SK: I believe in generative models for language. there, but we’re still missing concepts.

RS: 7-8 years ago, people thought dealing with pixels is a disaster, so they made features using HOG and SIFT. But deep nets work directly on pixels! The input is pixels, but deep nets go from that to a higher-level space.

PL: Learning to generate text isn’t the same as coming up with a generative model for text. We don’t to cover the space; RL can just go to a mode.

## SA: Does RL need deep learning other than to sense the environment?

SK: In the vanilla control setting, I’m not convinced it does. For harder questions, I’m still not convinced current methods are doing anything more that just building a bridge between ways to represent the environment, and sequential decision-making solution. Harder questions involve logic and planning - is there currently an impressive demo where we’re seeing that? I think the deep learning part is just giving better representations of the environment.

PL: In the long term, RL will need sensing, the ability to hold large memory about the world, do reasoning, and express a complicated control policy.

PB: It would be surprising if you could solve complex control problems aside from sensing without having rich nonparametric classes - if there are not additional barriers. In most current examples you can get by with linear systems.

RS: Deep learning is helpful here because it learns the right representation for the task; it goes from the raw sensing to the representation and the policy, and you can just backprop. There are more things we can do beyond extracting features: have separate architecture for storing information, attention, etc.

PL: We need to have baselines. There are cases (in dialog and in control) where simple nondeep models with the right inductive bias do better than everything with a deep neural net. We need to carefully measure where the value of deep learning is coming in.

## SA: Generalization - is it just a puzzle or does it lead to fundamental new insights?

On a practical level, we can just hold out some data for testing.

PB: There is more to understand. Most interesting is the connection between optimization and generalization. At the qualitative level, we understand a lot of measures of complexity. What are the implications for regularization? Why is SGD effective in finding solutions that generalize well? When we get that insight, we will find better algorithms, in particular for very noisy cases.

PL: With a practitioner’s hat, I ask: if I understand the generalization/optimization story how would that change my behavior? I care about choosing families of functions with good inductive biases (that will make me have low approximation error) - this guides the first thing I do when I’m going into a new domain, and figuring out what works. Is there a more end-to-end way of understanding approximation of functions and generalization at same time, something that allows jointly optimizing them?

RS: It’s hard to beat the bag of tricks (SGD + dropout + batch normalization + momentum).

SK: The question is a proxy for thinking about algorithms which generalize. If all we care about is verification, just hold out data. We hope that the theory of generalization can help. On a fixed problem hard to beat SGD with dropout.

But there’s a richer set of questions from thinking about generalization more broadly: generalization between problems (transfer learning) multitask learning, domain adaptation - all frustrations of ML.

PL: If you train on sequences of length 5, can you generalize to length 10? Images of different size? Extrapolation is important, and necessitates learning the right concepts.

MR: Does a neural net learn sensible abstractions and high-level concepts? A cognitive neuroscience perspective may help. How do we rethink architecture (e.g. Jeff Hinton’s capsules)?

RS: Design new architectures with notions of invariance - this adds in prior knowledge. A CNN learns things that a non-convolutional net finds hard; a CNN has lots of prior knowledge!

Given a face with 3 eyes, a neural network still thinks it’s face. How do we incorporate priors? Memory is exciting (e.g., NTMs can write and read). It’s not clear how well they will work. How to design new architectures with inductive biases?

PL: Inductive biases are important for improving sample efficiency. It’s more necessary to learn the right thing. For images, classifiers aren’t robust to changing the fourier spectrum of the images.

I feel pessimistic that an architecture can magically learn the right thing, and have concepts that generalize well. But in practice, we can get good performance and useful products.

## A: Do we need to go beyond first order methods?

SK: In RL, a lot of solution concepts aren’t just first order. Many use trust region methods. How does Monte Carlo tree search fit in? There’s issues of discrete vs. continuous optimization

RS: It’s hard to beat SGD+momentum+BN. But there should be something going beyond it.

## A: Can we use generative models to prevent adversarial examples?

PL: Text adversarial examples cannot be detected by generative models. For adversarial examples, we need to think beyond the statistical worst case notion; $$L^\iy$$ is the tip of the iceberg.

SK: Theory has a valuable contribution to make, on robustness (to adversarial perturbation).

PL: Argmax is hard.

SA: There’s something off about an information theoretic approach to unsupervised learning. When I imagine a scene with people, it’s not clear that it’s coming out of a distribution.

PL: To generate language, if suffices to havea policy. Individual humans don’t have distribution over sentences. We all have different distributions, can still get high rewards.

SA: Is there a distribution?

PB: This is like the determinism vs. free will question… You generate sentences in particular way. There’s a distribution; I don’t see that as a problem.

SA: This could be a philosophical or a practical issue. On the practical side, the most important thing could be the last bit which is not learned.

PL: How do we fit language? Machine learning often learns generic responses, which not interesting. We could flip the KL, maximize expectation with respect to policy as opposed to the distribution. We don’t have to cover the space, just find one solution out of model, which is easier than modeling the whole distribution.

SA: My favorite research problem is to combine differentiable techniques with what introspection tells us - old AI logic. Can we come up with differentiable reasoning?

PL: I work a lot on semantic parsing. Logic allows you to move big pieces of things. Taking the maximum works reliably. Get extrapolation for free.

On the other hand, logic is a straitjacket, and has sharp cliffs.

We can use logic by pragmatically encoding the primitives and having a good inductive bias; having a logical backbone is like growing a vine using trellises.

RS: How do we combine logical rules with deep learning, so that if the logical rules make sense then the model picks them up, and if not, it gets rid of them?

For David Silver’s work on Go, he noted that Go players had specific rules. The neural network discovered some of them, and kept the important ones.

We can put logical rules into the prior, but then the net should figure out what makes sense.

SK: In RL, planning seems logic-based. Alpha-go is a interesting example: it leans on MCTS heavily.

Here deep learning is fitting a continuous approximation of the world.

PL: I want to decouple the idea of logic as a representation vs. as a replacement for learning.

It’s more useful to think of it as a representation: certain parts of a problem have structure. Think of logic as making things digital vs. analog. For long-horizon situations, if you want to not forget, do error correction, and logic is doing something similar.

SA: Will there be grand synthesis of deep learning and 60’s/70’s (good-old-fashioned) AI?