AAML workshop
Posted: 2017-04-01 , Modified: 2017-04-01
Tags: ai safety, machine learning
Posted: 2017-04-01 , Modified: 2017-04-01
Tags: ai safety, machine learning
See
What’s the problem with KWIK?
Dimensions to vary
Filter class \(F\) and hypothesis class \(H\). Suppose there exists \(f,h\) such that \(fh(x)=y, 1-y, \perp\) with probability \(p, 0, 1-p\). We want to find \(\wh f\wh h\) with \(\wh f \wh h(x)=y,1-y, \perp\) with probabilities \(p+\ep, \ep', 1-p-\ep-\ep'\) with \(\ep'\ll \ep\).
Neural net anomaly detection:
2-3 levels of problem.
(Lump 2 and 3 together.)
This is easiest for formalize and tackle as a theory problem. Assume there is a distinguishing function (conservative concept?) that excludes all bad outcomes.
Under human distribution of actions, reward given corresponds to value.
But after gaming (ex. realizing human only checks in 1 place to check room is clean), agent leaves human distribution of actions, and inferred/represented reward stops corresponding to value. Either from maintaining multiple hypothesis of reward function, or human feedback, it realizes this.
Going back and forth, can continuously improve. Can it generalize over human pushback?
Alternative: have separate agent/part of agent that acts as predictor - holds model of world, job is to predict, e.g., whether there’s a strawberry.
Question: can’t you just embed the “plagiarism” problem in here?
Maybe the problem considered here is more concrete: There’s a better notion of what a strawberry/plate is than a good story.
Note that “putting self in simulation” is a relative term. It means “fooling all its sensors.” If it has a world-model, this means the harder task of fooling the world-model. (Think of world-model itself as a sensor system.) (Maybe the world-model asks for proofs?)
Why can’t you just have a world model or adversarial predictor? Problem if there is no good evaluator.
This contains the conservative concept problem and the reward hacking problem. (I think.) Solving the informed oversight problem is sufficient.
It is impossible to refer to the physical world. Our mapping from physical actions to mental representations is many-to-one; many ways of moving our arm all get translated into a mental story of “picking up the strawberry.” There are many ways to execute this task.
We only live in our own conceptual space. This space is highly bound/coupled to actual physics. (There’s no glitch in the universe that if I move my arm in a specific way, Konami code, I shut down the universe.) Any way I move my arm is roughly the same.
The AI solves tasks within its own conceptual space. We can evaluate that the AI is doing the right thing insofar as it is transparent, we can look at its world model, point at the concept of “strawberry” and see that it’s close enough to our own. We can solve “environmental goals” if the intersection of ontologies is nonempty, and the goal is within that intersection.
Measure 1: get back to what would have happened under null. \[ I(s) = \min_{\pi} D_{KL}(P_{t_1}^{\phi}(\bullet | s)|| P_{t_2}^\pi(\bullet | s')) \]
Measure 2: stay similar to trusted region \(R\). Let \(f:X\to Y\) be mapping to feature space. \[ I(s) = d(R, f(s)). \] For example, \(R=\{f(s_0)\}\) and \(d=\ved_2\).
Measure 3: train \(I\) on examples, conservatively (ex. RBF) on good examples \((s_i,0)\), and bad examples, \((s_j,>0)\). Also can encode prior information, e.g. about things that are neutral.
Probability of \((x_i,y_i)\) under \(p_{\te_1}\) and \(p_{\te_2}\), where \(p_i=1\) if \((x_i,y_i)\sim D_1\) and 0 if \(\sim D_2\). \[ \sumo in p_i p_{\te_1}(y_i|x_i) + \sumo in (1-p_i) p_{\te_2}p(y_i|x_i). \] Max log likelihood. Do EM with \(p\)’s and \((x,y)\)’s.
Version 2: Use IRL: keep track of best guess nets, or sets of valid hypotheses. Keep track of posterior probabilities of each net. (Update in online fashion.) Update posterior probabilities assuming Markovian switching (cf. DP in HMM, sleeping experts), and gradient descent on parameters.