Game theory and decision theory
Posted: 2017-02-12 , Modified: 2017-02-12
Tags: ai safety, game theory, decision theory
Posted: 2017-02-12 , Modified: 2017-02-12
Tags: ai safety, game theory, decision theory
Papers on game theory/decision theory:
But what are the available actions? And what are the counterfactual universes correspond- ing to what “would happen” if an action “were taken”?
(A deterministic agent could only have taken one action.)
To fully describe the problem faced by intelligent agents making decisions, it is necessary to provide an idealized procedure which takes a description of an environment and one of the agents within, and identifies the best action available to that agent
Evidential blackmail: AI researcher knows whether the AI will lose $150 mil from an investment (scandal). If either is true, the AI researcher sends the info and asks for $100 mil:
Conditioned on refusing, loses $150 mil.
Counterfactual blackmail: Developer has developed computer virus which would cause both to lose $150 mil. Once deployed, only way to prevent activation 1 day later is to wire $100 mil. Researcher would only deploy if quite sure agent will pay.
CDT prescribes that an agent resist certain attempts to improve its decision procedures.
Retro blackmail: AI researcher has access to original source code. Can self-modify after researcher acquires original source code but before researcher decides whether to deploy.
Self-modify to not give in to demands.
But CDT and any decision procedure to which CDT would self-modify would lose money.
CDT: Alas, the virus has been deployed. I would have preferred that the virus not be deployed, but since it has been, I must now decide whether or not to pay up. Paying up is bad, but refusing is worse, so I’ll pay up.
Policy selection: The optimal policy is to refuse to pay up upon observing that the virus has been deployed. I now observe that the virus has been deployed. Therefore, I refuse to pay.
Consider Prisoner’s dilemma when
However, CDT evaluates actions according to a physical counterfactual where the action is changed but everything causally separated from the action is held constant. It is not the physical output of the agent’s hardware which must be modified to construct a counterfactual, it is the logical output of the agent’s decision algorithm.
Cf. \(10 \E a - a\).
UDT chooses the best action according to a world-model which represents not only causal relationships in the world, but also the logical effects of algorithms upon the world.
How to encode logical relations in graph? Underspecified: constructing graph is difficult. Graph for UDT further requires some ability to identify and separate “algorithms” from the physical processes that implement them. How is UDT supposed to recognize that the agent and its opponent implement the same algorithm?
To illustrate, consider UDT identify- ing the best action available to an agent playing a Pris- oner’s Dilemma against an opponent that does exactly the same thing as the agent 80% of the time, and takes the opposite action otherwise.
(Problem seems to be identifying what something is doing logically - it might be obfuscated. Also, graph loses info from algorithm.)
Evaluate logical implications of the agent’s algorithm selecting the policy \(\pi\).
Graph is unnecessary: the environment itself is an algorithm.
evaluates policies by searching for formal proofs, using some mathematical theory such as Peano Arithmetic (PA), of how much utility is attained in the world-model if A() selects the policy \(\pi\).
But: requires halting oracle. can only identify the best policy if there exists a proof that ex- ecuting that policy leads to a good outcome
Problem in stochastic environments (? if can model prior, seems ok)
Ex. agent uses UDT, play game with human. If numbers written sum to \(\$10\), each paid, else 0.
UDT misidentifies best policy:
Human: I don’t quite know how UDT works, but I remember hearing that it’s a very powerful predictor. So if I decide to write down 9, then it will predict this, and it will decide to write 1. Therefore, I can write down 9 without fear.
But agent with superior predictive power loses to the dumber agent! Human’s lack of power to predict UDT gives an advantage!
Problem: not guaranteed to work. As soon as proof-based UDT proves that an agent will not take a certain policy, it concludes that taking that policy leads to the best possible outcome (because from a contradiction, anything follows).
(I don’t get this…) \[ A() = UDT(\ce{E()}, \ce{A()}). \]
(What is an embedding of an agent?)
Why Lob? The short answer is that this theorem illustrates the basickind of self-reference involved when an algorithm considers its own output as part of theuniverse, and it is thus germane to many kinds of research involving self-modifying agents,especially when formal verification is involved or when we want to cleanly prove things inmodel problems.
Problem: How can Deep Thought 1.0 build Deep Thought 2.0 with guarantee of good consequences?
DT1 can’t actually figure out what actions DT2 is going to take. Naively, it seems as if it should be enough for DT1 toknow that DT2 has the same goals as DT1, that DT2’s deductions are reliable, and that DT2 only takes actions that it deduces to have good consequences on balance.
If we tryand model DT1 and DT2 as proving statements in two formal systems (one stronger thanthe other), then the only way that DT1 can make such a statement about DT2’s reliability is if DT1 (and thus both) are in fact unreliable!
One (anachronistic) way of stating Godel’s key insight is that you can use computer pro-grams to search for proofs, and you can prove statements about computer programs
This makes it more intuitive that we can embed self-reference in mathematics, because we can embed self-reference in computer code!
Quines: programs can be made which perform arbitrary tasks using their own source code we can have a program G which refers to itself in this way, and searches for proofs inarithmetic related to its own source code.
G searches for a proof of “G runs forever”. G never finds a proof, but we can never prove that G runs forever.
We can add “G never halts” or “G halts” as a new axiom.
First incompleteness. If the theory of arithmetic is consistent, then there exist undecidable statements in the theory of arithmetic.
Lob’s Theorem:
Intuition: if \(L(X)\) were provable for every \(X\), you can prove consistency.
Also a finite version of Lob’s theorem, controls length of proof of \(\phi\) in terms of \(\square \phi\to \phi\).
DT1 trusts DT2: if the action is deduced to be good, then it must actually be good.
(It can’t even trust itself?)
a simple agent with a utility function will only create successors whose formal systems are strictly weaker thanits own, since only those are fully trusted by the current system.
References: Tiling Agents for Self-Modifying AI, and the Lobian Obstacle [18] and Problems of self-reference in self-improving space-time embedded intelligence [9].
algorithm and theirs get to read the opponent’s source code, calculate for as long as they like, and then play only once.
Cooperate iff source code identical: fragile.
FairBot: search through all proofs of length \(\le N\) to see if valid proofs of \(X(FairBot)=C\). If yes, output \(C\).
Intuitively, it seems like both mutual cooperation and mutual defection are stable fixed points of the situation. However, a Lobian statement breaks the deadlock in favor of cooperation!
Proof: L(“FairBot(FairBot)=C”) follows. By Lob, there must be a proof of FairBot(FairBot=C). (! This is a case where Lob actually helps prove something!)
See Program Equilibrium in the Prisoner’s Dilemma via Lob’s Theorem.
Note on model:
When agent and universe are quined, you can’t “run the universe” on \(A\) or \(A'\). You search for proofs of \(A()=x\to U()=c\). Problem: ordering matters. If you prove \(x\) is better you choose \(x\), even though it could be worse, because \(A()=\neg x\to U()=c\) is true for any \(c\).
(Why can’t extensional work? Search for proofs in universe quined with \(A'\), then choose \(A'\).)
careful ordering of which proofs it pays attention to, and that agent can be shown to make thecorrect decisions (given enough deductive power). The idea was originally due to Benja Fal-lenstein; Tsvi Benson-Tilson’s paper UDT with Known Search Order
the same theory can have many models, some of them not at all what you were thinking of when you made the axioms. Notes: to get Peano Arithmetic need
There are models of Peano arithmetic where G holds, and other models where G fails to hold. In Robinson arithmetic “Bob” might satisfy the formula G is checking. In PA nonstandard models are weirder.
The key to understanding these is that G never halts at any finite number, but we can’t actually define in our formal language what “finite” means. Thenonstandard models of Peano Arithmetic are those which have all the usual numbers butalso lots of extra numbers that are “too large” to ever be written as lots of S’s followed by an O, but which nonetheless are swept along in any inductive statement.
Remark: nonstandard analysis
Ponder for a moment the formal system which has all the axioms of PA, plus the axiom that PA is consistent, plus the axiom that “the systemwhich has all the axioms of PA, plus the axiom that PA is consistent” is inconsistent. As it turns out, this is a perfectly consistent system (What happens if you take the recursively axiomatizable “consistency\({}^n\), \(n\in \N\)?)
We might want to endorse some particular model as the “true” one (for instance, our standard model of the natural numbers, without all of those weird nonstandard numbers), and say that logical statements are true if they hold in that model and false if they don’t. This truth predicate exists outside the language, and so the logical statements can’t talk about the truth predicate, only about weaker notions like provability.
The trouble comes when we try to construct a language that contains its own truth predicate such that \(T(\phi)\lra \phi\). \(T(X)\lra T(\neg X)\).
if P isn’t allowed to make exact statements about its own values, but only arbitrarily precise approximations, then everything can work out consistently.
P can’t rule out the possibility that reciprocals of nonstandard natural numbers (infinitesimals) exist.
followup paper by Christiano on computationally tractable approxima-tions of probabilistic logic: Non-Omniscience, Probabilistic Inference, and Metamathematics
We’re interested in a particular modal logic that constitutes a reflection of Lobian phenomena in PA, etc.
GL axiom: \(\square (\square A \to A)\to \square A\).
(What about things like \(\forall x, \square P(x)\)?)
some special cases where there are efficient algorithms for deducing provability in GL.
All sorts of formulas that refer to themselves and each other by quining. Formulas \(p\lra \phi(p,q_1,\ldots, q_k)\) modalized in \(p\): every \(p\) in \(\phi\) is in scope of some \(\square\).
When \(p\) equivalent to formula modalized in \(p\), then \(p\) is equivalent to something which doesn’t use \(p\). Godel statement \(\lra\) inconsistency of arithmetic \(\square (p\lra \square \neg p) \lra \square (p\lra \square \perp)\).
formula is provable in the modal logic if and only if a corresponding property holds for every Kripke frame in that class.
Robust Cooperation in the Prisoner’s Dilemma: Program Equilibrium via Provability Logic.
One embarrassing thing about FairBot is that it doesn’t check whether its potential cooperation would actually make any difference. (ex. it cooperates against a rock.)
Ex. PrudentBot
if we assume infinite computational power (i.e. the ability to consult a halting oracle about proof searches in Peano Arithmetic), then they can be written out as simple statements in modal logic
Find what happens using efficient algorithm!
Modal agents of rank 0: \(p\lra \phi(p,q)\), modalized in both \(p\) and \(q\) (don’t run, only prove). Equivalent to something \(p\lra \phi(q)\).