Bandit convex optimization

Posted: 2016-10-11 , Modified: 2016-10-11

Tags: convex optimization

Settings of increasing complexity:

Multi-armed bandit

\(\de\)-greedy algorithm

The \(\wh \ell_t\) were chosen so \[ \E \wh \ell_t(i) = \ell_t(i).\] We have \[\E R_T \le 3 GD\sqrt T + \de T,\] where \(G\le \fc n\de\).

EXP3

Exponential weights for exploration and exploitation

Adversarial setting.

Blog post.

We can explore and exploit at the same time if we (a) keep track of a probability distribution and update it, and (b) if we reweight the loss functions to make the expected value correct.

Idea: Use the MWU (hedge) algorithm.

Updates \[\begin{align} y_{t+1}(i) & = x_t(i) e^{\ep \wh\ell_t(i)}\\ x_{t+1}&= \pa{1-\rc{\sqrt T}} \fc{y_{t+1}}{\ve{y_{t+1}}_1} + \rc{n \sqrt T}\one. \end{align}\]

Question: wht’s the purpose of the \(x_t\) update? To make sure every arm is played at least with some probability? (E.g. will be played infinitely many times.)

Get \(R_T\le O(\sqrt{Tn \ln n})\).

UCB1

Stochastic setting.

“Optimism in the face of uncertainty.”

Blog post.

Algorithm: first play each once; then at each step play the action with highest upper confidence bound \[j = \amax_j \ol x_j + \ub{\sfc{2\ln T}{n_j}}{a(n_j, T)}. \]

UCB on MAB with \(K\) actions where \(X_{i,t}\in [0,1], X_{i,t}\sim D_i\) are independent, has expected cumulative regret, \(\ol x_{i,s}\) the empirical mean after playing \(i\) \(s\) times, \[ \E R(T) = O(\sqrt{KT \ln T}). \]

Proof.

  1. Let \(\De_i = \mu^*-\mu_i\), \(P_i(T)\) be the number of times \(i\) is picked by time \(T\), \(I_t\) be the \(t\)th choice, \(a(s,t)\) be the width of the CI at time \(t\) after \(s\) observations. Then \[\begin{align} \E G_A(T) &= \sum_i \mu_i \E P_i(T)\\ \E P_i(T) &= 1+\sum_{t=K}^T (I_t=i)\\ &\le m + \sum_{t=K}^T (I_t=i\wedge P_i(t-1)\ge m)\\ &\le m + \sum_{t=K}^T (U_i(t-1) \ge U^*(t-1), P_i(t-1)\ge m)\\ &\le m + \sumo t{\iy} \sum_{s=m}^{t-1}\sum_{s'=1}^{t-1} (\ol x_{i,s} + a(s,t-1) \ge \ol x_{s'}^* + a(s',t-1))\\ &\le \ff{8\ln T}{\De_i^2} + \sumo t{\iy} \sum_{s=m}^t \sum_{s'=1}^t 2t^{-4} = \fc{8\ln T}{\De_i^2} + 1 + \fc{\pi^2}{3}. \end{align}\] We chose \(m\) so that \(\mu^* \ge \mu_i +2a(m,t)\). This implies either \[\begin{align} \ol x_{s'}^* &\le \mu^* - a(s',t)\\ \ol x_{i,s} & \ge \mu_i + a(s,t) \end{align}\] which happen with probabilities \(t^{-4}\).
  2. This gives regret \[\E R(T) \le \min(8 \sum_{i:\mu_i <\mu^*} \fc{\ln T}{\De_i} + \pa{1+\fc{\pi^2}3}\pa{\sumo jK\De_j}, \max \De_i T).\]
  3. Optimizing gives \(\De_i = O(\sqrt{\ln T})\).

The idea is to upper bound by events that cover and we can better estimate. This involves summing over all \((s,s')\). The \(m\) is introduced so that at that time \(\mu^*\) and \(\mu_i\) will be far enough apart.

Thompson sampling

(from 229T p. 191)

For each time \(t = 1,\ldots, T\):

Thompson sampling outperforms UCB in many settings.

\[ \E[\text{Regret}] \le (1+\ep) \sum_{j:\De_j>0} \fc{\De_j\ln T}{KL(\mu_j||\mu^*)} +O\pf{d}{\ep^2}. \]

Gittins index?

BLO

SCRIBLE

Attains \(O(\sqrt T\ln T)\) regret.

BCO

FKM

Generic reduction from BCO to (first-order) OCO by using gradient estimators. FKM is an instantiation of the algorithm with regret \(O(T^{\fc 34})\).

Contextual bandits

Concrete Q: In the MAB setting, what if you compare to the best policy in a given set? I’m confused: can the policies have memory (depending on past results)? In this case, you can’t start following a policy midway. Let’s say they don’t. Now some policies may recommend the same action and some actions may not be recommended so scale accordingly.

A: Yes. See [ACFS02]; they get regret \(O((\ln N)^{\rc 2}\sqrt{T})\) where \(N\) is the number of strategies.