(TPGB17) The space of transferable adversarial examples

Posted: 2017-05-24 , Modified: 2017-05-24

Tags: neural nets, aaml

Adversarial subspaces

Recall that FGSM is \[ x^* = x + \ep \nv{\nb_x J(x,y)}. \]

Techniques:

For DNN, get 44 directions, 25 of which transfer. For CNN, get 15 directions, 2 of which transfer.

Decision boundaries

Adversarial training does not significantly displace decision boundary.

Define unit norm directions \[ d(f,x) := \fc{x'-x}{\ve{x'-x}} \] where \(x'\) is defined differently in 3 cases:

  1. Legitimate direction \(d_{leg}\): \(x'\) is closest data point with different class label.
  2. Adversarial example \(d_{adv}\): adversarial example generated from \(x\).
  3. \(d_{rand}\) random \(x'\) in input domain that is classified differently.

Define minimum distance \[ MD_d(f,x) = \amin_{\ep>0} f(x+\ep \cdot d) \ne f(x) \] and interboundary distance as \[ ID_d(f_1,f_2,x) = |MD_d(f_1,x) - MD_d (f_2,x)| \]

Experiments

Transfer from

Defenses only prevent white-box attacks by reducing reliability of 1st order approximations (gradient masking).

Limits of transferability

This hypothesis is false: If 2 models achieve low error while exhibiting low robustness, then adversarial examples transfer between models.

Ex. Adversarial examples on MNIST don’t transfer between linear and quadratic models.

Model-agnostic perturbation: For a fixed feature mapping \(\phi\), define \(\de_\phi\) as difference in intra-class means, and the adversarial direction \(r_\phi\) for \((x,y)\), \[\begin{align} \de_\phi:&=\rc 2 (\E_{\mu_{+1}} [\phi(x)] - \E_{\mu_{-1}}[\phi(x)])\\ r_\phi:&= - \ep y \wh \de_\phi. \end{align}\]

If \(f(x) = w^T\phi(x)+b\), and \(\De:=\wh w^T \wh\de_\phi\) is large, and \(\phi\) is “pseudo-linear” (\(\phi(x+r)-\phi(x)\approx r_\phi\)) then \(x+r\) transfers to \(f\).

TLDR: shift points in direction of difference of class means; this transfers well.

Can models with access to same set of input features learn representations that don’t transfer?

There’s a simple (but not very informative…) example where this works: MNIST with XOR artifacts trained on linear and quadratic.