(TPGB17) The space of transferable adversarial examples
Posted: 2017-05-24 , Modified: 2017-05-24
Tags: neural nets, aaml
Posted: 2017-05-24 , Modified: 2017-05-24
Tags: neural nets, aaml
Recall that FGSM is \[ x^* = x + \ep \nv{\nb_x J(x,y)}. \]
Techniques:
For DNN, get 44 directions, 25 of which transfer. For CNN, get 15 directions, 2 of which transfer.
Adversarial training does not significantly displace decision boundary.
Define unit norm directions \[ d(f,x) := \fc{x'-x}{\ve{x'-x}} \] where \(x'\) is defined differently in 3 cases:
Define minimum distance \[ MD_d(f,x) = \amin_{\ep>0} f(x+\ep \cdot d) \ne f(x) \] and interboundary distance as \[ ID_d(f_1,f_2,x) = |MD_d(f_1,x) - MD_d (f_2,x)| \]
Transfer from
Defenses only prevent white-box attacks by reducing reliability of 1st order approximations (gradient masking).
This hypothesis is false: If 2 models achieve low error while exhibiting low robustness, then adversarial examples transfer between models.
Ex. Adversarial examples on MNIST don’t transfer between linear and quadratic models.
Model-agnostic perturbation: For a fixed feature mapping \(\phi\), define \(\de_\phi\) as difference in intra-class means, and the adversarial direction \(r_\phi\) for \((x,y)\), \[\begin{align} \de_\phi:&=\rc 2 (\E_{\mu_{+1}} [\phi(x)] - \E_{\mu_{-1}}[\phi(x)])\\ r_\phi:&= - \ep y \wh \de_\phi. \end{align}\]If \(f(x) = w^T\phi(x)+b\), and \(\De:=\wh w^T \wh\de_\phi\) is large, and \(\phi\) is “pseudo-linear” (\(\phi(x+r)-\phi(x)\approx r_\phi\)) then \(x+r\) transfers to \(f\).
TLDR: shift points in direction of difference of class means; this transfers well.
Can models with access to same set of input features learn representations that don’t transfer?
There’s a simple (but not very informative…) example where this works: MNIST with XOR artifacts trained on linear and quadratic.