Adversarial examples in neural networks

Posted: 2017-02-21 , Modified: 2017-02-21

Tags: neural nets, uncertainty, aaml

See my experiments.

See also confidence.

Introduction

Statement

Neural networks can be easily fooled—ex. an adversary adding a small amount of noise can change the classification from “dog” to “cat” with high confidence. It can be fooled even by a weak adversary with just black-box access!

Related to making NN’s resistant: Have NN’s give a confidence bound.

Ideas:

Blog posts

Literature

Experiments

Theory

[LCLS17] Delving into Transferable Adversarial Examples

  1. Model, training data, training process, test label set unknown to attacker.
  2. Large dataset (ImageNet)
  3. Do not construct substitute model

What is the difference between targeted and non-targeted transferability?

  1. Non-targeted: \(x^*\approx x\), \(f_\te(x^*)\ne f_\te(x) = y\). (Constrain \(d(x,x^*)\le B\).)
  2. Targeted: \(x^*\approx x\), \(f_\te(x^*)=y^*\).

3 approaches: Suppose \(f = \max_i J_\te(x)_i\), where \(J_\te(x)\) is vector of probabilities.

  1. Optimization \(\amin_{x^*} \la d(x,x^*) - \ell(\one_y, J_\te(x^*))\). Ex. \(\ell(u,v) = \ln (1-u\cdot v)\).
  2. Fast gradient \(x^* \leftarrow \text{clamp}(x+B\sign (\nb_x \ell(\one_y, J_\te(x))))\).
  3. Fast gradient sign \(x^* \leftarrow \text{clamp}\pa{x+B\nv{\nb_x\ell(\one_y, J_\te(x))}}\).

Approaches for targeted: Replace constraint with \(f_\te(x^*)=y^*\)

  1. Optimization \(\amin_{x^*} \la d(x,x^*) \redd{+} \redd{\ell'(\one_{y^*}}, J_\te(x^*))\). Ex. \(\ell'(u,v) = \redd{-\sum_i u_i \lg v_i}\).
  2. Fast gradient \(x^* \leftarrow \text{clamp}(x\redd{-}B\sign (\nb_x \redd{\ell'(\one_{y^*}}, J_\te(x))))\).
  3. Fast gradient sign \(x^* \leftarrow \text{clamp}\pa{x\redd{-}B\nv{\nb_x\redd{\ell'(\one_y}, J_\te(x))}}\).

Experiments

Choose 100 images (ILSVRC2012 dataset) which can be correctly classified by all 5 models.

Non-target transferability: accuracy = percentage of adversarial examples for one model correctly classified for the other. (For NN to be good, want this to be high)

Targeted transferability: matching rate = percentage of adversarial examples classified as target label by other model. (Want this to be low)

Root mean square deviation \(d(x^*,x) = \sfc{\sum_i (x_i^*-x_i^2)}{N}\).

Q: isn’t the optimizer using gradient information? (We can estimate it by sampling though!)

Use small learning rate to generate images with RMSD<2. Actually can set \(\la=0\).

(Accuracy is low. But what is the confidence?)

Find the minimal transferable RMSD by linear search.

Note FGS minimizes distortion’s \(L_\iy\) norm while FG minimizes \(L_2\) norm.

Target labels do not transfer. Fast gradient-based approaches don’t do well because they only search in 1-D subspace.

Ensemble-based approaches

These do better! If an adversarial image remains adversarial for multiple models, it is more likely to transfer to other models. \[ \amin_{x^*} -\ln \pa{\pa{\sumo ik \al_i J_i(x^*)}\cdot \one_{y^*}} + \la d(x,x^*) \] For each of the five models, we treat it as the black-box model to attack, and generate adversarial images for the ensemble of the rest four, which is considered as white-box. This attack does well!

Non-targeted adversarial images have almost perfect transferability!

Fast gradient doesn’t work with ensemble.

Geometry

Questions