(CW17) Adversarial Examples Are Not Easily Detected - Bypassing Ten Detection Methods

Posted: 2018-04-05 , Modified: 2018-04-05

Tags: neural nets, adversarial examples

Carlini and Wagner defeat 10 proposed methods of detecting adversarial examples for neural networks.

It seems easier produce a detector for adversarial examples (where you don’t have to classify then correctly) than classify them correctly, so this is a strong result.

Takeaways

Recommendations for defenses

They recommend:

Attack models

They consider 3 attack models, in order of increasing power:

  1. Zero-knowledge (black box): the attacker has no knowledge of what the detector is. (Attack: take another neural net trained on the same data and make adversarial examples against it.)
  2. Limited-knowledge: the attacker knows the architecture of the detector, but not the weights/the examples that it was trained on. (The attacker can train the same architecture on another sample from the same distribution, and craft adversarial examples against that.)
  3. Perfect-knowledge (white box): the attacker knows everything (all the parameters, so they can take gradients, etc.).

Carlini and Wagner’s attack

Known as “C&W’s attack”.

In words: given \(x\), and given a target class \(t\), find \(x'\) that is

There is a tradeoff as you can weight these two objectives differently (do a binary search to find the best tradeoff).

In symbols (needs a parameter \(c\) and \(\ka\)) \[\begin{align} &\min_{x'}\ba{\ve{x'-x}_2^2 + c\ell(x')}\\ \ell(x') :&= \max \pa{ \max \set{Z(x')_i}{i\ne t} -Z(x')_t,-\kappa } \end{align}\]

The 10 defenses

There are 3 defenses that use neural nets, 3 that use PCA, 2 that are statistical.

The table records how the attacks do. (A means attack succeeds, D means defense succeeds.)

Defense (citation number in paper) Description Zero-Knowledge Limited-Knowledge Perfect-Knowledge
[15] (NN) Train for (N+1)-way classification, additional class for adversarial examples. D A A
[12] (NN) Train a NN to classify between real/adversarial. (Reduce to [15].) D A A
[18] (NN) Train a detection network that takes input from intermediate layers. D (mostly) A A
[19] (PCA) Authors noted that coefficients of adversarial examples for small principal components were unusually large - but this is an artifact of MNIST: border pixels are often 0 for real examples and nonzero for adversarial examples. Attack: restrict adversary to top k PC’s. - - A
[4] (PCA) Dimensionality reduction: project to top k PC’s, then feed through rest of network. Attack: restrict adversary to top k PC’s. - - A
[24] (PCA) Do PCA on inner convolutional layers. Each layer has a detector which is a SVM on the coefficients in the PCA basis. Accept only if ALL detectors accept. Result: when trained against C&W’s attack, SVM fails to separate real and adversarial examples well. - - A
[11] (Stat) Model final layer activations as a gaussian mixture model; reject if probability density is small on the data point. Attack: add clamped “negative log likelihood” term to the objective function of the attacker. A (CIFAR), D (MNIST) A A
[15] (Stat) Attempts to detect whether the test distribution is different from the training distribution (ex. the test distribution is adversarial) using the MMD statistical test (see more below). This is defeated by a black-box attack! A A A
[11] (Randomization) Apply dropout. Repeat many times with different randomization. Idea is that for real inputs, different trials will tend to agree; if there is too large a discrepancy between different trials, reject. D D/A (need huge distortion to attack) A (but need +80% distortion)
[24] Apply mean blur on 3x3 patches D - A

More on MMD defense

(See [15] (Stat) in table above.)

Note that this defense takes a set of examples and tests whether it’s drawn from the same distribution as the train data - it doesn’t operate on a per-example basis. (So it also doesn’t “produce” a discriminator, it’s really a hypothesis test on distributions.)

The test considered in the paper only operates on the inputs, not hidden layer activations (potentially, considering the hidden layers can help! But it doesn’t seem like this was done.).

The MMD operates as follows. Given a class of functions (e.g. they use a (ball in) Gaussian kernel space), find the function for which \(\hat \E f(x_i) - \hat \E f(y_i)\) is largest where \(\{x_i\}\) and \(\{y_i\}\) are the two samples and \(\hat \E\) means average. If this is significantly larger than when \(x_i,y_i\) are drawn from the same distribution (the training distribution), then reject the hypothesis that they are drawn from the same distribution; otherwise fail to reject.

The authors found the defense successful against FGSM, JSMA, but Carlini and Wagner broke it with C&W’s attack.