(CW17) Adversarial Examples Are Not Easily Detected

Posted: 2018-04-05 , Modified: 2018-04-05

Takeaways
- Recommendations for defenses
Attack models
- Carlini and Wagner’s attack
The 10 defenses
More on MMD defense

Carlini and Wagner defeat 10 proposed methods of detecting adversarial examples for neural networks.

It seems easier produce a detector for adversarial examples (where you don’t have to classify then correctly) than classify them correctly, so this is a strong result.

Takeaways

Randomization can increase the distortion required to fool the detector. The most effective defense was one which used dropout.
Defenses which work on MNIST do not necessarily work on CIFAR, etc. MNIST has properties which make it easier to defend.
Detectors based on neural networks were the least effective (possibly because you can easily take gradients through them).
Defenses operating on the raw pixel values are ineffective.

Recommendations for defenses

They recommend:

Evaluate using a strong attack - do not just use FGSM (which only takes 1 step). For many of the defenses considered, a black-box attack using C&W’s (strong) attack succeeds.
Demonstrate white-box attacks fail.
Report false positive and true positive rates
Evaluate on more than MNIST.
Release source code!

Attack models

They consider 3 attack models, in order of increasing power:

Zero-knowledge (black box): the attacker has no knowledge of what the detector is. (Attack: take another neural net trained on the same data and make adversarial examples against it.)
Limited-knowledge: the attacker knows the architecture of the detector, but not the weights/the examples that it was trained on. (The attacker can train the same architecture on another sample from the same distribution, and craft adversarial examples against that.)
Perfect-knowledge (white box): the attacker knows everything (all the parameters, so they can take gradients, etc.).

Carlini and Wagner’s attack

Known as “C&W’s attack”.

In words: given \(x\), and given a target class \(t\), find \(x'\) that is

close to \(x\) (in \(L^2\) norm)
and the difference between the \(Z(x')_t\) (the predicted logit for class \(t\) for \(x'\)) and the largest remaining class is as large as possible.

There is a tradeoff as you can weight these two objectives differently (do a binary search to find the best tradeoff).

In symbols (needs a parameter \(c\) and \(\ka\)) \[\begin{align} &\min_{x'}\ba{\ve{x'-x}_2^2 + c\ell(x')}\\ \ell(x') :&= \max \pa{ \max \set{Z(x')_i}{i\ne t} -Z(x')_t,-\kappa } \end{align}\]

The 10 defenses

There are 3 defenses that use neural nets, 3 that use PCA, 2 that are statistical.

The table records how the attacks do. (A means attack succeeds, D means defense succeeds.)

Defense (citation number in paper)	Description	Zero-Knowledge	Limited-Knowledge	Perfect-Knowledge
[15] (NN)	Train for (N+1)-way classification, additional class for adversarial examples.	D	A	A
[12] (NN)	Train a NN to classify between real/adversarial. (Reduce to [15].)	D	A	A
[18] (NN)	Train a detection network that takes input from intermediate layers.	D (mostly)	A	A
[19] (PCA)	Authors noted that coefficients of adversarial examples for small principal components were unusually large - but this is an artifact of MNIST: border pixels are often 0 for real examples and nonzero for adversarial examples. Attack: restrict adversary to top k PC’s.	-	-	A
[4] (PCA)	Dimensionality reduction: project to top k PC’s, then feed through rest of network. Attack: restrict adversary to top k PC’s.	-	-	A
[24] (PCA)	Do PCA on inner convolutional layers. Each layer has a detector which is a SVM on the coefficients in the PCA basis. Accept only if ALL detectors accept. Result: when trained against C&W’s attack, SVM fails to separate real and adversarial examples well.	-	-	A
[11] (Stat)	Model final layer activations as a gaussian mixture model; reject if probability density is small on the data point. Attack: add clamped “negative log likelihood” term to the objective function of the attacker.	A (CIFAR), D (MNIST)	A	A
[15] (Stat)	Attempts to detect whether the test distribution is different from the training distribution (ex. the test distribution is adversarial) using the MMD statistical test (see more below). This is defeated by a black-box attack!	A	A	A
[11] (Randomization)	Apply dropout. Repeat many times with different randomization. Idea is that for real inputs, different trials will tend to agree; if there is too large a discrepancy between different trials, reject.	D	D/A (need huge distortion to attack)	A (but need +80% distortion)
[24]	Apply mean blur on 3x3 patches	D	-	A

More on MMD defense

(See [15] (Stat) in table above.)

Note that this defense takes a set of examples and tests whether it’s drawn from the same distribution as the train data - it doesn’t operate on a per-example basis. (So it also doesn’t “produce” a discriminator, it’s really a hypothesis test on distributions.)

The test considered in the paper only operates on the inputs, not hidden layer activations (potentially, considering the hidden layers can help! But it doesn’t seem like this was done.).

The MMD operates as follows. Given a class of functions (e.g. they use a (ball in) Gaussian kernel space), find the function for which \(\hat \E f(x_i) - \hat \E f(y_i)\) is largest where \(\{x_i\}\) and \(\{y_i\}\) are the two samples and \(\hat \E\) means average. If this is significantly larger than when \(x_i,y_i\) are drawn from the same distribution (the training distribution), then reject the hypothesis that they are drawn from the same distribution; otherwise fail to reject.

The authors found the defense successful against FGSM, JSMA, but Carlini and Wagner broke it with C&W’s attack.

(CW17) Adversarial Examples Are Not Easily Detected - Bypassing Ten Detection Methods