adversarial attack

Contents

Papers

$\theta$ is the special model, $\varepsilon$ is disturbance, $x$ is input, $y$ is real value, $L$ is loss function.

Attack: to find the minimization $\varepsilon$ as far as possible.

\underset{ε}{\arg} max_{ε \in S} L (θ, x + ε, y)

$\arg\limits_{\varepsilon} \max\limits_{\varepsilon \in S} L(\theta, x + \varepsilon, y)$

Normally, we have to minimize $\rho$ and get maximization of loss function.

Defense:

min_{θ} ρ (θ) = E_{(x, y) \in D} [max_{ε \in S} L (θ, x + ε, y)] \underset{θ}{\arg} inf_{θ} sup_{ε \in S} L (θ, x + ε, y)

$\min\limits_{\theta} \rho(\theta)=E_{(x,y)\in \mathcal{D}} \left[ \max\limits_{\varepsilon \in S} L(\theta, x + \varepsilon, y) \right] \\ \arg\limits_{\theta} \inf\limits_{\theta}\sup\limits_{\varepsilon \in \mathcal{S}} L(\theta, x+\varepsilon, y)$

Intriguing properties of neural networks

Intriguing properties of neural networks 2014

Individual units can't express semantic information (feature).

FGSM

Explaining and Harnessing Adversarial Examples (FGSM) 2015

FGM

Adversarial Training Methods for Semi-Supervised Text Classification (FGM)

PGD

Towards Deep Learning Models Resistant to Adversarial attacks (PGD) 2017

FreeAT

Adversarial Training for Free! (FreeAT) 2019

YOPO

You Only Propagate Once: Accelerating Adversarial Training via Maximal Principle (YOPO) 2019

FreeLB

Enhanced Adversarial Training for Language Understanding (FreeLB) 2019

SMART

Robust and Efficient Fine-Tuning for Pre-trained Natural (SMART) 2019

MIM

Boosting Adversarial Attacks with Momentum (MIM) 2017

CW

Towards Evaluating the Robustness of Neural Networks (CW) 2016

Deeply Supervised Discriminative Learning for Adversarial Defense 2020, 6

Previous defense method:

modify the inputs during testing time:

JPEG compression operation is equivalent to selective blurring of the image, helping remove additive perturbations. Keeping the Bad Guys Out: Protecting and Vaccinating Deep Learning with JPEG Compression 2017
Random resizing, which resizes the input images to a random size Mitigating Adversarial Effects Through Randomization 2017, 570
deep image restoration networks learn mapping functions that can bring off-the-manifold adversarial samples onto the natural image manifold Image Super-Resolution as a Defense Against Adversarial Attacks 2019
When the neural responses are linear, applying the foveation mechanism to the adversarial example tends to significantly reduce the effect of the perturbation Foveation-based Mechanisms Alleviate Adversarial Examples 2015

proactive defense, which alters the underlying model’s architecture or learning procedure.

On ImageNet, Ensemble adversarial Training yields models with strong robustness to black-box attacks. Ensemble Adversarial Training: Attacks and Defenses 2017,1531
It finds that a large fraction of adversarial examples are classified incorrectly even when perceived through the camera. Adversarial examples in the physical world 2016, 3092
Papernot used distillation to improve the model’s robustness by retraining it with soft labels. Distillation as a Defense to Adversarial Perturbations against Deep Neural Networks 2015, 2081
Parsevel Networks restrict the Lipschitz constant of each layer of the model. Parseval networks: Improving robustness to adversarial examples 2017, 513
With HGD as a defense, the target model is more robust to either white-box or black-box adversarial attacks. Defense against Adversarial Attacks Using High-Level Representation Guided Denoiser 2017, 418
This paper presents a new method that explores the interaction among individual networks to improve robustness for ensemble models. Improving Adversarial Robustness via Promoting Ensemble Diversity 2019, 152
Min-Max Optimization is one of the strongest defense methods, which augments the training data with first order attacked samples. Towards deep learning models resistant to adversarial attacks 2017, 4165
It introduces enhanced defenses using a technique we call logit pairing, a method that encourages logits for pairs of examples to be similar. Adversarial Logit Pairing 2018, 370
The current defenses are successfully circumvented under white-box settings. Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples 2018,1751
Specifically, we study applying image transformations such as bit-depth reduction, JPEG compression, total varepsiloniance minimization, and image quilting before feeding the image to a convolutional network classifier. Countering Adversarial Images using Input Transformations 2017, 733

Preliminary work: Adversarial Defense by Restricting the Hidden Space of Deep Neural Networks 2019, 57

Efficiently compute perturbations that fool deep networks DeepFool: a simple and accurate method to fool deep neural networks 2015, 2870

Training objective is inspired from center loss, which clusters penultimate layer features. A Discriminative Feature Learning Approach for Deep Face Recognition 2016, 2542

Think

Limit affine

Firstly, considering a affine function. When the matrix $A$ exists a very large value, I assume it's $a_{ij}$ , so $x$ has a little perturbation that $y$ is going to have a big change.

y = A x

$y = Ax$

If we limit the maximum of $A$ , whether we can defense adversarial attack?

max a_{i j} \leq ξ

$\max{a_{ij}} \leq \xi$

Random disturbance - disrupting disturbance distribution

x^{(i)} \in R^{n} y^{(i)} \in R r () \in [- 0.5, 0.5]^{n} ε \in R c \in R_{+}

$x^{(i)} \in \mathbb{R}^n \\ y^{(i)} \in \mathbb{R} \\ r() \in [-0.5, 0.5]^{n} \\ \varepsilon \in \mathbb{R} \\ c \in \mathbb{R}_{+} \\$

\begin{aligned} max & \arg inf_{ε} \sum_{i = 1}^{m} [y^{(i)} - h (x^{(i)} + r () \cdot ε)] \\ s . t . & ε \geq c \end{aligned}

$\begin{align} \max & \quad \arg\inf\limits_{\varepsilon} \sum\limits_{i=1}^{m} \left[ y^{(i)}-h(x^{(i)} + r()\cdot \varepsilon) \right] \\ s.t. & \quad \varepsilon \geq c \end{align}$

According to some studies, it shows that it doesn't work well even if the epsilon is very large. so I think it may be related to disturbance distribution, the experiment has verified it, which gets no bad result by disrupting disturbance distribution.

But the method is too weak to be cracked as it is reversible, so whether or not there is a method that can disrupt disturbance distribution while it is irreversible.

New idea:

We can make three verification, we use the primal image, the left shifted image and the right shifted image. Then we get three answers.

There are three scenarios:

Three answers are the same, we think is a normal image.
Two answers are the same value a2, but the last one are not, we think is a adversarial image and the right class is a2.
Two answers are the same value a2, but the last one are not, we think is a adversarial image and the right class isn't a2.
There is no the same answer, we think it is a image without meaning.

Pro:

detecting adversarial example
It can get right class even if there is adversarial disturbance
It's irreversible and simple.

We need transformation $T$ :

T^{- 1} \in \emptyset T (a + b) = T (a) + T (b) \exists ε \geq 0 ε^{*} = \arg max_{ε \in S} L (θ, x + ε, y) R a n d o m : ε_{r} \frac{1}{2} {(L (θ, x, y) - L (θ, T (x), y))}^{2} \leq ε \frac{1}{2} {(L (θ, x, y) - L (θ, x + ε_{r}, y))}^{2} \leq ε \frac{1}{2} (L (θ, x + ε_{r}, y) - L (θ, T (x) + T (ε^{*}), y)) \leq ε

$T^{-1} \in \emptyset \\ T(a+b)=T(a) + T(b) \\ \exists \ \varepsilon \geq 0 \\ \varepsilon^{*} = \arg\max\limits_{\varepsilon \in S} L(\theta, x + \varepsilon, y) \\ Random: \quad \varepsilon_{r} \\ \frac{1}{2}\left(L(\theta, x, y) - L(\theta, T(x), y)\right)^{2} \leq \varepsilon\ \\ \frac{1}{2}\left(L(\theta, x, y) - L(\theta, x+\varepsilon_{r}, y)\right)^{2} \leq \varepsilon\ \\ \frac{1}{2} \left( L(\theta, x+\varepsilon_{r}, y) - L(\theta, T(x)+T(\varepsilon^{*}), y) \right) \leq \varepsilon$

Perturbation proof

ε^{*} = \arg max_{ε \in S} L (θ, x + ε, y) \frac{\partial L (θ, x + ε^{*}, y)}{\partial ε} = 0

$\varepsilon^{*} = \arg\max\limits_{\varepsilon \in S} L(\theta, x + \varepsilon, y) \\ \frac{\partial\ L(\theta, x + \varepsilon^{*}, y)}{\partial\ \varepsilon} = 0$

Expand $L$

y = f (x) L (θ, x^{(i)} + ε^{* (i)}, y) = \frac{1}{2} [\sum_{i = 1}^{m} (h (x^{(i)} + ε^{* (i)}, θ) - f (x^{(i)})] \frac{\partial \sum_{i = 1}^{m} (h (x^{(i)} + ε^{* (i)}, θ) - f (x^{(i)}))^{2}}{\partial ε} = 0

$y = f(x) \\ L(\theta, x^{(i)} + \varepsilon^{*(i)}, y) = \frac{1}{2} \left[ \sum\limits_{i=1}^{m}(h(x^{(i)} + \varepsilon^{*(i)},\theta) - f(x^{(i)}) \right] \\ \frac{\partial\ \sum\limits_{i=1}^{m}(h(x^{(i)} + \varepsilon^{*(i)},\theta) - f(x^{(i)}))^{2} }{\partial\ \varepsilon} = 0 \\$

For convenience, we set $m=1$

m = 1 (h (x + ε, θ) - f (x)) \frac{\partial h (x + ε, θ)}{\partial ε} = 0 \frac{\partial h (x + ε, θ)}{\partial ε} = 0

$m=1\\ (h(x + \varepsilon,\theta) - f(x)) \frac{\partial\ h(x + \varepsilon,\theta) }{\partial\ \varepsilon} = 0 \\ \frac{\partial\ h(x + \varepsilon,\theta) }{\partial\ \varepsilon} = 0$

Conclusion: It shows that there is nothing related to $y$ , and why?

\sum_{i = 1}^{m} \frac{\partial (h (x^{(i)} + ε^{* (i)}, θ) - f (x^{(i)}))^{2}}{\partial ε} = 0 \sum_{i = 1}^{m} (h (x^{(i)} + ε^{* (i)}, θ) - f (x^{(i)})) \frac{\partial h (x^{(i)} + ε^{* (i)}, θ)}{\partial ε} = 0

$\sum\limits_{i=1}^{m} \frac{ \partial\ (h(x^{(i)} + \varepsilon^{*(i)},\theta) - f(x^{(i)}) )^{2} }{\partial\ \varepsilon} = 0 \\ \sum\limits_{i=1}^{m} (h(x^{(i)} + \varepsilon^{*(i)},\theta) - f(x^{(i)}) ) \frac{\partial\ h(x^{(i)} + \varepsilon^{*(i)},\theta)}{\partial\ \varepsilon} = 0 \\$

Consider a simple situation.

\frac{\partial h (x^{(i)} + ε^{* (i)}, θ)}{\partial ε} = 0, i = 1, \dots, m

$\frac{\partial\ h(x^{(i)} + \varepsilon^{*(i)},\theta)}{\partial\ \varepsilon} = 0, \quad i = 1, \dots, m$

Conclusion

Models with adversarial attacks defense improve robustness.

Using this approach to provide examples for adversarial training, we reduce the test set error of a maxout network on the MNIST dataset. [Explaining and Harnessing Adversarial Examples, Goodfellow, 2015]