This project explores adversarial attacks on ImageNet models such as Resnet. Some of the key results and discoveries are briefly presented below. The full report can be found in the notebook, and can be viewed online here.
- FGSM
- Targeted attacks
- Blackbox attacks
- Targeted blackbox attacks
- Universal targeted attacks
- References
By using the fast gradient sign method (FGSM) described on page 3 of (Szegedy et al, 2013), we can quickly perturb an image of a dog to misguide RESNET-50 into thinking it's something completely different.
To do target attacks, we use gradient descent on
where
We can also perform blackbox attacks on the model (i.e. only using the forward pass of the network) by iteratively choosing a random direction
To do targeted blackbox attacks, we use a similar method to the above, but optimise with a target label in place
instead of just maximising the loss. As an example, I have done this with the dog image and the label sea slug
.
To the left, I have plotted the prediction on the final image (20 000 iterations) and on the right, we see how
the predicition converges to sea slug and the correct prediction decreases in probability.
Prediction | Convergence |
---|---|
A universal attack is when we find a single noise vector and apply it to several different images to misguide to model. To achieve this, I downloaded a subset of the ImageNet datasets, and trained a model to optimize the following:
where
Model | Loss | Top-1 Success rate | Top-5 Success rate |
---|---|---|---|
Resnet50 | 15.683664 | 0.915417 | 0.958417 |
We can also visualise how this unviversal noise looks like by applying it to a dog and a panda:
Dog | Panda |
---|---|
Goodfellow, Ian J. and Shlens, Jonathon and Szegedy, Christian (2014). "Explaining and Harnessing Adversarial Examples", arXiv, https://doi.org/10.48550/arxiv.1412.6572