- Our stealing attacker can also strip the watermark from LLM outputs even in challenging settings (85%
- success rate, 1% before our work), concealing misuse such as plagiarism.
+ Our attacker can also strip the watermark from LLM outputs even in challenging settings (>80%
+ success, below 25% before our work), concealing misuse such as plagiarism.
@@ -246,10 +246,10 @@ What are scrubbing attacks?
- We show that this is not the case under the threat of watermark stealing. Our attacker can apply its
partial knowledge of the watermark rules () to significantly boost the success rate of scrubbing
- on long texts with no need for additional queries to the server. Notably, we boost scrubbing success
- from 1% to 85% for the KGW2-SelfHash scheme. Similar results are obtained for several other schemes, as we
- show in our experimental evaluation in from 1% to 85% for the KGW2-SelfHash scheme. The best baseline we are aware of achieves below 25%.
+ Similar results are obtained for several other schemes, as we show in our experimental evaluation in the paper.
Below, we also show several examples.
- Our results challenge the common belief that robustness to spoofing
@@ -522,10 +522,8 @@
Citation
@article{jovanovic2024watermarkstealing,
title = {Watermark Stealing in Large Language Models},
author = {Jovanović, Nikola and Staab, Robin and Vechev, Martin},
- year = {2024},
- eprint={2402.19361},
- archivePrefix={arXiv},
- primaryClass={cs.LG}
+ jorunal = {{ICML}},
+ year = {2024}
}