- Our stealing attacker can also strip the watermark from LLM outputs even in challenging settings (85%
- success rate, 1% before our work), concealing misuse such as plagiarism.
+ Our attacker can also strip the watermark from LLM outputs even in challenging settings (>80%
+ success, below 25% before our work), concealing misuse such as plagiarism.
@@ -246,10 +246,10 @@ What are scrubbing attacks?
- We show that this is not the case under the threat of watermark stealing. Our attacker can apply its
partial knowledge of the watermark rules (
) to significantly boost the success rate of scrubbing
- on long texts with no need for additional queries to the server. Notably, we boost scrubbing success
- from 1% to 85% for the KGW2-SelfHash scheme. Similar results are obtained for several other schemes, as we
- show in our experimental evaluation in from 1% to 85% for the KGW2-SelfHash scheme. The best baseline we are aware of achieves below 25%.
+ Similar results are obtained for several other schemes, as we show in our experimental evaluation in the paper.
Below, we also show several examples.
- Our results challenge the common belief that robustness to spoofing
@@ -522,10 +522,8 @@
Citation
@article{jovanovic2024watermarkstealing,
title = {Watermark Stealing in Large Language Models},
author = {Jovanović, Nikola and Staab, Robin and Vechev, Martin},
- year = {2024},
- eprint={2402.19361},
- archivePrefix={arXiv},
- primaryClass={cs.LG}
+ jorunal = {{ICML}},
+ year = {2024}
}