Description: We challenge the assumption that LLM watermarks are ready for deployment by showing that prominent schemes can be stolen for under $50, enabling realistic spoofing and scrubbing attacks at scale.
security (10005) safety (4405) llm (252) large language models (58) attacks (34) spoofing (29) scrubbing (23) watermarking (20) watermarks (9)
The most promising line of LLM watermarking schemes ( ) works by altering the generation process of the LLM based on unique watermark rules , determined by the secret key known only to the server. Without secret key knowledge, the watermarked text looks unremarkable, but with it, the server can detect the unusually high usage of so-called green tokens , mathematically proving that a piece of text was watermarked. Recent work posits that current schemes may be fit for deployment, but we provide evidence for
In a realistic spoofing attack the attacker generates high-quality text on arbitrary topics, which is confidently detected as watermarked by the detector. This should be impossible for parties that do not know the secret key. A spoofing attack applied at scale discredits the watermark, as the server is unable to distinguish between truly watermarked and spoofed texts. Further, releasing harmful/toxic texts that are falsely attributed to a specific LLM provider at scale can lead to reputational damage. We de
In a scrubbing attack the attacker removes the watermark, i.e., tweaks the watermarked server response in a quality-preserving way, such that the resulting text is non-watermarked . If scrubbing is viable, misuse of powerful LLMs can be concealed, making it impossible to detect malicious use cases such as plagiarism or automated spamming and disinformation campaigns. Researchers have studied the threat of scrubbing attacks before, concluding that current state-of-the-art schemes are robust to this threat fo