LLM Jailbreak
Last reviewed by Moderation API
An LLM jailbreak is a prompt or sequence of prompts crafted to bypass a language model's built-in safety rules and elicit content the model is trained to refuse. Techniques include role-play framing, obfuscated instructions, and multi-turn manipulation that gradually erodes the model's guardrails.
