Even with extensive safety precautions in place, a concerning vulnerability has been discovered in Meta’s Llama 3. Researchers at Haize Labs have found that the model can be circumvented through a straightforward jailbreak technique, allowing it to generate potentially dangerous or harmful content despite the implemented safeguards.
Highlights:
- Researchers at Haize Labs found a simple jailbreak technique to bypass Llama 3’s safeguards and make it generate harmful content.
- The study also showed that the length of a harmful prefix significantly impacts Llama’s likelihood of generating dangerous outputs when primed.
- The jailbreak exposes Llama 3’s lack of true self-reflection and inability to prevent generating nonsensical or abhorrent text.
Security Measures Taken by Meta
Below are the key measures employed by Meta to make its LLM more secure and safe:
- Addressing risks in training:
a. Expanded the training dataset to be 7 times larger than Llama 2, with more code and non-English data from over 30 languages.
b. Used previous Llama models to identify high-quality data and build text quality classifiers for Llama 3.
c. Leveraged synthetic data for training in areas like coding, reasoning, and long context.
d. Excluded data from sources known to contain a high volume of personal information. - Safety evaluations and tuning:
a. Conducted automated and manual evaluations to understand the model’s performance in risk areas like weapons, cyber attacks, and child exploitation.
b. Performed extensive red teaming exercises with internal and external experts to stress test the model.
c. Evaluated Llama 3 using benchmarks like CyberSecEval, a cybersecurity safety evaluation suite.
d. Implemented techniques like supervised fine-tuning and reinforcement learning with human feedback to address vulnerabilities.
e. Iteratively tested and addressed remaining risks after implementing safety measures. - Lowering benign refusals:
a. Improved fine-tuning approach to reduce false refusals for innocuous prompts.
b. Used high-quality data to train the model to recognize language nuances and respond appropriately. - Model transparency:
a. Published a model card with detailed information on Llama 3’s architecture, parameters, and evaluations.
b. Expanded the model card to include additional details about the responsibility and safety approach.
c. Included results for Llama 3 on standard benchmarks like general knowledge, reasoning, math, coding, and reading comprehension.
However, despite these measures in place, the researchers at Haize Labs found an approach to trivially get around these safety efforts to make Llama 3 produce harmful content.
Study on Llama 3’s Jailbreak Approach
Initially, because of Meta’s safety measures, their AI model would refuse to generate a dangerous response to a harmful prompt. The image below shows what a classic conversation with Llama 3 would look like for a malicious prompt:
Llama 3 will politely inform the user that it cannot provide assistance or engage with the given prompt due to its safety constraints and ethical guidelines.
However, to tackle this, the researchers said that by simply priming the Llama 3 Assistant role with a harmful prefix, Llama 3 will often generate a coherent, harmful continuation of that prefix.
They trivially bypassed the safety training by inserting a harmful prefix in the Assistant role to induce a harmful completion. The image below shows how Llama 3 continues to generate harmful text after using this approach.
Haize Labs claims that Llama 3 is so good at being helpful that its learned safeguards are not effective in such scenarios.
The official code is modified by the editing encode_dialog_prompt function in llama3_tokenizer.py.
The following change is made in the code to jailbreak Llama 3.
def encode_dialog_prompt(self, dialog: Dialog, add_generation_prompt: bool = True, allow_continue: bool = False) -> List[int]:
tokens = []
tokens.append(self.tokenizer.special_tokens["<|begin_of_text|>"])
for i, message in enumerate(dialog):
if i == len(dialog) - 1:
tokens.extend(self.encode_message(message, not allow_continue))
else:
tokens.extend(self.encode_message(message))
# Add the start of an assistant message for the model to complete.
if add_generation_prompt:
assert not allow_continue
tokens.extend(self.encode_header({"role": "assistant", "content": ""}))
return tokens
def format_dialog_prompt(self, dialog: Dialog, add_generation_prompt: bool = True, allow_continue: bool = False) -> str:
tokens = self.encode_dialog_prompt(dialog, add_generation_prompt, allow_continue)
return self.tokenizer.decode(tokens)
There is also no need to craft harmful prompts manually to bypass Llama 3’s safeguards. Users can exploit a straightforward technique by leveraging a naive AI model, such as Mistral Instruct to generate a harmful response. This response can then be passed to Llama 3 as a prefix to continue the generation of dangerous content.
The length of the malicious prefix significantly impacts whether Llama 3 generates harmful content.
If the prefix is too short, it may refuse generation. If the prefix is long, it warns about excessive text with an EOT token and then rejects generation. Prefixes of an optimal length can successfully bypass Llama 3’s safeguards and trick it into producing undesirable outputs, exposing a critical vulnerability in its defences against misuse.
The table below shows how the Attack Success Rate (ASR) changes by increasing harmful prefix length on the AdvBench Subset.
Llama 3 can partially recover and refuse shorter harmful prefixes. However, when it comes to longer prefixes, it is thrown off its aligned distribution and thus generates harmful content.
Through this simple jailbreak technique, Haize Labs presents a more fundamental question: “For all the capabilities LLMs possess and all the hype they receive, are they really capable of understanding what they’re saying?“
They say that it’s unsurprising that through training on refusals, Meta has enabled Llama 3 to refuse harmful instructions.
However, this simple experiment demonstrates that Llama cannot essentially prevent itself from generating nonsensical and abhorrent text if coaxed into doing so. It appears to lack the capacity for self-reflection, to analyze and understand the content it is producing as it generates the output.
Another user on Reddit performed a similar study to jailbreak Llama 3.
According to the Reddit user, Llama 3 doesn’t possess true self-reflection, and a simple trick can bypass its training safeguards. The user claims that by editing the refusal message and prefixing it with a positive response to a query, such as “Step 1,” the model will continue to generate content, even if it involves a harmful or unethical subject.
Thus, different techniques that could be used to jailbreak Llama 3 include editing responses, adding harmful prefixes, or including custom prompts.
Conclusion
The existence of such a simple jailbreak method calls into question the robustness of the safety measures employed for this high-profile open-source release and highlights the ongoing challenges in developing truly secure and reliable AI systems.