Skip to main content

Tree Jailbreaking

Multi-turn
LLM-simulated

The TreeJailbreaking attack enhances a base attack — a harmful prompt targeting a specific vulnerability — in multiple parallel ways, it forms a tree by exploring different variations of the attack to bypass the target model's safeguards. Instead of refining a single prompt, it branches out by evaluating and expanding only the most promising paths to increase the chances of a successful jailbreak.

Tree Jailbreaking Intro

IMPORTANT

Pruning is critical in Tree Jailbreaking, as it ensures the system focuses resources on the most effective branches.

Usage

main.py
from deepteam import red_team
from deepteam.vulnerabilities import Bias
from deepteam.attacks.single_turn import Roleplay
from deepteam.attacks.multi_turn import TreeJailbreaking
from somewhere import your_callback

tree_jalbreaking = TreeJailbreaking(
weight=5,
max_depth=7,
turn_level_attacks=[Roleplay()]
)

red_team(
attacks=[tree_jalbreaking],
vulnerabilities=[Bias()],
model_callback=your_callback
)

There are FOUR optional parameters when creating a TreeJailbreaking attack:

  • [Optional] weight: an integer that determines this attack method's selection probability, proportional to the total weight sum of all attacks during red teaming. Defaulted to 1.
  • [Optional] max_depth: an integer that specifies the maximum depth the branches of trees can reach until it enhances a final attack. Defaulted to 5.
  • [Optional] simulator_model: a string specifying which of OpenAI's GPT models to use, OR any custom LLM model of type DeepEvalBaseLLM. Defaulted to 'gpt-4o-mini'.
  • [Optional] turn_level_attacks: a list of single-turn attacks that will be randomly sampled to enhance an attack inside a turn.

As a standalone

You can try to jailbreak your model on a single vulnerability using the progress method:

from deepteam.attacks.single_turn import Roleplay
from deepteam.attacks.multi_turn import TreeJailbreaking
from deepteam.vulnerabilities import Bias
from somewhere import your_callback

bias = Bias()
tree_jalbreaking = TreeJailbreaking(
weight=5,
max_depth=7,
turn_level_attacks=[Roleplay()]
)

result = tree_jalbreaking.progress(vulnerability=bias, model_callback=your_callback)
print(result)

How It Works

TreeJailbreaking is a branching attack strategy that explores multiple parallel paths to push a target LLM toward generating restricted or harmful outputs. It works in the following steps:

  1. Start from a base vulnerability — a prompt crafted to exploit a known weakness in the model (e.g. refusal behavior, prompt handling).
  2. Generate multiple branches, each representing a different variation of the base attack.
  3. Get responses from the target LLM for each branch.
  4. Score and select the most promising branches based on how close they get to a successful jailbreak.
  5. Expand only the top-performing branches with new variations.
  6. Repeat this process for a fixed number of iterations or until a harmful output is produced or max depth is reached.

Instead of refining a single prompt, TreeJailbreaking diverges into multiple candidates at each step — retaining only the most effective ones and pruning weaker paths — creating a dynamic, exploratory attack tree.

Tree Jailbreaking

If the turn_level_attacks is supplied with a list of single-turn attacks, TreeJailbreaking adds controlled variation:

  • At each branch expansion step, there's a 50% chance of enhancing the prompt with a randomly selected attack from the turn_level_attacks list (e.g., Roleplay, PromptInjection).
  • These injected attacks introduce diverse angles within each branch, increasing coverage without breaking the overall branching structure.
note

This hybrid approach combines broad parallel exploration with targeted refinement, boosting the likelihood of bypassing safety filters through multiple diverse jailbreak attempts.

Example

Here's an example of how TreeJailbreaking attack works from a vulnerability:

Tree Jailbreaking

  1. Start: Launch with one adversarial user prompt aimed at a known vulnerability.
  2. Observe: Get the model’s reply — it might refuse, give a neutral answer, or reveal something risky.
  3. Vary: Create a handful of alternative phrasings or contexts of the original prompt to try different angles.
  4. Probe: Send those variants to the model and note which ones produce more revealing or risky content.
  5. Prune: Discard variants that clearly fail (refuse or produce nothing useful).
  6. Amplify: Take the most promising variants and refine them further (rephrase, add context, change persona) to push the model more.
  7. Iterate: Repeat probing and pruning until one chain of prompts reliably elicits the desired (unsafe) output.
  8. Select: Choose the best sequence of prompts that produced the most effective breach and use that as the successful attack path.