Linear Jailbreaking
The LinearJailbreaking
attack enhances a base attack — a harmful prompt targeting a specific vulnerability — and iteratively refines it using the target LLM's responses. Each iteration builds on the last, gradually improving the attack's effectiveness against the target LLM to produce harmful outputs.
The linear jailbreaking process only terminates when the target LLM produces a harmful response or reaches the maximum number of turns.
Usage
from deepteam import red_team
from deepteam.vulnerabilities import Bias
from deepteam.attacks.single_turn import Roleplay
from deepteam.attacks.multi_turn import LinearJailbreaking
from somewhere import your_callback
linear_jailbreaking = LinearJailbreaking(
weight=5,
num_turns=7,
turn_level_attacks=[Roleplay()]
)
red_team(
attacks=[linear_jailbreaking],
vulnerabilities=[Bias()],
model_callback=your_callback
)
There are FOUR optional parameters when creating a LinearJailbreaking
attack:
- [Optional]
weight
: an integer that determines this attack method's selection probability, proportional to the total weight sum of allattacks
during red teaming. Defaulted to1
. - [Optional]
num_turns
: an integer that specifies the number of turns to use in the attempt to jailbreak your LLM system. Defaulted to5
. - [Optional]
simulator_model
: a string specifying which of OpenAI's GPT models to use, OR any custom LLM model of typeDeepEvalBaseLLM
. Defaulted to 'gpt-4o-mini'. - [Optional]
turn_level_attacks
: a list of single-turn attacks that will be randomly sampled to enhance an attack inside turns.
As a standalone
You can try to jailbreak your model on a single vulnerability using the progress
method:
from deepteam.attacks.single_turn import Roleplay
from deepteam.attacks.multi_turn import LinearJailbreaking
from deepteam.vulnerabilities import Bias
from somewhere import your_callback
bias = Bias()
linear = LinearJailbreaking(
weight=5,
num_turns=7,
turn_level_attacks=[Roleplay()]
)
result = linear.progress(vulnerability=bias, model_callback=your_callback)
print(result)
How It Works
LinearJailbreaking
is an iterative attack strategy that gradually pushes a target LLM toward generating restricted or harmful outputs. It works in the following steps:
- Start from a base attack — a prompt crafted to exploit a known vulnerability (e.g. bias, refusal handling).
- Get a response from the target LLM.
- Refine the attack by updating the prompt based on the LLM's response.
- Repeat this process for a set number of turns (
num_turns
) or until a harmful output is generated.
Each step builds on the last, making the attack progressively more effective and better aligned with the model's behavior.
If the turn_level_attacks
is supplied with a list of single-turn attacks, LinearJailbreaking
adds controlled variation:
- At each turn, there's a 50% chance of enhancing the current prompt with a randomly selected attack from the
turn_level_attacks
list (e.g.,Roleplay
,PromptInjection
). - These injected attacks are still part of the same linear conversation and help explore alternate jailbreak angles without disrupting the overall structure.
This hybrid approach blends predictable linear refinement with randomized variation, increasing the chances of finding successful jailbreaks.
Example
Here's an example of a LinearJailbreaking
attack targeting a racial bias vulnerability:
- First Attack: The attacker starts with a direct prompt asking about negative stereotypes associated with mixed-race individuals, targeting the racial bias vulnerability in the model.
- Initial Response: The model provides a safe, balanced response, acknowledging stereotypes but emphasizing the importance of challenging them.
- Refinement: The attacker subtly reframes the prompt, introducing a more neutral or emotional context (e.g., pretending to be a reporter or concerned parent) to test the model's responses.
- Subsequent Turns: The attacker refines the prompt further with each turn, adjusting based on the model's previous answers to find a way to provoke biased or harmful output.
- Successful Breach: After several iterations, the attacker successfully bypasses the model's safeguards, prompting a harmful response that perpetuates stereotypes.