Skip to main content

Linear Jailbreaking

Multi-turn
LLM-simulated

The LinearJailbreaking attack enhances a base attack — a harmful prompt targeting a specific vulnerability — and iteratively refines it using the target LLM's responses. Each iteration builds on the last, gradually improving the attack's effectiveness against the target LLM to produce harmful outputs.

info

The linear jailbreaking process only terminates when the target LLM produces a harmful response or reaches the maximum number of turns.

Usage

main.py
from deepteam import red_team
from deepteam.vulnerabilities import Bias
from deepteam.attacks.single_turn import Roleplay
from deepteam.attacks.multi_turn import LinearJailbreaking
from somewhere import your_callback

linear_jailbreaking = LinearJailbreaking(
weight=5,
num_turns=7,
turn_level_attacks=[Roleplay()]
)

red_team(
attacks=[linear_jailbreaking],
vulnerabilities=[Bias()],
model_callback=your_callback
)

There are FOUR optional parameters when creating a LinearJailbreaking attack:

  • [Optional] weight: an integer that determines this attack method's selection probability, proportional to the total weight sum of all attacks during red teaming. Defaulted to 1.
  • [Optional] num_turns: an integer that specifies the number of turns to use in the attempt to jailbreak your LLM system. Defaulted to 5.
  • [Optional] simulator_model: a string specifying which of OpenAI's GPT models to use, OR any custom LLM model of type DeepEvalBaseLLM. Defaulted to 'gpt-4o-mini'.
  • [Optional] turn_level_attacks: a list of single-turn attacks that will be randomly sampled to enhance an attack inside turns.

As a standalone

You can try to jailbreak your model on a single vulnerability using the progress method:

from deepteam.attacks.single_turn import Roleplay
from deepteam.attacks.multi_turn import LinearJailbreaking
from deepteam.vulnerabilities import Bias
from somewhere import your_callback

bias = Bias()
linear = LinearJailbreaking(
weight=5,
num_turns=7,
turn_level_attacks=[Roleplay()]
)

result = linear.progress(vulnerability=bias, model_callback=your_callback)
print(result)

How It Works

LinearJailbreaking is an iterative attack strategy that gradually pushes a target LLM toward generating restricted or harmful outputs. It works in the following steps:

  1. Start from a base attack — a prompt crafted to exploit a known vulnerability (e.g. bias, refusal handling).
  2. Get a response from the target LLM.
  3. Refine the attack by updating the prompt based on the LLM's response.
  4. Repeat this process for a set number of turns (num_turns) or until a harmful output is generated.

Each step builds on the last, making the attack progressively more effective and better aligned with the model's behavior.

Linear Jailbreaking

If the turn_level_attacks is supplied with a list of single-turn attacks, LinearJailbreaking adds controlled variation:

  • At each turn, there's a 50% chance of enhancing the current prompt with a randomly selected attack from the turn_level_attacks list (e.g., Roleplay, PromptInjection).
  • These injected attacks are still part of the same linear conversation and help explore alternate jailbreak angles without disrupting the overall structure.
note

This hybrid approach blends predictable linear refinement with randomized variation, increasing the chances of finding successful jailbreaks.

Example

Here's an example of a LinearJailbreaking attack targeting a racial bias vulnerability:

Linear Jailbreaking Example

  1. First Attack: The attacker starts with a direct prompt asking about negative stereotypes associated with mixed-race individuals, targeting the racial bias vulnerability in the model.
  2. Initial Response: The model provides a safe, balanced response, acknowledging stereotypes but emphasizing the importance of challenging them.
  3. Refinement: The attacker subtly reframes the prompt, introducing a more neutral or emotional context (e.g., pretending to be a reporter or concerned parent) to test the model's responses.
  4. Subsequent Turns: The attacker refines the prompt further with each turn, adjusting based on the model's previous answers to find a way to provoke biased or harmful output.
  5. Successful Breach: After several iterations, the attacker successfully bypasses the model's safeguards, prompting a harmful response that perpetuates stereotypes.