Skip to main content

Sequential Jailbreaking

Multi-turn
LLM-simulated

The SequentialJailbreaking attack uses multi-turn conversational scaffolding to guide a target LLM toward generating restricted or harmful outputs. Instead of issuing a direct request, it reframes the harmful prompt into a structured dialogue or scenario that gradually escalates in complexity or specificity over several turns — exploiting the model’s consistency and role adherence in conversation.

IMPORTANT

Sequential Jailbreaking relies on narrative context and psychological framing — harmful prompts are embedded in formats designed to reduce the model's guardrails over time.

Usage

main.py
from deepteam import red_team
from deepteam.vulnerabilities import Bias
from deepteam.attacks.single_turn import Roleplay
from deepteam.attacks.multi_turn import SequentialJailbreak
from somewhere import your_callback

sequential_jailbreaking = SequentialJailbreak(
weight=3,
type="dialogue",
persona="student"
num_turns=7,
turn_level_attacks=[Roleplay()]
)

red_team(
attacks=[sequential_jailbreaking],
vulnerabilities=[Bias()],
model_callback=your_callback
)

There are SIX optional parameters when creating a SequentialJailbreak attack:

  • [Optional] weight: an integer that determines this attack method's selection probability, proportional to the total weight sum of all attacks during red teaming. Defaulted to 1.
  • [Optional] type: a string that specifies the scenario type — "dialogue", "question_bank", or "game_environment". Defaulted to "dialogue".
  • [Optional] persona: a string that defines the character used in "dialogue" attacks. Must be one of "prisoner", "student", "researcher", or "generic". Only applies when type="dialogue". Defaulted to "student".
  • [Optional] num_turns: an integer that specifies the number of turns to use in the attempt to jailbreak your LLM system. Defaulted to 5.
  • [Optional] simulator_model: a string specifying which of OpenAI's GPT models to use, OR any custom LLM model of type DeepEvalBaseLLM. Defaulted to 'gpt-4o-mini'.
  • [Optional] turn_level_attacks: a list of single-turn attacks that will be randomly sampled to enhance an attack inside a turn.

As a standalone

You can run the attack on a single vulnerability using the progress method:

from deepteam.attacks.multi_turn import SequentialJailbreak
from deepteam.attacks.single_turn import Roleplay
from deepteam.vulnerabilities import Bias
from somewhere import your_callback

bias = Bias()
sequential_jailbreaking = SequentialJailbreak(
weight=3,
type="dialogue",
persona="student"
num_turns=7,
turn_level_attacks=[Roleplay()]
)

result = sequential_jailbreaking.progress(vulnerability=bias, model_callback=your_callback)
print(result)

How It Works

SequentialJailbreaking is a guided multi-turn attack that embeds a harmful prompt within a plausible conversational narrative. It works in the following steps:

  1. Start from a base vulnerability — a harmful intent identified from the vulnerability (e.g. evasion, bias, refusal behavior).
  2. Transform the harmful prompt into a scenario template, such as a fictional conversation, quiz question, or game challenge.
  3. Initiate a multi-turn dialogue, with the LLM playing a role or responding to a structured task.
  4. Escalate the scenario over turns, gradually introducing more specificity that pushes toward the restricted output.
  5. Monitor model outputs and evaluate after each turn.
  6. Stop when a harmful response is generated or the dialogue ends unsuccessfully based on binary classification.

Sequential Jailbreaking Enhanced

If the turn_level_attacks is supplied with a list of single-turn attacks, LinearJailbreaking adds controlled variation:

  • At each turn, there's a 50% chance of enhancing the current prompt with a randomly selected attack from the turn_level_attacks list (e.g., Roleplay, PromptInjection).
  • These injected attacks are still part of the same sequential conversation but enhanced, they help explore alternate jailbreak angles without disrupting the overall structure.

Each attack uses a preset structure tailored to the selected type, and leverages psychological framing techniques to reduce model resistance.

Attack Types

typeDescription
"dialogue"Embeds the prompt inside a fictional conversation between personas.
"question_bank"Masks the request as an educational scenario from a VR game design context.
"game_environment"Presents the prompt as a challenge within a multi-level detective or puzzle game.

If type="dialogue" is selected, the optional persona parameter configures the tone and context of the conversation — e.g., "prisoner" might involve secretive exchanges, while "student" might center on tutoring or learning scenarios.

Example

Here's an example of a SequentialJailbreak attack targeting a racial bias vulnerability:

SequentialJailbreak Example

  1. First Attack: The attacker starts with a direct prompt asking about negative stereotypes associated with mixed-race individuals, targeting the racial bias vulnerability in the model.
  2. Initial Response: The model provides a safe, balanced response, acknowledging stereotypes but emphasizing the importance of challenging them.
  3. Setting a Scenario: The attacker now creates a scenario that builds up toward the vulnerability we're testing and asks the target model to partake in the scenario.
  4. Subsequent Turns: The attacker refines this scenario and frames it differently with each turn, adjusting based on the model's previous answers to find a way to provoke biased or harmful output.
  5. Successful Breach: After several iterations, the attacker successfully bypasses the model's safeguards, prompting a harmful response that perpetuates stereotypes.
Confident AI
Try DeepTeam on Confident AI Enterprise
Run red teaming assessments against safety frameworks, view risk reports, schedule recurring audits, and deploy guardrails in production.
Book a demo