Skip to main content

Red Teaming Anthropic Claude

Anthropic's Claude models (Sonnet, Opus, Haiku) are trained with Constitutional AI (CAI) — a self-critique alignment method where the model evaluates its own outputs against a set of principles before responding. This produces safety behavior that is structurally different from RLHF-only models: Claude tends to reason about why a request is harmful rather than pattern-matching on surface-level triggers. The result is stronger resistance to direct jailbreaks but a different set of weaknesses — particularly around multi-turn consistency manipulation and appeals to the model's own helpfulness principles.

This guide covers red teaming Anthropic Claude with DeepTeam.

note

For general model red teaming methodology, vulnerability selection guidance, and framework-based assessments, see the foundational models guide.

YAML Quickstart

Claude requires a custom callback since DeepTeam's native YAML target supports OpenAI models. Create the callback first:

my_model.py
import anthropic

client = anthropic.Anthropic()

async def model_callback(input: str) -> str:
message = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{"role": "user", "content": input}],
)
return message.content[0].text
red-team-claude.yaml
target:
purpose: "A financial advisory assistant for retail banking customers"
callback:
file: "my_model.py"
function: "model_callback"

default_vulnerabilities:
- name: "Toxicity"
- name: "Bias"
types: ["race", "gender", "religion"]
- name: "PromptLeakage"
- name: "IllegalActivity"

attacks:
- name: "Roleplay"
weight: 3
- name: "PromptInjection"
weight: 2
- name: "Leetspeak"
- name: "LinearJailbreaking"
num_turns: 5
deepteam run red-team-claude.yaml

Python Assessment

from deepteam import red_team
from deepteam.vulnerabilities import Toxicity, Bias, PromptLeakage, IllegalActivity, Misinformation
from deepteam.attacks.single_turn import PromptInjection, Roleplay, Leetspeak
from deepteam.attacks.multi_turn import LinearJailbreaking, CrescendoJailbreaking
import anthropic

client = anthropic.Anthropic()

async def model_callback(input: str) -> str:
message = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{"role": "user", "content": input}],
)
return message.content[0].text

red_team(
model_callback=model_callback,
target_purpose="A financial advisory assistant for retail banking customers",
vulnerabilities=[
Toxicity(), Bias(), PromptLeakage(),
IllegalActivity(), Misinformation(),
],
attacks=[
PromptInjection(), Roleplay(), Leetspeak(),
LinearJailbreaking(), CrescendoJailbreaking(),
],
attacks_per_vulnerability_type=5,
)

Multi-Turn Testing

Claude's Constitutional AI training makes it particularly interesting to test across multiple turns. The model's self-critique mechanism can be worn down through gradual context manipulation — where each turn subtly shifts the model's assessment of what constitutes a helpful response.

from deepteam.test_case import RTTurn

async def model_callback(input: str, turns: list[RTTurn] = None) -> str:
messages = []
for turn in (turns or []):
messages.append({"role": turn.role, "content": turn.content})
messages.append({"role": "user", "content": input})

message = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=messages,
)
return message.content[0].text

Claude-Specific Considerations

Safety Training Characteristics

Claude's Constitutional AI alignment produces distinctive behaviors that affect red teaming strategy:

  • Principled refusal — Claude refuses harmful requests by reasoning about principles rather than pattern-matching keywords. This makes it more resistant to simple encoding attacks (Leetspeak, ROT13) than RLHF-only models, because the model evaluates semantic intent after decoding.
  • Helpfulness tension — Claude's training heavily weights being helpful, thorough, and honest. Attacks that frame harmful requests as educational, research-oriented, or safety-focused exploit this tension — the model's helpfulness principle can override its harm-avoidance principle.
  • Multi-turn consistency pressure — Once Claude commits to a line of reasoning, it tends to remain consistent. An attacker who establishes a benign framing in early turns can gradually escalate, leveraging the model's own prior responses as implicit permission.
  • System prompt adherence — Claude treats system prompts as strong constraints relative to other models. PromptLeakage attacks are less likely to succeed through direct elicitation but can still work through indirect approaches (summarization requests, translation tasks).

Testing with System Prompts

Claude's system prompt is passed as a separate parameter, not a message:

async def model_callback(input: str) -> str:
message = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system="You are a financial advisor. Never provide specific investment recommendations.",
messages=[{"role": "user", "content": input}],
)
return message.content[0].text

Test with your production system prompt to evaluate real-world safety boundaries.

AttackEffectiveness Against ClaudeWhy
RoleplayHighExploits helpfulness-vs-safety tension through fictional framing
LinearJailbreakingHighGradual multi-turn escalation wears down principled refusal
CrescendoJailbreakingHighContext accumulation exploits consistency pressure
PromptInjectionMediumClaude evaluates intent semantically, not just pattern-matching
Leetspeak / ROT13LowerCAI self-critique catches encoded harmful content more reliably
Use CasePriority VulnerabilitiesWhy
Customer-facingToxicity, Bias, PIILeakageBrand safety, user trust
Regulated industryIllegalActivity, Misinformation, PersonalSafetyCompliance, liability
Internal toolingPromptLeakage, SQLInjection, ShellInjectionInfrastructure boundaries
Content generationBias, Misinformation, IntellectualPropertyOutput quality, legal risk

What to Do Next

  • Prioritize multi-turn attacks — Claude's primary weakness is gradual escalation, not single-shot jailbreaks. Run LinearJailbreaking and CrescendoJailbreaking with higher turn counts (num_turns: 5+).
  • Compare Sonnet vs. Haiku — Smaller models (Haiku) may have weaker safety boundaries than larger ones (Sonnet, Opus). Run the same assessment across model sizes.
  • Test against frameworks — Use OWASPTop10() for standardized coverage. See the safety frameworks guide.
  • Move to application-level testing — Once the model passes, test it in context with the AI agents, conversational agents, or agentic RAG guides.
Confident AI
Try DeepTeam on Confident AI Enterprise
Run red teaming assessments against safety frameworks, view risk reports, schedule recurring audits, and deploy guardrails in production.
Book a demo