Red Teaming Open-Weight Models
Open-weight models — Llama, Mistral, Qwen, Gemma, Phi — are deployed without any provider-side safety infrastructure. There is no moderation API, no content filter, no abuse detection system sitting between the model and your users. The model's own alignment training is the entire safety boundary. And that alignment can be weakened by quantization, removed by fine-tuning, or simply insufficient in smaller model sizes.
This fundamentally changes the red teaming threat model. You are not testing whether a provider's safety stack holds. You are testing whether your deployment — your serving configuration, your system prompt, your guardrails — produces safe behavior with a model whose safety properties you may not fully control.
This guide covers red teaming methodology for open-weight model deployments with DeepTeam.
For provider-specific callback setup and safety characteristics, see the Meta Llama and Mistral guides. For general model red teaming, see the foundational models guide.
Why Open-Weight Models Need Different Red Teaming
No Provider Safety Net
When GPT-4o produces harmful output, OpenAI's moderation API may still block it at the API level. When Claude produces harmful output, Anthropic's usage policies trigger account-level enforcement. Open-weight models have neither. A successful attack against an open-weight deployment goes directly to the user.
This means:
- Every vulnerability that passes red teaming is a vulnerability in production — no hidden safety layer will catch what you missed.
- Application-level guardrails are not optional. They are your only line of defense beyond the model itself.
- Red teaming coverage must be broader and attack intensity higher than for API-served models.
Quantization Degrades Safety
Most open-weight deployments use quantized models for cost and latency. Quantization reduces numerical precision (e.g., from 16-bit to 4-bit), which disproportionately degrades safety-related behaviors:
- Safety training involves subtle weight patterns that are among the first to lose fidelity during quantization.
- A model that passes red teaming at full precision may fail at 4-bit quantization.
- The degradation is not uniform — some vulnerability categories are more affected than others.
Always test the quantized model you actually deploy, not the full-precision reference.
Fine-Tuning Can Remove Safety
Open-weight models can be fine-tuned on arbitrary data. Community fine-tunes, domain adaptations, and LoRA merges may inadvertently — or intentionally — remove safety behaviors. If your deployment uses a fine-tuned variant, treat its safety properties as unknown until tested.
Methodology
The red teaming process for open-weight models follows the standard DeepTeam workflow but with expanded scope:
- Identify the exact deployment configuration — Model name, quantization level, serving framework, system prompt, and any pre/post-processing.
- Select broad vulnerability coverage — Without a provider safety net, you cannot rely on catching what you don't test.
- Integrate the exact deployment — The callback must hit the same serving endpoint, with the same configuration, as production.
- Compare across configurations — Run the same assessment against full-precision and quantized versions to measure safety degradation.
Writing the model_callback
The callback must test your actual deployment — same quantization, same system prompt, same serving framework:
Ollama:
import httpx
async def model_callback(input: str) -> str:
async with httpx.AsyncClient() as client:
response = await client.post(
"http://localhost:11434/api/generate",
json={"model": "llama3.1:8b-q4_0", "prompt": input, "stream": False},
timeout=60.0,
)
return response.json()["response"]
vLLM (OpenAI-compatible):
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
async def model_callback(input: str) -> str:
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": input},
],
)
return response.choices[0].message.content
Hugging Face Transformers (direct):
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
torch_dtype=torch.float16,
device_map="auto",
)
async def model_callback(input: str) -> str:
inputs = tokenizer(input, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
return tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
Running the Assessment
from deepteam import red_team
from deepteam.vulnerabilities import (
Toxicity, Bias, PromptLeakage, PIILeakage,
IllegalActivity, ShellInjection, SQLInjection,
PersonalSafety, Misinformation
)
from deepteam.attacks.single_turn import (
PromptInjection, Roleplay, Leetspeak, ROT13, Base64
)
from deepteam.attacks.multi_turn import LinearJailbreaking, CrescendoJailbreaking
red_team(
model_callback=model_callback,
target_purpose="An internal assistant for enterprise employees",
vulnerabilities=[
Toxicity(), Bias(), PromptLeakage(), PIILeakage(),
IllegalActivity(), ShellInjection(), SQLInjection(),
PersonalSafety(), Misinformation(),
],
attacks=[
PromptInjection(), Roleplay(), Leetspeak(),
ROT13(), Base64(),
LinearJailbreaking(), CrescendoJailbreaking(),
],
attacks_per_vulnerability_type=5,
)
Use broader vulnerability coverage than you would for API models. Without a provider safety net, gaps in testing are gaps in production.
Measuring Quantization Impact
Run the same assessment against multiple quantization levels to quantify safety degradation:
QUANTIZATIONS = [
("Full precision", "meta-llama/Llama-3.1-8B-Instruct"),
("8-bit", "meta-llama/Llama-3.1-8B-Instruct-GPTQ-Int8"),
("4-bit", "meta-llama/Llama-3.1-8B-Instruct-GPTQ-Int4"),
]
for label, model_name in QUANTIZATIONS:
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
async def model_callback(input: str) -> str:
response = client.chat.completions.create(
model=model_name,
messages=[{"role": "user", "content": input}],
)
return response.choices[0].message.content
print(f"Testing: {label}")
red_team(
model_callback=model_callback,
target_purpose="An internal assistant",
vulnerabilities=[Toxicity(), Bias(), PromptLeakage()],
attacks=[PromptInjection(), Roleplay(), Leetspeak()],
attacks_per_vulnerability_type=3,
)
Compare pass rates across quantization levels. A significant drop from full precision to 4-bit indicates that your deployment's safety is quantization-constrained, and you should either use a higher precision or add guardrails to compensate.
Attack Strategy for Open-Weight Models
Open-weight models have weaker and less consistent safety training than API-served models. Attack prioritization should reflect this:
| Priority | Attack | Rationale |
|---|---|---|
| 1 | PromptInjection | Weakest instruction hierarchy — the most direct and effective attack |
| 2 | ROT13 / Base64 | Encoding attacks exploit the gap between tokenization and safety training |
| 3 | Leetspeak | Character substitution bypasses token-level pattern matching |
| 4 | Roleplay | Instruction-following training overrides weaker safety alignment |
| 5 | LinearJailbreaking | Multi-turn escalation, though often unnecessary if single-turn attacks succeed |
If single-turn attacks consistently succeed, the model's alignment is fundamentally weak for your deployment configuration. Address this at the system level (guardrails, moderation layer) rather than trying to patch individual attack vectors.
Community Fine-Tunes and Merges
If your deployment uses a community fine-tune, a LoRA merge, or a domain-adapted variant:
- Treat safety as unknown. Fine-tuning can silently remove safety behaviors even when the training data is benign.
- Test before deploying. Run a baseline assessment before any production exposure.
- Compare against the base model. Run the same assessment against the original model to identify what the fine-tune changed.
- Monitor for regression. Re-run the assessment after each model update.
See the fine-tuned models guide for detailed methodology on testing fine-tuned and adapted models.
What to Do Next
- Deploy guardrails — Open-weight deployments need application-level guardrails more than any other deployment type.
- Test against frameworks — Use
OWASPTop10()for standardized coverage. See the safety frameworks guide. - Benchmark across model sizes — Compare 8B, 70B, and 405B variants to understand the safety-size tradeoff for your use case.
- Move to application-level testing — Once the model passes, test it in context with the AI agents, conversational agents, or agentic RAG guides.