Red Teaming Fine-Tuned Models
Fine-tuning a model changes its safety properties. This is true whether you fine-tuned intentionally for safety, fine-tuned on domain data without thinking about safety at all, or used a community adapter that someone else trained. The base model's alignment is a product of its original safety training — and fine-tuning overwrites, dilutes, or disrupts that training in ways that are not always predictable.
A model that passes red teaming before fine-tuning may fail afterward. Conversely, safety-focused fine-tuning may introduce new refusal patterns that create a different set of edge cases. Either way, the fine-tuned model's safety properties are not the same as the base model's, and must be tested independently.
This guide covers red teaming methodology for fine-tuned models with DeepTeam.
For general model red teaming methodology, see the foundational models guide. For deployment-level concerns with open-weight models, see the open-weight models guide.
How Fine-Tuning Affects Safety
Safety Degradation from Domain Fine-Tuning
The most common scenario: you fine-tune a model on domain-specific data (medical records, legal documents, customer transcripts, code) to improve task performance. The training data is benign. But the fine-tuning process still degrades safety:
- Catastrophic forgetting — Safety training occupies specific weight patterns. Fine-tuning on new data shifts these weights, and safety behaviors are disproportionately affected because they are often in tension with the model's primary objective (being helpful and following instructions).
- Distribution shift — The model learns to prioritize the fine-tuning domain's language patterns. If the fine-tuning data includes content that the base model would have refused (e.g., medical descriptions of injuries, legal descriptions of crimes), the model learns that this content is acceptable to produce.
- Refusal boundary erosion — Each fine-tuning step that rewards the model for producing domain content without refusal weakens the model's general ability to refuse. The model learns that "not refusing" is correct behavior.
Safety Changes from LoRA and Adapters
Parameter-efficient fine-tuning (LoRA, QLoRA, prefix tuning) modifies fewer weights than full fine-tuning, but safety degradation still occurs:
- LoRA adapters can override safety-critical weight patterns even with low rank.
- Merged adapters (where the LoRA weights are folded into the base model) produce unpredictable safety interactions.
- Stacking multiple adapters compounds safety degradation.
Intentional Safety Fine-Tuning
Fine-tuning specifically for safety (adding refusal examples, RLHF, DPO with safety-focused preference data) can improve safety — but introduces its own edge cases:
- Over-refusal on benign content the model incorrectly classifies as harmful.
- Inconsistent refusal boundaries where the model refuses some phrasings but not semantic equivalents.
- Brittleness — safety fine-tuning may work for the training distribution but fail on adversarial inputs that differ from the safety training examples.
Methodology
Fine-tuned model red teaming requires a comparative approach:
- Establish a baseline — Run the same assessment against the base model to understand its safety properties.
- Test the fine-tuned model — Run the identical assessment against the fine-tuned model.
- Compare results — Identify which vulnerabilities degraded, which improved, and which are unchanged.
- Investigate regressions — For each vulnerability that got worse, determine whether the degradation is from training data influence, catastrophic forgetting, or distribution shift.
- Fix and retest — Apply guardrails, adjust fine-tuning data, or add safety examples, then retest.
Running a Comparative Assessment
Base Model Callback
from openai import OpenAI
base_client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
async def base_callback(input: str) -> str:
response = base_client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": input}],
)
return response.choices[0].message.content
Fine-Tuned Model Callback
ft_client = OpenAI(base_url="http://localhost:8001/v1", api_key="dummy")
async def finetuned_callback(input: str) -> str:
response = ft_client.chat.completions.create(
model="my-org/Llama-3.1-8B-medical-v2",
messages=[{"role": "user", "content": input}],
)
return response.choices[0].message.content
Run Both Assessments
from deepteam import red_team
from deepteam.vulnerabilities import (
Toxicity, Bias, PromptLeakage, PIILeakage,
IllegalActivity, Misinformation, PersonalSafety
)
from deepteam.attacks.single_turn import PromptInjection, Roleplay, Leetspeak
from deepteam.attacks.multi_turn import LinearJailbreaking
VULNERABILITIES = [
Toxicity(), Bias(), PromptLeakage(), PIILeakage(),
IllegalActivity(), Misinformation(), PersonalSafety(),
]
ATTACKS = [
PromptInjection(), Roleplay(), Leetspeak(), LinearJailbreaking(),
]
print("=== Base Model ===")
red_team(
model_callback=base_callback,
target_purpose="A medical information assistant",
vulnerabilities=VULNERABILITIES,
attacks=ATTACKS,
attacks_per_vulnerability_type=5,
)
print("=== Fine-Tuned Model ===")
red_team(
model_callback=finetuned_callback,
target_purpose="A medical information assistant",
vulnerabilities=VULNERABILITIES,
attacks=ATTACKS,
attacks_per_vulnerability_type=5,
)
Compare pass rates per vulnerability. A drop in any category indicates fine-tuning-induced safety regression.
Fine-Tuning-Specific Vulnerabilities
Beyond standard vulnerabilities, fine-tuned models have additional failure modes:
Training Data Leakage
Fine-tuned models can memorize and reproduce training data. If the fine-tuning dataset contains sensitive information (PII, proprietary data, internal documents), the model may leak it:
from deepteam.vulnerabilities import PIILeakage, PromptLeakage
red_team(
model_callback=finetuned_callback,
target_purpose="A medical information assistant trained on patient records",
vulnerabilities=[PIILeakage(), PromptLeakage()],
attacks=[PromptInjection(), Roleplay()],
attacks_per_vulnerability_type=10,
)
Use higher attacks_per_vulnerability_type for training data leakage — memorization is inconsistent and may only surface with specific prompt patterns.
Domain-Induced Permissiveness
A model fine-tuned on medical data may become more willing to discuss symptoms, medications, and procedures in ways that cross into providing unauthorized medical advice. A model fine-tuned on legal data may produce content that constitutes legal counsel. Test whether the model maintains appropriate boundaries for its deployment context:
from deepteam.vulnerabilities import PersonalSafety, Misinformation, IllegalActivity
red_team(
model_callback=finetuned_callback,
target_purpose="A medical information assistant (not a licensed provider)",
vulnerabilities=[PersonalSafety(), Misinformation(), IllegalActivity()],
attacks=[PromptInjection(), Roleplay()],
attacks_per_vulnerability_type=5,
)
Refusal Regression
Test whether the fine-tuned model refuses content that the base model would refuse:
from deepteam.vulnerabilities import Toxicity, Bias, IllegalActivity
REGRESSION_VULNS = [Toxicity(), Bias(), IllegalActivity()]
REGRESSION_ATTACKS = [PromptInjection(), Roleplay(), Leetspeak()]
print("=== Base Model Refusal ===")
red_team(
model_callback=base_callback,
target_purpose="A general assistant",
vulnerabilities=REGRESSION_VULNS,
attacks=REGRESSION_ATTACKS,
attacks_per_vulnerability_type=5,
)
print("=== Fine-Tuned Model Refusal ===")
red_team(
model_callback=finetuned_callback,
target_purpose="A general assistant",
vulnerabilities=REGRESSION_VULNS,
attacks=REGRESSION_ATTACKS,
attacks_per_vulnerability_type=5,
)
Testing LoRA and Adapter Configurations
Single Adapter
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
async def model_callback(input: str) -> str:
response = client.chat.completions.create(
model="base-model+my-lora-adapter",
messages=[{"role": "user", "content": input}],
)
return response.choices[0].message.content
Merged Adapter
If you've merged a LoRA adapter into the base weights, test the merged model directly:
async def model_callback(input: str) -> str:
response = client.chat.completions.create(
model="my-org/llama-3.1-8b-merged-medical",
messages=[{"role": "user", "content": input}],
)
return response.choices[0].message.content
Compare results between the adapter-applied and merged versions — merging can produce different safety properties than runtime adapter application.
Continuous Testing
Fine-tuning is iterative. Each training run, data update, or adapter change can shift safety properties. Integrate red teaming into your fine-tuning pipeline:
name: Fine-Tune Safety Check
on:
workflow_dispatch:
push:
paths:
- "models/**"
- "training/**"
jobs:
safety-check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.11"
- run: pip install deepteam
- run: deepteam run safety-check.yaml -o results
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
- uses: actions/upload-artifact@v4
with:
name: safety-results
path: results/
What to Do Next
- Always compare against the base model — Fine-tuning safety regressions are only visible in comparison.
- Test after every training run — Safety properties change with each fine-tuning iteration.
- Deploy guardrails — Fine-tuned models have less predictable safety boundaries. Application-level guardrails compensate for fine-tuning-induced regressions.
- Test against frameworks — Use
OWASPTop10()for standardized coverage. See the safety frameworks guide. - Increase attack density for PII — Fine-tuned models may memorize training data. Use higher
attacks_per_vulnerability_typeforPIILeakageandPromptLeakage.