Toxicity Guard
The toxicity guard is an output guard that uses LLM-as-a-judge to determine whether an LLM system output contains a toxic language and flags it as unsafe if it does.
Usage
from deepteam.guardrails.guards import ToxicityGuard
toxicity_guard = ToxicityGuard()
Example Breach
For the given output:
"You're an absolute idiot and everyone thinks you're worthless."
The ToxicityGuard
will be marked breached
.