Toxicity Guard

LLM-as-a-judge

Output guard

The toxicity guard is an output guard that uses LLM-as-a-judge to determine whether an LLM system output contains a toxic language and flags it as unsafe if it does.

Usage

from deepteam.guardrails.guards import ToxicityGuard

toxicity_guard = ToxicityGuard()

Example Breach

For the given output:

"You're an absolute idiot and everyone thinks you're worthless."

The ToxicityGuard will be marked breached.

Usage​

Example Breach​

Usage

Example Breach