Introduction to LLM Guardrails
deepteam
enables anyone to safeguard LLM applications against vulnerable inputs and outputs uncovered during red teaming.
While deepeval
's metrics focuses on accuracy and precision, deepteam
's guardrails focuses on speed and reliability.
deepteam
's comprehensive suite of guardrails acts as binary metrics to evaluate end-to-end LLM system inputs and output for malicious intent, unsafe behavior, and security vulnerabilities.
Types of Guardrails
There are two types of guardrails:
- Input guardrails - Protects against malicious inputs before they reach your LLM
- Output guardrails - Evaluates an LLM's output before they reach your users.
A input/output guardrail is composed of one or more guards, and an input/output should only be let in/out if all guards have passed the checks. Together, they ensure that nothing unsafe ever goes in or leaves your LLM system.
The number of guards you choose to set up and whether you decide to utilize both types of guards depends on your priorities regarding latency, cost, and LLM safety.
While most guards are only either for input or output guarding, some guards such as the cybersecurity guard offer both input and output guarding capabilities.
Guard Your First LLM Interaction
First import and initialize the guards you desire from deepteam.guardrails.guards
and pass them to an instantiated Guardrails
object:
from deepteam.guardrails import Guardrails
from deepteam.guardrails.guards import PromptInjectionGuard, ToxicityGuard
# Initialize guardrails
guardrails = Guardrails(
input_guards=[PromptInjectionGuard()],
output_guards=[ToxicityGuard()]
)
There are TWO mandatory and TWO optional parameters when creating a Guardrails
:
input_guards
: a list of input guards for protecting against malicious inputsoutput_guards
: a list of output guards for evaluating LLM outputs- [Optional]
evaluation_model
: a string specifying which OpenAI model to use for guard evaluation. Defaulted togpt-4.1
. - [Optional]
sample_rate
: a float between 0.0 and 1.0 that determines what fraction of requests are guarded. Defaulted to1.0
.
# Advanced configuration
guardrails = Guardrails(
input_guards=[PromptInjectionGuard()],
output_guards=[ToxicityGuard()],
evaluation_model="gpt-4o",
sample_rate=0.5 # Guard 50% of requests
)
Your list of input_guards
and output_guards
must not be empty if you wish to guard the respecitve inputs and outputs.
Guard an Input
Then, simply call the guard_input()
method and supply the input
.
...
res = guardrails.guard_input(input="Is the earth flat")
# Reject if true
print(res.breached)
The guard_input
method will invoke every single guard within your Guardrails
s to check whether a particular input has breached any of the defined criteria.
Guard an Output
Simiarly, call the guard_output
method wherever appropriate in your code to guard against an LLM output:
...
res = guardrails.guard_output(input="Is the earth flat", output="I bet it is")
print(res.breached)
It is extremely common to be re-generated LLM outputs for x number of times until successful.
When using the guard_output()
method, you must provide both the input
and the LLM output
.
Run Async Guardrails
deepteam
's Guardrails
also support asynchronous through the a_guard_input()
and a_guard_output()
methods.
...
res = await guardrails.a_guard_output(input, output)
Available Guards
deepteam
offers a robust selection of input and output guards for protection against vulnerabilities you've detected during red teaming:
Input
Input guards protect your LLM system by screening user inputs before they reach your model, preventing malicious prompts and unwanted content from being processed:
- Prompt Injection Guard - Detects and blocks prompt injection and jailbreaking attempts that try to manipulate the AI's behavior or bypass safety measures
- Topical Guard - Restricts conversations to specific allowed topics, preventing off-topic or unauthorized discussions
- Cybersecurity Guard - (When configured for input) Protects against cybersecurity threats and malicious technical instructions
Output
Output guards evaluate your LLM's responses before they reach users, ensuring safe and appropriate content delivery:
- Toxicity Guard - Prevents toxic, harmful, abusive, or discriminatory content from being shared with users
- Privacy Guard - Detects and blocks personally identifiable information (PII) and sensitive data from being exposed
- Illegal Guard - Prevents the sharing of illegal activity instructions or content that violates laws
- Hallucination Guard - Detects and prevents fabricated, inaccurate, or hallucinated information in responses
- Cybersecurity Guard - (When configured for output) Ensures responses don't contain dangerous cybersecurity information or instructions
3-Tier Safety System
All of deepteam
's guards uses a 3-tier safety assessment system:
safe
: The content is clearly safe and poses no riskuncertain
: The content is borderline or ambiguous, requiring human reviewunsafe
: The content clearly violates safety guidelines
...
for verdict in res.verdicts:
# Access detailed safety assessment
print(verdict.safety_level) # "safe", "uncertain", or "unsafe"
print(verdict.reason) # Explanation for the assessment
The reason why deepteam
does this is it allows you to control the strictness of each and every guard, and control what consitutues as a "breached" or "mitigated" event.
A Guardrails
is considered breached if any guard returns unsafe
or uncertain
.
Guardrails
used against malicious inputs is the best way to not waste tokens generating text that you shouldn't even be generating in the first place.
Data Models
There are two data models you should be aware of to access guardrail results after each run.
GuardResult
The GuardResult
data model is what you'll receive after guard_input()
or guard_output()
. It tells you whether the I/O has breached any of the criteria laid out in your guards or not, as well as the individual breakdown of all the guards:
class GuardResult:
breached: bool
verdicts: list[GuardVerdict]
GuardVerdict
You can access each individual guard's results through the GuardVerdict
data model in the verdicts
field of a GuardResult
.
class GuardVerdict(BaseModel):
name: str
safety_level: Literal["safe", "unsafe", "uncertain"]
latency: Optional[float] = None
reason: Optional[str] = None
score: Optional[float] = None
error: Optional[str] = None
You should decide what retry logic to run, if any, depending on the results you're getting after each guardrails invocation.