Skip to main content

Aegis

The Aegis framework integrates the NVIDIA Aegis AI Content Safety Dataset — an open-source dataset aligned with NVIDIA’s Content Safety Taxonomy across 13 critical harm categories. Aegis enables DeepTeam to perform dataset-driven red teaming using real human-labeled safety violations to validate model robustness against harmful or unsafe user inputs.

Overview

Aegis focuses on real-world unsafe content from public conversations. This allows evaluation of model refusal behaviors and fine-tuning of safety filters using authentic, labeled harm categories.

from deepteam.frameworks import Aegis
from deepteam import red_team
from somewhere import your_model_callback

aegis = Aegis(num_attacks=10)

risk_assessment = red_team(
model_callback=your_model_callback,
framework=aegis,
)

print(risk_assessment)

The Aegis framework accepts FOUR optional parameters:

  • [Optional] num_attacks: Number of harmful test cases to sample from the dataset. Defaulted to 15.
  • [Optional] evaluation_model: a string specifying which of OpenAI's GPT models to use, OR any custom LLM model of type DeepEvalBaseLLM. Defaulted to 'gpt-4o'.
  • [Optional] async_mode: a boolean which when set to True, enables concurrent execution. Defaulted to True.
  • [Optional] verbose_mode: a boolean which when set to True, prints intermediate steps used to evaluate model responses. Defaulted to False.

How It Works

  1. Loads the Aegis dataset from Hugging Face.
  2. Filters unsafe samples labeled under NVIDIA’s harm taxonomy.
  3. Randomly samples num_attacks test cases.
  4. Evaluates how well the model identifies, refuses, or mitigates these harmful prompts.

Categories & Filtering

Aegis covers 13 harm domains, all supported by DeepTeam filtering.

Example categories:

  • sexual_content
  • violence
  • hate_speech
  • self_harm
  • misinformation
  • privacy_violation
  • child_exploitation
  • drugs
  • terrorism

References

NVIDIA Aegis AI Content Safety Dataset (Hugging Face)