Agentic AI Security Guide / Guardrails & Responsible AI

Guardrails & Responsible AI

Guardrails are your safety net when all else fails. Implement them at every stage of the agent lifecycle to catch harmful inputs, behaviors, and outputs.

9.1 Three-Phase Guardrail Model

Implement guardrails across the lifecycle of an agent task.

1. Pre-Input Guardrails

Run before user input reaches the LLM:

  • PII detection and optional redaction
  • Toxicity / abuse filters
  • Basic prompt injection pattern detection
  • Domain scoping (e.g., "answer only about product X")

2. Mid-Execution Guardrails

Run during reasoning/tool loops:

  • Max iterations/tool calls
  • Privilege escalation detection
  • Resource limits (tokens, time, API calls)
  • Detect suspicious patterns (e.g., repeated attempts to bypass policies)

3. Post-Output Guardrails

Run before output is shown to users or sent downstream:

  • PII and secret leakage detection
  • Content moderation (toxicity, hate, self-harm, sexual content, child safety)
  • Grounding/hallucination checks for workflows that rely on RAG:
    • If confidence is low, respond conservatively or ask user for clarification.
  • Bias detection on sensitive axes where relevant.

Important: Guardrails should be implemented as separate services or modules, not just extra words in prompts.

9.2 Core RAI Harm Categories

For each application, consider at least:

  • Toxicity, hate, harassment
  • Violence, self-harm, extremist content
  • Sexual content and child safety
  • Misinformation and hallucinations
  • Bias and fairness (especially in hiring, lending, or other high-stakes domains)
  • Privacy violations (unexpected personal data handling)
  • IP/copyright infringement

For each category, define:

  • Risk level for the use case
  • Mitigation strategy:
    • Policy constraints (what the agent will and won't do)
    • Model choice/fine-tuning
    • Guardrail thresholds
    • Human review where necessary

9.3 Domain-Specific Constraints

Certain domains require explicit, stricter rules:

Financial Services

  • No unsupervised fund movements.
  • Strong disclaimers on investment or tax advice.
  • Conservative language; emphasize risks.
  • Breakpoints and multi-approval for any financial transactions.

Healthcare

  • Agents provide education/triage, not diagnosis or treatment.
  • Always encourage consultation with a healthcare professional.
  • Detect and handle crisis/emergency cues with appropriate escalation.

Legal

  • Do not present output as legal advice.
  • Do not autonomously file legal documents or make binding commitments.
  • Include disclaimers and route complex issues to human lawyers.

Security/Cyber

  • Restrict exploit generation and attack planning scenarios.
  • Focus on defensive guidance and best practices.
  • Monitor for malicious-intent prompts and block or escalate.