Guardrails & Responsible AI

Guardrails are your safety net when all else fails. Implement them at every stage of the agent lifecycle to catch harmful inputs, behaviors, and outputs.

9.1 Three-Phase Guardrail Model

Implement guardrails across the lifecycle of an agent task.

1. Pre-Input Guardrails

Run before user input reaches the LLM:

PII detection and optional redaction
Toxicity / abuse filters
Basic prompt injection pattern detection
Domain scoping (e.g., "answer only about product X")

2. Mid-Execution Guardrails

Run during reasoning/tool loops:

Max iterations/tool calls
Privilege escalation detection
Resource limits (tokens, time, API calls)
Detect suspicious patterns (e.g., repeated attempts to bypass policies)

3. Post-Output Guardrails

Run before output is shown to users or sent downstream:

PII and secret leakage detection
Content moderation (toxicity, hate, self-harm, sexual content, child safety)
Grounding/hallucination checks for workflows that rely on RAG:
- If confidence is low, respond conservatively or ask user for clarification.
Bias detection on sensitive axes where relevant.

Important: Guardrails should be implemented as separate services or modules, not just extra words in prompts.

9.2 Core RAI Harm Categories

For each application, consider at least:

Toxicity, hate, harassment
Violence, self-harm, extremist content
Sexual content and child safety
Misinformation and hallucinations
Bias and fairness (especially in hiring, lending, or other high-stakes domains)
Privacy violations (unexpected personal data handling)
IP/copyright infringement

For each category, define:

Risk level for the use case
Mitigation strategy:
- Policy constraints (what the agent will and won't do)
- Model choice/fine-tuning
- Guardrail thresholds
- Human review where necessary

9.3 Domain-Specific Constraints

Certain domains require explicit, stricter rules:

Financial Services

No unsupervised fund movements.
Strong disclaimers on investment or tax advice.
Conservative language; emphasize risks.
Breakpoints and multi-approval for any financial transactions.

Healthcare

Agents provide education/triage, not diagnosis or treatment.
Always encourage consultation with a healthcare professional.
Detect and handle crisis/emergency cues with appropriate escalation.

Legal

Do not present output as legal advice.
Do not autonomously file legal documents or make binding commitments.
Include disclaimers and route complex issues to human lawyers.

Security/Cyber

Restrict exploit generation and attack planning scenarios.
Focus on defensive guidance and best practices.
Monitor for malicious-intent prompts and block or escalate.