Guardrails & Responsible AI
Guardrails are your safety net when all else fails. Implement them at every stage of the agent lifecycle to catch harmful inputs, behaviors, and outputs.
9.1 Three-Phase Guardrail Model
Implement guardrails across the lifecycle of an agent task.
1. Pre-Input Guardrails
Run before user input reaches the LLM:
- PII detection and optional redaction
- Toxicity / abuse filters
- Basic prompt injection pattern detection
- Domain scoping (e.g., "answer only about product X")
2. Mid-Execution Guardrails
Run during reasoning/tool loops:
- Max iterations/tool calls
- Privilege escalation detection
- Resource limits (tokens, time, API calls)
- Detect suspicious patterns (e.g., repeated attempts to bypass policies)
3. Post-Output Guardrails
Run before output is shown to users or sent downstream:
- PII and secret leakage detection
- Content moderation (toxicity, hate, self-harm, sexual content, child safety)
- Grounding/hallucination checks for workflows that rely on RAG:
- If confidence is low, respond conservatively or ask user for clarification.
- Bias detection on sensitive axes where relevant.
Important: Guardrails should be implemented as separate services or modules, not just extra words in prompts.
9.2 Core RAI Harm Categories
For each application, consider at least:
- Toxicity, hate, harassment
- Violence, self-harm, extremist content
- Sexual content and child safety
- Misinformation and hallucinations
- Bias and fairness (especially in hiring, lending, or other high-stakes domains)
- Privacy violations (unexpected personal data handling)
- IP/copyright infringement
For each category, define:
- Risk level for the use case
- Mitigation strategy:
- Policy constraints (what the agent will and won't do)
- Model choice/fine-tuning
- Guardrail thresholds
- Human review where necessary
9.3 Domain-Specific Constraints
Certain domains require explicit, stricter rules:
Financial Services
- No unsupervised fund movements.
- Strong disclaimers on investment or tax advice.
- Conservative language; emphasize risks.
- Breakpoints and multi-approval for any financial transactions.
Healthcare
- Agents provide education/triage, not diagnosis or treatment.
- Always encourage consultation with a healthcare professional.
- Detect and handle crisis/emergency cues with appropriate escalation.
Legal
- Do not present output as legal advice.
- Do not autonomously file legal documents or make binding commitments.
- Include disclaimers and route complex issues to human lawyers.
Security/Cyber
- Restrict exploit generation and attack planning scenarios.
- Focus on defensive guidance and best practices.
- Monitor for malicious-intent prompts and block or escalate.