Guardrails, Safety, and Responsible AI
Deploying AI in production carries responsibility. Guardrails prevent your agent from causing harm, embarrassing your brand, or violating user trust.
Input Validation
Validate and sanitize user input before it reaches Claude:
import re
import anthropic
client = anthropic.Anthropic()
MAX_INPUT_LENGTH = 10_000 # characters
BLOCKED_PATTERNS = [
r"ignore (?:all )?(?:previous|prior|above) instructions",
r"system prompt",
r"jailbreak",
r"DAN mode",
]
class InputValidationError(Exception):
pass
def validate_input(user_input: str) -> str:
# Length check
if len(user_input) > MAX_INPUT_LENGTH:
raise InputValidationError(f"Input too long: {len(user_input)} chars (max {MAX_INPUT_LENGTH})")
# Prompt injection detection
lower_input = user_input.lower()
for pattern in BLOCKED_PATTERNS:
if re.search(pattern, lower_input):
raise InputValidationError("Input contains disallowed content.")
return user_input.strip()
Output Moderation
Screen Claude's responses before returning them to users:
def moderate_output(response_text: str) -> dict:
"""Use Claude to check its own output for policy violations."""
moderation = client.messages.create(
model="claude-haiku-4-5",
max_tokens=64,
system="""You are a content moderator. Review the text and respond with JSON only.
Schema: {"safe": true/false, "reason": "brief explanation if unsafe"}""",
messages=[{
"role": "user",
"content": f"Review this text:\n\n{response_text}"
}],
)
import json
result = json.loads(moderation.content[0].text)
return result
def safe_agent_response(user_input: str) -> str:
# Validate input
try:
clean_input = validate_input(user_input)
except InputValidationError as e:
return f"I can't process that request: {e}"
# Generate response
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=[{"role": "user", "content": clean_input}],
)
response_text = response.content[0].text
# Moderate output
moderation = moderate_output(response_text)
if not moderation["safe"]:
return "I'm unable to provide that response."
return response_text
Claude's Constitutional AI Foundation
Claude is trained with Constitutional AI (CAI) — a process where the model is trained to be helpful, harmless, and honest. Claude will refuse genuinely harmful requests by default. Your guardrails supplement this:
- Input guardrails — Catch prompt injection, validate format, enforce length limits
- Output guardrails — Catch PII in responses, verify citations, check format compliance
- Business logic guardrails — Enforce domain restrictions (e.g., "only answer about cooking")
# Domain restriction via system prompt
COOKING_ASSISTANT_SYSTEM = """
You are a cooking assistant for RecipeHub. You ONLY answer questions about:
- Recipes and cooking techniques
- Ingredient substitutions
- Nutrition information
- Kitchen equipment
For ANY other topic, respond: "I'm a cooking assistant and can only help with food and cooking questions."
"""
Principle of Minimal Capability
Give agents only the tools and permissions they need for their specific task:
# Bad: give agent access to everything
all_tools = [read_file, write_file, delete_file, run_code, send_email, call_api, ...]
# Good: minimal tool set for the specific task
summarizer_tools = [read_file] # Only needs to read, not write/delete/execute
Responsible AI deployment is not a checklist — it is an ongoing commitment to understanding how your system behaves in the real world and correcting it when it falls short.