Guardrails, Safety, and Responsible AI

22 min read Lesson 28 / 28

Deploying AI in production carries responsibility. Guardrails prevent your agent from causing harm, embarrassing your brand, or violating user trust.

Input Validation

Validate and sanitize user input before it reaches Claude:

import re
import anthropic

client = anthropic.Anthropic()

MAX_INPUT_LENGTH = 10_000  # characters
BLOCKED_PATTERNS = [
    r"ignore (?:all )?(?:previous|prior|above) instructions",
    r"system prompt",
    r"jailbreak",
    r"DAN mode",
]

class InputValidationError(Exception):
    pass

def validate_input(user_input: str) -> str:
    # Length check
    if len(user_input) > MAX_INPUT_LENGTH:
        raise InputValidationError(f"Input too long: {len(user_input)} chars (max {MAX_INPUT_LENGTH})")

    # Prompt injection detection
    lower_input = user_input.lower()
    for pattern in BLOCKED_PATTERNS:
        if re.search(pattern, lower_input):
            raise InputValidationError("Input contains disallowed content.")

    return user_input.strip()

Output Moderation

Screen Claude's responses before returning them to users:

def moderate_output(response_text: str) -> dict:
    """Use Claude to check its own output for policy violations."""
    moderation = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=64,
        system="""You are a content moderator. Review the text and respond with JSON only.
Schema: {"safe": true/false, "reason": "brief explanation if unsafe"}""",
        messages=[{
            "role": "user",
            "content": f"Review this text:\n\n{response_text}"
        }],
    )

    import json
    result = json.loads(moderation.content[0].text)
    return result

def safe_agent_response(user_input: str) -> str:
    # Validate input
    try:
        clean_input = validate_input(user_input)
    except InputValidationError as e:
        return f"I can't process that request: {e}"

    # Generate response
    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        messages=[{"role": "user", "content": clean_input}],
    )
    response_text = response.content[0].text

    # Moderate output
    moderation = moderate_output(response_text)
    if not moderation["safe"]:
        return "I'm unable to provide that response."

    return response_text

Claude's Constitutional AI Foundation

Claude is trained with Constitutional AI (CAI) — a process where the model is trained to be helpful, harmless, and honest. Claude will refuse genuinely harmful requests by default. Your guardrails supplement this:

Input guardrails — Catch prompt injection, validate format, enforce length limits
Output guardrails — Catch PII in responses, verify citations, check format compliance
Business logic guardrails — Enforce domain restrictions (e.g., "only answer about cooking")

# Domain restriction via system prompt
COOKING_ASSISTANT_SYSTEM = """
You are a cooking assistant for RecipeHub. You ONLY answer questions about:
- Recipes and cooking techniques
- Ingredient substitutions
- Nutrition information
- Kitchen equipment

For ANY other topic, respond: "I'm a cooking assistant and can only help with food and cooking questions."
"""

Principle of Minimal Capability

Give agents only the tools and permissions they need for their specific task:

# Bad: give agent access to everything
all_tools = [read_file, write_file, delete_file, run_code, send_email, call_api, ...]

# Good: minimal tool set for the specific task
summarizer_tools = [read_file]  # Only needs to read, not write/delete/execute

Responsible AI deployment is not a checklist — it is an ongoing commitment to understanding how your system behaves in the real world and correcting it when it falls short.