Guardrails
Protect your LLM usage with content guardrails that detect and block harmful content
Guardrails
Guardrails protect your organization by automatically detecting and blocking harmful content in LLM requests before they reach the model.
Guardrails are available on the Enterprise plan.
Overview
Guardrails run on every API request, scanning message content for:
- Security threats (prompt injection, jailbreak attempts)
- Sensitive data (PII, secrets, credentials)
- Policy violations (blocked terms, restricted topics)
When a violation is detected, you control what happens: block the request, redact the content, or log a warning.
System Rules
Built-in rules protect against common threats:
Prompt Injection Detection
Detects attempts to override or manipulate system instructions. Common patterns include:
- "Ignore all previous instructions"
- "You are now a different AI"
- Hidden instructions in encoded text
Jailbreak Detection
Identifies attempts to bypass safety measures:
- DAN (Do Anything Now) prompts
- Roleplay-based bypasses
- Instruction override attempts
PII Detection
Identifies personal information:
- Email addresses
- Phone numbers
- Social Security Numbers
- Credit card numbers
- IP addresses
When the action is set to redact, PII is replaced with placeholders like [EMAIL_REDACTED].
Secrets Detection
Detects credentials and API keys:
- AWS access keys and secrets
- Generic API keys
- Passwords in common formats
- Private keys
File Type Restrictions
Control which file types can be uploaded:
- Configure allowed MIME types
- Set maximum file size limits
- Block potentially dangerous file types
Document Leakage Prevention
Detects attempts to extract confidential documents or internal data.
Configurable Actions
For each rule, choose how to respond:
| Action | Behavior |
|---|---|
| Block | Reject the request with a content policy error |
| Redact | Remove or mask the sensitive content, then continue |
| Warn | Log the violation but allow the request to proceed |
Custom Rules
Create organization-specific rules for your use case:
Blocked Terms
Prevent specific words or phrases from being used:
- Match type: exact, contains, or regex
- Case-sensitive matching option
- Multiple terms per rule
Custom Regex
Match patterns unique to your organization:
- Internal project codenames
- Customer identifiers
- Domain-specific sensitive data
Topic Restrictions
Block content related to specific topics:
- Define restricted topics
- Keyword-based detection
Security Events Dashboard
Monitor all guardrail violations with a dedicated dashboard:
- Total violations — Overall count and trends
- By action — Breakdown of blocked, redacted, and warned
- By category — Which rules are being triggered
- Detailed logs — Individual violations with timestamps and matched patterns
How It Works
Request → Guardrails Check → Action Based on Rules → Forward to Model (if allowed)
↓
Log Violation- Request received — API request comes in with messages
- Content scanned — All text content is checked against enabled rules
- Violations detected — Matches are identified and logged
- Action taken — Based on rule configuration (block/redact/warn)
- Request proceeds — If not blocked, the (potentially redacted) request continues
Best Practices
- Start with warnings — Enable rules in warn mode first to understand your traffic patterns
- Review violations — Check the Security Events dashboard regularly
- Tune custom rules — Adjust blocked terms and regex patterns based on false positives
- Layer defenses — Use multiple rule types together for comprehensive protection
Get Started
Guardrails are an Enterprise feature. Contact us to enable Enterprise for your organization.
How is this guide?
Last updated on