Protect your LLM usage with content guardrails that detect and block harmful content

Guardrails

Guardrails protect your organization by automatically detecting and blocking harmful content in LLM requests before they reach the model.

Guardrails are available on the Enterprise plan.

Overview

Guardrails run on every API request, scanning message content for:

Security threats (prompt injection, jailbreak attempts)
Sensitive data (PII, secrets, credentials)
Policy violations (blocked terms, restricted topics)

When a violation is detected, you control what happens: block the request, redact the content, or log a warning.

System Rules

Built-in rules protect against common threats:

Prompt Injection Detection

Detects attempts to override or manipulate system instructions. Common patterns include:

"Ignore all previous instructions"
"You are now a different AI"
Hidden instructions in encoded text

Jailbreak Detection

Identifies attempts to bypass safety measures:

DAN (Do Anything Now) prompts
Roleplay-based bypasses
Instruction override attempts

PII Detection

Identifies personal information:

Email addresses
Phone numbers
Social Security Numbers
Credit card numbers
IP addresses

When the action is set to redact, PII is replaced with placeholders like [EMAIL_REDACTED].

Secrets Detection

Detects credentials and API keys:

AWS access keys and secrets
Generic API keys
Passwords in common formats
Private keys

File Type Restrictions

Control which file types can be uploaded:

Configure allowed MIME types
Set maximum file size limits
Block potentially dangerous file types

Document Leakage Prevention

Detects attempts to extract confidential documents or internal data.

Configurable Actions

For each rule, choose how to respond:

Action	Behavior
Block	Reject the request with a content policy error
Redact	Remove or mask the sensitive content, then continue
Warn	Log the violation but allow the request to proceed

Custom Rules

Create organization-specific rules for your use case:

Blocked Terms

Prevent specific words or phrases from being used:

Match type: exact, contains, or regex
Case-sensitive matching option
Multiple terms per rule

Custom Regex

Match patterns unique to your organization:

Internal project codenames
Customer identifiers
Domain-specific sensitive data

Topic Restrictions

Block content related to specific topics:

Define restricted topics
Keyword-based detection

Security Events Dashboard

Monitor all guardrail violations with a dedicated dashboard:

Total violations — Overall count and trends
By action — Breakdown of blocked, redacted, and warned
By category — Which rules are being triggered
Detailed logs — Individual violations with timestamps and matched patterns

How It Works

Request → Guardrails Check → Action Based on Rules → Forward to Model (if allowed)
                ↓
           Log Violation

Request received — API request comes in with messages
Content scanned — All text content is checked against enabled rules
Violations detected — Matches are identified and logged
Action taken — Based on rule configuration (block/redact/warn)
Request proceeds — If not blocked, the (potentially redacted) request continues

Best Practices

Start with warnings — Enable rules in warn mode first to understand your traffic patterns
Review violations — Check the Security Events dashboard regularly
Tune custom rules — Adjust blocked terms and regex patterns based on false positives
Layer defenses — Use multiple rule types together for comprehensive protection

Get Started

Guardrails are an Enterprise feature. Contact us to enable Enterprise for your organization.

Guardrails

On this page