AI Guardrails
AI Guardrails provide real-time safety controls for AI models accessed through the AI Gateway. Guardrails inspect both input (prompts) and output (responses) to enforce content policies, detect sensitive data, prevent prompt injection attacks, and control costs.
How Guardrails Work
Guardrails are applied on every API request to the AI Gateway's chat/completions endpoint:
User Request
↓
┌─────────────────────────┐
│ Input Guardrails │ ← Check prompt before sending to model
│ (PII, injection, │
│ content filter) │
├─────────────────────────┤
│ BLOCK → 400 error │ ← If triggered, request is rejected
│ MODIFY → cleaned │ ← If PII found, mask and continue
│ PASS → original │ ← No issues found
└─────────────────────────┘
↓
┌─────────────────────────┐
│ AI Model (LLM) │ ← Model generates response
└─────────────────────────┘
↓
┌─────────────────────────┐
│ Output Guardrails │ ← Check response before returning
│ (content filter, │
│ PII, validation) │
├─────────────────────────┤
│ BLOCK → error │ ← Response blocked
│ MODIFY → cleaned │ ← Response sanitized
│ PASS → original │ ← No issues found
└─────────────────────────┘
↓
Response (with violation metadata)
All guardrail evaluations are logged for monitoring and compliance auditing.
Available Guardrails
Content Filter
Type: Input & Output | Category: Content Safety
Blocks harmful, offensive, or inappropriate content including adult/explicit material, hate speech, and NSFW content.
| Setting | Options | Description |
|---|---|---|
| Filtering Level | low, medium, high, strict | Sensitivity threshold for content blocking |
- Low — Blocks only explicitly harmful content
- Medium — Blocks harmful and suggestive content
- High — Blocks harmful, suggestive, and borderline content
- Strict — Maximum filtering, may produce false positives
PII Detection
Type: Input & Output | Category: Security
Detects and masks personally identifiable information (PII) in prompts and responses. Prevents sensitive data from being sent to or returned by AI models.
| PII Type | Description | Example |
|---|---|---|
| SSN | Social Security numbers | ***-**-1234 |
| Credit Card | Credit card numbers | ****-****-****-5678 |
| Email addresses | [EMAIL] | |
| Phone | Phone numbers | [PHONE] |
| Passport | Passport numbers | [PASSPORT] |
| Driver's License | Driver's license numbers | [DRIVERS_LICENSE] |
Configuration options:
| Setting | Options | Description |
|---|---|---|
| Action | mask, block | Mask detected PII or block the entire request |
| PII Types | Multi-select | Which PII types to detect |
| Mask Mode | char, uuid, labeled_uuid, partial, full | How masked values appear |
Masking modes explained:
| Mode | Example Output | Description |
|---|---|---|
char | ****@example.com | Character masking with configurable mask character |
uuid | a1b2c3d4-e5f6-7890-abcd-ef1234567890 | Consistent hash-based UUID (same input = same UUID) |
labeled_uuid | EMAIL_a1b2c3d4 | Type prefix + UUID for tracking across masking |
partial | ****user@example.com | Shows last 4 characters for reference |
full | [EMAIL] | Complete replacement with type label |
Prompt Injection Prevention
Type: Input | Category: Security
Detects and blocks prompt injection attacks where users attempt to override the system prompt or manipulate model behavior through specially crafted inputs.
Detection covers:
- System prompt override attempts
- Instruction injection patterns
- Role-play manipulation
- Context window exploitation
- Delimiter injection
Rate Limiting
Type: Input | Category: Performance
Enforces request frequency limits per model to prevent abuse and control costs.
| Setting | Description |
|---|---|
| Requests per Minute | Maximum API calls per minute |
| Requests per Day | Maximum API calls per day |
When limits are exceeded, subsequent requests receive a 429 Too Many Requests response with a Retry-After header.
Cost Control
Type: Input | Category: Cost Management
Enforces spending limits per model with budget alerts and automatic cutoff.
| Setting | Description |
|---|---|
| Monthly Budget | Maximum monthly spend (USD) |
| Alert Threshold | Percentage at which to send budget alerts (e.g., 80%) |
| Auto-Cutoff | Whether to block requests when budget is exceeded |
Output Validation
Type: Output | Category: Quality
Validates model responses against schemas or patterns to ensure consistent, well-formatted output.
| Setting | Description |
|---|---|
| Validation Type | json_schema or regex |
| Schema/Pattern | JSON Schema definition or regex pattern |
| Action on Failure | block or log |
Data Retention Policy
Type: Input & Output | Category: Compliance
Controls how long request and response data is retained for compliance with data protection regulations.
| Setting | Description |
|---|---|
| Retention Period | How long to retain request/response logs |
| Auto-Delete | Whether to automatically purge after retention period |
Model Version Lock
Type: Configuration | Category: Operational
Prevents automatic model version upgrades, ensuring consistent behavior across deployments.
| Setting | Description |
|---|---|
| Locked Version | Specific model version to pin to |
| Allow Minor Updates | Whether to allow patch/minor version changes |
Configuring Guardrails
From the Guardrails Dashboard
- Navigate to AI Gateway > Guardrails in the sidebar
- The dashboard shows all models with their guardrail status:
- Protected Models — Models with active guardrails
- Total Rules — Total active guardrail rules across all models
- Blocked Today — Requests blocked by guardrails in the last 24 hours
- Modified Today — Requests modified (e.g., PII masked) in the last 24 hours
- Click Configure on any model to open the guardrail editor
Guardrail Configuration Editor
The configuration editor has multiple sections:
Predefined Rules Tab
Select from the built-in guardrail types listed above. For each selected guardrail:
- Toggle the guardrail on/off
- Configure guardrail-specific settings (filtering level, PII types, rate limits, etc.)
- Set the action:
block,modify,log, oralert - Set priority order (higher priority rules are evaluated first)
Custom Rules Tab
Create custom content rules with conditions and actions:
Condition operators:
| Operator | Description |
|---|---|
contains | Content contains the specified substring |
equals | Content exactly matches the value |
matches | Pattern match |
regex | Regular expression match |
greater_than | Numeric comparison (for token counts, etc.) |
less_than | Numeric comparison |
Action types:
| Action | Description |
|---|---|
block | Reject the request entirely |
modify | Transform the content (e.g., replace matched text) |
log | Log the violation but allow the request |
alert | Send an alert notification |
Example custom rule:
{
"name": "Block Competitor Mentions",
"description": "Prevent discussion of competitor products",
"type": "input",
"enabled": true,
"priority": 5,
"conditions": [
{
"field": "content",
"operator": "regex",
"value": "\\b(CompetitorA|CompetitorB)\\b",
"case_sensitive": false
}
],
"actions": [
{
"type": "block",
"config": {
"message": "Content references competitor products"
}
}
]
}
Code Editor Tab
For advanced users, the full guardrail configuration can be edited as JSON using the built-in code editor (CodeMirror). This allows precise control over all guardrail settings.
Test Tab
Test guardrails against sample input before deploying:
- Enter test content in the input field
- Select direction (
inputoroutput) - Click Run Test
- View results showing:
- Which rules were triggered
- What action was taken (block, modify, pass)
- Modified content (if applicable)
- Confidence scores for ML-based detection
From Governance Policies
Guardrail requirements can also be enforced through governance policies:
- Create a policy with
aiGatewayModelas an applicable resource type - Add a stage with guardrail requirement fields
- Select which guardrails must be enabled (required) and which are recommended
- The Enforcement Engine verifies guardrail compliance during pre-deployment checks
This integrates guardrails into the broader governance workflow, ensuring models meet organizational safety standards before deployment.
Guardrail Actions
When a guardrail is triggered, one of these actions is taken:
| Action | Description | HTTP Response |
|---|---|---|
| Block | Request is rejected entirely | 400 Bad Request with triggered rule details |
| Modify | Content is cleaned/masked and request continues | 200 OK with modified content |
| Log | Violation is logged but request proceeds unchanged | 200 OK with violation in metadata |
| Alert | Notification sent, request proceeds | 200 OK with alert in metadata |
| Redirect | Request redirected to alternative handler | Varies |
Block Response Format
When a request is blocked by guardrails:
{
"detail": "Request blocked by guardrails: pii-filter, content-filter"
}
ML-Based Detection
The guardrails system supports ML-based detection for enhanced accuracy (when ML models are available on the gateway):
| Detection Type | Model/Library | Description |
|---|---|---|
| Toxicity | Detoxify | Toxic content detection with confidence scores |
| PII (ML) | Microsoft Presidio | Named entity recognition for PII |
| Prompt Injection | DistilBERT | ML-based injection pattern detection |
| Hallucination | BART zero-shot | Factual consistency checking |
| Language Detection | langdetect | Detect content language |
| Named Entity Recognition | spaCy | Entity extraction for sensitive data |
ML-based detection runs in parallel using a thread pool for minimal latency impact. Results are cached for 5 minutes to avoid re-computation on similar content.
When ML models are not available, the system falls back to rule-based pattern matching using regex and keyword detection.
Monitoring Guardrails
Activity Logs
Navigate to the guardrail configuration for any model and open the Logs tab to view recent activity:
| Field | Description |
|---|---|
| Timestamp | When the evaluation occurred |
| Direction | input or output |
| Triggered Rules | Which guardrails fired |
| Action Taken | Block, modify, or log |
| Confidence | ML confidence score (if applicable) |
| Content Info | Original vs. modified content length |
Guardrail logs are retained for 15 days (TTL index) and are automatically deleted after that period.
Statistics
Each model's guardrail configuration tracks aggregate statistics:
| Metric | Description |
|---|---|
| Total Evaluations | Total input/output checks performed |
| Block Rate | Percentage of requests blocked |
| Modification Rate | Percentage of requests modified |
| Top Triggered Rules | Most frequently triggered guardrails |
| Requests per Minute/Day | Current request volume |
Duplicating Configurations
To apply the same guardrail configuration to multiple models:
- Open the guardrail config for the source model
- Click Duplicate
- Select the target model(s)
- The configuration is copied to each target model
Exporting and Importing
Guardrail configurations can be exported as JSON and imported to other models or environments, enabling governance-as-code workflows.
Integration with Governance Policies
Guardrails integrate with the governance policy system at two levels:
1. Policy-Level Requirements
Governance policies can require specific guardrails on AI Gateway models:
- A policy stage can specify required guardrails (e.g., "PII detection must be enabled")
- The enforcement engine verifies these requirements before deployment
- Non-compliant models are blocked from deployment
2. Enforcement Checks
The enforcement engine includes guardrail compliance as part of its pre-deployment evaluation:
- Checks which guardrails are configured on the model
- Compares against policy requirements
- Reports missing guardrails as hard blocks, soft blocks, or warnings based on the policy's enforcement rules
See Enforcement for the complete enforcement workflow.
Best Practices
Start with Essentials
Enable these guardrails on all production AI models as a baseline:
- PII Detection (mask mode) — Protect sensitive data
- Content Filter (medium level) — Block harmful content
- Prompt Injection Prevention — Protect against manipulation
- Rate Limiting — Prevent abuse
Layer Security
- Use
inputguardrails to protect data sent to models (PII, injection) - Use
outputguardrails to protect users from harmful responses (content filter, validation) - Use
bothfor comprehensive protection where applicable
Test Before Deploying
Always use the Test feature to validate guardrail behavior with representative content before enabling on production models. Check for:
- False positives (legitimate content being blocked)
- False negatives (harmful content passing through)
- PII masking accuracy
Monitor and Tune
- Review guardrail logs regularly to identify patterns
- Adjust filtering levels based on block rates
- High false-positive rates suggest the filtering level is too strict
- High false-negative rates suggest more guardrails or stricter settings are needed
- Export statistics for compliance reporting
When first deploying guardrails, start with log action to understand the impact before switching to block or modify. This allows you to tune thresholds without disrupting service.