AI Guardrails

AI Guardrails provide real-time safety controls for AI models accessed through the AI Gateway. Guardrails inspect both input (prompts) and output (responses) to enforce content policies, detect sensitive data, prevent prompt injection attacks, and control costs.

How Guardrails Work

Guardrails are applied on every API request to the AI Gateway's chat/completions endpoint:

User Request
    ↓
┌─────────────────────────┐
│   Input Guardrails      │  ← Check prompt before sending to model
│   (PII, injection,      │
│    content filter)       │
├─────────────────────────┤
│   BLOCK → 400 error     │  ← If triggered, request is rejected
│   MODIFY → cleaned      │  ← If PII found, mask and continue
│   PASS → original       │  ← No issues found
└─────────────────────────┘
    ↓
┌─────────────────────────┐
│   AI Model (LLM)        │  ← Model generates response
└─────────────────────────┘
    ↓
┌─────────────────────────┐
│   Output Guardrails     │  ← Check response before returning
│   (content filter,      │
│    PII, validation)     │
├─────────────────────────┤
│   BLOCK → error         │  ← Response blocked
│   MODIFY → cleaned      │  ← Response sanitized
│   PASS → original       │  ← No issues found
└─────────────────────────┘
    ↓
Response (with violation metadata)

All guardrail evaluations are logged for monitoring and compliance auditing.

Available Guardrails

Content Filter

Type: Input & Output | Category: Content Safety

Blocks harmful, offensive, or inappropriate content including adult/explicit material, hate speech, and NSFW content.

Setting	Options	Description
Filtering Level	`low`, `medium`, `high`, `strict`	Sensitivity threshold for content blocking

Low — Blocks only explicitly harmful content
Medium — Blocks harmful and suggestive content
High — Blocks harmful, suggestive, and borderline content
Strict — Maximum filtering, may produce false positives

PII Detection

Type: Input & Output | Category: Security

Detects and masks personally identifiable information (PII) in prompts and responses. Prevents sensitive data from being sent to or returned by AI models.

PII Type	Description	Example
SSN	Social Security numbers	`*--1234`
Credit Card	Credit card numbers	`**--**-5678`
Email	Email addresses	`[EMAIL]`
Phone	Phone numbers	`[PHONE]`
Passport	Passport numbers	`[PASSPORT]`
Driver's License	Driver's license numbers	`[DRIVERS_LICENSE]`

Configuration options:

Setting	Options	Description
Action	`mask`, `block`	Mask detected PII or block the entire request
PII Types	Multi-select	Which PII types to detect
Mask Mode	`char`, `uuid`, `labeled_uuid`, `partial`, `full`	How masked values appear

Masking modes explained:

Mode	Example Output	Description
`char`	`****@example.com`	Character masking with configurable mask character
`uuid`	`a1b2c3d4-e5f6-7890-abcd-ef1234567890`	Consistent hash-based UUID (same input = same UUID)
`labeled_uuid`	`EMAIL_a1b2c3d4`	Type prefix + UUID for tracking across masking
`partial`	`****user@example.com`	Shows last 4 characters for reference
`full`	`[EMAIL]`	Complete replacement with type label

Prompt Injection Prevention

Type: Input | Category: Security

Detects and blocks prompt injection attacks where users attempt to override the system prompt or manipulate model behavior through specially crafted inputs.

Detection covers:

System prompt override attempts
Instruction injection patterns
Role-play manipulation
Context window exploitation
Delimiter injection

Rate Limiting

Type: Input | Category: Performance

Enforces request frequency limits per model to prevent abuse and control costs.

Setting	Description
Requests per Minute	Maximum API calls per minute
Requests per Day	Maximum API calls per day

When limits are exceeded, subsequent requests receive a 429 Too Many Requests response with a Retry-After header.

Cost Control

Type: Input | Category: Cost Management

Enforces spending limits per model with budget alerts and automatic cutoff.

Setting	Description
Monthly Budget	Maximum monthly spend (USD)
Alert Threshold	Percentage at which to send budget alerts (e.g., 80%)
Auto-Cutoff	Whether to block requests when budget is exceeded

Output Validation

Type: Output | Category: Quality

Validates model responses against schemas or patterns to ensure consistent, well-formatted output.

Setting	Description
Validation Type	`json_schema` or `regex`
Schema/Pattern	JSON Schema definition or regex pattern
Action on Failure	`block` or `log`

Data Retention Policy

Type: Input & Output | Category: Compliance

Controls how long request and response data is retained for compliance with data protection regulations.

Setting	Description
Retention Period	How long to retain request/response logs
Auto-Delete	Whether to automatically purge after retention period

Model Version Lock

Type: Configuration | Category: Operational

Prevents automatic model version upgrades, ensuring consistent behavior across deployments.

Setting	Description
Locked Version	Specific model version to pin to
Allow Minor Updates	Whether to allow patch/minor version changes

Configuring Guardrails

From the Guardrails Dashboard

Navigate to AI Gateway > Guardrails in the sidebar
The dashboard shows all models with their guardrail status:
- Protected Models — Models with active guardrails
- Total Rules — Total active guardrail rules across all models
- Blocked Today — Requests blocked by guardrails in the last 24 hours
- Modified Today — Requests modified (e.g., PII masked) in the last 24 hours
Click Configure on any model to open the guardrail editor

Guardrail Configuration Editor

The configuration editor has multiple sections:

Predefined Rules Tab

Select from the built-in guardrail types listed above. For each selected guardrail:

Toggle the guardrail on/off
Configure guardrail-specific settings (filtering level, PII types, rate limits, etc.)
Set the action: block, modify, log, or alert
Set priority order (higher priority rules are evaluated first)

Custom Rules Tab

Create custom content rules with conditions and actions:

Condition operators:

Operator	Description
`contains`	Content contains the specified substring
`equals`	Content exactly matches the value
`matches`	Pattern match
`regex`	Regular expression match
`greater_than`	Numeric comparison (for token counts, etc.)
`less_than`	Numeric comparison

Action types:

Action	Description
`block`	Reject the request entirely
`modify`	Transform the content (e.g., replace matched text)
`log`	Log the violation but allow the request
`alert`	Send an alert notification

Example custom rule:

{
  "name": "Block Competitor Mentions",
  "description": "Prevent discussion of competitor products",
  "type": "input",
  "enabled": true,
  "priority": 5,
  "conditions": [
    {
      "field": "content",
      "operator": "regex",
      "value": "\\b(CompetitorA|CompetitorB)\\b",
      "case_sensitive": false
    }
  ],
  "actions": [
    {
      "type": "block",
      "config": {
        "message": "Content references competitor products"
      }
    }
  ]
}

Code Editor Tab

For advanced users, the full guardrail configuration can be edited as JSON using the built-in code editor (CodeMirror). This allows precise control over all guardrail settings.

Test Tab

Test guardrails against sample input before deploying:

Enter test content in the input field
Select direction (input or output)
Click Run Test
View results showing:
- Which rules were triggered
- What action was taken (block, modify, pass)
- Modified content (if applicable)
- Confidence scores for ML-based detection

From Governance Policies

Guardrail requirements can also be enforced through governance policies:

Create a policy with aiGatewayModel as an applicable resource type
Add a stage with guardrail requirement fields
Select which guardrails must be enabled (required) and which are recommended
The Enforcement Engine verifies guardrail compliance during pre-deployment checks

This integrates guardrails into the broader governance workflow, ensuring models meet organizational safety standards before deployment.

Guardrail Actions

When a guardrail is triggered, one of these actions is taken:

Action	Description	HTTP Response
Block	Request is rejected entirely	`400 Bad Request` with triggered rule details
Modify	Content is cleaned/masked and request continues	`200 OK` with modified content
Log	Violation is logged but request proceeds unchanged	`200 OK` with violation in metadata
Alert	Notification sent, request proceeds	`200 OK` with alert in metadata
Redirect	Request redirected to alternative handler	Varies

Block Response Format

When a request is blocked by guardrails:

{
  "detail": "Request blocked by guardrails: pii-filter, content-filter"
}

ML-Based Detection

The guardrails system supports ML-based detection for enhanced accuracy (when ML models are available on the gateway):

Detection Type	Model/Library	Description
Toxicity	Detoxify	Toxic content detection with confidence scores
PII (ML)	Microsoft Presidio	Named entity recognition for PII
Prompt Injection	DistilBERT	ML-based injection pattern detection
Hallucination	BART zero-shot	Factual consistency checking
Language Detection	langdetect	Detect content language
Named Entity Recognition	spaCy	Entity extraction for sensitive data

ML-based detection runs in parallel using a thread pool for minimal latency impact. Results are cached for 5 minutes to avoid re-computation on similar content.

When ML models are not available, the system falls back to rule-based pattern matching using regex and keyword detection.

Monitoring Guardrails

Activity Logs

Navigate to the guardrail configuration for any model and open the Logs tab to view recent activity:

Field	Description
Timestamp	When the evaluation occurred
Direction	`input` or `output`
Triggered Rules	Which guardrails fired
Action Taken	Block, modify, or log
Confidence	ML confidence score (if applicable)
Content Info	Original vs. modified content length

Guardrail logs are retained for 15 days (TTL index) and are automatically deleted after that period.

Statistics

Each model's guardrail configuration tracks aggregate statistics:

Metric	Description
Total Evaluations	Total input/output checks performed
Block Rate	Percentage of requests blocked
Modification Rate	Percentage of requests modified
Top Triggered Rules	Most frequently triggered guardrails
Requests per Minute/Day	Current request volume

Duplicating Configurations

To apply the same guardrail configuration to multiple models:

Open the guardrail config for the source model
Click Duplicate
Select the target model(s)
The configuration is copied to each target model

Exporting and Importing

Guardrail configurations can be exported as JSON and imported to other models or environments, enabling governance-as-code workflows.

Integration with Governance Policies

Guardrails integrate with the governance policy system at two levels:

1. Policy-Level Requirements

Governance policies can require specific guardrails on AI Gateway models:

A policy stage can specify required guardrails (e.g., "PII detection must be enabled")
The enforcement engine verifies these requirements before deployment
Non-compliant models are blocked from deployment

2. Enforcement Checks

The enforcement engine includes guardrail compliance as part of its pre-deployment evaluation:

Checks which guardrails are configured on the model
Compares against policy requirements
Reports missing guardrails as hard blocks, soft blocks, or warnings based on the policy's enforcement rules

See Enforcement for the complete enforcement workflow.

Best Practices

Start with Essentials

Enable these guardrails on all production AI models as a baseline:

PII Detection (mask mode) — Protect sensitive data
Content Filter (medium level) — Block harmful content
Prompt Injection Prevention — Protect against manipulation
Rate Limiting — Prevent abuse

Layer Security

Use input guardrails to protect data sent to models (PII, injection)
Use output guardrails to protect users from harmful responses (content filter, validation)
Use both for comprehensive protection where applicable

Test Before Deploying

Always use the Test feature to validate guardrail behavior with representative content before enabling on production models. Check for:

False positives (legitimate content being blocked)
False negatives (harmful content passing through)
PII masking accuracy

Monitor and Tune

Review guardrail logs regularly to identify patterns
Adjust filtering levels based on block rates
High false-positive rates suggest the filtering level is too strict
High false-negative rates suggest more guardrails or stricter settings are needed
Export statistics for compliance reporting

tip

When first deploying guardrails, start with log action to understand the impact before switching to block or modify. This allows you to tune thresholds without disrupting service.

How Guardrails Work​

Available Guardrails​

Content Filter​

PII Detection​

Prompt Injection Prevention​

Rate Limiting​

Cost Control​

Output Validation​

Data Retention Policy​

Model Version Lock​

Configuring Guardrails​

From the Guardrails Dashboard​

Guardrail Configuration Editor​

Predefined Rules Tab​

Custom Rules Tab​

Code Editor Tab​

Test Tab​

From Governance Policies​

Guardrail Actions​

Block Response Format​

ML-Based Detection​

Monitoring Guardrails​

Activity Logs​

Statistics​

Duplicating Configurations​

Exporting and Importing​

Integration with Governance Policies​

1. Policy-Level Requirements​

2. Enforcement Checks​

Best Practices​

Start with Essentials​

Layer Security​

Test Before Deploying​

Monitor and Tune​