Skip to main content

AI Guardrails

AI Guardrails provide real-time safety controls for AI models accessed through the AI Gateway. Guardrails inspect both input (prompts) and output (responses) to enforce content policies, detect sensitive data, prevent prompt injection attacks, and control costs.

How Guardrails Work

Guardrails are applied on every API request to the AI Gateway's chat/completions endpoint:

User Request

┌─────────────────────────┐
│ Input Guardrails │ ← Check prompt before sending to model
│ (PII, injection, │
│ content filter) │
├─────────────────────────┤
│ BLOCK → 400 error │ ← If triggered, request is rejected
│ MODIFY → cleaned │ ← If PII found, mask and continue
│ PASS → original │ ← No issues found
└─────────────────────────┘

┌─────────────────────────┐
│ AI Model (LLM) │ ← Model generates response
└─────────────────────────┘

┌─────────────────────────┐
│ Output Guardrails │ ← Check response before returning
│ (content filter, │
│ PII, validation) │
├─────────────────────────┤
│ BLOCK → error │ ← Response blocked
│ MODIFY → cleaned │ ← Response sanitized
│ PASS → original │ ← No issues found
└─────────────────────────┘

Response (with violation metadata)

All guardrail evaluations are logged for monitoring and compliance auditing.

Available Guardrails

Content Filter

Type: Input & Output | Category: Content Safety

Blocks harmful, offensive, or inappropriate content including adult/explicit material, hate speech, and NSFW content.

SettingOptionsDescription
Filtering Levellow, medium, high, strictSensitivity threshold for content blocking
  • Low — Blocks only explicitly harmful content
  • Medium — Blocks harmful and suggestive content
  • High — Blocks harmful, suggestive, and borderline content
  • Strict — Maximum filtering, may produce false positives

PII Detection

Type: Input & Output | Category: Security

Detects and masks personally identifiable information (PII) in prompts and responses. Prevents sensitive data from being sent to or returned by AI models.

PII TypeDescriptionExample
SSNSocial Security numbers***-**-1234
Credit CardCredit card numbers****-****-****-5678
EmailEmail addresses[EMAIL]
PhonePhone numbers[PHONE]
PassportPassport numbers[PASSPORT]
Driver's LicenseDriver's license numbers[DRIVERS_LICENSE]

Configuration options:

SettingOptionsDescription
Actionmask, blockMask detected PII or block the entire request
PII TypesMulti-selectWhich PII types to detect
Mask Modechar, uuid, labeled_uuid, partial, fullHow masked values appear

Masking modes explained:

ModeExample OutputDescription
char****@example.comCharacter masking with configurable mask character
uuida1b2c3d4-e5f6-7890-abcd-ef1234567890Consistent hash-based UUID (same input = same UUID)
labeled_uuidEMAIL_a1b2c3d4Type prefix + UUID for tracking across masking
partial****user@example.comShows last 4 characters for reference
full[EMAIL]Complete replacement with type label

Prompt Injection Prevention

Type: Input | Category: Security

Detects and blocks prompt injection attacks where users attempt to override the system prompt or manipulate model behavior through specially crafted inputs.

Detection covers:

  • System prompt override attempts
  • Instruction injection patterns
  • Role-play manipulation
  • Context window exploitation
  • Delimiter injection

Rate Limiting

Type: Input | Category: Performance

Enforces request frequency limits per model to prevent abuse and control costs.

SettingDescription
Requests per MinuteMaximum API calls per minute
Requests per DayMaximum API calls per day

When limits are exceeded, subsequent requests receive a 429 Too Many Requests response with a Retry-After header.

Cost Control

Type: Input | Category: Cost Management

Enforces spending limits per model with budget alerts and automatic cutoff.

SettingDescription
Monthly BudgetMaximum monthly spend (USD)
Alert ThresholdPercentage at which to send budget alerts (e.g., 80%)
Auto-CutoffWhether to block requests when budget is exceeded

Output Validation

Type: Output | Category: Quality

Validates model responses against schemas or patterns to ensure consistent, well-formatted output.

SettingDescription
Validation Typejson_schema or regex
Schema/PatternJSON Schema definition or regex pattern
Action on Failureblock or log

Data Retention Policy

Type: Input & Output | Category: Compliance

Controls how long request and response data is retained for compliance with data protection regulations.

SettingDescription
Retention PeriodHow long to retain request/response logs
Auto-DeleteWhether to automatically purge after retention period

Model Version Lock

Type: Configuration | Category: Operational

Prevents automatic model version upgrades, ensuring consistent behavior across deployments.

SettingDescription
Locked VersionSpecific model version to pin to
Allow Minor UpdatesWhether to allow patch/minor version changes

Configuring Guardrails

From the Guardrails Dashboard

  1. Navigate to AI Gateway > Guardrails in the sidebar
  2. The dashboard shows all models with their guardrail status:
    • Protected Models — Models with active guardrails
    • Total Rules — Total active guardrail rules across all models
    • Blocked Today — Requests blocked by guardrails in the last 24 hours
    • Modified Today — Requests modified (e.g., PII masked) in the last 24 hours
  3. Click Configure on any model to open the guardrail editor

Guardrail Configuration Editor

The configuration editor has multiple sections:

Predefined Rules Tab

Select from the built-in guardrail types listed above. For each selected guardrail:

  1. Toggle the guardrail on/off
  2. Configure guardrail-specific settings (filtering level, PII types, rate limits, etc.)
  3. Set the action: block, modify, log, or alert
  4. Set priority order (higher priority rules are evaluated first)

Custom Rules Tab

Create custom content rules with conditions and actions:

Condition operators:

OperatorDescription
containsContent contains the specified substring
equalsContent exactly matches the value
matchesPattern match
regexRegular expression match
greater_thanNumeric comparison (for token counts, etc.)
less_thanNumeric comparison

Action types:

ActionDescription
blockReject the request entirely
modifyTransform the content (e.g., replace matched text)
logLog the violation but allow the request
alertSend an alert notification

Example custom rule:

{
"name": "Block Competitor Mentions",
"description": "Prevent discussion of competitor products",
"type": "input",
"enabled": true,
"priority": 5,
"conditions": [
{
"field": "content",
"operator": "regex",
"value": "\\b(CompetitorA|CompetitorB)\\b",
"case_sensitive": false
}
],
"actions": [
{
"type": "block",
"config": {
"message": "Content references competitor products"
}
}
]
}

Code Editor Tab

For advanced users, the full guardrail configuration can be edited as JSON using the built-in code editor (CodeMirror). This allows precise control over all guardrail settings.

Test Tab

Test guardrails against sample input before deploying:

  1. Enter test content in the input field
  2. Select direction (input or output)
  3. Click Run Test
  4. View results showing:
    • Which rules were triggered
    • What action was taken (block, modify, pass)
    • Modified content (if applicable)
    • Confidence scores for ML-based detection

From Governance Policies

Guardrail requirements can also be enforced through governance policies:

  1. Create a policy with aiGatewayModel as an applicable resource type
  2. Add a stage with guardrail requirement fields
  3. Select which guardrails must be enabled (required) and which are recommended
  4. The Enforcement Engine verifies guardrail compliance during pre-deployment checks

This integrates guardrails into the broader governance workflow, ensuring models meet organizational safety standards before deployment.

Guardrail Actions

When a guardrail is triggered, one of these actions is taken:

ActionDescriptionHTTP Response
BlockRequest is rejected entirely400 Bad Request with triggered rule details
ModifyContent is cleaned/masked and request continues200 OK with modified content
LogViolation is logged but request proceeds unchanged200 OK with violation in metadata
AlertNotification sent, request proceeds200 OK with alert in metadata
RedirectRequest redirected to alternative handlerVaries

Block Response Format

When a request is blocked by guardrails:

{
"detail": "Request blocked by guardrails: pii-filter, content-filter"
}

ML-Based Detection

The guardrails system supports ML-based detection for enhanced accuracy (when ML models are available on the gateway):

Detection TypeModel/LibraryDescription
ToxicityDetoxifyToxic content detection with confidence scores
PII (ML)Microsoft PresidioNamed entity recognition for PII
Prompt InjectionDistilBERTML-based injection pattern detection
HallucinationBART zero-shotFactual consistency checking
Language DetectionlangdetectDetect content language
Named Entity RecognitionspaCyEntity extraction for sensitive data

ML-based detection runs in parallel using a thread pool for minimal latency impact. Results are cached for 5 minutes to avoid re-computation on similar content.

When ML models are not available, the system falls back to rule-based pattern matching using regex and keyword detection.

Monitoring Guardrails

Activity Logs

Navigate to the guardrail configuration for any model and open the Logs tab to view recent activity:

FieldDescription
TimestampWhen the evaluation occurred
Directioninput or output
Triggered RulesWhich guardrails fired
Action TakenBlock, modify, or log
ConfidenceML confidence score (if applicable)
Content InfoOriginal vs. modified content length

Guardrail logs are retained for 15 days (TTL index) and are automatically deleted after that period.

Statistics

Each model's guardrail configuration tracks aggregate statistics:

MetricDescription
Total EvaluationsTotal input/output checks performed
Block RatePercentage of requests blocked
Modification RatePercentage of requests modified
Top Triggered RulesMost frequently triggered guardrails
Requests per Minute/DayCurrent request volume

Duplicating Configurations

To apply the same guardrail configuration to multiple models:

  1. Open the guardrail config for the source model
  2. Click Duplicate
  3. Select the target model(s)
  4. The configuration is copied to each target model

Exporting and Importing

Guardrail configurations can be exported as JSON and imported to other models or environments, enabling governance-as-code workflows.

Integration with Governance Policies

Guardrails integrate with the governance policy system at two levels:

1. Policy-Level Requirements

Governance policies can require specific guardrails on AI Gateway models:

  • A policy stage can specify required guardrails (e.g., "PII detection must be enabled")
  • The enforcement engine verifies these requirements before deployment
  • Non-compliant models are blocked from deployment

2. Enforcement Checks

The enforcement engine includes guardrail compliance as part of its pre-deployment evaluation:

  • Checks which guardrails are configured on the model
  • Compares against policy requirements
  • Reports missing guardrails as hard blocks, soft blocks, or warnings based on the policy's enforcement rules

See Enforcement for the complete enforcement workflow.

Best Practices

Start with Essentials

Enable these guardrails on all production AI models as a baseline:

  1. PII Detection (mask mode) — Protect sensitive data
  2. Content Filter (medium level) — Block harmful content
  3. Prompt Injection Prevention — Protect against manipulation
  4. Rate Limiting — Prevent abuse

Layer Security

  • Use input guardrails to protect data sent to models (PII, injection)
  • Use output guardrails to protect users from harmful responses (content filter, validation)
  • Use both for comprehensive protection where applicable

Test Before Deploying

Always use the Test feature to validate guardrail behavior with representative content before enabling on production models. Check for:

  • False positives (legitimate content being blocked)
  • False negatives (harmful content passing through)
  • PII masking accuracy

Monitor and Tune

  • Review guardrail logs regularly to identify patterns
  • Adjust filtering levels based on block rates
  • High false-positive rates suggest the filtering level is too strict
  • High false-negative rates suggest more guardrails or stricter settings are needed
  • Export statistics for compliance reporting
tip

When first deploying guardrails, start with log action to understand the impact before switching to block or modify. This allows you to tune thresholds without disrupting service.