AI Inference
Run AI inference through the Strongly AI Gateway. All inference endpoints proxy to the configured AI Gateway backend and support both synchronous and streaming responses. These endpoints are OpenAI-compatible, allowing drop-in replacement for existing OpenAI SDK integrations.
POST /api/v1/ai/chat/completions
Create a chat completion
Generates a model response for the given conversation. Compatible with the OpenAI Chat Completions API format.
Scope: ai-gateway:inference
Headers Forwarded:
| Header | Description |
|---|---|
X-User-Id | Authenticated user ID |
X-Request-Id | Request trace ID |
X-Organization-ID | Organization context for multi-tenancy |
Authorization | Bearer token passed to the upstream provider |
Request Body:
{
"model": "gpt-4",
"messages": [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "Explain Kubernetes in one sentence." }
],
"stream": false,
"maxTokens": 1024,
"temperature": 0.7,
"topP": 1.0,
"stop": ["\n"]
}
| Field | Type | Required | Description |
|---|---|---|---|
model | string | Yes | Model ID or name registered in the AI Gateway |
messages | array | Yes | Conversation messages array |
messages[].role | string | Yes | Message role: system, user, or assistant |
messages[].content | string | Yes | Message content |
stream | boolean | No | Enable SSE streaming (default: false) |
maxTokens | integer | No | Maximum tokens to generate |
temperature | number | No | Sampling temperature (0.0 - 2.0) |
topP | number | No | Nucleus sampling threshold (0.0 - 1.0) |
stop | string or string[] | No | Stop sequence(s) |
Response: 200 OK
{
"id": "chatcmpl-abc123",
"model": "gpt-4",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Kubernetes is an open-source container orchestration platform."
},
"finishReason": "stop"
}
],
"usage": {
"promptTokens": 25,
"completionTokens": 12,
"totalTokens": 37
},
"created": 1706000000
}
POST /api/v1/ai/completions
Create a text completion
Generates a completion for the given prompt. Useful for non-conversational text generation tasks.
Scope: ai-gateway:inference
Headers Forwarded:
| Header | Description |
|---|---|
X-User-Id | Authenticated user ID |
X-Request-Id | Request trace ID |
X-Organization-ID | Organization context for multi-tenancy |
Authorization | Bearer token passed to the upstream provider |
Request Body:
{
"model": "gpt-3.5-turbo-instruct",
"prompt": "Write a SQL query that selects all users where",
"stream": false,
"maxTokens": 256,
"temperature": 0.5
}
| Field | Type | Required | Description |
|---|---|---|---|
model | string | Yes | Model ID or name registered in the AI Gateway |
prompt | string | Yes | Text prompt to complete |
stream | boolean | No | Enable SSE streaming (default: false) |
maxTokens | integer | No | Maximum tokens to generate |
temperature | number | No | Sampling temperature (0.0 - 2.0) |
Response: 200 OK
{
"id": "cmpl-abc123",
"model": "gpt-3.5-turbo-instruct",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": " status = 'active' ORDER BY created_at DESC;"
},
"finishReason": "stop"
}
],
"usage": {
"promptTokens": 12,
"completionTokens": 14,
"totalTokens": 26
},
"created": 1706000000
}
POST /api/v1/ai/embeddings
Generate embeddings
Creates an embedding vector for the given input text. Supports single strings or batches of strings.
Scope: ai-gateway:inference
Request Body:
{
"model": "text-embedding-ada-002",
"input": "Kubernetes pod scheduling explained"
}
Or batch input:
{
"model": "text-embedding-ada-002",
"input": [
"Kubernetes pod scheduling explained",
"Docker container networking basics"
]
}
| Field | Type | Required | Description |
|---|---|---|---|
model | string | Yes | Embedding model ID or name registered in the AI Gateway |
input | string or string[] | Yes | Text to embed (single string or array of strings) |
Response: 200 OK
{
"data": [
{
"embedding": [0.0023064255, -0.009327292, 0.015797347, "..."],
"index": 0
}
],
"model": "text-embedding-ada-002",
"usage": {
"promptTokens": 6,
"totalTokens": 6
}
}
SSE Streaming Format
When stream is set to true on the chat completions or text completions endpoints, the response uses Server-Sent Events (SSE) instead of returning a single JSON object.
Response Headers:
Content-Type: text/event-stream
Cache-Control: no-cache
Connection: keep-alive
Stream Chunks:
Each chunk is delivered as a data: event followed by a JSON object and two newlines:
data: {"id":"chatcmpl-abc123","model":"gpt-4","choices":[{"index":0,"delta":{"role":"assistant","content":"Kubernetes"},"finishReason":null}],"created":1706000000}
data: {"id":"chatcmpl-abc123","model":"gpt-4","choices":[{"index":0,"delta":{"content":" is"},"finishReason":null}],"created":1706000000}
data: {"id":"chatcmpl-abc123","model":"gpt-4","choices":[{"index":0,"delta":{"content":" an"},"finishReason":null}],"created":1706000000}
data: {"id":"chatcmpl-abc123","model":"gpt-4","choices":[{"index":0,"delta":{},"finishReason":"stop"}],"created":1706000000}
data: [DONE]
| Field | Description |
|---|---|
delta.role | Present only in the first chunk |
delta.content | Token content (may be empty in the final chunk) |
finishReason | null during streaming, stop or length on the final content chunk |
[DONE] | Signals the end of the stream |
When consuming the stream, concatenate all delta.content values to reconstruct the full response. The usage field is not included in streamed chunks.