Skip to main content

AI Inference

Run AI inference through the Strongly AI Gateway. All inference endpoints proxy to the configured AI Gateway backend and support both synchronous and streaming responses. These endpoints are OpenAI-compatible, allowing drop-in replacement for existing OpenAI SDK integrations.


POST /api/v1/ai/chat/completions

Create a chat completion

Generates a model response for the given conversation. Compatible with the OpenAI Chat Completions API format.

Scope: ai-gateway:inference

Headers Forwarded:

HeaderDescription
X-User-IdAuthenticated user ID
X-Request-IdRequest trace ID
X-Organization-IDOrganization context for multi-tenancy
AuthorizationBearer token passed to the upstream provider

Request Body:

{
"model": "gpt-4",
"messages": [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "Explain Kubernetes in one sentence." }
],
"stream": false,
"maxTokens": 1024,
"temperature": 0.7,
"topP": 1.0,
"stop": ["\n"]
}
FieldTypeRequiredDescription
modelstringYesModel ID or name registered in the AI Gateway
messagesarrayYesConversation messages array
messages[].rolestringYesMessage role: system, user, or assistant
messages[].contentstringYesMessage content
streambooleanNoEnable SSE streaming (default: false)
maxTokensintegerNoMaximum tokens to generate
temperaturenumberNoSampling temperature (0.0 - 2.0)
topPnumberNoNucleus sampling threshold (0.0 - 1.0)
stopstring or string[]NoStop sequence(s)

Response: 200 OK

{
"id": "chatcmpl-abc123",
"model": "gpt-4",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Kubernetes is an open-source container orchestration platform."
},
"finishReason": "stop"
}
],
"usage": {
"promptTokens": 25,
"completionTokens": 12,
"totalTokens": 37
},
"created": 1706000000
}

POST /api/v1/ai/completions

Create a text completion

Generates a completion for the given prompt. Useful for non-conversational text generation tasks.

Scope: ai-gateway:inference

Headers Forwarded:

HeaderDescription
X-User-IdAuthenticated user ID
X-Request-IdRequest trace ID
X-Organization-IDOrganization context for multi-tenancy
AuthorizationBearer token passed to the upstream provider

Request Body:

{
"model": "gpt-3.5-turbo-instruct",
"prompt": "Write a SQL query that selects all users where",
"stream": false,
"maxTokens": 256,
"temperature": 0.5
}
FieldTypeRequiredDescription
modelstringYesModel ID or name registered in the AI Gateway
promptstringYesText prompt to complete
streambooleanNoEnable SSE streaming (default: false)
maxTokensintegerNoMaximum tokens to generate
temperaturenumberNoSampling temperature (0.0 - 2.0)

Response: 200 OK

{
"id": "cmpl-abc123",
"model": "gpt-3.5-turbo-instruct",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": " status = 'active' ORDER BY created_at DESC;"
},
"finishReason": "stop"
}
],
"usage": {
"promptTokens": 12,
"completionTokens": 14,
"totalTokens": 26
},
"created": 1706000000
}

POST /api/v1/ai/embeddings

Generate embeddings

Creates an embedding vector for the given input text. Supports single strings or batches of strings.

Scope: ai-gateway:inference

Request Body:

{
"model": "text-embedding-ada-002",
"input": "Kubernetes pod scheduling explained"
}

Or batch input:

{
"model": "text-embedding-ada-002",
"input": [
"Kubernetes pod scheduling explained",
"Docker container networking basics"
]
}
FieldTypeRequiredDescription
modelstringYesEmbedding model ID or name registered in the AI Gateway
inputstring or string[]YesText to embed (single string or array of strings)

Response: 200 OK

{
"data": [
{
"embedding": [0.0023064255, -0.009327292, 0.015797347, "..."],
"index": 0
}
],
"model": "text-embedding-ada-002",
"usage": {
"promptTokens": 6,
"totalTokens": 6
}
}

SSE Streaming Format

When stream is set to true on the chat completions or text completions endpoints, the response uses Server-Sent Events (SSE) instead of returning a single JSON object.

Response Headers:

Content-Type: text/event-stream
Cache-Control: no-cache
Connection: keep-alive

Stream Chunks:

Each chunk is delivered as a data: event followed by a JSON object and two newlines:

data: {"id":"chatcmpl-abc123","model":"gpt-4","choices":[{"index":0,"delta":{"role":"assistant","content":"Kubernetes"},"finishReason":null}],"created":1706000000}

data: {"id":"chatcmpl-abc123","model":"gpt-4","choices":[{"index":0,"delta":{"content":" is"},"finishReason":null}],"created":1706000000}

data: {"id":"chatcmpl-abc123","model":"gpt-4","choices":[{"index":0,"delta":{"content":" an"},"finishReason":null}],"created":1706000000}

data: {"id":"chatcmpl-abc123","model":"gpt-4","choices":[{"index":0,"delta":{},"finishReason":"stop"}],"created":1706000000}

data: [DONE]

FieldDescription
delta.rolePresent only in the first chunk
delta.contentToken content (may be empty in the final chunk)
finishReasonnull during streaming, stop or length on the final content chunk
[DONE]Signals the end of the stream
tip

When consuming the stream, concatenate all delta.content values to reconstruct the full response. The usage field is not included in streamed chunks.