AI Inference

Run AI inference through the Strongly AI Gateway. All inference endpoints proxy to the configured AI Gateway backend and support both synchronous and streaming responses. These endpoints are OpenAI-compatible, allowing drop-in replacement for existing OpenAI SDK integrations.

POST /api/v1/ai/chat/completions

Create a chat completion

Generates a model response for the given conversation. Compatible with the OpenAI Chat Completions API format.

Scope: ai-gateway:inference

Headers Forwarded:

Header	Description
`X-User-Id`	Authenticated user ID
`X-Request-Id`	Request trace ID
`X-Organization-ID`	Organization context for multi-tenancy
`Authorization`	Bearer token passed to the upstream provider

Request Body:

{
  "model": "gpt-4",
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "Explain Kubernetes in one sentence." }
  ],
  "stream": false,
  "maxTokens": 1024,
  "temperature": 0.7,
  "topP": 1.0,
  "stop": ["\n"]
}

Field	Type	Required	Description
`model`	string	Yes	Model ID or name registered in the AI Gateway
`messages`	array	Yes	Conversation messages array
`messages[].role`	string	Yes	Message role: `system`, `user`, or `assistant`
`messages[].content`	string	Yes	Message content
`stream`	boolean	No	Enable SSE streaming (default: `false`)
`maxTokens`	integer	No	Maximum tokens to generate
`temperature`	number	No	Sampling temperature (0.0 - 2.0)
`topP`	number	No	Nucleus sampling threshold (0.0 - 1.0)
`stop`	string or string[]	No	Stop sequence(s)

Response: 200 OK

{
  "id": "chatcmpl-abc123",
  "model": "gpt-4",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Kubernetes is an open-source container orchestration platform."
      },
      "finishReason": "stop"
    }
  ],
  "usage": {
    "promptTokens": 25,
    "completionTokens": 12,
    "totalTokens": 37
  },
  "created": 1706000000
}

POST /api/v1/ai/completions

Create a text completion

Generates a completion for the given prompt. Useful for non-conversational text generation tasks.

Scope: ai-gateway:inference

Headers Forwarded:

Header	Description
`X-User-Id`	Authenticated user ID
`X-Request-Id`	Request trace ID
`X-Organization-ID`	Organization context for multi-tenancy
`Authorization`	Bearer token passed to the upstream provider

Request Body:

{
  "model": "gpt-3.5-turbo-instruct",
  "prompt": "Write a SQL query that selects all users where",
  "stream": false,
  "maxTokens": 256,
  "temperature": 0.5
}

Field	Type	Required	Description
`model`	string	Yes	Model ID or name registered in the AI Gateway
`prompt`	string	Yes	Text prompt to complete
`stream`	boolean	No	Enable SSE streaming (default: `false`)
`maxTokens`	integer	No	Maximum tokens to generate
`temperature`	number	No	Sampling temperature (0.0 - 2.0)

Response: 200 OK

{
  "id": "cmpl-abc123",
  "model": "gpt-3.5-turbo-instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": " status = 'active' ORDER BY created_at DESC;"
      },
      "finishReason": "stop"
    }
  ],
  "usage": {
    "promptTokens": 12,
    "completionTokens": 14,
    "totalTokens": 26
  },
  "created": 1706000000
}

POST /api/v1/ai/embeddings

Generate embeddings

Creates an embedding vector for the given input text. Supports single strings or batches of strings.

Scope: ai-gateway:inference

Request Body:

{
  "model": "text-embedding-ada-002",
  "input": "Kubernetes pod scheduling explained"
}

Or batch input:

{
  "model": "text-embedding-ada-002",
  "input": [
    "Kubernetes pod scheduling explained",
    "Docker container networking basics"
  ]
}

Field	Type	Required	Description
`model`	string	Yes	Embedding model ID or name registered in the AI Gateway
`input`	string or string[]	Yes	Text to embed (single string or array of strings)

Response: 200 OK

{
  "data": [
    {
      "embedding": [0.0023064255, -0.009327292, 0.015797347, "..."],
      "index": 0
    }
  ],
  "model": "text-embedding-ada-002",
  "usage": {
    "promptTokens": 6,
    "totalTokens": 6
  }
}

SSE Streaming Format

When stream is set to true on the chat completions or text completions endpoints, the response uses Server-Sent Events (SSE) instead of returning a single JSON object.

Response Headers:

Content-Type: text/event-stream
Cache-Control: no-cache
Connection: keep-alive

Stream Chunks:

Each chunk is delivered as a data: event followed by a JSON object and two newlines:

data: {"id":"chatcmpl-abc123","model":"gpt-4","choices":[{"index":0,"delta":{"role":"assistant","content":"Kubernetes"},"finishReason":null}],"created":1706000000}

data: {"id":"chatcmpl-abc123","model":"gpt-4","choices":[{"index":0,"delta":{"content":" is"},"finishReason":null}],"created":1706000000}

data: {"id":"chatcmpl-abc123","model":"gpt-4","choices":[{"index":0,"delta":{"content":" an"},"finishReason":null}],"created":1706000000}

data: {"id":"chatcmpl-abc123","model":"gpt-4","choices":[{"index":0,"delta":{},"finishReason":"stop"}],"created":1706000000}

data: [DONE]

Field	Description
`delta.role`	Present only in the first chunk
`delta.content`	Token content (may be empty in the final chunk)
`finishReason`	`null` during streaming, `stop` or `length` on the final content chunk
`[DONE]`	Signals the end of the stream

tip

When consuming the stream, concatenate all delta.content values to reconstruct the full response. The usage field is not included in streamed chunks.

POST /api/v1/ai/chat/completions​

POST /api/v1/ai/completions​

POST /api/v1/ai/embeddings​

SSE Streaming Format​

POST /api/v1/ai/chat/completions

POST /api/v1/ai/completions

POST /api/v1/ai/embeddings

SSE Streaming Format