Self-Hosted Models

Deploy 169 pre-configured open-source AI models on your own infrastructure. Each model comes with a production-ready Dockerfile and is routed through the AI Gateway's standardized API endpoints — the same endpoints used by third-party providers like OpenAI, Anthropic, and Google.

Model Catalogue

Category	Count	Model Types
LLM	83	Chat, code generation, reasoning
Multimodal	32	Vision-language models (VLMs)
Embedding	12	Text embeddings for RAG/search
Audio	34	Text-to-speech (16), speech-to-text (18)
Video	6	Text-to-video generation
Image	1	Image generation (Stable Diffusion)
NLP	1	Specialized NLP tasks

Provider & Engine Architecture

All self-hosted models use provider: self_hosted. The engine field determines how the AI Gateway routes requests — one provider maps to many engines.

Provider vs Engine

Concept	What It Means	Examples
Provider	Who provides the model (the vendor)	`self_hosted`, `openai`, `anthropic`, `custom`
Engine	How to communicate with the inference server	`vllm`, `transformers`, `whisper`, `tts-engine`, `video-engine`, `custom`

Engine Routing

Engine	Internal Handler	Used For	API Format
`vllm`	VLLMProvider	LLM chat, multimodal, embedding	OpenAI-compatible `/v1/chat/completions`
`transformers`	VLLMProvider	Custom transformers servers	OpenAI-compatible `/v1/chat/completions`
`custom`	CustomProvider	Generic custom servers	Configurable endpoints
`whisper`	CustomProvider	Speech-to-text models	`/v1/audio/transcriptions`
`tts-engine`	CustomProvider	Text-to-speech models	`/v1/audio/speech`
`video-engine`	CustomProvider	Video generation models	`/v1/video/generations`

How Routing Works

User Request (e.g., POST /v1/chat/completions)
    ↓
AI Gateway receives request with model ID
    ↓
Looks up model config: provider=self_hosted, engine=vllm
    ↓
Provider Factory sees provider=self_hosted, reads engine field:
  - engine: vllm/transformers → VLLMProvider
  - engine: custom/whisper/tts-engine/video-engine → CustomProvider
    ↓
Provider forwards request to self-hosted model's internal endpoint
    ↓
Response is normalized to standard format and returned

Engine Capabilities

VLLMProvider (engines: vllm, transformers):

Capability	Supported	Endpoint
Chat completions	Yes	`/v1/chat/completions`
Streaming	Yes	`/v1/chat/completions` (SSE)
Embeddings	Yes	`/v1/embeddings`
Image generation	Yes	`/v1/images/generations`
Vision/Multimodal	Yes	`/v1/chat/completions` with image content
Tool calling	Yes	Via `--tool-call-parser` vLLM flag

CustomProvider (engines: custom, whisper, tts-engine, video-engine):

Capability	Supported	Endpoint
Chat completions	Yes	`/v1/chat/completions`
Streaming	Yes	`/v1/chat/completions` (SSE)
Embeddings	Yes	`/v1/embeddings`
Text-to-speech	Yes	`/v1/audio/speech`
Speech-to-text	Yes	`/v1/audio/transcriptions`
Image generation	Yes	`/v1/images/generations`
Video generation	Yes	`/v1/video/generations`

Inference Engines

vLLM Engine (113 models)

The majority of models use the vLLM inference engine — a high-performance serving framework with PagedAttention, continuous batching, and OpenAI-compatible API.

Base Image: vllm/vllm-openai:v0.16.0

Key vLLM flags:

--tensor-parallel-size N — Distribute model across N GPUs
--gpu-memory-utilization 0.9 — GPU memory fraction to use
--max-model-len N — Maximum context length
--served-model-name name — Model name exposed via API
--trust-remote-code — Required for some HuggingFace models
--tool-call-parser hermes — Enable function calling support
--enable-reasoning — Enable chain-of-thought reasoning
--limit-mm-per-prompt image=4 — Limit multimodal inputs per request

Example: Llama 3.1 8B

FROM vllm/vllm-openai:v0.16.0
ENV MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"
ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]
CMD ["--model", "meta-llama/Llama-3.1-8B-Instruct",
     "--host", "0.0.0.0", "--port", "8000",
     "--served-model-name", "llama-3.1-8b",
     "--gpu-memory-utilization", "0.9",
     "--max-model-len", "32768"]

Transformers Engine (24 models)

Custom Python servers using HuggingFace Transformers directly. Used when a model isn't supported by vLLM or needs custom preprocessing.

Must expose: Standard OpenAI-compatible /v1/chat/completions endpoint accepting JSON:

{
  "model": "model-name",
  "messages": [{"role": "user", "content": "Hello"}],
  "max_tokens": 512
}

For multimodal transformers models, the server must handle both text-only and image+text messages in OpenAI format:

{
  "messages": [{
    "role": "user",
    "content": [
      {"type": "text", "text": "What do you see?"},
      {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
    ]
  }]
}

TTS Engine (16 models)

Text-to-speech models that expose /v1/audio/speech endpoint.

Request format:

{
  "input": "Text to speak",
  "voice": "default",
  "model": "model-name",
  "response_format": "mp3",
  "speed": 1.0
}

Response: Binary audio stream.

Whisper Engine (5 models)

Speech-to-text models using OpenAI Whisper-compatible API at /v1/audio/transcriptions.

Request format: multipart/form-data with audio file upload.

Response:

{
  "text": "Transcribed text content"
}

Video Engine (6 models)

Text-to-video generation models that expose /v1/video/generations endpoint.

Models: CogVideoX-2B, CogVideoX-5B, HunyuanVideo, Wan2.1-T2V-1.3B, Wan2.1-T2V-14B, LTX-Video

Request format:

{
  "prompt": "A ball bouncing in slow motion",
  "duration": 5,
  "resolution": "1080p",
  "aspect_ratio": "16:9",
  "fps": 24
}

Response: Either JSON with job ID (async) or binary video data (sync), depending on the model.

Model Types & API Endpoints

Each model type maps to a specific AI Gateway endpoint:

Model Type	Gateway Endpoint	Proxy Path
`chat`	`POST /v1/chat/completions`	`/api/v1/ai/chat/completions`
`multimodal`	`POST /v1/chat/completions`	`/api/v1/ai/chat/completions`
`embedding`	`POST /v1/embeddings`	`/api/v1/ai/embeddings`
`text_to_speech`	`POST /v1/audio/speech`	`/api/v1/ai/audio/speech`
`speech_to_text`	`POST /v1/audio/transcriptions`	`/api/v1/ai/audio/transcriptions`
`image_generation`	`POST /v1/generations/images`	`/api/v1/ai/generations/images`
`video_generation`	`POST /v1/generations/videos`	`/api/v1/ai/generations/videos`

Important: Multimodal models use the same /v1/chat/completions endpoint as chat models. The gateway accepts both chat and multimodal model types for chat completions.

Deployment Lifecycle

1. Create Deployment

When you deploy a self-hosted model, the platform:

Reads the model's Dockerfile from the catalogue
Builds the Docker image using a K8s build job (Docker-in-Docker)
Pushes the image to ECR
Creates a Kubernetes Deployment + Service in the org namespace
Registers the model in the AI Gateway with provider: self_hosted and the model's engine field

2. Model Status Flow

building → deploying → active → running
                    → error (build/deploy failure)

building: Docker image is being built
deploying: K8s resources are being created
active: K8s Deployment exists, pod is starting
running: Model is serving inference requests
error: Build or deployment failed

3. Inference Readiness

After a model reaches "active" status, the inference server still needs time to:

Pull the Docker image
Download model weights from HuggingFace
Load weights into GPU memory
Start the HTTP server

The gateway handles this transparently — requests return 503 until the model is ready.

GPU Requirements

GPU Count	Instance Types	Example Models
0 (CPU)	`t3.large`, `c5.2xlarge`	Whisper (tiny/base/small), smaller TTS models
1	`g5.xlarge`, `g5.2xlarge`, `g4dn.xlarge`	7B-8B LLMs, most multimodal, video gen
2	`g5.12xlarge`	13B LLMs, larger multimodal models
4	`g5.12xlarge`, `p3.8xlarge`	34B-72B LLMs, large vision models
8	Custom	70B+ LLMs, frontier models

Adding Custom Models

To add a new self-hosted model to the catalogue, add an entry to self-hosted-models.js:

{
  id: 'my-custom-model',
  name: 'My Custom Model',
  description: 'Description of the model',
  category: 'llm',           // llm, multimodal, embedding, audio, video, image
  provider: 'vendor-name',
  modelType: 'chat',         // chat, multimodal, embedding, text_to_speech, etc.
  engine: 'vllm',            // vllm, transformers, custom, whisper, tts-engine, video-engine
  dockerfile: `...`,         // Full Dockerfile content
  defaultPort: 8000,
  defaultResources: { cpu: '4000m', memory: '16Gi', gpu: 1, disk: '50Gi' },
  recommendedInstance: 'g5.xlarge',
  tags: ['llm', 'chat', '7b'],
  modelSize: '14GB',
  documentation: '# My Model\n\nModel documentation here.'
}

Engine Selection Guide

Choose This Engine	When
`vllm`	Model is supported by vLLM (most LLMs, VLMs, embeddings)
`transformers`	Custom HuggingFace model not in vLLM, needs custom preprocessing
`custom`	Non-standard inference server with custom API
`whisper`	OpenAI Whisper-compatible STT model
`tts-engine`	Text-to-speech model exposing `/v1/audio/speech`
`video-engine`	Video generation model exposing `/v1/video/generations`

Server Requirements

All self-hosted model servers must expose:

Health endpoint: GET /health returning {"status": "ok"}
Inference endpoint: The appropriate endpoint for the model type (see table above)
Port 8000: Default port (configurable via defaultPort)

The inference endpoint must accept and return standard JSON — not multipart form data or custom formats (except STT which uses multipart for file upload).

Model Catalogue​

Provider & Engine Architecture​

Provider vs Engine​

Engine Routing​

How Routing Works​

Engine Capabilities​

Inference Engines​

vLLM Engine (113 models)​

Transformers Engine (24 models)​

TTS Engine (16 models)​

Whisper Engine (5 models)​

Video Engine (6 models)​

Model Types & API Endpoints​

Deployment Lifecycle​

1. Create Deployment​

2. Model Status Flow​

3. Inference Readiness​

GPU Requirements​

Adding Custom Models​

Engine Selection Guide​

Server Requirements​