Self-Hosted Models
Deploy 169 pre-configured open-source AI models on your own infrastructure. Each model comes with a production-ready Dockerfile and is routed through the AI Gateway's standardized API endpoints — the same endpoints used by third-party providers like OpenAI, Anthropic, and Google.
Model Catalogue
| Category | Count | Model Types |
|---|---|---|
| LLM | 83 | Chat, code generation, reasoning |
| Multimodal | 32 | Vision-language models (VLMs) |
| Embedding | 12 | Text embeddings for RAG/search |
| Audio | 34 | Text-to-speech (16), speech-to-text (18) |
| Video | 6 | Text-to-video generation |
| Image | 1 | Image generation (Stable Diffusion) |
| NLP | 1 | Specialized NLP tasks |
Provider & Engine Architecture
All self-hosted models use provider: self_hosted. The engine field determines how the AI Gateway routes requests — one provider maps to many engines.
Provider vs Engine
| Concept | What It Means | Examples |
|---|---|---|
| Provider | Who provides the model (the vendor) | self_hosted, openai, anthropic, custom |
| Engine | How to communicate with the inference server | vllm, transformers, whisper, tts-engine, video-engine, custom |
Engine Routing
| Engine | Internal Handler | Used For | API Format |
|---|---|---|---|
vllm | VLLMProvider | LLM chat, multimodal, embedding | OpenAI-compatible /v1/chat/completions |
transformers | VLLMProvider | Custom transformers servers | OpenAI-compatible /v1/chat/completions |
custom | CustomProvider | Generic custom servers | Configurable endpoints |
whisper | CustomProvider | Speech-to-text models | /v1/audio/transcriptions |
tts-engine | CustomProvider | Text-to-speech models | /v1/audio/speech |
video-engine | CustomProvider | Video generation models | /v1/video/generations |
How Routing Works
User Request (e.g., POST /v1/chat/completions)
↓
AI Gateway receives request with model ID
↓
Looks up model config: provider=self_hosted, engine=vllm
↓
Provider Factory sees provider=self_hosted, reads engine field:
- engine: vllm/transformers → VLLMProvider
- engine: custom/whisper/tts-engine/video-engine → CustomProvider
↓
Provider forwards request to self-hosted model's internal endpoint
↓
Response is normalized to standard format and returned
Engine Capabilities
VLLMProvider (engines: vllm, transformers):
| Capability | Supported | Endpoint |
|---|---|---|
| Chat completions | Yes | /v1/chat/completions |
| Streaming | Yes | /v1/chat/completions (SSE) |
| Embeddings | Yes | /v1/embeddings |
| Image generation | Yes | /v1/images/generations |
| Vision/Multimodal | Yes | /v1/chat/completions with image content |
| Tool calling | Yes | Via --tool-call-parser vLLM flag |
CustomProvider (engines: custom, whisper, tts-engine, video-engine):
| Capability | Supported | Endpoint |
|---|---|---|
| Chat completions | Yes | /v1/chat/completions |
| Streaming | Yes | /v1/chat/completions (SSE) |
| Embeddings | Yes | /v1/embeddings |
| Text-to-speech | Yes | /v1/audio/speech |
| Speech-to-text | Yes | /v1/audio/transcriptions |
| Image generation | Yes | /v1/images/generations |
| Video generation | Yes | /v1/video/generations |
Inference Engines
vLLM Engine (113 models)
The majority of models use the vLLM inference engine — a high-performance serving framework with PagedAttention, continuous batching, and OpenAI-compatible API.
Base Image: vllm/vllm-openai:v0.16.0
Key vLLM flags:
--tensor-parallel-size N— Distribute model across N GPUs--gpu-memory-utilization 0.9— GPU memory fraction to use--max-model-len N— Maximum context length--served-model-name name— Model name exposed via API--trust-remote-code— Required for some HuggingFace models--tool-call-parser hermes— Enable function calling support--enable-reasoning— Enable chain-of-thought reasoning--limit-mm-per-prompt image=4— Limit multimodal inputs per request
Example: Llama 3.1 8B
FROM vllm/vllm-openai:v0.16.0
ENV MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"
ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]
CMD ["--model", "meta-llama/Llama-3.1-8B-Instruct",
"--host", "0.0.0.0", "--port", "8000",
"--served-model-name", "llama-3.1-8b",
"--gpu-memory-utilization", "0.9",
"--max-model-len", "32768"]
Transformers Engine (24 models)
Custom Python servers using HuggingFace Transformers directly. Used when a model isn't supported by vLLM or needs custom preprocessing.
Must expose: Standard OpenAI-compatible /v1/chat/completions endpoint accepting JSON:
{
"model": "model-name",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 512
}
For multimodal transformers models, the server must handle both text-only and image+text messages in OpenAI format:
{
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "What do you see?"},
{"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
]
}]
}
TTS Engine (16 models)
Text-to-speech models that expose /v1/audio/speech endpoint.
Request format:
{
"input": "Text to speak",
"voice": "default",
"model": "model-name",
"response_format": "mp3",
"speed": 1.0
}
Response: Binary audio stream.
Whisper Engine (5 models)
Speech-to-text models using OpenAI Whisper-compatible API at /v1/audio/transcriptions.
Request format: multipart/form-data with audio file upload.
Response:
{
"text": "Transcribed text content"
}
Video Engine (6 models)
Text-to-video generation models that expose /v1/video/generations endpoint.
Models: CogVideoX-2B, CogVideoX-5B, HunyuanVideo, Wan2.1-T2V-1.3B, Wan2.1-T2V-14B, LTX-Video
Request format:
{
"prompt": "A ball bouncing in slow motion",
"duration": 5,
"resolution": "1080p",
"aspect_ratio": "16:9",
"fps": 24
}
Response: Either JSON with job ID (async) or binary video data (sync), depending on the model.
Model Types & API Endpoints
Each model type maps to a specific AI Gateway endpoint:
| Model Type | Gateway Endpoint | Proxy Path |
|---|---|---|
chat | POST /v1/chat/completions | /api/v1/ai/chat/completions |
multimodal | POST /v1/chat/completions | /api/v1/ai/chat/completions |
embedding | POST /v1/embeddings | /api/v1/ai/embeddings |
text_to_speech | POST /v1/audio/speech | /api/v1/ai/audio/speech |
speech_to_text | POST /v1/audio/transcriptions | /api/v1/ai/audio/transcriptions |
image_generation | POST /v1/generations/images | /api/v1/ai/generations/images |
video_generation | POST /v1/generations/videos | /api/v1/ai/generations/videos |
Important: Multimodal models use the same /v1/chat/completions endpoint as chat models. The gateway accepts both chat and multimodal model types for chat completions.
Deployment Lifecycle
1. Create Deployment
When you deploy a self-hosted model, the platform:
- Reads the model's Dockerfile from the catalogue
- Builds the Docker image using a K8s build job (Docker-in-Docker)
- Pushes the image to ECR
- Creates a Kubernetes Deployment + Service in the org namespace
- Registers the model in the AI Gateway with
provider: self_hostedand the model'senginefield
2. Model Status Flow
building → deploying → active → running
→ error (build/deploy failure)
- building: Docker image is being built
- deploying: K8s resources are being created
- active: K8s Deployment exists, pod is starting
- running: Model is serving inference requests
- error: Build or deployment failed
3. Inference Readiness
After a model reaches "active" status, the inference server still needs time to:
- Pull the Docker image
- Download model weights from HuggingFace
- Load weights into GPU memory
- Start the HTTP server
The gateway handles this transparently — requests return 503 until the model is ready.
GPU Requirements
| GPU Count | Instance Types | Example Models |
|---|---|---|
| 0 (CPU) | t3.large, c5.2xlarge | Whisper (tiny/base/small), smaller TTS models |
| 1 | g5.xlarge, g5.2xlarge, g4dn.xlarge | 7B-8B LLMs, most multimodal, video gen |
| 2 | g5.12xlarge | 13B LLMs, larger multimodal models |
| 4 | g5.12xlarge, p3.8xlarge | 34B-72B LLMs, large vision models |
| 8 | Custom | 70B+ LLMs, frontier models |
Adding Custom Models
To add a new self-hosted model to the catalogue, add an entry to self-hosted-models.js:
{
id: 'my-custom-model',
name: 'My Custom Model',
description: 'Description of the model',
category: 'llm', // llm, multimodal, embedding, audio, video, image
provider: 'vendor-name',
modelType: 'chat', // chat, multimodal, embedding, text_to_speech, etc.
engine: 'vllm', // vllm, transformers, custom, whisper, tts-engine, video-engine
dockerfile: `...`, // Full Dockerfile content
defaultPort: 8000,
defaultResources: { cpu: '4000m', memory: '16Gi', gpu: 1, disk: '50Gi' },
recommendedInstance: 'g5.xlarge',
tags: ['llm', 'chat', '7b'],
modelSize: '14GB',
documentation: '# My Model\n\nModel documentation here.'
}
Engine Selection Guide
| Choose This Engine | When |
|---|---|
vllm | Model is supported by vLLM (most LLMs, VLMs, embeddings) |
transformers | Custom HuggingFace model not in vLLM, needs custom preprocessing |
custom | Non-standard inference server with custom API |
whisper | OpenAI Whisper-compatible STT model |
tts-engine | Text-to-speech model exposing /v1/audio/speech |
video-engine | Video generation model exposing /v1/video/generations |
Server Requirements
All self-hosted model servers must expose:
- Health endpoint:
GET /healthreturning{"status": "ok"} - Inference endpoint: The appropriate endpoint for the model type (see table above)
- Port 8000: Default port (configurable via
defaultPort)
The inference endpoint must accept and return standard JSON — not multipart form data or custom formats (except STT which uses multipart for file upload).