Skip to main content

Self-Hosted Models

Deploy 169 pre-configured open-source AI models on your own infrastructure. Each model comes with a production-ready Dockerfile and is routed through the AI Gateway's standardized API endpoints — the same endpoints used by third-party providers like OpenAI, Anthropic, and Google.

Model Catalogue

CategoryCountModel Types
LLM83Chat, code generation, reasoning
Multimodal32Vision-language models (VLMs)
Embedding12Text embeddings for RAG/search
Audio34Text-to-speech (16), speech-to-text (18)
Video6Text-to-video generation
Image1Image generation (Stable Diffusion)
NLP1Specialized NLP tasks

Provider & Engine Architecture

All self-hosted models use provider: self_hosted. The engine field determines how the AI Gateway routes requests — one provider maps to many engines.

Provider vs Engine

ConceptWhat It MeansExamples
ProviderWho provides the model (the vendor)self_hosted, openai, anthropic, custom
EngineHow to communicate with the inference servervllm, transformers, whisper, tts-engine, video-engine, custom

Engine Routing

EngineInternal HandlerUsed ForAPI Format
vllmVLLMProviderLLM chat, multimodal, embeddingOpenAI-compatible /v1/chat/completions
transformersVLLMProviderCustom transformers serversOpenAI-compatible /v1/chat/completions
customCustomProviderGeneric custom serversConfigurable endpoints
whisperCustomProviderSpeech-to-text models/v1/audio/transcriptions
tts-engineCustomProviderText-to-speech models/v1/audio/speech
video-engineCustomProviderVideo generation models/v1/video/generations

How Routing Works

User Request (e.g., POST /v1/chat/completions)

AI Gateway receives request with model ID

Looks up model config: provider=self_hosted, engine=vllm

Provider Factory sees provider=self_hosted, reads engine field:
- engine: vllm/transformers → VLLMProvider
- engine: custom/whisper/tts-engine/video-engine → CustomProvider

Provider forwards request to self-hosted model's internal endpoint

Response is normalized to standard format and returned

Engine Capabilities

VLLMProvider (engines: vllm, transformers):

CapabilitySupportedEndpoint
Chat completionsYes/v1/chat/completions
StreamingYes/v1/chat/completions (SSE)
EmbeddingsYes/v1/embeddings
Image generationYes/v1/images/generations
Vision/MultimodalYes/v1/chat/completions with image content
Tool callingYesVia --tool-call-parser vLLM flag

CustomProvider (engines: custom, whisper, tts-engine, video-engine):

CapabilitySupportedEndpoint
Chat completionsYes/v1/chat/completions
StreamingYes/v1/chat/completions (SSE)
EmbeddingsYes/v1/embeddings
Text-to-speechYes/v1/audio/speech
Speech-to-textYes/v1/audio/transcriptions
Image generationYes/v1/images/generations
Video generationYes/v1/video/generations

Inference Engines

vLLM Engine (113 models)

The majority of models use the vLLM inference engine — a high-performance serving framework with PagedAttention, continuous batching, and OpenAI-compatible API.

Base Image: vllm/vllm-openai:v0.16.0

Key vLLM flags:

  • --tensor-parallel-size N — Distribute model across N GPUs
  • --gpu-memory-utilization 0.9 — GPU memory fraction to use
  • --max-model-len N — Maximum context length
  • --served-model-name name — Model name exposed via API
  • --trust-remote-code — Required for some HuggingFace models
  • --tool-call-parser hermes — Enable function calling support
  • --enable-reasoning — Enable chain-of-thought reasoning
  • --limit-mm-per-prompt image=4 — Limit multimodal inputs per request

Example: Llama 3.1 8B

FROM vllm/vllm-openai:v0.16.0
ENV MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"
ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]
CMD ["--model", "meta-llama/Llama-3.1-8B-Instruct",
"--host", "0.0.0.0", "--port", "8000",
"--served-model-name", "llama-3.1-8b",
"--gpu-memory-utilization", "0.9",
"--max-model-len", "32768"]

Transformers Engine (24 models)

Custom Python servers using HuggingFace Transformers directly. Used when a model isn't supported by vLLM or needs custom preprocessing.

Must expose: Standard OpenAI-compatible /v1/chat/completions endpoint accepting JSON:

{
"model": "model-name",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 512
}

For multimodal transformers models, the server must handle both text-only and image+text messages in OpenAI format:

{
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "What do you see?"},
{"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
]
}]
}

TTS Engine (16 models)

Text-to-speech models that expose /v1/audio/speech endpoint.

Request format:

{
"input": "Text to speak",
"voice": "default",
"model": "model-name",
"response_format": "mp3",
"speed": 1.0
}

Response: Binary audio stream.

Whisper Engine (5 models)

Speech-to-text models using OpenAI Whisper-compatible API at /v1/audio/transcriptions.

Request format: multipart/form-data with audio file upload.

Response:

{
"text": "Transcribed text content"
}

Video Engine (6 models)

Text-to-video generation models that expose /v1/video/generations endpoint.

Models: CogVideoX-2B, CogVideoX-5B, HunyuanVideo, Wan2.1-T2V-1.3B, Wan2.1-T2V-14B, LTX-Video

Request format:

{
"prompt": "A ball bouncing in slow motion",
"duration": 5,
"resolution": "1080p",
"aspect_ratio": "16:9",
"fps": 24
}

Response: Either JSON with job ID (async) or binary video data (sync), depending on the model.


Model Types & API Endpoints

Each model type maps to a specific AI Gateway endpoint:

Model TypeGateway EndpointProxy Path
chatPOST /v1/chat/completions/api/v1/ai/chat/completions
multimodalPOST /v1/chat/completions/api/v1/ai/chat/completions
embeddingPOST /v1/embeddings/api/v1/ai/embeddings
text_to_speechPOST /v1/audio/speech/api/v1/ai/audio/speech
speech_to_textPOST /v1/audio/transcriptions/api/v1/ai/audio/transcriptions
image_generationPOST /v1/generations/images/api/v1/ai/generations/images
video_generationPOST /v1/generations/videos/api/v1/ai/generations/videos

Important: Multimodal models use the same /v1/chat/completions endpoint as chat models. The gateway accepts both chat and multimodal model types for chat completions.


Deployment Lifecycle

1. Create Deployment

When you deploy a self-hosted model, the platform:

  1. Reads the model's Dockerfile from the catalogue
  2. Builds the Docker image using a K8s build job (Docker-in-Docker)
  3. Pushes the image to ECR
  4. Creates a Kubernetes Deployment + Service in the org namespace
  5. Registers the model in the AI Gateway with provider: self_hosted and the model's engine field

2. Model Status Flow

building → deploying → active → running
→ error (build/deploy failure)
  • building: Docker image is being built
  • deploying: K8s resources are being created
  • active: K8s Deployment exists, pod is starting
  • running: Model is serving inference requests
  • error: Build or deployment failed

3. Inference Readiness

After a model reaches "active" status, the inference server still needs time to:

  • Pull the Docker image
  • Download model weights from HuggingFace
  • Load weights into GPU memory
  • Start the HTTP server

The gateway handles this transparently — requests return 503 until the model is ready.


GPU Requirements

GPU CountInstance TypesExample Models
0 (CPU)t3.large, c5.2xlargeWhisper (tiny/base/small), smaller TTS models
1g5.xlarge, g5.2xlarge, g4dn.xlarge7B-8B LLMs, most multimodal, video gen
2g5.12xlarge13B LLMs, larger multimodal models
4g5.12xlarge, p3.8xlarge34B-72B LLMs, large vision models
8Custom70B+ LLMs, frontier models

Adding Custom Models

To add a new self-hosted model to the catalogue, add an entry to self-hosted-models.js:

{
id: 'my-custom-model',
name: 'My Custom Model',
description: 'Description of the model',
category: 'llm', // llm, multimodal, embedding, audio, video, image
provider: 'vendor-name',
modelType: 'chat', // chat, multimodal, embedding, text_to_speech, etc.
engine: 'vllm', // vllm, transformers, custom, whisper, tts-engine, video-engine
dockerfile: `...`, // Full Dockerfile content
defaultPort: 8000,
defaultResources: { cpu: '4000m', memory: '16Gi', gpu: 1, disk: '50Gi' },
recommendedInstance: 'g5.xlarge',
tags: ['llm', 'chat', '7b'],
modelSize: '14GB',
documentation: '# My Model\n\nModel documentation here.'
}

Engine Selection Guide

Choose This EngineWhen
vllmModel is supported by vLLM (most LLMs, VLMs, embeddings)
transformersCustom HuggingFace model not in vLLM, needs custom preprocessing
customNon-standard inference server with custom API
whisperOpenAI Whisper-compatible STT model
tts-engineText-to-speech model exposing /v1/audio/speech
video-engineVideo generation model exposing /v1/video/generations

Server Requirements

All self-hosted model servers must expose:

  1. Health endpoint: GET /health returning {"status": "ok"}
  2. Inference endpoint: The appropriate endpoint for the model type (see table above)
  3. Port 8000: Default port (configurable via defaultPort)

The inference endpoint must accept and return standard JSON — not multipart form data or custom formats (except STT which uses multipart for file upload).