Skip to main content

Deploying AI Models

Deploy open-source models from Hugging Face on your own infrastructure with full control over scaling, costs, and data privacy. Deployments are scoped to organization namespaces for multi-tenant isolation.

Environment

Self-hosted models run on vLLM, a high-performance inference engine optimized for LLM serving. vLLM provides PagedAttention, continuous batching, and optimized CUDA kernels for maximum throughput.

Deployments use the NVIDIA PyTorch base image (nvcr.io/nvidia/pytorch:23.10-py3) with vLLM and dependencies installed on top.

Multi-Tenant Deployment

All model deployments are scoped to organization namespaces:

  • Organization context (X-Organization-ID header) is required for deployment
  • Each deployment runs in the organization's Kubernetes namespace
  • Resources are isolated between organizations
  • The deployment ID is used as the Kubernetes service name

Deployment Options

Control when and how your model is available:

Always On

Behavior: Model stays running 24/7

  • Use Case: Production applications with consistent traffic
  • Cost: Fixed compute costs regardless of usage
  • Latency: Instant response (no cold start)
  • Configuration: Set min replicas to 1 or more, no scale-to-zero
Best for Production

Always On deployment ensures instant response times and is ideal for production applications with consistent traffic patterns.

On Demand (Auto-Shutdown)

Behavior: Automatically shuts down after a configurable idle period

  • Use Case: Development, testing, low-traffic applications
  • Cost: Pay only for active usage time
  • Latency: Cold start when model restarts after shutdown
  • Configuration: Set auto_shutdown_minutes to define the idle timeout
  • Savings: Significant cost reduction for intermittent workloads
Cost Optimization

On Demand deployment with auto-shutdown can significantly reduce costs for intermittent workloads by automatically stopping the model during idle periods.

Scheduled

Behavior: Runs during specified time windows

  • Use Case: Batch processing, business hours only, regional availability
  • Cost: Only charged during active schedule
  • Latency: Instant during scheduled hours, unavailable outside schedule
  • Configuration: Cron expressions (e.g., "0 9-17 * * 1-5" for weekdays 9am-5pm)
  • Examples:
    • 0 8 * * * - Daily at 8am
    • 0 */4 * * * - Every 4 hours
    • 0 0 * * 0 - Sunday midnight

Resource Configuration

Choose the right hardware for your model size:

Model SizeGPUVRAMRecommended For
Small (7B parameters)1x V100 / A10G16-24GBLlama-2-7B, Mistral-7B
Medium (13B parameters)1x A10G24GBLlama-2-13B, Vicuna-13B
Large (70B parameters)4x A100 (40GB)160GB totalLlama-2-70B with tensor parallelism
Quantized (4-bit)1x V1008-12GB70B models quantized to 4-bit

Available Instance Types

The deployment system offers these EC2 instance types:

GPU Instances:

InstanceGPUGPU MemoryvCPUsRAMHourly Cost
g5.xlarge1x A10G24GB416GB$1.006
g5.2xlarge1x A10G24GB832GB$1.212
g5.12xlarge4x A10G96GB48192GB$7.09
p3.2xlarge1x V10016GB861GB$3.06
p3.8xlarge4x V10064GB32244GB$12.24

CPU-Only Instances:

InstancevCPUsRAMHourly Cost
c5.2xlarge816GB$0.34
c5.4xlarge1632GB$0.68

Model Operations

Start and Stop

You can start and stop deployed models without deleting them:

  • Start: POST /api/v1/models/deployment/{model_id}/start - Resume a stopped model
  • Stop: POST /api/v1/models/deployment/{model_id}/stop - Stop a running model (preserves configuration)

Deployment Status

Check the status of any deployment:

GET /api/v1/models/deployment/{model_id}/status

Returns current status, replica counts, resource usage (CPU, memory, GPU), requests per second, and average latency.

Logs

View logs for a deployment with different log types:

GET /api/v1/models/deployment/{model_id}/logs?log_type=runtime&lines=100

Log Types:

  • build - Docker image build output
  • deployment - Kubernetes deployment events
  • runtime - Application runtime logs

Customizing Dockerfile

For advanced use cases, customize the Docker image for your model deployment. The platform provides Dockerfile templates as starting points:

# Example: Custom vLLM Dockerfile (based on platform template)
FROM nvcr.io/nvidia/pytorch:23.10-py3

# Install vLLM and dependencies
RUN pip install vllm transformers accelerate

# Set environment variables
ENV MODEL_NAME=mistralai/Mistral-7B-Instruct-v0.2
ENV PORT=8000
ENV GPU_MEMORY_UTILIZATION=0.95
ENV MAX_MODEL_LEN=8192

# Create model directory
RUN mkdir -p /models
WORKDIR /models

# Expose port
EXPOSE 8000

# Health check endpoint
HEALTHCHECK CMD curl -f http://localhost:8000/health || exit 1

# Start vLLM server
CMD python -m vllm.entrypoints.openai.api_server \
--model $MODEL_NAME \
--port $PORT \
--gpu-memory-utilization $GPU_MEMORY_UTILIZATION \
--max-model-len $MAX_MODEL_LEN \
--trust-remote-code

Dockerfile Templates

The platform provides pre-built templates accessible via GET /api/v1/models/deployment/dockerfiles/templates:

  • Mistral 7B with vLLM - Deploy Mistral 7B with high-performance inference (min 16GB GPU)
  • Llama 2 7B with vLLM - Deploy Llama 2 7B (min 16GB GPU)
  • Custom vLLM Model - Template for deploying custom models (min 8GB GPU)

Common customizations:

  • Additional Python packages
  • Custom tokenizers
  • Preprocessing scripts
  • Model quantization
  • Multi-GPU configuration

Deployment API Endpoints

EndpointMethodDescription
/api/v1/models/deployment/deployPOSTDeploy a new model
/api/v1/models/deployment/{model_id}/statusGETGet deployment status
/api/v1/models/deployment/{model_id}/startPOSTStart a stopped model
/api/v1/models/deployment/{model_id}/stopPOSTStop a running model
/api/v1/models/deployment/{model_id}/logsGETGet deployment logs
/api/v1/models/deployment/instance-typesGETList available instance types
/api/v1/models/deployment/instance-pricing/{type}GETGet instance pricing
/api/v1/models/deployment/dockerfiles/templatesGETGet Dockerfile templates

Next Steps