Deploying AI Models

Deploy open-source models from Hugging Face on your own infrastructure with full control over scaling, costs, and data privacy. Deployments are scoped to organization namespaces for multi-tenant isolation.

Environment

Self-hosted models run on vLLM, a high-performance inference engine optimized for LLM serving. vLLM provides PagedAttention, continuous batching, and optimized CUDA kernels for maximum throughput.

Deployments use the NVIDIA PyTorch base image (nvcr.io/nvidia/pytorch:23.10-py3) with vLLM and dependencies installed on top.

Multi-Tenant Deployment

All model deployments are scoped to organization namespaces:

Organization context (X-Organization-ID header) is required for deployment
Each deployment runs in the organization's Kubernetes namespace
Resources are isolated between organizations
The deployment ID is used as the Kubernetes service name

Deployment Options

Control when and how your model is available:

Always On

Behavior: Model stays running 24/7

Use Case: Production applications with consistent traffic
Cost: Fixed compute costs regardless of usage
Latency: Instant response (no cold start)
Configuration: Set min replicas to 1 or more, no scale-to-zero

Best for Production

Always On deployment ensures instant response times and is ideal for production applications with consistent traffic patterns.

On Demand (Auto-Shutdown)

Behavior: Automatically shuts down after a configurable idle period

Use Case: Development, testing, low-traffic applications
Cost: Pay only for active usage time
Latency: Cold start when model restarts after shutdown
Configuration: Set auto_shutdown_minutes to define the idle timeout
Savings: Significant cost reduction for intermittent workloads

Cost Optimization

On Demand deployment with auto-shutdown can significantly reduce costs for intermittent workloads by automatically stopping the model during idle periods.

Scheduled

Behavior: Runs during specified time windows

Use Case: Batch processing, business hours only, regional availability
Cost: Only charged during active schedule
Latency: Instant during scheduled hours, unavailable outside schedule
Configuration: Cron expressions (e.g., "0 9-17 * * 1-5" for weekdays 9am-5pm)
Examples:
- 0 8 * * * - Daily at 8am
- 0 */4 * * * - Every 4 hours
- 0 0 * * 0 - Sunday midnight

Resource Configuration

Choose the right hardware for your model size:

Model Size	GPU	VRAM	Recommended For
Small (7B parameters)	1x V100 / A10G	16-24GB	Llama-2-7B, Mistral-7B
Medium (13B parameters)	1x A10G	24GB	Llama-2-13B, Vicuna-13B
Large (70B parameters)	4x A100 (40GB)	160GB total	Llama-2-70B with tensor parallelism
Quantized (4-bit)	1x V100	8-12GB	70B models quantized to 4-bit

Available Instance Types

The deployment system offers these EC2 instance types:

GPU Instances:

Instance	GPU	GPU Memory	vCPUs	RAM	Hourly Cost
`g5.xlarge`	1x A10G	24GB	4	16GB	$1.006
`g5.2xlarge`	1x A10G	24GB	8	32GB	$1.212
`g5.12xlarge`	4x A10G	96GB	48	192GB	$7.09
`p3.2xlarge`	1x V100	16GB	8	61GB	$3.06
`p3.8xlarge`	4x V100	64GB	32	244GB	$12.24

CPU-Only Instances:

Instance	vCPUs	RAM	Hourly Cost
`c5.2xlarge`	8	16GB	$0.34
`c5.4xlarge`	16	32GB	$0.68

Model Operations

Start and Stop

You can start and stop deployed models without deleting them:

Start: POST /api/v1/models/deployment/{model_id}/start - Resume a stopped model
Stop: POST /api/v1/models/deployment/{model_id}/stop - Stop a running model (preserves configuration)

Deployment Status

Check the status of any deployment:

GET /api/v1/models/deployment/{model_id}/status

Returns current status, replica counts, resource usage (CPU, memory, GPU), requests per second, and average latency.

Logs

View logs for a deployment with different log types:

GET /api/v1/models/deployment/{model_id}/logs?log_type=runtime&lines=100

Log Types:

build - Docker image build output
deployment - Kubernetes deployment events
runtime - Application runtime logs

Customizing Dockerfile

For advanced use cases, customize the Docker image for your model deployment. The platform provides Dockerfile templates as starting points:

# Example: Custom vLLM Dockerfile (based on platform template)
FROM nvcr.io/nvidia/pytorch:23.10-py3

# Install vLLM and dependencies
RUN pip install vllm transformers accelerate

# Set environment variables
ENV MODEL_NAME=mistralai/Mistral-7B-Instruct-v0.2
ENV PORT=8000
ENV GPU_MEMORY_UTILIZATION=0.95
ENV MAX_MODEL_LEN=8192

# Create model directory
RUN mkdir -p /models
WORKDIR /models

# Expose port
EXPOSE 8000

# Health check endpoint
HEALTHCHECK CMD curl -f http://localhost:8000/health || exit 1

# Start vLLM server
CMD python -m vllm.entrypoints.openai.api_server \
    --model $MODEL_NAME \
    --port $PORT \
    --gpu-memory-utilization $GPU_MEMORY_UTILIZATION \
    --max-model-len $MAX_MODEL_LEN \
    --trust-remote-code

Dockerfile Templates

The platform provides pre-built templates accessible via GET /api/v1/models/deployment/dockerfiles/templates:

Mistral 7B with vLLM - Deploy Mistral 7B with high-performance inference (min 16GB GPU)
Llama 2 7B with vLLM - Deploy Llama 2 7B (min 16GB GPU)
Custom vLLM Model - Template for deploying custom models (min 8GB GPU)

Common customizations:

Additional Python packages
Custom tokenizers
Preprocessing scripts
Model quantization
Multi-GPU configuration

Deployment API Endpoints

Endpoint	Method	Description
`/api/v1/models/deployment/deploy`	POST	Deploy a new model
`/api/v1/models/deployment/{model_id}/status`	GET	Get deployment status
`/api/v1/models/deployment/{model_id}/start`	POST	Start a stopped model
`/api/v1/models/deployment/{model_id}/stop`	POST	Stop a running model
`/api/v1/models/deployment/{model_id}/logs`	GET	Get deployment logs
`/api/v1/models/deployment/instance-types`	GET	List available instance types
`/api/v1/models/deployment/instance-pricing/{type}`	GET	Get instance pricing
`/api/v1/models/deployment/dockerfiles/templates`	GET	Get Dockerfile templates

Next Steps

Fine-tune a model on your custom data
Configure autoscaling to handle varying inference loads
Set up monitoring to track model performance
Optimize costs with right-sized resources

Environment​

Multi-Tenant Deployment​

Deployment Options​

Always On​

On Demand (Auto-Shutdown)​

Scheduled​

Resource Configuration​

Available Instance Types​

Model Operations​

Start and Stop​

Deployment Status​

Logs​

Customizing Dockerfile​

Dockerfile Templates​

Deployment API Endpoints​

Next Steps​