Deploying AI Models
Deploy open-source models from Hugging Face on your own infrastructure with full control over scaling, costs, and data privacy. Deployments are scoped to organization namespaces for multi-tenant isolation.
Environment
Self-hosted models run on vLLM, a high-performance inference engine optimized for LLM serving. vLLM provides PagedAttention, continuous batching, and optimized CUDA kernels for maximum throughput.
Deployments use the NVIDIA PyTorch base image (nvcr.io/nvidia/pytorch:23.10-py3) with vLLM and dependencies installed on top.
Multi-Tenant Deployment
All model deployments are scoped to organization namespaces:
- Organization context (
X-Organization-IDheader) is required for deployment - Each deployment runs in the organization's Kubernetes namespace
- Resources are isolated between organizations
- The deployment ID is used as the Kubernetes service name
Deployment Options
Control when and how your model is available:
Always On
Behavior: Model stays running 24/7
- Use Case: Production applications with consistent traffic
- Cost: Fixed compute costs regardless of usage
- Latency: Instant response (no cold start)
- Configuration: Set min replicas to 1 or more, no scale-to-zero
Always On deployment ensures instant response times and is ideal for production applications with consistent traffic patterns.
On Demand (Auto-Shutdown)
Behavior: Automatically shuts down after a configurable idle period
- Use Case: Development, testing, low-traffic applications
- Cost: Pay only for active usage time
- Latency: Cold start when model restarts after shutdown
- Configuration: Set
auto_shutdown_minutesto define the idle timeout - Savings: Significant cost reduction for intermittent workloads
On Demand deployment with auto-shutdown can significantly reduce costs for intermittent workloads by automatically stopping the model during idle periods.
Scheduled
Behavior: Runs during specified time windows
- Use Case: Batch processing, business hours only, regional availability
- Cost: Only charged during active schedule
- Latency: Instant during scheduled hours, unavailable outside schedule
- Configuration: Cron expressions (e.g., "0 9-17 * * 1-5" for weekdays 9am-5pm)
- Examples:
0 8 * * *- Daily at 8am0 */4 * * *- Every 4 hours0 0 * * 0- Sunday midnight
Resource Configuration
Choose the right hardware for your model size:
| Model Size | GPU | VRAM | Recommended For |
|---|---|---|---|
| Small (7B parameters) | 1x V100 / A10G | 16-24GB | Llama-2-7B, Mistral-7B |
| Medium (13B parameters) | 1x A10G | 24GB | Llama-2-13B, Vicuna-13B |
| Large (70B parameters) | 4x A100 (40GB) | 160GB total | Llama-2-70B with tensor parallelism |
| Quantized (4-bit) | 1x V100 | 8-12GB | 70B models quantized to 4-bit |
Available Instance Types
The deployment system offers these EC2 instance types:
GPU Instances:
| Instance | GPU | GPU Memory | vCPUs | RAM | Hourly Cost |
|---|---|---|---|---|---|
g5.xlarge | 1x A10G | 24GB | 4 | 16GB | $1.006 |
g5.2xlarge | 1x A10G | 24GB | 8 | 32GB | $1.212 |
g5.12xlarge | 4x A10G | 96GB | 48 | 192GB | $7.09 |
p3.2xlarge | 1x V100 | 16GB | 8 | 61GB | $3.06 |
p3.8xlarge | 4x V100 | 64GB | 32 | 244GB | $12.24 |
CPU-Only Instances:
| Instance | vCPUs | RAM | Hourly Cost |
|---|---|---|---|
c5.2xlarge | 8 | 16GB | $0.34 |
c5.4xlarge | 16 | 32GB | $0.68 |
Model Operations
Start and Stop
You can start and stop deployed models without deleting them:
- Start:
POST /api/v1/models/deployment/{model_id}/start- Resume a stopped model - Stop:
POST /api/v1/models/deployment/{model_id}/stop- Stop a running model (preserves configuration)
Deployment Status
Check the status of any deployment:
GET /api/v1/models/deployment/{model_id}/status
Returns current status, replica counts, resource usage (CPU, memory, GPU), requests per second, and average latency.
Logs
View logs for a deployment with different log types:
GET /api/v1/models/deployment/{model_id}/logs?log_type=runtime&lines=100
Log Types:
build- Docker image build outputdeployment- Kubernetes deployment eventsruntime- Application runtime logs
Customizing Dockerfile
For advanced use cases, customize the Docker image for your model deployment. The platform provides Dockerfile templates as starting points:
# Example: Custom vLLM Dockerfile (based on platform template)
FROM nvcr.io/nvidia/pytorch:23.10-py3
# Install vLLM and dependencies
RUN pip install vllm transformers accelerate
# Set environment variables
ENV MODEL_NAME=mistralai/Mistral-7B-Instruct-v0.2
ENV PORT=8000
ENV GPU_MEMORY_UTILIZATION=0.95
ENV MAX_MODEL_LEN=8192
# Create model directory
RUN mkdir -p /models
WORKDIR /models
# Expose port
EXPOSE 8000
# Health check endpoint
HEALTHCHECK CMD curl -f http://localhost:8000/health || exit 1
# Start vLLM server
CMD python -m vllm.entrypoints.openai.api_server \
--model $MODEL_NAME \
--port $PORT \
--gpu-memory-utilization $GPU_MEMORY_UTILIZATION \
--max-model-len $MAX_MODEL_LEN \
--trust-remote-code
Dockerfile Templates
The platform provides pre-built templates accessible via GET /api/v1/models/deployment/dockerfiles/templates:
- Mistral 7B with vLLM - Deploy Mistral 7B with high-performance inference (min 16GB GPU)
- Llama 2 7B with vLLM - Deploy Llama 2 7B (min 16GB GPU)
- Custom vLLM Model - Template for deploying custom models (min 8GB GPU)
Common customizations:
- Additional Python packages
- Custom tokenizers
- Preprocessing scripts
- Model quantization
- Multi-GPU configuration
Deployment API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/api/v1/models/deployment/deploy | POST | Deploy a new model |
/api/v1/models/deployment/{model_id}/status | GET | Get deployment status |
/api/v1/models/deployment/{model_id}/start | POST | Start a stopped model |
/api/v1/models/deployment/{model_id}/stop | POST | Stop a running model |
/api/v1/models/deployment/{model_id}/logs | GET | Get deployment logs |
/api/v1/models/deployment/instance-types | GET | List available instance types |
/api/v1/models/deployment/instance-pricing/{type} | GET | Get instance pricing |
/api/v1/models/deployment/dockerfiles/templates | GET | Get Dockerfile templates |
Next Steps
- Fine-tune a model on your custom data
- Configure autoscaling to handle varying inference loads
- Set up monitoring to track model performance
- Optimize costs with right-sized resources