Deploying AI Models
Deploy open-source models from Hugging Face on your own infrastructure with full control over scaling, costs, and data privacy.
Environment Selection
Choose the runtime environment for your self-hosted model:
| Environment | Best For | Features |
|---|---|---|
| vLLM | High-throughput inference | PagedAttention, continuous batching, optimized CUDA kernels |
| Text Generation Inference (TGI) | Production deployments | Token streaming, tensor parallelism, quantization support |
| Ollama | Local development, smaller models | Simple setup, CPU support, model library |
| Custom | Specialized requirements | Bring your own Docker image and configuration |
Deployment Options
Control when and how your model is available:
Always On
Behavior: Model stays running 24/7
- Use Case: Production applications with consistent traffic
- Cost: Fixed compute costs regardless of usage
- Latency: Instant response (no cold start)
- Configuration: Set min replicas ≥ 1, no scale-to-zero
Always On deployment ensures instant response times and is ideal for production applications with consistent traffic patterns.
On Demand
Behavior: Scales to zero when idle, starts on first request
- Use Case: Development, testing, low-traffic applications
- Cost: Pay only for actual usage time
- Latency: 30-60 second cold start for first request
- Configuration: Set scale-to-zero timeout (default: 5 minutes idle)
- Savings: 70-90% cost reduction for intermittent workloads
On Demand deployment can reduce costs by 70-90% for intermittent workloads by scaling to zero during idle periods.
Scheduled
Behavior: Runs during specified time windows
- Use Case: Batch processing, business hours only, regional availability
- Cost: Only charged during active schedule
- Latency: Instant during scheduled hours, unavailable outside schedule
- Configuration: Cron expressions (e.g., "0 9-17 * * 1-5" for weekdays 9am-5pm)
- Examples:
0 8 * * *- Daily at 8am0 */4 * * *- Every 4 hours0 0 * * 0- Sunday midnight
Resource Configuration
Choose the right hardware for your model size:
| Model Size | GPU | VRAM | Recommended For |
|---|---|---|---|
| Small (7B parameters) | 1x T4 / L4 | 16GB | Llama-2-7B, Mistral-7B |
| Medium (13B parameters) | 1x A10G / L40 | 24GB | Llama-2-13B, Vicuna-13B |
| Large (70B parameters) | 4x A100 (40GB) | 160GB total | Llama-2-70B with tensor parallelism |
| Quantized (4-bit) | 1x T4 | 8-12GB | 70B models quantized to 4-bit |
Customizing Dockerfile
For advanced use cases, customize the Docker image for your model deployment:
# Example: Custom vLLM Dockerfile
FROM vllm/vllm-openai:latest
# Install additional dependencies
RUN pip install transformers==4.36.0 accelerate
# Copy custom configuration
COPY model_config.json /app/config.json
# Set environment variables
ENV CUDA_VISIBLE_DEVICES=0,1
ENV VLLM_WORKER_MULTIPROC_METHOD=spawn
# Custom entrypoint for preprocessing
COPY entrypoint.sh /app/entrypoint.sh
RUN chmod +x /app/entrypoint.sh
ENTRYPOINT ["/app/entrypoint.sh"]
Common customizations:
- Additional Python packages
- Custom tokenizers
- Preprocessing scripts
- Model quantization
- Multi-GPU configuration
Fine-Tuning
Train models on your custom data to improve performance for domain-specific tasks. Fine-tuning creates a specialized version of a base model.
Fine-Tuning Methods
LoRA (Low-Rank Adaptation)
How it works: Adds small trainable matrices to model layers while freezing original weights
- Memory: 90% less than full fine-tuning
- Speed: 2-3x faster training
- Quality: Near full fine-tuning performance
- Storage: 1-10MB adapter files vs. multi-GB full models
- Best for: Most use cases, limited GPU resources
QLoRA (Quantized LoRA)
How it works: LoRA with 4-bit quantization of base model
- Memory: 95% less than full fine-tuning
- GPU Requirement: Fine-tune 70B models on single 48GB GPU
- Quality: Minimal degradation vs. standard LoRA
- Best for: Large models with limited hardware
Full Fine-Tuning
How it works: Updates all model weights
- Memory: Requires 4-8x model size in VRAM
- Quality: Maximum customization potential
- Storage: Full model copy (10-140GB)
- Best for: Maximum performance, ample resources, significant domain shift
Fine-Tuning Configuration
Base Model Selection
- Model: Choose from Llama-2, Mistral, Flan-T5, GPT-J, or upload custom base model
- Version: Specific model checkpoint (7B, 13B, 70B variants)
- License: Ensure compliance with model license for fine-tuned derivative
Dataset Configuration
- Format: JSONL, CSV, Parquet (structured training examples)
- Structure: Input-output pairs, instruction-response, prompt-completion
- Size: Minimum 100 examples, recommended 1,000-10,000
- Split: Automatic train/validation split (default 90/10) or manual
- Preprocessing: Tokenization, truncation, padding strategies
- Data Validation: Automatic checks for format errors, duplicates, length issues
Training Hyperparameters
| Parameter | Default | Range | Description |
|---|---|---|---|
| Learning Rate | 2e-4 | 1e-5 to 5e-4 | Step size for weight updates. Higher = faster learning but risk of instability. Lower for large models. |
| Batch Size | 4 | 1 to 32 | Examples processed per step. Larger = more stable gradients but more memory. Limited by GPU VRAM. |
| Epochs | 3 | 1 to 20 | Complete passes through dataset. More epochs = more learning but risk of overfitting. Monitor validation loss. |
| Warmup Steps | 100 | 0 to 500 | Gradual learning rate increase at start. Prevents early training instability. Set to 5-10% of total steps. |
| Weight Decay | 0.01 | 0 to 0.1 | L2 regularization strength. Prevents overfitting by penalizing large weights. Increase for small datasets. |
| Max Sequence Length | 512 | 128 to 4096 | Maximum tokens per example. Longer = more context but more memory. Truncates longer sequences. |
| Gradient Accumulation Steps | 1 | 1 to 16 | Accumulate gradients over multiple steps before update. Simulates larger batch size with limited memory. |
LoRA-Specific Parameters
| Parameter | Default | Range | Description |
|---|---|---|---|
| LoRA Rank (r) | 8 | 4 to 64 | Dimensionality of adapter matrices. Higher = more capacity but more parameters. 8-16 typical. |
| LoRA Alpha | 16 | 8 to 128 | Scaling factor for adapter weights. Typically 2x rank. Controls strength of fine-tuned behavior. |
| LoRA Dropout | 0.05 | 0 to 0.3 | Dropout probability for adapter layers. Prevents overfitting. Higher for small datasets. |
| Target Modules | q_proj, v_proj | Various | Which layers get adapters. q_proj/v_proj (query/value) most common. Add k_proj/o_proj for more capacity. |
Optimization Settings
- Optimizer: AdamW (default), SGD, Adafactor. AdamW best for most cases.
- Learning Rate Schedule: Linear (default), cosine, constant, polynomial. Cosine often best for long training.
- Mixed Precision: FP16 or BF16. Reduces memory 2x, faster training. BF16 more stable for large models.
- Gradient Checkpointing: Saves memory by recomputing activations. 40% memory reduction, 20% slower.
- Early Stopping: Stop if validation loss doesn't improve for N epochs. Prevents overfitting.
Evaluation Metrics
Track these metrics during training:
- Training Loss: How well model fits training data. Should decrease consistently.
- Validation Loss: Performance on held-out data. If increases while train loss decreases = overfitting.
- Perplexity: Measure of prediction uncertainty. Lower = better. Exp(loss).
- Learning Rate: Current learning rate (changes with schedule). Monitor for stability.
- Gradient Norm: Magnitude of gradients. Spikes indicate instability, may need lower learning rate.
- Examples/Second: Training throughput. Monitor for hardware efficiency.
Fine-Tuning Workflow
- Prepare Dataset: Upload training data in JSONL format with input-output pairs
- Configure Job: Select base model, method (LoRA/QLoRA/Full), set hyperparameters
- Start Training: Job runs on GPU cluster, typically 1-6 hours depending on size
- Monitor Progress: Real-time loss curves, validation metrics, sample outputs
- Evaluate Results: Test on validation set, compare to base model, check for overfitting
- Deploy Model: One-click deployment as inference endpoint with same API as base model
- Iterate: Adjust hyperparameters, add more data, or try different base models
Next Steps
- Configure autoscaling to handle varying inference loads
- Set up monitoring to track model performance
- Optimize costs with right-sized resources