Skip to main content

Deploying AI Models

Deploy open-source models from Hugging Face on your own infrastructure with full control over scaling, costs, and data privacy.

Environment Selection

Choose the runtime environment for your self-hosted model:

EnvironmentBest ForFeatures
vLLMHigh-throughput inferencePagedAttention, continuous batching, optimized CUDA kernels
Text Generation Inference (TGI)Production deploymentsToken streaming, tensor parallelism, quantization support
OllamaLocal development, smaller modelsSimple setup, CPU support, model library
CustomSpecialized requirementsBring your own Docker image and configuration

Deployment Options

Control when and how your model is available:

Always On

Behavior: Model stays running 24/7

  • Use Case: Production applications with consistent traffic
  • Cost: Fixed compute costs regardless of usage
  • Latency: Instant response (no cold start)
  • Configuration: Set min replicas ≥ 1, no scale-to-zero
Best for Production

Always On deployment ensures instant response times and is ideal for production applications with consistent traffic patterns.

On Demand

Behavior: Scales to zero when idle, starts on first request

  • Use Case: Development, testing, low-traffic applications
  • Cost: Pay only for actual usage time
  • Latency: 30-60 second cold start for first request
  • Configuration: Set scale-to-zero timeout (default: 5 minutes idle)
  • Savings: 70-90% cost reduction for intermittent workloads
Cost Optimization

On Demand deployment can reduce costs by 70-90% for intermittent workloads by scaling to zero during idle periods.

Scheduled

Behavior: Runs during specified time windows

  • Use Case: Batch processing, business hours only, regional availability
  • Cost: Only charged during active schedule
  • Latency: Instant during scheduled hours, unavailable outside schedule
  • Configuration: Cron expressions (e.g., "0 9-17 * * 1-5" for weekdays 9am-5pm)
  • Examples:
    • 0 8 * * * - Daily at 8am
    • 0 */4 * * * - Every 4 hours
    • 0 0 * * 0 - Sunday midnight

Resource Configuration

Choose the right hardware for your model size:

Model SizeGPUVRAMRecommended For
Small (7B parameters)1x T4 / L416GBLlama-2-7B, Mistral-7B
Medium (13B parameters)1x A10G / L4024GBLlama-2-13B, Vicuna-13B
Large (70B parameters)4x A100 (40GB)160GB totalLlama-2-70B with tensor parallelism
Quantized (4-bit)1x T48-12GB70B models quantized to 4-bit

Customizing Dockerfile

For advanced use cases, customize the Docker image for your model deployment:

# Example: Custom vLLM Dockerfile
FROM vllm/vllm-openai:latest

# Install additional dependencies
RUN pip install transformers==4.36.0 accelerate

# Copy custom configuration
COPY model_config.json /app/config.json

# Set environment variables
ENV CUDA_VISIBLE_DEVICES=0,1
ENV VLLM_WORKER_MULTIPROC_METHOD=spawn

# Custom entrypoint for preprocessing
COPY entrypoint.sh /app/entrypoint.sh
RUN chmod +x /app/entrypoint.sh

ENTRYPOINT ["/app/entrypoint.sh"]

Common customizations:

  • Additional Python packages
  • Custom tokenizers
  • Preprocessing scripts
  • Model quantization
  • Multi-GPU configuration

Fine-Tuning

Train models on your custom data to improve performance for domain-specific tasks. Fine-tuning creates a specialized version of a base model.

Fine-Tuning Methods

LoRA (Low-Rank Adaptation)

How it works: Adds small trainable matrices to model layers while freezing original weights

  • Memory: 90% less than full fine-tuning
  • Speed: 2-3x faster training
  • Quality: Near full fine-tuning performance
  • Storage: 1-10MB adapter files vs. multi-GB full models
  • Best for: Most use cases, limited GPU resources

QLoRA (Quantized LoRA)

How it works: LoRA with 4-bit quantization of base model

  • Memory: 95% less than full fine-tuning
  • GPU Requirement: Fine-tune 70B models on single 48GB GPU
  • Quality: Minimal degradation vs. standard LoRA
  • Best for: Large models with limited hardware

Full Fine-Tuning

How it works: Updates all model weights

  • Memory: Requires 4-8x model size in VRAM
  • Quality: Maximum customization potential
  • Storage: Full model copy (10-140GB)
  • Best for: Maximum performance, ample resources, significant domain shift

Fine-Tuning Configuration

Base Model Selection

  • Model: Choose from Llama-2, Mistral, Flan-T5, GPT-J, or upload custom base model
  • Version: Specific model checkpoint (7B, 13B, 70B variants)
  • License: Ensure compliance with model license for fine-tuned derivative

Dataset Configuration

  • Format: JSONL, CSV, Parquet (structured training examples)
  • Structure: Input-output pairs, instruction-response, prompt-completion
  • Size: Minimum 100 examples, recommended 1,000-10,000
  • Split: Automatic train/validation split (default 90/10) or manual
  • Preprocessing: Tokenization, truncation, padding strategies
  • Data Validation: Automatic checks for format errors, duplicates, length issues

Training Hyperparameters

ParameterDefaultRangeDescription
Learning Rate2e-41e-5 to 5e-4Step size for weight updates. Higher = faster learning but risk of instability. Lower for large models.
Batch Size41 to 32Examples processed per step. Larger = more stable gradients but more memory. Limited by GPU VRAM.
Epochs31 to 20Complete passes through dataset. More epochs = more learning but risk of overfitting. Monitor validation loss.
Warmup Steps1000 to 500Gradual learning rate increase at start. Prevents early training instability. Set to 5-10% of total steps.
Weight Decay0.010 to 0.1L2 regularization strength. Prevents overfitting by penalizing large weights. Increase for small datasets.
Max Sequence Length512128 to 4096Maximum tokens per example. Longer = more context but more memory. Truncates longer sequences.
Gradient Accumulation Steps11 to 16Accumulate gradients over multiple steps before update. Simulates larger batch size with limited memory.

LoRA-Specific Parameters

ParameterDefaultRangeDescription
LoRA Rank (r)84 to 64Dimensionality of adapter matrices. Higher = more capacity but more parameters. 8-16 typical.
LoRA Alpha168 to 128Scaling factor for adapter weights. Typically 2x rank. Controls strength of fine-tuned behavior.
LoRA Dropout0.050 to 0.3Dropout probability for adapter layers. Prevents overfitting. Higher for small datasets.
Target Modulesq_proj, v_projVariousWhich layers get adapters. q_proj/v_proj (query/value) most common. Add k_proj/o_proj for more capacity.

Optimization Settings

  • Optimizer: AdamW (default), SGD, Adafactor. AdamW best for most cases.
  • Learning Rate Schedule: Linear (default), cosine, constant, polynomial. Cosine often best for long training.
  • Mixed Precision: FP16 or BF16. Reduces memory 2x, faster training. BF16 more stable for large models.
  • Gradient Checkpointing: Saves memory by recomputing activations. 40% memory reduction, 20% slower.
  • Early Stopping: Stop if validation loss doesn't improve for N epochs. Prevents overfitting.

Evaluation Metrics

Track these metrics during training:

  • Training Loss: How well model fits training data. Should decrease consistently.
  • Validation Loss: Performance on held-out data. If increases while train loss decreases = overfitting.
  • Perplexity: Measure of prediction uncertainty. Lower = better. Exp(loss).
  • Learning Rate: Current learning rate (changes with schedule). Monitor for stability.
  • Gradient Norm: Magnitude of gradients. Spikes indicate instability, may need lower learning rate.
  • Examples/Second: Training throughput. Monitor for hardware efficiency.

Fine-Tuning Workflow

  1. Prepare Dataset: Upload training data in JSONL format with input-output pairs
  2. Configure Job: Select base model, method (LoRA/QLoRA/Full), set hyperparameters
  3. Start Training: Job runs on GPU cluster, typically 1-6 hours depending on size
  4. Monitor Progress: Real-time loss curves, validation metrics, sample outputs
  5. Evaluate Results: Test on validation set, compare to base model, check for overfitting
  6. Deploy Model: One-click deployment as inference endpoint with same API as base model
  7. Iterate: Adjust hyperparameters, add more data, or try different base models

Next Steps