Fine-Tuning
Train models on your custom data to improve performance for domain-specific tasks. The platform supports 4 training methods (LoRA, QLoRA, full fine-tuning, and prompt tuning), with real-time monitoring, cost tracking, and one-click deployment.
Overview
Fine-tuning creates a specialized version of a base model by training on your data. The platform handles the full lifecycle:
- Configure - Select base model, training method, dataset, and hyperparameters
- Train - Dedicated GPU nodes run your training job with real-time metrics
- Monitor - Track loss curves, resource usage, and costs in real-time
- Deploy - One-click deployment as an inference endpoint
Base Models
Choose from 30+ open-source models across major families:
| Family | Models | Best For |
|---|---|---|
| Llama | 3.3 70B, 3.2 (1B/3B/11B/90B), 3.1 (8B/70B), CodeLlama | General purpose, coding |
| Mistral | 7B, v0.3, Mixtral 8x7B, Nemo, Codestral | Multilingual, code |
| Qwen | 2.5 (0.5B-72B), Coder, VL variants | Multilingual, vision |
| Phi | 3.5 Mini, 3 Medium, 4 | Small efficient models |
| Gemma | 2 (9B/27B), 7B, CodeGemma | Compact, code |
| DeepSeek | R1 (1.5B-70B), Coder, Math | Reasoning, code, math |
Models can be sourced from HuggingFace, S3, or local storage.
Training Methods
LoRA (Low-Rank Adaptation)
Adds small trainable matrices to model layers while freezing original weights.
- Memory: ~50% of base model VRAM
- GPU: 16GB minimum (T4 recommended)
- Quality: Near full fine-tuning performance
- Storage: 1-50MB adapter files
LoRA Parameters:
| Parameter | Default | Range | Description |
|---|---|---|---|
| Rank (r) | 16 | 1-256 | Dimensionality of adapter matrices. Higher = more capacity |
| Alpha | 32 | 1-512 | Scaling factor. Typically 2x rank |
| Dropout | 0.1 | 0-0.5 | Prevents overfitting. Higher for small datasets |
| Target Modules | q_proj, v_proj | Various | Which layers get adapters |
QLoRA (Quantized LoRA)
LoRA with 4-bit quantization of the base model for maximum memory efficiency.
- Memory: ~25% of base model VRAM
- GPU: 8GB minimum (T4 recommended)
- Quality: Minimal degradation vs standard LoRA
- Best for: Large models (70B) on limited hardware
QLoRA Parameters:
| Parameter | Default | Options | Description |
|---|---|---|---|
| Bits | 4 | 2, 3, 4, 5, 6, 8 | Quantization precision |
| Quant Type | nf4 | nf4, int4, fp4 | Quantization algorithm |
| Double Quant | true | true/false | Quantize the quantization constants |
| Compute Dtype | float16 | float16, bfloat16, float32 | Compute precision |
Full Fine-Tuning
Updates all model weights for maximum customization.
- Memory: 150% of base model VRAM (4-8x model size total)
- GPU: 40GB+ (A100 recommended)
- Quality: Maximum customization potential
- Storage: Full model copy (10-140GB)
- Best for: Significant domain shift, ample resources
Prompt Tuning
Learns soft prompt embeddings prepended to inputs, keeping the full model frozen.
- Memory: Minimal additional VRAM
- GPU: 16GB+ recommended
- Quality: Good for simple task adaptation
- Storage: Very small adapter (< 1MB)
- Best for: Quick experiments, task-specific tuning with minimal resources
Dataset Configuration
Supported Formats
| Format | Structure | Best For |
|---|---|---|
| JSONL | One JSON object per line | General purpose |
| Alpaca | {"instruction", "input", "output"} | Instruction following |
| ShareGPT | {"conversations": [...]} | Multi-turn chat |
| OpenAI | {"messages": [...]} | Chat format |
| CSV | Columnar data | Classification, simple tasks |
| Parquet | Compressed columnar | Large datasets |
Dataset Parameters
| Parameter | Default | Description |
|---|---|---|
| Train/Val/Test Split | 80/15/5 | Automatic dataset splitting |
| Max Samples | unlimited | Limit training examples |
| Cutoff Length | 2048 | Maximum tokens per example (128-4096) |
| Streaming | false | Stream large datasets from storage |
Data Sources
- File Upload - Drag and drop JSONL/CSV/Parquet files (multipart form upload)
- JSON API - Submit configuration via JSON with S3 path or HuggingFace dataset ID
- S3 Path - Reference datasets stored in S3
Dataset Requirements
- Minimum: 100 examples (1,000-10,000 recommended)
- Validation: Automatic checks for format errors, duplicates, length issues
Hyperparameters
Training Parameters
| Parameter | Default | Range | Description |
|---|---|---|---|
| Learning Rate | 2e-4 | 1e-6 to 1e-2 | Step size for weight updates |
| Batch Size | 4 | 1-32 | Examples per gradient step |
| Epochs | 3 | 1-50 | Complete passes through dataset |
| Gradient Accumulation | 4 | 1-32 | Simulates larger batch size with less memory |
| Warmup Steps | 100 | 0+ | Gradual learning rate increase at start |
| Weight Decay | 0.01 | 0-1 | L2 regularization strength |
| Max Sequence Length | 2048 | 128-4096 | Maximum tokens per example |
Advanced Configuration
| Setting | Default | Description |
|---|---|---|
| Early Stopping | true | Stop training when validation loss plateaus |
| Save Strategy | steps | When to save checkpoints |
| Evaluation Strategy | steps | When to run evaluation |
| Logging Steps | 10 | Steps between log entries |
| Save Steps | 500 | Steps between checkpoint saves |
| Eval Steps | 500 | Steps between evaluations |
| Dataloader Workers | 4 | Parallel data loading threads (0-16) |
| FP16 | true | Half-precision training for memory efficiency |
| Gradient Checkpointing | true | Trade compute for memory |
Hardware Selection
GPU Types
| GPU | VRAM | Valid gpu_type Value | Best For |
|---|---|---|---|
| NVIDIA T4 | 16GB | nvidia-t4 | LoRA/QLoRA on 7B models |
| NVIDIA L4 | 24GB | nvidia-l4 | LoRA on 13B models |
| NVIDIA A10G | 24GB | nvidia-a10g | LoRA/Full on 7-13B models |
| NVIDIA A100 40GB | 40GB | nvidia-a100-40gb | Full fine-tuning 13B+, LoRA 70B |
| NVIDIA A100 80GB | 80GB | nvidia-a100-80gb | Full fine-tuning 70B |
| NVIDIA H100 | 80GB | nvidia-h100 | Maximum performance |
Hardware Configuration Options
| Parameter | Default | Range | Description |
|---|---|---|---|
| GPU Type | nvidia-t4 | See table above | GPU for Karpenter node selection |
| GPU Count | 1 | 1-8 | Number of GPUs |
| Storage (GB) | 100 | 50-1000 | Ephemeral storage |
| Instance Type | (derived) | Optional | Legacy field, derived from gpu_type |
VRAM Requirements by Method
| Model Size | QLoRA | LoRA | Full |
|---|---|---|---|
| 7B | ~4GB | ~8GB | ~28GB |
| 13B | ~7GB | ~13GB | ~52GB |
| 30B | ~16GB | ~30GB | ~120GB |
| 70B | ~35GB | ~70GB | ~280GB |
Monitoring
Real-Time Metrics
During training, monitor:
- Training Loss - Should decrease consistently
- Validation Loss - If increases while train loss decreases = overfitting
- Learning Rate - Current rate per schedule
- Training Speed - Samples/second throughput
- Gradient Norm - Spikes indicate instability
Resource Monitoring
- GPU Utilization - GPU compute usage %
- GPU Memory - VRAM used/total
- GPU Temperature - Thermal monitoring
- CPU/Memory - System resource usage
Logs
Filter logs by source:
- Build - Docker image build output
- Deployment - Kubernetes deployment events
- Training - Training loop output
Log levels: debug, info, warning, error, critical
Cost Tracking
- Current Cost - Real-time spend
- Estimated Total - Projected final cost
- Hourly Rate - Based on instance type
- Compute Hours - Total GPU time
Job Lifecycle
Status States
| Status | Description |
|---|---|
pending | Job created, waiting to start |
preparing | Uploading dataset, building Docker image |
building | Docker image being built and pushed to ECR |
deploying | Kubernetes node provisioning |
training | Model training in progress |
evaluating | Running validation metrics |
completed | Training finished successfully |
failed | Error occurred (check logs) |
stopping | Stop requested, shutting down |
cancelled | Stopped by user |
Infrastructure
Each fine-tuning job:
- Builds a custom Docker image (PyTorch 2.1.0 + CUDA 11.8 + HuggingFace)
- Pushes to AWS ECR
- Provisions a dedicated GPU node (AWS EKS) in the organization's namespace
- Runs as a Kubernetes Job with exclusive node access
- Uploads model artifacts to S3 on completion
- Cleans up node and resources
Post-Training
After a job completes:
- Deploy - One-click deployment as a vLLM inference endpoint (uses the same API as any other model)
- Download - Generate a presigned S3 URL (1-hour expiration) to download model weights
- Delete - Remove job, logs, metrics, and S3 artifacts
API Endpoints
Job Management
| Endpoint | Method | Description |
|---|---|---|
/api/v1/fine-tuning/jobs | POST | Create new fine-tuning job (multipart form with file upload) |
/api/v1/fine-tuning/jobs/json | POST | Create new fine-tuning job (JSON body, no file upload) |
/api/v1/fine-tuning/jobs | GET | List all jobs |
/api/v1/fine-tuning/jobs/{job_id} | GET | Get job details |
/api/v1/fine-tuning/jobs/{job_id}/stop | POST | Stop running job |
/api/v1/fine-tuning/jobs/{job_id} | DELETE | Delete job |
/api/v1/fine-tuning/jobs/{job_id}/logs | GET | Get training logs (log_type: build, deployment, training) |
/api/v1/fine-tuning/jobs/{job_id}/metrics | GET | Get training metrics |
/api/v1/fine-tuning/jobs/{job_id}/download | GET | Get model download URL |
Configuration
| Endpoint | Method | Description |
|---|---|---|
/api/v1/fine-tuning/base-models | GET | List available base models |
/api/v1/fine-tuning/hardware-options | GET | List hardware configurations |
/api/v1/fine-tuning/models/{model_id}/requirements | GET | Get hardware requirements for a specific model |
/api/v1/fine-tuning/validate-hardware | POST | Validate hardware config against model requirements |
/api/v1/fine-tuning/estimate-cost | POST | Estimate training cost |
/api/v1/fine-tuning/stats | GET | Get user statistics |
Cost Estimation
Training cost is calculated based on:
- Instance hourly rate - GPU type determines cost
- Estimated training hours - Based on dataset size, epochs, and method
- Method efficiency - QLoRA/LoRA faster than full fine-tuning
- Overhead - 20-30% for setup, evaluation, and export
Example estimates:
| Scenario | GPU | Hours | Cost |
|---|---|---|---|
| 7B LoRA, 1K examples, 3 epochs | T4 | ~1h | ~$1 |
| 13B QLoRA, 5K examples, 5 epochs | A10G | ~3h | ~$5 |
| 70B QLoRA, 10K examples, 3 epochs | A100 40GB | ~6h | ~$20 |
| 7B Full, 10K examples, 5 epochs | A100 40GB | ~4h | ~$13 |
Best Practices
Dataset Quality
- Minimum 1,000 examples for meaningful fine-tuning
- Consistent format - Ensure all examples follow the same structure
- Balanced distribution - Avoid class imbalance (>10:1 ratio triggers warnings)
- Remove duplicates - Platform warns if >10% duplicates detected
- Appropriate length - Set max_sequence_length based on your data distribution
Training Configuration
- Start with LoRA - Most cost-effective for initial experiments
- Use QLoRA for large models - Fine-tune 70B on a single T4
- Monitor validation loss - Stop early if it increases (overfitting)
- Lower learning rate for larger models - 1e-5 for 70B vs 2e-4 for 7B
- Increase rank for complex tasks - r=16-64 for specialized domains
Cost Optimization
- Start small - Test with subset of data before full training
- Use QLoRA - 75% less GPU cost than full fine-tuning
- Set early stopping - Avoid unnecessary training epochs
- Right-size GPU - Don't use A100 for a 7B LoRA job
Next Steps
- Deploy your fine-tuned model as an inference endpoint
- Configure autoscaling for your deployed model
- Monitor performance of your fine-tuned model