Skip to main content

Fine-Tuning

Train models on your custom data to improve performance for domain-specific tasks. The platform supports 4 training methods (LoRA, QLoRA, full fine-tuning, and prompt tuning), with real-time monitoring, cost tracking, and one-click deployment.

Overview

Fine-tuning creates a specialized version of a base model by training on your data. The platform handles the full lifecycle:

  1. Configure - Select base model, training method, dataset, and hyperparameters
  2. Train - Dedicated GPU nodes run your training job with real-time metrics
  3. Monitor - Track loss curves, resource usage, and costs in real-time
  4. Deploy - One-click deployment as an inference endpoint

Base Models

Choose from 30+ open-source models across major families:

FamilyModelsBest For
Llama3.3 70B, 3.2 (1B/3B/11B/90B), 3.1 (8B/70B), CodeLlamaGeneral purpose, coding
Mistral7B, v0.3, Mixtral 8x7B, Nemo, CodestralMultilingual, code
Qwen2.5 (0.5B-72B), Coder, VL variantsMultilingual, vision
Phi3.5 Mini, 3 Medium, 4Small efficient models
Gemma2 (9B/27B), 7B, CodeGemmaCompact, code
DeepSeekR1 (1.5B-70B), Coder, MathReasoning, code, math

Models can be sourced from HuggingFace, S3, or local storage.


Training Methods

LoRA (Low-Rank Adaptation)

Adds small trainable matrices to model layers while freezing original weights.

  • Memory: ~50% of base model VRAM
  • GPU: 16GB minimum (T4 recommended)
  • Quality: Near full fine-tuning performance
  • Storage: 1-50MB adapter files

LoRA Parameters:

ParameterDefaultRangeDescription
Rank (r)161-256Dimensionality of adapter matrices. Higher = more capacity
Alpha321-512Scaling factor. Typically 2x rank
Dropout0.10-0.5Prevents overfitting. Higher for small datasets
Target Modulesq_proj, v_projVariousWhich layers get adapters

QLoRA (Quantized LoRA)

LoRA with 4-bit quantization of the base model for maximum memory efficiency.

  • Memory: ~25% of base model VRAM
  • GPU: 8GB minimum (T4 recommended)
  • Quality: Minimal degradation vs standard LoRA
  • Best for: Large models (70B) on limited hardware

QLoRA Parameters:

ParameterDefaultOptionsDescription
Bits42, 3, 4, 5, 6, 8Quantization precision
Quant Typenf4nf4, int4, fp4Quantization algorithm
Double Quanttruetrue/falseQuantize the quantization constants
Compute Dtypefloat16float16, bfloat16, float32Compute precision

Full Fine-Tuning

Updates all model weights for maximum customization.

  • Memory: 150% of base model VRAM (4-8x model size total)
  • GPU: 40GB+ (A100 recommended)
  • Quality: Maximum customization potential
  • Storage: Full model copy (10-140GB)
  • Best for: Significant domain shift, ample resources

Prompt Tuning

Learns soft prompt embeddings prepended to inputs, keeping the full model frozen.

  • Memory: Minimal additional VRAM
  • GPU: 16GB+ recommended
  • Quality: Good for simple task adaptation
  • Storage: Very small adapter (< 1MB)
  • Best for: Quick experiments, task-specific tuning with minimal resources

Dataset Configuration

Supported Formats

FormatStructureBest For
JSONLOne JSON object per lineGeneral purpose
Alpaca{"instruction", "input", "output"}Instruction following
ShareGPT{"conversations": [...]}Multi-turn chat
OpenAI{"messages": [...]}Chat format
CSVColumnar dataClassification, simple tasks
ParquetCompressed columnarLarge datasets

Dataset Parameters

ParameterDefaultDescription
Train/Val/Test Split80/15/5Automatic dataset splitting
Max SamplesunlimitedLimit training examples
Cutoff Length2048Maximum tokens per example (128-4096)
StreamingfalseStream large datasets from storage

Data Sources

  • File Upload - Drag and drop JSONL/CSV/Parquet files (multipart form upload)
  • JSON API - Submit configuration via JSON with S3 path or HuggingFace dataset ID
  • S3 Path - Reference datasets stored in S3

Dataset Requirements

  • Minimum: 100 examples (1,000-10,000 recommended)
  • Validation: Automatic checks for format errors, duplicates, length issues

Hyperparameters

Training Parameters

ParameterDefaultRangeDescription
Learning Rate2e-41e-6 to 1e-2Step size for weight updates
Batch Size41-32Examples per gradient step
Epochs31-50Complete passes through dataset
Gradient Accumulation41-32Simulates larger batch size with less memory
Warmup Steps1000+Gradual learning rate increase at start
Weight Decay0.010-1L2 regularization strength
Max Sequence Length2048128-4096Maximum tokens per example

Advanced Configuration

SettingDefaultDescription
Early StoppingtrueStop training when validation loss plateaus
Save StrategystepsWhen to save checkpoints
Evaluation StrategystepsWhen to run evaluation
Logging Steps10Steps between log entries
Save Steps500Steps between checkpoint saves
Eval Steps500Steps between evaluations
Dataloader Workers4Parallel data loading threads (0-16)
FP16trueHalf-precision training for memory efficiency
Gradient CheckpointingtrueTrade compute for memory

Hardware Selection

GPU Types

GPUVRAMValid gpu_type ValueBest For
NVIDIA T416GBnvidia-t4LoRA/QLoRA on 7B models
NVIDIA L424GBnvidia-l4LoRA on 13B models
NVIDIA A10G24GBnvidia-a10gLoRA/Full on 7-13B models
NVIDIA A100 40GB40GBnvidia-a100-40gbFull fine-tuning 13B+, LoRA 70B
NVIDIA A100 80GB80GBnvidia-a100-80gbFull fine-tuning 70B
NVIDIA H10080GBnvidia-h100Maximum performance

Hardware Configuration Options

ParameterDefaultRangeDescription
GPU Typenvidia-t4See table aboveGPU for Karpenter node selection
GPU Count11-8Number of GPUs
Storage (GB)10050-1000Ephemeral storage
Instance Type(derived)OptionalLegacy field, derived from gpu_type

VRAM Requirements by Method

Model SizeQLoRALoRAFull
7B~4GB~8GB~28GB
13B~7GB~13GB~52GB
30B~16GB~30GB~120GB
70B~35GB~70GB~280GB

Monitoring

Real-Time Metrics

During training, monitor:

  • Training Loss - Should decrease consistently
  • Validation Loss - If increases while train loss decreases = overfitting
  • Learning Rate - Current rate per schedule
  • Training Speed - Samples/second throughput
  • Gradient Norm - Spikes indicate instability

Resource Monitoring

  • GPU Utilization - GPU compute usage %
  • GPU Memory - VRAM used/total
  • GPU Temperature - Thermal monitoring
  • CPU/Memory - System resource usage

Logs

Filter logs by source:

  • Build - Docker image build output
  • Deployment - Kubernetes deployment events
  • Training - Training loop output

Log levels: debug, info, warning, error, critical

Cost Tracking

  • Current Cost - Real-time spend
  • Estimated Total - Projected final cost
  • Hourly Rate - Based on instance type
  • Compute Hours - Total GPU time

Job Lifecycle

Status States

StatusDescription
pendingJob created, waiting to start
preparingUploading dataset, building Docker image
buildingDocker image being built and pushed to ECR
deployingKubernetes node provisioning
trainingModel training in progress
evaluatingRunning validation metrics
completedTraining finished successfully
failedError occurred (check logs)
stoppingStop requested, shutting down
cancelledStopped by user

Infrastructure

Each fine-tuning job:

  1. Builds a custom Docker image (PyTorch 2.1.0 + CUDA 11.8 + HuggingFace)
  2. Pushes to AWS ECR
  3. Provisions a dedicated GPU node (AWS EKS) in the organization's namespace
  4. Runs as a Kubernetes Job with exclusive node access
  5. Uploads model artifacts to S3 on completion
  6. Cleans up node and resources

Post-Training

After a job completes:

  • Deploy - One-click deployment as a vLLM inference endpoint (uses the same API as any other model)
  • Download - Generate a presigned S3 URL (1-hour expiration) to download model weights
  • Delete - Remove job, logs, metrics, and S3 artifacts

API Endpoints

Job Management

EndpointMethodDescription
/api/v1/fine-tuning/jobsPOSTCreate new fine-tuning job (multipart form with file upload)
/api/v1/fine-tuning/jobs/jsonPOSTCreate new fine-tuning job (JSON body, no file upload)
/api/v1/fine-tuning/jobsGETList all jobs
/api/v1/fine-tuning/jobs/{job_id}GETGet job details
/api/v1/fine-tuning/jobs/{job_id}/stopPOSTStop running job
/api/v1/fine-tuning/jobs/{job_id}DELETEDelete job
/api/v1/fine-tuning/jobs/{job_id}/logsGETGet training logs (log_type: build, deployment, training)
/api/v1/fine-tuning/jobs/{job_id}/metricsGETGet training metrics
/api/v1/fine-tuning/jobs/{job_id}/downloadGETGet model download URL

Configuration

EndpointMethodDescription
/api/v1/fine-tuning/base-modelsGETList available base models
/api/v1/fine-tuning/hardware-optionsGETList hardware configurations
/api/v1/fine-tuning/models/{model_id}/requirementsGETGet hardware requirements for a specific model
/api/v1/fine-tuning/validate-hardwarePOSTValidate hardware config against model requirements
/api/v1/fine-tuning/estimate-costPOSTEstimate training cost
/api/v1/fine-tuning/statsGETGet user statistics

Cost Estimation

Training cost is calculated based on:

  • Instance hourly rate - GPU type determines cost
  • Estimated training hours - Based on dataset size, epochs, and method
  • Method efficiency - QLoRA/LoRA faster than full fine-tuning
  • Overhead - 20-30% for setup, evaluation, and export

Example estimates:

ScenarioGPUHoursCost
7B LoRA, 1K examples, 3 epochsT4~1h~$1
13B QLoRA, 5K examples, 5 epochsA10G~3h~$5
70B QLoRA, 10K examples, 3 epochsA100 40GB~6h~$20
7B Full, 10K examples, 5 epochsA100 40GB~4h~$13

Best Practices

Dataset Quality

  • Minimum 1,000 examples for meaningful fine-tuning
  • Consistent format - Ensure all examples follow the same structure
  • Balanced distribution - Avoid class imbalance (>10:1 ratio triggers warnings)
  • Remove duplicates - Platform warns if >10% duplicates detected
  • Appropriate length - Set max_sequence_length based on your data distribution

Training Configuration

  • Start with LoRA - Most cost-effective for initial experiments
  • Use QLoRA for large models - Fine-tune 70B on a single T4
  • Monitor validation loss - Stop early if it increases (overfitting)
  • Lower learning rate for larger models - 1e-5 for 70B vs 2e-4 for 7B
  • Increase rank for complex tasks - r=16-64 for specialized domains

Cost Optimization

  • Start small - Test with subset of data before full training
  • Use QLoRA - 75% less GPU cost than full fine-tuning
  • Set early stopping - Avoid unnecessary training epochs
  • Right-size GPU - Don't use A100 for a 7B LoRA job

Next Steps