Fine-Tuning

Train models on your custom data to improve performance for domain-specific tasks. The platform supports 4 training methods (LoRA, QLoRA, full fine-tuning, and prompt tuning), with real-time monitoring, cost tracking, and one-click deployment.

Overview

Fine-tuning creates a specialized version of a base model by training on your data. The platform handles the full lifecycle:

Configure - Select base model, training method, dataset, and hyperparameters
Train - Dedicated GPU nodes run your training job with real-time metrics
Monitor - Track loss curves, resource usage, and costs in real-time
Deploy - One-click deployment as an inference endpoint

Base Models

Choose from 30+ open-source models across major families:

Family	Models	Best For
Llama	3.3 70B, 3.2 (1B/3B/11B/90B), 3.1 (8B/70B), CodeLlama	General purpose, coding
Mistral	7B, v0.3, Mixtral 8x7B, Nemo, Codestral	Multilingual, code
Qwen	2.5 (0.5B-72B), Coder, VL variants	Multilingual, vision
Phi	3.5 Mini, 3 Medium, 4	Small efficient models
Gemma	2 (9B/27B), 7B, CodeGemma	Compact, code
DeepSeek	R1 (1.5B-70B), Coder, Math	Reasoning, code, math

Models can be sourced from HuggingFace, S3, or local storage.

Training Methods

LoRA (Low-Rank Adaptation)

Adds small trainable matrices to model layers while freezing original weights.

Memory: ~50% of base model VRAM
GPU: 16GB minimum (T4 recommended)
Quality: Near full fine-tuning performance
Storage: 1-50MB adapter files

LoRA Parameters:

Parameter	Default	Range	Description
Rank (r)	16	1-256	Dimensionality of adapter matrices. Higher = more capacity
Alpha	32	1-512	Scaling factor. Typically 2x rank
Dropout	0.1	0-0.5	Prevents overfitting. Higher for small datasets
Target Modules	q_proj, v_proj	Various	Which layers get adapters

QLoRA (Quantized LoRA)

LoRA with 4-bit quantization of the base model for maximum memory efficiency.

Memory: ~25% of base model VRAM
GPU: 8GB minimum (T4 recommended)
Quality: Minimal degradation vs standard LoRA
Best for: Large models (70B) on limited hardware

QLoRA Parameters:

Parameter	Default	Options	Description
Bits	4	2, 3, 4, 5, 6, 8	Quantization precision
Quant Type	nf4	nf4, int4, fp4	Quantization algorithm
Double Quant	true	true/false	Quantize the quantization constants
Compute Dtype	float16	float16, bfloat16, float32	Compute precision

Full Fine-Tuning

Updates all model weights for maximum customization.

Memory: 150% of base model VRAM (4-8x model size total)
GPU: 40GB+ (A100 recommended)
Quality: Maximum customization potential
Storage: Full model copy (10-140GB)
Best for: Significant domain shift, ample resources

Prompt Tuning

Learns soft prompt embeddings prepended to inputs, keeping the full model frozen.

Memory: Minimal additional VRAM
GPU: 16GB+ recommended
Quality: Good for simple task adaptation
Storage: Very small adapter (< 1MB)
Best for: Quick experiments, task-specific tuning with minimal resources

Dataset Configuration

Supported Formats

Format	Structure	Best For
JSONL	One JSON object per line	General purpose
Alpaca	`{"instruction", "input", "output"}`	Instruction following
ShareGPT	`{"conversations": [...]}`	Multi-turn chat
OpenAI	`{"messages": [...]}`	Chat format
CSV	Columnar data	Classification, simple tasks
Parquet	Compressed columnar	Large datasets

Dataset Parameters

Parameter	Default	Description
Train/Val/Test Split	80/15/5	Automatic dataset splitting
Max Samples	unlimited	Limit training examples
Cutoff Length	2048	Maximum tokens per example (128-4096)
Streaming	false	Stream large datasets from storage

Data Sources

File Upload - Drag and drop JSONL/CSV/Parquet files (multipart form upload)
JSON API - Submit configuration via JSON with S3 path or HuggingFace dataset ID
S3 Path - Reference datasets stored in S3

Dataset Requirements

Minimum: 100 examples (1,000-10,000 recommended)
Validation: Automatic checks for format errors, duplicates, length issues

Hyperparameters

Training Parameters

Parameter	Default	Range	Description
Learning Rate	2e-4	1e-6 to 1e-2	Step size for weight updates
Batch Size	4	1-32	Examples per gradient step
Epochs	3	1-50	Complete passes through dataset
Gradient Accumulation	4	1-32	Simulates larger batch size with less memory
Warmup Steps	100	0+	Gradual learning rate increase at start
Weight Decay	0.01	0-1	L2 regularization strength
Max Sequence Length	2048	128-4096	Maximum tokens per example

Advanced Configuration

Setting	Default	Description
Early Stopping	true	Stop training when validation loss plateaus
Save Strategy	steps	When to save checkpoints
Evaluation Strategy	steps	When to run evaluation
Logging Steps	10	Steps between log entries
Save Steps	500	Steps between checkpoint saves
Eval Steps	500	Steps between evaluations
Dataloader Workers	4	Parallel data loading threads (0-16)
FP16	true	Half-precision training for memory efficiency
Gradient Checkpointing	true	Trade compute for memory

Hardware Selection

GPU Types

GPU	VRAM	Valid gpu_type Value	Best For
NVIDIA T4	16GB	`nvidia-t4`	LoRA/QLoRA on 7B models
NVIDIA L4	24GB	`nvidia-l4`	LoRA on 13B models
NVIDIA A10G	24GB	`nvidia-a10g`	LoRA/Full on 7-13B models
NVIDIA A100 40GB	40GB	`nvidia-a100-40gb`	Full fine-tuning 13B+, LoRA 70B
NVIDIA A100 80GB	80GB	`nvidia-a100-80gb`	Full fine-tuning 70B
NVIDIA H100	80GB	`nvidia-h100`	Maximum performance

Hardware Configuration Options

Parameter	Default	Range	Description
GPU Type	`nvidia-t4`	See table above	GPU for Karpenter node selection
GPU Count	1	1-8	Number of GPUs
Storage (GB)	100	50-1000	Ephemeral storage
Instance Type	(derived)	Optional	Legacy field, derived from gpu_type

VRAM Requirements by Method

Model Size	QLoRA	LoRA	Full
7B	~4GB	~8GB	~28GB
13B	~7GB	~13GB	~52GB
30B	~16GB	~30GB	~120GB
70B	~35GB	~70GB	~280GB

Monitoring

Real-Time Metrics

During training, monitor:

Training Loss - Should decrease consistently
Validation Loss - If increases while train loss decreases = overfitting
Learning Rate - Current rate per schedule
Training Speed - Samples/second throughput
Gradient Norm - Spikes indicate instability

Resource Monitoring

GPU Utilization - GPU compute usage %
GPU Memory - VRAM used/total
GPU Temperature - Thermal monitoring
CPU/Memory - System resource usage

Logs

Filter logs by source:

Build - Docker image build output
Deployment - Kubernetes deployment events
Training - Training loop output

Log levels: debug, info, warning, error, critical

Cost Tracking

Current Cost - Real-time spend
Estimated Total - Projected final cost
Hourly Rate - Based on instance type
Compute Hours - Total GPU time

Job Lifecycle

Status States

Status	Description
`pending`	Job created, waiting to start
`preparing`	Uploading dataset, building Docker image
`building`	Docker image being built and pushed to ECR
`deploying`	Kubernetes node provisioning
`training`	Model training in progress
`evaluating`	Running validation metrics
`completed`	Training finished successfully
`failed`	Error occurred (check logs)
`stopping`	Stop requested, shutting down
`cancelled`	Stopped by user

Infrastructure

Each fine-tuning job:

Builds a custom Docker image (PyTorch 2.1.0 + CUDA 11.8 + HuggingFace)
Pushes to AWS ECR
Provisions a dedicated GPU node (AWS EKS) in the organization's namespace
Runs as a Kubernetes Job with exclusive node access
Uploads model artifacts to S3 on completion
Cleans up node and resources

Post-Training

After a job completes:

Deploy - One-click deployment as a vLLM inference endpoint (uses the same API as any other model)
Download - Generate a presigned S3 URL (1-hour expiration) to download model weights
Delete - Remove job, logs, metrics, and S3 artifacts

API Endpoints

Job Management

Endpoint	Method	Description
`/api/v1/fine-tuning/jobs`	POST	Create new fine-tuning job (multipart form with file upload)
`/api/v1/fine-tuning/jobs/json`	POST	Create new fine-tuning job (JSON body, no file upload)
`/api/v1/fine-tuning/jobs`	GET	List all jobs
`/api/v1/fine-tuning/jobs/{job_id}`	GET	Get job details
`/api/v1/fine-tuning/jobs/{job_id}/stop`	POST	Stop running job
`/api/v1/fine-tuning/jobs/{job_id}`	DELETE	Delete job
`/api/v1/fine-tuning/jobs/{job_id}/logs`	GET	Get training logs (log_type: build, deployment, training)
`/api/v1/fine-tuning/jobs/{job_id}/metrics`	GET	Get training metrics
`/api/v1/fine-tuning/jobs/{job_id}/download`	GET	Get model download URL

Configuration

Endpoint	Method	Description
`/api/v1/fine-tuning/base-models`	GET	List available base models
`/api/v1/fine-tuning/hardware-options`	GET	List hardware configurations
`/api/v1/fine-tuning/models/{model_id}/requirements`	GET	Get hardware requirements for a specific model
`/api/v1/fine-tuning/validate-hardware`	POST	Validate hardware config against model requirements
`/api/v1/fine-tuning/estimate-cost`	POST	Estimate training cost
`/api/v1/fine-tuning/stats`	GET	Get user statistics

Cost Estimation

Training cost is calculated based on:

Instance hourly rate - GPU type determines cost
Estimated training hours - Based on dataset size, epochs, and method
Method efficiency - QLoRA/LoRA faster than full fine-tuning
Overhead - 20-30% for setup, evaluation, and export

Example estimates:

Scenario	GPU	Hours	Cost
7B LoRA, 1K examples, 3 epochs	T4	~1h	~$1
13B QLoRA, 5K examples, 5 epochs	A10G	~3h	~$5
70B QLoRA, 10K examples, 3 epochs	A100 40GB	~6h	~$20
7B Full, 10K examples, 5 epochs	A100 40GB	~4h	~$13

Best Practices

Dataset Quality

Minimum 1,000 examples for meaningful fine-tuning
Consistent format - Ensure all examples follow the same structure
Balanced distribution - Avoid class imbalance (>10:1 ratio triggers warnings)
Remove duplicates - Platform warns if >10% duplicates detected
Appropriate length - Set max_sequence_length based on your data distribution

Training Configuration

Start with LoRA - Most cost-effective for initial experiments
Use QLoRA for large models - Fine-tune 70B on a single T4
Monitor validation loss - Stop early if it increases (overfitting)
Lower learning rate for larger models - 1e-5 for 70B vs 2e-4 for 7B
Increase rank for complex tasks - r=16-64 for specialized domains

Cost Optimization

Start small - Test with subset of data before full training
Use QLoRA - 75% less GPU cost than full fine-tuning
Set early stopping - Avoid unnecessary training epochs
Right-size GPU - Don't use A100 for a 7B LoRA job

Next Steps

Deploy your fine-tuned model as an inference endpoint
Configure autoscaling for your deployed model
Monitor performance of your fine-tuned model

Overview​

Base Models​

Training Methods​

LoRA (Low-Rank Adaptation)​

QLoRA (Quantized LoRA)​

Full Fine-Tuning​

Prompt Tuning​

Dataset Configuration​

Supported Formats​

Dataset Parameters​

Data Sources​

Dataset Requirements​

Hyperparameters​

Training Parameters​

Advanced Configuration​

Hardware Selection​

GPU Types​

Hardware Configuration Options​

VRAM Requirements by Method​

Monitoring​

Real-Time Metrics​

Resource Monitoring​

Logs​

Cost Tracking​

Job Lifecycle​

Status States​

Infrastructure​

Post-Training​

API Endpoints​

Job Management​

Configuration​

Cost Estimation​

Best Practices​

Dataset Quality​

Training Configuration​

Cost Optimization​

Next Steps​