Data Forge
Generate high-quality synthetic training data from your existing documents. Data Forge automates the full pipeline from raw documents to fine-tuning-ready datasets.
Overview
Data Forge creates specialized training datasets by:
- Upload - Add source documents (PDF, DOCX, Markdown, TXT, HTML, CSV)
- Parse & Chunk - Automatically split documents into semantic chunks
- Generate - Use a teacher LLM to create Q&A training pairs from each chunk
- Validate - Score quality, check grounding against source material, deduplicate
- Review - Accept, reject, or edit generated pairs
- Export - Download as JSONL in ChatML or Alpaca format for fine-tuning
All processing runs as Kubernetes Jobs launched by the AI Gateway. No GPU is required for Data Forge — LLM calls go through the AI Gateway inference API.
Getting Started
1. Create a Project
Navigate to AI Gateway > Data Forge in the sidebar and click Create Project. Provide a name and optional description.
2. Upload Documents
In the Documents tab, drag and drop files or click to browse. Supported formats:
| Format | Extensions | Parser |
|---|---|---|
.pdf | PyMuPDF | |
| Word | .docx | python-docx |
| Markdown | .md | markdown |
| Plain Text | .txt | Direct read |
| HTML | .html, .htm | BeautifulSoup |
| CSV | .csv | pandas |
Files upload directly to S3 via presigned URLs — they never pass through the server.
3. Parse Documents
Click Parse All Documents to start the parsing job. This creates a K8s Job that:
- Downloads each document from S3
- Extracts text content using format-specific parsers
- Splits content into chunks by headings, paragraphs, or sliding window
- Stores chunks in MongoDB with position, heading, and content type metadata
Progress appears in real-time via the Generation History section.
4. Review Chunks
Switch to the Chunks tab to review parsed chunks. You can:
- Preview chunk content
- Edit chunk text before generation
- Exclude irrelevant chunks from generation
- Filter by document
5. Configure Generation
In the Generate tab, configure the generation run:
Teacher Model — Select any model available in your AI Gateway. Larger models produce higher quality pairs but cost more.
Generation Type — Controls the type of data generated:
| Type | Description |
|---|---|
qa | Question and answer pairs |
instruction | Instruction-following pairs |
conversation | Multi-turn conversations |
summary | Summarization pairs |
classification | Classification examples |
Temperature — Controls randomness (0.0-2.0, default: 0.7). Lower values produce more focused, deterministic pairs. Higher values produce more diverse, creative pairs.
System Prompt — Custom instructions for the teacher LLM. Use this to control the style, domain terminology, and format of generated pairs.
Additional Configuration (passed via the config field):
| Parameter | Default | Description |
|---|---|---|
pairs_per_chunk | 3 | Number of Q&A pairs to generate per chunk |
difficulty_distribution | Equal | Distribution weights: {"easy": 0.3, "medium": 0.5, "hard": 0.2} |
style_template | mixed | Predefined style: mixed, how-to, troubleshooting, conceptual, api-code |
6. Monitor Generation
The Generation History table shows all generation runs with:
- Status —
pending,parsing,generating,validating,completed,failed,cancelled - Progress — Percentage completion with animated progress bar
- Results — Chunks processed, pairs generated, average quality score
- Duration — Elapsed time
- Logs — Expand any row to view real-time streaming logs
Active generations poll every 5 seconds for status and progress updates.
7. Review & Curate
Switch to the Review tab to curate generated pairs:
- Accept — Include pair in exported dataset
- Reject — Exclude from export
- Edit — Modify question or answer text, then accept
- Bulk Actions — Select multiple pairs and accept/reject/reset in bulk
Filter pairs by:
- Review status (pending, accepted, rejected, edited)
- Quality score threshold
- Generation run
- Full-text search across questions and answers
8. Export Dataset
In the Export tab:
- Select export format:
- ChatML —
{"messages": [{"role": "system", ...}, {"role": "user", ...}, {"role": "assistant", ...}]} - Alpaca —
{"instruction": "...", "input": "", "output": "..."}
- ChatML —
- Optionally set a minimum quality score filter
- Choose whether to include system prompts
- Click Export
Each export creates a versioned JSONL file in S3. Previous versions remain available for download.
Send to Fine-Tuning — One click creates a new fine-tuning job pre-populated with the exported dataset S3 key.
Quality Scoring
Each generated pair receives two scores:
Quality Score (0.0 - 1.0)
Overall pair quality based on:
- Answer completeness and accuracy
- Question clarity and specificity
- Formatting consistency
- Appropriate difficulty level
Grounding Score (0.0 - 1.0)
How well the answer is grounded in the source chunk:
- 1.0 = Fully supported by source material
- 0.5 = Partially supported
- 0.0 = Hallucinated/unsupported content
Pairs with low grounding scores are flagged for review.
Question Types
The generator classifies each pair by question type:
| Type | Description |
|---|---|
factual | Direct fact retrieval questions |
how-to | Step-by-step procedural questions |
troubleshooting | Problem diagnosis and resolution |
conceptual | Explanation and understanding |
api-code | Code examples and API usage |
analytical | Analysis and comparison |
comparative | Side-by-side comparisons |
The Analytics tab shows question type distribution to help ensure diverse training coverage.
Analytics Dashboard
The Analytics tab provides:
- Document Coverage — Documents parsed vs total, chunks per document
- Quality Distribution — Histogram of quality scores across all pairs
- Question Type Distribution — Diversity of generated question types
- Topic Distribution — Top topics from chunk classification
- Difficulty Distribution — Balance across easy, medium, and hard pairs
- Training Readiness — Accepted pair count, estimated training time, dataset quality assessment
Infrastructure
Worker Image
Data Forge uses a dedicated CPU-only worker image (data-forge-worker) for all processing. The same image runs in 3 modes:
| Mode | K8s Job Name | Purpose |
|---|---|---|
parse | df-parse-{id} | Document parsing and chunking |
generate | df-generate-{id} | Q&A pair generation via AI Gateway |
validate | df-validate-{id} | Quality scoring and deduplication |
K8s Job Configuration
- Service Account:
operator-service-account(for S3 IRSA) - Node Selector:
workload-type: general(CPU nodes only) - Backoff Limit: 2 retries
- Active Deadline: 2 hours
- TTL After Finished: 10 minutes cleanup
S3 Storage
$S3_BUCKET/data-forge/<user_id>/<project_id>/sources/ → uploaded documents
$S3_BUCKET/data-forge/<user_id>/<project_id>/exports/ → exported datasets
API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/api/v1/data-forge/projects | POST | Create project |
/api/v1/data-forge/projects | GET | List projects |
/api/v1/data-forge/projects/{id} | GET | Get project details |
/api/v1/data-forge/projects/{id} | PUT | Update project |
/api/v1/data-forge/projects/{id} | DELETE | Delete project |
/api/v1/data-forge/projects/{id}/upload-url | POST | Get presigned upload URL |
/api/v1/data-forge/projects/{id}/documents | POST | Register uploaded document |
/api/v1/data-forge/projects/{id}/documents | GET | List documents |
/api/v1/data-forge/projects/{id}/documents/{doc_id} | DELETE | Delete document |
/api/v1/data-forge/projects/{id}/chunks | GET | List chunks (paginated) |
/api/v1/data-forge/projects/{id}/chunks/{chunk_id} | PUT | Update chunk |
/api/v1/data-forge/projects/{id}/parse | POST | Start parse job |
/api/v1/data-forge/projects/{id}/generate | POST | Start generation job |
/api/v1/data-forge/generations/{gen_id}/cancel | POST | Cancel generation |
/api/v1/data-forge/projects/{id}/generations | GET | List generations |
/api/v1/data-forge/generations/{gen_id} | GET | Get generation details |
/api/v1/data-forge/generations/{gen_id}/logs | GET | Get generation logs |
/api/v1/data-forge/projects/{id}/pairs | GET | List pairs (paginated, filterable) |
/api/v1/data-forge/pairs/{pair_id} | PUT | Update pair |
/api/v1/data-forge/projects/{id}/pairs/bulk-action | POST | Bulk accept/reject/delete/reset |
/api/v1/data-forge/projects/{id}/export | POST | Export dataset |
/api/v1/data-forge/projects/{id}/export/{version} | GET | Get export download URL |
/api/v1/data-forge/projects/{id}/analytics | GET | Get project analytics |
/api/v1/data-forge/available-models | GET | List available teacher models |
Best Practices
Document Quality
- Structured content works best — Documents with headings, sections, and clear organization produce better chunks
- Remove boilerplate — Headers, footers, legal text, and table of contents add noise
- Minimum 5 pages per document for meaningful pair generation
- Domain-specific terminology is preserved — the teacher model learns from your content
Generation Configuration
- Start with defaults — The default configuration produces good results for most use cases
- Use large teacher models — GPT-4o or Claude 3.5 Sonnet produce significantly higher quality pairs than smaller models
- Set difficulty distribution — Ensure your training data covers easy, medium, and hard questions
- Custom system prompts — Tailor the generation to your domain language and style
Review Workflow
- Filter by quality score — Review lowest-quality pairs first (sort ascending)
- Use bulk actions — Accept all pairs above 0.9 quality, then manually review the rest
- Check grounding scores — Low grounding scores indicate potential hallucination
- Aim for 1,000+ accepted pairs for effective fine-tuning
Export
- ChatML format for OpenAI-compatible models and most modern training frameworks
- Alpaca format for instruction-following fine-tuning with frameworks like LLaMA Factory
- Set min_quality_score on export to automatically filter low-quality pairs