Data Forge

Generate high-quality synthetic training data from your existing documents. Data Forge automates the full pipeline from raw documents to fine-tuning-ready datasets.

Overview

Data Forge creates specialized training datasets by:

Upload - Add source documents (PDF, DOCX, Markdown, TXT, HTML, CSV)
Parse & Chunk - Automatically split documents into semantic chunks
Generate - Use a teacher LLM to create Q&A training pairs from each chunk
Validate - Score quality, check grounding against source material, deduplicate
Review - Accept, reject, or edit generated pairs
Export - Download as JSONL in ChatML or Alpaca format for fine-tuning

All processing runs as Kubernetes Jobs launched by the AI Gateway. No GPU is required for Data Forge — LLM calls go through the AI Gateway inference API.

Getting Started

1. Create a Project

Navigate to AI Gateway > Data Forge in the sidebar and click Create Project. Provide a name and optional description.

2. Upload Documents

In the Documents tab, drag and drop files or click to browse. Supported formats:

Format	Extensions	Parser
PDF	`.pdf`	PyMuPDF
Word	`.docx`	python-docx
Markdown	`.md`	markdown
Plain Text	`.txt`	Direct read
HTML	`.html`, `.htm`	BeautifulSoup
CSV	`.csv`	pandas

Files upload directly to S3 via presigned URLs — they never pass through the server.

3. Parse Documents

Click Parse All Documents to start the parsing job. This creates a K8s Job that:

Downloads each document from S3
Extracts text content using format-specific parsers
Splits content into chunks by headings, paragraphs, or sliding window
Stores chunks in MongoDB with position, heading, and content type metadata

Progress appears in real-time via the Generation History section.

4. Review Chunks

Switch to the Chunks tab to review parsed chunks. You can:

Preview chunk content
Edit chunk text before generation
Exclude irrelevant chunks from generation
Filter by document

5. Configure Generation

In the Generate tab, configure the generation run:

Teacher Model — Select any model available in your AI Gateway. Larger models produce higher quality pairs but cost more.

Generation Type — Controls the type of data generated:

Type	Description
`qa`	Question and answer pairs
`instruction`	Instruction-following pairs
`conversation`	Multi-turn conversations
`summary`	Summarization pairs
`classification`	Classification examples

Temperature — Controls randomness (0.0-2.0, default: 0.7). Lower values produce more focused, deterministic pairs. Higher values produce more diverse, creative pairs.

System Prompt — Custom instructions for the teacher LLM. Use this to control the style, domain terminology, and format of generated pairs.

Additional Configuration (passed via the config field):

Parameter	Default	Description
`pairs_per_chunk`	3	Number of Q&A pairs to generate per chunk
`difficulty_distribution`	Equal	Distribution weights: `{"easy": 0.3, "medium": 0.5, "hard": 0.2}`
`style_template`	`mixed`	Predefined style: `mixed`, `how-to`, `troubleshooting`, `conceptual`, `api-code`

6. Monitor Generation

The Generation History table shows all generation runs with:

Status — pending, parsing, generating, validating, completed, failed, cancelled
Progress — Percentage completion with animated progress bar
Results — Chunks processed, pairs generated, average quality score
Duration — Elapsed time
Logs — Expand any row to view real-time streaming logs

Active generations poll every 5 seconds for status and progress updates.

7. Review & Curate

Switch to the Review tab to curate generated pairs:

Accept — Include pair in exported dataset
Reject — Exclude from export
Edit — Modify question or answer text, then accept
Bulk Actions — Select multiple pairs and accept/reject/reset in bulk

Filter pairs by:

Review status (pending, accepted, rejected, edited)
Quality score threshold
Generation run
Full-text search across questions and answers

8. Export Dataset

In the Export tab:

Select export format:
- ChatML — {"messages": [{"role": "system", ...}, {"role": "user", ...}, {"role": "assistant", ...}]}
- Alpaca — {"instruction": "...", "input": "", "output": "..."}
Optionally set a minimum quality score filter
Choose whether to include system prompts
Click Export

Each export creates a versioned JSONL file in S3. Previous versions remain available for download.

Send to Fine-Tuning — One click creates a new fine-tuning job pre-populated with the exported dataset S3 key.

Quality Scoring

Each generated pair receives two scores:

Quality Score (0.0 - 1.0)

Overall pair quality based on:

Answer completeness and accuracy
Question clarity and specificity
Formatting consistency
Appropriate difficulty level

Grounding Score (0.0 - 1.0)

How well the answer is grounded in the source chunk:

1.0 = Fully supported by source material
0.5 = Partially supported
0.0 = Hallucinated/unsupported content

Pairs with low grounding scores are flagged for review.

Question Types

The generator classifies each pair by question type:

Type	Description
`factual`	Direct fact retrieval questions
`how-to`	Step-by-step procedural questions
`troubleshooting`	Problem diagnosis and resolution
`conceptual`	Explanation and understanding
`api-code`	Code examples and API usage
`analytical`	Analysis and comparison
`comparative`	Side-by-side comparisons

The Analytics tab shows question type distribution to help ensure diverse training coverage.

Analytics Dashboard

The Analytics tab provides:

Document Coverage — Documents parsed vs total, chunks per document
Quality Distribution — Histogram of quality scores across all pairs
Question Type Distribution — Diversity of generated question types
Topic Distribution — Top topics from chunk classification
Difficulty Distribution — Balance across easy, medium, and hard pairs
Training Readiness — Accepted pair count, estimated training time, dataset quality assessment

Infrastructure

Worker Image

Data Forge uses a dedicated CPU-only worker image (data-forge-worker) for all processing. The same image runs in 3 modes:

Mode	K8s Job Name	Purpose
`parse`	`df-parse-{id}`	Document parsing and chunking
`generate`	`df-generate-{id}`	Q&A pair generation via AI Gateway
`validate`	`df-validate-{id}`	Quality scoring and deduplication

K8s Job Configuration

Service Account: operator-service-account (for S3 IRSA)
Node Selector: workload-type: general (CPU nodes only)
Backoff Limit: 2 retries
Active Deadline: 2 hours
TTL After Finished: 10 minutes cleanup

S3 Storage

$S3_BUCKET/data-forge/<user_id>/<project_id>/sources/   → uploaded documents
$S3_BUCKET/data-forge/<user_id>/<project_id>/exports/   → exported datasets

API Endpoints

Endpoint	Method	Description
`/api/v1/data-forge/projects`	POST	Create project
`/api/v1/data-forge/projects`	GET	List projects
`/api/v1/data-forge/projects/{id}`	GET	Get project details
`/api/v1/data-forge/projects/{id}`	PUT	Update project
`/api/v1/data-forge/projects/{id}`	DELETE	Delete project
`/api/v1/data-forge/projects/{id}/upload-url`	POST	Get presigned upload URL
`/api/v1/data-forge/projects/{id}/documents`	POST	Register uploaded document
`/api/v1/data-forge/projects/{id}/documents`	GET	List documents
`/api/v1/data-forge/projects/{id}/documents/{doc_id}`	DELETE	Delete document
`/api/v1/data-forge/projects/{id}/chunks`	GET	List chunks (paginated)
`/api/v1/data-forge/projects/{id}/chunks/{chunk_id}`	PUT	Update chunk
`/api/v1/data-forge/projects/{id}/parse`	POST	Start parse job
`/api/v1/data-forge/projects/{id}/generate`	POST	Start generation job
`/api/v1/data-forge/generations/{gen_id}/cancel`	POST	Cancel generation
`/api/v1/data-forge/projects/{id}/generations`	GET	List generations
`/api/v1/data-forge/generations/{gen_id}`	GET	Get generation details
`/api/v1/data-forge/generations/{gen_id}/logs`	GET	Get generation logs
`/api/v1/data-forge/projects/{id}/pairs`	GET	List pairs (paginated, filterable)
`/api/v1/data-forge/pairs/{pair_id}`	PUT	Update pair
`/api/v1/data-forge/projects/{id}/pairs/bulk-action`	POST	Bulk accept/reject/delete/reset
`/api/v1/data-forge/projects/{id}/export`	POST	Export dataset
`/api/v1/data-forge/projects/{id}/export/{version}`	GET	Get export download URL
`/api/v1/data-forge/projects/{id}/analytics`	GET	Get project analytics
`/api/v1/data-forge/available-models`	GET	List available teacher models

Best Practices

Document Quality

Structured content works best — Documents with headings, sections, and clear organization produce better chunks
Remove boilerplate — Headers, footers, legal text, and table of contents add noise
Minimum 5 pages per document for meaningful pair generation
Domain-specific terminology is preserved — the teacher model learns from your content

Generation Configuration

Start with defaults — The default configuration produces good results for most use cases
Use large teacher models — GPT-4o or Claude 3.5 Sonnet produce significantly higher quality pairs than smaller models
Set difficulty distribution — Ensure your training data covers easy, medium, and hard questions
Custom system prompts — Tailor the generation to your domain language and style

Review Workflow

Filter by quality score — Review lowest-quality pairs first (sort ascending)
Use bulk actions — Accept all pairs above 0.9 quality, then manually review the rest
Check grounding scores — Low grounding scores indicate potential hallucination
Aim for 1,000+ accepted pairs for effective fine-tuning

Export

ChatML format for OpenAI-compatible models and most modern training frameworks
Alpaca format for instruction-following fine-tuning with frameworks like LLaMA Factory
Set min_quality_score on export to automatically filter low-quality pairs

Overview​

Getting Started​

1. Create a Project​

2. Upload Documents​

3. Parse Documents​

4. Review Chunks​

5. Configure Generation​

6. Monitor Generation​

7. Review & Curate​

8. Export Dataset​

Quality Scoring​

Quality Score (0.0 - 1.0)​

Grounding Score (0.0 - 1.0)​

Question Types​

Analytics Dashboard​

Infrastructure​

Worker Image​

K8s Job Configuration​

S3 Storage​

API Endpoints​

Best Practices​

Document Quality​

Generation Configuration​

Review Workflow​

Export​