Skip to main content

Data Forge

Generate high-quality synthetic training data from your existing documents. Data Forge automates the full pipeline from raw documents to fine-tuning-ready datasets.

Overview

Data Forge creates specialized training datasets by:

  1. Upload - Add source documents (PDF, DOCX, Markdown, TXT, HTML, CSV)
  2. Parse & Chunk - Automatically split documents into semantic chunks
  3. Generate - Use a teacher LLM to create Q&A training pairs from each chunk
  4. Validate - Score quality, check grounding against source material, deduplicate
  5. Review - Accept, reject, or edit generated pairs
  6. Export - Download as JSONL in ChatML or Alpaca format for fine-tuning

All processing runs as Kubernetes Jobs launched by the AI Gateway. No GPU is required for Data Forge — LLM calls go through the AI Gateway inference API.


Getting Started

1. Create a Project

Navigate to AI Gateway > Data Forge in the sidebar and click Create Project. Provide a name and optional description.

2. Upload Documents

In the Documents tab, drag and drop files or click to browse. Supported formats:

FormatExtensionsParser
PDF.pdfPyMuPDF
Word.docxpython-docx
Markdown.mdmarkdown
Plain Text.txtDirect read
HTML.html, .htmBeautifulSoup
CSV.csvpandas

Files upload directly to S3 via presigned URLs — they never pass through the server.

3. Parse Documents

Click Parse All Documents to start the parsing job. This creates a K8s Job that:

  • Downloads each document from S3
  • Extracts text content using format-specific parsers
  • Splits content into chunks by headings, paragraphs, or sliding window
  • Stores chunks in MongoDB with position, heading, and content type metadata

Progress appears in real-time via the Generation History section.

4. Review Chunks

Switch to the Chunks tab to review parsed chunks. You can:

  • Preview chunk content
  • Edit chunk text before generation
  • Exclude irrelevant chunks from generation
  • Filter by document

5. Configure Generation

In the Generate tab, configure the generation run:

Teacher Model — Select any model available in your AI Gateway. Larger models produce higher quality pairs but cost more.

Generation Type — Controls the type of data generated:

TypeDescription
qaQuestion and answer pairs
instructionInstruction-following pairs
conversationMulti-turn conversations
summarySummarization pairs
classificationClassification examples

Temperature — Controls randomness (0.0-2.0, default: 0.7). Lower values produce more focused, deterministic pairs. Higher values produce more diverse, creative pairs.

System Prompt — Custom instructions for the teacher LLM. Use this to control the style, domain terminology, and format of generated pairs.

Additional Configuration (passed via the config field):

ParameterDefaultDescription
pairs_per_chunk3Number of Q&A pairs to generate per chunk
difficulty_distributionEqualDistribution weights: {"easy": 0.3, "medium": 0.5, "hard": 0.2}
style_templatemixedPredefined style: mixed, how-to, troubleshooting, conceptual, api-code

6. Monitor Generation

The Generation History table shows all generation runs with:

  • Statuspending, parsing, generating, validating, completed, failed, cancelled
  • Progress — Percentage completion with animated progress bar
  • Results — Chunks processed, pairs generated, average quality score
  • Duration — Elapsed time
  • Logs — Expand any row to view real-time streaming logs

Active generations poll every 5 seconds for status and progress updates.

7. Review & Curate

Switch to the Review tab to curate generated pairs:

  • Accept — Include pair in exported dataset
  • Reject — Exclude from export
  • Edit — Modify question or answer text, then accept
  • Bulk Actions — Select multiple pairs and accept/reject/reset in bulk

Filter pairs by:

  • Review status (pending, accepted, rejected, edited)
  • Quality score threshold
  • Generation run
  • Full-text search across questions and answers

8. Export Dataset

In the Export tab:

  1. Select export format:
    • ChatML{"messages": [{"role": "system", ...}, {"role": "user", ...}, {"role": "assistant", ...}]}
    • Alpaca{"instruction": "...", "input": "", "output": "..."}
  2. Optionally set a minimum quality score filter
  3. Choose whether to include system prompts
  4. Click Export

Each export creates a versioned JSONL file in S3. Previous versions remain available for download.

Send to Fine-Tuning — One click creates a new fine-tuning job pre-populated with the exported dataset S3 key.


Quality Scoring

Each generated pair receives two scores:

Quality Score (0.0 - 1.0)

Overall pair quality based on:

  • Answer completeness and accuracy
  • Question clarity and specificity
  • Formatting consistency
  • Appropriate difficulty level

Grounding Score (0.0 - 1.0)

How well the answer is grounded in the source chunk:

  • 1.0 = Fully supported by source material
  • 0.5 = Partially supported
  • 0.0 = Hallucinated/unsupported content

Pairs with low grounding scores are flagged for review.


Question Types

The generator classifies each pair by question type:

TypeDescription
factualDirect fact retrieval questions
how-toStep-by-step procedural questions
troubleshootingProblem diagnosis and resolution
conceptualExplanation and understanding
api-codeCode examples and API usage
analyticalAnalysis and comparison
comparativeSide-by-side comparisons

The Analytics tab shows question type distribution to help ensure diverse training coverage.


Analytics Dashboard

The Analytics tab provides:

  • Document Coverage — Documents parsed vs total, chunks per document
  • Quality Distribution — Histogram of quality scores across all pairs
  • Question Type Distribution — Diversity of generated question types
  • Topic Distribution — Top topics from chunk classification
  • Difficulty Distribution — Balance across easy, medium, and hard pairs
  • Training Readiness — Accepted pair count, estimated training time, dataset quality assessment

Infrastructure

Worker Image

Data Forge uses a dedicated CPU-only worker image (data-forge-worker) for all processing. The same image runs in 3 modes:

ModeK8s Job NamePurpose
parsedf-parse-{id}Document parsing and chunking
generatedf-generate-{id}Q&A pair generation via AI Gateway
validatedf-validate-{id}Quality scoring and deduplication

K8s Job Configuration

  • Service Account: operator-service-account (for S3 IRSA)
  • Node Selector: workload-type: general (CPU nodes only)
  • Backoff Limit: 2 retries
  • Active Deadline: 2 hours
  • TTL After Finished: 10 minutes cleanup

S3 Storage

$S3_BUCKET/data-forge/<user_id>/<project_id>/sources/   → uploaded documents
$S3_BUCKET/data-forge/<user_id>/<project_id>/exports/ → exported datasets

API Endpoints

EndpointMethodDescription
/api/v1/data-forge/projectsPOSTCreate project
/api/v1/data-forge/projectsGETList projects
/api/v1/data-forge/projects/{id}GETGet project details
/api/v1/data-forge/projects/{id}PUTUpdate project
/api/v1/data-forge/projects/{id}DELETEDelete project
/api/v1/data-forge/projects/{id}/upload-urlPOSTGet presigned upload URL
/api/v1/data-forge/projects/{id}/documentsPOSTRegister uploaded document
/api/v1/data-forge/projects/{id}/documentsGETList documents
/api/v1/data-forge/projects/{id}/documents/{doc_id}DELETEDelete document
/api/v1/data-forge/projects/{id}/chunksGETList chunks (paginated)
/api/v1/data-forge/projects/{id}/chunks/{chunk_id}PUTUpdate chunk
/api/v1/data-forge/projects/{id}/parsePOSTStart parse job
/api/v1/data-forge/projects/{id}/generatePOSTStart generation job
/api/v1/data-forge/generations/{gen_id}/cancelPOSTCancel generation
/api/v1/data-forge/projects/{id}/generationsGETList generations
/api/v1/data-forge/generations/{gen_id}GETGet generation details
/api/v1/data-forge/generations/{gen_id}/logsGETGet generation logs
/api/v1/data-forge/projects/{id}/pairsGETList pairs (paginated, filterable)
/api/v1/data-forge/pairs/{pair_id}PUTUpdate pair
/api/v1/data-forge/projects/{id}/pairs/bulk-actionPOSTBulk accept/reject/delete/reset
/api/v1/data-forge/projects/{id}/exportPOSTExport dataset
/api/v1/data-forge/projects/{id}/export/{version}GETGet export download URL
/api/v1/data-forge/projects/{id}/analyticsGETGet project analytics
/api/v1/data-forge/available-modelsGETList available teacher models

Best Practices

Document Quality

  • Structured content works best — Documents with headings, sections, and clear organization produce better chunks
  • Remove boilerplate — Headers, footers, legal text, and table of contents add noise
  • Minimum 5 pages per document for meaningful pair generation
  • Domain-specific terminology is preserved — the teacher model learns from your content

Generation Configuration

  • Start with defaults — The default configuration produces good results for most use cases
  • Use large teacher models — GPT-4o or Claude 3.5 Sonnet produce significantly higher quality pairs than smaller models
  • Set difficulty distribution — Ensure your training data covers easy, medium, and hard questions
  • Custom system prompts — Tailor the generation to your domain language and style

Review Workflow

  • Filter by quality score — Review lowest-quality pairs first (sort ascending)
  • Use bulk actions — Accept all pairs above 0.9 quality, then manually review the rest
  • Check grounding scores — Low grounding scores indicate potential hallucination
  • Aim for 1,000+ accepted pairs for effective fine-tuning

Export

  • ChatML format for OpenAI-compatible models and most modern training frameworks
  • Alpaca format for instruction-following fine-tuning with frameworks like LLaMA Factory
  • Set min_quality_score on export to automatically filter low-quality pairs