Data Forge
Generate synthetic fine-tuning datasets from your documents. Data Forge lets you upload source documents, parse and chunk them, generate Q&A training pairs using a teacher LLM, review and curate the output, and export it for fine-tuning.
All endpoints require authentication via X-API-Key header and organization context via X-Organization-ID header.
DataForgeProject Object
{
"projectId": "proj_abc123",
"name": "Customer Support FAQ",
"description": "Generate training data from support documentation",
"status": "active",
"sourceType": "documents",
"userId": "user_456",
"organizationId": "org_xyz",
"s3Prefix": "data-forge/user_456/proj_abc123",
"stats": {
"total_documents": 12,
"total_chunks": 340,
"total_pairs": 1020,
"accepted_pairs": 850,
"rejected_pairs": 45,
"pending_pairs": 125,
"avg_quality_score": 0.87
},
"defaultConfig": {},
"currentVersion": 2,
"versions": [],
"createdAt": "2025-06-15T10:00:00Z",
"updatedAt": "2025-06-15T14:30:00Z"
}
Projects
POST /api/v1/data-forge/projects
Create a new Data Forge project.
Request Body
{
"name": "Customer Support FAQ",
"description": "Generate training data from support documentation",
"source_type": "documents",
"config": {}
}
| Field | Type | Required | Description |
|---|---|---|---|
name | string | Yes | Project name (1-200 characters) |
description | string | No | Project description (max 2000 characters) |
source_type | string | No | Source type: documents, urls, or raw_text (default: documents) |
config | object | No | Project-level configuration overrides |
Response 201 Created
Returns the full DataForgeProject object.
GET /api/v1/data-forge/projects
List all Data Forge projects for the authenticated user, scoped to the organization.
Response 200 OK
Returns a list of DataForgeProject objects.
GET /api/v1/data-forge/projects/:project_id
Get details of a specific Data Forge project.
Path Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
project_id | string | Yes | Project ID |
Response 200 OK
Returns the full DataForgeProject object.
PUT /api/v1/data-forge/projects/:project_id
Update a Data Forge project.
Path Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
project_id | string | Yes | Project ID |
Request Body
{
"name": "Updated Project Name",
"description": "Updated description",
"config": {}
}
| Field | Type | Required | Description |
|---|---|---|---|
name | string | No | Updated project name (1-200 characters) |
description | string | No | Updated description (max 2000 characters) |
config | object | No | Updated configuration |
Response 200 OK
Returns the updated DataForgeProject object.
DELETE /api/v1/data-forge/projects/:project_id
Delete a Data Forge project and clean up associated S3 objects, documents, chunks, and pairs.
Path Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
project_id | string | Yes | Project ID |
Response 200 OK
{
"message": "Project proj_abc123 deleted successfully",
"deleted": true
}
Documents
POST /api/v1/data-forge/projects/:project_id/upload-url
Get a presigned PUT URL for uploading a document directly to S3.
Path Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
project_id | string | Yes | Project ID |
Request Body
{
"filename": "product-manual.pdf",
"content_type": "application/pdf",
"file_size": 2048576
}
| Field | Type | Required | Description |
|---|---|---|---|
filename | string | Yes | Name of the file to upload (1-500 characters) |
content_type | string | No | MIME type (default: application/octet-stream) |
file_size | integer | No | File size in bytes for validation |
Response 200 OK
{
"upload_url": "https://s3.amazonaws.com/bucket/data-forge/...",
"s3_key": "data-forge/user_456/proj_abc123/sources/product-manual.pdf"
}
POST /api/v1/data-forge/projects/:project_id/documents
Register a document that has been uploaded to S3. Call this after the browser finishes uploading to the presigned URL.
Path Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
project_id | string | Yes | Project ID |
Request Body
{
"filename": "product-manual.pdf",
"s3_key": "data-forge/user_456/proj_abc123/sources/product-manual.pdf",
"content_type": "application/pdf",
"file_size": 2048576,
"metadata": {}
}
| Field | Type | Required | Description |
|---|---|---|---|
filename | string | Yes | Name of the uploaded file (1-500 characters) |
s3_key | string | Yes | S3 object key where the file was uploaded |
content_type | string | No | MIME type (default: application/octet-stream) |
file_size | integer | Yes | File size in bytes |
metadata | object | No | Additional document metadata |
Response 201 Created
{
"documentId": "doc_xyz789",
"projectId": "proj_abc123",
"name": "product-manual.pdf",
"mimeType": "application/pdf",
"fileSize": 2048576,
"s3Key": "data-forge/user_456/proj_abc123/sources/product-manual.pdf",
"parsingStatus": "pending",
"chunkCount": 0,
"createdAt": "2025-06-15T10:05:00Z"
}
GET /api/v1/data-forge/projects/:project_id/documents
List all documents in a Data Forge project.
Path Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
project_id | string | Yes | Project ID |
Response 200 OK
Returns a list of document objects.
DELETE /api/v1/data-forge/projects/:project_id/documents/:document_id
Delete a document from a project and remove from S3.
Path Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
project_id | string | Yes | Project ID |
document_id | string | Yes | Document ID |
Response 200 OK
{
"message": "Document doc_xyz789 deleted successfully"
}
Chunks
GET /api/v1/data-forge/projects/:project_id/chunks
Get paginated chunks for a Data Forge project. Chunks are created when documents are parsed.
Path Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
project_id | string | Yes | Project ID |
Query Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
page | integer | No | Page number, 1-based (default: 1) |
page_size | integer | No | Items per page, 1-500 (default: 50) |
document_id | string | No | Filter by document ID |
Response 200 OK
{
"chunks": [
{
"chunkId": "chunk_001",
"projectId": "proj_abc123",
"documentId": "doc_xyz789",
"content": "To reset your password, navigate to Settings > Security...",
"heading": "Password Reset",
"pageNumber": 3,
"position": 12,
"contentType": "paragraph",
"topic": "account-management",
"difficulty": "easy",
"pairsGenerated": 3,
"createdAt": "2025-06-15T10:10:00Z"
}
],
"total": 340,
"page": 1,
"page_size": 50
}
PUT /api/v1/data-forge/projects/:project_id/chunks/:chunk_id
Update a specific chunk's content or metadata.
Path Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
project_id | string | Yes | Project ID |
chunk_id | string | Yes | Chunk ID |
Request Body
{
"content": "Updated chunk text content",
"metadata": {},
"excluded": false
}
| Field | Type | Required | Description |
|---|---|---|---|
content | string | No | Updated chunk text content |
metadata | object | No | Updated chunk metadata |
excluded | boolean | No | Whether to exclude this chunk from generation |
Response 200 OK
Returns the updated chunk object.
Pipeline
POST /api/v1/data-forge/projects/:project_id/parse
Start a document parsing job. Parses uploaded documents into text chunks using a K8s Job.
Path Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
project_id | string | Yes | Project ID |
Response 200 OK
{
"generation_id": "gen_parse_001",
"status": "pending",
"message": "Parse job started"
}
POST /api/v1/data-forge/projects/:project_id/generate
Start a data generation job. Uses an AI model to generate training pairs from parsed chunks.
Path Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
project_id | string | Yes | Project ID |
Request Body
{
"model_id": "gpt-4o",
"generation_type": "qa",
"num_pairs": 500,
"temperature": 0.7,
"max_tokens": 4096,
"system_prompt": "Generate high-quality Q&A pairs from the provided context.",
"chunk_ids": null,
"config": {
"pairs_per_chunk": 3,
"difficulty_distribution": {"easy": 0.3, "medium": 0.5, "hard": 0.2},
"style_template": "mixed"
}
}
| Field | Type | Required | Description |
|---|---|---|---|
model_id | string | Yes | AI model ID to use as the teacher LLM |
generation_type | string | No | Type of generation: qa, instruction, conversation, summary, classification (default: qa) |
num_pairs | integer | No | Target number of pairs to generate (1-100,000) |
temperature | float | No | Sampling temperature (0.0-2.0, default: 0.7) |
max_tokens | integer | No | Max tokens per generation response (1-32,768) |
system_prompt | string | No | Custom system prompt for the teacher LLM (max 10,000 characters) |
chunk_ids | array | No | Specific chunk IDs to generate from (all chunks if omitted) |
config | object | No | Additional generation configuration (merged into config) |
Response 200 OK
{
"generationId": "gen_abc123",
"projectId": "proj_abc123",
"config": {
"model_id": "gpt-4o",
"generation_type": "qa",
"temperature": 0.7,
"pairs_per_chunk": 3
},
"status": "pending",
"progress": 0,
"createdAt": "2025-06-15T11:00:00Z"
}
POST /api/v1/data-forge/generations/:generation_id/cancel
Cancel a running generation job.
Path Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
generation_id | string | Yes | Generation job ID |
Response 200 OK
{
"generation_id": "gen_abc123",
"status": "cancelled",
"message": "Generation cancelled"
}
Generations
GET /api/v1/data-forge/projects/:project_id/generations
List all generation jobs for a project.
Path Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
project_id | string | Yes | Project ID |
Response 200 OK
Returns a list of generation objects:
[
{
"generationId": "gen_abc123",
"projectId": "proj_abc123",
"config": {
"model_id": "gpt-4o",
"generation_type": "qa",
"temperature": 0.7
},
"status": "completed",
"progress": 100,
"jobName": "df-generate-gen_abc1",
"results": {
"chunks_processed": 340,
"pairs_generated": 1020,
"pairs_valid": 980,
"avg_quality_score": 0.87,
"tokens_used": 450000
},
"startedAt": "2025-06-15T11:00:30Z",
"completedAt": "2025-06-15T11:45:00Z",
"createdAt": "2025-06-15T11:00:00Z"
}
]
GET /api/v1/data-forge/generations/:generation_id
Get details of a specific generation job.
Path Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
generation_id | string | Yes | Generation job ID |
Response 200 OK
Returns the full generation object.
GET /api/v1/data-forge/generations/:generation_id/logs
Get logs for a specific generation job. Logs are streamed in real-time during active jobs.
Path Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
generation_id | string | Yes | Generation job ID |
Query Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
tail | integer | No | Number of most recent log entries to return (1-10,000) |
Response 200 OK
{
"generation_id": "gen_abc123",
"logs": [
{
"timestamp": "2025-06-15T11:00:30Z",
"level": "info",
"stage": "generation",
"message": "Starting generation for 340 chunks"
},
{
"timestamp": "2025-06-15T11:01:00Z",
"level": "info",
"stage": "generation",
"message": "Processing chunk 1/340: Password Reset"
},
{
"timestamp": "2025-06-15T11:01:05Z",
"level": "info",
"stage": "generation",
"message": "Generated 3 pairs for chunk 1 (avg quality: 0.92)"
}
]
}
Pairs
GET /api/v1/data-forge/projects/:project_id/pairs
Get paginated training pairs for a project with optional filtering.
Path Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
project_id | string | Yes | Project ID |
Query Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
page | integer | No | Page number, 1-based (default: 1) |
page_size | integer | No | Items per page, 1-500 (default: 50) |
status | string | No | Filter by review status: pending, accepted, rejected, edited |
generation_id | string | No | Filter by generation run ID |
search | string | No | Search within pair input/output text |
Response 200 OK
{
"pairs": [
{
"pairId": "pair_001",
"projectId": "proj_abc123",
"chunkId": "chunk_001",
"documentId": "doc_xyz789",
"generationId": "gen_abc123",
"type": "single_turn",
"systemPrompt": "You are a helpful customer support assistant.",
"messages": [
{"role": "system", "content": "You are a helpful customer support assistant."},
{"role": "user", "content": "How do I reset my password?"},
{"role": "assistant", "content": "To reset your password, navigate to Settings > Security..."}
],
"question": "How do I reset my password?",
"answer": "To reset your password, navigate to Settings > Security...",
"qualityScore": 0.92,
"groundingScore": 0.95,
"complexityLevel": "easy",
"questionType": "how-to",
"isDuplicate": false,
"status": "accepted",
"reviewerNotes": null,
"createdAt": "2025-06-15T11:01:05Z"
}
],
"total": 1020,
"page": 1,
"page_size": 50
}
PUT /api/v1/data-forge/pairs/:pair_id
Update a specific training pair. Use this to edit content or change review status.
Path Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
pair_id | string | Yes | Pair ID |
Request Body
{
"input_text": "Updated question text",
"output_text": "Updated answer text",
"status": "accepted",
"metadata": {
"reviewer_notes": "Good quality pair"
}
}
| Field | Type | Required | Description |
|---|---|---|---|
input_text | string | No | Updated input/question text |
output_text | string | No | Updated output/answer text |
status | string | No | Review status: pending, accepted, rejected, edited |
metadata | object | No | Updated pair metadata |
Response 200 OK
Returns the updated pair object.
POST /api/v1/data-forge/projects/:project_id/pairs/bulk-action
Perform a bulk action on training pairs. Specify either pair_ids for specific pairs or filters for criteria-based selection.
Path Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
project_id | string | Yes | Project ID |
Request Body
{
"action": "accept",
"pair_ids": ["pair_001", "pair_002", "pair_003"],
"filters": null
}
| Field | Type | Required | Description |
|---|---|---|---|
action | string | Yes | Action to perform: accept, reject, delete, reset |
pair_ids | array | No | Specific pair IDs to act on |
filters | object | No | Filter criteria for selecting pairs (alternative to pair_ids) |
Response 200 OK
{
"action": "accept",
"affected_count": 3,
"message": "Bulk action completed"
}
Export
POST /api/v1/data-forge/projects/:project_id/export
Export the dataset as a JSONL file in the specified format. Only accepted pairs are included.
Path Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
project_id | string | Yes | Project ID |
Request Body
{
"format": "chatml",
"include_system_prompt": true,
"min_quality_score": 0.8
}
| Field | Type | Required | Description |
|---|---|---|---|
format | string | No | Export format: chatml or alpaca (default: chatml) |
include_system_prompt | boolean | No | Include system prompt in ChatML output (default: true) |
min_quality_score | float | No | Minimum quality score filter for pairs (0.0-1.0) |
Export Formats
ChatML — OpenAI-compatible chat format:
{"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
Alpaca — Instruction-following format:
{"instruction": "...", "input": "", "output": "..."}
Response 200 OK
{
"version": 3,
"format": "chatml",
"s3Key": "data-forge/user_456/proj_abc123/exports/v3/dataset.jsonl",
"pairCount": 850,
"fileSize": 4521890,
"downloadUrl": "https://s3.amazonaws.com/bucket/data-forge/...",
"includeSystemPrompt": true,
"minQualityScore": 0.8
}
GET /api/v1/data-forge/projects/:project_id/export/:version
Get a download URL for a specific export version.
Path Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
project_id | string | Yes | Project ID |
version | integer | Yes | Export version number |
Response 200 OK
Returns the export object with a presigned download URL.
Analytics
GET /api/v1/data-forge/projects/:project_id/analytics
Get analytics and statistics for a Data Forge project.
Path Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
project_id | string | Yes | Project ID |
Response 200 OK
{
"projectId": "proj_abc123",
"projectName": "Customer Support FAQ",
"documents": {
"total": 12,
"parsed": 12,
"failed": 0
},
"chunks": {
"total": 340,
"by_type": {"paragraph": 280, "heading": 60}
},
"pairs": {
"total": 1020,
"accepted": 850,
"rejected": 45,
"pending": 125,
"avg_quality_score": 0.87
},
"qualityDistribution": [
{"range": "0.9-1.0", "count": 420},
{"range": "0.8-0.9", "count": 380},
{"range": "0.7-0.8", "count": 150},
{"range": "0.0-0.7", "count": 70}
],
"generations": {
"total": 3,
"completed": 2,
"running": 1,
"failed": 0
},
"exportsCount": 2
}
GET /api/v1/data-forge/available-models
Get available AI models that can be used as teacher models for data generation.
Response 200 OK
[
{
"id": "gpt-4o",
"name": "GPT-4o",
"provider": "openai"
},
{
"id": "claude-3-5-sonnet",
"name": "Claude 3.5 Sonnet",
"provider": "anthropic"
},
{
"id": "llama-3.1-70b",
"name": "Llama 3.1 70B",
"provider": "self-hosted"
}
]
Generation Status Values
| Status | Description |
|---|---|
pending | Job created, waiting to start |
parsing | Parsing documents into chunks |
generating | Generating Q&A pairs from chunks |
validating | Running quality validation and deduplication |
completed | Job finished successfully |
failed | Error occurred (check logs) |
cancelled | Stopped by user |
Pair Review Status Values
| Status | Description |
|---|---|
pending | Not yet reviewed |
accepted | Approved for export |
rejected | Excluded from export |
edited | Modified by reviewer, approved for export |