Skip to main content

Data Forge

Generate synthetic fine-tuning datasets from your documents. Data Forge lets you upload source documents, parse and chunk them, generate Q&A training pairs using a teacher LLM, review and curate the output, and export it for fine-tuning.

All endpoints require authentication via X-API-Key header and organization context via X-Organization-ID header.


DataForgeProject Object

{
"projectId": "proj_abc123",
"name": "Customer Support FAQ",
"description": "Generate training data from support documentation",
"status": "active",
"sourceType": "documents",
"userId": "user_456",
"organizationId": "org_xyz",
"s3Prefix": "data-forge/user_456/proj_abc123",
"stats": {
"total_documents": 12,
"total_chunks": 340,
"total_pairs": 1020,
"accepted_pairs": 850,
"rejected_pairs": 45,
"pending_pairs": 125,
"avg_quality_score": 0.87
},
"defaultConfig": {},
"currentVersion": 2,
"versions": [],
"createdAt": "2025-06-15T10:00:00Z",
"updatedAt": "2025-06-15T14:30:00Z"
}

Projects

POST /api/v1/data-forge/projects

Create a new Data Forge project.

Request Body

{
"name": "Customer Support FAQ",
"description": "Generate training data from support documentation",
"source_type": "documents",
"config": {}
}
FieldTypeRequiredDescription
namestringYesProject name (1-200 characters)
descriptionstringNoProject description (max 2000 characters)
source_typestringNoSource type: documents, urls, or raw_text (default: documents)
configobjectNoProject-level configuration overrides

Response 201 Created

Returns the full DataForgeProject object.


GET /api/v1/data-forge/projects

List all Data Forge projects for the authenticated user, scoped to the organization.

Response 200 OK

Returns a list of DataForgeProject objects.


GET /api/v1/data-forge/projects/:project_id

Get details of a specific Data Forge project.

Path Parameters

ParameterTypeRequiredDescription
project_idstringYesProject ID

Response 200 OK

Returns the full DataForgeProject object.


PUT /api/v1/data-forge/projects/:project_id

Update a Data Forge project.

Path Parameters

ParameterTypeRequiredDescription
project_idstringYesProject ID

Request Body

{
"name": "Updated Project Name",
"description": "Updated description",
"config": {}
}
FieldTypeRequiredDescription
namestringNoUpdated project name (1-200 characters)
descriptionstringNoUpdated description (max 2000 characters)
configobjectNoUpdated configuration

Response 200 OK

Returns the updated DataForgeProject object.


DELETE /api/v1/data-forge/projects/:project_id

Delete a Data Forge project and clean up associated S3 objects, documents, chunks, and pairs.

Path Parameters

ParameterTypeRequiredDescription
project_idstringYesProject ID

Response 200 OK

{
"message": "Project proj_abc123 deleted successfully",
"deleted": true
}

Documents

POST /api/v1/data-forge/projects/:project_id/upload-url

Get a presigned PUT URL for uploading a document directly to S3.

Path Parameters

ParameterTypeRequiredDescription
project_idstringYesProject ID

Request Body

{
"filename": "product-manual.pdf",
"content_type": "application/pdf",
"file_size": 2048576
}
FieldTypeRequiredDescription
filenamestringYesName of the file to upload (1-500 characters)
content_typestringNoMIME type (default: application/octet-stream)
file_sizeintegerNoFile size in bytes for validation

Response 200 OK

{
"upload_url": "https://s3.amazonaws.com/bucket/data-forge/...",
"s3_key": "data-forge/user_456/proj_abc123/sources/product-manual.pdf"
}

POST /api/v1/data-forge/projects/:project_id/documents

Register a document that has been uploaded to S3. Call this after the browser finishes uploading to the presigned URL.

Path Parameters

ParameterTypeRequiredDescription
project_idstringYesProject ID

Request Body

{
"filename": "product-manual.pdf",
"s3_key": "data-forge/user_456/proj_abc123/sources/product-manual.pdf",
"content_type": "application/pdf",
"file_size": 2048576,
"metadata": {}
}
FieldTypeRequiredDescription
filenamestringYesName of the uploaded file (1-500 characters)
s3_keystringYesS3 object key where the file was uploaded
content_typestringNoMIME type (default: application/octet-stream)
file_sizeintegerYesFile size in bytes
metadataobjectNoAdditional document metadata

Response 201 Created

{
"documentId": "doc_xyz789",
"projectId": "proj_abc123",
"name": "product-manual.pdf",
"mimeType": "application/pdf",
"fileSize": 2048576,
"s3Key": "data-forge/user_456/proj_abc123/sources/product-manual.pdf",
"parsingStatus": "pending",
"chunkCount": 0,
"createdAt": "2025-06-15T10:05:00Z"
}

GET /api/v1/data-forge/projects/:project_id/documents

List all documents in a Data Forge project.

Path Parameters

ParameterTypeRequiredDescription
project_idstringYesProject ID

Response 200 OK

Returns a list of document objects.


DELETE /api/v1/data-forge/projects/:project_id/documents/:document_id

Delete a document from a project and remove from S3.

Path Parameters

ParameterTypeRequiredDescription
project_idstringYesProject ID
document_idstringYesDocument ID

Response 200 OK

{
"message": "Document doc_xyz789 deleted successfully"
}

Chunks

GET /api/v1/data-forge/projects/:project_id/chunks

Get paginated chunks for a Data Forge project. Chunks are created when documents are parsed.

Path Parameters

ParameterTypeRequiredDescription
project_idstringYesProject ID

Query Parameters

ParameterTypeRequiredDescription
pageintegerNoPage number, 1-based (default: 1)
page_sizeintegerNoItems per page, 1-500 (default: 50)
document_idstringNoFilter by document ID

Response 200 OK

{
"chunks": [
{
"chunkId": "chunk_001",
"projectId": "proj_abc123",
"documentId": "doc_xyz789",
"content": "To reset your password, navigate to Settings > Security...",
"heading": "Password Reset",
"pageNumber": 3,
"position": 12,
"contentType": "paragraph",
"topic": "account-management",
"difficulty": "easy",
"pairsGenerated": 3,
"createdAt": "2025-06-15T10:10:00Z"
}
],
"total": 340,
"page": 1,
"page_size": 50
}

PUT /api/v1/data-forge/projects/:project_id/chunks/:chunk_id

Update a specific chunk's content or metadata.

Path Parameters

ParameterTypeRequiredDescription
project_idstringYesProject ID
chunk_idstringYesChunk ID

Request Body

{
"content": "Updated chunk text content",
"metadata": {},
"excluded": false
}
FieldTypeRequiredDescription
contentstringNoUpdated chunk text content
metadataobjectNoUpdated chunk metadata
excludedbooleanNoWhether to exclude this chunk from generation

Response 200 OK

Returns the updated chunk object.


Pipeline

POST /api/v1/data-forge/projects/:project_id/parse

Start a document parsing job. Parses uploaded documents into text chunks using a K8s Job.

Path Parameters

ParameterTypeRequiredDescription
project_idstringYesProject ID

Response 200 OK

{
"generation_id": "gen_parse_001",
"status": "pending",
"message": "Parse job started"
}

POST /api/v1/data-forge/projects/:project_id/generate

Start a data generation job. Uses an AI model to generate training pairs from parsed chunks.

Path Parameters

ParameterTypeRequiredDescription
project_idstringYesProject ID

Request Body

{
"model_id": "gpt-4o",
"generation_type": "qa",
"num_pairs": 500,
"temperature": 0.7,
"max_tokens": 4096,
"system_prompt": "Generate high-quality Q&A pairs from the provided context.",
"chunk_ids": null,
"config": {
"pairs_per_chunk": 3,
"difficulty_distribution": {"easy": 0.3, "medium": 0.5, "hard": 0.2},
"style_template": "mixed"
}
}
FieldTypeRequiredDescription
model_idstringYesAI model ID to use as the teacher LLM
generation_typestringNoType of generation: qa, instruction, conversation, summary, classification (default: qa)
num_pairsintegerNoTarget number of pairs to generate (1-100,000)
temperaturefloatNoSampling temperature (0.0-2.0, default: 0.7)
max_tokensintegerNoMax tokens per generation response (1-32,768)
system_promptstringNoCustom system prompt for the teacher LLM (max 10,000 characters)
chunk_idsarrayNoSpecific chunk IDs to generate from (all chunks if omitted)
configobjectNoAdditional generation configuration (merged into config)

Response 200 OK

{
"generationId": "gen_abc123",
"projectId": "proj_abc123",
"config": {
"model_id": "gpt-4o",
"generation_type": "qa",
"temperature": 0.7,
"pairs_per_chunk": 3
},
"status": "pending",
"progress": 0,
"createdAt": "2025-06-15T11:00:00Z"
}

POST /api/v1/data-forge/generations/:generation_id/cancel

Cancel a running generation job.

Path Parameters

ParameterTypeRequiredDescription
generation_idstringYesGeneration job ID

Response 200 OK

{
"generation_id": "gen_abc123",
"status": "cancelled",
"message": "Generation cancelled"
}

Generations

GET /api/v1/data-forge/projects/:project_id/generations

List all generation jobs for a project.

Path Parameters

ParameterTypeRequiredDescription
project_idstringYesProject ID

Response 200 OK

Returns a list of generation objects:

[
{
"generationId": "gen_abc123",
"projectId": "proj_abc123",
"config": {
"model_id": "gpt-4o",
"generation_type": "qa",
"temperature": 0.7
},
"status": "completed",
"progress": 100,
"jobName": "df-generate-gen_abc1",
"results": {
"chunks_processed": 340,
"pairs_generated": 1020,
"pairs_valid": 980,
"avg_quality_score": 0.87,
"tokens_used": 450000
},
"startedAt": "2025-06-15T11:00:30Z",
"completedAt": "2025-06-15T11:45:00Z",
"createdAt": "2025-06-15T11:00:00Z"
}
]

GET /api/v1/data-forge/generations/:generation_id

Get details of a specific generation job.

Path Parameters

ParameterTypeRequiredDescription
generation_idstringYesGeneration job ID

Response 200 OK

Returns the full generation object.


GET /api/v1/data-forge/generations/:generation_id/logs

Get logs for a specific generation job. Logs are streamed in real-time during active jobs.

Path Parameters

ParameterTypeRequiredDescription
generation_idstringYesGeneration job ID

Query Parameters

ParameterTypeRequiredDescription
tailintegerNoNumber of most recent log entries to return (1-10,000)

Response 200 OK

{
"generation_id": "gen_abc123",
"logs": [
{
"timestamp": "2025-06-15T11:00:30Z",
"level": "info",
"stage": "generation",
"message": "Starting generation for 340 chunks"
},
{
"timestamp": "2025-06-15T11:01:00Z",
"level": "info",
"stage": "generation",
"message": "Processing chunk 1/340: Password Reset"
},
{
"timestamp": "2025-06-15T11:01:05Z",
"level": "info",
"stage": "generation",
"message": "Generated 3 pairs for chunk 1 (avg quality: 0.92)"
}
]
}

Pairs

GET /api/v1/data-forge/projects/:project_id/pairs

Get paginated training pairs for a project with optional filtering.

Path Parameters

ParameterTypeRequiredDescription
project_idstringYesProject ID

Query Parameters

ParameterTypeRequiredDescription
pageintegerNoPage number, 1-based (default: 1)
page_sizeintegerNoItems per page, 1-500 (default: 50)
statusstringNoFilter by review status: pending, accepted, rejected, edited
generation_idstringNoFilter by generation run ID
searchstringNoSearch within pair input/output text

Response 200 OK

{
"pairs": [
{
"pairId": "pair_001",
"projectId": "proj_abc123",
"chunkId": "chunk_001",
"documentId": "doc_xyz789",
"generationId": "gen_abc123",
"type": "single_turn",
"systemPrompt": "You are a helpful customer support assistant.",
"messages": [
{"role": "system", "content": "You are a helpful customer support assistant."},
{"role": "user", "content": "How do I reset my password?"},
{"role": "assistant", "content": "To reset your password, navigate to Settings > Security..."}
],
"question": "How do I reset my password?",
"answer": "To reset your password, navigate to Settings > Security...",
"qualityScore": 0.92,
"groundingScore": 0.95,
"complexityLevel": "easy",
"questionType": "how-to",
"isDuplicate": false,
"status": "accepted",
"reviewerNotes": null,
"createdAt": "2025-06-15T11:01:05Z"
}
],
"total": 1020,
"page": 1,
"page_size": 50
}

PUT /api/v1/data-forge/pairs/:pair_id

Update a specific training pair. Use this to edit content or change review status.

Path Parameters

ParameterTypeRequiredDescription
pair_idstringYesPair ID

Request Body

{
"input_text": "Updated question text",
"output_text": "Updated answer text",
"status": "accepted",
"metadata": {
"reviewer_notes": "Good quality pair"
}
}
FieldTypeRequiredDescription
input_textstringNoUpdated input/question text
output_textstringNoUpdated output/answer text
statusstringNoReview status: pending, accepted, rejected, edited
metadataobjectNoUpdated pair metadata

Response 200 OK

Returns the updated pair object.


POST /api/v1/data-forge/projects/:project_id/pairs/bulk-action

Perform a bulk action on training pairs. Specify either pair_ids for specific pairs or filters for criteria-based selection.

Path Parameters

ParameterTypeRequiredDescription
project_idstringYesProject ID

Request Body

{
"action": "accept",
"pair_ids": ["pair_001", "pair_002", "pair_003"],
"filters": null
}
FieldTypeRequiredDescription
actionstringYesAction to perform: accept, reject, delete, reset
pair_idsarrayNoSpecific pair IDs to act on
filtersobjectNoFilter criteria for selecting pairs (alternative to pair_ids)

Response 200 OK

{
"action": "accept",
"affected_count": 3,
"message": "Bulk action completed"
}

Export

POST /api/v1/data-forge/projects/:project_id/export

Export the dataset as a JSONL file in the specified format. Only accepted pairs are included.

Path Parameters

ParameterTypeRequiredDescription
project_idstringYesProject ID

Request Body

{
"format": "chatml",
"include_system_prompt": true,
"min_quality_score": 0.8
}
FieldTypeRequiredDescription
formatstringNoExport format: chatml or alpaca (default: chatml)
include_system_promptbooleanNoInclude system prompt in ChatML output (default: true)
min_quality_scorefloatNoMinimum quality score filter for pairs (0.0-1.0)

Export Formats

ChatML — OpenAI-compatible chat format:

{"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

Alpaca — Instruction-following format:

{"instruction": "...", "input": "", "output": "..."}

Response 200 OK

{
"version": 3,
"format": "chatml",
"s3Key": "data-forge/user_456/proj_abc123/exports/v3/dataset.jsonl",
"pairCount": 850,
"fileSize": 4521890,
"downloadUrl": "https://s3.amazonaws.com/bucket/data-forge/...",
"includeSystemPrompt": true,
"minQualityScore": 0.8
}

GET /api/v1/data-forge/projects/:project_id/export/:version

Get a download URL for a specific export version.

Path Parameters

ParameterTypeRequiredDescription
project_idstringYesProject ID
versionintegerYesExport version number

Response 200 OK

Returns the export object with a presigned download URL.


Analytics

GET /api/v1/data-forge/projects/:project_id/analytics

Get analytics and statistics for a Data Forge project.

Path Parameters

ParameterTypeRequiredDescription
project_idstringYesProject ID

Response 200 OK

{
"projectId": "proj_abc123",
"projectName": "Customer Support FAQ",
"documents": {
"total": 12,
"parsed": 12,
"failed": 0
},
"chunks": {
"total": 340,
"by_type": {"paragraph": 280, "heading": 60}
},
"pairs": {
"total": 1020,
"accepted": 850,
"rejected": 45,
"pending": 125,
"avg_quality_score": 0.87
},
"qualityDistribution": [
{"range": "0.9-1.0", "count": 420},
{"range": "0.8-0.9", "count": 380},
{"range": "0.7-0.8", "count": 150},
{"range": "0.0-0.7", "count": 70}
],
"generations": {
"total": 3,
"completed": 2,
"running": 1,
"failed": 0
},
"exportsCount": 2
}

GET /api/v1/data-forge/available-models

Get available AI models that can be used as teacher models for data generation.

Response 200 OK

[
{
"id": "gpt-4o",
"name": "GPT-4o",
"provider": "openai"
},
{
"id": "claude-3-5-sonnet",
"name": "Claude 3.5 Sonnet",
"provider": "anthropic"
},
{
"id": "llama-3.1-70b",
"name": "Llama 3.1 70B",
"provider": "self-hosted"
}
]

Generation Status Values

StatusDescription
pendingJob created, waiting to start
parsingParsing documents into chunks
generatingGenerating Q&A pairs from chunks
validatingRunning quality validation and deduplication
completedJob finished successfully
failedError occurred (check logs)
cancelledStopped by user

Pair Review Status Values

StatusDescription
pendingNot yet reviewed
acceptedApproved for export
rejectedExcluded from export
editedModified by reviewer, approved for export