Doc Merge
Last Update: 05/02/2025Overview
Doc Merge allows users to merge multiple QA files into a single, consolidated JSONL
file for streamlined processing, analysis, or AI model training.
Use Case
Combine QA datasets generated from various runs of the DocQA Generator or SynthQA Generator, or sourced from different contributors, into one unified dataset to simplify model input and training workflows.
Functionality Overview
Input
Upload multiple QA files in any supported format:
- DocSynth Single-Turn
- QA Format
- OpenAI Format
Accepted file types: .jsonl
Processing
- The tool automatically normalizes the formats if needed.
- All QA pairs are appended into a single JSONL file.
- An optional target format can be selected to ensure uniform structure before merging (e.g., convert all files to OpenAI Format before merging).
Output
A single JSONL
file where each line contains a standalone QA pair (or message object, depending on the format), formatted consistently across merged inputs.
Example: Merging Files
File 1 (QA Format)
{
"question": "What is AI?",
"answer": "AI is..."
}
File 2 (DocSynth Single-Turn Format)
{
"conversations": [
{"from": "human", "value": "Why use AI?"},
{"from": "assistant", "value": "AI improves efficiency..."}
]
}
Merged Output (QA Format - JSONL)
{"question": "What is AI?", "answer": "AI is..."}
{"question": "Why use AI?", "answer": "AI improves efficiency..."}
This tool is especially useful for building large, diverse QA datasets from modular sources while maintaining format consistency.