Doc Merge

Last Update: 05/02/2025

Overview

Doc Merge allows users to merge multiple QA files into a single, consolidated JSONL file for streamlined processing, analysis, or AI model training.

Use Case

Combine QA datasets generated from various runs of the DocQA Generator or SynthQA Generator, or sourced from different contributors, into one unified dataset to simplify model input and training workflows.

Functionality Overview

Input

Upload multiple QA files in any supported format:

DocSynth Single-Turn
QA Format
OpenAI Format

Accepted file types: .jsonl

Processing

The tool automatically normalizes the formats if needed.
All QA pairs are appended into a single JSONL file.
An optional target format can be selected to ensure uniform structure before merging (e.g., convert all files to OpenAI Format before merging).

Output

A single JSONL file where each line contains a standalone QA pair (or message object, depending on the format), formatted consistently across merged inputs.

Example: Merging Files

File 1 (QA Format)

{
  "question": "What is AI?",
  "answer": "AI is..."
}

File 2 (DocSynth Single-Turn Format)

{
  "conversations": [
    {"from": "human", "value": "Why use AI?"},
    {"from": "assistant", "value": "AI improves efficiency..."}
  ]
}

Merged Output (QA Format - JSONL)

{"question": "What is AI?", "answer": "AI is..."}
{"question": "Why use AI?", "answer": "AI improves efficiency..."}

This tool is especially useful for building large, diverse QA datasets from modular sources while maintaining format consistency.