Doc Splitter

Last Update: 05/02/2025

Overview

Doc Splitter allows users to divide a large QA dataset into smaller, manageable files. This is especially useful for distributed processing, modular testing, or collaborative annotation and review.

Use Case

Split a large QA file into smaller chunks for:

Parallel model training or evaluation.
Sharing subsets with different teams or collaborators.
Managing memory and processing efficiency during development.

Functionality Overview

Input

Upload a QA dataset in .jsonl format.
Supported formats: DocSynth Single-Turn, Multi-Turn, QA Format, OpenAI Format.

Configuration Options

Number of Output Files: Specify how many smaller files to generate.
Maximum QA Pairs per File: Alternatively, define the maximum number of QA pairs allowed in each file.

Processing

The tool automatically and evenly distributes QA pairs across the output files.
The original format of the QA pairs is preserved in all splits.

Output

Multiple .jsonl files, each containing a subset of the original data.

Files are named sequentially: split_1.jsonl, split_2.jsonl, etc.

Example

Input: 100 QA pairs

Split Configuration: 4 files

Result:

split_1.jsonl: ~25 QA pairs
split_2.jsonl: ~25 QA pairs
split_3.jsonl: ~25 QA pairs
split_4.jsonl: ~25 QA pairs

All output files maintain the original QA format (e.g., DocSynth Multi-Turn).