Doc Splitter

Last Update: 05/02/2025

Overview

Doc Splitter allows users to divide a large QA dataset into smaller, manageable files. This is especially useful for distributed processing, modular testing, or collaborative annotation and review.

Use Case

Split a large QA file into smaller chunks for:

  • Parallel model training or evaluation.
  • Sharing subsets with different teams or collaborators.
  • Managing memory and processing efficiency during development.

Functionality Overview

Input

  • Upload a QA dataset in .jsonl format.
  • Supported formats: DocSynth Single-Turn, Multi-Turn, QA Format, OpenAI Format.

Configuration Options

  • Number of Output Files: Specify how many smaller files to generate.
  • Maximum QA Pairs per File: Alternatively, define the maximum number of QA pairs allowed in each file.

Processing

  • The tool automatically and evenly distributes QA pairs across the output files.
  • The original format of the QA pairs is preserved in all splits.

Output

Multiple .jsonl files, each containing a subset of the original data.

Files are named sequentially: split_1.jsonl, split_2.jsonl, etc.

Example

Input: 100 QA pairs

Split Configuration: 4 files

Result:

  • split_1.jsonl: ~25 QA pairs
  • split_2.jsonl: ~25 QA pairs
  • split_3.jsonl: ~25 QA pairs
  • split_4.jsonl: ~25 QA pairs

All output files maintain the original QA format (e.g., DocSynth Multi-Turn).