Doc Splitter
Last Update: 05/02/2025Overview
Doc Splitter allows users to divide a large QA dataset into smaller, manageable files. This is especially useful for distributed processing, modular testing, or collaborative annotation and review.
Use Case
Split a large QA file into smaller chunks for:
- Parallel model training or evaluation.
- Sharing subsets with different teams or collaborators.
- Managing memory and processing efficiency during development.
Functionality Overview
Input
- Upload a QA dataset in
.jsonl
format. - Supported formats: DocSynth Single-Turn, Multi-Turn, QA Format, OpenAI Format.
Configuration Options
- Number of Output Files: Specify how many smaller files to generate.
- Maximum QA Pairs per File: Alternatively, define the maximum number of QA pairs allowed in each file.
Processing
- The tool automatically and evenly distributes QA pairs across the output files.
- The original format of the QA pairs is preserved in all splits.
Output
Multiple .jsonl
files, each containing a subset of the original data.
Files are named sequentially: split_1.jsonl
, split_2.jsonl
, etc.
Example
Input: 100 QA pairs
Split Configuration: 4 files
Result:
split_1.jsonl
: ~25 QA pairssplit_2.jsonl
: ~25 QA pairssplit_3.jsonl
: ~25 QA pairssplit_4.jsonl
: ~25 QA pairs
All output files maintain the original QA format (e.g., DocSynth Multi-Turn).