Doc Validation

Last Update: 05/02/2025

Overview

Doc Validation enables users to split a QA dataset into training and validation sets, a critical step in preparing data for machine learning model development and evaluation.

Use Case

Prepare a dataset for supervised learning by dividing it into two subsets:

A training set (e.g., 80%) used to train the model.
A validation set (e.g., 20%) used to assess the model's performance on unseen data.

This ensures fair evaluation and helps prevent overfitting.

Functionality Overview

Input

Upload a QA file in any supported format: .jsonl.
Supported formats: QA Format, DocSynth Single-Turn, Multi-Turn, OpenAI Format.

Configuration Options

Split Ratio: Define a ratio such as 80/20 or 70/30 for train/validation.
Fixed Count: Alternatively, specify a fixed number of QA pairs to include in the validation set.
Shuffle (Optional): Randomly shuffle the dataset before splitting to avoid order bias.

Processing

The tool preserves the original format of the uploaded file.
Generates two output files corresponding to the defined split.

Output

Training File: e.g., train.jsonl – Contains the specified portion of QA pairs for training.
Validation File: e.g., validation.jsonl – Contains the remainder of QA pairs for evaluation.

Example

Input File: 10 QA pairs

Split Ratio: 80/20

Result:

train.jsonl: 8 QA pairs
validation.jsonl: 2 QA pairs

Both files retain the original format (e.g., OpenAI format or QA format), ensuring compatibility with your training pipeline.