Doc Validation

Last Update: 05/02/2025

Overview

Doc Validation enables users to split a QA dataset into training and validation sets, a critical step in preparing data for machine learning model development and evaluation.

Use Case

Prepare a dataset for supervised learning by dividing it into two subsets:

  • A training set (e.g., 80%) used to train the model.
  • A validation set (e.g., 20%) used to assess the model's performance on unseen data.

This ensures fair evaluation and helps prevent overfitting.

Functionality Overview

Input

  • Upload a QA file in any supported format: .jsonl.
  • Supported formats: QA Format, DocSynth Single-Turn, Multi-Turn, OpenAI Format.

Configuration Options

  • Split Ratio: Define a ratio such as 80/20 or 70/30 for train/validation.
  • Fixed Count: Alternatively, specify a fixed number of QA pairs to include in the validation set.
  • Shuffle (Optional): Randomly shuffle the dataset before splitting to avoid order bias.

Processing

  • The tool preserves the original format of the uploaded file.
  • Generates two output files corresponding to the defined split.

Output

  • Training File: e.g., train.jsonl – Contains the specified portion of QA pairs for training.
  • Validation File: e.g., validation.jsonl – Contains the remainder of QA pairs for evaluation.

Example

Input File: 10 QA pairs

Split Ratio: 80/20

Result:

  • train.jsonl: 8 QA pairs
  • validation.jsonl: 2 QA pairs

Both files retain the original format (e.g., OpenAI format or QA format), ensuring compatibility with your training pipeline.