Doc Validation
Last Update: 05/02/2025Overview
Doc Validation enables users to split a QA dataset into training and validation sets, a critical step in preparing data for machine learning model development and evaluation.
Use Case
Prepare a dataset for supervised learning by dividing it into two subsets:
- A training set (e.g., 80%) used to train the model.
- A validation set (e.g., 20%) used to assess the model's performance on unseen data.
This ensures fair evaluation and helps prevent overfitting.
Functionality Overview
Input
- Upload a QA file in any supported format:
.jsonl
. - Supported formats: QA Format, DocSynth Single-Turn, Multi-Turn, OpenAI Format.
Configuration Options
- Split Ratio: Define a ratio such as 80/20 or 70/30 for train/validation.
- Fixed Count: Alternatively, specify a fixed number of QA pairs to include in the validation set.
- Shuffle (Optional): Randomly shuffle the dataset before splitting to avoid order bias.
Processing
- The tool preserves the original format of the uploaded file.
- Generates two output files corresponding to the defined split.
Output
- Training File: e.g.,
train.jsonl
– Contains the specified portion of QA pairs for training. - Validation File: e.g.,
validation.jsonl
– Contains the remainder of QA pairs for evaluation.
Example
Input File: 10 QA pairs
Split Ratio: 80/20
Result:
train.jsonl
: 8 QA pairsvalidation.jsonl
: 2 QA pairs
Both files retain the original format (e.g., OpenAI format or QA format), ensuring compatibility with your training pipeline.