Dataset Preparation
This first stage takes the timdettmers/openassistant-guanaco training split and rewrites each conversation into a strict ChatML format.
Why This Dataset
The rationale is straightforward:
- Reproducibility: the tutorial does not depend on private chat logs.
- Quality: Guanaco is a common baseline in QLoRA examples.
- Integration speed: the data is already close to a conversational format.
That choice matters because every later step assumes a single, stable representation of the conversation.
Why ChatML Is The Contract
The tutorial is built around a base model, not a chat model with a built-in prompt template. Because of that, the code has to mark speaker boundaries manually.
<|im_start|>user
What is a distributed system?<|im_end|>
<|im_start|>assistant
It is a collection of autonomous computers that work together...<|im_end|>
The same structure appears in preparation, tokenization, training, and inference. If one stage drifts away from this format, the pipeline stops being coherent.
Core Transformation
The key function in the source file is the format rewrite below.
def format_chatml(example):
"""
Converts the original Human/Assistant format to ChatML.
The 'timdettmers/openassistant-guanaco' dataset usually has a 'text' field
with '### Human: ... ### Assistant: ...'
"""
text = example['text']
# Replace the markers with ChatML tags
# Guanaco format is: ### Human: {prompt}### Assistant: {response}
text = text.replace("### Human:", "<|im_start|>user\n")
text = text.replace(
"### Assistant:", "<|im_end|>\n<|im_start|>assistant\n")
text += "<|im_end|>"
return {"text": text}
This is intentionally simple. The tutorial does not add extra metadata, roles, or filtering logic. It only standardizes the text into the exact conversation shape needed later.
What The Script Does End To End
- Loads the
trainsplit from Hugging Face. - Maps the
format_chatmltransform across the dataset. - Prints the first three examples for visual inspection.
- Saves the result to the prepared dataset directory.
Run Command
python 1_Dataset/prepare_dataset.py
By default, the script writes to prepared_dataset_chatml, although the path can be overridden through config.ini according to the source code and README.
Repository References
Repository References
- Dataset script on GitHub Runnable source file for this stage.
- Dataset stage README Stage-level execution notes.
Reference Implementation
Dataset preparation script
The excerpt above covers the core rewrite. Expand the panel when you want the full implementation, including the preview and save-to-disk steps.
import configparser
import os
from datasets import load_dataset
BASE_DIR = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
CONFIG_PATH = os.path.join(BASE_DIR, "config.ini")
config = configparser.ConfigParser()
config.read(CONFIG_PATH)
def format_chatml(example):
"""
Converts the original Human/Assistant format to ChatML.
The 'timdettmers/openassistant-guanaco' dataset usually has a 'text' field
with '### Human: ... ### Assistant: ...'
"""
text = example['text']
# Replace the markers with ChatML tags
# Guanaco format is: ### Human: {prompt}### Assistant: {response}
text = text.replace("### Human:", "<|im_start|>user\n")
text = text.replace(
"### Assistant:", "<|im_end|>\n<|im_start|>assistant\n")
text += "<|im_end|>"
return {"text": text}
def preview_dataset(dataset, num_samples=3):
"""Prints the first few examples of the dataset for visual inspection."""
print(f"\n--- Previewing {num_samples} samples from the dataset ---\n")
for i in range(num_samples):
print(f"--- Example {i+1} ---")
print(dataset[i]['text'])
print("-" * 20 + "\n")
def main():
output_dir = config.get("dataset", "prepared_dataset_dir",
fallback="prepared_dataset_chatml")
output_path = os.path.join(BASE_DIR, output_dir)
print("๐ Loading 'timdettmers/openassistant-guanaco' dataset from Hugging Face...")
# Load the training split
dataset = load_dataset("timdettmers/openassistant-guanaco", split="train")
print("๐ ๏ธ Transforming dataset to ChatML format...")
# Apply the formatting function
dataset = dataset.map(format_chatml)
# Visual inspection
preview_dataset(dataset)
print(f"๐พ Saving prepared dataset to {output_path}...")
dataset.save_to_disk(output_path)
print("โ
Dataset preparation complete. 100% of examples are now in ChatML format.")
print(f"๐ Saved to: {output_path}")
if __name__ == "__main__":
main() Continue with Tokenization & ChatML.