Dataset Preparation

This first stage takes the timdettmers/openassistant-guanaco training split and rewrites each conversation into a strict ChatML format.

Why This Dataset

The rationale is straightforward:

Reproducibility: the tutorial does not depend on private chat logs.
Quality: Guanaco is a common baseline in QLoRA examples.
Integration speed: the data is already close to a conversational format.

That choice matters because every later step assumes a single, stable representation of the conversation.

Why ChatML Is The Contract

The tutorial is built around a base model, not a chat model with a built-in prompt template. Because of that, the code has to mark speaker boundaries manually.

<|im_start|>user
What is a distributed system?<|im_end|>
<|im_start|>assistant
It is a collection of autonomous computers that work together...<|im_end|>

The same structure appears in preparation, tokenization, training, and inference. If one stage drifts away from this format, the pipeline stops being coherent.

Core Transformation

The key function in the source file is the format rewrite below.

def format_chatml(example):
    """
    Converts the original Human/Assistant format to ChatML.
    The 'timdettmers/openassistant-guanaco' dataset usually has a 'text' field
    with '### Human: ... ### Assistant: ...'
    """
    text = example['text']
    # Replace the markers with ChatML tags
    # Guanaco format is: ### Human: {prompt}### Assistant: {response}
    text = text.replace("### Human:", "<|im_start|>user\n")
    text = text.replace(
        "### Assistant:", "<|im_end|>\n<|im_start|>assistant\n")
    text += "<|im_end|>"
    return {"text": text}

This is intentionally simple. The tutorial does not add extra metadata, roles, or filtering logic. It only standardizes the text into the exact conversation shape needed later.

What The Script Does End To End

Loads the train split from Hugging Face.
Maps the format_chatml transform across the dataset.
Prints the first three examples for visual inspection.
Saves the result to the prepared dataset directory.

Run Command

python 1_Dataset/prepare_dataset.py

By default, the script writes to prepared_dataset_chatml, although the path can be overridden through config.ini according to the source code and README.

Repository References

Dataset script on GitHub Runnable source file for this stage.
Dataset stage README Stage-level execution notes.

Reference Implementation

Dataset preparation script prepare_dataset.py

The excerpt above covers the core rewrite. Expand the panel when you want the full implementation, including the preview and save-to-disk steps.

import configparser
import os

from datasets import load_dataset


BASE_DIR = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
CONFIG_PATH = os.path.join(BASE_DIR, "config.ini")

config = configparser.ConfigParser()
config.read(CONFIG_PATH)


def format_chatml(example):
    """
    Converts the original Human/Assistant format to ChatML.
    The 'timdettmers/openassistant-guanaco' dataset usually has a 'text' field
    with '### Human: ... ### Assistant: ...'
    """
    text = example['text']
    # Replace the markers with ChatML tags
    # Guanaco format is: ### Human: {prompt}### Assistant: {response}
    text = text.replace("### Human:", "<|im_start|>user\n")
    text = text.replace(
        "### Assistant:", "<|im_end|>\n<|im_start|>assistant\n")
    text += "<|im_end|>"
    return {"text": text}


def preview_dataset(dataset, num_samples=3):
    """Prints the first few examples of the dataset for visual inspection."""
    print(f"\n--- Previewing {num_samples} samples from the dataset ---\n")
    for i in range(num_samples):
        print(f"--- Example {i+1} ---")
        print(dataset[i]['text'])
        print("-" * 20 + "\n")


def main():
    output_dir = config.get("dataset", "prepared_dataset_dir",
                            fallback="prepared_dataset_chatml")
    output_path = os.path.join(BASE_DIR, output_dir)

    print("🚀 Loading 'timdettmers/openassistant-guanaco' dataset from Hugging Face...")
    # Load the training split
    dataset = load_dataset("timdettmers/openassistant-guanaco", split="train")

    print("🛠️ Transforming dataset to ChatML format...")
    # Apply the formatting function
    dataset = dataset.map(format_chatml)

    # Visual inspection
    preview_dataset(dataset)

    print(f"💾 Saving prepared dataset to {output_path}...")
    dataset.save_to_disk(output_path)

    print("✅ Dataset preparation complete. 100% of examples are now in ChatML format.")
    print(f"📁 Saved to: {output_path}")


if __name__ == "__main__":
    main()

Continue with Tokenization & ChatML.