Tokenization & ChatML

This stage turns prepared ChatML strings into tensors and makes a crucial training choice: only the assistant side should contribute to the loss.

Adding Special Tokens

Because the tutorial fine-tunes a base Mistral model, the tokenizer must learn that <|im_start|> and <|im_end|> are real structural tokens rather than plain text fragments.

special_tokens = {"additional_special_tokens": ["<|im_start|>", "<|im_end|>"]}
tokenizer.add_special_tokens(special_tokens)
added_special_tokens = tokenizer.special_tokens_map.get(
    "additional_special_tokens",
    special_tokens["additional_special_tokens"],
)

tokenizer.pad_token = "<|im_end|>"
tokenizer.padding_side = "right"

This is the bridge between the textual ChatML format from step 1 and the numerical representation used during training.

Label Masking

The most important logic in this file is not tokenization by itself. It is the masking step that prevents the model from spending learning capacity on the user prompt.

def preprocess_function(example):
    """
    Applies label masking to a dataset that was already standardized to ChatML.
    Labels are set to -100 for the user prompt so the model only learns from assistant responses.
    """
    chatml_text = example["text"]

    if "<|im_start|>assistant\n" not in chatml_text:
        return {"input_ids": [], "attention_mask": [], "labels": []}

    tokenized = tokenizer(
        chatml_text,
        truncation=True,
        max_length=MAX_LENGTH,
        add_special_tokens=False
    )

    input_ids = list(tokenized["input_ids"])
    labels = list(input_ids)

    assistant_start_tag = tokenizer.encode(
        "<|im_start|>assistant\n", add_special_tokens=False)

    for i in range(len(input_ids) - len(assistant_start_tag)):
        if input_ids[i:i+len(assistant_start_tag)] == assistant_start_tag:
            for j in range(i + len(assistant_start_tag)):
                labels[j] = -100
            break

In practice, that means the model is rewarded only for predicting the assistant answer, not for echoing the prompt structure it was given.

Inputs And Outputs

Input directory: prepared_dataset_chatml
Output directory: tokenized_dataset_chatml
Config dependency: model_name, max_length, and output paths are read from config.ini

As noted in the overview, the configuration file is not present in this documentation workspace snapshot, but it does exist in the public project repository and the source code depends on it.

Run Command

python 2_Tokenizer/tokenizer.py

Repository References

Tokenizer script on GitHub Runnable source file for this stage.
Tokenizer stage README Stage-level execution notes.
Repository config.ini Model name, max length, and output paths.

Reference Implementation

Tokenization and masking script tokenizer.py

Use the excerpts above for the key ideas: ChatML special tokens and assistant-only loss. Expand the panel for the complete tokenization flow and save step.

import configparser
import os

from datasets import load_from_disk
from transformers import AutoTokenizer


BASE_DIR = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
CONFIG_PATH = os.path.join(BASE_DIR, "config.ini")

# Load configuration
config = configparser.ConfigParser()
config.read(CONFIG_PATH)

try:
    model_name = config.get("tokenizer", "model_name")
    MAX_LENGTH = int(config.get("tokenizer", "max_length"))
    prepared_dataset_dir = config.get(
        "dataset", "prepared_dataset_dir", fallback="prepared_dataset_chatml")
    tokenized_dataset_dir = config.get(
        "tokenizer", "output_dir", fallback="tokenized_dataset_chatml")
except Exception as e:
    print(f"Error loading configuration: {e}")
    exit(1)

prepared_dataset_path = os.path.join(BASE_DIR, prepared_dataset_dir)
tokenized_dataset_path = os.path.join(BASE_DIR, tokenized_dataset_dir)

# 1. Load the Tokenizer
print(f"🔄 Loading tokenizer for {model_name}...")
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 2. Configure Special Tokens (ChatML)
# We add <|im_start|> and <|im_end|> as special tokens so they are treated as atomic units.
special_tokens = {"additional_special_tokens": ["<|im_start|>", "<|im_end|>"]}
tokenizer.add_special_tokens(special_tokens)
added_special_tokens = tokenizer.special_tokens_map.get(
    "additional_special_tokens",
    special_tokens["additional_special_tokens"],
)

# Define pad_token (Standard practice for Mistral is using eos_token or the new im_end)
tokenizer.pad_token = "<|im_end|>"
tokenizer.padding_side = "right"  # Standard for training causal LMs

# 3. Load Dataset (Prepared in Step 1)
print(f"📦 Loading prepared dataset from {prepared_dataset_path}...")
raw_dataset = load_from_disk(prepared_dataset_path)


def preprocess_function(example):
    """
    Applies label masking to a dataset that was already standardized to ChatML.
    Labels are set to -100 for the user prompt so the model only learns from assistant responses.
    """
    chatml_text = example["text"]

    if "<|im_start|>assistant\n" not in chatml_text:
        return {"input_ids": [], "attention_mask": [], "labels": []}

    # Tokenize the full text
    tokenized = tokenizer(
        chatml_text,
        truncation=True,
        max_length=MAX_LENGTH,
        add_special_tokens=False  # We handle special tokens manually in the text
    )

    input_ids = list(tokenized["input_ids"])
    labels = list(input_ids)

    # --- LABEL MASKING LOGIC ---
    # We want to find where the assistant response starts.
    assistant_start_tag = tokenizer.encode(
        "<|im_start|>assistant\n", add_special_tokens=False)

    # Find the start index of the assistant response in input_ids
    for i in range(len(input_ids) - len(assistant_start_tag)):
        if input_ids[i:i+len(assistant_start_tag)] == assistant_start_tag:
            # Mask everything before the actual response starts (including the tag)
            for j in range(i + len(assistant_start_tag)):
                labels[j] = -100
            break

    return {
        "input_ids": input_ids,
        "attention_mask": tokenized["attention_mask"],
        "labels": labels
    }


print("🛠️ Tokenizing and applying label masking...")
tokenized_dataset = raw_dataset.map(
    preprocess_function,
    remove_columns=raw_dataset.column_names,
    desc="Tokenizing with ChatML Masking"
)

# Save the tokenized dataset
tokenized_dataset.save_to_disk(tokenized_dataset_path)

print(f"\n✅ Tokenization complete!")
print(f"📁 Saved to: {tokenized_dataset_path}")
print(f"📝 Special tokens added: {added_special_tokens}")
print(f"📏 Max length: {MAX_LENGTH}")
print(f"📊 Total examples: {len(tokenized_dataset)}")

Continue with Fine-Tuning with QLoRA.