Fine-Tuning Glossary

This page keeps the original structure of the source glossary: each term has a practical explanation first and a more internal explanation second.

Glossary Source

This glossary is a documentation adaptation of the original training glossary in the public repository. Use the source file when you want the raw upstream wording alongside the runnable training assets.

bitsandbytes

Simple explanation

bitsandbytes helps you run large models with much less GPU memory.

In this project, it enables 4-bit quantization so a 7B model can fit on consumer hardware.

What it does internally

It stores weights in low precision and uses optimized CUDA kernels for quantized math. That reduces VRAM use while keeping most of the model quality.

QLoRA

Simple explanation

QLoRA is a way to fine-tune a big model cheaply.

Instead of retraining the whole model, you keep the base model compressed and only train small adapter layers.

What it does internally

It combines low-bit quantization for the frozen base model with LoRA adapters for the trainable part. This keeps memory use low while still allowing the model to learn new behavior.

NF4

Simple explanation

NF4 is a smart 4-bit format for storing model weights.

What it does internally

NF4 stands for NormalFloat4. It is designed for weight values that roughly follow a normal distribution, so it usually preserves model quality better than simpler 4-bit formats.

Double Quantization

Simple explanation

Double quantization is compression for the quantization metadata itself.

What it does internally

When weights are quantized, extra scaling values are needed to decode them. Double quantization compresses those scaling values too, which reduces memory overhead.

LoRA rank (r)

Simple explanation

Rank controls the size of the LoRA adapter’s learning capacity.

  • Lower rank means less memory and less adaptation power.
  • Higher rank means more memory and more adaptation power.

What it does internally

LoRA replaces a full weight update with two small matrices. The rank r is the bottleneck dimension that limits how much change the adapter can represent.

Alpha (lora_alpha)

Simple explanation

Alpha controls how strongly LoRA updates affect the base model.

What it does internally

LoRA applies a scaled update of roughly:

delta_W is proportional to (alpha / r) * B * A, so increasing alpha increases the effective magnitude of adapter updates.

Projections

Simple explanation

Projections are core linear layers in the transformer. Targeting them means LoRA edits the most important places where information is transformed.

What it does internally

In attention, q_proj, k_proj, v_proj, and o_proj create and remap the attention flow. In feed-forward blocks, gate_proj, up_proj, and down_proj shape the hidden-state expansion and compression.

BF16 vs FP16

Simple explanation

Both are 16-bit formats that save memory and speed up training, but BF16 is usually more stable on modern GPUs.

What it does internally

BF16 keeps the wider exponent range of FP32, which greatly reduces overflow and underflow issues. FP16 has a smaller exponent range, so activations and gradients are more fragile.

Flash Attention

Simple explanation

Flash Attention is a faster and more memory-efficient attention implementation.

What it does internally

It computes attention in tiled blocks and avoids storing large intermediate matrices in full precision, which cuts memory traffic and improves throughput.

SDPA

Simple explanation

SDPA is PyTorch’s built-in attention engine and the safe fallback when Flash Attention is not available.

What it does internally

SDPA means Scaled Dot-Product Attention. It uses PyTorch’s optimized attention kernels instead of an external package.

device_map="auto"

Simple explanation

This tells Transformers to decide automatically where model parts should go.

What it does internally

Transformers inspects the available hardware and automatically places layers on the right device, usually one or more GPUs and sometimes CPU if needed.

AdamW

Simple explanation

AdamW is the optimizer, the rule that updates weights after each batch.

What it does internally

It combines adaptive moments with decoupled weight decay, which usually behaves better than classic Adam plus L2 regularization.

Cosine Scheduler

Simple explanation

A cosine scheduler lowers the learning rate smoothly over time.

What it does internally

It applies a cosine-shaped decay curve to the learning rate across training steps, which often helps convergence by reducing update noise near the end.

Data Collator

Simple explanation

The data collator packs training examples into a batch with matching tensor shapes.

What it does internally

It pads sequences to a common length, builds tensors, and prepares keys such as input_ids, attention_mask, and labels.

label_pad_token_id = -100

Simple explanation

This tells the loss function to ignore the fake padding positions.

What it does internally

PyTorch loss functions commonly ignore targets with value -100, so padded label positions do not affect the gradients.

Adapter

Simple explanation

An adapter is a small extra set of trained weights that teaches the base model a new behavior.

What it does internally

With LoRA, the adapter contains the low-rank update matrices that are added to selected base-model layers at inference time. Loading the base model plus the adapter recreates the fine-tuned behavior.