Fine-Tuning Glossary
This page keeps the original structure of the source glossary: each term has a practical explanation first and a more internal explanation second.
Glossary Source
This glossary is a documentation adaptation of the original training glossary in the public repository. Use the source file when you want the raw upstream wording alongside the runnable training assets.
- Fine-tuning glossary on GitHub Original glossary file.
- Fine-tuning stage README Stage context for the training terms below.
- Fine-tuning script Implementation that uses these concepts.
bitsandbytes
Simple explanation
bitsandbytes helps you run large models with much less GPU memory.
In this project, it enables 4-bit quantization so a 7B model can fit on consumer hardware.
What it does internally
It stores weights in low precision and uses optimized CUDA kernels for quantized math. That reduces VRAM use while keeping most of the model quality.
QLoRA
Simple explanation
QLoRA is a way to fine-tune a big model cheaply.
Instead of retraining the whole model, you keep the base model compressed and only train small adapter layers.
What it does internally
It combines low-bit quantization for the frozen base model with LoRA adapters for the trainable part. This keeps memory use low while still allowing the model to learn new behavior.
NF4
Simple explanation
NF4 is a smart 4-bit format for storing model weights.
What it does internally
NF4 stands for NormalFloat4. It is designed for weight values that roughly follow a normal distribution, so it usually preserves model quality better than simpler 4-bit formats.
Double Quantization
Simple explanation
Double quantization is compression for the quantization metadata itself.
What it does internally
When weights are quantized, extra scaling values are needed to decode them. Double quantization compresses those scaling values too, which reduces memory overhead.
LoRA rank (r)
Simple explanation
Rank controls the size of the LoRA adapter’s learning capacity.
- Lower rank means less memory and less adaptation power.
- Higher rank means more memory and more adaptation power.
What it does internally
LoRA replaces a full weight update with two small matrices. The rank r is the bottleneck dimension that limits how much change the adapter can represent.
Alpha (lora_alpha)
Simple explanation
Alpha controls how strongly LoRA updates affect the base model.
What it does internally
LoRA applies a scaled update of roughly:
delta_W is proportional to (alpha / r) * B * A, so increasing alpha increases the effective magnitude of adapter updates.
Projections
Simple explanation
Projections are core linear layers in the transformer. Targeting them means LoRA edits the most important places where information is transformed.
What it does internally
In attention, q_proj, k_proj, v_proj, and o_proj create and remap the attention flow. In feed-forward blocks, gate_proj, up_proj, and down_proj shape the hidden-state expansion and compression.
BF16 vs FP16
Simple explanation
Both are 16-bit formats that save memory and speed up training, but BF16 is usually more stable on modern GPUs.
What it does internally
BF16 keeps the wider exponent range of FP32, which greatly reduces overflow and underflow issues. FP16 has a smaller exponent range, so activations and gradients are more fragile.
Flash Attention
Simple explanation
Flash Attention is a faster and more memory-efficient attention implementation.
What it does internally
It computes attention in tiled blocks and avoids storing large intermediate matrices in full precision, which cuts memory traffic and improves throughput.
SDPA
Simple explanation
SDPA is PyTorch’s built-in attention engine and the safe fallback when Flash Attention is not available.
What it does internally
SDPA means Scaled Dot-Product Attention. It uses PyTorch’s optimized attention kernels instead of an external package.
device_map="auto"
Simple explanation
This tells Transformers to decide automatically where model parts should go.
What it does internally
Transformers inspects the available hardware and automatically places layers on the right device, usually one or more GPUs and sometimes CPU if needed.
AdamW
Simple explanation
AdamW is the optimizer, the rule that updates weights after each batch.
What it does internally
It combines adaptive moments with decoupled weight decay, which usually behaves better than classic Adam plus L2 regularization.
Cosine Scheduler
Simple explanation
A cosine scheduler lowers the learning rate smoothly over time.
What it does internally
It applies a cosine-shaped decay curve to the learning rate across training steps, which often helps convergence by reducing update noise near the end.
Data Collator
Simple explanation
The data collator packs training examples into a batch with matching tensor shapes.
What it does internally
It pads sequences to a common length, builds tensors, and prepares keys such as input_ids, attention_mask, and labels.
label_pad_token_id = -100
Simple explanation
This tells the loss function to ignore the fake padding positions.
What it does internally
PyTorch loss functions commonly ignore targets with value -100, so padded label positions do not affect the gradients.
Adapter
Simple explanation
An adapter is a small extra set of trained weights that teaches the base model a new behavior.
What it does internally
With LoRA, the adapter contains the low-rank update matrices that are added to selected base-model layers at inference time. Loading the base model plus the adapter recreates the fine-tuned behavior.