Inference Glossary
This page keeps the same two-layer structure as the source glossary: what you notice in practice, then what is happening internally.
Glossary Source
This glossary is a documentation adaptation of the original inference glossary in the public repository. Use the source file when you want the raw upstream wording together with the runnable chat loop.
- Inference glossary on GitHub Original glossary file.
- Inference stage README Stage context for the generation terms below.
- Inference script Implementation that uses these parameters.
max_new_tokens
Simple explanation
This is the maximum response length. If it is set to 256, the response cannot grow beyond 256 new tokens.
What it does internally
The model generates one token at a time in a loop. This parameter stops that loop when the limit is reached, even if the end token has not appeared.
do_sample
Simple explanation
This enables sampling.
Falsemeans the model almost always picks the most likely next token.Trueallows variation and less rigid outputs.
What it does internally
With do_sample=False, decoding is deterministic. With do_sample=True, the next token is sampled from a probability distribution.
temperature
Simple explanation
This is the creativity thermostat.
- Low values produce more conservative responses.
- Medium values balance coherence and variation.
- High values allow riskier responses.
What it does internally
Before converting logits to probabilities, logits are divided by temperature. If T < 1, the distribution becomes sharper. If T > 1, it becomes flatter.
top_p (Nucleus Sampling)
Simple explanation
This means the model samples only from the most likely options whose cumulative probability reaches p.
What it does internally
It sorts tokens by probability, accumulates them until reaching p, discards the rest, and samples from the remaining subset.
stopping_criteria
Simple explanation
This is the stop condition. In this tutorial, generation stops when <|im_end|> appears.
What it does internally
After each generated token, Transformers calls the criterion. If it returns True, generation stops immediately.
pad_token_id
Simple explanation
This defines which token is used as padding when sequence lengths need alignment.
What it does internally
It is used in batching and alignment operations and helps avoid warnings or ambiguous behavior when a tokenizer does not already define a padding token.
Reading temperature And top_p Together
The source glossary summarizes the interaction this way:
temperaturedecides how much freedom exists inside the candidate set.top_pdecides which candidates are allowed in the first place.
The current setup in the tutorial is temperature=0.8 and top_p=0.9, which aims for technical responses that still feel natural rather than rigid.