Inference Glossary

This page keeps the same two-layer structure as the source glossary: what you notice in practice, then what is happening internally.

Glossary Source

This glossary is a documentation adaptation of the original inference glossary in the public repository. Use the source file when you want the raw upstream wording together with the runnable chat loop.

max_new_tokens

Simple explanation

This is the maximum response length. If it is set to 256, the response cannot grow beyond 256 new tokens.

What it does internally

The model generates one token at a time in a loop. This parameter stops that loop when the limit is reached, even if the end token has not appeared.

do_sample

Simple explanation

This enables sampling.

  • False means the model almost always picks the most likely next token.
  • True allows variation and less rigid outputs.

What it does internally

With do_sample=False, decoding is deterministic. With do_sample=True, the next token is sampled from a probability distribution.

temperature

Simple explanation

This is the creativity thermostat.

  • Low values produce more conservative responses.
  • Medium values balance coherence and variation.
  • High values allow riskier responses.

What it does internally

Before converting logits to probabilities, logits are divided by temperature. If T < 1, the distribution becomes sharper. If T > 1, it becomes flatter.

top_p (Nucleus Sampling)

Simple explanation

This means the model samples only from the most likely options whose cumulative probability reaches p.

What it does internally

It sorts tokens by probability, accumulates them until reaching p, discards the rest, and samples from the remaining subset.

stopping_criteria

Simple explanation

This is the stop condition. In this tutorial, generation stops when <|im_end|> appears.

What it does internally

After each generated token, Transformers calls the criterion. If it returns True, generation stops immediately.

pad_token_id

Simple explanation

This defines which token is used as padding when sequence lengths need alignment.

What it does internally

It is used in batching and alignment operations and helps avoid warnings or ambiguous behavior when a tokenizer does not already define a padding token.

Reading temperature And top_p Together

The source glossary summarizes the interaction this way:

  • temperature decides how much freedom exists inside the candidate set.
  • top_p decides which candidates are allowed in the first place.

The current setup in the tutorial is temperature=0.8 and top_p=0.9, which aims for technical responses that still feel natural rather than rigid.