What is a Model’s “Temperature,” Really?
The temperature dial is often used to set the randomness of AI models. Under the hood, it’s just math.
One way to make a model’s outputs more random is to raise its temperature. But what does this actually mean, and what is happening under the hood?
Here’s how I think about it:
First, remember that language models generate text one piece at a time. Those pieces are called tokens. A token isn’t always a full word; it might be a word (“cat”), part of a word (“ing”), or sometimes even punctuation. The important thing to note is that models don’t “think” in terms of sentences or paragraphs, but rather in terms of which token should come next.
Each time you type something into a model, it reads all the tokens you gave it, predicts a list of all possible next tokens, and assigns each of them a score for how likely they are to follow.
Those scores are called logits. A logit is just a number the model generates for each possible token before turning it into a probability. Bigger logits mean that a word is more likely to be the next token, while smaller logits mean a word is less likely. But a logit is not a probability, which is a bit confusing!
To turn logits into actual probabilities, the model first runs them through a function called softmax, which basically says: “exponentiate all the logits, then divide each by the sum so they all add up to 1.” At that point, you have a list of probabilities for every possible next token.
I think the best way to truly grasp this is to look at an example. Let’s say a model is writing a paragraph of text, and it thinks the next token could be either “cat,” “dog,” or “banana,” with logits of 2.0, 1.0, and 0.1. The highest logit is for cat, and so that’s the word we would expect to come next. But this is where temperature comes in…
Temperature is a way to mess with the conversion between logits and probabilities. The gist is that, before the model applies the softmax function to convert logit → probability, it first divides each logit by the set temperature value.
If the temperature is less than 1, the differences between logits get bigger, so the most likely token is assigned a higher-than-expected probability. The least likely tokens nearly disappear, and the model becomes more conservative or predictable.
If the temperature is greater than 1, the differences between logits get smaller, and the probabilities flatten out. That means the model will pick less obvious tokens more often. (If you’d like to see the actual math equation for temperature, check out our footnote.1)
Let’s go back to our example with cats, dogs, and bananas. Remember that these have logits of 2.0, 1.0, and 0.1.
Now, if we set temperature = 1.0, then basically nothing happens because we’re dividing each logit by 1.0. So softmax would turn these logits into probabilities of about 66% for “cat” (2.0 logits / 3.1 total logits), 24% for “dog,” and 10% for “banana.” The model will usually say “cat” but sometimes “dog” or “banana.”
If we set temperature = 0.5, then we are dividing each logit by 0.5 and thus doubling its value. Now the gap between logits becomes bigger! The probability of “cat” jumps to about 86%, “dog” falls to 12%, and “banana” is almost gone at 2%. The model becomes much more predictable.
And finally, if we set temperature = 2.0, we’re dividing each logit in half such that the differences between them become smaller. Now the probability of the next token being “cat” drops to about 50%, “dog” climbs to 30%, and “banana” rises to nearly 20%. The model’s outputs become more varied.
Temperature isn’t unique to language models. This general concept can be used with any model that outputs a probability distribution over possible actions. In image generation, it can change how adventurous the model is with details. In speech synthesis, it can affect how much pronunciation varies. In reinforcement learning, it can make an agent more exploratory. The principle is always the same.
I think the key to understanding temperature is to stop thinking of it as a creativity dial, though, and start thinking of it as a way of reshaping probabilities. It doesn’t make the model smarter or dumber, but simply changes how much it sticks to the safe bet versus taking a chance.
If you want to learn more about AI, join our Discord community! Or visit us at minibase.ai.
— Minibase Engineers
The formula is given below, where T is the set temperature and can be any positive number. If T < 1, logits get bigger in magnitude, the probability distribution gets sharper, and the next token is more predictable. If T = 1, there is no change from the model’s raw distribution. As T goes above 1, the logits get smaller in magnitude, the distribution gets flatter, and there is more randomness in the next token.
Great blog post. I never understood temperature before this.