LLM Temperature and top_p Explained Without the Math

The Intuition (Skip the Formulas)

When an LLM generates text, it doesn’t just pick the most likely next word. At each step, it has a probability distribution over possible next words.

Most of the time, the top word is way more likely than the others. But sometimes, you want variety. That’s where temperature and top_p come in.

Temperature = “How boring or creative is the model?”

top_p = “How much of the probability distribution do I care about?”

Both control the same thing from different angles.

Temperature: The Knob Everyone Knows

Temperature ranges from 0 to ~2 (though values above 1 are rare).

Temperature = 0
├─ Always picks the most likely word (deterministic)
└─ Output: Repetitive, predictable, sometimes dull

Temperature = 0.7 (default for most APIs)
├─ Picks likely words, but allows some variation
└─ Output: Natural, conversational, slight randomness

Temperature = 1.2
├─ "Anything goes" — all words weighted roughly equally
└─ Output: Creative but sometimes nonsensical

Temperature = 2.0
├─ Chaos — essentially random
└─ Output: Gibberish

Practical Examples

Task: Code generation

# Temperature = 0
# Output: Always the same code pattern, might miss better solutions

# Temperature = 0.5
# Output: Same structure, slightly varied variable names

Task: Creative writing

# Temperature = 0
# Output: "The sun rose over the hills. It was a sunny day."
#         (Boring, but coherent)

# Temperature = 0.9
# Output: "The sun erupted like liquid gold, shattering shadows."
#         (More interesting, still coherent)

# Temperature = 1.5
# Output: "Sun purple-blazed quantum elephant mountains."
#         (Creative but nonsensical)

Rule of thumb:

0–0.3: Analytical tasks (code, math, factual)
0.7: Default (chatting, writing)
1.0+: Creative tasks (brainstorming, art)
>1.5: You’re probably doing it wrong

top_p: The Advanced Knob

top_p (nucleus sampling) filters the probability distribution differently.

Instead of tweaking how sharp/flat the distribution is, top_p says: “Include the most likely words until you’ve covered P% of the probability mass.”

Imagine the model predicts:
"the" — 30% likely
"a" — 25% likely
"some" — 20% likely
"an" — 15% likely
"one" — 5% likely
"that" — 3% likely
"something" — 2% likely
...

top_p = 0.9 (cover 90% of probability)
├─ Include: "the", "a", "some", "an" (90% total)
└─ Exclude: "one", "that", "something"...

top_p = 0.5 (cover 50% of probability)
├─ Include: "the", "a" (55% total)
└─ Exclude: "some", "an", etc.

Temperature vs. top_p: When to Use Each

Temperature:

Use when you want simple, intuitive control
“Make it more creative” → increase temperature
Works everywhere

top_p:

Use when you want to eliminate tail words (weird outputs)
“Prevent nonsense but allow variety” → set top_p = 0.9
More sophisticated, less intuitive

Common patterns:

Analytical (code, facts):
temperature = 0.2
top_p = 0.9

Conversational (chatbots):
temperature = 0.7
top_p = 0.95

Creative (brainstorming):
temperature = 0.95
top_p = 0.9

Interestingly, using both together works better than either alone. top_p keeps you from generating garbage, while temperature adds variation.

How to Set Them in Practice

Ollama:

curl http://localhost:11434/api/generate -d '{
  "model": "mistral",
  "prompt": "Explain quantum computing",
  "temperature": 0.7,
  "top_p": 0.9,
  "stream": false
}'

Python (with Claude API):

import anthropic

client = anthropic.Anthropic()

message = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    temperature=0.8,
    top_p=0.95,
    messages=[
        {"role": "user", "content": "Write a creative story about robots"}
    ]
)

The Testing Workflow

Don’t guess. Test with your actual use case.

#!/bin/bash

temps=(0.3 0.7 1.0)
ps=(0.8 0.9 0.95)

prompt="Write a short poem about code"

for temp in "${temps[@]}"; do
    for p in "${ps[@]}"; do
        echo "=== Temperature: $temp, top_p: $p ==="
        curl -s http://localhost:11434/api/generate -d "{
            \"model\": \"mistral\",
            \"prompt\": \"$prompt\",
            \"temperature\": $temp,
            \"top_p\": $p,
            \"stream\": false
        }" | python3 -c "import sys, json; print(json.load(sys.stdin)['response'])"
        echo ""
    done
done

Run this, see which combo produces output you like, then hardcode those values.

The Gotcha: Interaction Effects

Temperature and top_p interact in non-obvious ways.

temperature = 0 (deterministic)
+ top_p = 0.5 (filter tail words)
= Same result as temperature = 0 alone (top_p doesn't matter)

temperature = 1.0 (uniform distribution)
+ top_p = 0.5 (cover 50% of words)
= About 50% of the probability is "available"

High temperature + low top_p = weird interaction. Test if you use both.

Real-World Advice

Start with temperature = 0.7, top_p = 0.95. That’s the sensible default.
For critical tasks (customer support, technical docs), use temperature = 0.2 or lower.
For creative tasks, push temperature to 0.8–1.0.
If outputs are gibberish, lower temperature or top_p. One of them is letting too many bad words through.
If outputs are boring, increase temperature. It’s the easiest tuning knob.

Don’t overthink it. Most people optimize for the wrong thing. Spend time on your prompt first, then tweak these dials if you need to.