The Intuition (Skip the Formulas)
When an LLM generates text, it doesn’t just pick the most likely next word. At each step, it has a probability distribution over possible next words.
Most of the time, the top word is way more likely than the others. But sometimes, you want variety. That’s where temperature and top_p come in.
Temperature = “How boring or creative is the model?”
top_p = “How much of the probability distribution do I care about?”
Both control the same thing from different angles.
Temperature: The Knob Everyone Knows
Temperature ranges from 0 to ~2 (though values above 1 are rare).
Temperature = 0├─ Always picks the most likely word (deterministic)└─ Output: Repetitive, predictable, sometimes dull
Temperature = 0.7 (default for most APIs)├─ Picks likely words, but allows some variation└─ Output: Natural, conversational, slight randomness
Temperature = 1.2├─ "Anything goes" — all words weighted roughly equally└─ Output: Creative but sometimes nonsensical
Temperature = 2.0├─ Chaos — essentially random└─ Output: GibberishPractical Examples
Task: Code generation
# Temperature = 0# Output: Always the same code pattern, might miss better solutions
# Temperature = 0.5# Output: Same structure, slightly varied variable namesTask: Creative writing
# Temperature = 0# Output: "The sun rose over the hills. It was a sunny day."# (Boring, but coherent)
# Temperature = 0.9# Output: "The sun erupted like liquid gold, shattering shadows."# (More interesting, still coherent)
# Temperature = 1.5# Output: "Sun purple-blazed quantum elephant mountains."# (Creative but nonsensical)Rule of thumb:
- 0–0.3: Analytical tasks (code, math, factual)
- 0.7: Default (chatting, writing)
- 1.0+: Creative tasks (brainstorming, art)
- >1.5: You’re probably doing it wrong
top_p: The Advanced Knob
top_p (nucleus sampling) filters the probability distribution differently.
Instead of tweaking how sharp/flat the distribution is, top_p says: “Include the most likely words until you’ve covered P% of the probability mass.”
Imagine the model predicts:"the" — 30% likely"a" — 25% likely"some" — 20% likely"an" — 15% likely"one" — 5% likely"that" — 3% likely"something" — 2% likely...
top_p = 0.9 (cover 90% of probability)├─ Include: "the", "a", "some", "an" (90% total)└─ Exclude: "one", "that", "something"...
top_p = 0.5 (cover 50% of probability)├─ Include: "the", "a" (55% total)└─ Exclude: "some", "an", etc.Temperature vs. top_p: When to Use Each
Temperature:
- Use when you want simple, intuitive control
- “Make it more creative” → increase temperature
- Works everywhere
top_p:
- Use when you want to eliminate tail words (weird outputs)
- “Prevent nonsense but allow variety” → set top_p = 0.9
- More sophisticated, less intuitive
Common patterns:
Analytical (code, facts):temperature = 0.2top_p = 0.9
Conversational (chatbots):temperature = 0.7top_p = 0.95
Creative (brainstorming):temperature = 0.95top_p = 0.9Interestingly, using both together works better than either alone. top_p keeps you from generating garbage, while temperature adds variation.
How to Set Them in Practice
Ollama:
curl http://localhost:11434/api/generate -d '{ "model": "mistral", "prompt": "Explain quantum computing", "temperature": 0.7, "top_p": 0.9, "stream": false}'Python (with Claude API):
import anthropic
client = anthropic.Anthropic()
message = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=1024, temperature=0.8, top_p=0.95, messages=[ {"role": "user", "content": "Write a creative story about robots"} ])The Testing Workflow
Don’t guess. Test with your actual use case.
#!/bin/bash
temps=(0.3 0.7 1.0)ps=(0.8 0.9 0.95)
prompt="Write a short poem about code"
for temp in "${temps[@]}"; do for p in "${ps[@]}"; do echo "=== Temperature: $temp, top_p: $p ===" curl -s http://localhost:11434/api/generate -d "{ \"model\": \"mistral\", \"prompt\": \"$prompt\", \"temperature\": $temp, \"top_p\": $p, \"stream\": false }" | python3 -c "import sys, json; print(json.load(sys.stdin)['response'])" echo "" donedoneRun this, see which combo produces output you like, then hardcode those values.
The Gotcha: Interaction Effects
Temperature and top_p interact in non-obvious ways.
temperature = 0 (deterministic)+ top_p = 0.5 (filter tail words)= Same result as temperature = 0 alone (top_p doesn't matter)
temperature = 1.0 (uniform distribution)+ top_p = 0.5 (cover 50% of words)= About 50% of the probability is "available"High temperature + low top_p = weird interaction. Test if you use both.
Real-World Advice
- Start with temperature = 0.7, top_p = 0.95. That’s the sensible default.
- For critical tasks (customer support, technical docs), use temperature = 0.2 or lower.
- For creative tasks, push temperature to 0.8–1.0.
- If outputs are gibberish, lower temperature or top_p. One of them is letting too many bad words through.
- If outputs are boring, increase temperature. It’s the easiest tuning knob.
Don’t overthink it. Most people optimize for the wrong thing. Spend time on your prompt first, then tweak these dials if you need to.