Prompt caching automatically reuses previously processed prompt content, reducing latency and costs for requests with repeated or similar context.
When you send a request to Freddy, the system checks if parts of your prompt have been processed recently. If cached content exists, the model reuses those computations instead of reprocessing them, resulting in:
- Faster responses - Up to 80% reduction in latency for cached prompts
- Lower costs - 50% discount on cached input neurons
- Automatic optimization - No code changes required
Request 1: [System Prompt + Long Context] + User Query
└─────────────────────────────┘
Processed & Cached
Request 2: [Same System Prompt + Long Context] + Different User Query
└────────────────────────────────┘
Retrieved from Cache
Result: Faster response, lower costCaching is automatic and transparent - you don't need to change your code.
Prompt caching is available on:
- GPT-4.1 and newer
- GPT-4.1-mini and newer
- o3-preview and o3-mini
- All fine-tuned versions of the above models
Cached neurons are discounted by 50%:
| Model | Uncached Input | Cached Input | Savings |
|---|---|---|---|
| GPT-4.1 | $2.50/1M neurons | $1.25/1M neurons | 50% |
| GPT-4.1-mini | $0.15/1M neurons | $0.075/1M neurons | 50% |
| o3-preview | $15.00/1M neurons | $7.50/1M neurons | 50% |
Output synapses are priced normally regardless of caching.
Every response includes cache metrics in the usage field:
{
"usage": {
"inputNeurons": 1500,
"cachedNeurons": 1200,
"outputSynapses": 150,
"totalNeurons": 1500,
"totalSynapses": 150
}
}cachedNeurons shows how many input neurons were retrieved from cache.
Cache hit rate: cachedNeurons / inputNeurons = 80% (in this example)
- Active cache: 5-10 minutes of inactivity
- Maximum lifetime: 1 hour from last use
- Automatic cleanup: Caches expire and are removed
Caches match when:
✅ Prompt prefix is identical - Same text, same order
✅ Same model - Must use the exact same model ID
✅ Within time window - Cache hasn't expired
Caches DON'T match when:
❌ Any text changes - Even minor edits invalidate cache
❌ Different order - Rearranged content won't match
❌ Different model - Model changes reset cache
❌ Cache expired - Past the lifetime window
Put static content first, dynamic content last:
// ✅ Good - Static content cached
{
"inputs": [
{
"role": "system",
"texts": [{"text": "You are an expert Python developer..."}] // Cached
},
{
"role": "user",
"texts": [{"text": "How do I fix this error: {{user_error}}"}] // Dynamic
}
]
}
// ❌ Bad - Dynamic content breaks cache
{
"inputs": [
{
"role": "system",
"texts": [{"text": "Current time: {{timestamp}}"}] // Changes every request
},
{
"role": "system",
"texts": [{"text": "You are an expert Python developer..."}] // Won't cache
}
]
}If you have documentation, examples, or guidelines - put them at the start:
{
"inputs": [
{
"role": "system",
"texts": [
{"text": "# API Documentation\n\n... (5000 tokens of docs) ..."}
]
},
{
"role": "system",
"texts": [
{"text": "# Code Style Guide\n\n... (3000 tokens) ..."}
]
},
{
"role": "user",
"texts": [{"text": "Help me write a function"}] // Only this part changes
}
]
}The 8000 neurons of documentation get cached, only the user query (10 neurons) is processed fresh.
Threads automatically cache conversation history:
# First message in thread - builds cache
response1 = requests.post('/v1/model/response', json={
"model": "gpt-4.1",
"thread": "thread_user123",
"inputs": [{"role": "user", "texts": [{"text": "Hello"}]}]
})
# Follow-up - conversation history cached
response2 = requests.post('/v1/model/response', json={
"model": "gpt-4.1",
"thread": "thread_user123", # Previous messages cached
"inputs": [{"role": "user", "texts": [{"text": "Tell me more"}]}]
})Use prompt templates for reusable, cacheable instructions:
{
"model": "gpt-4.1",
"prompt": {
"id": "prompt_code_reviewer", // Template content gets cached
"variables": {
"language": "Python", // Only variables change
"focus": "security"
}
},
"inputs": [
{"role": "user", "texts": [{"text": "Review this code: ..."}]}
]
}# Embed large documentation in system prompt
DOCS = """
[5000 tokens of product documentation]
"""
def answer_question(question):
return requests.post('/v1/model/response', json={
"model": "gpt-4.1",
"inputs": [
{"role": "system", "texts": [{"text": DOCS}]}, # Cached after first call
{"role": "user", "texts": [{"text": question}]}
]
})
# First call: Full processing
answer_question("What is feature X?") # cachedNeurons: 0
# Subsequent calls: Cached docs
answer_question("How do I use feature Y?") # cachedNeurons: 5000
answer_question("What's the pricing?") # cachedNeurons: 5000# Provide codebase context once
CODEBASE_CONTEXT = """
[Large codebase structure and key files]
"""
def analyze_code(code_snippet):
return requests.post('/v1/model/response', json={
"model": "gpt-4.1",
"inputs": [
{"role": "system", "texts": [{"text": CODEBASE_CONTEXT}]}, # Cached
{"role": "user", "texts": [{"text": f"Analyze: {code_snippet}"}]}
]
})// First message - establishes cache
await createResponse({
model: 'gpt-4.1',
thread: 'support_ticket_123',
inputs: [
{
role: 'system',
texts: [{ text: customerContext }] // Customer history cached
},
{
role: 'user',
texts: [{ text: 'I need help' }]
}
]
});
// Follow-ups automatically benefit from cache
await createResponse({
model: 'gpt-4.1',
thread: 'support_ticket_123', // Previous context cached
inputs: [
{ role: 'user', texts: [{ text: 'Can you clarify?' }] }
]
});import requests
response = requests.post('/v1/model/response', json={
"model": "gpt-4.1",
"inputs": [...]
})
data = response.json()
usage = data['usage']
cache_hit_rate = usage['cachedNeurons'] / usage['inputNeurons'] * 100
cost_savings = (usage['cachedNeurons'] * 0.50) # 50% discount
print(f"Cache hit rate: {cache_hit_rate}%")
print(f"Neurons saved: {usage['cachedNeurons']}")
print(f"Cost savings: ${cost_savings / 1_000_000:.4f}")const logCacheMetrics = (response) => {
const { usage } = response;
const hitRate = (usage.cachedNeurons / usage.inputNeurons) * 100;
console.log({
timestamp: new Date(),
model: response.model,
inputNeurons: usage.inputNeurons,
cachedNeurons: usage.cachedNeurons,
cacheHitRate: `${hitRate.toFixed(1)}%`,
estimatedSavings: (usage.cachedNeurons * 0.0000025 * 0.5).toFixed(4) // GPT-4.1 pricing
});
};- Place static content first - System prompts, docs, examples before user input
- Reuse prompts - Same prefix across requests maximizes cache hits
- Use threads - Automatic caching of conversation history
- Monitor metrics - Track
cachedNeuronsto measure effectiveness - Batch similar requests - Process related queries while cache is hot
- Keep prompts stable - Minor edits break caching
- Add timestamps - Dynamic content at the start breaks caching
- Randomize order - Shuffling prompt structure prevents cache hits
- Overthink it - Caching is automatic, don't over-engineer
- Cache very short prompts - Overhead may exceed benefits (<100 neurons)
- Rely on long cache lifetimes - Caches expire after 5-10 minutes
- Cache expiration: 5-10 minutes of inactivity, 1 hour maximum
- Exact matching: Even minor changes invalidate cache
- No cross-model caching: Different models have separate caches
- No manual control: Can't explicitly force or clear caches
- Prefix-only: Only the beginning of prompts can be cached, not middle or end
✅ Large static context (>1000 neurons)
✅ Frequent requests (multiple per minute)
✅ Similar prompts (same documentation/instructions)
✅ Multi-turn conversations (thread-based)
⚠️ Unique prompts every time
⚠️ Very short prompts (<100 neurons)
⚠️ Infrequent requests (minutes apart)
⚠️ Highly dynamic content (always changing)
Q: Do I need to enable caching?
A: No, it's automatic on supported models.
Q: Can I disable caching?
A: No, but dynamic content naturally prevents caching.
Q: How long do caches last?
A: 5-10 minutes of inactivity, maximum 1 hour.
Q: Why isn't my prompt caching?
A: Check for dynamic content (timestamps, IDs) at the start, ensure exact prompt matching, and verify cache hasn't expired.
Q: Do threads help with caching?
A: Yes! Threads automatically cache conversation history across requests.
Q: Is cached content less accurate?
A: No, cached responses are identical to non-cached ones.
Related:
- Synapses and Neurons - Understanding usage metrics
- Prompt Templates - Reusable prompts that cache well
- Threads - Automatic caching for conversations
- Best Practices - Optimization strategies