# Prompt Caching Prompt caching automatically reuses previously processed prompt content, reducing latency and costs for requests with repeated or similar context. ## What is Prompt Caching? When you send a request to Freddy, the system checks if parts of your prompt have been processed recently. If cached content exists, the model reuses those computations instead of reprocessing them, resulting in: - **Faster responses** - Up to 80% reduction in latency for cached prompts - **Lower costs** - 50% discount on cached input neurons - **Automatic optimization** - No code changes required ## How It Works ``` Request 1: [System Prompt + Long Context] + User Query └─────────────────────────────┘ Processed & Cached Request 2: [Same System Prompt + Long Context] + Different User Query └────────────────────────────────┘ Retrieved from Cache Result: Faster response, lower cost ``` Caching is automatic and transparent - you don't need to change your code. ## Supported Models Prompt caching is available on: - **GPT-4.1** and newer - **GPT-4.1-mini** and newer - **o3-preview** and **o3-mini** - **All fine-tuned versions** of the above models ## Pricing Cached neurons are discounted by **50%**: | Model | Uncached Input | Cached Input | Savings | | --- | --- | --- | --- | | GPT-4.1 | $2.50/1M neurons | $1.25/1M neurons | 50% | | GPT-4.1-mini | $0.15/1M neurons | $0.075/1M neurons | 50% | | o3-preview | $15.00/1M neurons | $7.50/1M neurons | 50% | Output synapses are priced normally regardless of caching. ## Monitoring Cache Usage Every response includes cache metrics in the `usage` field: ```json { "usage": { "inputNeurons": 1500, "cachedNeurons": 1200, "outputSynapses": 150, "totalNeurons": 1500, "totalSynapses": 150 } } ``` **`cachedNeurons`** shows how many input neurons were retrieved from cache. **Cache hit rate**: `cachedNeurons / inputNeurons = 80%` (in this example) ## Cache Behavior ### Cache Lifetime - **Active cache**: 5-10 minutes of inactivity - **Maximum lifetime**: 1 hour from last use - **Automatic cleanup**: Caches expire and are removed ### Cache Matching Caches match when: ✅ **Prompt prefix is identical** - Same text, same order ✅ **Same model** - Must use the exact same model ID ✅ **Within time window** - Cache hasn't expired Caches DON'T match when: ❌ **Any text changes** - Even minor edits invalidate cache ❌ **Different order** - Rearranged content won't match ❌ **Different model** - Model changes reset cache ❌ **Cache expired** - Past the lifetime window ## Optimizing for Cache Hits ### 1. Consistent Prompt Structure Put **static content first**, **dynamic content last**: ```json // ✅ Good - Static content cached { "inputs": [ { "role": "system", "texts": [{"text": "You are an expert Python developer..."}] // Cached }, { "role": "user", "texts": [{"text": "How do I fix this error: {{user_error}}"}] // Dynamic } ] } // ❌ Bad - Dynamic content breaks cache { "inputs": [ { "role": "system", "texts": [{"text": "Current time: {{timestamp}}"}] // Changes every request }, { "role": "system", "texts": [{"text": "You are an expert Python developer..."}] // Won't cache } ] } ``` ### 2. Large Static Context First If you have documentation, examples, or guidelines - put them at the start: ```json { "inputs": [ { "role": "system", "texts": [ {"text": "# API Documentation\n\n... (5000 tokens of docs) ..."} ] }, { "role": "system", "texts": [ {"text": "# Code Style Guide\n\n... (3000 tokens) ..."} ] }, { "role": "user", "texts": [{"text": "Help me write a function"}] // Only this part changes } ] } ``` The 8000 neurons of documentation get cached, only the user query (10 neurons) is processed fresh. ### 3. Use Threads for Conversations Threads automatically cache conversation history: ```python # First message in thread - builds cache response1 = requests.post('/v1/model/response', json={ "model": "gpt-4.1", "thread": "thread_user123", "inputs": [{"role": "user", "texts": [{"text": "Hello"}]}] }) # Follow-up - conversation history cached response2 = requests.post('/v1/model/response', json={ "model": "gpt-4.1", "thread": "thread_user123", # Previous messages cached "inputs": [{"role": "user", "texts": [{"text": "Tell me more"}]}] }) ``` ### 4. Prompt Templates with Caching Use prompt templates for reusable, cacheable instructions: ```json { "model": "gpt-4.1", "prompt": { "id": "prompt_code_reviewer", // Template content gets cached "variables": { "language": "Python", // Only variables change "focus": "security" } }, "inputs": [ {"role": "user", "texts": [{"text": "Review this code: ..."}]} ] } ``` ## Use Cases ### Documentation Q&A ```python # Embed large documentation in system prompt DOCS = """ [5000 tokens of product documentation] """ def answer_question(question): return requests.post('/v1/model/response', json={ "model": "gpt-4.1", "inputs": [ {"role": "system", "texts": [{"text": DOCS}]}, # Cached after first call {"role": "user", "texts": [{"text": question}]} ] }) # First call: Full processing answer_question("What is feature X?") # cachedNeurons: 0 # Subsequent calls: Cached docs answer_question("How do I use feature Y?") # cachedNeurons: 5000 answer_question("What's the pricing?") # cachedNeurons: 5000 ``` ### Code Analysis ```python # Provide codebase context once CODEBASE_CONTEXT = """ [Large codebase structure and key files] """ def analyze_code(code_snippet): return requests.post('/v1/model/response', json={ "model": "gpt-4.1", "inputs": [ {"role": "system", "texts": [{"text": CODEBASE_CONTEXT}]}, # Cached {"role": "user", "texts": [{"text": f"Analyze: {code_snippet}"}]} ] }) ``` ### Multi-Turn Conversations ```javascript // First message - establishes cache await createResponse({ model: 'gpt-4.1', thread: 'support_ticket_123', inputs: [ { role: 'system', texts: [{ text: customerContext }] // Customer history cached }, { role: 'user', texts: [{ text: 'I need help' }] } ] }); // Follow-ups automatically benefit from cache await createResponse({ model: 'gpt-4.1', thread: 'support_ticket_123', // Previous context cached inputs: [ { role: 'user', texts: [{ text: 'Can you clarify?' }] } ] }); ``` ## Cache Analytics ### Tracking Cache Performance ```python import requests response = requests.post('/v1/model/response', json={ "model": "gpt-4.1", "inputs": [...] }) data = response.json() usage = data['usage'] cache_hit_rate = usage['cachedNeurons'] / usage['inputNeurons'] * 100 cost_savings = (usage['cachedNeurons'] * 0.50) # 50% discount print(f"Cache hit rate: {cache_hit_rate}%") print(f"Neurons saved: {usage['cachedNeurons']}") print(f"Cost savings: ${cost_savings / 1_000_000:.4f}") ``` ### Logging Cache Metrics ```javascript const logCacheMetrics = (response) => { const { usage } = response; const hitRate = (usage.cachedNeurons / usage.inputNeurons) * 100; console.log({ timestamp: new Date(), model: response.model, inputNeurons: usage.inputNeurons, cachedNeurons: usage.cachedNeurons, cacheHitRate: `${hitRate.toFixed(1)}%`, estimatedSavings: (usage.cachedNeurons * 0.0000025 * 0.5).toFixed(4) // GPT-4.1 pricing }); }; ``` ## Best Practices ### ✅ DO - **Place static content first** - System prompts, docs, examples before user input - **Reuse prompts** - Same prefix across requests maximizes cache hits - **Use threads** - Automatic caching of conversation history - **Monitor metrics** - Track `cachedNeurons` to measure effectiveness - **Batch similar requests** - Process related queries while cache is hot - **Keep prompts stable** - Minor edits break caching ### ❌ DON'T - **Add timestamps** - Dynamic content at the start breaks caching - **Randomize order** - Shuffling prompt structure prevents cache hits - **Overthink it** - Caching is automatic, don't over-engineer - **Cache very short prompts** - Overhead may exceed benefits (<100 neurons) - **Rely on long cache lifetimes** - Caches expire after 5-10 minutes ## Limitations - **Cache expiration**: 5-10 minutes of inactivity, 1 hour maximum - **Exact matching**: Even minor changes invalidate cache - **No cross-model caching**: Different models have separate caches - **No manual control**: Can't explicitly force or clear caches - **Prefix-only**: Only the beginning of prompts can be cached, not middle or end ## Cost-Benefit Analysis ### When Caching Provides High Value ✅ **Large static context** (>1000 neurons) ✅ **Frequent requests** (multiple per minute) ✅ **Similar prompts** (same documentation/instructions) ✅ **Multi-turn conversations** (thread-based) ### When Caching Provides Low Value ⚠️ **Unique prompts every time** ⚠️ **Very short prompts** (<100 neurons) ⚠️ **Infrequent requests** (minutes apart) ⚠️ **Highly dynamic content** (always changing) ## FAQ **Q: Do I need to enable caching?** A: No, it's automatic on supported models. **Q: Can I disable caching?** A: No, but dynamic content naturally prevents caching. **Q: How long do caches last?** A: 5-10 minutes of inactivity, maximum 1 hour. **Q: Why isn't my prompt caching?** A: Check for dynamic content (timestamps, IDs) at the start, ensure exact prompt matching, and verify cache hasn't expired. **Q: Do threads help with caching?** A: Yes! Threads automatically cache conversation history across requests. **Q: Is cached content less accurate?** A: No, cached responses are identical to non-cached ones. **Related:** - [Synapses and Neurons](/docs/documentation/core-concepts/synapses-and-tokens) - Understanding usage metrics - [Prompt Templates](/docs/documentation/core-concepts/prompt-templates) - Reusable prompts that cache well - [Threads](/docs/documentation/core-concepts/threads) - Automatic caching for conversations - [Best Practices](/docs/documentation/best-practices) - Optimization strategies