Prompt Caching
Copy for LLM
Copy page as Markdown for LLMs
View as Markdown
Open this page as Markdown
Open in ChatGPT
Get insights from ChatGPT
Open in Claude
Get insights from Claude

Prompt caching automatically reuses previously processed prompt content, reducing latency and costs for requests with repeated or similar context.

What is Prompt Caching?

When you send a request to Freddy, the system checks if parts of your prompt have been processed recently. If cached content exists, the model reuses those computations instead of reprocessing them, resulting in:

Faster responses - Up to 80% reduction in latency for cached prompts
Lower costs - 50% discount on cached input neurons
Automatic optimization - No code changes required

How It Works

Request 1: [System Prompt + Long Context] + User Query
           └─────────────────────────────┘
                 Processed & Cached
                 
Request 2: [Same System Prompt + Long Context] + Different User Query
           └────────────────────────────────┘
                   Retrieved from Cache
                   
Result: Faster response, lower cost

Caching is automatic and transparent - you don't need to change your code.

Supported Models

Prompt caching is available on:

GPT-4.1 and newer
GPT-4.1-mini and newer
o3-preview and o3-mini
All fine-tuned versions of the above models

Pricing

Cached neurons are discounted by 50%:

Model	Uncached Input	Cached Input	Savings
GPT-4.1	$2.50/1M neurons	$1.25/1M neurons	50%
GPT-4.1-mini	$0.15/1M neurons	$0.075/1M neurons	50%
o3-preview	$15.00/1M neurons	$7.50/1M neurons	50%

Output synapses are priced normally regardless of caching.

Monitoring Cache Usage

Every response includes cache metrics in the usage field:

{
  "usage": {
    "inputNeurons": 1500,
    "cachedNeurons": 1200,
    "outputSynapses": 150,
    "totalNeurons": 1500,
    "totalSynapses": 150
  }
}

cachedNeurons shows how many input neurons were retrieved from cache.

Cache hit rate: cachedNeurons / inputNeurons = 80% (in this example)

Cache Behavior

Cache Lifetime

Active cache: 5-10 minutes of inactivity
Maximum lifetime: 1 hour from last use
Automatic cleanup: Caches expire and are removed

Cache Matching

Caches match when:

✅ Prompt prefix is identical - Same text, same order
✅ Same model - Must use the exact same model ID
✅ Within time window - Cache hasn't expired

Caches DON'T match when:

❌ Any text changes - Even minor edits invalidate cache
❌ Different order - Rearranged content won't match
❌ Different model - Model changes reset cache
❌ Cache expired - Past the lifetime window

Optimizing for Cache Hits

1. Consistent Prompt Structure

Put static content first, dynamic content last:

// ✅ Good - Static content cached
{
  "inputs": [
    {
      "role": "system",
      "texts": [{"text": "You are an expert Python developer..."}]  // Cached
    },
    {
      "role": "user",
      "texts": [{"text": "How do I fix this error: {{user_error}}"}]  // Dynamic
    }
  ]
}

// ❌ Bad - Dynamic content breaks cache
{
  "inputs": [
    {
      "role": "system",
      "texts": [{"text": "Current time: {{timestamp}}"}]  // Changes every request
    },
    {
      "role": "system",
      "texts": [{"text": "You are an expert Python developer..."}]  // Won't cache
    }
  ]
}

2. Large Static Context First

If you have documentation, examples, or guidelines - put them at the start:

{
  "inputs": [
    {
      "role": "system",
      "texts": [
        {"text": "# API Documentation\n\n... (5000 tokens of docs) ..."}
      ]
    },
    {
      "role": "system",
      "texts": [
        {"text": "# Code Style Guide\n\n... (3000 tokens) ..."}
      ]
    },
    {
      "role": "user",
      "texts": [{"text": "Help me write a function"}]  // Only this part changes
    }
  ]
}

The 8000 neurons of documentation get cached, only the user query (10 neurons) is processed fresh.

3. Use Threads for Conversations

Threads automatically cache conversation history:

# First message in thread - builds cache
response1 = requests.post('/v1/model/response', json={
    "model": "gpt-4.1",
    "thread": "thread_user123",
    "inputs": [{"role": "user", "texts": [{"text": "Hello"}]}]
})

# Follow-up - conversation history cached
response2 = requests.post('/v1/model/response', json={
    "model": "gpt-4.1",
    "thread": "thread_user123",  # Previous messages cached
    "inputs": [{"role": "user", "texts": [{"text": "Tell me more"}]}]
})

4. Prompt Templates with Caching

Use prompt templates for reusable, cacheable instructions:

{
  "model": "gpt-4.1",
  "prompt": {
    "id": "prompt_code_reviewer",  // Template content gets cached
    "variables": {
      "language": "Python",  // Only variables change
      "focus": "security"
    }
  },
  "inputs": [
    {"role": "user", "texts": [{"text": "Review this code: ..."}]}
  ]
}

Use Cases

Documentation Q&A

# Embed large documentation in system prompt
DOCS = """
[5000 tokens of product documentation]
"""

def answer_question(question):
    return requests.post('/v1/model/response', json={
        "model": "gpt-4.1",
        "inputs": [
            {"role": "system", "texts": [{"text": DOCS}]},  # Cached after first call
            {"role": "user", "texts": [{"text": question}]}
        ]
    })

# First call: Full processing
answer_question("What is feature X?")  # cachedNeurons: 0

# Subsequent calls: Cached docs
answer_question("How do I use feature Y?")  # cachedNeurons: 5000
answer_question("What's the pricing?")  # cachedNeurons: 5000

Code Analysis

# Provide codebase context once
CODEBASE_CONTEXT = """
[Large codebase structure and key files]
"""

def analyze_code(code_snippet):
    return requests.post('/v1/model/response', json={
        "model": "gpt-4.1",
        "inputs": [
            {"role": "system", "texts": [{"text": CODEBASE_CONTEXT}]},  # Cached
            {"role": "user", "texts": [{"text": f"Analyze: {code_snippet}"}]}
        ]
    })

Multi-Turn Conversations

// First message - establishes cache
await createResponse({
  model: 'gpt-4.1',
  thread: 'support_ticket_123',
  inputs: [
    {
      role: 'system',
      texts: [{ text: customerContext }]  // Customer history cached
    },
    {
      role: 'user',
      texts: [{ text: 'I need help' }]
    }
  ]
});

// Follow-ups automatically benefit from cache
await createResponse({
  model: 'gpt-4.1',
  thread: 'support_ticket_123',  // Previous context cached
  inputs: [
    { role: 'user', texts: [{ text: 'Can you clarify?' }] }
  ]
});

Cache Analytics

Tracking Cache Performance

import requests

response = requests.post('/v1/model/response', json={
    "model": "gpt-4.1",
    "inputs": [...]
})

data = response.json()
usage = data['usage']

cache_hit_rate = usage['cachedNeurons'] / usage['inputNeurons'] * 100
cost_savings = (usage['cachedNeurons'] * 0.50)  # 50% discount

print(f"Cache hit rate: {cache_hit_rate}%")
print(f"Neurons saved: {usage['cachedNeurons']}")
print(f"Cost savings: ${cost_savings / 1_000_000:.4f}")

Logging Cache Metrics

const logCacheMetrics = (response) => {
  const { usage } = response;
  const hitRate = (usage.cachedNeurons / usage.inputNeurons) * 100;
  
  console.log({
    timestamp: new Date(),
    model: response.model,
    inputNeurons: usage.inputNeurons,
    cachedNeurons: usage.cachedNeurons,
    cacheHitRate: `${hitRate.toFixed(1)}%`,
    estimatedSavings: (usage.cachedNeurons * 0.0000025 * 0.5).toFixed(4)  // GPT-4.1 pricing
  });
};

Best Practices

✅ DO

Place static content first - System prompts, docs, examples before user input
Reuse prompts - Same prefix across requests maximizes cache hits
Use threads - Automatic caching of conversation history
Monitor metrics - Track cachedNeurons to measure effectiveness
Batch similar requests - Process related queries while cache is hot
Keep prompts stable - Minor edits break caching

❌ DON'T

Add timestamps - Dynamic content at the start breaks caching
Randomize order - Shuffling prompt structure prevents cache hits
Overthink it - Caching is automatic, don't over-engineer
Cache very short prompts - Overhead may exceed benefits (<100 neurons)
Rely on long cache lifetimes - Caches expire after 5-10 minutes

Limitations

Cache expiration: 5-10 minutes of inactivity, 1 hour maximum
Exact matching: Even minor changes invalidate cache
No cross-model caching: Different models have separate caches
No manual control: Can't explicitly force or clear caches
Prefix-only: Only the beginning of prompts can be cached, not middle or end

Cost-Benefit Analysis

When Caching Provides High Value

✅ Large static context (>1000 neurons)
✅ Frequent requests (multiple per minute)
✅ Similar prompts (same documentation/instructions)
✅ Multi-turn conversations (thread-based)

When Caching Provides Low Value

⚠️ Unique prompts every time
⚠️ Very short prompts (<100 neurons)
⚠️ Infrequent requests (minutes apart)
⚠️ Highly dynamic content (always changing)

FAQ

Q: Do I need to enable caching?
A: No, it's automatic on supported models.

Q: Can I disable caching?
A: No, but dynamic content naturally prevents caching.

Q: How long do caches last?
A: 5-10 minutes of inactivity, maximum 1 hour.

Q: Why isn't my prompt caching?
A: Check for dynamic content (timestamps, IDs) at the start, ensure exact prompt matching, and verify cache hasn't expired.

Q: Do threads help with caching?
A: Yes! Threads automatically cache conversation history across requests.

Q: Is cached content less accurate?
A: No, cached responses are identical to non-cached ones.

Related:

Synapses and Neurons - Understanding usage metrics
Prompt Templates - Reusable prompts that cache well
Threads - Automatic caching for conversations
Best Practices - Optimization strategies

Prompt CachingCopyCopy for LLMCopy page as Markdown for LLMsView as MarkdownOpen this page as MarkdownOpen in ChatGPTGet insights from ChatGPTOpen in ClaudeGet insights from Claude