Skip to content
Last updated

Prompt caching automatically reuses previously processed prompt content, reducing latency and costs for requests with repeated or similar context.

What is Prompt Caching?

When you send a request to Freddy, the system checks if parts of your prompt have been processed recently. If cached content exists, the model reuses those computations instead of reprocessing them, resulting in:

  • Faster responses - Up to 80% reduction in latency for cached prompts
  • Lower costs - 50% discount on cached input neurons
  • Automatic optimization - No code changes required

How It Works

Request 1: [System Prompt + Long Context] + User Query
           └─────────────────────────────┘
                 Processed & Cached
                 
Request 2: [Same System Prompt + Long Context] + Different User Query
           └────────────────────────────────┘
                   Retrieved from Cache
                   
Result: Faster response, lower cost

Caching is automatic and transparent - you don't need to change your code.

Supported Models

Prompt caching is available on:

  • GPT-4.1 and newer
  • GPT-4.1-mini and newer
  • o3-preview and o3-mini
  • All fine-tuned versions of the above models

Pricing

Cached neurons are discounted by 50%:

ModelUncached InputCached InputSavings
GPT-4.1$2.50/1M neurons$1.25/1M neurons50%
GPT-4.1-mini$0.15/1M neurons$0.075/1M neurons50%
o3-preview$15.00/1M neurons$7.50/1M neurons50%

Output synapses are priced normally regardless of caching.

Monitoring Cache Usage

Every response includes cache metrics in the usage field:

{
  "usage": {
    "inputNeurons": 1500,
    "cachedNeurons": 1200,
    "outputSynapses": 150,
    "totalNeurons": 1500,
    "totalSynapses": 150
  }
}

cachedNeurons shows how many input neurons were retrieved from cache.

Cache hit rate: cachedNeurons / inputNeurons = 80% (in this example)

Cache Behavior

Cache Lifetime

  • Active cache: 5-10 minutes of inactivity
  • Maximum lifetime: 1 hour from last use
  • Automatic cleanup: Caches expire and are removed

Cache Matching

Caches match when:

Prompt prefix is identical - Same text, same order
Same model - Must use the exact same model ID
Within time window - Cache hasn't expired

Caches DON'T match when:

Any text changes - Even minor edits invalidate cache
Different order - Rearranged content won't match
Different model - Model changes reset cache
Cache expired - Past the lifetime window

Optimizing for Cache Hits

1. Consistent Prompt Structure

Put static content first, dynamic content last:

// ✅ Good - Static content cached
{
  "inputs": [
    {
      "role": "system",
      "texts": [{"text": "You are an expert Python developer..."}]  // Cached
    },
    {
      "role": "user",
      "texts": [{"text": "How do I fix this error: {{user_error}}"}]  // Dynamic
    }
  ]
}

// ❌ Bad - Dynamic content breaks cache
{
  "inputs": [
    {
      "role": "system",
      "texts": [{"text": "Current time: {{timestamp}}"}]  // Changes every request
    },
    {
      "role": "system",
      "texts": [{"text": "You are an expert Python developer..."}]  // Won't cache
    }
  ]
}

2. Large Static Context First

If you have documentation, examples, or guidelines - put them at the start:

{
  "inputs": [
    {
      "role": "system",
      "texts": [
        {"text": "# API Documentation\n\n... (5000 tokens of docs) ..."}
      ]
    },
    {
      "role": "system",
      "texts": [
        {"text": "# Code Style Guide\n\n... (3000 tokens) ..."}
      ]
    },
    {
      "role": "user",
      "texts": [{"text": "Help me write a function"}]  // Only this part changes
    }
  ]
}

The 8000 neurons of documentation get cached, only the user query (10 neurons) is processed fresh.

3. Use Threads for Conversations

Threads automatically cache conversation history:

# First message in thread - builds cache
response1 = requests.post('/v1/model/response', json={
    "model": "gpt-4.1",
    "thread": "thread_user123",
    "inputs": [{"role": "user", "texts": [{"text": "Hello"}]}]
})

# Follow-up - conversation history cached
response2 = requests.post('/v1/model/response', json={
    "model": "gpt-4.1",
    "thread": "thread_user123",  # Previous messages cached
    "inputs": [{"role": "user", "texts": [{"text": "Tell me more"}]}]
})

4. Prompt Templates with Caching

Use prompt templates for reusable, cacheable instructions:

{
  "model": "gpt-4.1",
  "prompt": {
    "id": "prompt_code_reviewer",  // Template content gets cached
    "variables": {
      "language": "Python",  // Only variables change
      "focus": "security"
    }
  },
  "inputs": [
    {"role": "user", "texts": [{"text": "Review this code: ..."}]}
  ]
}

Use Cases

Documentation Q&A

# Embed large documentation in system prompt
DOCS = """
[5000 tokens of product documentation]
"""

def answer_question(question):
    return requests.post('/v1/model/response', json={
        "model": "gpt-4.1",
        "inputs": [
            {"role": "system", "texts": [{"text": DOCS}]},  # Cached after first call
            {"role": "user", "texts": [{"text": question}]}
        ]
    })

# First call: Full processing
answer_question("What is feature X?")  # cachedNeurons: 0

# Subsequent calls: Cached docs
answer_question("How do I use feature Y?")  # cachedNeurons: 5000
answer_question("What's the pricing?")  # cachedNeurons: 5000

Code Analysis

# Provide codebase context once
CODEBASE_CONTEXT = """
[Large codebase structure and key files]
"""

def analyze_code(code_snippet):
    return requests.post('/v1/model/response', json={
        "model": "gpt-4.1",
        "inputs": [
            {"role": "system", "texts": [{"text": CODEBASE_CONTEXT}]},  # Cached
            {"role": "user", "texts": [{"text": f"Analyze: {code_snippet}"}]}
        ]
    })

Multi-Turn Conversations

// First message - establishes cache
await createResponse({
  model: 'gpt-4.1',
  thread: 'support_ticket_123',
  inputs: [
    {
      role: 'system',
      texts: [{ text: customerContext }]  // Customer history cached
    },
    {
      role: 'user',
      texts: [{ text: 'I need help' }]
    }
  ]
});

// Follow-ups automatically benefit from cache
await createResponse({
  model: 'gpt-4.1',
  thread: 'support_ticket_123',  // Previous context cached
  inputs: [
    { role: 'user', texts: [{ text: 'Can you clarify?' }] }
  ]
});

Cache Analytics

Tracking Cache Performance

import requests

response = requests.post('/v1/model/response', json={
    "model": "gpt-4.1",
    "inputs": [...]
})

data = response.json()
usage = data['usage']

cache_hit_rate = usage['cachedNeurons'] / usage['inputNeurons'] * 100
cost_savings = (usage['cachedNeurons'] * 0.50)  # 50% discount

print(f"Cache hit rate: {cache_hit_rate}%")
print(f"Neurons saved: {usage['cachedNeurons']}")
print(f"Cost savings: ${cost_savings / 1_000_000:.4f}")

Logging Cache Metrics

const logCacheMetrics = (response) => {
  const { usage } = response;
  const hitRate = (usage.cachedNeurons / usage.inputNeurons) * 100;
  
  console.log({
    timestamp: new Date(),
    model: response.model,
    inputNeurons: usage.inputNeurons,
    cachedNeurons: usage.cachedNeurons,
    cacheHitRate: `${hitRate.toFixed(1)}%`,
    estimatedSavings: (usage.cachedNeurons * 0.0000025 * 0.5).toFixed(4)  // GPT-4.1 pricing
  });
};

Best Practices

✅ DO

  • Place static content first - System prompts, docs, examples before user input
  • Reuse prompts - Same prefix across requests maximizes cache hits
  • Use threads - Automatic caching of conversation history
  • Monitor metrics - Track cachedNeurons to measure effectiveness
  • Batch similar requests - Process related queries while cache is hot
  • Keep prompts stable - Minor edits break caching

❌ DON'T

  • Add timestamps - Dynamic content at the start breaks caching
  • Randomize order - Shuffling prompt structure prevents cache hits
  • Overthink it - Caching is automatic, don't over-engineer
  • Cache very short prompts - Overhead may exceed benefits (<100 neurons)
  • Rely on long cache lifetimes - Caches expire after 5-10 minutes

Limitations

  • Cache expiration: 5-10 minutes of inactivity, 1 hour maximum
  • Exact matching: Even minor changes invalidate cache
  • No cross-model caching: Different models have separate caches
  • No manual control: Can't explicitly force or clear caches
  • Prefix-only: Only the beginning of prompts can be cached, not middle or end

Cost-Benefit Analysis

When Caching Provides High Value

Large static context (>1000 neurons)
Frequent requests (multiple per minute)
Similar prompts (same documentation/instructions)
Multi-turn conversations (thread-based)

When Caching Provides Low Value

⚠️ Unique prompts every time
⚠️ Very short prompts (<100 neurons)
⚠️ Infrequent requests (minutes apart)
⚠️ Highly dynamic content (always changing)

FAQ

Q: Do I need to enable caching?
A: No, it's automatic on supported models.

Q: Can I disable caching?
A: No, but dynamic content naturally prevents caching.

Q: How long do caches last?
A: 5-10 minutes of inactivity, maximum 1 hour.

Q: Why isn't my prompt caching?
A: Check for dynamic content (timestamps, IDs) at the start, ensure exact prompt matching, and verify cache hasn't expired.

Q: Do threads help with caching?
A: Yes! Threads automatically cache conversation history across requests.

Q: Is cached content less accurate?
A: No, cached responses are identical to non-cached ones.


Related: