# Prompt Caching

Prompt caching automatically reuses previously processed prompt content, reducing latency and costs for requests with repeated or similar context.

## What is Prompt Caching?

When you send a request to Freddy, the system checks if parts of your prompt have been processed recently. If cached content exists, the model reuses those computations instead of reprocessing them, resulting in:

- **Faster responses** - Up to 80% reduction in latency for cached prompts
- **Lower costs** - 50% discount on cached input neurons
- **Automatic optimization** - No code changes required


## How It Works


```
Request 1: [System Prompt + Long Context] + User Query
           └─────────────────────────────┘
                 Processed & Cached
                 
Request 2: [Same System Prompt + Long Context] + Different User Query
           └────────────────────────────────┘
                   Retrieved from Cache
                   
Result: Faster response, lower cost
```

Caching is automatic and transparent - you don't need to change your code.

## Supported Models

Prompt caching is available on:

- **GPT-4.1** and newer
- **GPT-4.1-mini** and newer
- **o3-preview** and **o3-mini**
- **All fine-tuned versions** of the above models


## Pricing

Cached neurons are discounted by **50%**:

| Model | Uncached Input | Cached Input | Savings |
|  --- | --- | --- | --- |
| GPT-4.1 | $2.50/1M neurons | $1.25/1M neurons | 50% |
| GPT-4.1-mini | $0.15/1M neurons | $0.075/1M neurons | 50% |
| o3-preview | $15.00/1M neurons | $7.50/1M neurons | 50% |


Output synapses are priced normally regardless of caching.

## Monitoring Cache Usage

Every response includes cache metrics in the `usage` field:


```json
{
  "usage": {
    "inputNeurons": 1500,
    "cachedNeurons": 1200,
    "outputSynapses": 150,
    "totalNeurons": 1500,
    "totalSynapses": 150
  }
}
```

**`cachedNeurons`** shows how many input neurons were retrieved from cache.

**Cache hit rate**: `cachedNeurons / inputNeurons = 80%` (in this example)

## Cache Behavior

### Cache Lifetime

- **Active cache**: 5-10 minutes of inactivity
- **Maximum lifetime**: 1 hour from last use
- **Automatic cleanup**: Caches expire and are removed


### Cache Matching

Caches match when:

✅ **Prompt prefix is identical** - Same text, same order
✅ **Same model** - Must use the exact same model ID
✅ **Within time window** - Cache hasn't expired

Caches DON'T match when:

❌ **Any text changes** - Even minor edits invalidate cache
❌ **Different order** - Rearranged content won't match
❌ **Different model** - Model changes reset cache
❌ **Cache expired** - Past the lifetime window

## Optimizing for Cache Hits

### 1. Consistent Prompt Structure

Put **static content first**, **dynamic content last**:


```json
// ✅ Good - Static content cached
{
  "inputs": [
    {
      "role": "system",
      "texts": [{"text": "You are an expert Python developer..."}]  // Cached
    },
    {
      "role": "user",
      "texts": [{"text": "How do I fix this error: {{user_error}}"}]  // Dynamic
    }
  ]
}

// ❌ Bad - Dynamic content breaks cache
{
  "inputs": [
    {
      "role": "system",
      "texts": [{"text": "Current time: {{timestamp}}"}]  // Changes every request
    },
    {
      "role": "system",
      "texts": [{"text": "You are an expert Python developer..."}]  // Won't cache
    }
  ]
}
```

### 2. Large Static Context First

If you have documentation, examples, or guidelines - put them at the start:


```json
{
  "inputs": [
    {
      "role": "system",
      "texts": [
        {"text": "# API Documentation\n\n... (5000 tokens of docs) ..."}
      ]
    },
    {
      "role": "system",
      "texts": [
        {"text": "# Code Style Guide\n\n... (3000 tokens) ..."}
      ]
    },
    {
      "role": "user",
      "texts": [{"text": "Help me write a function"}]  // Only this part changes
    }
  ]
}
```

The 8000 neurons of documentation get cached, only the user query (10 neurons) is processed fresh.

### 3. Use Threads for Conversations

Threads automatically cache conversation history:


```python
# First message in thread - builds cache
response1 = requests.post('/v1/model/response', json={
    "model": "gpt-4.1",
    "thread": "thread_user123",
    "inputs": [{"role": "user", "texts": [{"text": "Hello"}]}]
})

# Follow-up - conversation history cached
response2 = requests.post('/v1/model/response', json={
    "model": "gpt-4.1",
    "thread": "thread_user123",  # Previous messages cached
    "inputs": [{"role": "user", "texts": [{"text": "Tell me more"}]}]
})
```

### 4. Prompt Templates with Caching

Use prompt templates for reusable, cacheable instructions:


```json
{
  "model": "gpt-4.1",
  "prompt": {
    "id": "prompt_code_reviewer",  // Template content gets cached
    "variables": {
      "language": "Python",  // Only variables change
      "focus": "security"
    }
  },
  "inputs": [
    {"role": "user", "texts": [{"text": "Review this code: ..."}]}
  ]
}
```

## Use Cases

### Documentation Q&A


```python
# Embed large documentation in system prompt
DOCS = """
[5000 tokens of product documentation]
"""

def answer_question(question):
    return requests.post('/v1/model/response', json={
        "model": "gpt-4.1",
        "inputs": [
            {"role": "system", "texts": [{"text": DOCS}]},  # Cached after first call
            {"role": "user", "texts": [{"text": question}]}
        ]
    })

# First call: Full processing
answer_question("What is feature X?")  # cachedNeurons: 0

# Subsequent calls: Cached docs
answer_question("How do I use feature Y?")  # cachedNeurons: 5000
answer_question("What's the pricing?")  # cachedNeurons: 5000
```

### Code Analysis


```python
# Provide codebase context once
CODEBASE_CONTEXT = """
[Large codebase structure and key files]
"""

def analyze_code(code_snippet):
    return requests.post('/v1/model/response', json={
        "model": "gpt-4.1",
        "inputs": [
            {"role": "system", "texts": [{"text": CODEBASE_CONTEXT}]},  # Cached
            {"role": "user", "texts": [{"text": f"Analyze: {code_snippet}"}]}
        ]
    })
```

### Multi-Turn Conversations


```javascript
// First message - establishes cache
await createResponse({
  model: 'gpt-4.1',
  thread: 'support_ticket_123',
  inputs: [
    {
      role: 'system',
      texts: [{ text: customerContext }]  // Customer history cached
    },
    {
      role: 'user',
      texts: [{ text: 'I need help' }]
    }
  ]
});

// Follow-ups automatically benefit from cache
await createResponse({
  model: 'gpt-4.1',
  thread: 'support_ticket_123',  // Previous context cached
  inputs: [
    { role: 'user', texts: [{ text: 'Can you clarify?' }] }
  ]
});
```

## Cache Analytics

### Tracking Cache Performance


```python
import requests

response = requests.post('/v1/model/response', json={
    "model": "gpt-4.1",
    "inputs": [...]
})

data = response.json()
usage = data['usage']

cache_hit_rate = usage['cachedNeurons'] / usage['inputNeurons'] * 100
cost_savings = (usage['cachedNeurons'] * 0.50)  # 50% discount

print(f"Cache hit rate: {cache_hit_rate}%")
print(f"Neurons saved: {usage['cachedNeurons']}")
print(f"Cost savings: ${cost_savings / 1_000_000:.4f}")
```

### Logging Cache Metrics


```javascript
const logCacheMetrics = (response) => {
  const { usage } = response;
  const hitRate = (usage.cachedNeurons / usage.inputNeurons) * 100;
  
  console.log({
    timestamp: new Date(),
    model: response.model,
    inputNeurons: usage.inputNeurons,
    cachedNeurons: usage.cachedNeurons,
    cacheHitRate: `${hitRate.toFixed(1)}%`,
    estimatedSavings: (usage.cachedNeurons * 0.0000025 * 0.5).toFixed(4)  // GPT-4.1 pricing
  });
};
```

## Best Practices

### ✅ DO

- **Place static content first** - System prompts, docs, examples before user input
- **Reuse prompts** - Same prefix across requests maximizes cache hits
- **Use threads** - Automatic caching of conversation history
- **Monitor metrics** - Track `cachedNeurons` to measure effectiveness
- **Batch similar requests** - Process related queries while cache is hot
- **Keep prompts stable** - Minor edits break caching


### ❌ DON'T

- **Add timestamps** - Dynamic content at the start breaks caching
- **Randomize order** - Shuffling prompt structure prevents cache hits
- **Overthink it** - Caching is automatic, don't over-engineer
- **Cache very short prompts** - Overhead may exceed benefits (<100 neurons)
- **Rely on long cache lifetimes** - Caches expire after 5-10 minutes


## Limitations

- **Cache expiration**: 5-10 minutes of inactivity, 1 hour maximum
- **Exact matching**: Even minor changes invalidate cache
- **No cross-model caching**: Different models have separate caches
- **No manual control**: Can't explicitly force or clear caches
- **Prefix-only**: Only the beginning of prompts can be cached, not middle or end


## Cost-Benefit Analysis

### When Caching Provides High Value

✅ **Large static context** (>1000 neurons)
✅ **Frequent requests** (multiple per minute)
✅ **Similar prompts** (same documentation/instructions)
✅ **Multi-turn conversations** (thread-based)

### When Caching Provides Low Value

⚠️ **Unique prompts every time**
⚠️ **Very short prompts** (<100 neurons)
⚠️ **Infrequent requests** (minutes apart)
⚠️ **Highly dynamic content** (always changing)

## FAQ

**Q: Do I need to enable caching?**
A: No, it's automatic on supported models.

**Q: Can I disable caching?**
A: No, but dynamic content naturally prevents caching.

**Q: How long do caches last?**
A: 5-10 minutes of inactivity, maximum 1 hour.

**Q: Why isn't my prompt caching?**
A: Check for dynamic content (timestamps, IDs) at the start, ensure exact prompt matching, and verify cache hasn't expired.

**Q: Do threads help with caching?**
A: Yes! Threads automatically cache conversation history across requests.

**Q: Is cached content less accurate?**
A: No, cached responses are identical to non-cached ones.

**Related:**

- [Synapses and Neurons](/docs/documentation/core-concepts/synapses-and-tokens) - Understanding usage metrics
- [Prompt Templates](/docs/documentation/core-concepts/prompt-templates) - Reusable prompts that cache well
- [Threads](/docs/documentation/core-concepts/threads) - Automatic caching for conversations
- [Best Practices](/docs/documentation/best-practices) - Optimization strategies