Skip to content
Last updated

Streaming mode delivers model responses incrementally as they're generated, enabling real-time user experiences without waiting for complete responses.

What is Streaming?

Instead of waiting for the entire response to complete, streaming sends partial outputs as server-sent events (SSE) as the model generates them:

Non-Streaming:
[Wait...] → "The capital of France is Paris."

Streaming:
"The" → " capital" → " of" → " France" → " is" → " Paris" → "."

Enabling Streaming

Set stream: true in your request:

curl https://api.aitronos.com/v1/model/response \
 -H "X-API-Key: $FREDDY_API_KEY" \
 -H "Content-Type: application/json" \
 -d '{
 "model": "gpt-4o",
 "stream": true,
 "inputs": [
 {
 "role": "user",
 "content": "Write a short story"
 }
 ]
 }'

Response Format

Streaming responses use server-sent events (SSE):

event: response.created
data: {"id":"resp_abc123","status":"in_progress"}

event: response.output_item.added
data: {"index":0,"item":{"type":"message","role":"assistant"}}

event: response.output_text.delta
data: {"index":0,"delta":"Once"}

event: response.output_text.delta
data: {"index":0,"delta":" upon"}

event: response.output_text.delta
data: {"index":0,"delta":" a"}

event: response.output_text.done
data: {"index":0,"text":"Once upon a time..."}

event: response.completed
data: {"status":"completed","usage":{"outputSynapses":156}}

Implementation Examples

Python

import requests

response = requests.post(
 "https://api.aitronos.com/v1/model/response",
 headers={
 "X-API-Key": api_key,
 "Content-Type": "application/json"
 },
 json={
 "model": "gpt-4o",
 "stream": True,
 "inputs": [
 {"role": "user", "content": "Tell me a story"}
 ]
 },
 stream=True
)

for line in response.iter_lines():
 if line:
 # Parse SSE format
 if line.startswith(b'data: '):
 data = json.loads(line[6:])
 if 'delta' in data:
 print(data['delta'], end='', flush=True)

JavaScript

const response = await fetch('https://api.aitronos.com/v1/model/response', {
 method: 'POST',
 headers: {
 'X-API-Key': apiKey,
 'Content-Type': 'application/json'
 },
 body: JSON.stringify({
 model: 'gpt-4o',
 stream: true,
 inputs: [
 { role: 'user', texts: [{ text: 'Tell me a story' }] }
 ]
 })
});

const reader = response.body.getReader();
const decoder = new TextDecoder();

while (true) {
 const { done, value } = await reader.read();
 if (done) break;

 const chunk = decoder.decode(value);
 const lines = chunk.split('\n');

 for (const line of lines) {
 if (line.startsWith('data: ')) {
 const data = JSON.parse(line.slice(6));
 if (data.delta) {
 process.stdout.write(data.delta);
 }
 }
 }
}

React Component

import { useState } from 'react';

function StreamingChat() {
 const [output, setOutput] = useState('');
 const [isStreaming, setIsStreaming] = useState(false);

 const handleSubmit = async (message) => {
 setIsStreaming(true);
 setOutput('');

 const response = await fetch('/api/chat', {
 method: 'POST',
 headers: { 'Content-Type': 'application/json' },
 body: JSON.stringify({
 model: 'gpt-4o',
 stream: true,
 inputs: [{ role: 'user', texts: [{ text: message }] }]
 })
 });

 const reader = response.body.getReader();
 const decoder = new TextDecoder();

 while (true) {
 const { done, value } = await reader.read();
 if (done) break;

 const chunk = decoder.decode(value);
 const lines = chunk.split('\n');

 for (const line of lines) {
 if (line.startsWith('data: ')) {
 const data = JSON.parse(line.slice(6));
 if (data.delta) {
 setOutput(prev => prev + data.delta);
 }
 }
 }
 }

 setIsStreaming(false);
 };

 return (
 <div>
 <div className="output">{output}</div>
 {isStreaming && <div>Generating...</div>}
 </div>
 );
}

Stream Obfuscation

Stream obfuscation adds random padding to normalize payload sizes, mitigating timing-based side-channel attacks:

{
 "stream": true,
 "streamOptions": {
 "includeObfuscation": false // Disable for bandwidth optimization
 }
}

When to disable obfuscation:

  • Trusted network environment
  • Bandwidth-constrained connections
  • High-volume streaming applications

Keep enabled (default) when:

  • Handling sensitive data
  • Untrusted network paths
  • Security is prioritized over bandwidth

Event Types

response.created

Response has been created and processing started.

{"id":"resp_abc123","status":"in_progress"}

response.output_item.added

New output item (message, tool call) has been added.

{"index":0,"item":{"type":"message","role":"assistant"}}

response.output_text.delta

Incremental text content generated.

{"index":0,"delta":"Hello"}

response.output_text.done

Text output for an item is complete.

{"index":0,"text":"Hello, how can I help you?"}

response.tool_call.delta

Tool call arguments are being generated.

{"index":1,"delta":"{\"query\":\""}

response.completed

Response generation has finished.

{"status":"completed","usage":{"outputSynapses":245}}

response.failed

Response generation encountered an error.

{"status":"failed","error":{"code":"rate_limit_exceeded"}}

Use Cases

Interactive Chat

Show responses character-by-character for a natural conversation feel:

def stream_chat(user_message):
 response = requests.post(
 api_url,
 json={"model": "gpt-4o", "stream": True, "inputs": [...]},
 stream=True
 )

 for line in response.iter_lines():
 if line.startswith(b'data: '):
 data = json.loads(line[6:])
 if 'delta' in data:
 yield data['delta']

Content Generation

Display articles, stories, or documentation as they're written:

async function generateBlogPost(topic) {
 const stream = await fetch('/api/generate', {
 method: 'POST',
 body: JSON.stringify({
 stream: true,
 inputs: [{ role: 'user', texts: [{ text: `Write about ${topic}` }] }]
 })
 });

 // Update UI in real-time
 for await (const chunk of streamResponse(stream)) {
 document.getElementById('preview').textContent += chunk;
 }
}

Code Generation

Show code being written line-by-line:

def stream_code_generation(prompt):
 for chunk in stream_response(prompt):
 syntax_highlight_and_display(chunk)
 time.sleep(0.01) # Smooth animation

Best Practices

DO

  • Buffer incomplete events - SSE chunks may split across packets
  • Handle reconnection - Implement retry logic for network issues
  • Parse incrementally - Process deltas as they arrive
  • Show loading indicators - Indicate streaming is in progress
  • Implement timeouts - Don't wait indefinitely

DON'T

  • Assume complete JSON - Chunks may contain partial data
  • Block UI thread - Process streams asynchronously
  • Ignore error events - Handle response.failed appropriately
  • Forget to close streams - Clean up connections when done

Performance Considerations

Latency:

  • First token: ~200-500ms
  • Subsequent tokens: ~20-50ms each
  • Total time to first output: Faster than non-streaming

Bandwidth:

  • Streaming uses ~20% more bandwidth due to SSE overhead
  • Disable includeObfuscation if bandwidth is critical

User Experience:

  • Users perceive streaming as 50-70% faster
  • Engagement increases with real-time feedback

Troubleshooting

Chunks Not Arriving

Check connection headers:

curl -N -H "Accept: text/event-stream" ...

Incomplete JSON

Buffer until newlines:

buffer = ""
for chunk in response.iter_content():
 buffer += chunk.decode()
 while '\n' in buffer:
 line, buffer = buffer.split('\n', 1)
 process_line(line)

Connection Drops

Implement exponential backoff:

async function streamWithRetry(url, maxRetries = 3) {
 for (let i = 0; i < maxRetries; i++) {
 try {
 return await streamRequest(url);
 } catch (error) {
 if (i === maxRetries - 1) throw error;
 await sleep(2 ** i * 1000);
 }
 }
}

Related: