Audio and Speech
Copy for LLM
Copy page as Markdown for LLMs
View as Markdown
Open this page as Markdown
Open in ChatGPT
Get insights from ChatGPT
Open in Claude
Get insights from Claude

Freddy supports audio input in multimodal requests, allowing models to understand spoken content, transcribe audio, and respond to voice messages.

Audio Input

Send audio as part of your request inputs using the audio_url content type:

{
 "organization_id": "org_your_org_id",
 "model": "gpt-4o",
 "inputs": [
 {
 "role": "user",
 "content": [
 {
 "type": "audio_url",
 "audio_url": {
 "url": "https://storage.example.com/audio/recording.mp3"
 }
 },
 {
 "type": "text",
 "text": "Transcribe and summarize this audio."
 }
 ]
 }
 ]
}

The model processes the audio and responds with text.

Supported Formats

Format	Extension	Notes
MP3	`.mp3`	Most common, recommended
WAV	`.wav`	Uncompressed, larger files
M4A	`.m4a`	Apple format
OGG	`.ogg`	Open source format
WebM	`.webm`	Web streaming format

Audio URL Requirements

The URL must be publicly accessible, or hosted on a Freddy-accessible storage endpoint
Maximum audio duration: 25 minutes per request
Maximum file size: 25 MB

For long audio files, split them into segments and process each separately.

Uploading Audio Files

Upload audio files to Freddy's file storage, then reference them by URL:

import requests

# Upload the file
with open("audio.mp3", "rb") as f:
 upload_response = requests.post(
 "https://api.aitronos.com/v1/files",
 headers={"X-API-Key": os.environ["FREDDY_API_KEY"]},
 files={"file": f},
 data={"organization_id": "org_your_org_id", "purpose": "assistants"},
 )

file_data = upload_response.json()
audio_url = file_data["url"]

# Use in a request
response = requests.post(
 "https://api.aitronos.com/v1/model/response",
 headers={"X-API-Key": os.environ["FREDDY_API_KEY"]},
 json={
 "organization_id": "org_your_org_id",
 "model": "gpt-4o",
 "inputs": [
 {
 "role": "user",
 "content": [
 {"type": "audio_url", "audio_url": {"url": audio_url}},
 {"type": "text", "text": "What is being discussed?"},
 ],
 }
 ],
 },
)

Use Cases

Transcription — Convert spoken audio to text
Meeting summaries — Summarize recorded discussions
Voice Q&A — Answer questions about spoken content
Audio analysis — Identify topics, sentiment, or key points

Model Support

Audio input is supported by models with multimodal capabilities. Check that your selected model supports audio by reviewing Available Models.

Inputs and Outputs — Full multimodal input reference
Images and Vision — Image input and generation
Files API — Uploading files for use in requests
Available Models — Model capability guide

Audio and SpeechCopyCopy for LLMCopy page as Markdown for LLMsView as MarkdownOpen this page as MarkdownOpen in ChatGPTGet insights from ChatGPTOpen in ClaudeGet insights from Claude