Source check
Every listed row should trace back to an official provider pricing, docs, model, or API page before it becomes a comparison row.
Speech and audio APIs use different billing units. Keep characters, transcription hours, voice minutes, and reconnect overhead separate before choosing a provider.
Price rows
8
Providers
7
Official sources
7
Last updated
2026-07-04T06:36:11.407Z
Last source check: 2026-07-04
What changed in this update
Refreshed 8 official audio price rows.
Grouped rows across 7 providers and 7 official source pages.
Kept workload guidance tied to launch checks, real usage units, and official-source verification.
What the price row misses
AI API pricing pages often look simple because they compare one published row at a time. This page keeps the official row visible, then adds the messy assumptions that show up in products: retries, long outputs, cache misses, batches, and review loops.
Source check
Every listed row should trace back to an official provider pricing, docs, model, or API page before it becomes a comparison row.
Unit check
Rows are kept in their original billing units when conversion would hide an important difference, such as per-second video or per-image generation.
Workload check
The calculator starts from product behavior: retries, cache hits, long prompts, output length, batch jobs, and rejected generations.
Launch check
Before a production rollout, reopen the official source because provider prices, cache rules, model names, and eligibility can change quickly.
Pricing validation playbook
Official rows are the starting point. The production decision comes from measuring the unit your users actually complete, the retries they create, and the quality gates you need before an output is accepted.
Define the unit
Separate transcript hours, TTS characters, dubbed minutes, realtime voice minutes, and completed conversations before comparing providers.
Instrument retries
Audio workflows can repeat speech, transcription, diarization, voice turns, and failed sessions. Log these separately from clean unit prices.
Compare quality gates
Evaluate accuracy, speaker separation, voice quality, turn latency, noise handling, and post-processing before choosing the cheapest unit.
Review privacy
Before launch, verify provider data retention, region, consent, logging, and voice cloning or speaker-identification policy constraints.
Audio workflow cheat sheet
Audio API cost depends on the unit your workflow actually delivers: characters, transcription hours, realtime voice minutes, dubbed minutes, or completed voice conversations. Keep those units separate before choosing a provider.
Workload
What moves the bill
Characters, voice tier, regeneration, pronunciation fixes, and accepted takes.
Measure first
Accepted spoken characters per month.
Workload
What moves the bill
Audio hours, diarization, language support, noise, and failed files.
Measure first
Processed audio hours per month.
Workload
What moves the bill
Voice minutes, reconnects, turn-taking, STT, LLM calls, and TTS output.
Measure first
Cost per completed conversation.
Workload
What moves the bill
Source audio length, target languages, voice matching, review passes, and re-renders.
Measure first
Approved dubbed minutes per language.
Workload
What moves the bill
Recorded hours, queue timing, extraction prompts, and post-processing.
Measure first
Calls processed per nightly batch.
Set expected usage; each row estimates monthly cost from official unit prices.
8 official USD rows - checked 2026-07-04 - sorted by official source order.
| Model | Provider | Published price | Region | Source | Notes | |
|---|---|---|---|---|---|---|
OpenAI Official API | $0.034 per minute Realtime speech translation | $35.02 | Global | OpenAI API pricing Checked 2026-07-04 | OpenAI also lists gpt-realtime-whisper at $0.017/minute. | |
ELOfficial price row | ElevenLabs Official API | $0.05 per 1K characters Ultra-low latency text to speech | $51.50 | ElevenAPI | ElevenAPI pricing Checked 2026-07-04 | Multilingual v2/v3 is listed at $0.10 per 1K characters. |
ELOfficial price row | ElevenLabs Official API | $0.22 per hour Speech to text, bulk transcription | $22.66 | ElevenAPI | ElevenAPI pricing Checked 2026-07-04 | Realtime Scribe v2 is listed at $0.39 per hour. |
xAI Official API | $0.05 per minute Realtime voice API | $51.50 | xAI API | xAI API pricing Checked 2026-07-04 | xAI also lists TTS at $15 per 1M characters and STT at $0.10/hour REST. | |
Mistral Official API | $0.016 per 1K characters Text-to-speech generation and voice cloning | $16.48 | Mistral API | Mistral pricing Checked 2026-07-04 | Available on /v1/audio/speech. | |
MiniMax Official API | $60 per 1M characters Text to audio | $61.80 | MiniMax API | MiniMax pay-as-you-go pricing Checked 2026-07-04 | HD speech model is listed at $100 per 1M characters. | |
Alibaba Cloud Official API | $0.10 per 10K input characters Output is not billed | $10.30 | International deployment | Alibaba Cloud Model Studio pricing Checked 2026-07-04 | TTS voice cloning and realtime variants have separate character prices. | |
Z.AI Official API | $0.03 / MTok, approximately $0.0024 per minute Speech recognition | $2.47 | Z.AI API | Z.AI pricing Checked 2026-07-04 | The per-minute figure is the provider's equivalent value. |
Audio production cost traps
Real audio products combine speech recognition, generated speech, realtime sessions, reconnects, moderation, and sometimes an LLM in the middle. Use this section to keep those costs visible.
TTS, transcription, and realtime voice can look similar in a table, but they bill different product units and should not be flattened into one average price.
Reconnects, silence, overlap buffers, interruptions, and failed turns can add cost even when the useful spoken answer is short.
A voice agent may use STT, an LLM, TTS, tool calls, and safety checks. The audio row is only one part of the full user action cost.
Higher-quality voices, lower latency, dubbing, diarization, or premium transcription can change which provider still fits more than the base row suggests.
Voice cost workflow
A better audio pricing decision starts with the user action that succeeds: a narrated file, a transcribed hour, a dubbed minute, or a finished voice-agent conversation.
Audio API pricing FAQ
Audio providers bill by different units: characters, audio hours, realtime minutes, voice minutes, or bundles. Compare the unit your workflow actually delivers, not the cleanest-looking row.
Estimate completed conversations, average voice minutes, reconnects, failed turns, and any LLM or tool calls attached to the voice session. Realtime voice is usually a workflow cost, not one audio row.
TTS character pricing matters most for narration, voiceovers, tutoring, dubbing, and products that regenerate speech for quality or pronunciation fixes.
Check billing unit, model availability, voice tier, latency, streaming support, diarization, language support, retention policy, and whether retries or reconnects are billable.
Yes. A voice agent usually includes STT, an LLM, tools, safety checks, and TTS or realtime voice. Estimate the completed conversation, not only the audio provider row.
Retries can mean regenerated speech, repeated transcription, reconnects, or failed voice turns. Add them only where your workflow creates them, then compare the cost per accepted audio output.