Official prices plus workload math

Speech and Audio API Pricing Calculator

Speech and audio APIs use different billing units. Keep characters, transcription hours, voice minutes, and reconnect overhead separate before choosing a provider.

Price rows

8

Providers

7

Official sources

7

Last updated

2026-07-04T06:36:11.407Z

Last source check: 2026-07-04

What changed in this update

Refreshed 8 official audio price rows.

Grouped rows across 7 providers and 7 official source pages.

Kept workload guidance tied to launch checks, real usage units, and official-source verification.

TTS characters

STT hours

Realtime voice minutes

Retry or overlap buffer

What the price row misses

The useful number is the cost of a successful workflow, not the cleanest API row.

AI API pricing pages often look simple because they compare one published row at a time. This page keeps the official row visible, then adds the messy assumptions that show up in products: retries, long outputs, cache misses, batches, and review loops.

Source check

Every listed row should trace back to an official provider pricing, docs, model, or API page before it becomes a comparison row.

Unit check

Rows are kept in their original billing units when conversion would hide an important difference, such as per-second video or per-image generation.

Workload check

The calculator starts from product behavior: retries, cache hits, long prompts, output length, batch jobs, and rejected generations.

Launch check

Before a production rollout, reopen the official source because provider prices, cache rules, model names, and eligibility can change quickly.

Pricing validation playbook

Validate the bill with your product workflow before choosing a provider.

Official rows are the starting point. The production decision comes from measuring the unit your users actually complete, the retries they create, and the quality gates you need before an output is accepted.

Define the unit

Cost per completed voice job

Separate transcript hours, TTS characters, dubbed minutes, realtime voice minutes, and completed conversations before comparing providers.

Instrument retries

Count reconnects and regenerations

Audio workflows can repeat speech, transcription, diarization, voice turns, and failed sessions. Log these separately from clean unit prices.

Compare quality gates

Measure words, speakers, and latency

Evaluate accuracy, speaker separation, voice quality, turn latency, noise handling, and post-processing before choosing the cheapest unit.

Review privacy

Check retention and consent requirements

Before launch, verify provider data retention, region, consent, logging, and voice cloning or speaker-identification policy constraints.

Audio workflow cheat sheet

Separate speech, transcription, and realtime voice before comparing rows.

Audio API cost depends on the unit your workflow actually delivers: characters, transcription hours, realtime voice minutes, dubbed minutes, or completed voice conversations. Keep those units separate before choosing a provider.

Workload

Text-to-speech narration

What moves the bill

Characters, voice tier, regeneration, pronunciation fixes, and accepted takes.

Measure first

Accepted spoken characters per month.

Workload

Speech-to-text transcription

What moves the bill

Audio hours, diarization, language support, noise, and failed files.

Measure first

Processed audio hours per month.

Workload

Realtime voice agents

What moves the bill

Voice minutes, reconnects, turn-taking, STT, LLM calls, and TTS output.

Measure first

Cost per completed conversation.

Workload

Dubbing or translation

What moves the bill

Source audio length, target languages, voice matching, review passes, and re-renders.

Measure first

Approved dubbed minutes per language.

Workload

Batch call analysis

What moves the bill

Recorded hours, queue timing, extraction prompts, and post-processing.

Measure first

Calls processed per nightly batch.

Official API price calculator

Audio APIs

Text-to-speech, speech-to-text, realtime voice agents, translation, dubbing, and sound workflows. Use this for TTS, STT, realtime voice agents, dubbing, and translation.

Text Images Video Audio

Workload assumptions

Set expected usage; each row estimates monthly cost from official unit prices.

TTS characters

STT hours

Voice minutes

Retries / overlap

%

Search official rows

ProviderSelected model

Official price table

8 official USD rows - checked 2026-07-04 - sorted by official source order.

Daily source checks

Model	Provider	Published price		Region	Source	Notes
Official price row	OpenAI Official API	$0.034 per minute Realtime speech translation	$35.02	Global	OpenAI API pricing Checked 2026-07-04	OpenAI also lists gpt-realtime-whisper at $0.017/minute.
ELOfficial price row	ElevenLabs Official API	$0.05 per 1K characters Ultra-low latency text to speech	$51.50	ElevenAPI	ElevenAPI pricing Checked 2026-07-04	Multilingual v2/v3 is listed at $0.10 per 1K characters.
ELOfficial price row	ElevenLabs Official API	$0.22 per hour Speech to text, bulk transcription	$22.66	ElevenAPI	ElevenAPI pricing Checked 2026-07-04	Realtime Scribe v2 is listed at $0.39 per hour.
Official price row	xAI Official API	$0.05 per minute Realtime voice API	$51.50	xAI API	xAI API pricing Checked 2026-07-04	xAI also lists TTS at $15 per 1M characters and STT at $0.10/hour REST.
Official price row	Mistral Official API	$0.016 per 1K characters Text-to-speech generation and voice cloning	$16.48	Mistral API	Mistral pricing Checked 2026-07-04	Available on /v1/audio/speech.
Official price row	MiniMax Official API	$60 per 1M characters Text to audio	$61.80	MiniMax API	MiniMax pay-as-you-go pricing Checked 2026-07-04	HD speech model is listed at $100 per 1M characters.
Official price row	Alibaba Cloud Official API	$0.10 per 10K input characters Output is not billed	$10.30	International deployment	Alibaba Cloud Model Studio pricing Checked 2026-07-04	TTS voice cloning and realtime variants have separate character prices.
Official price row	Z.AI Official API	$0.03 / MTok, approximately $0.0024 per minute Speech recognition	$2.47	Z.AI API	Z.AI pricing Checked 2026-07-04	The per-minute figure is the provider's equivalent value.

Audio production cost traps

A voice feature is usually more than one audio row.

Real audio products combine speech recognition, generated speech, realtime sessions, reconnects, moderation, and sometimes an LLM in the middle. Use this section to keep those costs visible.

Characters, hours, and minutes are different businesses

TTS, transcription, and realtime voice can look similar in a table, but they bill different product units and should not be flattened into one average price.

Realtime voice creates session overhead

Reconnects, silence, overlap buffers, interruptions, and failed turns can add cost even when the useful spoken answer is short.

Voice agents often pay twice

A voice agent may use STT, an LLM, TTS, tool calls, and safety checks. The audio row is only one part of the full user action cost.

Quality and latency tiers matter

Higher-quality voices, lower latency, dubbing, diarization, or premium transcription can change which provider still fits more than the base row suggests.

Voice cost workflow

Estimate the completed conversation, not one audio call.

A better audio pricing decision starts with the user action that succeeds: a narrated file, a transcribed hour, a dubbed minute, or a finished voice-agent conversation.

1Split the workload into TTS, STT, realtime voice, dubbing, and batch transcription.

2Estimate the unit your workflow delivers: spoken characters, audio hours, voice minutes, or completed conversations.

3Add retries, reconnects, overlap buffers, silence, failed turns, and moderation passes only where they apply.

4Open the official source before you rely on the estimate because model availability, voice tiers, and realtime rules can change.

Audio API pricing FAQ

Why are audio API prices hard to compare?

Audio providers bill by different units: characters, audio hours, realtime minutes, voice minutes, or bundles. Compare the unit your workflow actually delivers, not the cleanest-looking row.

How should I estimate realtime voice cost?

Estimate completed conversations, average voice minutes, reconnects, failed turns, and any LLM or tool calls attached to the voice session. Realtime voice is usually a workflow cost, not one audio row.

When does TTS character pricing matter most?

TTS character pricing matters most for narration, voiceovers, tutoring, dubbing, and products that regenerate speech for quality or pronunciation fixes.

What should I verify before choosing an audio API?

Check billing unit, model availability, voice tier, latency, streaming support, diarization, language support, retention policy, and whether retries or reconnects are billable.

Should voice agent cost include the LLM bill?

Yes. A voice agent usually includes STT, an LLM, tools, safety checks, and TTS or realtime voice. Estimate the completed conversation, not only the audio provider row.

How do retries affect speech and audio pricing?

Retries can mean regenerated speech, repeated transcription, reconnects, or failed voice turns. Add them only where your workflow creates them, then compare the cost per accepted audio output.