Generate speech (text-to-speech) using ElevenLabs, Gemini, OpenAI, and Qwen models through the OpenAI-compatible audio API

Speech Generation

LLMGateway supports text-to-speech (TTS) through the OpenAI-compatible /v1/audio/speech endpoint, powered by ElevenLabs, Google Gemini, OpenAI, and Alibaba Qwen speech models.

Want to hear the voices before writing code? The Audio Studio in Lounge generates speech from up to three models side by side, with per-model voice, format, and speed controls.

Available Models

Browse all speech generation models, with up-to-date pricing, on the models page.

Billing varies by model family. Some models are billed on token usage reported by the provider (input text tokens and output audio tokens), while others are billed on input character count (those return audio bytes without usage data). See the models page for each model's exact pricing.

Parameters

Parameter	Type	Default	Description
`model`	string	required	The speech model to use
`input`	string	required	The text to synthesize into speech
`voice`	string	model	A prebuilt voice. Defaults to `Kore` (Gemini), `alloy` (OpenAI), `Sarah` (ElevenLabs), or the model's first voice on Qwen (`longanlingxin` on Plus, `longanhuan_v3.6` on Flash)
`response_format`	string	model	Audio format. OpenAI: `mp3` (default), `opus`, `aac`, `flac`, `wav`, `pcm`. ElevenLabs: `mp3` (default), `wav`, `pcm`, `opus`. Gemini: `wav` (default), `pcm`. Qwen: `wav`
`instructions`	string	—	Optional style/delivery directive prepended to the input (e.g. `"Say cheerfully"`)
`speed`	number	—	Accepted for OpenAI compatibility, but not applied by Gemini speech models

Gemini speech models return raw PCM audio. LLMGateway wraps it in a WAV container by default (response_format: "wav"), or returns the raw 16-bit little-endian PCM at 24 kHz when response_format: "pcm" is requested. Other formats such as mp3 are only available on the OpenAI models, which return the audio already encoded in the requested format.

curl

curl -X POST "https://api.llmgateway.io/v1/audio/speech" \
  -H "Authorization: Bearer $LLM_GATEWAY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemini-2.5-flash-preview-tts",
    "input": "Hello, welcome to LLM Gateway!",
    "voice": "Kore"
  }' \
  --output speech.wav

OpenAI SDK

Works with the standard OpenAI client library — just point the base URL to LLMGateway.

import OpenAI from "openai";
import { writeFileSync } from "fs";

const openai = new OpenAI({
	apiKey: process.env.LLM_GATEWAY_API_KEY,
	baseURL: "https://api.llmgateway.io/v1",
});

const response = await openai.audio.speech.create({
	model: "gemini-2.5-flash-preview-tts",
	voice: "Kore",
	input: "Hello, welcome to LLM Gateway!",
});

const buffer = Buffer.from(await response.arrayBuffer());
writeFileSync("speech.wav", buffer);

Streaming

Streaming speech responses (chunked audio or stream_format: "sse") are not supported yet. The endpoint always returns the complete audio file in a single response, so there is no low-latency, play-as-you-go output for now.

Voices

Gemini exposes 30 prebuilt voices. A few common ones: Kore, Puck, Zephyr, Charon, Fenrir, Leda, Orus, Aoede. When voice is omitted on a Gemini model, Kore is used.

OpenAI voices include alloy, ash, ballad, coral, echo, fable, nova, onyx, sage, shimmer, and verse. When voice is omitted on an OpenAI model, alloy is used.

ElevenLabs models accept 20 named voices, including Sarah, Aria, Roger, Laura, Charlie, George, Charlotte, Jessica, Brian, and Lily. When voice is omitted on an ElevenLabs model, Sarah is used. A raw ElevenLabs voice id is also accepted directly.

Qwen-Audio-3.0-TTS voices are model-specific and cannot be mixed between models: longanlingxin and longanlufeng on Plus (default longanlingxin), and longanhuan_v3.6, longjielidou_v3.6, loongeva_v3.6, and loongjohn on Flash (default longanhuan_v3.6).

ElevenLabs

The four ElevenLabs models are billed per input character (see the models page for rates):

eleven-multilingual-v2 — most lifelike, rich emotional expression, 29 languages
eleven-v3 — most expressive and human-like, 70+ languages
eleven-flash-v2-5 — ultra-low latency, 32 languages
eleven-turbo-v2-5 — fast and balanced, 32 languages

curl -X POST "https://api.llmgateway.io/v1/audio/speech" \
  -H "Authorization: Bearer $LLM_GATEWAY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "eleven-multilingual-v2",
    "input": "Hello, welcome to LLM Gateway!",
    "voice": "Sarah"
  }' \
  --output speech.mp3

Speech Generation

On this page