The IBM Text to Speech service provides an Application Programming Interface (API) that uses IBM's speech-synthesis capabilities to convert written text to natural-sounding speech. The service streams the results back to the client with minimal delay. It converts written text into natural sounding audio in a variety of languages and voices. You can customize and control the pronunciation of specific words to deliver a seamless voice interaction that caters to your audience.
The service offers the following features:
- HTTP and WebSocket interfaces: Supports speech synthesis via both HTTP REST and WebSocket interfaces. Both interfaces enable the use of SSML for all supported languages. The WebSocket interface also supports the SSML element as well as optional word timing information for all words of the input text to synchronize the audio and input, for example, for use with robots.
- Audio formats: Produces Ogg format with the Opus or Vorbis codec, Waveform Audio File Format (WAV), Free Lossless Audio Codec (FLAC), Web Media (WebM) format with the Opus or Vorbis codec, Linear 16-bit Pulse-Code Modulation (PCM), mu-law (u-law), or basic audio.
- Voices: Synthesizes text to audio in a variety of languages, including English, French, German, Italian, Japanese, Spanish, and Brazilian Portuguese. The service offers at least one male or female voice, sometimes both, for each language and different dialects, such as US and UK English and Castilian, Latin American, and North American Spanish. The audio uses appropriate cadence and intonation.
- SSML: Accepts plain text or text that is tagged with the Speech Synthesis Markup Language (SSML), an XML-based markup language that provides annotations of text for speech synthesis applications.
- Expressiveness: Augments SSML with an expressive element that lets you indicate a speaking style of GoodNews, Apology, or Uncertainty. Currently available only for the US English Allison voice.
- Voice transformation: Extends SSML by adding a voice transformation element that lets you expand the range of possible voices by controlling aspects such as pitch, rate, and timbre. The service also offers two built-in virtual voices, Young and Soft. Currently available only for US English voices.
- Customization: Provides a customization interface that lets you specify how it pronounces unusual words that occur in your input. You can define pronunciations with the International Phonetic Alphabet (IPA) or IBM Symbolic Phonetic Representation (SPR).