If you've discovered this blog, chances are you are already familiar with the concept of AI generated speech. Usually, synthetic speech is generated from text (text-to-speech). These days, we often hear about this type of speech when discussing apps like Google Maps. The natural audio quality of synthetic speech has made considerable gains in the past few years due to a revolution in artificial intelligence (deep learning).
Google Maps is light years different from the Stephen Hawking voice. But it still struggles with unusual words and puts emphasis in the wrong places. And the problem is made even worse if a dramatic and emotional performance is desired. Imagine watching an entire movie with the characters voiced by Google Maps.
Unfortunately, artificial intelligence (AI) won't be able to completely solve this problem until it develops the ability to perform method acting and listen to the director's tips.
Here at Respeecher, we're developing a different kind of technology. We use artificial intelligence to synthesize speech, but we don't use text at all. Our software does speech-to-speech voice conversion: instead of replacing a human being, it allows a person to speak in a different voice.
You can read more about it in our article Respeecher Explained: The Speech Synthesis Software for Content Creators. But for now, let's dive into the most common problems found in synthetic speech. These problems affect all AI voices, whether generated from text or from the actual speech but not always equally.
There are two main types of pronunciation errors made by synthetic speech systems. Text-to-speech (TTS) systems often just don't know how to pronounce a word (think how tricky this can be, in particular, for unusual words spelled in strange ways or for words that have two possible homographs that are pronounced differently, like "put" the very common verb and "put" the less commonly pronounced verb that has to do with golf).
Speech-to-speech (STS) systems almost entirely avoid this kind of pronunciation error, and if it does happen, it is generally the fault of the source speaker, not of the system. The other type of pronunciation mistake has to do with pronouncing a sound unclearly or substituting one sound for another.
Actually, the result can be the same as the result of not knowing how to pronounce a word -- the system might substitute the "u" of one "put" for the "u" of the other. But the origin is different. Some older text-to-speech systems are hand-constructed. They are incredibly consistent in how they pronounce words, so although they may sound very unnatural (like Stephen Hawking's voice) they are essentially immune to making this mistake.
Sometimes, synthetic speech has the wrong sound (substituting one sound for another) or an unclear pronunciation of a sound. Paradoxically, this issue affects the most up-to-date systems, which rely most heavily on artificial intelligence, the most. And although it affects both TTS and STS, it affects STS more because textual input is very consistent -- the same word always appears the same way -- while one word can be pronounced in all kinds of ways.
At Respeecher, we use many different proprietary algorithmic techniques to fight mispronunciation in voice AI, and we are always looking for new ones. But one thing that usually helps is more data. Currently, for the best results, we ask customers for one hour each of both the person whose voice we are cloning and the person whose voices will be changed.
While modern TTS systems have good audio quality, they also have difficulties pronouncing uncommon words. Probably the worst problem they suffer from is unnatural prosody. "Prosody" is a catch-all term for rhythm, intonation, and in general, features of speech that span over multiple words.
Prosody is difficult for TTS because to really nail it, a system needs to understand the meaning of what it is saying. There is an infinite variety of ways to say something at the prosodic level, unlike at the phonetic level, where there is typically just one way to pronounce a word (in a given dialect).
Speech-to-speech has a natural advantage in prosody over TTS because it excels at duplicating the source speaker's prosody (and the source speaker, hopefully, does understand the text). Respeecher's technology produces far more natural sounding prosody than TTS systems. It offers an infinite prosodic palette for content creators.
On the other hand, even if it could solve the problem of producing natural prosody, TTS would not be able to produce the perfect performance for any directorial intent. And a big part of the performance (though by no means all) is prosody. TTS is and will remain unsuitable for many applications because of this fundamental difference.
Compared to pronunciation and prosody errors, vocoding and audio quality issues are a technical problem that continues to be resolved as technology improves, at least for cases where high quality training data and data to convert are available.
We all have an intuitive understanding of audio quality, but what exactly is a vocoder? What does it have to do with audio quality? Both TTS and STS systems often work internally with signals that vary much more slowly than a waveform.
This makes intuitive sense since a high-quality waveform needs to be sampled about 44,000 times per second, but the physical parameters of sound change only about 100 times per second, and the control signal that the human brain supplies to create speech has an even lower timing precision, especially if we consider how often we tend to change the sound we are producing.
Working with a signal that varies too quickly is computationally inefficient. It can also obscure the true nature of the underlying control signals that produce speech.
Some of the most common issues here are noises, clicks, and other sound artifacts that shouldn't be present in AI generated speech. In fact, it's impossible to catalog all of the vocoding issues. Many of them are hard to describe in words because they represent a variety of sound distortions.
The degree to which an AI generated voice sounds like the voice of the target speaker is called speaker identity. Speaker identity problems are common for both TTS and STS technologies.
The issue lies in the lack of original audio data used as a source for speech synthesis and the synthesis system respectfully. Assuming that we have an hour-long audio recording of the original, this should almost completely eliminate the problem. The more audio context a recording contains, including different intonations, emotions, and tempo, the more accurate the AI generated speech will be.
But even when the client doesn’t have high-res sources available, Respeecher built an audio version of the super-resolution algorithm to deliver the highest resolution audio across the board. Learn more by downloading this on increasing audio resolution with Respeecher.
At Respeecher, we are continually working to gain more control over the aspects of voice cloning that are possible to transfer and convert. This helps not only with mimicking speech identity, but accent as well.
Respeecher can help with dubbing in a foreign language when using the voice of the original actor and letting people speak with their own voices in foreign accents. Imagine hearing someone speaking a language you don't know using your voicе.
Now that we've taken a quick glance at the main issues of AI generated speech, you are well equipped to choose the best solution for your needs.
If you need an excellent generic TTS solution, Google is one of the best options available. With its Cloud Text-to-Speech, you can expect some of the best quality on the market. However, it will still contain prosody and vocoding issues. And you cannot use it to mimic a particular person's voice like you can with STS technology.
Keep in mind that other TTS providers may possess systems that are able to sound more natural than Google in some cases, though possibly less robust to specific phonetic issues, or they might have worse vocoding.
Nevertheless, a major advantage of other text-to-speech providers is that they can provide different voices and speaking styles from Google. These sometimes include custom voices, just like Respeecher provides with its speech-to-speech technology.
For additional dialogue recording (ADR) or any other use case where you need to re-create a particular voice, speech-to-speech voice conversion is a game-changer. With an hour-long original speech sample of the consenting speaker, Respeecher can help you create unlimited speech content.
Contact us today and see for yourself why Hollywood studios and sound engineers are so excited about Respeecher's AI voice generator technology.