Blog | Respeecher

Ask Me Anything (AMA) with Respeecher CEO Alex Serdiuk, Part III

Written by Alex Serdiuk | Aug 7, 2023 2:30:43 PM

Alex Serdiuk, CEO of Respeecher, answered questions from the audience in a live AMA session we recorded on YouTube. This interview is part of a series of four interviews that will cover topics like: ethics of deepfakes, synthetic media industry, voice resurrection for Hollywood movies, Respeecher in the context of war, and more.

If you haven’t read the first parts, you can do so here:

Watch the full video of the AMA session here:

 

 

[Q] Why is permission needed? 

[AS] Addressing the critical concern of voice cloning and AI ethics, it's imperative to understand that permission is necessary when reproducing someone's voice using advanced voice cloning technology. The voice belongs to a particular person. And this person or those who own the rights for this intellectual property should give you permission and consent in order to replicate that voice. This step is fundamental to respecting and safeguarding personal identity in the era of generative AI tools.

[Q] Who owns the AI? 

[AS] The voices from the Voice Marketplace Library are owned by the content maker who uses a voice. The Marketplace is being licensed and the content maker can use any of the voices there, turning it into a versatile voice maker tool. But they produce using our library in their pieces so they get a global license. This means they own that particular content that's being created with a specific voice, leveraging our voice AI for creative freedom.

In all other cases, when we create specific models for clients, the voice owner or their representative owns the voice. And that's the only way it works so far. This approach ensures AI ethics are maintained, recognizing the voice as an individual's property. We don't own any voices aside from those found in the Voice Marketplace Library, and we responsibly license these to creators, upholding AI ethical issues considerations.

[Q] Why don't you make it available to everyone again?

[AS] Because we protect it from any potential misuse, we don't want our technology to be used without those boundaries in place. This is a crucial aspect of AI ethics. I mentioned this before in another section about future applications. Our stance is a testament to our commitment to AI ethical issues, ensuring that our voice AI and voice cloning technologies are used responsibly.

[Q] Can you tell me more about your new healthcare direction? How is Respeecher’s technology helpful? 

[AS] We want to help people who have speech disorders to be able to improve the quality of their life because it's often the case with speech disorders caused by different medical conditions - people can speak, but it's really hard to understand them. And we hear cases when patients say that they have to repeat whatever they say four or five times in order to be understood on a phone call, including on a phone call with their doctor.

So early this year, in February and March, our technology finally made it to the robustness level and we got some initial results, results that were promising. We started to conduct trials with real patients from two universities in the UK and the US. This application of our AI voice generator shows the potential of AI voices in healthcare, offering significant benefits.

And now we've found out what the limitations are, and what needs to be improved. Luckily, everything that needs to be improved in the healthcare direction is very much correlated with our general scope of improvements. We are introducing a real-time system, which means the ability to get the conversion to happen somewhere closer to the patient's device. Getting rid of the need for an internet connection and the cloud. This represents a step forward in generative AI tools for healthcare.

Our team is very much excited about this direction. We will invest more in it. We are still defining the path in terms of what we will do next. 

So we started with patients with particular diseases, but the TAC would be helpful for various cases and we are currently exploring that. We started cooperating with people from the community that used to build different products in order to help those patients.

In a week or two, you will see a very interesting case study on our website, and feel free to subscribe to our newsletter to be notified about it. We send a newsletter once per month or even once every two months. 

And another interesting direction is voice banking. So there are some medical conditions when patients know that soon they will lose their voice. And we have many requests like that. We consult people on what exactly needs to be recorded, and what kind of data set they need to put together right now in order to have access to their voice. We are currently working on a couple of projects like this and we will be doing more where we train models for people who are losing their voice. 

So they would be able to use their voice further by using text-to-speech (TTS) or even using speech-to-speech (STS), just giving their model to someone they trust, like their relatives. Also in some cases, they can still speak or whisper, and that could be converted into a healthy voice. 

[Q] Have you ever done a voiceover for books? 

[AS] Yeah, we did audiobooks. We just recently did an interesting audiobook for a YouTube channel, Jolly. That may be the first audiobook that was done using speech-to-speech synthetic voice technology

We have a couple of projects in long-form content, including audiobooks. And that's a very exciting direction because I mean, text-to-speech (TTS) has limitations. Being able to voice over an audiobook in a way where you would listen to it and it would sound just natural, right? And those limitations are not just about prosody. Those limitations are also about some particular inflections, some human touch ability to change style depending on the character, something that voiceover actors do when they voice audiobooks. 

But what can our speech-to-speech technology bring to the table is the ability to change the voice of the actor who is voicing the audiobook in many different voices.

So some of us might want to listen to an audiobook voiceover by Tom Hanks, and you might not be able to record all the audiobooks in his voice. The publisher gets permission from Tom Hanks to use his voice to convert the audiobook library into Tom Hanks’s voice. That would be amazing. And then we would not be limited to just one voice in audiobooks. 

[Q] Can you please explain how your technology is better and different from text-to-speech? 

[AS] There are basically two ways you can synthesize speech. The most popular is text-to-speech (TTS). And we used to hear text-to-speech everywhere. For example, Alexa Speaker, Google Speaker, and chatbots are using text-to-speech. 

The thing is, text-to-speech technology is somewhat limited to language models. It works in vocabularies and word domains. It often struggles with some unusual names.

The bigger issue with text-to-speech is that it's limited in terms of performance. So if you look for the best text-to-speech software out there, you might find a system that would be biased towards advertising. And they do very good advertising prosody and it sounds very natural but you don't have control over the full voice range of emotions. You can make text-to-speech sound sad or excited, or calm, but that's it. That's a very limited thing compared to what we can do with the vocal apparatus we have been born with. 

All those tiny inflections, all those things we produce naturally, can be done only by humans. And that keeps the human touch. 

And that's where speech-to-speech technology comes in because the performance could be enhanced by humans. The director of the movie could say to the voice actor: “Just say it with a bit more warmth. Can you add some violet notes to this particular line?” And the human would understand and would make it.

You cannot say that to text-to-speech. And even if you can imagine a text-to-speech system that would be able to introduce all those tiny things we just naturally have in our voice, it wouldn't be handy in terms of usage, because we would have like a huge soundboard with many buttons that we would have to use in order to control those tiny inflections. And it would be extremely time-consuming.

Other things are singing, whispering, crying, and emotions. Those are not things that are usually covered by text-to-speech. So speech-to-speech always keeps humans in the loop. Humans are in charge of performance. Technology is in charge of changing their voice to sound exactly like another voice. And that gives the ability to control emotions and the ability to perform, the ability to go wider in terms of emotions and say it in an exact way with the exact prosody that you envision. Not relying on some text to speech AI that would guess how it should sound, that would try to make it sound in a way that would resemble a human saying it. In this case, you have the human saying it and you can work with the human.

Subscribe to our newsletter to be notified when the next part is published.