Ask Me Anything (AMA) with Alex Serdiuk, CEO of Respeecher, Part I

Written by Alex Serdiuk | May 31, 2023 2:50:57 PM

Alex Serdiuk, CEO of Respeecher, answered questions from the audience in a live Ask Me Anything (AMA) session we recorded on YouTube. This article is part of a series of four interviews that will cover topics like the ethics of deepfakes, the synthetic media industry, voice resurrection for Hollywood movies, Respeecher in the context of war, and more.

Watch the full video of the AMA session here:

[AS] Alex Serdiuk: Hi, everyone. This is the very first time we make something live. Like, an Ask Me Anything session. Hopefully, we'll do this again in the future with technical co-founders, those who are in charge of ethics and partnerships, and maybe our healthcare direction - not just me. So it's going to be interesting. We have some interesting questions from the audience.

So what we basically did, was we grouped those questions into four categories: technology, ethics, future applications, and company-related questions.

And I guess I'll start right away. Thank you so much again for submitting such great questions and thank you for your interest in us.

[Question] What are some new features, updates, and improvements we can look forward to from Respeecher's Voice Synthesis?

[AS] Thank you. Yeah, there are a lot. The thing is, in the domain of voice synthesis, we basically just scratched the surface. So it's just started. We are very early in terms of the technology adoption curve, in terms of awareness about generative AI technology from different content makers. And initially, this feature started with optimizing quality and speaker identity. Our goal from the very first days of this feature was to create a synthetic voice using our voice maker that would be indistinguishable from the original speaker. Initially, we traded off a lot of applications.

We might have required very complicated data sets and a lot of effort from clients in order to be able to use our voice cloning technology in their production. You might have required about two months to deliver a project and a lot of GPU power in order to train the system, as well as a lot of manual work to prepare data sets and make it happen.

But then once we hit some bar of quality that enabled us to go through sound engineers and Hollywood studios and lend our generative AI technology in some big productions, we started to optimize towards usability, towards speed as well as we started to develop some additional features based on the core technology.

So now we work on pushing it to the next level. One direction we are very much interested in right now is high-level voice conversion. So the core system, the system we have right now works in a way where it takes all the performance from a human who is a source performer to someone who would be driving the voice that needs to be reproduced, and replicated, using our technology.

But we take everything in terms of emotions and style of speaking from a performer. That's great because you can manage all those aspects by working directly with a human who usually is a good voice actor. But it would be cool to be able to add something additional to the vocal timber to conversions. Like if you want to convert into, let’s say, Tom Hanks' voice. You would want to convert Tom Hanks' accent, his style of speech, and his inflections. And that's something we call high-level voice conversion.

We have some very good initial results in terms of accent. We just listened to some samples in our Demo Day last Friday and those were very exciting. So hopefully soon we will release a new feature that will add the accent of a target voice to conversions. That's quite exciting for us.

Also, we’re improving the technology towards speed, so it would take us less time to train the voice synthesizer model and to introduce the target voice to the model, but also it would take less time to convert.

Now we have two basic AI voice generator models. One of them is an offline or core system. Another is a real-time voice cloning system that works with a small delay of about half a second, maybe 800 milliseconds.

We really envision ourselves requiring way fewer resources than we need now. So now we rely on heavy GPUs in order to train and in order to do voice conversions itself. But hopefully, in the foreseeable future, we will be able to even do computation on edge or on a device. So it would increase the range of use cases we cover.

We’re constantly improving our AI voice generator system toward a better emotional range. And that means that in some areas or emotions where our technology might have a hard time like crying or screaming, it will work better and be more robust. And we already have some good results in that direction, from say a year ago. Crying and screaming was just not possible. Now, if we have proper data sets and if you include some manual work in order to deliver that, we can do quite a decent conversion of emotions and we want to cover all the emotions our local operators can produce.

Another big thing we are optimized towards is robustness - and by robustness, we mean the ability of our voice cloning technology not to make any mistakes, mispronunciations, or all those annoying errors, that synthetic speech, and synthetic voice might have, but also sensibility to recording conditions.

Now, in order to squeeze most of our voice cloning system, you should use a silent room and a very decent microphone. But we already hear the results that are coming from cheaper microphones. Maybe not a studio, but just a room works well. And the improvement actually increases in terms of pace.

So it seemed like a year ago we would introduce some new voice AI features and improvements once per week. Now it would be once every two weeks. We have something new in this picture. We are very excited about new creative applications of our technology, like using non-human voices in our Voice Marketplace.

Voices that would be character voices being able to take action from a donor of X and, for example, apply X to a particular voice.

Also, we are paying a lot of attention right now to our healthcare direction. That would be the product, the piece of voice AI technology that would help people with speech disorders to be able to live a better life, and actually improve their quality of living. We have other questions regarding healthcare further on, so we will be more focused on them later.

[Q] What are the limitations of the generative AI technology? Regarding emotions, speech impediments, super high and super low registers, and specific accents (like the Canadian accent).

[AS] We might have some trouble with very extreme emotions. And that's something we are working on. We are still limited to the source performance. The voice synthesis technology would not help much and hopefully, soon we will be able to add some additional features to fix that.

Regarding accent, as I said, it's in quite early stages, though samples are quite promising, and hopefully, in November we will show you a new demo on accent conversion. But now if you use our Voice Marketplace and you speak English, it would convert well in your English accent, whether it would be British or Australian.

But if you are using another language, say Ukrainian, you would hear some American accent or general English accent in the output. And the thing is, this part of Voice AI technology, the Voice Marketplace product, is basically biased towards accents, towards English. That means one thing - it's a limitation - but another thing is that it can be biased toward other languages. And that means that we might have even more control over accents because we would be able to use different models that have different biases. And that means that you can use the accent in a creative way.

[Q] How do you envision that in the PC or making? How long does it take and how much dialogue needs to be recorded to build a cloned voice?

[AS] So currently we ask for about 30-40 minutes of good, clean recording of a target voice. Though it's not like we cannot do work with less. In many projects, we use less data.

I guess the most extreme project was when we had just 88 seconds of target voice. That was something in the resurrection domain. So we can work with less data. It's harder, it's more complicated, but if you have half an hour of voice recorded in good conditions, we are good from a technology standpoint.

The only thing is we would like to have a data set with an emotional range that would be similar to the emotional range we would expect in the output.

So if you want your character to scream or whisper, it's better to have this stuff in the data set. If you train the system on just speech, we can create it, but our singing voice is usually very different from our speaking voice.

So if you want to reproduce a similar voice, it's better to have a separate data set of singing acapella so you would be able to train a separate speaker on this data and just treat the singing voice as a different voice.

So we have projects when in a film you would need a character to be on three different pages. And we have recordings of this person from 40 years ago, from 20 years ago, and from five years ago, and in that case, we would also train three separate models and just treat all those voices as different speakers in order to convert them into this exact vocal chamber.

[Q] Do we have demo availability in Git?

[AS] No, we don't have our demo available. If you want to try our technology, you can use the Voice Marketplace. And with the recent update to the Voice Marketplace, you don't even need to enter your credit card and you can do 15 minutes of voice conversions.

You cannot introduce any target voice in the system. The Voice Marketplace product is limited to the Library of Voices, and we do not plan on putting the AI voice generator technology on it. So it wouldn’t be accessible for everyone because it basically contradicts our ethics statement. Because when technology like ours becomes available and we have no control over how it's being used, it might and most probably will fall into the wrong hands and people would (or could) use it for malicious purposes.

When this feature works, the first question our team asks is whether a client has permission and content from the target voice owner - or their estate and family if they're deceased - in order to even start exploring the project. So that's the very basic part of our ethics statement. And we say no to many projects because of no clear association with rights or permission.

We've been investing a lot into bringing this very heavy technology that has been in use only by Hollywood studios because it's heavy and expensive to small creators. And it's been a roller coaster to get there. And now we found out the way to do that via the Voice Marketplace.

But in case you have an exciting project and you are a small creator and you don't think you can allocate funds close to a studio budget, we have a special program, called Respeecher Program for Small Content Creators, and you can apply to get your project approved there.

Usually, we do one or two projects like this per month. So we can drop our fees or meet your budget and basically invest with you in this project if you like it. And, of course, if it's ethical and if it’s, ideally, showcasing some of the amazing features we just introduced in our AI voice cloning product.

[Q] For many people like me, psychological medical conditions affect our ability to use our natural voice online, which creates major drawbacks in modern day to life, both recreationally and professionally. I feel like this feature will be a real benefit to people like me. However, regional accents are extremely important to self-identify. For example, if I were to speak with a non-British accent, it wouldn't feel like I was a British national. What are your goals and prospects with implementing voice, variation, and accent variety for local small-scale budget use cases or free speech?

[AS] There are two questions basically: a medical condition and how our technology can help, and accent conversion. And we have responses for both of them.

So in terms of medical conditions, it really depends on what particular medical condition you have. But currently, we have good results in helping people with speech disorders. For example, people who had throat cancer can now speak, they can produce sounds with their voice, but they do not sound good. And by applying our voice synthesis technology, including the real-time version of our technology, we can basically make it sound better. And we are very happy to hear warm feedback from patients we worked with that said that our voice synthesis technology changed their lives.

Most probably tried text-to-speech (TTS) in order to synthesize speech and text-to-speech might have some limitations in terms of accents that are available.

In our case, as I said, you can use text-to-speech in our Voice Marketplace. You would be also limited to accents, but hopefully, soon we will introduce voices in the Voice Marketplace or versions of our voices, and the Voice Marketplace with different regional accents. So you would be able to speak in the system and apply an accent you'd like in the output. Also, it's a very interesting idea to apply. Just vanilla text-to-speech and use personalized accent from a voice we have available in the system. I don't think we have discussed that internally, but that's something quite interesting.

So if you are not feeling like you want to produce content with your voice and you would like to use a text-to-speech system to just type and convert audio from the text, then you can select a good text-to-speech system you like. Then you can fit the output of this text-to-speech system in our system, and then apply a voice from our library, which would have a particular accent you're interested in. Thank you. That's a great question, and it's actually something we should discuss and add to our pipeline.

Subscribe to our newsletter to be notified when the next part is published.

View full post