Our own Alex Serdiuk, the CEO of Respeecher, had the pleasure of talking to Bret Kinsella of Voicebot in a podcast episode about Respeecher’s current business situation, voice cloning for media, how speech-to-speech voice conversion is different from text-to-speech, deepfakes, ethical considerations, and a lot more.
Listen to the podcast episode above or on the Voicebot AI website, or read it below.
Bret Kinsella: This is episode 250 of the Voice Bot podcast. My guest today is Alex Serdiuk, CEO of Respeecher.
Hello, Voice By Nation. This is your host, Brett Kinsella. I have a special episode for you today. It is the first time for me interviewing a guest that is operating a business in the midst of an act of war in their country, where Respeecher is one of the leading voice cloning solutions in the market today, it is also headquartered in Ukraine and my recent update with Alex, the CEO, I learned that despite warn Ukraine, Respeecher was in fact still operating, fulfilling customer contracts. So we agreed to catch up on Respeecher's current business, and also let everyone know that the company is still operating despite the war.
You're really going to like this discussion. We talk about voice cloning for the media, how it's different from speech to text, deep fakes, ethical considerations, and a whole lot more. Before we get started, I want to give a special shout out to Deepgram. They came forward and asked the sponsor of the Voice Bot podcast for the next several weeks.
So you're gonna be hearing more about them, but to give you a quick introduction, if you're not familiar with Deepgram, the founder, Scott Stephenson was on the podcast in 2020. So almost two years ago. We talked about the origin story there and really what first caught my attention about Deepgram is the end-to-end deep learning architecture.
So other solutions typically have separate automated speech recognition, that's ASR, and that hands off to a natural language understanding model and all you. So it's a discreet two-step process to determine intent. Deepgram includes an integrated model with both ASR and NLU operating in concert. So this enhances speed and accuracy, and can also reduce costs in many situations.
Some applications of Deepgram include speech analytics called transcription and conversational AI bots for contact centers that can be trained as they say to 90% accuracy. So to learn more, visit deepgram.com/voicebotAI, deepgram.com/voicebotAI. And did that today. Okay. Our guest is Alex Serdiuk, CEO of Respeecher.
Alex comes from the big data industry prior to founding Respeecher in 2017. Now Respeecher has worked on several high-profile movie and TV projects, including the Mandalorian where they recreated the voice of a young Luke Skywalker. And it was an Emmy award-winning piece of work beyond film and television. Respeecher provides AI voice cloning for games, podcasts, advertising, content localization, and a whole lot more.
And they have a way for celebrities to actually put their voice where their tweets are and support Ukraine and Ukrainian position in the war and then allow the celebrities to actually speak in whatever their native language is. And to give that encouragement to Ukrainians and they translated into Ukrainian so that Ukrainians can listen to it in their native language, but in the voice of that fan sector.
So I think that's a fascinating idea that they're using their voice AI technology in order to boost morale in the country. It's also a fascinating technology in itself and a great conversation. So you're going to like this next up voice, cloning war, Emmy awards, and deep learning. Let's get started
Alex Serdiuk, welcome to the Voice Bot podcast.
[00:03:40] Alex Serdiuk: Hi, thank you for having me.
[00:03:40] Bret Kinsella: I'm very excited to have you here today. It's under trying circumstances. And your position because some people may not know you're headquartered in Ukraine and you're in the middle of a war. This just happened to be where this converged. I've been interested in race features.
For some time, I just spoke with Chris Ume from metaphysics talking about you and it was good timing for this. And I thought it was really also interesting to talk to you now because you're fully operational, despite the war. You've also had a lot of momentum recently. So Alex, I'm, I'm thrilled to have you on.
And I have to say that I'm impressed by anybody. Who's an entrepreneur because it's very hard to start a business, even harder to grow a business than to start one. And you've done those things, but now you're doing it in the midst of a war. So it's that much harder. So how are things going for you right now?
Uh, and you know, how should people think about what you're doing and maybe how you've changed, how you operate over the last few weeks?
[00:04:48] Alex Serdiuk: Yeah. I mean, I wouldn't say that this war was unexpected because the war actually started eight years ago. And in Respeecher, we had our contingency plans in place for several months before this full-blown invasion started and we have executed on contingency plans.
Our team is 36 now, and 27 of them are Ukrainians, and 24 of them are in Ukraine. We relocated a big part of the team before this crazy invasion. And then once first bomb started to hit Kyiv and I took my wife and kids and also headed towards the Western regions Ukraine where I am right now. So given that we were prepared, we did not have any significant problems with our operations.
We actually delivered one of the biggest projects in this quarter on the day when the missiles started to hit our big cities.
[00:05:45] Bret Kinsella: Wow, very challenging. Um, I really do want to focus on the business and that's what we should probably transition to. Uh, but of course, but definitely, and I don't know when people will be listening to this and you know, what the resolution will be, uh, but great respect goes to you for doing this great respect for the product you've built, but also what you're enduring right now.
And, you know, my understanding is that you are fully operational because of your contingency planning, you're serving customers just as you had, you're signing new customers, people are using your marketplace, correct?
[00:06:20] Alex Serdiuk: Yeah, that's correct. And also, COVID turned out to be a good thing for being operational in this time, because we all learned how to work remotely.
And now when the team is very distributed and not in the office we keep doing the same stuff in the same way we were doing like a year ago.
[00:06:36] Bret Kinsella: Yeah. I have some of your employees gone across the border for other locations. Or has everybody, uh, sort of regrouped up in Western Ukraine?
[00:06:47] Alex Serdiuk: Yeah. Some of the people were outside of Ukraine (before the full blown invasion).
Some of the people in our team are not Ukrainians. Now you cannot leave Ukraine if you're male. But all the girls we have at Respeecher who are Ukrainians never, never left Ukraine yet.
[00:07:05] Bret Kinsella: Got it. Got it. Well, I will tell you from a personal standpoint, I was involved in a, I would call it NGO non-government organization, uh, in the early 1990s.
And one of the things that we did is we did programs with Eastern European nations. Uh, sometimes we would bring. Uh, either high, ranking government officials. Um, if you journalists or, um, people from the judiciary and we bring them to the U S and they'd run through programs to see how the US system works.
And we would also send experts, um, from the US to help with different types of programs that were going on as the government was restructuring. And actually one of the programs I was involved with actually sent a, uh, US appellate court judge to Ukraine, and he actually participated in the drafting of the Ukrainian constitution.
[00:07:57] Alex Serdiuk: Oh, that's amazing. So cool to hear.
[00:07:59] Bret Kinsella: Yes. And my colleague, her family, was from Ukraine. She was leading the program. We had a wonderful time. We met a lot of really wonderful people and, uh, I'm so happy. The two have seen it then, and to see how, how things have gone. So best of luck to everything, uh, for you, we're praying for you here in the states that everything goes well and soon resolved quickly.
Okay. Yes. You want to say something?
[00:08:23] Alex Serdiuk: Yeah. I just want to say thank you. And we feel all that support. It's, it's tremendous. It's huge. The fuel, all this support from the whole world is very big and it's very important for Ukrainians, which are united. And we showed the world that we can beat this enemy, this threat to the peace in the world.
And without dependence on when the podcast would go out, I still can say that I am convinced that Ukraine has already won this war.
[00:08:58] Bret Kinsella: Excellent. Excellent. Okay, well, let's talk about what you've done, because what you've done is actually really significant. Uh, you have a great reputation in the industry for the technology that you've built.
You've actually won an Emmy award. Why didn't you tell people just at the top now what Respeecher does?
[00:09:19] Alex Serdiuk: Yeah, so basically Respeecher is a voice cloning technology that works in acoustic domain, in the speech to speech domain. So in simple words, we enable a human to speak in the voice of another particular human keeping all the performance untouched and changing the vocal timbre only to meet their desired voice using our voice cloning software.
[00:09:41] Bret Kinsella: Right. And so this is predominantly, at least so far, I've seen the applications are in the entertainment industry. Is that your sole focus? Are you looking at other markets as well?
[00:09:51] Alex Serdiuk: Yeah. This started with the entertainment industry because our initial goal was to make a synthetic speech on that quality where it would go through the pickiest sound engineers in Hollywood studios.
But now when we improve the usability and scalability of our technology, you're actually looking into a few more markets. One of them will be healthcare - we can help people with impaired speech to improve the way, how they communicate. Another direction we can mention would be called call-centers and a customer service industry where real-time voice quality could change the way people communicate with call centers.
But, for the time being our primary focus has been media and entertainment, leveraging AI voice synthesis for groundbreaking results.
[00:10:36] Bret Kinsella: Yeah. It's interesting that you mentioned that you thought that that would be a good place to start. I believe it's tougher because you have a, um, a, uh, Customer that is more focused on quality than maybe any other industry.
[00:10:51] Alex Serdiuk: Yeah, that was our intentional choice because we understood that there's no synthetic speech in the market that would meet those requirements, those standards, those benchmarks that Hollywood studios have. And we wanted to be the first in order to develop this quality of the synthetic sound, no matter what that takes from the technology perspective.
If it's a very heavy system, ok, if it is very expensive to operate - ok. It doesn't matter. We just needed to meet the quality standard. And then after we met this bar of quality, we can go downhill in terms of improving usability and stuff, or we can keep the quality and be focused solely on improving some other technical aspects of our systems.
So that has been our strategy from the very beginning, leveraging voice AI to meet the highest standards of synthetic speech quality.
[00:11:36] Bret Kinsella: Yeah. And I think everyone in the voice cloning speech synthesis space does have a certain guiding light around quality. However, Many of the other companies in this space have focused on affordability. And so they've also made some other decisions where they're doing the trade off in quality.
And it sounds like for you, it was quality first, once you met the quality threshold, then you could figure out what the cost basis and pricing is going to be
[00:12:04] Alex Serdiuk: Correct. Yeah. That was our bet. And also it's worth mentioning that most of the companies that do synthetic speech, use a different approach and that would be text to speech.
Text-to-speech it's very scalable from the very beginning. It's less resource intensive. But it has some holistic limitations in terms of quality, in terms of control over emotions, language dependency. Our goal was to make something that would actually fit industry standards in terms of being used in a very big creative productions of not just from the quality perspective, but also from the ability to keep all the performance untouch, be used in any language and all those aspects of speech that seem to be were important and were reliant on humans as performers, incorporating ethical voice synthesis.
[00:12:52] Bret Kinsella: Right. So since you're not using Texas speaks, why don't we talk a little bit about the technology and how you do apply it? Could you start to outline that for us?
[00:13:00] Alex Serdiuk: Yeah, so basically the AI voice synthesis technology is a deep neural net that is pre-trained to understand difference in voices or just understanding voice timbers.
And there are two stages when we apply our technology for a project. The first stage will be to prepare a system. We call it training. So basically we introduce a new voice to the system, and then once it's trained and it takes a week or two, it's ready to do conversions from another human voice into this particular voice we trained on using our voice cloning software. The second phase is the inference phase. It takes way less time, it takes minutes if you use our offline system or it could take even 500 milliseconds if you use our real-time system.
[00:13:45] Bret Kinsella: So it's a deep neural net, and it sounds like what you're doing is in this case instead of converting to text and then converting back to speech, you're just modifying the wave form in order to match the new vocal pattern, the new voice that you want to convert it into this. Is that correct?
[00:14:05] Alex Serdiuk: Yeah, we can say it that way. So basically, in the acoustic domain, we don't have any language or text engines inside. We just convert speech or actual whatever a vocal apparatus produces into another vocal timbre.
[00:14:21] Bret Kinsella: Right. And where did, where did this idea come from? Uh, because I, as you mentioned, most other people are doing text to speech, which has been around for decades now, uh, in, in some form or another it's much better now than it was two decades ago.
Um, but what was it that inspired you to use this approach?
[00:14:42] Alex Serdiuk: Yeah, the idea started with a Hackathon back in 2016, in Kyiv, Ukraine where Dmytro, the founder and CTO at Respeecher, picked the idea to apply machine learning for voice domain. All other teams in that Hackathon, were using machine learning for analyzing images or generating images while acoustic and speech is something people were hesitant to do, considering the implications for AI ethics.
He picked this idea of speech-to-speech conversion which turned out to be quite an interesting direction. We won that hackathon. And then in a few more years, we met our third co-founder Grant who was focused on what we call accent conversion. It’s when you can change an accent of a person or eliminate an accent, it's also in acoustic domain..
And we understood that there are plenty of text-to-speech companies in the space. And there is no reason for us to compete with them because in our opinion, text to speech has some holistic limitations. You cannot control emotions, or the control over emotions is limited. You're very tight with the language.
So if you need to produce something that is not in vocabulary or even sing, or whisper or cry, you couldn't do that with text to speech. And that made us think that, uh, for high-quality productions, text to speech wouldn't do, and we wanted to be focused on high-quality productions, as we saw a niche there, we wanted to get to
[00:16:16] Bret Kinsella: Got it. So when you're building, when you're building the initial solution for the hackathon, which you won, congratulations, um, that led you to think that there's a, you know, a significant opportunity here. What significant changes have you made over the last couple of years in technology? Has there been a point where you ran up against a limitation and then you had to change our architecture approach?
At some point you had created a new model. I'm interested in that journey from a technical standpoint, and I assume it hasn't just been, you're using the model that was originally built during the hackathon.
[00:16:54] Alex Serdiuk: Oh, it's not. Actually, in the hackathon we didn't use any deep neural nets. If I correctly recall we were just using the basic machine learning techniques.
I mean, our AI voice synthesis technology developed a lot over the last four years we've been doing that as a company. So when we just started we had this model that required very complicated data sets. So basically, uh, we had to get the data for a target voice, the voice we impersonate, then we had to get a source speaker, a particular source speaker, and make the source speaker record the same data set with the same emotions we have in the target voice.
And that was quite a process. So it could take several days for someone to record the dataset. And then once we trained the system we had a lot of limitations in terms of quality, of course, but also the biggest challenge was that our models used to produce some mistakes that were phonetic errors. And in some cases it could be fine if one of 10 lines convert well, because studio will make several takes for each line, but it didn't allow us to grow.
So we were focused on solving this robustness issue where each take would be converted. Well, and another big thing we accomplished over the last year, we got rid of these requirements to have the source speaker record their data. So we now have many-to-one model and anyone can speak in a particular voice that simplifies use of the system a lot.
And a lot of those changes were about the ability of the model to perform in different emotional ranges. So sometime ago we had huge problems with whispering, we were not able to cry and scream. So we increased this emotional range as well as we were focused on increasing the quality and the resolution of the file we deliver.
For quite some time, we had to deliver 22 khz to our clients. And they were complaining a lot about that because that's not an industry standard.
[00:19:04] Bret Kinsella: What are the industry standards?.
[00:19:06] Alex Serdiuk: Starting from 44-48 onwards.
[00:19:10] Bret Kinsella: Got it. All right. So you said a lot there, I'm really interested in some of these things.
Okay. So this line, some definitional points, the target voice is the voice you want is the ultimate output, is that correct?
[00:19:22] Alex Serdiuk: Yeah, that's correct.
[00:19:23] Bret Kinsella: And the source speaker is some baseline voice. So tell me what, why do you need both?
[00:19:32] Alex Serdiuk: Yeah, we don't need both for training anymore. So our current system requires only a target voice to be presented as a data set for training.
But the speech-to-speech system would require someone to perform and that would be a source speaker.
[00:19:47] Bret Kinsella: Got it. And so the target voice right now is something that you have is basically an asset in your system, or someone will create it as a new asset, and then you can have other people come in as the source, uh, speak whatever lines or script they need to, and then you just convert it to the target.
[00:20:06] Alex Serdiuk: Yeah, that's correct.
[00:20:08] Bret Kinsella: Okay. So I think I understand that now for the target voice, how much content do you need for them to create this, uh, you know, to create the, the clone, or the twin of their voice? Is it eating for 10 minutes, five hours, three days?
[00:20:27] Alex Serdiuk: Yeah. We feel very comfortable if you have around 30, 40 minutes of recordings of high-quality recordings of a target voice, but another important point, it should have an emotional range we will need to have in conversions.
So if you need the character to scream or whisper or sing, you would need that, that data, to be the dataset. So sometimes it's increased. Also our team is used to working with old recordings and we never had requirements to record а particular script for us - we can work with whatever is available.
You just need observations of the voice saying, or performing, uh, something, we don't need particular words to be set.
[00:21:09] Bret Kinsella: When you're working with archive, audio files, there tends to be a significant variance in quality. And how do you account for that? If you're working from those old files or do you not do that anymore?
[00:21:21] Alex Serdiuk: We do that. And we had plenty of projects where we had to work with old files. And it's always challenging because we take something that is not perfect on input for training. Uh, and then the output we need to produce should sound right and good for modern production. A good example would be in The Mandalorian where we used old recordings of young Luke Skywalker. Uh, some of them were from tape with all the tape hiss and stuff. So those are like particular challenges, uh, for particular projects, our delivery team needs to go through them. And that's actually why we have a Delivery team. It consists of sound people that are familiar with our voice cloning technology, how it behaves and how it would react to a particular data.
Vince Lombardi was extremely hard because we had just a few minutes of a clean speech from Vince and the other data was not good at all. We had a lot of noise in the background or a lot of reverb. So it was quite a project, maybe one of the hardest projects for us.
[00:22:25] Bret Kinsella: Yeah. So if people don't know who Vince Lombardi is, he was probably the most famous American football coach of the 20th century and was famous for not only winning NFL national football league championships before the super bowl. But then also winning is the coach of the green bay Packers, the first two Super Bowls. Uh, and he was also really well known for his aphorisms and, you know, as a Paragon of leadership principles, uh, and so that I was coming to you earlier, I felt like that was the 2021 Super Bowl.
And I thought that was the best piece of the, of all, you know, all the media they do for the Superbowl. I thought that was the best piece because it was so well done that there was a visual element to, uh, which was done really well, but the speech was excellent.
[00:23:21] Alex Serdiuk: Yeah, it was very hard project and, uh, it was, it was done very well from a creative standpoint, several very talented teams worked on that: 72 and Sunny, Digital Domain who did hologram and avatar.
Source performer. The actor who actually performed for Vince was amazing. So it was a very exciting journey for us to be part of, though we had an extremely tight schedule and time pressure, some of our team members had not slept for quite a while then.
[00:24:01] Bret Kinsella: Well, and some people I probably should even, or I should clarify for some people who are listening. And when I say the 20th century, I'm not talking about the eighties and nineties. I'm guessing that these voice recordings you were at, you were using for, from the 1950s, sixties, and maybe the seventies.
[00:24:18] Alex Serdiuk: Yeah, it was, it was very old data.
[00:24:20] Bret Kinsella: Yeah. Vintage. Okay. Which makes, uh, using some of the files from young Luke Skywalker from the late 1970s, the early 1980s, uh, probably seem to be much, much higher quality and comparison.
[00:24:36] Alex Serdiuk: Yeah. And not just quality, but also quantity. Another project we had quite a lot of challenges was, uh, this project bean with MIT when we had to resurrect the voice of Richard Nixon.
Uh, we had plenty of recordings of Richard Nixon, but back then, our voice cloning technology was not robust enough. And what we did, we produced was not that good. And also those recordings were quite old. They were taken from archives.
[00:25:13] Bret Kinsella: Right. And so again, Richard Nixon would have been the copious amount of audio recordings of him from the fifties, sixties, and seventies.
But then the question is what the quality is. Uh, so why don't you tell people about what that was? Cause there's an interesting backstory there, and it's an interesting application of the AI voice synthesis technology.
[00:25:33] Alex Serdiuk: Yeah, it was a very exciting project for us. It was one of the first big projects we did as a company.
Uh, so we basically met the MIT team who was, uh, pitching outside the idea of making Richard Nixon say the speech that was written in case the moonlanding goes wrong. Uh, if Apollo 11 mission would fail. And he obviously had two speeches because no one knew how it would go. That was a contingency plan.
If it goes wrong, uh, is a very powerful piece of speechwriting in general, but no one knew about that. It stayed in archives. So MIT wanted to get a few goals achieved. The first one would be showing what are the technologies capable of in general, in terms of synthesizing something. And the second goal would be bring to a general society that, uh, synthetic technologies, synthetic media technologies can actually change our history, uh, and can, uh, it can be dangerous at some point.
And both those goals were interesting for us because from the very beginning, we built very strong ethics statements, uh, where big part of our ethics statement is actually showing to the general society what the technologies of synthetic speech could do in case if they fall in wrong hands,emphasizing our commitment to ethical voice synthesis.
[00:27:08] Bret Kinsella: And that sounds very similar to a conversation I had with Chris Ume for metaphysics recently.
I said, Chris and metaphysics are known for deep Tom cruise. So if anybody hasn't looked that up, go to YouTube, you'll see what they do. And they have really excellent face swap technology. And one of the things he talked about was that he's actually collaborated with you because he was always relying on actors and their ability to do impersonations.
And now, as with some of the other things that they're working on, they can use some of the synthetic speech you'd provide. Well, we had a long conversation and, uh, our podcast when she came out just a couple of weeks ago about, uh, ethics and deep fake technology has generated a lot of headlines, mostly negative, mostly, uh, raising concern and fears about what that might be.
And he from the beginning had a strong sense of what the ethical guidelines could be. So. Could you summarize ways Respeecher thinks about this in terms of ethical guidelines, because this has come up a number of times, there was a controversy when Anthony Bourdain's voice was used and in a recent documentary, that was not a controversy when Andy Warhol was use, because they got all the rights, you know, how do you, how do you come out on this?
[00:28:26] Alex Serdiuk: Yeah, so, uh, our ethics statement started with Respeecher requiring permissions in all the cases in all the projects we work with. So basically we need a copy of permission from a target voice owner in order to replicate their voice and get permission in place. That means that, uh, IP rights are not violated and the project is legit and would not cause harm in that regard. We also do a lot of, a lot of actions to protect our technology from any potential misuse, uh, we do contribute to several directions where you want to develop a watermark for audio, where you can tell Respeecher generated content from other content, we are launching a big initiative with one of our corporate clients where we would launch a Kaggle competition in order to create the best synthetic speech detector that could be then used in gatekeepers like Facebook and YouTube, it would be open source.
But another big point, big part of our ethics, as I mentioned, would be, uh, speaking as much as possible, uh, to, um, general audience, the people around that there is another tool that could manipulate some facts, it could be used for wrong. And this point might be one of the most important because.
Technology is, is basically neutral, neither good nor bad, but at some point various technologies like ours technology or metaphysics technology will fall into the wrong hands. And that would inevitably happen because technology does commoditize a year or two years. That would mean that we should be prepared, uh, from the standpoint of how we treat information we receive.
And it's not a new concept. It's not different from printing press. It's not different from Photoshop. We need to adopt new tools also from points of being misused way faster than we had to adopt a printing press.
[00:30:42] Bret Kinsella: Right. Right. Well, let's talk about permission.
So you mentioned Vince Lombardi. So Vince Lombardi is no longer living. Uh, did you require, does Respeecher require permission from like Vince Lombardi's state or his heirs, or is it sufficient for you to have permission from the NFL national football league to show you they already have permission?
[00:31:06] Alex Serdiuk: Yeah, it actually works on a case by case basis because for people who are not no longer with us, their rights could be owned by different, uh, persons or institutions. In the case of this Lombardi, the NFL owned all the rights and they can confirm this project and they confirm that they do own the right and give permission to the project.
And they were, um, very deeply involved in the whole project, ensuring compliance with AI ethics.
[00:31:37] Bret Kinsella: Oh, right. So actually in that case, it's interesting because NFL films own, uh, a significant amount of footage. So you could probably just use. The assets that they already owned, you didn't have to go to some other third-party to get additional audio recordings, correct?
[00:31:55] Alex Serdiuk: Yeah. When we talk about rights, they're a bit more, it's a bit more complicated because, uh, footage that exists it's covered by copyright. Uh, and it's just one part of the equation. We also talk about rights of private life, rights of publicity, or other IP rights of likeness, of a particular person that should not be violated. And that's why permission should be in place, especially when considering ethical voice synthesis.
[00:32:18] Bret Kinsella: Got it. So for someone like, uh, Mark Hamill, then for young Luke Skywalker, what is, what is the right regime look like there does, do you get a, a release from him for this specific use or does the studio take care of.
[00:32:37] Alex Serdiuk: Yeah, this particular project is something I can comment on in detail.
But we do have some cases when, uh, the studio is obliged to get the rights. And that's applicable only for big studios. When we talk about smaller studios, uh, we do always require permission, in written, a copy of permission so we can store it.
[00:33:06] Bret Kinsella: Yeah. Right. And then just rounding this out and talking about Richard Nixon, who at the time you did this project had, you know, was deceased.
Uh, but he was also a public figure in a lot of recordings of Richard Nixon. Uh, so in that instance, how do you handle that when you're working with MIT, did you have rights from his estate, his heirs, or is that not required? Because he was a national political figure.
[00:33:35] Alex Serdiuk: And that particular case MIT, uh, cleared this up, uh, in the case of American presidents usually rights, uh, for their likeness or being owned by president libraries.
Uh, so that's an institution you should work with. If you want to clone the voice of an American president who is diseased or, uh, do whatever manipulations with, uh, face, deepfake, visual deepfake techniques. Um, but there are some cases where we can think about figures that do belong to the public domain.
I can say that we have not worked on, on such occasions yet, but it, it might be because there, there are plenty of voices where there are no heirs or they're all obviously in public domain, uh, though in case if there is a tiny chance to get the permission cleared, and then even for the voice that is obviously in public domain, we would still push our client to go that way, because that's the best way to go, uh, to the business, especially for technologies in synthetic media, because of all those deep fakes, because of, uh, people used to see a lot of threat in deepfakes, even though that technology again, it's neutral and it's all about how we apply that technology.
And also, uh, we used to be biased towards bad cases. When we talk about deep fake technology it is because we don't understand the technology and we are threatened by it. And that's one of the reasons for projects other companies did, some of them you mentioned before, that were not clear from the right perspective, they were not using a particular voice ethically. They drag so much more attention than great projects like Respeecher is doing. Anthony Bourdain project received so, so many headliners in the news and, and that might be not fair because if you listen to the quality of the sound it's very far from being perfect.
And, uh, we still don't know who exactly did that.
[00:35:51] Bret Kinsella: Um, right. Well, eventually I think they did resolve the issue, but it became a big, big public, uh, uh, big public controversy probably would be the best way to characterize it. I'm interested in, uh, the, you know, the concept of deepfakes. So you mentioned this idea of watermarking.
So I think there's two different ways. Most people have been saying that the best way to address this is to have some other type of solution that can run an analysis. It has its own daycare that does matching, and it'll tell you whether it's true or not. Now, my view is that these are going to lead to a lot of false positives, um, and probably a lot of false negatives as well.
Uh, because we have a wide variety of speech from different people, uh, and, uh, little things like they have a cold. Um, but also we know that the voice pattern changes, the vocal resonance changes as we age. In fact, it ages very quickly. Uh, I remember talking to a founder of pin drop, who does this for, uh, voice printing for security and banking.
And he said that actually it can change as frequently as every three months. And that if you have a highly sensitive system, it will reject somebody who is obviously the person, uh, just because they've had this natural aging that's going on. So it strikes me that. We should do that because you're not always going to have watermarking technology, but watermarking seems like a really wonderful audio watermarking.
It seems like a wonderful solution for anything. That's a commercial project.
[00:37:33] Alex Serdiuk: Yeah. I mean, there were several approaches to have some control over synthetic media, especially in audio and this approach of having like a general, um, synthesize speech detector that will be able to just highlight that this particular piece was synthesized using voice AI technology.
It's very promising though. We should understand that there would always be an armor race between synthesizers and detectors, right? Synthesizers would become better than detectors should become better. And in these armor races we'll have some, some leakages, right? So we, uh, we cannot rely on this info. When we talk about watermark, watermark is applicable to a particular producer of synthetic content.
So if Respeecher applies watermark, it means that we can tell Respeecher generated content from any other content. And, uh, therefore it's, it's way harder to make an industry standard watermark, because there are plenty of players synthetic media field, but also, uh, watermark for us has been quite a challenging project because with our AI voice cloning technology you can produce very small chunks of data.
You can produce just three seconds of conversions and to put a watermark in such a short recording - this watermark should be very reliable and it's always a balance between watermark to be easily removed or watermark to audible. And this balance is something that is extremely hard to find when you need to apply watermark for very short recordings.
When we talk about long recordings, Sony has a watermark they used in cinemas for quite a while, and they can put it in, uh, in like 20 minutes. And it's, it's very legit and very reliable. Uh, but in general, when we talk about protecting ourselves from, uh, misuses of synthetic media, uh, one of the best ways would be still to teach ourselves how to treat information, right?
Because there would be many more things like synthetic media and the future. And that's all about how we accept information, how we treat information, how we check facts and that something, uh, that might be one of the weakest points in terms of information distribution.
[00:40:03] Bret Kinsella: Well, that sort of emphasizes the point that I think is very important is that the rise of deepfake technology is teaching people to be more skeptical of the information they consume.
And I know some people don't like that because they're like, oh, this, this is terrible. That we don't live in a world where we can trust the things that are being put in front of us. But the world has always been filled with fraud. It just shows up everywhere because this is one of the things humans do.
And this fact that whether you call it where they have fake news, or you have, you have deep fakes or all those things have made people a little bit more aware that what they're looking at might not be the original source. And they need to do just a little bit more investigation. And so I know a lot of our world has been to remove responsibility from individuals over the last couple of decades.
This is one where I think it makes a lot of sense, but squarely back on the individual, that they have some responsibility themselves for being good consumers of information and understanding that there are some things that might be fraudulent and they need to take that responsibility on to do some investigation if it's important to them.
[00:41:18] Alex Serdiuk: Yeah, exactly. I totally agree with that. And actually, I mean, it's not a new concept at all, because as humans we've been living with this concept of, uh, say rumors, for example, from the very beginning of humanity - if someone tells someone that someone else did something they never did. It's the same stuff you're doing with deepfakes if you misuse deepfake. So that's not a new concept at all. There are just new tools.
[00:41:46] Bret Kinsella: Yeah, absolutely. Okay. Well, let's move back to the products. I want to talk a little bit more about how it works. You have a really nice video on your homepage, at least it's there. Now, maybe you'll update your own page and it won't be that later, but it's, it's a YouTube video and it shows, uh, an actor who is using your technology in order to update a audio track from some film clips, showcasing the capabilities of your voice cloning software.
And I really liked the way this was. And so she would be the source speaker, and then she has access to a number of different targets, voices that she can change into. And there's a few things that she does demonstrate. Uh, the first thing she demonstrates is the ability to add spoken audio, to attract those that don't have it and to, and to do it.
And the first case with a couple of different voices. And so she is a woman, uh, actually says what she wants to say in terms of these additional lines. And she uses the right emotional impact or emotional content and inflection that would be appropriate for those characters. So she does that and she shows how she is doing this with two different voices that are seamlessly integrated into a scene, demonstrating the capabilities of AI voice synthesis.
Uh, she does another version of that, where she, where she actually does the vocal inflection and it converts to a man's voice. Um, and then she takes existing dialogue and she replaces it. Right? So there's like those different scenarios. So the interesting thing that I found along those lines is that you've, you've got this bank of target voices and any trained actor can then go in and represent multiple characters in a way that they might've done in the past by changing the sound of their voice.
Right. This is what actors do. Uh, but in this case, they're just using their standard voice and Respeecher is modifying it to sound like a completely different person. Could you talk a little bit more about that? What did I get wrong? You know, what am I missing in this process?
[00:43:51] Alex Serdiuk: Yeah, yeah. It's, it's basically.
All right. So there are two, two types of how Respeecher operates. The first type of services we provide when there is a studio that needs to create a particular voice, to clone that particular voice, using our voice cloning software. And we do that all basically manually in terms of training the system and delivering the result, meeting some creative expectations. But also there is a tool that can be used and Abby is not just a sound person.
She is also an actress you might have seen in the Orange is the new black. Uh, and she, she just uses our voice marketplace. And this video was recorded like a year ago. So she uses the very first versions of the Voice Marketplace. The idea of this is the democratized version of our AI voice cloning technology.
So we have a bank of voices. We have pre-trained voices in the system, and then anyone can basically register and use those copyright-free voices for whatever need in content they have. This system was initially designed for small creators. When you do a game in your basement or you do animation and you don't have budgets to hire 20 voice actors, you can do everything yourself, or just hire one voice actor and get their voice converted in very different timbers, all the variety of timbers we have on the system.
Um, but then this voice marketplace started to getting some good interests from sound people because they could do such things, like a video for someone who is just crossing the street and should say hello, and you don't need to call a guy who would come to your studio and say, hello
As a part of the loop group, you can just do it yourself in minutes. Uh, but also, uh, we grew voice marketplace in the direction where sound people can use it, not just the converting, the human voices, but also to some voices that are not human. We have dogs there, and we have cats. So if you bark there and convert it into a dog, it would sound like natural of dog barking.
And that means that you have control over barking and your post-production to the extent you never had, you don't need to download the huge package of barking dogs. You can just bark yourself. And such tools will be increased in the Voice Marketplace. We are looking into vehicle sounds. We are looking into gunshots, uh, and plenty of other things that could be driven by your voice.
And the third category of users that are excited about voice marketplace at the moment would be professional voice sectors because this voice AI technology is an enabler for them. They're not limited to their vocal timbre. They're not being hired anymore because of their particular vocal timber they were born with, they're being hired because of their performance abilities, facilitated by voice cloning software.
Now they can perform in whatever voice we have in the library. That means that if you are good as an actor but you're a 30 years old male, you can speak like a 70 years old female, and this would sound legit. It would sound like a 70 years old, female actress.
[00:46:58] Bret Kinsella: Are people using that for that feature right now?
[00:47:02] Alex Serdiuk: Yeah. Yeah. We see growing interest from voice actors at the moment.
[00:47:08] Bret Kinsella: Oh, that's really interesting. So for your library of target voices. How does that work? Do you source those from, from voice actors, just so that you have a bank of generic voices that people can use?
[00:47:24] Alex Serdiuk: Yeah. We tried to create voices from scratch and it turned out to be quite a challenging task.
So what we eventually did, we hired plenty of people, uh, and those would be just ordinary people. So they're not voice actors of any kind. Uh, we get them to record in the studio. We recorded them, uh, for a dataset and we paid the money. We received a full release. We anonymized them in the tool and now, their voices could be used on the platform as target voices you convert into. But that's the first stage of voice marketplace. We envision the second stage and it should happen in foreseeable future where we have a full-blown voice marketplace and we would invite voice actors to be a part of it, ensuring adherence to AI ethics.
So, first of all, that they would be able to provide their data sets and sell their voice from the platform having transactional income, but also they would be able to get some jobs from the platform. Because on the other end, we have plenty of creators who do need professional actors to voiceover the stuff they are doing
[00:48:38] Bret Kinsella: Right. So the difference is that your existing library is something where you brought in people to give you voice data and you just paid them. And then your relationship with them is now gone. Era is now essentially over because you own the rights to whatever they recorded for you.
Uh, and you're moving into this idea that people, or even voice actors, or this could be then celebrities, who you want to have, be able to utilize their air voice. And then in that case, as you said, it's transactional income. So essentially they, they can sell it or you can sell it on their behalf, access to their voice.
Now this goes back to an earlier question that I was going to ask you. So it's related when you have rights from somebody, let's say it's for something pre-packaged like The Mandalorian or you have, right. Which would be Mark Hamill, or you have rights from a voice actor. Is there, how do you bake in restrictions on what they say.
Because it's one thing to utilize their voice. It's another to utilize it, their voice to say things that they would never want uttered with their voice.
[00:49:54] Alex Serdiuk: Yeah. That's a very good question. And, uh, that's one of the reasons we have never started voice marketplace in a full-blown mode yet because in order to, uh, make people who make a living based on their voice, like voice actors, uh, they would need to have a transactional income in the, in the tool like the voice marketplace.
And we would need to build this layer of compensation. Uh it's and it's not the hardest part of the equation. The hardest part of the equation is actually to give them a right control over the content that is being created with their voice. And that's something we should not do alone, we get together a group of voice providers, professional voice actors in order to build this layer of, um, uh, approval process of, of being able to embed their thoughts on how their voice should or shouldn't be used in the voice marketplace. And that's not actually an easy task to do because one voice actor I speak with is fine with his voice being used, for whatever, whatever is needed.
Another says that I don't want my voice to be used for cigarettes and alcohol promotion. And the other says that there should be no hate speech in my voice. So different people have different thoughts about how they should control, uh, what is being produced with their voice. And we need to embed all those options in the platform correspondingly.
Uh, so that's, that's quite a task we are going to go through and it's, it's very interesting problem to be solved at this moment. Uh, users of the voice marketplace have some limitations, so they cannot use it for something, something abusive or for hate speech or for porn content. We don't allow them to do that, ensuring adherence to ethical voice synthesis.
Um, but in future, we would need to embed way more sophisticated system over granular control, over particular target voice, because we will work with voice actors, uh, and all of them have different requirements.
[00:52:14] Bret Kinsella: So I mean, it, shouldn't how you control that. So as a user, I go in and I can use the marketplace.
I can select voices. I can record my own voice or have someone else's voice that you then do the modification of, using the voice cloning software. I assume this is all, all the processing is done in the cloud. Is that correct?
[00:52:33] Alex Serdiuk: Yeah, it's in the cloud. And basically you cannot misuse the system much from the point of your rights, because all this, those voices are copyright free, but you can misuse the system in terms of doing something that respeecher does not allow our users to do.
And we do have some detection system in order to prevent that, with a focus on AI ethics.
[00:52:56] Bret Kinsella: I understand. So that's what I was wondering. So once you, you actually, it does the processing, you get the output to the users, vendors, download a file, and then they have, uh, they have that to use.
[00:53:08] Alex Serdiuk: Yeah, that's correct. They can download the file or a package of files from their project in the voice marketplace.
[00:53:14] Bret Kinsella: Right. And that's after you would have run any type of analysis on that, that, you know, to determine that it doesn't violate any through your terms.
[00:53:22] Alex Serdiuk: Yeah. It could even, it could even happen earlier, but I wouldn't, wouldn't go into more details about how we do this analysis because it's at an early stage and we keep making it better. So I would avoid you in details on how it works now in order to, uh, let less people abuse that.
[00:53:42] Bret Kinsella: Yeah. Well, I think it's harder for you as well, because if you were just doing this for corporations, then you could probably put a standard set of controls and that would cover most of the potential problems that come up.
And then if you need exceptions, there's very few of them. But since you're an entertainment, things like hate speech and abusive language are very common in scripts. So you, you would potentially run up with these, uh, events that are being thrown and in errors, there are prohibitions coming up a lot. Uh, so this sounds like a very difficult process.
[00:54:21] Alex Serdiuk: Yeah, it is. And it's also the essence of the technology, because it's not text to speech technology. When you do the text-to-speech, you can just have a vocabulary of stop words that are you can write this a lot to use and with speech-to-speech that could be used in very different forms, in different languages you can speak, that's, that's a bit harder problem.
That's why our user, when they enter the voice marketplace, needs to accept our terms. And if they see that they have content that might violate our rules, they need to reach out. And tell us a bit more about their content, so we would understand it better. And what we would decide on our end at this point, uh, whether we can allow them to use voice marketplace for their projects or not.
[00:55:11] Bret Kinsella: So essentially you then have to have a moderator function in your company, which is different, but not that much different from what social media companies deal with.
[00:55:23] Alex Serdiuk: Yeah, that's correct. And I mean, in general, it's very hard to draw a very strict line. What, what is allowed and what is not? Uh, just because it's, it's really hard.
There are some violence speech and video games and in films that are absolutely fine as a part of creative decision in the film, right? Uh, so, and it would be really hard to detect that particular concept from just applying some algorithms for, for quite some time I believe we would have a need to have moderators in place.
[00:55:58] Bret Kinsella: Right. And one other thing that we haven't talked about yet, but I did want to get to before we close our conversation today is this idea of taking a voice and then enabling it to speak another language. So I'm very interested in that. So you have a target voice, which has, you know, as you talk about the timber, uh, and probably other qualities that are baked into that voice.
You have a voice actor, then speak a different language, maybe Ukrainian, for example, a fair English speaker. And then it's as if that target voice is speaking Ukrainian with native fluency.
[00:56:35] Alex Serdiuk: Yeah, that's correct. And that's, that was a feature we were always excited about because you can make Tom Hanks speak whatever language in that content in the original Tom Hanks, timber and that's something very cool, but for quite some time, our technology was not capable to do that, uh, in, in the right way. We had a very heavy accent, or it introduced some additional artifacts. And just recently we were able to get rid of that. We just posted this new demo maybe a week ago or so.
And the demo was actually created in bomb shelters in Ukraine, where someone, uh, is speaking in cross lingual mode, uh, in French to English and from English to French. It sounds very good. We don't have heavy accent issues anymore. We have one project now when we do, uh, not just cross lingual, voice conversion, but also we'll do cross lingual singing voice conversion.
So that means that we have a target voice, a singer who is an American, um, well-known singer. And then we have other singers that sing Italian, Chinese, um, and other languages. And we convert those singers into a timbre of the original American singer. And also given that we have now this technology in place that works well, in terms of quality, we launched a campaign, uh, in order to help Ukrainians at this this hard times our country goes through, and this campaign is designed with the idea of letting famous people say whatever they want to say to Ukrainians, words of encouragement and support in our native language and Ukrainian. And it's also very important for Ukrainians because our language has been the target of Russian attacks for decades.
They wanted to destroy the Ukrainian language or separate us based on language. Uh, so that's, that's a very meaningful campaign, not just for speech, but also for Ukrainians in general. And we already have a few celebrities we work with in order to enable them to speak perfect Ukrainian and we call for more.
So. Um, if you know, some famous people, uh, who want to be a part of it need to just record a short video and we will translate, voiceover by Ukrainian actor into the target voice, send for approval and, and post. It is be very meaningful for our community, for our nation.
[00:59:13] Bret Kinsella: Got it. And so in this case you would be creating for these celebrities, a limited use target voice.
[00:59:21] Alex Serdiuk: Yeah, that's correct. So we will need their approval, uh, for training a model of their voice. Uh, they might send us data or just let us use data from their interviews to train our model. And then, uh, after the model is trained, uh, we get a video message from them where, where they say what they want to say. Uh, we translate it, uh, hire a Ukrainian to perform it in Ukrainian and convert it into a target voice and sent for approval, the whole piece again, and it will have English subtitles. Of course. So the international audience would see what's going on.
[00:59:59] Bret Kinsella: Yeah, that's great. I think about all the tweets that have been sent out the Instagram posts that have been, uh, published over the last several weeks.
And, you know, it almost feels like those are disposable. Uh, and this is something which I think would have just, uh, a lot more resonance. So it's a great challenge out there to celebrities to put their mouth where their tweets are in this case. And, uh, and, and do this in such a way that not only it's their voice being heard, but it's their voice being heard in the native language.
I think it's, it's a really interesting program that you're putting on and I hope more actor sign up for it .
[01:00:40] Alex Serdiuk: Yeah, no, no one ever did that. So it was not possible from a technological perspective on the quality that our company provides. So it's very exciting.
[01:01:02] Bret Kinsella: Yeah, absolutely. All right, Alex, you too. Thank you so much for coming on today. Where would you direct the listeners to learn more about what Respeecher is doing? Stay in contact with the developments at the company.
[01:01:15] Alex Serdiuk: Yeah. Thank you so much for having me. And in terms of learning more, you can just visit Respeecher.com.
We have case studies, we have a blog there, but also you can use our voice marketplace. If you want to try the system itself.
[01:01:29] Bret Kinsella: Excellent. Okay. Folks. I'm Brett Consella host of the Voice Bot podcast. I'm here each week with innovators, designers, engineers, all the people who are actually shaping future of the voice and AI community.
Alex was a great example of that today. I'm so happy. He was able to bring them to you, to you read voicebot.AI, we have stories in there occasionally about Respeecher, but around this emerging area around synthetic speech, which has undergone a lot of transformation recently, I was happy to be able to bring this to you today.
You can find me on the Twitter app reconcile. Let me know what you thought about today's interview and definitely check out Respeecher. Thanks so much, Alex.
[01:02:06] Alex Serdiuk: Thank you.