What Is Singing Voice Synthesis and Is It Even Possible?
With advancements in voice cloning, the ability to synthesize vocals to sound like another person, or sing with perfect pitch in different languages, is no longer science fiction. It is now possible to vocalize text in any tone of voice, including that of a child. But what if you want to synthesize… singing? Is AI singing possible? Let’s find out.
What is singing voice synthesis?
Singing voice synthesis (SVS) is a method of generating a singing voice from musical scores with lyrics using computer models.
Singing synthesis has been developing since the 1950s and, like text-to-speech, revolves around two paradigms: statistical parametric synthesis, using statistical models to reproduce the features of a voice, and unit selection, when snippets of vocal recordings are recombined on the fly. Thanks to recent advances in the voice AI technology, maestros can listen to a song immediately after composing it, no recording necessary
Modern SVS models can generate the natural singing voice of a singer in any language using vocals from the original score and recordings of singers in the target languages. This is called cross-lingual singing voice synthesis, which produces remarkably realistic AI voices.
In recent years, the following technologies have been used to achieve SVS:
- generic deep neural networks (DNN)
- convolutional neural networks
- recurrent neural network with long-short term memory (LSTM)
- generative adversarial networks (GAN)
Use cases for singing voice synthesis
Singing voice synthesis technology, powered by AI-generated voices, allows musicians and singers to instantly know how their written music will sound. It’s no longer necessary to go through the process of recording a piece of music, investing all the time, money, and resources that go into it. And no need to hire a team to assist with recording sessions.
Another critical use case is creating music for games and other projects that demand high degrees of audio support. Recording songs with real artists is extremely expensive for video game producers. Singing voice synthesis, powered by gen AI, allows smaller indie devs to produce songs from musical scores and text using existing voices.
Artists that want to reach a global audience with their message and provide support to different groups of people all around the world can also benefit from cross-lingual singing voice synthesis. Now, with the assistance of AI singers, they have an inexpensive means of distributing their message in any language.
How does cross-lingual singing voice synthesis work? Respeecher’s example
When synthesizing the singing voice of a particular performer, specialists begin by using samples of their vocals.
In total, about an hour of an individual’s vocals are needed to construct an initial model, and 10-15 minutes of recording will be used for the synthesizing process. This meticulous approach ensures the creation of a realistic AI voice that accurately reflects the nuances and characteristics of the original performer's singing style.
These recordings are loaded into a neural network, which then generates a voice, taking into account all possible nuances. The result is a synthesized voice that is almost indistinguishable from the original.
This is how Respeecher implements cross-lingual singing voice synthesis:
On the fourth anniversary of famous Swedish musician Tim Bergling, known professionally as Avicii, one of his best-known collaborators, Aloe Blacc, paid tribute to the artist. He performed and recorded Avicii’s hit “Wake Me Up” in the English, Mandarin, Spanish, Italian and French languages using AI voice synthesis. In doing so, his aim was to allow more people all around the world to appreciate Avicii’s talent in a deeper way.
Since Aloe’s aim was to sing the song flawlessly, not only in English but also in Mandarin, Spanish, Italian, and French, he was going to need some technological help from singing voice synthesis experts.
In order to facilitate the accuracy of the lyrics while also correctly following the natural beat of the song, Aloe Blacc turned to Respeecher and Metaphysic.ai.
Firstly, Aloe Blacc recorded a video of himself singing “Wake Me Up” in English. In order for him to also sing in Mandarin, Spanish, Italian, and French, the Respeecher team took recordings of other singers performing the song in these languages and applied them to Blacc’s voice using gen AI technology.
Then, Metaphysic.ai was tasked with lip-syncing Blacc’s vocal movements, making his mouth appear natural when singing in various languages. This synchronization process, combined with the use of AI-generated voice technology, ensured a seamless and authentic performance across different linguistic renditions of the song.
In a Nutshell
Thanks to singing voice synthesis technology, artists can “sing” in as many languages as they want. AI speech-to-speech technology clones an actor’s voice and reproduces it in such a way that the same material can be performed in a foreign language using the same voice. All you need is a minimum of one native speaker for the language you intend to reproduce your content for.
We encourage you to get in touch with Respeecher for a brief consultation regarding the use of our technology and scaling singing voice synthesis to meet the demands of your use case.
- voice synthesis
- artificial intelligence
- AI voice synthesis
- artificial intelligence (AI)
- synthetic voice
- speech-to-speech (STS) voice synthesis
- speech synthesis
- AI voice
- AI voices
- Voices
- Respeecher
- Respeecher’s voice cloning algorithms
- AI-powered algorithms
- Singing voice synthesis (SVS)
- Singing synthesis
- Singing voice synthesis technology
- “Wake Me Up”
- Avicii
- Aloe Blacc
- cross-lingual singing voice synthesis
- generative AI
- AI singer
- AI singing voice generator
- AI generated voice
- character voice ai
- realistic AI voice
- AI
- AI sinnging