On March 29th, 2024, OpenAI leveled up its Generative AI game when it unveiled its brand-new voice cloning tool, Voice Engine. This tool brings cutting-edge technology that can clone your voice in just 15 seconds.
Highlights:
- OpenAI unveils Voice Engine, an AI that can clone any user’s voice.
- Comes with several features such as translation and assistance with reading.
- Currently in preview mode and only rolled out to a few companies, keeping safety guidelines in mind.
We're sharing our learnings from a small-scale preview of Voice Engine, a model which uses text input and a single 15-second audio sample to generate natural-sounding speech that closely resembles the original speaker. https://t.co/yLsfGaVtrZ
— OpenAI (@OpenAI) March 29, 2024
OpenAI has been quite on the move in bringing a revolution to the Gen AI industry. After Sora, the state-of-the-art video generation AI model, this is yet another major advancement from OpenAI, which will disrupt the world of AI enthusiasts and developers.
What is OpenAI’s Voice Engine and how can developers make the most out of this tool? What are the features that come with it? Let’s find them out in-depth!
What is Voice Engine from OpenAI?
The well-known artificial intelligence firm OpenAI has entered the voice assistant market with Voice Engine, its most recent invention. With just 15 seconds of recorded speech from the subject, this state-of-the-art technology can accurately mimic an individual’s voice.
The development of Voice Engine began in late 2022, and OpenAI has utilized it to power ChatGPT Voice and Read Aloud, in addition to the preset voices that are available in the text-to-speech API.
All that Voice Engine needs is a short recording of your talking voice and some text to read, then it will successfully generate a copy of your voice. The voices are surprisingly of highly realistic quality and also represent emotions to an extreme degree.
How was Voice Engine trained?
A combination of licensed and openly accessible data sets was used to train OpenAI’s Voice Engine model. Speech recordings serve as an example for models such as the one that powers Voice Engine, which is trained on a vast amount of data sets and publicly accessible websites.
Jeff Harris, a member of the product staff at OpenAI, told TechCrunch in an interview that Voice Engine’s generative AI model has been operating covertly for some time. Since training data and related information are valuable assets for many generative AI vendors, they tend to keep them confidential.
However, another reason not to provide a lot of information about training data is that it could be the subject of IP-related disputes. This is one of the major reasons that much training information has not been provided on Voice Engine’s AI model. However, we can expect a detailed technical report soon from OpenAI, giving deep insights into the model’s build, dataset, and architecture.
What’s interesting is that Voice Engine hasn’t been trained or optimized using user data. This is partially due to the transient nature of speech generation produced by the model, which combines a transformer and a diffusion process. The model creates a corresponding voice without the need to create a unique model for each speaker by concurrently evaluating the text data intended for reading aloud and the speech data it takes from.
We take a small audio sample and text and generate realistic speech that matches the original speaker. The audio that’s used is dropped after the request is complete.
Harris told TechCrunch in the interview regarding Voice Engine.
Looking Into Voice Engine’s Features
OpenAI’s voice engine comes with several features that are mainly built around cloning realistic user voice. Let’s look into these features in detail:
1. Assisting With Reading
Voice Engine’s audio cloning capabilities can be highly beneficial to children and students as it uses realistic, expressive voices that convey a greater variety of speech than can be achieved with preset voices. The tool has a high potential to provide realistic interactive learning and reading sessions which can highly bolster the quality of education.
A company named Age Of Learning has been using GPT-4 and Voice Engine to improve learning and reading experience for a much wider variety of audience.
In the tweet below, you can see how the reference audio is being cloned by Voice Engine to teach a variety of subjects such as Biology, Reading, Chemistry, Math, and Physics.
OpenAI, ses klonlama aracı Voice Engine'i tanıttı.
— BPT (@bpthaber) March 30, 2024
15 saniyelik kısa bir sesle, insan seslerini gerçekçi bir şekilde kopyalayabiliyor ve yazılan metinleri sese çevirebiliyor.pic.twitter.com/6yNhhEGvxe
2. Translating Audio
Voice Engine can take a user’s voice input and then translate it into various multiple languages which can be communicated or reached to a wider variety of audiences and communities.
Voice Engine maintains the original speaker’s native accent when translating; for instance, if English is generated using an audio sample from a Spanish speaker, the result would be Spanish-accented speech.
A company named HeyGen, an AI visual storytelling company is currently using OpenAI’s Voice Engine to translate audio inputs into multiple languages, for a variety of content and demos.
In the tweet below, you can see how the input reference voice in English is being translated into Spanish, Mandarin, and much more.
OpenAI公布其语音生成模型:Voice Engine
— 小互 (@imxiaohu) March 30, 2024
根据文本输入和一个15秒的音频样本,就能生成接近原始说话者声音的自然听起来的语音。
Voice Engine最初于2022年底开发,并已经提供给包括Heygen在内的少数公司进行测试性使用。
主要功能
1、自然听起来的语音生成:利用单个15秒的音频样本,Voice… pic.twitter.com/AjP2wAYr4N
3. Connecting with Communities throughout the World
Giving interactive feedback in each worker’s native tongue, such as Swahili, or in more colloquial languages like Sheng—a code-mixed language that is widely used in Kenya—is possible with Voice Engine and GPT-4. This can be a highly useful feature to improve delivery in remote settings.
@OpenAI text to voice Engine 🔥🫨 https://t.co/5rgQMbW7wR pic.twitter.com/XnWyIDj8Oj
— Patrick Assalé (@patrickassale) March 29, 2024
Voice Engine is making it possible to improve the quality of life and service in remote regions, who for long haven’t had access to the latest gen AI models and their technologies.
4. Helping Non-Verbal People
People who are non-verbal can highly make use of Voice Engine, to solve their day-to-day issues. The AI alternative communication app Livox drives AAC (Augmentative & Alternative Communication) devices, which facilitate communication for those with disabilities. They can provide nonverbal persons with distinct, human voices in a variety of languages by utilizing Voice Engine.
Users who speak more than one language can select the speech that most accurately reflects them, and they can keep their voice consistent in all spoken languages.
Voice Engine
— سعيد الكلباني (@smalkalbani) March 29, 2024
ثورة OpenAI في تكنولوجيا الصوت الذكي
OpenAI أعلنت عن إطلاق نموذج صوتي جديد يسمى “Voice Engine”، الذي يمكنه توليد أصوات طبيعية تشبه صوت الشخص من خلال مجرد 15 ثانية من عينة صوتية. هذا النموذج قد تم استخدامه بالفعل من قبل شركاء كبار مثل HeyGen.
▪️أبرز النقاط حول Voice… pic.twitter.com/TxrVPQPYw4
5. Assisting Patients in Regaining Voice
Voice Engine is highly beneficial for those who suffer from sudden or degenerative voice conditions. The AI model is being offered as part of a trial program by the Norman Prince Neurosciences Institute at Lifespan, a not-for-profit health institution that is the main teaching affiliate of Brown University’s medical school that treats patients with neurologic or oncologic aetiologies for speech impairment.
Using audio from a film shot for a school project, doctors Fatima Mirza, Rohaid Ali, and Konstantina Svokos were able to restore the voice of a young patient who had lost her fluent speech owing to a vascular brain tumor, since Voice Engine required only a brief audio sample.
My favorite from @OpenAI new voice engine :#voiceEngine #openAi #aiforgood
— Qaisar Roonjha (@QRoonjha) March 29, 2024
OpenAI's Voice Engine is changing lives by restoring voices to those who've lost them! Check out this video of its impact on a patient who lost her speech to a brain tumor. pic.twitter.com/Qed1Z2ezgj
Overall, Voice Engine’s cloning capabilities extend far beyond just simple audio generation, as it covers a wide aspect of use cases benefitting the youth, diverse communities, and non-verbal patients with speech issues. OpenAI has made quite the bold move in developing a tool that can be of much use to people worldwide, with its magical “voice” features.
Is Voice Engine Accessible?
OpenAI’s announcement of Voice Engine, which hints at its intention to advance voice-related technology, follows the filing of a trademark application for the moniker. The company has chosen to restrict Voice Engine’s availability to a small number of early testers for the time being, citing worries over potential misuse and the accompanying risks, despite the technology’s potentially revolutionary potential.
In line with our approach to AI safety and our voluntary commitments, we are choosing to preview but not widely release this technology at this time. We hope this preview of Voice Engine both underscores its potential and also motivates the need to bolster societal resilience against the challenges brought by ever more convincing generative models.
OpenAI stated the limiting use of Voice Engine in their latest blog.
Only a small group of companies have had access to Voice Engine, and they are using it to help several groups of people, we already discussed some of them in detail. But we can expect the tool to be rolled out publicly in the months to come.
How is OpenAI tackling the misuse of “Deepfakes” with Voice Engine?
Recognizing the serious risks associated with voice mimicking, especially on delicate occasions like elections, OpenAI highlights the necessity of using this technology responsibly. The need for vigilance is critical, as seen by recent occurrences like robocalls that mimic political personalities with AI-generated voices.
Given the serious consequences of producing a speech that sounds a lot like people, especially during election season, the business revealed how they are taking preventative measures to mitigate these dangers.
We recognize that generating speech that resembles people’s voices has serious risks, which are especially top of mind in an election year. We are engaging with U.S. and international partners from across government, media, entertainment, education, civil society, and beyond to ensure we are incorporating their feedback as we build.
OpenAI
The company also announced a set of safety measures such as using a watermark to trace the origin of any audio generated by Voice Engine, and also monitor how the audio is being used. The companies using Voice Engine currently are also required to adhere to OpenAI’s policies and community guidelines which involve asking for consent from the person whose audio is being used and also informing the target audience that Voice Engine’s audio is AI-generated.
Conclusion
Voice Engine from OpenAI holds a profound potential to change the landscape of audio generation forever. The creation and application of technologies like Voice Engine, which present both previously unheard-of potential and difficulties, are expected to influence the direction of human-computer interaction as OpenAI continues to advance in the field of artificial intelligence. Only time will tell how the tool will be publicly perceived worldwide.