The tool, named VALL-E 2, is a text-to-speech generator capable of mimicking a voice based on just a few seconds of audio. Despite its impressive capabilities, the tech giant has decided not to share it with the public, citing “potential risks” of misuse.
VALL-E 2 is trained to recognize concepts without being provided any examples beforehand, a scenario called zero-shot learning. According to Microsoft, VALL-E 2 is the first of its kind to achieve “human parity,” meaning it meets or surpasses benchmarks for human likeness. It follows the original VALL-E system, which was announced in January 2023.
Developers at Microsoft Research claim that VALL-E 2 can produce “accurate, natural speech in the exact voice of the original speaker, comparable to human performance.” It can synthesize complex sentences as well as short phrases. To achieve this, the tool utilizes two key features: Repetition Aware Sampling and Grouped Code Modeling.
Repetition Aware Sampling addresses the issue of repetitive tokens, the smallest units of data a language model can process, represented by words or parts of words. This feature prevents recurring sounds or phrases during the decoding process, helping to vary the system’s speech and make it sound more natural.
Grouped Code Modeling limits the number of tokens the model processes at once, generating faster results.
The researchers compared VALL-E 2 against audio samples from LibriSpeech and VCTK, two English-language databases. They also used ELLA-V, an evaluation framework for zero-shot text-to-speech synthesis, to assess how well VALL-E 2 handled more complex tasks. According to a June 17 paper summarizing the results, the system ultimately outperformed its competitors “in speech robustness, naturalness, and speaker similarity.”
Microsoft claims VALL-E 2 will remain a research project and will not be released to the public anytime soon. “Currently, we have no plans to incorporate VALL-E 2 into a product or expand access to the public,” the company wrote on its website. “It may carry potential risks in the misuse of the model, such as spoofing voice identification or impersonating a specific speaker.”
The tech behemoth notes that suspected abuse of the tool can be reported using an online portal.
Microsoft’s concerns are well-founded. This year, cybersecurity experts have seen a surge in the use of AI tools by malicious actors, including those that replicate speech. “Vishing,” a combination of “voice” and “phishing,” is an attack where scammers pose as friends, family, or other trusted parties on the phone. Voice spoofing could even pose a national security risk. In January, a robocall using President Joe Biden’s voice urged Democrats not to vote in New Hampshire primaries. The man behind the plot was later indicted on charges of voter suppression and impersonation of a candidate.
Microsoft has faced increased scrutiny over its implementation of AI, particularly regarding antitrust and data privacy concerns. Regulators have voiced concerns about the tech giant’s $13 billion partnership with OpenAI and its resulting control over the startup. The company has also faced backlash from its users.
For instance, Recall, an “AI assistant” that takes screen captures of a device every few seconds, saw its release indefinitely postponed last month. Microsoft faced a deluge of criticism from consumers and data privacy experts like the Information Commissioner’s Office in the UK. In a statement to The U.S. Sun, a company spokesperson said Recall would shift “from a preview experience broadly available for Copilot+ PCs…to a preview available first in the Windows Insider Program.”