Recently, there has been a rising tide of concerns surrounding voice manipulation software. In sum, this technology allows a person to take someone’s speech recordings and create new utterances that the individual may or may not have said. This type of synthetic speech technology has been around for decades, but the entry of a new player in this space, or the publication of a research paper on this topic, tends to create a frenzy of excitement — and anxiety — about the implications of such capabilities on our society.
So what is all of the fuss about? Just like photo and video editing software enables people to create images and videos that blend reality with fiction, voice manipulation software can be used for the same purpose. Have you ever been fooled by a manipulated photograph? I personally like this edited picture of a killer whale attacking a bear. Does it look real? Absolutely. Did it happen? Clearly not. Most of us have been fooled a few times by fake pictures and videos and have learned that any picture or video can be easily manipulated in today’s modern age. I would argue, in fact, that this happens every day in most lifestyle magazines, as photo retouching has ensured that most of the photographs we see in print are altered versions of the true image. Seeing is not always believing. Manipulating videos has become just as easy; take a look at this video of a hawk supposedly dropping a snake on a family BBQ. In the 20th century, this video could have fooled millions. In the 21st century, most of us know better.
[aditude-amp id="flyingcarpet" targeting='{"env":"staging","page_type":"article","post_id":2134088,"post_type":"guest","post_chan":"none","tags":null,"ai":true,"category":"none","all_categories":"ai,bots,","session":"C"}']So it should come as no surprise that voice manipulation is possible, as well. But what can be done with the software tools that are readily available? You can, for instance, create an audio file of someone supposedly speaking words that they never actually uttered. All you need is about 20 minutes of net-speech from the person whose voice you’d like to manipulate. Then, with voice synthesis, you can create an audio file of the person saying virtually anything. As with photo and video manipulation, it works best and is most convincing when taking a real phrase and changing maybe a word or two, instead of fully synthesizing an entire sentence, which could be easily detected. For example, if you have a recording of someone saying “I love coffee,” manipulating the audio to voice “I love Lucy,” will be more convincing than creating a brand new sentence such as “the sky is blue.”
As you can imagine, voice manipulation can be used for nefarious purposes. Creating audio recordings of individuals speaking sentences that they never actually spoke could be used to discredit someone’s reputation, or worse, implicate someone in a criminal act that they didn’t actually commit. So does voice manipulation have ethical implications? Yes, it does, just like photo and video manipulation.
AI Weekly
The must-read newsletter for AI and Big Data industry written by Khari Johnson, Kyle Wiggers, and Seth Colaner.
Included with VentureBeat Insider and VentureBeat VIP memberships.
Beyond ethical concerns, voice manipulation has raised security concerns, as well. With the growing popularity of voice biometrics deployed by banks, telecom providers, insurance companies, and government agencies, questions have arisen regarding how voice manipulation could be used to potentially defeat voice biometric-based security layers. A research paper published by the University of Alabama raised this specific concern.
Organizations have been using voice biometrics for many reasons. Voice biometrics today allow consumers to log into (or authenticate) mobile apps without having to type in a password or PIN. Simply speaking a short passphrase, such as “my voice is my password,” can validate a person’s identity with a high degree of confidence.
The same technology is also used to authenticate customers into contact centers, eliminating the need for hard-to-remember PINs, or worse, the need to answer a series of security questions, such as “what was the name of your best childhood friend?” or “what was your most recent transaction?” A primary driver for organizations deploying voice biometrics is to improve the customer experience by moving past outdated authentication methods. Voice biometrics does this by reducing the time it takes to authenticate and reducing authentication failures. We all hate to fail, and we also hate to waste our time, and organizations know that eliminating these two irritants significantly benefits customer retention and overall satisfaction. However, beyond improving the customer experience, a second key driver for the implementation of voice biometrics has been to improve security — replacing or enhancing legacy authentication methods and driving down fraud losses. Here is where voice manipulation software creates a question mark. Does it undermine these security benefits?
While no security technology is impenetrable, and voice biometrics is no exception, real-world experience has shown that the technology can effectively detect and prevent voice manipulation attacks that use voice synthesis. Opus Research, a leading analyst in the industry, wrote about this very topic back in 2015, pointing to the fact that voice biometric technologies have anti-spoofing mechanisms to detect these types of attacks, such as those made using voice recordings or voice synthesis.
Voice manipulation as I’ve described it above relies on a combination of voice recordings and voice synthesis, which can both be detected, due to the audio artifacts that each of the processes generate. To the human ear, a poor recording or a clunky voice synthesis can be easily detected. You may not be able to describe why, but you are able to tell that the voice quality doesn’t sound right or that the voice sounds artificial in some way. Anti-spoofing algorithms operate in a similar fashion, but are more accurate than the human ear. They can pick up minute audio discrepancies that are caused by recordings or voice synthesis, aberrations that are undetectable to the human ear.
Organizations have been using voice biometrics to secure banking accounts and confidential data since 2001. So far this year, there have been over 1 billion voice biometric verifications performed, and not a single synthetic speech attack has been successful. This is a testament to the effectiveness of the anti-spoofing capabilities that protect such systems from these attacks. Furthermore, many organizations have reported a significant reduction in fraud losses following the deployment of voice biometrics. This shows that the technology that fraudsters have readily available to them does not offer a simple bypass of voice biometric security systems.
[aditude-amp id="medium1" targeting='{"env":"staging","page_type":"article","post_id":2134088,"post_type":"guest","post_chan":"none","tags":null,"ai":true,"category":"none","all_categories":"ai,bots,","session":"C"}']
Clearly, academia, and the industry at large, needs to remain vigilant to ensure that anti-spoofing techniques stay ahead of voice manipulation capabilities. As with all forms of security, it’s imperative that we employ continuous effort and innovation to stay one step ahead of those who seek to commit crimes.
VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Learn More