Solo on a synthesizer… of speech

A beautiful voice is a rare thing, and its owners consider it a generous gift of nature and a valuable tool. Pleasant tone and educated speech are a marvelous combo, rare luck, and a real superpower when it comes to working with people, especially in sales or public performances.

A machine speaks

We often hear voices of robots and audio records compiled or generated automatically in public transport, contact centers, public announcements, car navigators, or smart radio. From their first words, we understand that this is not human speech. Intonations, pauses, and word order "give away" the robot from the very beginning. The phrase may be correct, the meaning can be clear, but the speech still sounds lifeless. Most people subconsciously consider it annoying, and a dissatisfied person (i.e., a customer) is less disposed to negotiate, understand, buy, etc.

Meet the natural “synthetics”

Digital transformation, a process that implies continuous analysis of customer needs, is a rising trend in the production of video and audio content of all kinds. There are start-ups developing software that synthesizes a pleasant voice—it sounds so genuine that few people will understand that their interlocutor is not a pretty girl but a robot.

People have already mastered AI technologies to create "synthetic" voices that sound very pleasant to most ears. These technologies have been trained on the voices of many actors. The software analyzes their manner of speaking and then “dictates” any text. The AI voice sounds not just natural but even sometimes relaxed. Their secret is the absence of strict rules: pauses between words do not have to be strictly defined, speech speed does not need to be constant. The main “trick” of the program is improvising. The result is expressive and realistic.

However, Alexa, Siri, Google Assistant, and other popular mobile assistants are still prone to have "metallic," robotic voices. A notable exception is Google Duplex, with an impressive AI-powered voice sounding very humane.

Emotions of artificial intelligence

We do not know for sure how the voice assistant market will change, but we can suggest that its development trends would be similar to that of other innovative technologies. Beautiful AI-powered voices will be sold to companies involved in advertising, marketing, and e-learning courses. There will be no need to hire professional announcers anymore; all you will need will be a well-written text.

The good thing is that you will have plenty of choices! Based on the expected effect on the audience, the customer will be able to choose the most appropriate synthetic voice tone that can convey the original meaning in all its subtleties. The voice can be encouraging, energetic, satisfied, assuasive, "maternal," young or old, fast or calm. What is the sound of a "rich woman," "confident young professional," "repair worker," or "genius child?" We all know it on the subconscious level, and artificial intelligence can both prove and confirm it. Just like photos from stock photo banks, voices will soon become available in online marketplaces.

Interestingly, the AI itself will guess which words should be emphasized. If you run the same text through the speech synthesizer twice, the outputs will sound a little different. This is how synthesizer speech improvisation works.

So far, AI speech synthesizers cannot generate long monologues. Well, in fact, they can, but the results do not turn out to be convincing enough. However, short texts—one or two sentences—are perfect. It takes a program about 4 seconds to produce one phrase. To synthesize a more significant fragment, say, a paragraph, you need to slice it and give the AI more time for analysis.

It is difficult enough to give the speech synthesizer the right amount of information so that it can express the necessary feelings. Actors have to read a vast amount of texts, including Wikipedia, for example. But isn't it a miracle when an AI generates a phrase that can't be distinguished from real human voice?!

Real or not real?

There is another concern to be taken into account—the ethical one. Is it good when people will not understand whether a robot or a human is talking to them?

In 2018, Google made an impressive demonstration of Duplex capabilities. The AI made a phone call and booked a table in a Bay Area restaurant. The corporation was criticized for the experiment, but the developers of robotic voices believe that disclosure of the AI nature of a voice is not necessary when a positive result is achieved. At least when it comes to advertising

Solo on a synthesizer… of speech

A machine speaks

Meet the natural “synthetics”

Emotions of artificial intelligence

Real or not real?

tags