We all become accustomed to the tone and pattern of human speech at an early age, and any deviations from what we have come to accept as “normal” are immediately recognizable. That’s why it has been so difficult to develop text-to-speech (TTS) that sounds authentically human. Google’s DeepMind AI research arm has turned its machine learning model on the problem, and the resulting “WaveNet” platform has produced some amazing (and slightly creepy) results.
Google and other companies have made huge advances in making human speech understandable by machines, but making the reply sound realistic has proven more challenging. Most TTS systems are based on so-called concatenative technologies. This relies upon a database of speech fragments that are combined to form words. This tends to sound rather uneven and has odd inflections. There is also some work being done on parametric TTS, which uses a data model to generate words, but this sounds even less natural.
DeepMind is changing the way speech synthesis is handled by directly modeling the raw waveform of human speech. The very high-level approach of WaveNet means that it can conceivably generate any kind of speech or even music. Listen above for an example of WaveNet’s voice synthesis. There’s an almost uncanny valley quality to it.