Researchers at Microsoft have published details of new speech recognition technology that they say transcribes conversational speech as well as a human does. “We’ve reached human parity,” says Microsoft’s chief speech scientist Xuedong Huang in a statement. “This is an historic achievement.”
The system’s word error rate is reported to be 5.9 percent, which Microsoft says is “about equal” to professional transcriptionists asked to work on speech taken from the same Switchboard corpus of conversations. It uses neural language models that group similar words together, allowing for efficient generalization. Microsoft plans to use the technology in Cortana, its personal voice assistant for Windows and the Xbox One, as well as speech-to-text transcription software.
Although the results are impressive, it’s far from an endgame for speech recognition. Microsoft still needs to tune the technology to work as well with conversations in a wider range of more challenging real-life situations and with a broader selection of voices. And for use cases such as Cortana, much of the difficulty comes from teaching the artificial intelligence to understand the meaning of words and act on them, not just accurately hear them.