What is voice recognition? How it works & what it’s used for
Voice recognition is the process of converting a voice into digital data. The technology first appeared about 50 years ago, but it has become really popular in recent years. In this article, we will look at what this technology is and how it works. We will tell you how it is used in some industries and introduce you to some well-known voice/speech recognition solutions.
Table of Contents
- What Is Voice Recognition?
- What Is the Difference Between Voice Recognition and Speech Recognition?
- Types of Voice Recognition Systems
- Types of Speech Recognition Systems
- Brief History of Speech Recognition
- How Speech Recognition Works
- Recording Your Voice
- Enrollment
- Speech Recognition Tools
- How Speech recognition Is Used
- Why Is Speech Recognition Good?
- Speech Recognition Advantages and Disadvantages
- Speech Recognition Advantages
- Speech Recognition Disadvantages
- Speech Recognition Technology Applications
- Healthcare
- Military
- Usage in Education
- People With Disabilities
- In-Car Systems
- Voice-Controlled Video Games
- Different Speech Recognition (Virtual Assistant) Software
- Apple's Siri
- Amazon Alexa
- Microsoft's Cortana
- Google Assistant
- Nuance's Dragon Assistant and Dragon Naturally Speaking
- Does Speech Recognition Need Training?
- Future Uses of Speech Recognition Technology
- Is It Worth It to Use MFA(Multi-Factor Authentication)?
- FAQ
- How do I use Google voice recognition?
- What is voice recognition used for?
- What are the advantages of a speech recognition system?
- How is speech recognition used in healthcare?
- How reliable is speech recognition?
- What is the difference between voice recognition and speech recognition?
What Is Voice Recognition?
Voice or speaker recognition is the ability of a program to identify a person based on their unique voiceprint. It works by scanning the speech and establishing a match with the desired voice fingerprint. The development of AI opened up extensive opportunities for this subfield of computer science. It enables us to interact with machines without touching them. It is growing rapidly, and developers are finding more and more ways to apply it in various fields.
What Is the Difference Between Voice Recognition and Speech Recognition?
It is essential to understand the differences between these two disciplines. The purpose of voice recognition is to identify the voice owner. Speech recognition's purpose is to identify the words of the speaker. In the first case, the program needs a unique voiceprint of the speaker for comparison. In the second case, the program needs a huge dictionary to identify the speaker's words.
Types of Voice Recognition Systems
Voice recognition has two categories, they are:
- Text-Dependent — The system is trained to recognize predetermined voice passphrases by the speaker;
- Text Independent — It doesn't require predetermined passphrases. The subject of the analysis is conversational speech.
Types of Speech Recognition Systems
We can classify Automatic Speech Recognition (ASR) into different categories. First of all, it relies on the speaker. From this side, two types are known, they are:
- Speaker Dependent — The program is trained to recognize a specific voice, similar to voice recognition. The speaker must “talk” to the program and give it the ability to analyze the voice. Such systems are easier to implement. They provide high accuracy in speech recognition;
- Speaker Independent — This type of speech recognition software has wider usage. It doesn't require training to analyze the voice. The emphasis is on the speaker's word recognition. Typical examples of such programs are IVR systems.
The other method of categorization is based on how the user speaks. Those categories are:
- Discrete Speech Recognition — ASR applications have used this method since the early versions. Тhe speaker must pronounce each word separately, inserting pauses between them. With such programs, it is more difficult to work. It isn't easy to ensure the frequency of spoken words;
- Continuous Speech Recognition — This is a relatively new method of ASR and requires more effort to develop. The speaker's speech rate is close to normal in this case.
In the world of AI-Voice Recognition, another technology is known. It is Natural Language Processing(NLP). Тhe task of a speech recognition system is to understand words. The task of the NLP system is to understand and answer the speaker. That is an imitation of communication between a human and a machine. NLP is close to voice/speech recognition but is based on different algorithms.
Brief History of Speech Recognition
The first significant steps of this technology began at IBM's Bell Laboratory. In 1952, IBM introduced Audrey, the first documented speech recognizer. Audrey was a fully analogic system that understood single numbers with pauses in the between. Ten years later, IBM introduced Shoebox, capable of recognizing 16 English words and numbers from 0 to 9. In the early 1970s, there was a leap in the development of this technology. This was mostly due to DARPA, the R&D agency of the U.S. Department of Defence. Five years of research gave birth to Harpy by Carnegie Mellon. A machine capable of understanding 1011 words. In addition, Harpy was significantly different from its predecessors. It could understand sentences. In the early 80s, the size of the speech recognition system's vocabulary increased to several thousand words. This was mainly achieved thanks to the Hidden Markov statistical model. Speech recognition switched from pattern-based digital signal processing to predicting words from unknown sounds using statistical models.
Moreover, machines became more accurate in recognizing words. The Speech Recognition Group at IBM introduced Tangora, an experimental transcription system, in the mid-80s. Tangora was capable of recognizing 20000 words. Starting from the 1990s, speech recognition products such as DragonDictate became available to consumers thanks to personal computers. In the last two decades, many tech giants have been engaged in this technology. Later in this article, you will get acquainted with their products.
How Speech Recognition Works
Modern ASR systems are based on three models: acoustic, pronunciation, and language.
- Acoustic modeling makes it possible to distinguish between the voice signal and the phonemes(a unit of sound). Hidden Markov Model (HMM) is a common acoustic modeling approach. Other approaches use deep neural networks or convolutional neural networks, etc.;
- The pronunciation model defines how phonemes can be combined to make words;
- Language modeling is a discipline that helps distinguish between words and phrases that sound the same.
After recording the speech, the noise is cleared, and the useful signal is filtered from the recording. Тhe record is divided into small fragments. After that, each fragment is passed through the acoustic model. These fragments are compared to the phonemes, an initially built statistical model that describes the pronunciation of each sound in speech. Based on these matches, words are collected from phonemes. Тhe efficiency of finding words strongly depends on the size of the pre-prepared phoneme database.
Recording Your Voice
Оn any device, recording is carried out using a microphone. If the device doesn't have it, you need to connect a microphone headset or a professional microphone. To do this, you can use pre-installed applications such as Voice Recorder on Windows 10, Voice Memos on Apple products, etc. There is also a wide range of applications with advanced functionality. They provide an opportunity to choose the recording quality, bitrate, or format to save the record. Some are based on AI and allow you to get rid of the unnecessary noise from the recording.
Enrollment
User enrollment requires recording the speaker's voice and extracting the unique voiceprint as the first phase of each speaker recognition software. The next phase is verification. The recorded voice is compared with a database of different voices to find the best match or with a specific voice.
Speech Recognition Tools
If you don't want to build your speech recognition system, there are various open-source tools. Among them are:
- CMU Sphinx — A speaker-independent, continuous-speech recognition system developed at Carnegie Mellon University. CMU Sphinx Includes a group of products designed for different purposes. It's available to download from the GitHub webpage. In addition, there, you can find documentation for users. It supports many popular programming languages, such as C/C++, C#, Java, and Python;
- HTK Toolkit — A toolkit for working with Hidden Markov Models. Developed at Cambridge University by Machine Intelligence Laboratory, it is primarily used for speech recognition research. It's not fully open source. Users can find information on using the product on the official HTK website. Supported programming languages are C and Python;
- Kaldi — This one is an open-source toolkit for speech recognition and signal processing. The toolkit itself is available to download from the GitHub repo. The documentation is available on the official website. Supported programming languages are C++ and Python.
How Speech recognition Is Used
Thanks to personal computers and smartphones and the rapid development of AI, voice and speech recognition software have entered our everyday life. They let us command our devices just by talking. The first product that is worth mentioning is a virtual assistant. Google and Apple are shipping their OS with built-in virtual assistants. Microsoft has added its virtual assistant Cortana to Windows. Smart speakers are integrated with virtual assistants. Examples of such devices are Amazon Echo embedded with Alexa and Apple HomePod working on Siri. Speech recognition is implemented in the call center's IVR systems, medical devices. It is used in security systems with voice biometry. This technology can be helpful wherever a human needs to interact with a machine.
Why Is Speech Recognition Good?
Speech recognition technology increases the productivity of the user. It captures human speech much faster than we can type. Besides, you can talk to your device when your hands are busy with other work, performing two actions simultaneously. It is essential for people with disabilities who can't use their hands. They add an extra layer of reliability from the security side because it's not easy to fake a unique voiceprint.
Speech Recognition Advantages and Disadvantages
Speech recognition is a relatively new science. It has gone from simple programs with the ability to identify dozens of words in a single language to complex systems based on AI. During several decades it has developed greatly and began to solve a wider range of tasks. Despite this, there is still a lot to do to improve it. Let's sum up what advantages and disadvantages it has.
Speech Recognition Advantages
- Increases the productivity of businesses;
- Automates the interaction between the businesses and customers;
- Adds an extra security level;
- Captures speech faster than a human can type;
- Helps people with disabilities;
- Helps control your home devices;
- Assists drivers with in-car ASR systems and more.
Speech Recognition Disadvantages
- Systems can't fully recognize speech if the speaker speaks quickly and not clearly;
- Large vocabularies are required to improve recognition accuracy;
- Each language requires separate training for ASR;
- Businesses can collect and use the user's voice data without their permission;
- Time and financial costs are high;
- ASR software consumes a lot of memory and requires a large amount of RAM.
Speech Recognition Technology Applications
We talked about the widespread use of voice recognition systems. Let's see what applications it has in specific areas.
Healthcare
In medicine, speech recognition is mainly used to write patient documentation. Тwo different methods of the documentation process exist.
- The front-end documentation is when speech is translated into text in real-time. In this case, it is more likely that the system will make a mistake. Doctors have to fix the text. So it's better to use it for taking personal notes;
- The back-end documentation does the same but also attaches the recording of the speaker's voice to the text. The system provides the draft of the text so doctors can fix errors.
Military
In this area, it is primarily used for command and control over machines and devices. Voice command is much faster. In combat, this can play a key role in winning the battle.
Usage in Education
Students can check their pronunciation while learning languages. It can help to avoid grammar, punctuation errors. Writing large texts is less challenging. Students can type a large text without getting tired.
People With Disabilities
Students with hand disabilities or blind people can write without any limitations. ASR enables them to keep up with their studies.
In-Car Systems
Speech recognition in a car reduces the risk of an accident on the road. Аctions such as dialing a number, working with an MP3 player or radio are performed without taking your hands off the steering wheel.
Voice-Controlled Video Games
It can help you learn the game. The player needs time to memorize the game control keys. Instead, they can use voice commands.
Different Speech Recognition (Virtual Assistant) Software
Virtual assistant systems are quite complex and expensive. Solutions from tech giants mainly dominate the market. Let's get to know them.
Apple's Siri
This personal assistant is available only for Apple users. It first appeared in iPhone 4S and became an integral part of newer Apple products. Siri can post on Twitter or Facebook, solve complex math problems, save notes, make reservations, etc.
Amazon Alexa
Amazon is shipping its smart speakers with Alexa. It was first presented in 2013. Unlike Siri, it can be integrated into 3rd party devices. It’s capable of voice interaction, managing online shopping, and music playback. It can also control several smart devices.
Microsoft's Cortana
It is a virtual assistant by Microsoft, released in 2014. It is mainly used by Windows OS users but is also available for Android and IOS users. Cortana allows you to manage your calendar, join a meeting in Microsoft Teams, set reminders, and open apps on the computer.
Google Assistant
Google began its journey of creating virtual assistants with Google Now. It was a feature of Google search, which allowed users to search information using speech. Several years later, Google stopped the development of this project and announced Google Assistant in 2016. It was originally integrated into Google Home smart speakers and Google Pixel smartphones.
Nuance's Dragon Assistant and Dragon Naturally Speaking
Dragon Naturally Speaking is speech recognition software developed by Nuance Communications. Еarlier in this article, we mentioned the Dragon Dictate application. Over the years, it has improved and is now known as Dragon Naturally Speaking. The company also provides a personal assistant for PCs, the Dragon Assistant.
Does Speech Recognition Need Training?
To use a speech recognition system, you do not need long training sessions. There is a lot of information on the Internet on how to enable and use them. They can be found either on the official websites of manufacturers or other platforms. Here are some useful links.
- An article on how to use voice control on MAC by Apple. Videos on Youtube;
- An article about how to use voice control on Windows and videos on Youtube;
- An online university for Nuance Communication Products.
Future Uses of Speech Recognition Technology
The future of speech recognition is very promising. ASR systems will recognize not only the words but also the emotions of a person. Speech recognition will be applied in the fields such as the aerospace industry, home automation, robotics, telematics, and video games.
Is It Worth It to Use MFA(Multi-Factor Authentication)?
MFA significantly increases the level of data security. If the second authentication factor is a voice, it adds an extremely high level of security to your systems.
FAQ
How do I use Google voice recognition?
To use Google's virtual assistant, you need to say the phrase «Ok Google» or «Hey Google.» But before that, a user needs to activate this feature from the Google app settings.
What is voice recognition used for?
Voice recognition is widely used for security purposes to identify the speaker.
What are the advantages of a speech recognition system?
There are many advantages. To be short, it increases the productivity of its users.
How is speech recognition used in healthcare?
It helps in documentation writing processes.
How reliable is speech recognition?
These days, word recognition accuracy is high, but it will take time to achieve 100 percent accuracy.
What is the difference between voice recognition and speech recognition?
The purpose of voice recognition is security. It identifies the person by a unique voiceprint. The purpose of speech recognition is to identify the words spoken by a person to understand voice commands.