Before starting, consider using the Unity plug-in for the Cognitive Speech Services SDK. The plugin has better Speech Accuracy results and easy access to speech-to-text decode, as well as advanced speech features like dialog, intent based interaction, translation, text-to-speech synthesis, and natural language speech recognition. To get started, check out the sample and documentation.
Unity exposes three ways to add Voice input to your Unity application, the first two of which are types of PhraseRecognizer:
The KeywordRecognizer supplies your app with an array of string commands to listen for
The GrammarRecognizer gives your app an SRGS file defining a specific grammar to listen for
The DictationRecognizer lets your app listen for any word and provide the user with a note or other display of their speech
Note
Dictation and phrase recognition can't be handled at the same time. If a GrammarRecognizer or KeywordRecognizer is active, a DictationRecognizer can't be active and vice versa.
Enabling the capability for Voice
The Microphone capability must be declared for an app to use Voice input.
In the Unity Editor, navigate to Edit > Project Settings > Player
Select the Windows Store tab
In the Publishing Settings > Capabilities section, check the Microphone capability
Grant permissions to the app for microphone access on your HoloLens device
You'll be asked to do this on device startup, but if you accidentally clicked "no" you can change the permissions in the device settings
Phrase Recognition
To enable your app to listen for specific phrases spoken by the user then take some action, you need to:
Specify which phrases to listen for using a KeywordRecognizer or GrammarRecognizer
Handle the OnPhraseRecognized event and take action corresponding to the phrase recognized
private void KeywordRecognizer_OnPhraseRecognized(PhraseRecognizedEventArgs args)
{
System.Action keywordAction;
// if the keyword recognized is in our dictionary, call that Action.
if (keywords.TryGetValue(args.text, out keywordAction))
{
keywordAction.Invoke();
}
}
The GrammarRecognizer is used if you're specifying your recognition grammar using SRGS. This can be useful if your app has more than just a few keywords, if you want to recognize more complex phrases, or if you want to easily turn on and off sets of commands. See: Create Grammars Using SRGS XML for file format information.
You'll get a callback containing information specified in your SRGS grammar, which you can handle appropriately. Most of the important information will be provided in the semanticMeanings array.
Use the DictationRecognizer to convert the user's speech to text. The DictationRecognizer exposes dictation functionality and supports registering and listening for hypothesis and phrase completed events, so you can give feedback to your user both while they speak and afterwards. Start() and Stop() methods respectively enable and disable dictation recognition. Once done with the recognizer, it should be disposed using Dispose() to release the resources it uses. It will release these resources automatically during garbage collection at an extra performance cost if they aren't released before that.
There are only a few steps needed to get started with dictation:
Create a new DictationRecognizer
Handle Dictation events
Start the DictationRecognizer
Enabling the capability for dictation
The Internet Client and Microphone capabilities must be declared for an app to use dictation:
In the Unity Editor, go to Edit > Project Settings > Player
Select on the Windows Store tab
In the Publishing Settings > Capabilities section, check the InternetClient capability
Optionally, if you didn't already enable the microphone, check the Microphone capability
Grant permissions to the app for microphone access on your HoloLens device if you haven't already
You'll be asked to do this on device startup, but if you accidentally clicked "no" you can change the permissions in the device settings
DictationRecognizer
Create a DictationRecognizer like so:
dictationRecognizer = new DictationRecognizer();
There are four dictation events that can be subscribed to and handled to implement dictation behavior.
DictationResult
DictationComplete
DictationHypothesis
DictationError
DictationResult
This event is fired after the user pauses, typically at the end of a sentence. The full recognized string is returned here.
Start() and Stop() methods respectively enable and disable dictation recognition.
Once done with the recognizer, it must be disposed using Dispose() to release the resources it uses. It will release these resources automatically during garbage collection at an extra performance cost if they aren't released before that.
Timeouts occur after a set period of time. You can check for these timeouts in the DictationComplete event. There are two timeouts to be aware of:
If the recognizer starts and doesn't hear any audio for the first five seconds, it will time out.
If the recognizer has given a result, but then hears silence for 20 seconds, it will time out.
Using both Phrase Recognition and Dictation
If you want to use both phrase recognition and dictation in your app, you'll need to fully shut one down before you can start the other. If you have multiple KeywordRecognizers running, you can shut them all down at once with:
PhraseRecognitionSystem.Shutdown();
You can call Restart() to restore all recognizers to their previous state after the DictationRecognizer has stopped:
PhraseRecognitionSystem.Restart();
You could also just start a KeywordRecognizer, which will restart the PhraseRecognitionSystem as well.
Voice input in Mixed Reality Toolkit
You can find MRTK examples for voice input in the following demo scenes:
If you're following the Unity development checkpoint journey we've laid out, you're next task is exploring the Mixed Reality platform capabilities and APIs: