Speech Synthesis & Speech Recognition Using SAPI 5.1
Speech Synthesis & Speech Recognition Using SAPI 5.1
Brian Long (www.blong.com)
Table of Contents
Click
here to download the files associated with this article.
Introduction
This article looks at adding support for speech capabilities to Microsoft Windows
applications written in Delphi, using the Microsoft Speech API version 5.1 (SAPI
5.1). For an overview on the subject of speech technology please click
here.
There is also coverage on using SAPI 4 to build speech-enabled applications.
Information on using the SAPI 4 high level interfaces can be found by clicking
here, whilst discussion of the low level interfaces can be found by clicking
here.
SAPI 5.1 exposes most of the important interfaces, types and constants through
a registered type library (SAPI 5.0 did not do this, making it difficult to
use in Delphi without someone writing the equivalent of the JEDI import unit
for SAPI 5). This means that you can access SAPI 5.1 functionality through late
bound or early bound Automation. We will focus our attention on early bound
Automation, which requires you to import the type library.
Choose Project | Import Type Library...
and locate the type library described as Microsoft Speech Object Library
(Version 5.1) in the list. Now ensure the Generate
Component Wrapper checkbox is checked so the type library import unit
will include component wrapper classes for each exposed Automation object. These
components will go on the ActiveX page of the Component Palette by default,
but you may wish to specify a more appropriate page, such as SAPI 5.1.
Now press Install... so the
type library will be imported and the generated components will be installed
onto the Component Palette (pressing Create
Unit would also generate the type library import unit, but would require
us to install it manually).
The generated import unit is called SpeechLib_TLB.pas and will be installed
in a package. You can either select the default package offered (the Borland
User Components package by default), choose to open a different package
or even create a new one. When the package is compiled and installed you will
get a whopping set of 19 new components on the SAPI 5.1 page of the Component
Palette.
Each component is named after the primary interface it implements. So for example,
the TSpVoice component implements
the SpVoice interface. You can
find abundant documentation on all these interfaces in the SAPI 5.1 SDK documentation.
Ready made SAPI 5.1 packages containing Automation components for Delphi 5,
6 and 7 can be found in appropriately named subdirectories under SAPI 5.1 in
the accompanying files.
If you are using Delphi 6 you will encounter a problem
that is still present even with Update Pack 2 installed. The type library importer
has a bug where the parameters to Automation events are incorrectly dispatched
(they are sent in reverse order) meaning that all the Automation events operate
incorrectly (if at all). You can avoid this by importing the type library in
Delphi 5 or 7 and using the generated type library import unit in Delphi 6.
A Delphi 6 compatible package is supplied with this article's
files (it uses a Delphi 5 generated type library import unit).
The Delphi 7 type library importer has been improved to
produce more accurate Pascal representations of items in the type library than
Delphi 5 did (and than Delphi 6 tried to). As a result of this, the event handlers
will often have different parameter lists in the Delphi 7 imported type library.
This means that the sample programs won't compile with Delphi 7 with the true
Delphi 7 SAPI type library import unit.
If you wish, you can write late bound Automation that calls CreateOleObject
to instantiate the Automation objects. In the case of the SpVoice
interface, you would execute:
var
SpVoice: Variant;
...
SpVoice := CreateOleObject('SAPI.SpVoice')
Speech Synthesis
At its simplest level, all you need to do to get your program to speak is to
use a TSpVoice Automation object
and call the Speak method. A
trivial application that does this can be found in the TextToSpeechSimple.dpr
project in the files associated with this article.
The code looks like this:
procedure TfrmTextToSpeech.Button1Click(Sender: TObject);
begin
SpVoice1.Speak(memText.Text, SVSFDefault)
end;
And there you have it: a speaking application. The call to Speak takes a number
of parameters that we should examine:
- The first is the text to speak, passed as a PChar.
Because of the second parameter, this call will be synchronous and so will
not return until the text has been spoken.
- The second parameter represents some flags that indicate how to use the
first parameter (you can combine multiple flags with the or
operator). For example:
When the program executes it lets you type in some text in a memo and a button
renders it into the spoken word.

That's the simple example out of the way, but what can we achieve if we dig
a little deeper and get our hands a little dirtier? The next project, which
holds the answers to these questions, can be found as TextToSpeech.dpr in this
article's files. You can see it running in the screenshot below; notice
that as the text is spoken, the current sentence is italicised and the current
word is displayed selected and also the phonemes spoken are written to a memo.

The following sections describe the important parts of the code from this project.
Enumerating Voices
The first thing the program does is to add a list of all the available voices
to the combobox and set the rate and volume track bar positions. The latter
part of this is trivial as the voice rate and volume are always within predetermined
ranges (the volume is in the range 0 to 100 and the rate is in the range -10
to 10).
procedure TfrmTextToSpeech.FormCreate(Sender: TObject);
var
I: Integer;
SOToken: ISpeechObjectToken;
SOTokens: ISpeechObjectTokens;
begin
SendMessage(lstProgress.Handle, LB_SETHORIZONTALEXTENT, Width, 0);
//Ensure all events fire
SpVoice.EventInterests := SVEAllEvents;
Log('About to enumerate voices');
SOTokens := SpVoice.GetVoices('', '');
for I := 0 to SOTokens.Count - 1 do
begin
//For each voice, store the descriptor in the TStrings list
SOToken := SOTokens.Item(I);
cbVoices.Items.AddObject(SOToken.GetDescription(0), TObject(SOToken));
//Increment descriptor reference count to ensure it's not destroyed
SOToken._AddRef;
end;
if cbVoices.Items.Count > 0 then
begin
cbVoices.ItemIndex := 0; //Select 1st voice
cbVoices.OnChange(cbVoices); //& ensure OnChange triggers
end;
Log('Enumerated voices');
Log('About to check attributes');
tbRate.Position := SpVoice.Rate;
lblRate.Caption := IntToStr(tbRate.Position);
tbVolume.Position := SpVoice.Volume;
lblVolume.Caption := IntToStr(tbVolume.Position);
Log('Checked attributes');
end;
The SpVoice object's GetVoices
method returns a collection object that allows access to each voice as an ISpeechObjectToken.
In this code, both parameters are passed as empty strings, but the first can
be used to specify required parameters of the returned voices and the second
for optional parameters. So a call to GetVoices('Gender
= male', '') would return only male voices.
In order to keep track of the voices, these ISpeechObjectToken
interfaces are added, along with a description, to the combobox's Items
(the description in the Strings
array and the interfaces in the Objects
array).
Storing an interface reference in an object reference is possible as long as
we remember exactly what we stored, and we don't make the mistake of accessing
it as an object reference. Also, since the interface reference is stored using
an inappropriate type, it is important to manually increment its reference count
to stop it being destroyed when the RTL code decrements the reference count
at the end of the method.
The OnDestroy event handler
tidies up these descriptor objects by decrementing their reference counts, thereby
allowing them to be destroyed.
procedure TfrmTextToSpeech.FormDestroy(Sender: TObject);
var
I: Integer;
begin
//Release all the voice descriptors
for I := 0 to cbVoices.Items.Count - 1 do
ISpeechObjectToken(Pointer(cbVoices.Items.Objects[I]))._Release;
end;
When the user selects a different voice from the combobox, the OnChange
event handler selects the new voice and displays the voice attributes (including
the path in the Windows registry where the voice attributes are stored).
procedure TfrmTextToSpeech.cbVoicesChange(Sender: TObject);
var
SOToken: ISpeechObjectToken;
begin
with lstEngineInfo.Items do
begin
Clear;
SOToken := ISpeechObjectToken(Pointer(
cbVoices.Items.Objects[cbVoices.ItemIndex]));
SpVoice.Voice := SOToken;
Add(Format('Name: %s', [SOToken.GetAttribute('Name')]));
Add(Format('Vendor: %s', [SOToken.GetAttribute('Vendor')]));
Add(Format('Age: %s', [SOToken.GetAttribute('Age')]));
Add(Format('Gender: %s', [SOToken.GetAttribute('Gender')]));
Add(Format('Language: %s', [SOToken.GetAttribute('Language')]));
Add(Format('Reg key: %s', [SOToken.Id]));
end
end;
Making Your Computer Talk
There are different calls to start speech and to continue paused speech, so
a helper flag is employed to record whether pause has been pressed. This allows
the play button to start a fresh speech stream as well as continue a paused
speech stream. The text to speak is taken from a richedit control and is spoken
asynchronously thanks to the SVSFlagsAsync
flag being used.
procedure TfrmTextToSpeech.btnPlayClick(Sender: TObject);
begin
if not BeenPaused then
SpVoice.Speak(reText.Text, SVSFlagsAsync)
else
begin
SpVoice.Resume;
BeenPaused := False
end
end;
procedure TfrmTextToSpeech.btnPauseClick(Sender: TObject);
begin
SpVoice.Pause;
BeenPaused := True
end;
procedure TfrmTextToSpeech.btnStopClick(Sender: TObject);
begin
SpVoice.Skip('Sentence', MaxInt)
end;
There is another speech demo in the same directory
in the project TextToSpeechReadWordDoc.dpr. As the name suggests, this sample
reads out loud from a Word document. It uses Automation to control Microsoft
Word (as well as the SAPI voice object).
type
TfrmVTxtAutoLateBound = class(TForm)
...
private
MSWord: Variant;
end;
...
procedure TfrmTextToSpeechReadWordDoc.FormCreate(Sender: TObject);
begin
MSWord := CreateOleObject('Word.Application');
end;
procedure TfrmTextToSpeechReadWordDoc.btnReadDocClick(Sender: TObject);
const
// Constants for enum WdUnits
wdCharacter = $00000001;
wdParagraph = $00000004;
// Constants for enum WdMovementType
wdExtend = $00000001;
var
Moved: Integer;
Txt: String;
begin
(Sender as TButton).Enabled := False;
Stopped := False;
if dlgOpenDoc.Execute then
begin
MSWord.Documents.Open(FileName := dlgOpenDoc.FileName);
Moved := 2;
while (Moved > 1) and not Stopped do
begin
//Select next paragraph
Moved := MSWord.Selection.EndOf(Unit:=wdParagraph, Extend:=wdExtend);
if Moved > 1 then
begin
Txt := Trim(MSWord.Selection.Text);
if Length(Txt) > 0 then
SpVoice.Speak(Txt, SVSFlagsAsync);
Application.ProcessMessages;
//Move to start of next paragraph
MSWord.Selection.MoveRight(Unit := wdCharacter);
end
end;
end;
MSWord.ActiveDocument.Close;
TButton(Sender).Enabled := True;
end;
procedure TfrmTextToSpeechReadWordDoc.btnStopClick(Sender: TObject);
begin
SpVoice.Skip('Sentence', Maxint);
Stopped := True;
end;
procedure TfrmTextToSpeechReadWordDoc.FormDestroy(Sender: TObject);
begin
btnStop.Click;
MSWord.Quit;
MSWord := Unassigned;
end;
Voice Events
The SpVoice object has a variety
of events that fire during speech. Each block of speech starts with an OnStartStream
event and ends with OnEndStream.
OnStartStream identifies the
speech stream, and all the other events pass the stream number to which they
pertain. As each sentence is started an OnSentence
event fires and there is also an OnWord
event that triggers at the start of each spoken word.
Additionally (among others) an OnAudioLevel
event allows a progress bar to be used as a VU meter for the spoken text. However
it is important to note that for some events to fire you must set the EventInterests
property accordingly; to receive the OnAudioLevel
event you should set EventInterests
to SVEAudioLevel or SVEAllEvents.
const
Phonemes: array[1..49] of String = (
'-', '!', '&', ',', '.', '?', '_',
'1', '2', 'aa', 'ae', 'ah', 'ao', 'aw',
'ax', 'ay', 'b', 'ch', 'd', 'dh', 'eh',
'er', 'ey', 'f', 'g', 'h', 'ih', 'iy',
'jh', 'k', 'l', 'm', 'n', 'ng', 'ow',
'oy', 'p', 'r', 's', 'sh', 't', 'th',
'uh', 'uw', 'v', 'w', 'y', 'z', 'zh'
);
procedure TfrmTextToSpeech.SpVoicePhoneme(Sender: TObject;
StreamNumber: Integer; StreamPosition: OleVariant; Duration: Integer;
NextPhoneId: Smallint; Feature: TOleEnum; CurrentPhoneId: Smallint);
begin
if CurrentPhoneId <> 7 then //Display phonemes, except silence
memEnginePhonemes.Text :=
memEnginePhonemes.Text + Phonemes[CurrentPhoneId] +'-'
end;
Animating Speech
An OnViseme event is triggered
for each recognised viseme (a portion of speech requiring the mouth to move
into a visibly different position); there are 22 different visemes generated
by English speech and these are based on the Disney 13 visemes (cartoons have
less granularity and Disney animators discovered many years ago that only 13
cartoon mouth shapes are required to represent all English phonemes).
If you have some artistic flair and can draw a mouth in each position represented
by the visemes you could use this event to provide a simple animated representation
of speech.
The SAPI 5.1 SDK comes with a C++ example called TTSApp, which displays an
animated cartoon microphone whose mouth is drawn to represent each viseme. The
microphone is made up from a number of separate images that can all be loaded
into an image list. The additional demo program TextToSpeechAnimated.dpr
makes use of these images to show how the effect can be achieved.
const
Visemes: array[0..21] of Byte = (
0, // SP_VISEME_0 = 0, // Silence
11, // SP_VISEME_1, // AE, AX, AH
11, // SP_VISEME_2, // AA
11, // SP_VISEME_3, // AO
10, // SP_VISEME_4, // EY, EH, UH
11, // SP_VISEME_5, // ER
9, // SP_VISEME_6, // y, IY, IH, IX
2, // SP_VISEME_7, // w, UW
13, // SP_VISEME_8, // OW
9, // SP_VISEME_9, // AW
12, // SP_VISEME_10, // OY
11, // SP_VISEME_11, // AY
9, // SP_VISEME_12, // h
3, // SP_VISEME_13, // r
6, // SP_VISEME_14, // l
7, // SP_VISEME_15, // s, z
8, // SP_VISEME_16, // SH, CH, JH, ZH
5, // SP_VISEME_17, // TH, DH
4, // SP_VISEME_18, // f, v
7, // SP_VISEME_19, // d, t, n
9, // SP_VISEME_20, // k, g, NG
1 // SP_VISEME_21, // p, b, m
);
procedure TfrmTextToSpeech.SpVoiceViseme(Sender: TObject;
StreamNumber: Integer; StreamPosition: OleVariant; Duration: Integer;
NextVisemeId, Feature, CurrentVisemeId: TOleEnum);
const
EyesNarrow = 14;
EyesClosed = 15;
begin
imgsMic.Draw(pbMic.Canvas, 0, 0, Visemes[CurrentVisemeId]);
if Visemes[CurrentVisemeId] mod 6 = 2 then
imgsMic.Draw(pbMic.Canvas, 0, 0, EyesNarrow)
else
if Visemes[CurrentVisemeId] mod 6 = 5 then
imgsMic.Draw(pbMic.Canvas, 0, 0, EyesClosed);
end;
procedure TfrmTextToSpeech.pbMicPaint(Sender: TObject);
begin
imgsMic.Draw(pbMic.Canvas, 0, 0, 0);
end;
The OnViseme event gets the
image list to draw on a paint box component and the image to draw is identified
from a simple lookup table. There are 22 different visemes, but only 13 images
(as in the Disney approach). Occasionally the code also draws narrowed or closed
eyes, but whenever the silence viseme is received (at the start and end of each
sentence) the default microphone (the first image in the image list) is drawn.

You can take this idea further if you need, by using images of a person's face
saying each of the 22 visemes (for real people it seems to work best if you
use 22 images, rather than 13). This way you can animate a real person's face
in sync with the spoken text quite trivially.

Keeping Track Of Spoken Text
We can use OnWord and OnSentence
to highlight the currently spoken work or sentence, as the events provide the
character offset and length of the pertinent characters in the text. So when
a sentence is started, the OnSentence
event tells you which character in the text is the start of the sentence, and
also how long the sentence is.
procedure TfrmTextToSpeech.SetTextHilite(FirstChar, Len: Integer);
begin
reText.SelStart := FirstChar; //highlight word
reText.SelLength := Len;
end;
procedure TfrmTextToSpeech.SetTextStyle(FirstChar, Len: Integer; Styles: TFontStyles);
begin
with reText do
begin
Lines.BeginUpdate;
try
SelStart := FirstChar; //highlight word
SelLength := Len;
SelAttributes.Style := Styles; //apply requested style
SelLength := 0; //unhighlight word
finally
Lines.EndUpdate
end
end
end;
procedure TfrmTextToSpeech.SpVoiceSentence(Sender: TObject;
StreamNumber: Integer; StreamPosition: OleVariant; CharacterPosition,
Length: Integer);
begin
Log('OnSentence: stream %d, position: %s, char. pos. %d, length %d',
[StreamNumber, String(StreamPosition), CharacterPosition, Length]);
SetTextStyle(OldSentencePos, OldSentenceLen, []);
if Length > 0 then
begin
SetTextStyle(CharacterPosition, Length, [fsItalic]);
OldSentencePos := CharacterPosition;
OldSentenceLen := Length;
end;
if not StreamJustStarted then
memEnginePhonemes.Text := memEnginePhonemes.Text + #13#10;
StreamJustStarted := False;
end;
procedure TfrmTextToSpeech.SpVoiceWord(Sender: TObject;
StreamNumber: Integer; StreamPosition: OleVariant; CharacterPosition,
Length: Integer);
begin
Log('OnWord: stream %d, position: %s, char. pos. %d, length %d',
[StreamNumber, String(StreamPosition), CharacterPosition, Length]);
SetTextHilite(CharacterPosition, Length);
end;
Each sentence that gets spoken is italicised through the SetTextStyle
helper routine (which records the position details so the sentence can be set
back to non-italic when the next sentence starts). Similarly, each spoken word
is highlighted using the SetTextHilite
helper routine.
The comment in the OnSentence
event handler points out that the last OnSentence
event for some text has the character position set to the last character and
the length set to the negative equivalent. This gives an opportunity to reset
all the text formatting back to the default styles. However it is only true
if the text ends with a full stop; if not you can use the OnEndStream
event for tidying up.
Speaking Dialogs
As an example of using speech synthesis you can make all your VCL dialogs talk
to you using this small piece of code.
uses
ComObj;
var
Voice: Variant;
procedure TForm1.FormCreate(Sender: TObject);
begin
Screen.OnActiveFormChange := ScreenFormChange;
end;
procedure TForm1.ReadVCLDialog(Form: TCustomForm);
var
I: Integer;
ButtonCaptions, LabelCaption, DialogText: string;
const
SVSFlagsAsync = 1;
begin
try
if VarType(Voice) <> varDispatch then
Voice := CreateOleObject('SAPI.SpVoice');
for I := 0 to Form.ComponentCount - 1 do
if Form.Components[I] is TLabel then
LabelCaption := TLabel(Form.Components[I]).Caption
else
if Form.Components[I] is TButton then
ButtonCaptions := Format('%s%s, ',
[ButtonCaptions, TButton(Form.Components[I]).Caption]);
ButtonCaptions := StringReplace(ButtonCaptions,'&','', [rfReplaceAll]);
DialogText := Format('%s.%s%s.%s%s',
[Form.Caption, sLineBreak, LabelCaption, sLineBreak, ButtonCaptions]);
Memo1.Text := DialogText;
Voice.Speak(DialogText, SVSFlagsAsync)
except
//pretend everything is okay
end
end;
procedure TForm1.ScreenFormChange(Sender: TObject);
begin
if Assigned(Screen.ActiveForm) and
(Screen.ActiveForm.ClassName = 'TMessageForm') then
ReadVCLDialog(Screen.ActiveForm)
end;
The form's OnCreate event handler
sets up an OnActiveFormChange
event handler for the screen object. This is triggered each time a new form
is displayed, which includes VCL dialogs. Any call to ShowMessage,
MessageDlg or related routines
causes a TMessageForm to be displayed
so the code checks for this. If the form type is found, a textual version of
what's on the dialog is built up and then spoken through the SAPI Automation
component.
A statement such as:
MessageDlg('Save changes?', mtConfirmation, mbYesNoCancel, 0)
causes the ReadVCLDialog routine
to build up and say this text:
Confirm.
Save changes?.
Yes, No, Cancel,
Notice the full stops at the end of each line to briefly pause the speech engine
at that point before moving on.
Speech Recognition
Continuous dictation is easy to set up as no specific grammar is required,
but Command and Control recognition will need a grammar to educate the recogniser
as to the permissible commands.
When you need SR you can either use a shared recogniser (TSpSharedRecognizer)
or an in-process recogniser (TSpInprocRecognizer).
The in-process recogniser is more efficient (it resides in your process address
space) but means that no other SR applications can receive input from the microphone
until it is closed down. On the other hand the shared recogniser can be used
by multiple applications, and each one can access the microphone. It is more
common to use the shared recogniser in typical SAPI applications.
The recogniser uses the notion of a recognition context to identify
when it will be active (not to be confused with the use of context in a context-free
grammar or CFG). A context is represented by the SpInprocRecoContext
or SpSharedRecoContext interfaces.
An application may use one context for each form that will use SR, or several
contexts for different application modes (Office XP has a dictation mode for
adding text to a document and a control mode for executing menu commands).
Recognition contexts enable you to start and stop recognition, set up the grammar
and receive important recognition notifications.
Grammars
Part of the process of speech recognition involves deciding what words have
actually been spoken. Recognisers use a grammar to decide what has been said,
where possible. SAPI 5.x represents grammars in XML.
In the case of dictation, a grammar can be used to indicate some words that
are likely to be spoken. It is not feasible to try and represent the entire
spoken English language as a grammar so the recogniser uses its own rules and
context analysis, with any help from a grammar you might supply.
With Command and Control, the words that are understood are limited to the
supported commands defined in the grammar. The grammar defines various rules
that dictate what will be said and this makes the recogniser's job much easier.
Rather than trying to understand anything spoken, it only needs to recognise
speech that follows the supplied rules. A simple grammar that recognises three
colours might look like this:
<GRAMMAR LANGID="809">
<!-- "Constant" definitions -->
<DEFINE>
<ID NAME="RID_start" VAL="1"/>
</DEFINE>
<!-- Rule definitions -->
<RULE NAME="start" ID="RID_start" TOPLEVEL="ACTIVE">
<L>
<P>red</P>
<P>blue</P>
<P>green</P>
</L>
</RULE>
</GRAMMAR>
The GRAMMAR root node defines the language as British English ($809, American
English is $409). Note that the colour rule is a top level rule and has
been marked to be active by default, meaning it will be active whenever speech
recognition is enabled for this context.
Grammars support lists to make implementing many similar commands easy and
also support optional sections. For example this grammar will recognise any
of the following:
- colour red
- colour red please
- colour blue
- colour blue please
- colour green
- colour green please
<GRAMMAR LANGID="809">
<DEFINE>
<ID NAME="RID_start" VAL="1"/>
</DEFINE>
<RULE NAME="start" ID="RID_start" TOPLEVEL="ACTIVE">
<P>colour</P>
<RULEREF NAME="colour" />
<O>please</O>
</RULE>
<RULE NAME="colour">
<L>
<P>red</P>
<P>blue</P>
<P>green</P>
</L>
</RULE>
</GRAMMAR>
You can find more details about the supported grammar syntax in the SAPI documentation
Continuous Dictation Recognition
Thankfully this is quite straightforward to use. We need to set up a recognition
context object for the shared recogniser so drop a TSpSharedRecContext
component on the form.
The recogniser will implicitly be set up if we do not create
one specifically. This means you do not need to drop a TSpSharedRecognizer
or a TSpInprocRecognizer on the
form unless you need to use them directly.
The code below shows how you create a simple grammar that will satisfy the
SR engine for continuous dictation. The grammar is represented by an ISpeechRecoGrammar
interface and is used to start the dictation session. The code comes from the
ContinuousDictation.dpr sample project.
SRGrammar: ISpeechRecoGrammar;
...
procedure TfrmContinuousDictation.FormCreate(Sender: TObject);
begin
//OnAudioLevel event is not fired by default - this changes that
SpSharedRecoContext.EventInterests := SREAllEvents;
SRGrammar := SpSharedRecoContext.CreateGrammar(0);
SRGrammar.DictationSetState(SGDSActive)
end;
Grammar Notifications
As the SR engine does its work it calls notification methods when certain things
happen, such as a phrase having been finished and recognised. These notifications
are available as standard Delphi events in the Delphi Automation object component
wrappers. This greatly simplifies the job of responding to the notifications.
The main event is OnRecognition,
which is called when the SR engine has decided what has been spoken. Whilst
working it out, it will likely call the OnHypothesis
event several times. Finished phrases are added to a memo on the form and whilst
a phrase is being worked out the hypotheses are added to a list box so you can
see how the SR engine made its decision. Each time a new phrase is started,
the hypothesis list is cleared.
You can see the list of hypotheses building up in this screenshot of the program
running.

Both OnRecognition and OnHypothesis
are passed a Result parameter;
this is an ISpeechRecoResult
results object. In Delphi 7 this is declared using the correct ISpeechRecoResult
interface type, but in earlier versions this was just declared as an OleVariant
(which contained the ISpeechRecoResult
interface).
This code can be used in Delphi 6 and earlier to access the text that was recognised:
procedure TfrmContinuousDictation.SpSharedRecoContextHypothesis(
Sender: TObject; StreamNumber: Integer; StreamPosition: OleVariant;
var Result: OleVariant);
begin
lstHypotheses.Items.Add(Result.PhraseInfo.GetText);
lstHypotheses.ItemIndex := lstHypotheses.Items.Count - 1
end;
procedure TfrmContinuousDictation.SpSharedRecoContextRecognition(
Sender: TObject; StreamNumber: Integer; StreamPosition: OleVariant;
RecognitionType: TOleEnum; var Result: OleVariant);
begin
memText.SelText := Result.PhraseInfo.GetText + #32
end;
This code uses late bound Automation on the results object
(so no Code Completion or Code Parameters), but you could use early bound Automation
with:
procedure TfrmContinuousDictation.SpSharedRecoContextHypothesis(
Sender: TObject; StreamNumber: Integer; StreamPosition: OleVariant;
var Result: OleVariant);
var
SRResult: ISpeechRecoResult;
begin
SRResult := IDispatch(Result) as ISpeechRecoResult;
lstHypotheses.Items.Add(SRResult.PhraseInfo.GetText(0, -1, True));
lstHypotheses.ItemIndex := lstHypotheses.Items.Count - 1
end;
procedure TfrmContinuousDictation.SpSharedRecoContextRecognition(
Sender: TObject; StreamNumber: Integer; StreamPosition: OleVariant;
RecognitionType: TOleEnum; var Result: OleVariant);
var
SRResult: ISpeechRecoResult;
begin
SRResult := IDispatch(Result) as ISpeechRecoResult;
memText.SelText := SRResult.PhraseInfo.GetText(0, -1, True) + #32
end;
The code here does not check if a valid IDispatch
reference is in the Variant but probably should.
In Delphi 7 the code should look like this:
procedure TfrmContinuousDictation.SpSharedRecoContextHypothesis(
ASender: TObject; StreamNumber: Integer; StreamPosition: OleVariant;
const Result: ISpeechRecoResult);
begin
lstHypotheses.Items.Add(Result.PhraseInfo.GetText(0, -1, True));
lstHypotheses.ItemIndex := lstHypotheses.Items.Count - 1
end;
procedure TfrmContinuousDictation.SpSharedRecoContextRecognition(
ASender: TObject; StreamNumber: Integer; StreamPosition: OleVariant;
RecognitionType: TOleEnum; const Result: ISpeechRecoResult);
begin
memText.SelText := Result.PhraseInfo.GetText(0, -1, True) + #32
end;
Engine Dialogs
The buttons on the form allow various engine dialogs to be invoked (if supported).
This support is all attained through a couple of methods of the recogniser object.
const
SPDUI_EngineProperties = 'EngineProperties';
SPDUI_AddRemoveWord = 'AddRemoveWord';
SPDUI_UserTraining = 'UserTraining';
SPDUI_MicTraining = 'MicTraining';
SPDUI_RecoProfileProperties = 'RecoProfileProperties';
SPDUI_AudioProperties = 'AudioProperties';
SPDUI_AudioVolume = 'AudioVolume';
procedure TfrmContinuousDictation.btnEnginePropsClick(Sender: TObject);
begin
InvokeUI(SPDUI_EngineProperties, 'Engine Properties')
end;
procedure TfrmContinuousDictation.btnUserSettingsClick(Sender: TObject);
begin
InvokeUI(SPDUI_RecoProfileProperties, 'User Settings')
end;
procedure TfrmContinuousDictation.btnLexiconClick(Sender: TObject);
begin
InvokeUI(SPDUI_AddRemoveWord, 'Add/Remove Word')
end;
procedure TfrmContinuousDictation.btnTrainGeneralClick(Sender: TObject);
begin
InvokeUI(SPDUI_UserTraining, 'Speaker Training')
end;
procedure TfrmContinuousDictation.btnTrainMicClick(Sender: TObject);
begin
InvokeUI(SPDUI_MicTraining, 'Microphone Setup')
end;
procedure TfrmContinuousDictation.btnAudioPropsClick(Sender: TObject);
begin
InvokeUI(SPDUI_AudioProperties, 'Audio Properties')
end;
procedure TfrmContinuousDictation.btnAudioVolClick(Sender: TObject);
begin
InvokeUI(SPDUI_AudioVolume, 'Audio Volume')
end;
procedure TfrmContinuousDictation.InvokeUI(const TypeOfUI, Caption: WideString);
var
U: OleVariant;
begin
U := Unassigned;
if SpSharedRecoContext.Recognizer.IsUISupported(TypeOfUI, U) then
SpSharedRecoContext.Recognizer.DisplayUI(Handle, Caption, TypeOfUI, U)
end;
Command and Control Recognition
For C and C recognition we will need a grammar to give the SR engine rules
by which to recognise the commands. This grammar is used by a sample project
called CommandAndControl.dpr in the files that accompany
this article.
<GRAMMAR LANGID="809">
<!-- "Constant" definitions -->
<DEFINE>
<ID NAME="RID_start" VAL="1"/>
<ID NAME="PID_chosencolour" VAL="2"/>
<ID NAME="PID_colourvalue" VAL="3"/>
</DEFINE>
<!-- Rule definitions -->
<RULE NAME="start" ID="RID_start" TOPLEVEL="ACTIVE">
<O>colour</O>
<RULEREF NAME="colour" PROPNAME="chosencolour" PROPID="PID_chosencolour" />
<O>please</O>
</RULE>
<RULE NAME="colour">
<L PROPNAME="colourvalue" PROPID="PID_colourvalue">
<P VAL="1">red</P>
<P VAL="2">blue</P>
<P VAL="3">green</P>
</L>
</RULE>
</GRAMMAR>
After defining some constants the rules are laid out next. The top level rule
(start, which is just an arbitrarily chosen name) is defined as the optional
word colour, a value from another rule (also called colour) and
the optional word please. The value from the colour rule can be identified
programmatically (rather than by scanning the recognised text) thanks to it
being defined as a property (chosencolour).
The colour rule defines one of three colours that can be spoken, each of which
has a value defined for it. Again, this value will be accessible thanks to the
list being defined as a property (colourvalue).
This grammar is stored in an XML file and loaded in the OnCreate
event handler.
procedure TfrmCommandAndControl.FormCreate(Sender: TObject);
begin
//OnAudioLevel event is not fired by default - this changes that
SpSharedRecoContext.EventInterests := SREAllEvents;
SRGrammar := SpSharedRecoContext.CreateGrammar(0);
SRGrammar.CmdLoadFromFile('C and C Grammar.xml', SLODynamic);
SRGrammar.CmdSetRuleIdState(0, SGDSActive)
end;
Notice that two different ISpeechRecoGrammar
methods are used to instigate command and control recognition. CmdLoadFromFile
loads a grammar from an XML file and CmdSetRuleIdState
activates all top level rules when the first parameter is zero (you can activate
individual rules by passing their rule ID).
The OnRecognition event handler
does the work of locating the chosencolour property and then finding
the nested colourvalue property. Its value is used to change the form
colour at the user's request, for example with phrases such as:
- red please
- colour green
- colour blue please
- red
procedure TfrmCommandAndControl.SpSharedRecoContextRecognition(
ASender: TObject; StreamNumber: Integer; StreamPosition: OleVariant;
RecognitionType: TOleEnum; const Result: ISpeechRecoResult);
begin
with Result.PhraseInfo do
begin
Log('OnRecognition: %s', [GetText(0, -1, True)]);
case GetPropValue(Result, ['chosencolour', 'colourvalue']) of
1: Color := clRed;
2: Color := clBlue;
3: Color := clGreen;
end
end
end;
This code uses a helper routine, GetPropValue
whose task is to locate the appropriate property in the result object, by following
the property path specified in the string array parameter. The code for GetPropValue
and its own helper routine, GetProp,
looks like this:
function GetProp(Props: ISpeechPhraseProperties;
const Name: String): ISpeechPhraseProperty; overload;
var
I: Integer;
Prop: ISpeechPhraseProperty;
begin
Result := nil;
for I := 0 to Props.Count - 1 do
begin
Prop := Props.Item(I);
if CompareText(Prop.Name, Name) = 0 then
begin
Result := Prop;
Break
end
end
end;
function GetPropValue(SRResult: ISpeechRecoResult;
const Path: array of String): OleVariant;
var
Prop: ISpeechPhraseProperty;
PathLoop: Integer;
begin
for PathLoop := Low(Path) to High(Path) do
begin
if PathLoop = Low(Path) then //top level property
Prop := GetProp(SRResult.PhraseInfo.Properties, Path[PathLoop])
else //nested property
Prop := GetProp(Prop.Children, Path[PathLoop]);
if not Assigned(Prop) then
begin
Result := Unassigned;
Exit;
end
end;
Result := Prop.Value
end;
This is what the application looks like when running.

Speech Recognition Troubleshooting
If you get issues of SR stopping (or not starting) unexpectedly, or other weird
SR issues, check your recording settings have the microphone enabled.
- Double-click the Volume icon in your Task Bar's System Tray. If no Volume
icon is present, choose Start | Programs
| Accessories | Entertainment | Volume Control.
- If you see a Microphone column,
ensure it has its Mute checkbox
checked
- Choose Options | Properties,
click Recording, ensure the
Microphone option is checked
and press OK.
- Now ensure the Microphone
column has its Select checkbox
enabled, if it has one, or that its Mute
checkbox is unchecked, if it has one.
SAPI 5.1 Deployment
When distributing SAPI 5.1 applications you will need get hold of the redistributable
components package available as SpeechSDK51MSM.exe from http://www.microsoft.com/speech/download/SDK51
(a colossal file, weighing in at 132 Mb) contains Windows Installer merge modules
for all the SAPI 5.1 components (the main DLLs, the TTS and SR engines, the
Control Panel applet) and the SDK documentation includes a white paper on how
to use all these components from within a Windows Installer compatible installation
building tool.

Summary
Adding various speech capabilities into a Delphi application does not take
an awful lot of work, particularly if you do the background work to understand
the SAPI concepts.
There is much to Speech API that we have not looked at in this paper but hopefully
the areas covered will be enough to whet your appetite and get you exploring
further on your own.
References/Further Reading
The following is a list of useful articles and papers that I found on SAPI
5.1 development during my research on this subject.
- Speech
Part 1 - How to Add "Text to Speech" (Speech Synthesis) to your Delphi Apps
by Alec Bergamini, Delphi
3000.
This discusses installing the SAPI 5.1 SDK and getting simple speech.
- Speech
Part 2 - How to Add Simple Dictation speech recognition to your Delphi Apps
by Alec Bergamini, Delphi
3000.
This looks at simple dictation SR.
About Brian Long
Brian Long used to work at Borland
UK, performing a number of duties including Technical Support on all the programming
tools. Since leaving in 1995, Brian has been providing training and consultancy
on Borland's RAD products ever since, and is now moving into the .NET world.
Besides authoring a
Borland Pascal problem-solving book published in 1994, Brian is a regular
columnist in The
Delphi Magazine and has had numerous articles published in Developer's Review,
Computing, Delphi
Developer's Journal and EXE Magazine. He was nominated for the Spirit
of Delphi 2000 award and was voted Best Speaker at Borland's BorCon
2002 conference in Anaheim, California by the conference delegates.
There are a growing number of conference papers and articles available on Brian's
Web site, so feel free to have a browse.
In his spare time (and waiting for his C++ programs to compile) Brian has learnt
the art of juggling and
making inflatable origami
paper frogs.
Go to the speech capabilities overview
Go to the SAPI 4 High Level Interfaces
article
Go to the SAPI 4 Low Level Interfaces
article
Go back to the top of this SAPI 5.1 article