A General Survey of

SPEECH RECOGNITION PROGRAMS

(Updated August 4, 2009)

 

Itamar Even-Zohar

For a comparison of the SR application see Table #1.

For a list of available languages see Table #2.

For other tables see Comparative Tables.

 

Table of Contents

Languages served by Speech Recognition

Manufacturers of Speech Recognition

A Historical Note

Microsoft Speech Recognition: Office 2003, Vista and Windows 7

Discussion Forums / Support Groups for MS Speech

A Comparative Evaluation

Factors involved with SR Functionality

Concluding Remarks

Information

 

Languages served by Speech Recognition

Speech Recognition (“SR”) is available for a limited number of languages. There are competing programs for the following:

 

For Windows

English (US, UK, Australian, South East Asian, Indian), Spanish, French, German, Dutch, Italian, Mandarin Chinese (Simplified Chinese and Traditional Chinese), Japanese, and Swedish.

 

SR applications for Brazilian Portuguese, Catalan, Austrian German, and Arabic are no longer available.

 

(For a complete list of, and details about, the available languages for Windows see Table #2.)

 

For Macintosh

English (US and UK), German, Italian, and Spanish.

 

Manufacturers of Speech Recognition

For Windows

Only two companies now provide major applications for speech recognition for Windows: Nuance and Microsoft.  Nuance provides Dragon NaturallySpeaking (DNS for short; latest version: 10), which works under Windows 98, 2000, XP, Vista, and Windows 7. Microsoft provides Office SR in Office 2003 under Windows XP, and “Windows Speech Recognition” (WSR) as a built-in feature of Windows Vista and Windows 7 (all versions).

Various companies offer secondary SR applications, i.e. ones that are built on an engine developed by a major company, adding various features to it. Among those, Linguatec seems to be the most professional and promising. Its Voice Pro, erstwhile based on IBM’s ViaVoice engine, is now based, since version 12, on Windows Speech Recognition engine. It currently offers an SR application only for German, but the company is considering other languages, too.

 

For Macintosh

Mac is served by ViaVoice, which is sold by Nuance. and iListen. iListen offers downloadable language packs for US and UK English, German, Italian, and Spanish. About iListen go to http://www.macspeech.com (I am grateful to Jeff Neal for this information).

 

A Historical Note

Four major companies produced speech recognition programs since the early 1990’s: IBM, Dragon, L&H, and Philips. Dragon’s product, “Dragon NaturallySpeaking”, is the only product still developed and marketed, currently by Nuance, which has taken over ScanSoft, which had replaced the former Lernout & Hauspie, who had in their turn purchased both Dragon and Kurzweil before. Nuance has also taken the support for IBM’s ViaVoice, but no upgrade is offered for Windows Vista, and the company recommends DNS instead. When ScanSoft bought L&H, they decided to make Dragon NaturallySpeaking (DNS for short) their main SR product, transferring features from L&H’s VoiceXpress into it (since Version 6). With the discontinuation of Philips FreeSpeech 2000, several out of its 14 supported languages have no SR support any longer, mainly Austrian German and Catalan. Swedish, however, is now served by a Swedish company, Voxit (http://www.voxit.se), which maintains and sells VoiceXpress (I am grateful to Oscar Abrahamsson for this information). The latest version, 5.3, was issued December 2, 2005, but the company claims it functions under Vista (with certain configuration tweaks).

 

Microsoft Speech Recognition: Office 2003, Vista and Windows 7

Microsoft Speech entered the SR arena when it provided speech recognition for Office 2002 under Windows XP, and then much improved it in Office 2003 (October 2003).

 

Windows Speech Recognition (known as WSR; released January 2007) made a bolder step and made Speech Recognition a built-in feature of the Operating System. This is an unacknowledged revolution in the sense that SR is now available on an unprecedented scale to any Windows user whose language is supported (see below). The User Interface, repertoire of commands and the command-and-control (controlling the entire computer by voice) features have all been much improved. A new beta Macro Feature was released in spring 2008, then first release in January 2009, making it possible to create text macros, command substitutions, and sophisticated adds-on. This enhances the capabilities of WSR towards high flexibility and personal customizability.

 

WSR supports various languages, all available to users of Windows Vista Ultimate, through Language Packs downloadable directly from Microsoft’s server via the Update Windows feature. The following languages are currently supported: English US, English UK, French, Spanish, German, Chinese (Simplified and Classical), and Japanese.

 

WSR is also a built-in feature of Windows 7. In its current state (January-July 2009), there are no new UI features added. It works the same way as under Vista, including the Macro Feature add-on, but various basic recognition functions have been greatly improved according to Eric Brown of the Microsoft SR Team. See here for details.

 

For a list of desiderata for WSR see here.

 

Discussion Forums / Support Groups for MS Speech

In April 2003 I set up a Discussion Forum for Microsoft SR on Yahoo (http://groups.yahoo.com/groups/ms-speech), and a Speech Website with various materials (http://speech.even-zohar.com). There is a lot of valuable information on the Speech Computing forum (http://www.speechcomputing.com/forum) dedicated to all applications.

 

SR applications based on WSR

Dr. Reinhard Busch, of the Linguatec Sprachtechnologien, has provided me with a brief description of Voice Pro 12. Voice Pro 12 is an SR application based on the WSR engine but with extended acoustic models, new language models for various domains (general, medical, legal) and extended functionality, e.g. it allows full transcription of WAV and MP3 files, similar to DNS 10. The transcribed text is put out together with the original dictated audio (text-aligned). So one can make corrections based on the original recording plus add unknown words to the speaker profile. In addition it includes the Macro Tool and app. 1000 built-in macro commands to allow for more intuitive interaction (called “SmartCommands”). As VP 12 relies on the WSR engine for German, the user needs to switch to a full German Windows.

For more information see: http://www.linguatec.de/products/stt/voice_pro

 

A Comparative Evaluation

There are no accepted criteria for an “objective” evaluation of SR, mostly because people have different experiences and appreciate different features. A combination of high accuracy with a reasonable User Interface is appreciated by most users, though naturally people differ in their evaluation of these features or what “high accuracy” really means. While anything less than 98% accuracy is considered by most users to not make SR worthwhile, there is no agreement about how to calculate the figure.

 

A prospective or a new user should take into account that speech recognition, like speech itself, is a very personal matter. A program that works successfully for one person may not work as successfully for another.

 

We constantly hear contradicting testimonies on the various SR discussion forums about failure and success. Some people are sworn supporters of one application, bitterly complaining about some other. This is expected in view of the enormous complexity of speech, the large range of voices and language varieties.  In short, as is the case with selecting any other product, much depends on personal preferences as regards the various features provided by the various programs.

 

The following is therefore based on my personal experience and some currently accepted evaluations among veteran users.

 

  1. DNS seems to still maintain its reputation as the leading program in terms of accuracy, User Interface and various important features, as well as its advantage of being functional under different Windows versions. However, with Windows Speech Recognition constantly improving, DNS must provide more advanced capabilities than offered in its latest version 10 to justify its leading position.

 

  1. Microsoft Speech 2003 is very accurate, but it suffers from User Interface deficiencies and lacks basic features. It offers SR for only US English, Chinese and Japanese. Moreover, Microsoft has stopped developing it as part of its overall policy to stop selling and providing support for windows XP.

 

  1. Windows Speech Recognition (WSR) has an improved UI, a superb set of commands and extended command-and-control, a powerful macro tool (currently in beta version but fully functional for all languages), as well as seven supported languages. Its greatest advantage is of course the fact that it is a built-in feature of the OS. Its disadvantage is that it would currently work only under Windows Vista and Windows 7. In addition, it fails to provide several indispensable features like transcription and the preservation of the speech data (the audio element of SR). The latter, by the way, is offered by Microsoft Office 2003 speech recognition.

 

A toolkit sold by Martin Markoe, written by Brad Trott, now offers basic transcription and basic macroing for WSR. It can be purchased from here. Brad Trott has also written a Guidebook for Macro Writing, available from the same site.

 

For a summary comparison of the various features supported by the SR application see Table #1.

 

Factors involved with SR Functionality

 

The functionality of a Speech Recognition application depends on various factors, such as the quality of the computer, the quality of the microphone, and the quality of one’s voice and speech.

 

Computer

The best application would not perform very well on a weak machine, in spite of the so-called “minimal requirements” advertised by the manufacturer. A powerful machine, on the other hand, can improve speech recognition to an unimaginable level. A powerful machine is a combination of at least the following parameters: a powerful CPU, a large amount of random access memory (RAM), a good sound card, and a large disk. For XP, a minimum of 1GB RAM would be necessary; for Vista – a minimum of 2GB.

 

Microphone

A good microphone can make a lot of difference. There is a fair variety of brands, and a good advice from experienced people can make your SR a success story. For information, consultation and products I recommend Martin Markoe’s Website:

https://www.emicrophones.com.

 

Voice and manner of dictation

As Martin Markoe puts it, “the other factor besides a microphone is actually the quality of a person's voice and if they're actually going to use a dictation style of talking as opposed to a conversational style.”

 

Concluding Remarks

 

On the whole, as Chuck Runquist once wrote, “The whole issue of accuracy is a combination of hardware, software, and user skill.” At its current state, no one should expect SR to render excellent results without some learning and adjustment. Although the quality of SR has splendidly improved, one still must be determined to succeed. In addition to the need of the program to study the way people use a language, the speakers themselves must also learn something about themselves and the way they speak. Sometimes, because of the peculiar phonetic clashes characteristic of a given language, one must learn to change one’s speech habits. For example, short pauses must be introduced between certain words. Some words must be pronounced a bit differently than in everyday ordinary speech. In short, a novice must pay attention to the instances where they go wrong, and try and see how these can be handled by certain strategies, similarly to people who have to speak in public and make themselves heard and understood. People who want to succeed with speech recognition are strongly advised to not only consult the manuals for their applications, but also the various discussion forums, which also provide further documentation. (See links on my speech website: http://speech.even-zohar.com).

 

Information

Information about the various programs can be found on the Websites of their respective companies:

 

Nuance (DNS and ViaVoice)

http://www.nuance.com/naturallyspeaking

Microsoft

http://www.microsoft.com/speech

 

Linguatec

   http:/www.linguatec.de/products/stt/voice_pro

Voxit

http://www.voxit.se (Swedish)

 

 

Comparative Tables

free tracking [by StatCounter]