MULTILINGUALISM IN SPEECH RECOGNITION

(April 2003, October 2004, April 2007)

Itamar Even-Zohar

 

Multilingualism in speech recognition can mean different things:

 

1.      Producing speech recognition applications for several languages.

2.      Bundling together speech recognition applications for several languages.

3.      Allowing the use of several speech recognition applications under one operating system without conflicts.

4.      Producing a speech recognition application that would allow switching between several languages in one and the same document.

 

Let me now briefly discuss these topics.

 

1. Producing speech recognition applications for several languages.

Between 1990 and 2007, all four companies engaged in producing SR applications offered Speech recognition for several languages.  However, two out of the four have discontinued their SR applications (VX and Philips), and IBM no longer develops its ViaVoice, which is currently sold by Nuance, the company that has taken over DNS. The main languages have been: U.S. and UK English, Chinese, French, Spanish, German, Italian, Japanese, and Dutch. DNS (the latest: Version 9) offers in addition Australian, Indian, and Southeast Asian English. ViaVoice (the latest: Version 10.5) no longer offers Brazilian Portuguese and Arabic. Swedish is offered by a Swedish company which continues VoiceXpress. Philips, which offered fourteen languages (among which Austrian German, Catalan, and Swedish), has discontinued its application.

 

As for the new contender, Microsoft speech recognition, it has fully integrated speech recognition in its new OS – Windows Vista. Currently, the following languages have full SR in Vista: English U.S. and UK, Chinese, French, German, Japanese, and Spanish. As for speech recognition for Office 2003 under Windows XP, only U.S. English, Japanese and simplified Chinese are available.

 

For more details see Languages Supported.

 

2. Bundling together speech recognition applications for several languages.

With the exception of Philips (which offered all of its 14 languages in one single bundle, or made it possible to get additional languages when requested), all other speech recognition companies offered their various languages only in those countries where the relevant languages are (believed to be) used.  In this supposedly globalized age, such international companies as Microsoft, Nuance and IBM still think in terms of language provinces cut off from each other and serving only local communities.  It seems to be inconceivable for the designers of these applications that people may be in need for more than one single language.  This has not disturbed IBM, by the way, to create such a unique architecture for ViaVoice that actually does not allow using all of their applications on the same computer and under the same operating system if one of the languages is installed in a different version. Similarly, DNS since version 7 has introduced a similar architecture that makes it impossible to keep different versions for different languages and run them alternately under the same operating system.

 

Although DNS has never offered any multilingual bundle before version 5, it seems that in view of the growing international need for English, all DNS Speech recognition languages are now bundled at least with English. On the other hand, this information is not always made explicit on the various Internet pages of their respective electronic stores.  For version 9, Spanish, German and Italian are each bundled with all the various English variants, while the Dutch package is bundled with Dutch, German, English, and French.

 

Purchasing the various language applications of both DNS and ViaVoice is not a simple operation.  As there is no international store for these products, one must locate in each of the relevant countries some Internet store that would be willing not only to sell it, but also to ship it abroad.  For example, Amazon UK is not prepared to ship an upgrade for ViaVoice outside of the European Union. Of course, for each language package purchased from a different store in a different country you get a new headset microphone, and you pay separately for shipping, handling, customs and brokerage. It is a very time and energy consuming operation.

 

As for Windows Vista SR, the additional languages can be downloaded from Microsoft Website as “Language Packages” if you have purchased Windows Vista Ultimate or Enterprise.

 

3. Allowing the use of several speech recognition applications under one operating system without conflicts.

This issue, as much as it might be taken for granted, is however not implemented by all manufacturers of SR, as described above about IBM and DNS. From the point of view of programming aesthetics, I truly agree that the architecture of ViaVoice (and currently DNS) is beautiful. One single engine is used for all installed languages, thus creating both aesthetic neatness and economical functionality. However, commercial factors have greatly contributed to destroying the advantage of this unique architecture. Due to the costs involved with manufacturing upgrades, and due to the fact that most languages other than English produce high quality results (and therefore do not badly require far reaching modifications as English does), IBM has not found it necessary to offer upgrades for most of these languages.  As a result, those who were in need of a better English version, and who immediately upgraded from version 7, to versions 8, 9, and 10, simply weren't able to go on using the other languages they had purchased for quite a significant sum of money.

 

By the way, the beautiful architecture of ViaVoice (and DNS) has not been accompanied by any explicit information.  In none of the various language manuals for these programs has there ever been any mentioning of multilingualism, or an explanation of that particular architecture and what it entailed.  A bug that created language confusion in the HELP file of ViaVoice has never been fixed, nor explained and for a long time not even acknowledged.

 

This architecture makes it possible to install more than one language, and then switch between the various languages more quickly than by unloading and loading a standalone application. This is simply carried out by switching between users.

 

4. Producing a speech recognition application that would allow switching between several languages in one and the same document.

Having a speech recognition application that would allow switching between languages even in the same document without the lengthy procedures of unloading and loading different modules, or users, or even different standalone applications, is not utopia. As a matter of fact, Philips offered this possibility in the now discontinued speech recognition application FreeSpeech 2000. As described above, FreeSpeech 2000 was shipped with one main language and an additional set of thirteen languages. If you did not get all of these at once, as far as I recall, you could request them from Philips headquarters.  As far as I remember, this applied to English or French, or perhaps German and Spanish, as main languages.  It became more complicated when I was looking in vain for the combination of those fourteen languages with Swedish as main language, for a friend of mine, a journalist working for the Swedish press.

 

FreeSpeech 2000 made it possible to work with as many of the fourteen languages as you liked.  For many international offices, I believe such a bundle must have been most efficient.  Even if they did not need to create multilingual documents, they could still switch quickly and easily between languages in order to create documents in various languages.  This arrangement beautifully liberated any prospective user from the need to wander between countries and department stores all over the world desperately looking for unheard of products. I still believe it is also the best solution from the marketing point of view.

 

However, FreeSpeech 2000 offered more than just a bundle of languages.  It offered the ability to switch between languages even in the middle of a sentence.  The trouble was that the implementation of this procedure was far from unproblematic. It often collapsed under windows 98 and caused all sorts of damage to the operating system (which was anyway quite shaky). On the whole, it was rather crude, requiring, for instance, to switch all the time between dictation and voice commands. The levels of accuracy gained were unstable; personally, I have gained the worst levels with English, and the best with Italian, Spanish, Swedish, and even Catalan.  Nevertheless, none of these languages reached any results comparable to any of the competitors', except for the languages where there were no competitors (such as Swedish and Catalan, or Austrian German). On the other hand, in many other respects it was more advanced than the current available applications.  For example, it offered very extensive and versatile tools for creating alternative commands.  The program was designed to work under Windows 95, 98 or NT, and had to be abandoned if one switched over to Windows 2000 and later XP. Philips decided to discontinue it rather than upgrade it to work with these new versions of the Windows operating system. 

 

In spite of all of its deficiencies, however, FreeSpeech 2000 at least was moving in the right direction, as well as has offered an array of solutions for various problems connected with multilingualism. When I was recently asked by David Mowatt of the Microsoft Corporation about how I viewed multilingual usage, I responded that I basically believe FreeSpeech 2000 could still serve as a model, naturally to be improved and brought forwards to the level of the current state of the art.

 

Basically, the design could be as follows:

1. User interface (UI)
The user decides what their main UI language might be. Suppose English is selected, then all UI will be in English. This solves the issue of UI. The user can switch to any language, but the UI will still be in English. That's compatible with Microsoft Office policy. For example, when I work with MS Office UI in English, I can still create documents in Hebrew, Arabic or Swedish. I could also opt for a Hebrew UI, in which case I would still have no problems with creating documents in English. I am thinking of the TEXTUAL IU, but if you include voice commands as part of it, that is a somewhat more difficult matter.

It is indeed thinkable that all voice commands remain in the main language UI ("go to end of line", "go to top", etc.), but on the other hand, some languages are so INCOMPATIBLE with one another's sound pattern that when your mouth is tuned to one language, moving in the middle of using it to another often feels awkward and sometimes does not even come out right. For example, pronouncing English commands like "New Paragraph" or "FULL STOP/PERIOD" in the middle of French or a Spanish dictation does not seem to fit. As much as I would have liked to go on using only ONE set of voice commands for all possible languages, this does not seem to be practical.

Of course there is one more possibility, which I have extensively used in VV for a lot of commands that didn't quite work. For example, instead of "open quote" and "close quote" I had BQ (pronounced BEE-KEW) and SQ; instead of "PERIOD--NEW PARAGRAPH" I had PNP; I had GTB for "Go to bottom", etc. Such commands are not language-dependent but they are perhaps less easy to memorize. Their advantage, however, is that they could stay in any language (though the letter names are pronounced differently).

This suggests that a solution could be modular, as follows:

[1] Textual UI (buttons, icons, etc.) remain in the Main Language;
[2] You allow voice commands in EITHER the main language OR the language switched-to;
[3] You allow extensive Customization of commands.
(The USER can solve the problem to their preferences. If, for example, the user prefers having ONE single set of voice commands for ALL languages, such a set can be created via substitution commands.)

David Mowatt's Comment: "That’s an interesting idea, although it reduces discoverability. Customisation is something that more advanced users might be able to master with sufficient help files, but not something that would work for all users. It is simply a really hard problem to solve!"

 

2. Language Switching
Language switching in FreeSpeech 2000 was relatively simple. You had a Language Button on the Speechbar with all the languages you have selected to install (out of the 14 available). When you wanted to switch to some other language, even in the middle of a sentence, you simply clicked on that button, selected the relevant language, waited for about 20 seconds, and continued your dictation in the newly selected language (if the application did not collapse in the meantime).

This is indeed the simplest thinkable procedure, and it is also compatible with MS Office at large, where you switch languages by buttons, icons or customized key combinations.

As to Mowatt’s question "How do you select text in a document that is half in English, half in French?” I believe the answer is: you select with the voice command that is functional at the moment. If you happen to be working in English you say "select...” But it would actually make no sense, unfortunately, to select a multilingual text because you would not be able to correct it with the module currently loaded. That is an unavoidable limitation. To overcome that we must wait for more powerful computers which would allow all modules loaded at one and the same time.

David Mowatt's Comment: "Yes. That was the conclusion that I was being drawn towards".

This will also allow us dictate, say, in English, then say "switch over to French", and immediately dictate in French. But perhaps this particular detail is feasible even now.
David Mowatt's Comment:  "I haven’t seen it yet in any commercial product either unfortunately."

Actually, no easy Switching Mechanism is eventually provided for Windows Speech Recognition under Vista. If you wish to dictate in a different language, you must change the computer’s main language (“Display Language”). You then get the entire UI switched over to that language. This does not look as a particularly attractive solution if you wish to go back and forth between dictations in various languages.