2.2 Multimodal interfaces
2.2.1 Advantages of Multimodal Interfaces
Multimodal interfaces most noticeable advantages are their flexibility, their sta- bility and robustness, their efficiency gains, and the fact that users, nearly uni- versally, prefer to interact multimodally. Each of these aspects will be discussed in more detail in the following paragraphs.
2.2.1.1 Flexibility
Unlike traditional keyboard and mouse interfaces or unimodal recognition-based interfaces, multimodal interfaces permit a flexible use of modes. This gives the user the choice of which modality to use for conveying different types of informa- tion, when to use one modality alone or combined with others, and to alternate between modalities at any time. Since individual modalities are well suited in some situations, and less ideal or even inappropriate in others, modality choice is an important issue in a multimodal system. It can be very advantageous to allow the diverse user groups to exercise selection and control over how they interact with the computer (Fellet al.,1994). In this respect, multimodal interfaces have
the potential to accommodate a broader range of users, tasks and environments than traditional interfaces (Oviatt, 1999a; Oviatt et al., 2000). Since there can be large individual differences in people’s abilities and preferences to use different modes of communication, multimodal interfaces will increase the accessibility of computing for users of different ages, skill levels, cognitive styles, sensory and motor impairments, native languages, or even temporary illness. For example, a visually impaired user, or a user with impaired upper limbs, may prefer speech input. In contrast, a user with a hearing impairment, or with phonetic problems may prefer pen input. The natural alternation between modes that is permitted by a multimodal interface can also be effective in preventing overuse and physical damage to any single modality, especially during extended periods of computer use (Oviatt & Cohen,2000).
Multimodal interfaces also provide the adaptability that is needed to natu- rally accommodate the continuously changing conditions of mobile use settings (Oviatt, 2003; Oviatt et al., 2000). Systems involving speech, pen or touch in- put, and graphical or speech output, are suitable for mobile tasks, and, when combined, users can shift among these modalities as environmental conditions change (Holzman, 1999; Oviatt, 2000a,c). For example, the user of an in-vehicle application may frequently be unable to use manual or gaze input and graphical output, although speech is relatively more available for input and output. A mul- timodal interface permits users to switch between modalities as needed during the changing usage conditions.
2.2.1.2 Stability and Robustness
Another major reason for developing multimodal interfaces is to improve the per- formance stability and robustness of recognition-based systems (Oviatt, 1999a).
From a usability standpoint, multimodal systems offer a flexible interface in which people can exercise intelligence about how to use input modes effectively so that errors are avoided.
One particularly advantageous feature of multimodal interfaces is their su- perior error handling, both in terms of error avoidance and graceful recovery
from errors (Oviatt, 1999a; Oviatt & VanGent, 1996; Oviatt et al., 1998; Rud- nicky & Hauptmann,1992;Tomlinsonet al., 1996). There are user-centered and system-centered reasons why multimodal systems facilitate error recovery, when compared with unimodal recognition-based interfaces. First, in a multimodal interface users may select the input mode that is less error prone for particu- lar lexical content, which tends to lead to error avoidance (Oviatt & VanGent, 1996). For example, users may prefer faster speech input, but will switch to pen input to communicate a foreign surname. Secondly, by allowing users to combine modalities in their input commands, the information carried by each modality is simplified when interacting multimodally, which can substantially reduce the complexity of the work requested to recognizers and thereby reduce recognition errors (Oviatt & Kuhn, 1998). For example, while in a unimodal system a user may say “select the house near the lakeshore”, in a multimodal system the user might point to the house and utter “select this house”. Thirdly, users have a strong tendency to switch modes after system recognition errors, which facili- tates error recovery (Oviatt et al., 2000).
In addition to these user-centered reasons for better error avoidance and reso- lution, there also are system-centered reasons for superior error handling. A well- designed multimodal architecture with two semantically rich input modes can support mutual disambiguation of input signals. Mutual disambiguation involves recovery from unimodal recognition errors within a multimodal architecture, be- cause semantic information from each input mode supplies partial disambiguation of the other mode, thereby leading to more stable and robust overall system per- formance (Oviatt, 1999a, 2000b). For example, if a user says “ditches” but the speech recognizer confirms the singular “ditch” as its best guess, then parallel recognition of several graphic marks can result in recovery of the correct plural interpretation. To achieve optimal error handling, a multimodal interface ideally should be designed to include complementary input modes, and so the alternative input modes provide duplicate functionality such that users can accomplish their goals using either mode.
2.2.1.3 Efficiency
Multimodal interfaces sometimes support improved efficiency, especially when manipulating graphical information. In simulation research comparing speech- only with multimodal pen/voice interaction, empirical work demonstrated that multimodal interaction yielded 10% faster task completion time during visual spa- tial tasks, but no significant efficiency advantage in verbal or quantitative task domains (Oviatt, 1997; Oviatt et al., 1994). Likewise, users’ efficiency improved when they combined speech and gestures multimodally to manipulate 3D objects, compared with unimodal input (Hauptmann, 1989). In another study, multi- modal speech and mouse input improved efficiency in a drawing task (Leatherby
& Pausch, 1992). Finally, in a study that compared task completion times for a graphical interface versus a multimodal pen/voice interface, military domain experts averaged four times faster at setting up complex simulation scenarios on a map when they were able to interact multimodally (Cohen et al., 2000).
This study was based on testing of a fully functional multimodal system, and it included time required to correct recognition errors.
Interestingly, multimodal systems demonstrate a relatively greater perfor- mance advantage precisely for those users and usage contexts in which unimodal systems fail. For example, recognition rates for unimodal spoken language sys- tems are known to degrade rapidly for children or nonnative accented speakers, and in noisy field environments or while users are mobile. However, research revealed a multimodal architecture can be designed that closes the recognition gap for these kinds of challenging users and usage contexts (Oviatt, 1999a,b). In addition, systems that process multiple modes aim to give users a more powerful interface for accessing and manipulating information, as well as increasingly so- phisticated visualization and output capabilities (Oviatt,1997). A study demon- strated multimodal interaction to be nine times faster when a user interacted with a pen/voice system that when using a more familiar graphical interface for initializing simulation exercises (Cohen et al.,1998).
2.2.1.4 User Preferences
Finally, a large body of data documents that multimodal interfaces satisfy higher levels of user preference when interacting with simulated or real computer sys- tems. Users have a strong preference to interact multimodally, rather than uni- modally, across a wide variety of different application domains, although this preference is most pronounced in spatial domains (Hauptmann, 1989; Oviatt, 1997). During pen/voice multimodal interaction, users preferred speech input for describing objects and events, sets and subsets of objects, out-of-view objects, conjoined information, past and future temporal states, and for issuing com- mands for actions or iterative actions (Cohen & Oviatt, 1995; Oviatt & Cohen, 1991). However, their preference for pen input increased when conveying digits, symbols, graphic content, and especially when conveying the location and form of spatially-oriented information on a dense graphic display such as a map (Oviatt, 1997; Oviatt & Olsen, 1994). Likewise, 71% of users combined speech and man- ual gestures multimodally, rather than using one input mode, when manipulating graphic objects on a CRT screen (Hauptmann, 1989).