The past three decades have seen dramatic changes in the way we live and work. We have created a society powered by information delivered almost instantaneously almost anywhere. We rely upon information technology (IT) systems to help acquire these data, and to store, process and transmit them ever more efficiently, cheaply and quickly. The sheer amount of data is only one challenge; another is coping with the increasing range of formats or modalities available. Ten years ago, most digital content came in the form of text; today it also includes speech, audio, images, video and other forms.
Modern personal computers have multimedia capabilities, and many more electronic tools are now intelligent or multi-purpose, including wearable computers, smartphones, and intelligent sensors and displays, which add to the global volume and array of digital data as well as the number of people who can generate and access it. The challenge is to organize, understand and search this multi-modal information in a robust, efficient and intelligent way.
Human communication and cognition is inherently multi-modal — people perceive the world through five primary senses and express themselves in various ways, including with voice, gestures, gaze, facial expressions, body posture, touch and motion. Computerized systems are a long way behind humans in their ability to handle all these inputs. Computers are efficient at processing large well-structured data sets, but are currently unable to cope with tasks that are easy for humans, for example creating and understanding natural language or interpreting visual information such as facial expression.
The goal of multi-modal interaction is to use all the different types of information contained within human communication efficiently to enable a more natural, real-time interaction between machines and people.
Multi-modal interaction has as its counterpart multi-modal computing that enhances the ability of computer systems to acquire, process and present different modes of data efficiently and robustly. Such systems have several aims: to analyse and interpret multi-modal information even when it is large, scattered, noisy and possibly incomplete; to organize the gathered knowledge to enable powerful querying; and to produce convincing visual output to display complex information in real time.
Designing systems that can interpret multi-modal information is a task with many component parts.
Acquiring, organizing and retrieving multi-modal information
Searching digital documents today relies on the use of keywords and simple text descriptions. Media including video, image and audio files are searchable only through the use of manually created annotations, which is restrictive and can create bias for certain types of search. Although many types of online resources are available for both professional and casual users, there is little integration among the different sources and formats.
In the future, knowledge will be automatically acquired, categorized and continuously maintained by a suite of methods that can process natural language1, and recognize and analyse video content2. These systems will also be able to perform other functions to improve organization, such as inferring relationships between pieces of information, and using context to extract the meanings of ambiguous words (semantic disambiguation; Fig. 1)3. Science and engineering, most notably medicine and the life sciences4, will particularly benefit from these applications as the number and range of scientific publications grows.
Realistic virtual environments
The goal is to create virtual environments for enhanced presentation of multi-modal data. The visual aspect can be programmed from first principles5 or can incorporate sophisticated processing of existing footage such as static images, video or three-dimensional scans6. These methods require techniques from computer graphics, image processing, computer vision, and combinatorial and geometric computing7 to generate large-scale, integrated, physically accurate and visually rich virtual environments.
A related requirement is for the creation of human-like virtual characters that look and speak realistically, show convincing emotions and mimic the behaviour of real people. Virtual characters provide a powerful and intuitive interface through which to present complex multi-modal data, and can be used to populate virtual reality environments.
Mirroring human-to-human communication
One approach to accessing stored information is to design a system that interacts with users in a way that mirrors human behaviour and dialogue. A system that recreates natural, daily person-to-person communication, in which both the system and the human user combine the same spectrum of modalities for input and output, is said to be symmetric8. A good example involves drivers and passengers travelling in a car: rather than breaking their attention to access advanced car services (for example, satellite navigation, entertainment or four-wheel drive), a naturalized interface would allow for easy access using voice commands combined with predictive algorithms. These technologies would create computational models of the current task combined with context, such as the user’s state and cognitive load, to understand the user’s needs and provide appropriate multi-modal responses.
Autonomous infrastructure. Designing dependable, autonomous multi-modal systems is one thing; however, they must be supported by appropriate platforms that are self-organized and able to operate independently on a range of infrastructures, thereby providing reliable computing and communication anytime and anywhere. Any manual input to these systems should be limited to the installation and replacement of hardware components.
Such systems will be capable of delivering personalized, relevant, and timely information and communication. However, they must respect users’ legitimate privacy concerns while holding them accountable for their actions. Such systems are a necessary platform for the previously stated objectives.
The multi-modal future is already around us in the form of smartphones, global positioning systems and even hyper-realistic computer games; going forwards, these will be even more commonplace — available anytime and anywhere. In our vision, these systems will be self-organizing and autonomous, using natural interfaces to provide personalized information quickly and accurately, yet they must also respect users’ legitimate privacy concerns9,10. The priority is to develop principles for the design and operation of such systems that manage the huge amounts of multi-modal information safely and securely.
Researchers at the Max Planck Institute for Informatics have recently developed a new marker-less approach to capturing complex human performances (spatio-temporally coherent geometry, motion and texture) from multi-view video. A novel approach to build comprehensive knowledge bases that tap the deepest online information sources and relationships, to address questions beyond today’s keyword-based search engines, has been proposed (de Aguiar, E. et al. ACM TOG 27 (3), 2008; Weikum, G. et al. Comm. ACM 52 (4), 2009).