The past three decades have seen dramatic changes in the way we live and work. We have created a society powered by information delivered almost instantaneously almost anywhere. We rely upon information technology (IT) systems to help acquire these data, and to store, process and transmit them ever more efficiently, cheaply and quickly. The sheer amount of data is only one challenge; another is coping with the increasing range of formats or modalities available. Ten years ago, most digital content came in the form of text; today it also includes speech, audio, images, video and other forms.
Modern personal computers have multimedia capabilities, and many more electronic tools are now intelligent or multi-purpose, including wearable computers, smartphones, and intelligent sensors and displays, which add to the global volume and array of digital data as well as the number of people who can generate and access it. The challenge is to organize, understand and search this multi-modal information in a robust, efficient and intelligent way.
Human communication and cognition is inherently multi-modal — people perceive the world through five primary senses and express themselves in various ways, including with voice, gestures, gaze, facial expressions, body posture, touch and motion. Computerized systems are a long way behind humans in their ability to handle all these inputs. Computers are efficient at processing large well-structured data sets, but are currently unable to cope with tasks that are easy for humans, for example creating and understanding natural language or interpreting visual information such as facial expression.
The goal of multi-modal interaction is to use all the different types of information contained within human communication efficiently to enable a more natural, real-time interaction between machines and people.