Lip-syncing thanks to artificial intelligence

A new piece of software adapts the facial expressions of people in videos to match an audio track dubbed over the film

Dubbing films could become significantly easier in the future. A team led by researchers from the Max Planck Institute for Informatics in Saarbrücken has developed a software package that can adapt actors’ mouth movements and whole facial expressions to match the film’s translation. The technique uses methods based on artificial intelligence and could save the film industry a considerable amount of time and money when it comes to dubbing films. The software can also correct the gaze and head pose of participants in a video conference to boost the impression of a natural conversation setting.

Synchronized facial expressions: a person’s facial expression, gaze direction, and head pose (input) can be transposed onto another individual (output) using the Deep Video Portraits technique, which works using 3D face models (centre).

Film translators and dubbing actors work within a set of rigid limitations. After all, they must ensure that the words they put into actors’ mouths not only accurately reproduce what was said but also correspond to the actors’ lip movements and facial expressions. Now, an international team led by researchers from the Max Planck Institute for Informatics has presented a technique known as Deep Video Portraits at the SIGGRAPH computer graphics conference in Vancouver. This technique does away with the need to synchronize the translated audio track with the facial expressions in the video footage. Instead, the software can adapt the actors’ facial expressions – and above all their lip movements – to match the translation.

The software was developed by a team involving not only the Max Planck researchers in Saarbrücken but also scientists from the University of Bath, Technicolor, the Technical University of Munich (TUM), and Stanford University. In contrast to existing methods, which can only animate the facial expressions found in videos, the new technique also adapts the head pose, gaze, and eye blinking. It can even synthesize a plausible static video background if the head moves.

The technique could transform the visual entertainment industry

In order to reproduce features realistically, the researchers use a model of the face in conjunction with methods based on artificial intelligence. “We work with model-based 3D face performance capture to record the detailed movements of the eyebrows, mouth, nose, and head position of the dubbing actor in a video,” explains Hyeongwoo Kim, a researcher at the Max Planck Institute for Informatics. "It works by using model-based 3D face performance capture to record the detailed movements of the eyebrows, mouth, nose, and head position of the dubbing actor in a video."

For the time being, the research merely demonstrates a new concept, and the method has yet to be put into practice. However, the researchers believe that the technique could completely transform sections of the visual entertainment industry. “Despite extensive post-production manipulation, dubbing films into foreign languages always presents a mismatch between the actor on screen and the dubbed voice,” says Christian Theobalt, who leads a research group at the Max Planck Institute for Informatics and played a key role in the current work. “Our new Deep Video Portraits approach enables us to modify the appearance of a target actor by transferring head pose, facial expressions, and eye motion with a high level of realism.”

More natural conversation settings in video conferencing

As well as a realistic rendering of films into other languages, the method also has a range of other applications in film production. “This technique could also be used for post-production in the film industry, where computer graphics editing of faces is already widely used in today’s feature films,” says Christian Richardt, who participated in the project on behalf of the University of Bath’s motion capture research centre CAMERA. One example of this type of editing is The Curious Case of Benjamin Button, where Brad Pitt’s face was replaced with a modified computer graphics version in nearly every frame of the film. Until now, interventions such as this often required many weeks of work by trained artists. “Deep Video Portraits shows how such a visual effect could be created with less effort in the future,” says Richardt. With the new approach, the positioning of an actor’s head and their facial expression could easily be edited in order to subtly alter the camera angle or framing of a scene and thus to tell the story better.

In addition, the new technique could also be used in video and VR teleconferencing, for example, where people typically look at the screen and not into the camera. As a result, they don’t appear to be looking into the eyes of their conversation partners on the other end of the video link. With Deep Video Portraits, the gaze and head pose could be corrected to create a more natural conversation setting.

Neuronal networks detect videos that have been edited

The software paves the way for a host of new creative applications in visual media production, but the authors are also aware of the potential for misuse of modern video editing technology. Whereas the media industry has been editing photos for many years, it is now increasingly easy to edit videos – and with increasingly convincing results. Given the constant improvements in video editing technology, we must also start being more critical about video content, just as we already are about photos, especially if there is no proof of origin, says Michael Zollhöfer from Stanford University. “We believe that the field of digital forensics should and will receive a lot more attention in the future to develop approaches that can automatically prove the authenticity of a video clip.”

Zollhöfer is convinced that, with better methods, it will be possible to spot modifications of this kind in future, even if we humans might not be able to spot them with our own eyes. This issue is also being addressed by the researchers who presented the new video editing software. They are developing neuronal networks that are trained to detect synthetically generated or edited video with high precision in order to make it much easier to spot forgeries.

The scientists currently have no plans to make the video modification software publicly available. Moreover, they say that any such software should leave watermarks in videos in order to clearly mark modifications.

Other Interesting Articles

Go to Editor View