Where machines go to school

February 11, 2005

The objects of their study are computer programs and algorithms - machines that are designed to solve complex problems, and whose strength lies in processing large and highly intricate quantities of data. Which is why the learning theorists under Bernhard Schölkopf, Director at the Max Planck Institute for Biological Cybernetics in Tübingen, are not focusing on school and pedagogy, but rather on the question of how machines can learn.

The machines that Bernhard Schölkopf’s scientists work with have nothing to do with levers and pistons or conveyor belts and axle grease. Nor is their research devoted to thermodynamic efficiency, as one might think upon hearing the word machine. In this case, machines are computer programs or, to be more precise, calculation methods, known as algorithms, performed by a robot or a program – often numerous times in succession. Accordingly, these machines require neither technicians nor mechanics.

At first glance, however, one would not necessarily suspect that they fall within the field of activity of mathematicians, computer scientists, engineers and cognition researchers, either. The relevant department at the Tübingen-based Max Planck Institute for Biological Cybernetics goes by the name Statistical Learning Theory and Empirical Inference. And when Bernhard Schölkopf talks about his area of expertise, the topics include training the machines, the decisions they have to make, and many other terms that really sound more like they belong to the realm of pedagogy.

Nevertheless, the fundamental object of learning theory can be formulated in very matter-of-fact scientific terms: the aim is to infer laws from a series of observations, such as an array of measurements or data. The laws must not only explain how the given observations are related, but also be able to reliably predict future observations. It’s like a mental exercise: what number is the logical continuation of the series 3 – 4 – 6 – 10 – 18? A little thought and mental arithmetic provide the answer: take the difference between any two consecutive numbers, multiply it by 2, add the result to the higher number, and the answer is 34.

Learning theory can also be compared with forecasting the weather: stations from Sylt to the Zugspitze collect such data as temperature, barometric pressure and precipitation. Meteorologists use this information to calculate the prevailing weather conditions at a given location and attempt (with varying degrees of success) to predict the future weather trend. An algorithm in a computer program, given the appropriate commands, takes an almost human approach. Just like humans, machines learn from examples – although their tasks can, of course, be very different.

Traditionally, the natural sciences deal with induction problems. This entails examining a specific, experimentally accessible model case in order to deduce the general law on which it is based. The problems can be solved by studying a system in sufficiently extensive detail to discover the key processes that take place within it. This is how scientists establish and verify a model. Not even the most teachable machines have managed this – they cannot derive connections. However, machines are frequently suited to describing systems and phenomena that are far too complex for a mechanistic model.

The task assigned to a machine might consist in diagnosing a medical condition. Therefore, the machine deduces the presence of a certain disease based on a range of symptoms. Bioinformaticists also use machines to analyze the human proteome to determine the structure and manner of functioning of the proteins the body produces based on its DNA blueprint. Or they might teach a machine to identify a voice or to recognize a pattern among a number of pixels.

The law that a machine is to learn is tied to a number of important conditions: it must explain – at least approximately – the given examples; in other words, the empirical risk must be low. Moreover, the explanation should be applicable to future observations. Scientists refer to this as generalizability. The figure below shows that low empirical risk is not synonymous with high generalizability: the dots symbolize the examples, and the lines the law being sought. The curve includes all of the examples, while the straight line just barely misses all of the dots. Nevertheless, intuitively, the researchers are inclined to place greater trust in the straight lines. If a new dot appears in the figure (a new observation or a new measurement), it must always lie as close as possible to the function being sought.

There is, of course, no certain knowledge of where the new measurement dot will appear – only a predetermined probability distribution. Again, a weather forecast is a practical comparison: the probability of measuring a temperature below freezing or above 100 degrees Celsius in Cologne in August is practically zero. The probability is only slightly higher for temperatures of 5 or 75 degrees Celsius. The temperature will probably be between 10 and 40 degrees Celsius. The object is to find the law that will predict the location of the example dot as accurately as possible.

The machine, which used training examples to deduce the initially unknown law, is thus tested for its ability to predict future observations based on test examples. This generalizability is the key quality criterion – just as in the case of weather forecasts. At most, only meteorologists care whether the centre of the low-pressure system that caused the last storm was located above Iceland or the Skagerrak. Farmers and mountain climbers want to know how it’s going to affect the weather in the days ahead. As Bernhard Schölkopf notes, “Natural scientists expect models to provide insight into the underlying phenomena. When analyzing complex high-dimensional problems it is not always possible – so the quality of learning theory models is judged primarily by their generalizability. In other words, it is judged by a quantity that a priori has nothing to do with insight.”

Computers for Face Detection

Generally speaking, a machine must be able to classify future observations. This means that, depending on how the predicted observation turns out, it must organize the values into classes. In binary pattern recognition, the prediction is limited to a decision function that can assume only the two values “yes” and “no.” A machine that has been taught to recognize a certain pattern must deliver a correct classification for unfamiliar test examples. Thus, it must decide whether or not the test example depicts the pattern in question – for example, a triangle, an object or a particular letter.

This task has very practical applications, such as automatic character recognition, or the one in which Schölkopf and his co-workers have achieved a major advance: computer-aided face detection. Surveillance cameras are an important aid in terror defence; however, analyzing the images they record requires a lot of time and concentration. Schölkopf and his research colleagues have developed a new method, the results of which contribute significantly to transferring this work to computers.

In a paper for the British Proceedings of the Royal Society, the researchers describe a method that enables computers to find faces on photos or in footage from Internet cameras significantly faster than before. It is based on a method frequently used in statistical data analysis, the so-called support vector method. It is also used in medicine to analyze patient tissue samples. Based on the genetic activity, it should be possible to make a reliable decision as to whether the patient suffers from a given disease, such as leukemia or another type of cancer. In this case, the training examples are the gene expression profiles of patients with known diagnoses. The aim is to accurately diagnose the expression profile (the gene activity) of a new patient whose diagnosis is not yet known.

Developmental biologists use the support vector method to discover the details of the gene activity during the development of Drosophila embryos. In terror defense, efforts are aimed at determining whether and where a large and complex camera image contains faces. To answer this question, the image is divided into individual sections. Then, for each of these sections, it must be decided whether the pixels being examined contain the image of a face. Of course, that is only part of the task of automatically analyzing images. The procedure that mathematicians use merely finds faces, but does not identify them. Nevertheless, this greatly facilitates and accelerates image evaluation.

The support vectors are significant training examples that the computer program calculates according to certain instructions. They effectively define a line of separation between faces and non-faces. With their aid, a computer program that evaluates images – a support vector machine – can be taught, using a mathematical method, whether an image section is really a face or, for example, just a bright patch in a patterned background. Faces can take on many different sizes – from detailed portraits to tiny parts of a body. Furthermore, a face must be found in an image – whether the person shown is wearing a moustache or glasses, or whether they are looking to the right, to the left or into the camera. The similarity between two image sections can be calculated using a mathematical trick. Schölkopf and his colleagues succeeded in improving the support vector method in two areas. First, the scientists found a way to manage with substantially fewer support vectors. This simplifies the decision as to whether an image patch contains a face. Nevertheless, it must, of course, remain precise. In other words, a face must not be overlooked, and there may not be any false detection. Using even a single reduced set of support vectors resulted in a significantly lower computational load. Compared with the analysis of all support vectors, this enabled a 30-fold decrease in image analysis time.

Second, the support vectors are not calculated in the same amount of detail for all parts of the image. Instead, the scientists apply an “evaluation cascade,” calculating multiple reduced sets of support vectors. One or two support vectors are sufficient to classify an unspecific background, such as a section of wall, the clothing of the persons shown or a large window through which the sky appears. Image sections that resemble faces require a more complex evaluation, corresponding to several dozen support vectors. In addition, each evaluation step uses the support vectors from the previous step, in a sort of recycling process.

Thus, the evaluation cascade makes it possible to focus the computational load better. The search for faces is restricted to portions of an image in which, according to initial evaluations, they could occur. The cascade, that is, the use of a sequence of multiple reduced sets of support vectors, allows images to be evaluated 900 times faster than the original method. But the scientists are not satisfied with this. They hope to increase the speed even further. They aim to do this, for example, by optimizing the selection of a reduced set of support vectors and quickly comparing these with the patches of the image of interest. However, faster image evaluation (which, incidentally, is not the only application of statistical learning theory that is useful for everyday purposes) will be possible not only with optimized calculation methods, but also by developing faster computer processors.

Another application could be to make life easier for critically ill people who are completely paralyzed following a stroke or as a result of the rare neurological disease amyotrophic lateral sclerosis (ALS). Such patients can neither speak nor use other muscles, such as visual motor skills, to control auxiliary devices. In general, however, their hearing still works well. One approach consists in recording patients’ brain waves while they focus on what they hear. The brain waves would then be transformed by means of a support vector method at the interface between the brain and the computer – and finally, a communication aid transmits the word “yes” or “no”.

Stefanie Hense

Go to Editor View