Looking at data through the lens of algebraic geometry
Algebraic methods yield insight into the geometry of datasets
The constant production of information is one of the distinctive features of the time we live in. Advancements in science and technology require us to be able to give meaning to such information and to be able to collect and handle such huge amounts in an efficient manner.
Mathematical structures naturally emerge in data analysis: very often data sets used in real-life applications possess an intrinsic geometric structure. This is the case for data coming from medical imaging, image and video recognition, computer vision, mathematical biology, and chemistry. One of the biggest challenges in the mathematics of data today is that of being able to identify any such underlying geometric structure.
A group of researchers affiliated to the MPI for Mathematics in the Sciences, led by prof. Bernd Sturmfels, has presented a new method to study geometrically distributed datasets. Their results have recently appeared in the Revista Matemática Complutense, a first-class mathematical journal that specialises in applied and computational mathematics.
In the article “Learning algebraic varieties from samples”, the authors explain how to retrieve geometric information from a sufficiently ample data set and argue that it is possible to improve on the mathematical analysis by looking at data through the lenses of algebraic geometry.
One of the first issues they address is that of recognising low-dimensional datasets. In their article, the authors argue that ”the mathematics of data science is concerned with finding low-dimensional needles in high-dimensional haystacks.”
In practical applications, the geometrical shape along which the data is distributed depends on a fixed number of parameters, typically smaller than that used to describe the ambient space. Singling out the dimension of a collection is, therefore, a crucial step in the mathematical approach to data analysis. The authors propose to recognise the underlying shape as an algebraic variety, i.e. as a set of points on which a given polynomial vanishes, and then use the geometric properties of this variety to extract further information on the data, including the dimension.
The idea to use algebraic techniques is an innovative aspect of their work. Indeed, while the present approaches to geometrically distributed data tend to discard the underlying algebraic structure, the MPI MiS researchers were able to show that by exploiting the extra amount of information contained in the polynomial equations, it is possible to improve on the quality, accuracy and efficiency of the analysis.
In parallel to the more theoretical aspects of their study, the authors have also developed a software package that implements their procedure, written in the open-source programming language Julia, freely available to anyone working in the field. The use of the software package is discussed amply in the article, with an abundance of examples.
By putting their algorithm to test on datasets coming from a chemistry database, as in the example of the cyclooctane molecule, a simple hydrocarbon used in the production of plastic and other fibers, the authors were able to demonstrate that their technique can precisely identify the molecule using fewer data points than those required from other widely used approaches.
The presence of different complementary mathematical expertise within the group of authors, ranging from computational algebraic geometry to applied topology, has been vital for the success of the collaboration. This paper constitutes a first step towards the development of a new research current within the mathematics of data, as exemplified by the workshop “TAGS: Linking Topology to Algebraic Geometry and Statistics” organised at the institute earlier this year.