FASTlab logo
GT Fundamental Algorithmic
and Statistical Tools Laboratory

School of Computational Science and Engineering, Georgia Institute of Technology

The FASTlab performs research on practical yet theoretically rigorous tools for fundamental algorithmic and statistical problems arising in challenging applications. Our main focus is state-of-the-art data analysis (machine learning/data mining/statistics) on massive datasets -- we are the largest academic lab devoted to this topic. Modern datasets are increasingly "massive" either in the number of objects (rows of a data table) or the number of dimensions/measurements describing each object (columns of a data table), and we are concerned with both. The research questions we pursue are motivated by collaborations with domain experts on high-impact science and engineering problems. Our goal is to develop practical high-accuracy methods and fast algorithms for them, then if possible generalize them with a theory allowing many future practical methods and algorithms to be developed. We strive to ensure practicality by developing real software and co-publishing in the domain area -- thus our work cycle takes us from mathematics to code to scientific results to mathematics. To achieve our end goals we are multidisciplinary. Work on the computational foundations of machine learning is ultimately intertwined with the fundamental statistical issues of data analysis -- a long-term goal is to unify them theoretically. Our statistical work (as well as our algorithmic work) incorporates concepts from geometry/topology, which we believe represents a long-term bridge between statistics and computation. Within our computational work, we utilize the tools of both discrete computer science (yielding powerful strategies for efficiency and analysis) and continuous numerical analysis (yielding, for example, rigorous bounding of approximation errors), and rare confluences of both can be found in our work. Our work bleeds into general contributions to basic applied mathematics and scientific computing problems.

All Papers and Code Our Team
Algorithmics

Fast Algorithms for Large Datasets
Our goal: To make all textbook state-of-the-art machine learning methods computationally tractable on big datasets. Helix nebula Of particular concern here are nonparametric methods, which are the most powerful but tend to result in the most expensive computations. Most existing approaches to analysis of massive datasets consist of either using simple methods, using a small subsample of the data, or using MapReduce/Hadoop, all of which result in significant limitations. Our strategy is to first design (or adopt) the fastest possible algorithms for the key computational bottlenecks, then consider their extension to out-of-core and parallel settings. Decomposing our goal, five main types of computations can be identified: generalized N-body problems, matrix computations, optimizations, graphical model inferences, and integrations. (This landscape will be outlined in a forthcoming survey paper on large-scale machine learning.) Of these, graphical model inferences and integrations have received much attention in the machine learning literature and therefore have not been a major focus of ours so far. We have so far developed the overall-fastest-while-accurate practical algorithms for the following methods:

Our software for scalable machine learning also includes the best algorithms from the research literature for other methods.

geometry Nearest-Neighbor and Computational Geometry
Many of these fastest algorithms are achieved using multidimensional trees, which are also key to basic problems beyond data analysis. These include kd-trees, ball/metric trees, and cover trees. We have: done extensive experimentation with virtually all known data structures of this kind; introduced new types of trees, including spill trees, subspace trees, and cosine trees; shown how to use trees, our cosine trees, for linear algebra computations; shown how to formally analyze algorithms for pairwise problems based on cover trees; and shown how to create out-of-core versions of trees in relational databases. We are currently working on more insightful data-dependent analyses of tree-based algorithms, as well as parallel trees and tree-based algorithms. Nearest-neighbor search is important in its own right, but also as a prototype problem for others that are well-treated with trees. We have: shown that all-nearest-neighbors can be done more efficiently than by using repeated single-query algorithms, using a dual-tree traversal; shown that nearest-neighbor classification can be done more efficiently than by actually finding the nearest neighbors; developed the fastest/most accurate method for (1+epsilon) distance-approximate nearest-neighbor using our spill trees; and introduced a more powerful and natural paradigm of approximation which is more effective for high-dimensional nearest-neighbor search than distance approximation. (More)

Gaussian kernels Kernel Summations and Multipole Methods
Large summations over kernel functions such as Gaussians occur in the heart of kernel estimators, RKHS kernel machines, and manifold learning methods. While similar to some computational physics problems, statistical problems may be in arbitrary dimension and require a wider array of possible kernels. We have: shown how to develop computational geometry (tree-based) methods for kernel summations which yield user-specifiable error guarantees; shown how to use multipole-like hierarchical series expansions with general multidimensional trees, achieving the first true (hierarchical) multipole-like method for statistical problems; and shown how to achieve efficiency in high-dimensional kernel summation problems by using our subspace trees and incorporating Monte Carlo techniques. (More)

planets Generalized N-Body Problems
Generalized N-body problems (GNPs), roughly speaking, are computations involving about comparisons between points (e.g. distances, dot products, and kernels), including all nearest-neighbor-like and kernel summation-like problems -- not surprisingly, many of the bottleneck computations in statistics are of this form. A common form of such problems involves comparisons between all pairs of points, a naively quadratic-time computation. We have: coined and formally (algebraically) defined the class, which subsumes the map-reduce class, as well as shown that a single abstract algebraic algorithm, a broad generalization of the famous Fast Multipole Method, solves all such pairwise problems for the first time; shown for the first time that such pairwise problems, including the original N-body problems of computational physics, are solvable in provably linear time; shown for the first time that the Euclidean minimum spanning tree (EMST) problem of computational geometry can be solved in O(NlogN) time; shown that higher-order (beyond pairs, to n-tuples) problems such as n-point correlations and more-than-pairwise kernels can also be computed efficiently; and created a programming model called THOR (Tree-based Higher-Order Reduce) for GNPs. (More)

the Matrix Matrix Computations and Optimization
Matrix computations are the main bottleneck in several methods, both basic and state-of-the-art. The idea of approximate linear algebra is arguably more sensible in machine learning than in many other areas of scientific computing. A matrix structure which is somewhat unique to machine learning (though not completely) is the kernel matrix. We have: developed a fast algorithm for SVD (full SVD, not just low-rank factorization) using Monte Carlo and a new kind of tree, cosine trees, which automatically determines the rank needed to achieve a specified approximation error; and shown how to use tree-based N-body solvers in kernel matrix problems. We are working on fast, stable algorithms for Gaussian process regression and kernel PCA. Every statistical method results in an optimization, whether implicit or explicit. Recent methods have required increasingly difficult optimizations, and while generic off-the-shelf optimizers exist, custom algorithm designs are needed for scalability. We have: developed a fast solver for the kinds of semidefinite programs that occur in manifold learning methods; developed the fastest online algorithm for nonlinear SVMs, based on the classic Frank-Wolfe approach; and developed more efficient optimization formulations for L_0 SVMs using perspective-cut mixed integer nonlinear programming and L_q SVMs for 0 < q < 1 using DC (difference of convex functions) programming. We are working on online algorithms for all machine learning methods. (More)

big planet Truly Massive Data
Truly massive datasets may fit only on disk, rather than RAM, and may even fit only on multiple computers' disks. At this point it is difficult o separate the issues of efficient machine learning from those of data management. We have shown for the first time how to use out-of-core versions of multidimensional trees within existing relational databases, with a demonstration using Microsoft SQL Server; we are working on new ideas for large-scale data management, aka data warehousing, and also on the problem of efficient data transfer to allow large-scale machine learning via cloud computing. Our THOR (Tree-Based Higher-Order Reduce) programming model and system for generalized N-body problems can achieve certain kinds of parallelism for tree-based algorithms but in general we are working on efficient parallelizations of the fastest algorithms for all machine learning methods, for all types of modern hardware setups (clusters, multicore, GPUs). (More)
Statistics

Methods for High-Dimensional/Complex Data
Our goal: To efficiently estimate, compute, and derive insight in the presence of many variables with possibly complex interdependencies. equation Of particular concern here are nonparametric methods, which are the most accurate but tend to have the most difficulties in high dimensionalities. In general, in pragmatically choosing between statistical methods, the tradeoff space is mostly defined by three axes: accuracy, scalability, and interpretability. We believe the accuracy/scalability tradeoff can be characterized mathematically, a long-term theoretical goal of ours, by utilizing tools from geometry/topology. One approach to achieving scalability is to consider new methods which result in easier computations yet hope to preserve most of the modeling power of more expensive methods, making the line between statistical method and computational method blurry -- our examples of this include density estimation trees, Markov matrix factorization, and local kernel machines. Interpretability, which can be captured by defining more interpretable objective functions (as in density-preserving maps) or by using more interpretable model classes (as in density estimation trees and non-negative SVMs), is more difficult to cast mathematically, though the number of variables in the final model might be a useful proxy. We have so far developed the following new statistical methods:

Our machine learning software includes these and other state-of-the-art machine learning methods.

Illuminati eye Curses of Dimensionality and Geometry
There are various curses of dimensionality, which we believe can ultimately be related to each other. There are statistical curses of dimensionality, such as the exponential dependence of the estimation error of kernel estimators on the dimension of the data, and computational curses of dimensionality, such as the exponential dependence of the runtime of nearest-neighbor algorithms on the dimension of the data. We have: shown that kernel estimators are sensitive to the dimension of the data's submanifold, rather than the full dimension of their ambient space, thus showing that kernel estimation in high dimensionality is possible; we have developed more refined variants of the expansion constant, which can be regarded as a worst-case form of intrinsic dimension in the context of some computational geometry computations; we are working on rigorous nonparametric estimators of the intrinsic dimension of datasets, and on distribution-dependent analyses of algorithms which apply the mathematical tools of statistics to algorithm analysis. (More)

eye Dimension Reduction and Feature Selection
The identification of the fundamental axes of variation in data, in principle, has implications for all statistical tasks, and has undergone a revival with the introduction of geometry-inspired ideas in recent years. We are interested in bringing more powerful geometry/topology ideas to this area, in blending them with basic statistical principles of estimation, and in creating new methods using formulations made possible by modern optimization capabilities. We have: shown for the first time how to perform non-negative matrix factorization in a convex fashion; shown how to perform non-negative matrix factorization with isometric (manifold learning-like) constraints, achieving a more compact spectrum; shown a stable way to perform manifold learning without distances, using only an ordering on the distances between points; and demonstrated a new paradigm for dimension reduction which preserves densities rather than distances, which is shown to always be possible while distance preservation is well-known not to be always possible, and which is well-motivated by the most common use of high-dimensional visualization -- identification of clusters and outliers. See also our work on isometric separation maps, which bridge manifold learning and SVMs. The identification of subsets of the original features (rather than linear or nonlinear combinations of all of the features) which are relevant to a particular learning task has both interpretive and statistical value, and much recent attention has surrounded the issues of sparsity and regularization. We have: shown how to perform fast but provably nonparametric density estimation using density estimation trees, the analog of classification and regression trees, yielding the interpretable rules and automatic feature selection capabilities of decision trees. See also our work where we have shown how to achieve the expected aggressive feature selection capabilities of the relatively little-studied regularization norms below L_1, in particular L_0 SVMs using certain relaxations of the true combinatorial objective function and L_q SVMs for 0 < q < 1. We are working on new ways to perform feature selection in multi-class learning problems and in the future will enter causality inference. Dimension reduction and feature selection can often be seen as special cases of learning the metric, or similarity or kernel function, at the root of a statistical task -- rather than, for example, taking the Euclidean distance (and functions of it) as given. See our work on this here. (More)

support vector machine Kernel Machines and Estimators
Toward general methodology for complex data, we are working on extending the capabilities of the best nonparametric methods, which revolve around notions of similarities of kernels between data points. Support vector machines and their relatives are more or less the most accurate classifiers available, but require improvements in scalability and extensions for making them applicable to certain kinds of problems. We have: shown the utility of a local kernel machines, which retain some of the advantage of SVMs while enjoying the greater scalability of other nonparametric methods; shown how to create non-negative SVMs which only find models with non-negative coefficients; proven the empirical and theoretical advantages of generative mean map kernels, a new kind of kernel for probability distributions, allowing complex generative models to be used within SVMs with increased accuracy. Kernel estimators (or smoothers) constitute another powerful class of nonparametric methods. We are working on ways to perform kernel estimation on data with measurement errors. See also our work on high-dimensional kernel estimation. We are working on the fundamental relationships between kernel estimators and kernel machines. We explored various questions regarding learning the kernel or similarity measure between points -- we have: shown how to learn the kernel in kernel density estimation via semidefinite programming, yielding behavior similar to variable-bandwidth methods but bypassing the idea of bandwidth selection; shown how to learn the metric for a max-margin-based nearest-neighbor classifier; and developed a way to learn the kernel in support vector machines such that linear separability in the kernel space is guaranteed rather than hoped for. (More)

spatial diagram Structured and Complex Data
Structured data are those for which there are known types of dependencies between the variables. Common types of structure include temporal and spatial dependence. We have: developed a form of independent component analysis (ICA) for functional data (such as time series) which accounts for the shape similarity between time series rather than requiring that they be aligned (as with the common use of iid ICA for time series); we have demonstrated a way to effectively learn fast approximations of hidden Markov models which cast the problem as a certain matrix factorization, which we call Markov matrix factorization; developed a way to simultaneously learn transformations of each data point while estimating models, in the context of an affine non-negative matrix factorization for images such as faces. See also our work on new kernels for parametric models, which can allow the use of structured (non-iid) data objects such as time series or images to be used within kernel machines. We are also working on ways to ensure generalization in dependent data. (More)
Tools

High-Impact Applications and Software
Our goal: To make significant impact in today's outstanding, large-scale problems, while using their underlying issues to drive our long-term research questions. Science journal cover When we can find the right collaborators, we try to focus on the problems which have the most long-term impact. This has often led us to long-standing fundamental science problems, but we have also begun to consider difficult problems occurring in industry. We believe that starting with problems in real-world domains, as opposed to academic ones, forces us to consider the truly hardest and most fundamental mathematical/computational problems, rather than simplify and abstract them. We embrace, rather than seek to avoid, the "non-ideal" aspects of real data analysis problems and regard them as opportunities for research. With each applied problem we tackle, we strive to develop not only an improved solution for it but also a general method (computational or statistical) which can be a used for a much larger class of problems. First-of-a-kind application results our tools have enabled include:

We are working on customized wrappers of our general software which are tailored to each of the user communities we work with.

nebula Astroinformatics Tools
Though in fact arguably very old, astroinformatics is just beginning to congeal as a community, in the way that bioinformatics has. Among the sciences, astronomy/astrophysics is the poster child of massive datasets, with the coming generation of sky surveys generating billions of data points per month, and the successful history among the sciences of being able to manage such influxes of data. We have worked in this area since 1993, beginning with contributions to the SKICAT star-galaxy separation system for the DPOSS survey. More recently we have been working mainly in the area of observational cosmology, or large-scale structure, with the SDSS survey. We have enabled the largest and most accurate quasar catalog to date, consisting of nearly 1 million quasars having the largest redshift and luminosity range yet, using our large-scale algorithm for kernel discriminant analysis -- this has led to the first large-scale verification of cosmic magnification (article in Nature). Our algorithm for n-point correlation functions, the first efficient exact algorithm for this critical family of statistics, helped to enable the first large-scale evidence for dark energy using the WMAP data (article in Science), the most accurate 2-point function of AGNs, and the largest 3-point function ever computed. We are still actively working on extensions to this algorithm. Our fast algorithm for kernel density estimation enabled an influential study of the ecology of galaxies. We are working on what we believe will be the most accurate redshift estimator to date using our fast algrorithms for kernel regression and kernel conditional density estimation, coupled with our work on kernel estimation accounting for measurement errors. We are also working on the problems of decomposing astrophysical spectra using our advanced methods of non-negative matrix factorization, cross-matching catalogs, deblending overlapped objects, and friends-of-friends clustering in astrophysical simulations. (More)

microarray Bioinformatics and Medical Tools
Bioinformatics remains the model success story among the sciences of the multidisciplinary union of a science domain with algorithmic and statistical methods. Our work on various bioinformatics problems goes back many years, and includes some of the early work on microarray analysis attempting to learn genetic circuits from microarray data, as well as a more recent foray into integrating microarray data and gene ontologies to identify cancer biomarkers. We have also worked on segmentation of cell images toward cancer detection. More recently we have focused on mass spectrometry for metabolomics, and have developed functional support vector machine (fSVM) models yielding the most accurate early detector of ovarian cancer to date (article in Horizons). We are working on using our work on feature-selecting SVMs to identify a smaller set of biomarkers, and on the difficult calibration and noise removal issues with mass spectrometry data. We are also now working on identification of salient temporal patterns in EEG data for brain-computer interfaces for the disabled and for clinical diagnosis of sleep disorders. Molecular modeling encompasses fundamental capabilities for the design and discovery of new drugs and materials, including tertiary structure prediction by protein folding or molecular dynamics simulations. The energy function used in such simulations lies at the center these activities and is a source of constant experimentation and tuning, within the large space of possible approximations to the true (quantum mechanical) energy function. We have formalized the problem of learning energy functions via non-negative ranking SVMs, and shown the ability to automatically learn superior energy functions. It has been found that traditional pairwise interaction potentials are not able to capture all of the salient physics in molecules, which has opened up the area of empirical and multi-body (more than pairwise) potentials. Using our framework of generalized N-body methods, we have developed the first fast algorithm for multi-body (general-order) potentials, enabling the largest-scale Axilrod-Teller molecular simulation to date. The next higher level of simulation fidelity lies at the quantum level, but long-standing approaches such as Hartree-Fock simulation are prohibitive as they result in quartic runtime in the number of basis functions used -- however our generalized N-body framework points to a more efficient approach than has existed so far, and we are exploring this to see if quantum-level simulation can be done on much larger molecules than previously considered possible. (More)

optical fibers Industrial and Other Scientific Tools
Man-made data, such as internet data, has also become massive and lies at the heart of important societal problems. We have developed an initial system to automatically flag emails as spam based on the sending patterns of their source, rather than their content using a source of 300 million emails per day, and are working on a more advanced system using our fast online SVM and fast HMM algorithms. We are also working on a large system for classification and Netflix-like recommendation of documents. In high-energy physics, we have developed fast methods for particle event triggering in the Large Hadron Collider. (More)

software and systems Software and Systems
We believe the ultimate incarnation of our research is robust software. There is a void in the space of today's machine learning software, which we are working to fill. It is either comprehensive (covering a large set of statistical methods) but non-scalable, such as Weka, R, or Matlab, or algorithmically sophisticated and tightly-coded but limited in scope, such as SVMlight. Our model is the world of linear algebra, in which LAPACK represents both comprehensiveness and state-of-the-art algorithms, within robust and widely usable software. Our attempt to create the analog for machine learning is called MLPACK, a collection of machine learning methods based on the Fastlib C++ library -- a pre-release is available for the curious, though a much-improved version 1.0 release is scheduled for June 2010. Longer-term, we are also working on a comprehensive system for data visualization and analysis, to make machine learning usable by non-experts. (More)

math Algorithm Derivation and Code Generation
Ultimately, we would like to be able to develop customized models for new domains/clients orders of magnitude more quickly than is currently possible. Perhaps our most ambitious project is to, over time, encode all the statistical and computational principles needed to develop the algorithms and code that an expert could develop if given enough time, so that a modeler can declaratively specify models and have state-of-the-art algorithms derived for them and highly efficient code generated for those algorithms, then have the models automatically tested and compared to each other. We have, based on our earlier involvement with the AutoBayes system, developed steps toward type-theoretic formallization of optimization and formalization of probability and statistics which will form a base for a new system we call Algorithmica. (More)