Truly Massive Data
Truly massive data does not fit in RAM, and may even require multiple disks to store.
FASTlab logo GT

FASTlab Home Papers/Code Team
high-speed data Fast Database-Resident Multivariate Statistics: Extending the Sloan Digital Sky Survey SQL Server Database
Ryan Riegel, Abhimanyu Adita, Praveen Krishnaiah, Prasad Jakka, Nikolaos Vasiloglou, Dongryeol Lee, Alexander Gray, Tamas Budavari, Alexander Szalay
Georgia Institute of Technology Technical Report, 2009

Most of the world's data is business data, which mostly lives in relational databases. We demonstrate how to put fast algorithms based on trees into real databases, for the first time allowing many machine learning methods to scale to out-of-core data effectively. [pdf]

Abstract: We develop for the first time fast DBMS-resident algorithms for multivariate statistical operations -- including all-nearest-neighbors, kernel density estimation, and the 2-point correlation—based on efficient multi-tree traversals. We implement these methods within a commercial DBMS, Microsoft SQL Server, and demonstrate their performance on real scientific data, the Sloan Digital Sky Survey. Empirical results suggest dramatic asymptotic speed-up over naive SQL implementations, with many orders of magnitude improvement for datasets containing millions of rows. This work demonstrates the scalability of multi-tree methods to datasets that cannot fit in RAM.

@techreport{riegel2009mldb, title = "{Fast Database-Resident Multivariate Statistics: Extending the Sloan Digital Sky Survey SQL Server Database}", author = "Ryan Riegel and Abhimanyu Adita and Praveen and Prasad Jakka and Nikolaos Vasiloglou and Dongryeol Lee and Alexander Gray and Tamas Budavari and Alexander Szalay", institution = "{Georgia Institute of Technology}", series = "{College of Computing Technical Report}", year = "2009" }
In preparation

Tree-Based Higher-Order Reduce
We have developed an analog to MapReduce, for generalized N-body problems, which automatically constructs parallel algorithms, called THOR. Currently it only parallelizes over the queries, though a more general version is also in progress.

Data-Intensive Computing and Networking
We are developing approaches to data-intensive computing which account for the fact that massive datasets must be sent over the network to clusters or clouds of computers, a current bottleneck.