computer Industrial Tools
Man-made data pose difficult challenges in both representation and scalability.

fastlab Big Ideas People Stuff
Detecting Spammers with SNARE: Spatio-temporal Network-level Automatic Reputation Engine
Shuang Hao, Nick Feamster, Alexander Gray, Nadeem Syed, and Sven Krasser
USENIX Security Symposium 2009

We demonstrated the ability to perform automatic spam blacklisting without examining email content at all -- instead, looking at senders' spatio-temporal activities. [pdf]

Abstract: Users and network administrators need ways to filter email messages based primarily on the reputation of the sender. Unfortunately, conventional mechanisms for sender reputation -- notably, IP blacklists are cumbersome to maintain and evadable. This paper investigates ways to infer the reputation of an email sender based solely on network-level features, without looking at the contents of a message. First, we study first-order properties of network-level features that may help distinguish spammers from legitimate senders. We examine features that can be ascertained without ever looking at a packet's contents, such as the distance in IP space to other email senders or the geographic distance between sender and receiver. We derive features that are lightweight, since they do not require seeing a large amount of email from a single IP address and can be gleaned without looking at an email's contents -- many such features are apparent from even a single packet. Second, we incorporate these features into a classification algorithm and evaluate the classifier's ability to automatically classify email senders as spammers or legitimate senders. We build an automated reputation engine, SNARE, based on these features using labeled data from a deployed commercial spam-filtering system. We demonstrate that SNARE can achieve comparable accuracy to existing static IP blacklists: about a 70% detection rate for less than a 0.3% false positive rate. Third, we show how SNARE can be integrated into existing blacklists, essentially as a first-pass filter.

@incollection{hao2009snare, title = "{Detecting Spammers with SNARE: Spatio-temporal Network-level Automatic Reputation Engine}", author = "Shuang Hao and Nick Feamster and Alexander Gray and Nadeem Syed and Sven Krasser", booktitle = "Proceedings of the Eighteenth USENIX Security Symposium" year = "2009" }
Isee elsewhere
Machine Learning in Relational Databases
Most of the world's data is business data, which mostly lives in relational databases. We have developed the first scheme for performing scalable machine learning analyses inside relational databases. [see full entry here]
Iin progress
A Research Document Search Engine
We are developing new methods for text analysis, including topic modeling, in the context of a system for retrieval and visualization of research papers.
Nonlinear Recommendation Systems
Recommendation systems are mostly based on linear dimension reduction methods. We are developing an approach to recommender systems based on more powerful machine learning methods.
Data-Intensive Computing and Networking
We are developing approaches to data-intensive computing which account for the fact that massive datasets must be sent over the network to clusters or clouds of computers, a current bottleneck.