Privacy Breaches in Privacy-Preserving Data Mining

Johannes Gehrke
Cornell University

Friday, April 2, 2004
EGRC 313 -- NCSU Centennial Campus
(Driving directions and parking suggestions)

The exponential growth in the amount of digital data has resulted in the creation of databases of unprecedented scale. At the same time concerns about privacy of personal information have emerged globally. Data mining, with its promise to efficiently discover valuable, non-obvious information from large databases, is particularly vulnerable to misuse. Since the primary task in data mining is the development of models about aggregated data, can we develop accurate models without access to precise information in individual data records? This talk will survey recent results on privacy-preserving data mining, concentrating on the class of solutions where each party randomizes their data before sending it to a central server for building the model. We show that simple randomization methods can be exploited to find privacy breaches, and we analyze the nature of these privacy breaches. We then propose a class of randomization operators with strong privacy guarantees and introduce a general property for any randomization operator that limits privacy breaches. This is joint work with Rakesh Agrawal, Alexandre Evfimievski, and Ramakrishnan Srikant.

About the speaker: Johannes Gehrke is an Assistant Professor in the Department of Computer Science at Cornell University. Johannes' research interests are in the areas of data mining, data stream processing, and sensor networks. Johannes has received an NSF Career Award, an Arthur P. Sloan Fellowship, an IBM Faculty Award, and the Cornell College of Engineering James and Mary Tien Excellence in Teaching Award. He co-authored the undergraduate textbook Database Management Systems (McGrawHill, 2002), and he serves as the Program co-Chair for the SIGKDD 2004 conference.

The talk is sponsored by the E-Commerce @ NC State initiative

Please send your comments to Rada Chirkova