IAS Seminar Series on Big Data

Making Sense of Big Data with the Berkeley Data Analytics Stack

Abstract

The Berkeley AMPLab is creating a new approach to data analytics. Launching in early 2011, the vision of the lab is to seamlessly integrate the three main resources available for making sense of data at scale: Algorithms (machine learning and statistical techniques), Machines (in the form of scalable clusters and elastic cloud computing), and People (both individually as analysts and in crowds). The lab is realizing its ideas through the development of a freely-available Open Source software stack called BDAS: the Berkeley Data Analytics Stack. In the nearly four years the lab has been in operation, the speaker and his research group have released major components of BDAS. Several of these components have gained significant traction in industry and elsewhere: the Mesos cluster resource manager, the Spark in-memory computation framework, and the Shark query processing system. BDAS features prominently in many industry discussions of the future of the Big Data analytics ecosystem – a rare degree of impact for an ongoing academic project.

Given this initial success, the lab is continuing on its research path, moving "up the stack" to better integrate and support advanced analytics and to make people a full-fledged resource for making sense of data. In this talk, the speaker will first outline the motivation and insights behind his research approach and describe how the research group has organized to address the cross-disciplinary nature of Big Data challenges. He will then describe the current state of BDAS with an emphasis on the group’s newest efforts, including some or all of: the GraphX graph processing system, the MLBase machine learning platform, and the SampleClean framework for combining sampling and hybrid human/computer data cleaning. Finally he will present his current views of how all the pieces will fit together to form a system that can adaptively bring the right resources to bear on a given data-driven question to meet time, cost and quality requirements throughout the analytics lifecycle.


About the speaker

Prof. Michael Franklin received his PhD in Computer Sciences from the University of Wisconsin-Madison in 1993. He was faculty at the University of Maryland from 1993 to 2001. He is currently Thomas M. Siebel Professor of Computer Science, Chair of the Computer Science Division, and also Director of the Algorithms, Machines and People Lab (AMPLab) at the University of California at Berkeley.

Prof. Franklin works primarily in the Database and Operating Systems and Networking Technology areas. The AMPLab which he directs specializes in data management, cloud computing, statistical machine learning and other important topics necessary for making sense of vast amounts of varied and unruly data. It currently works with 23 industrial sponsors including founding sponsors Amazon Web Services, Google, and SAP, and received a National Science Foundation CISE "Expeditions in Computing" Award, which was announced as part of the White House Big Data Research initiative in 2012.

Prof. Franklin received numerous awards including the ACM SIGMOD "Test of Time" Award, the IBM Faculty Award, Siemens Faculty Development Award, and the US National Science Foundation CAREER Award, etc. He is a Fellow of the Association for Computing Machinery.

Subscribe to the IAS Newsletter and stay informed.