Compact System

  • Subscribe to our RSS feed.
  • Twitter
  • StumbleUpon
  • Reddit
  • Facebook
  • Digg

Friday, 15 July 2011

Google Americas Faculty Summit Day 1: Cluster Management

Posted on 11:32 by Unknown
Posted by John Wilkes, Principal Software Engineer

On July 14 and 15, we held our seventh annual Faculty Summit for the Americas with our New York City offices hosting for the first time. Over the next few days, we will be bringing you a series of blog posts dedicated to sharing the Summit's events, topics and speakers. --Ed

At this year’s Faculty Summit, I had the opportunity to provide a glimpse into the world of cluster management at Google. My goal was to brief the audience on the challenges of this complex system and explain a few of the research opportunities that these kinds of systems provide.

First, a little background. Google’s fleet of machines are spread across many data centers, each of which consists of a number of clusters (a set of machines with a high-speed network between them). Each cluster is managed as one or more cells. A user (in this case, a Google engineer) submits jobs to a cell for it to run. A job could be a service that runs for an extended period, or a batch job that runs, for example, a MapReduce updating an index.

Cluster management operates on a very large scale: whereas a storage system that can hold a petabyte of data is considered large by most people, our storage systems will send us an emergency page when it has only a few petabytes of free space remaining. This scale give us opportunities (e.g., a single job may use several thousand machines at a time), but also many challenges (e.g., we constantly need to worry about the effects of failures). The cluster management system juggles the needs of a large number of jobs in order to achieve good utilization, trying to strike a balance between a number of conflicting goals.

To complicate things, data centers can have multiple types of machines, different network and power-distribution topologies, a range of OS versions and so on. We also need to handle changes, such as rolling out a software or a hardware upgrade, while the system is running.

Our current cluster management system is about seven years old now (several generations for most Google software) and, although it has been a huge success, it is beginning to show its age. We are currently prototyping a new system that will replace it; most of my talk was about the challenges we face in building this system. We are building it to handle larger cells, to look into the future (by means of a calendar of resource reservations) to provide predictable behavior, to support failures as a first-class concept, to unify a number of today’s disjoint systems and to give us the flexibility to add new features easily. A key goal is that it should provide predictable, understandable behavior to users and system administrators. For example, the latter want to know answers to questions like “Are we in trouble? Are we about to be in trouble? If so, what should we do about it?”

Putting all this together requires advances in a great many areas. I touched on a few of them, including scheduling and ways of representing and reasoning with user intentions. One of the areas that I think doesn’t receive nearly enough attention is system configuration—describing how systems should behave, how they should be set up, how those setups should change, etc. Systems at Google typically rely on dozens of other services and systems. It’s vital to simplify the process of making controlled changes to configurations that result in predictable outcomes, every time, even in the face of heterogeneous infrastructure environments and constant flux.

We’ll be taking steps toward these goals ourselves, but the intent of today’s discussion was to encourage people in the academic community to think about some of these problems and come up with new and better solutions, thereby raising the level for us all.

Email ThisBlogThis!Share to XShare to Facebook
Posted in Education | No comments
Newer Post Older Post Home

0 comments:

Post a Comment

Subscribe to: Post Comments (Atom)

Popular Posts

  • New research from Google shows that 88% of the traffic generated by mobile search ads is not replaced by traffic originating from mobile organic search
    Posted by Shaun Lysen, Statistician at Google Often times people are presented with two choices after making a search on their devices - the...
  • Education Awards on Google App Engine
    Posted by Andrea Held, Google University Relations Cross-posted with Google Developers Blog Last year we invited proposals for innovative p...
  • More researchers dive into the digital humanities
    Posted by Jon Orwant, Engineering Manager for Google Books When we started Google Book Search back in 2004, we were driven by the desire to...
  • Google, the World Wide Web and WWW conference: years of progress, prosperity and innovation
    Posted by Prabhakar Raghavan, Vice President of Engineering More than forty members of Google’s technical staff gathered in Lyon, France i...
  • Query Language Modeling for Voice Search
    Posted by Ciprian Chelba, Research Scientist About three years ago we set a goal to enable speaking to the Google Search engine on smart-pho...
  • Announcing our Q4 Research Awards
    Posted by Maggie Johnson, Director of Education & University Relations and Jeff Walz, Head of University Relations We do a significant a...
  • Word of Mouth: Introducing Voice Search for Indonesian, Malaysian and Latin American Spanish
    Posted by Linne Ha, International Program Manager Read more about the launch of Voice Search in Latin American Spanish on the Google América...
  • Under the Hood of App Inventor for Android
    Posted by Bill Magnuson, Hal Abelson, and Mark Friedman We recently announced our App Inventor for Android project on the Google Research B...
  • Make Your Websites More Accessible to More Users with Introduction to Web Accessibility
    Eve Andersson, Manager, Accessibility Engineering Cross-posted with  Google Developer's Blog You work hard to build clean, intuitive web...
  • 11 Billion Clues in 800 Million Documents: A Web Research Corpus Annotated with Freebase Concepts
    Posted by Dave Orr, Amar Subramanya, Evgeniy Gabrilovich, and Michael Ringgaard, Google Research “I assume that by knowing the truth you mea...

Categories

  • accessibility
  • ACL
  • ACM
  • Acoustic Modeling
  • ads
  • adsense
  • adwords
  • Africa
  • Android
  • API
  • App Engine
  • App Inventor
  • Audio
  • Awards
  • Cantonese
  • China
  • Computer Science
  • conference
  • conferences
  • correlate
  • crowd-sourcing
  • CVPR
  • datasets
  • Deep Learning
  • distributed systems
  • Earth Engine
  • economics
  • Education
  • Electronic Commerce and Algorithms
  • EMEA
  • EMNLP
  • entities
  • Exacycle
  • Faculty Institute
  • Faculty Summit
  • Fusion Tables
  • gamification
  • Google Books
  • Google+
  • Government
  • grants
  • HCI
  • Image Annotation
  • Information Retrieval
  • internationalization
  • Interspeech
  • jsm
  • jsm2011
  • K-12
  • Korean
  • Labs
  • localization
  • Machine Hearing
  • Machine Learning
  • Machine Translation
  • MapReduce
  • market algorithms
  • Market Research
  • ML
  • MOOC
  • NAACL
  • Natural Language Processing
  • Networks
  • Ngram
  • NIPS
  • NLP
  • open source
  • operating systems
  • osdi
  • osdi10
  • patents
  • ph.d. fellowship
  • PiLab
  • Policy
  • Public Data Explorer
  • publication
  • Publications
  • renewable energy
  • Research Awards
  • resource optimization
  • Search
  • search ads
  • Security and Privacy
  • SIGMOD
  • Site Reliability Engineering
  • Speech
  • statistics
  • Structured Data
  • Systems
  • Translate
  • trends
  • TV
  • UI
  • University Relations
  • UNIX
  • User Experience
  • video
  • Vision Research
  • Visiting Faculty
  • Visualization
  • Voice Search
  • Wiki
  • wikipedia
  • WWW
  • YouTube

Blog Archive

  • ►  2013 (51)
    • ►  December (3)
    • ►  November (9)
    • ►  October (2)
    • ►  September (5)
    • ►  August (2)
    • ►  July (6)
    • ►  June (7)
    • ►  May (5)
    • ►  April (3)
    • ►  March (4)
    • ►  February (4)
    • ►  January (1)
  • ►  2012 (59)
    • ►  December (4)
    • ►  October (4)
    • ►  September (3)
    • ►  August (9)
    • ►  July (9)
    • ►  June (7)
    • ►  May (7)
    • ►  April (2)
    • ►  March (7)
    • ►  February (3)
    • ►  January (4)
  • ▼  2011 (51)
    • ►  December (5)
    • ►  November (2)
    • ►  September (3)
    • ►  August (4)
    • ▼  July (9)
      • President's Council Recommends Open Data for Feder...
      • Studies Show Search Ads Drive 89% Incremental Traffic
      • Faculty from across the Americas meet in New York ...
      • Google Americas Faculty Summit: Reflections from o...
      • Google Americas Faculty Summit Day 2: Shopping, Co...
      • Google Americas Faculty Summit Day 1: Cluster Mana...
      • Google Americas Faculty Summit Day 1: Mobile Search
      • What You Capture Is What You Get: A New Way for Ta...
      • Languages of the World (Wide Web)
    • ►  June (6)
    • ►  May (4)
    • ►  April (4)
    • ►  March (5)
    • ►  February (5)
    • ►  January (4)
  • ►  2010 (44)
    • ►  December (7)
    • ►  November (2)
    • ►  October (9)
    • ►  September (7)
    • ►  August (2)
    • ►  July (7)
    • ►  June (3)
    • ►  May (2)
    • ►  April (1)
    • ►  March (1)
    • ►  February (1)
    • ►  January (2)
  • ►  2009 (44)
    • ►  December (8)
    • ►  November (4)
    • ►  August (4)
    • ►  July (5)
    • ►  June (5)
    • ►  May (4)
    • ►  April (6)
    • ►  March (3)
    • ►  February (1)
    • ►  January (4)
  • ►  2008 (11)
    • ►  December (1)
    • ►  November (1)
    • ►  October (1)
    • ►  September (1)
    • ►  July (1)
    • ►  May (3)
    • ►  April (1)
    • ►  March (1)
    • ►  February (1)
  • ►  2007 (9)
    • ►  October (1)
    • ►  September (2)
    • ►  August (1)
    • ►  July (1)
    • ►  June (2)
    • ►  February (2)
  • ►  2006 (15)
    • ►  December (1)
    • ►  November (1)
    • ►  September (1)
    • ►  August (1)
    • ►  July (1)
    • ►  June (2)
    • ►  April (3)
    • ►  March (4)
    • ►  February (1)
Powered by Blogger.

About Me

Unknown
View my complete profile