February 2011 ~ Compact System

Monday, 28 February 2011

Slicing and dicing data for interactive visualization

Posted on 13:30 by Unknown

Posted by Benjamin Yolken, Google Public Data Product Manager

A year ago, we introduced the Google Public Data Explorer, a tool that allows users to interactively explore public-interest datasets from a variety of influential sources like the World Bank, IMF, Eurostat, and the US Census Bureau. Today, users can visualize over 300 metrics across 31 datasets, including everything from labor productivity (OECD) to Internet speed (Ookla) to gender balance in parliaments (UNECE) to government debt levels (IMF) to population density by municipality (Statistics Catalonia), with more data being added every week.

Last week, as part of the launch of our dataset upload interface, we released one of the key pieces of technology behind the product: the Dataset Publishing Language (DSPL). We created this format to address a key problem in the Public Data Explorer and other, similar tools, namely, that existing data formats don’t provide enough information to support easy yet powerful data exploration by non-technical users.

DSPL addresses this by adding an additional layer of metadata on top of the raw, tabular data in a dataset. This metadata, expressed in XML, describes the concepts in the dataset, for instance “country”, “gender”, “population”, and “unemployment”, giving descriptions, URLs, formatting properties, etc. for each. These concepts are then referenced in slices, which partition the former into dimensions (i.e., categories) and metrics (i.e., quantitative values) and link them with the underlying data tables (provided in CSV format). This structure, along with some additional metadata, is what allows us to provide rich, interactive dataset visualizations in the Public Data Explorer.

With the release of DSPL, we hope to accelerate the process of making the world’s datasets searchable, visualizable, and understandable, without requiring a PhD in statistics. We encourage you to read more about the format and try it yourself, both in the Public Data Explorer and in your own software. Stay tuned for more DSPL extensions and applications in the future!

Posted in datasets, Public Data Explorer | No comments

Friday, 25 February 2011

Where does my data live?

Posted on 14:45 by Unknown

Posted by Daniel Ford, Senior Mathematician

Have you ever wondered what happens when you upload a photo to Picasa, or where all your Gmail or YouTube videos are stored? How it is that you can read or watch them from anywhere at any time?

If you stored your data on a single hard disk, like the one in your personal computer, then the disk would eventually fail and your data would be lost forever. If you want to protect your data from the possibility of such a failure, you can store copies across many different disks so that if any one fails then you just access the data from another.

However, once storage systems get large enough, anything and everything can and does go wrong. You have to plan not just for disk failures but for server, network, and entire datacenter failures. Add to this software bugs and maintenance operations and you have a whole lot more failures.

Using measurements from dozens of Google data centers, we found that almost-simultaneous failure of many servers in a data center has the greatest impact on availability. On the other hand, disk failures have relatively little impact because our systems are specifically designed to cope with these failures.

Once you have a model of failures, you can also look at the impact of various design choices. Where exactly should you place your data replicas? How fast do you need recover from losing a disk or server? What encoding scheme or number of replicas of the data is enough, given a desired level of availability? For example, we found that storing data across multiple data centers reduces data unavailability by many orders of magnitude compared to having the same number of replicas in a single data center. The added complexity and potential for slower recovery times is worth it to get better availability, or use less storage space, or even both at the same time.

As you can see, something as simple as storing your photos, mail, or videos becomes a lot more involved when you want to be sure it's always available.

In our paper, Availability in Globally Distributed Storage Systems, we characterize the availability of cloud storage systems, based on extensive monitoring of Google's main storage infrastructure, and the sources of failure which affect availability. We also present statistical models for reasoning about the impact of design choices such as data placement, recovery speed, and replication strategies, including replication across multiple data centers.

Posted in Publications | No comments

A Runtime Solution for Online Contention Detection and Response

Posted on 07:45 by Unknown

Posted by Jason Mars, Software Engineering Intern

In our recent paper, Contention Aware Execution: Online Contention Detection and Response, we have made a big step forward in addressing an important and pressing problem in the field of Computer Science today. This work appears in the 2010 Proceedings of the International Symposium on Code Generation and Optimization (CGO) and was awarded the CGO 2010 Best Presentation Award at the conference.

One of the greatest challenges when using multicore processors arise when critical resources, such as the on-chip caches, are shared by multiple executing programs. If these programs simultaneously place heavy demands on shared resources, the may be forced to "take turns," and as a result, unpredictable and abrupt slowdowns may occur. This unexpected "cross-core interference" is especially problematic when considering the latency sensitive applications that are found in Google's datacenters, such as web-search. The commonly used solution is to dedicate separate machines to each application, however this leaves the processing capabilities of multicore processors underutilized. In our work, we present the Contention Aware Execution Runtime (CAER) environment that provides a lightweight runtime solution that minimizes cross-core interference, while maximizing utilization. CAER leverages the ubiquitous performance monitoring capabilities present in current state-of-the-art multicore processors to infer and respond to cross-core interference and requires no added hardware support. Our experiments show that when using our CAER system, we are able to increase the utilization of the multicore CPU by 58% on average. Meanwhile CAER brings the performance penally due to allowing co-location from 17% down to just 4% on average.

Posted in Publications, resource optimization | No comments

Tuesday, 22 February 2011

Congratulations to Ken Thompson

Posted on 15:00 by Unknown

Posted by Bill Coughran, Senior Vice President of Engineering

I’m happy to share that Ken Thompson has been chosen as the recipient of the prestigious Japan Prize. The Japan Prize is bestowed for achievements in science and technology that promote the peace and prosperity of mankind.

Ken was awarded the prize along with Dennis Ritchie for their development of the UNIX operating system in 1969 while at Bell Labs. UNIX changed the direction of computing as a whole and paved the way for the development of the personal computers and the server systems that power the Internet.

It’s an enormous source of pride for us to have such amazing talent working here and Ken continues to serve as an inspiration to the rest of us. We’re excited to see what Ken will come up with next.

You can read the full press release here.

Posted in Awards, UNIX | No comments

Thursday, 17 February 2011

Query Language Modeling for Voice Search

Posted on 14:15 by Unknown

Posted by Ciprian Chelba, Research Scientist

About three years ago we set a goal to enable speaking to the Google Search engine on smart-phones. On the language modeling side, the motivation was that we had access to large amounts of typed text data from our users. At the same time, that meant that the users also had a clear expectation for how they would interact with a speech-enabled version of the Google Search application.

The challenge lay in the scale of the problem and the perceived sparsity of the query data. Our paper, Query Language Modeling for Voice Search, describes the approach we took, and the empirical findings along the way.

Besides data availability, the project succeeded due to our excellent computational platform, the culture built around teams that wholeheartedly tackle such challenges with the conviction that they will set a new bar, and a collaborative mindset that leverages resources across the company. In this case we used training data made available by colleagues working in query spelling correction, query stream sampling procedures devised for search quality evaluation, the open finite state tools, and distributed language modeling infrastructure built for machine translation.

Perhaps the most satisfying part of this research project was its impact on the end-user: when presenting the poster at SLT 2010 in Berkeley I offered to demo Google Voice Search, and often got the answer “Thanks, I already use it!”.

Posted in Publications, Voice Search | No comments

Compact System

Monday, 28 February 2011

Slicing and dicing data for interactive visualization

Friday, 25 February 2011

Where does my data live?

A Runtime Solution for Online Contention Detection and Response

Tuesday, 22 February 2011

Congratulations to Ken Thompson

Thursday, 17 February 2011

Query Language Modeling for Voice Search

Popular Posts

Categories

Blog Archive

About Me