2011 ~ Compact System

Thursday, 22 December 2011

Academic Successes in Cluster Computing

Posted on 15:00 by Unknown

Posted by Alfred Spector, VP of Research

Access to massive computing resources is foundational to Research and Development. Fifteen awardees of the National Science Foundation (NSF) Cluster Exploratory Service (CLuE) program have been applying large scale computational resources donated by Google and IBM.

Overall, 1,328 researchers have used the cluster to perform over 120 million computing tasks on the cluster and in the process, have published 49 scientific publications, educated thousands of students on parallel computing and supported numerous post-doctoral candidates in their academic careers. Researchers have used the program for such diverse fields as astronomy, oceanography and linguistics. Besides validating MapReduce as a useful tool in academic research, the program has also generated significant scientific knowledge.

Three years later, there are many viable, affordable alternatives to the Academic Cloud Computing Initiative, so we have decided to bring our part of the program to a close. It has been a great opportunity to collaborate with IBM, the NSF and the many universities on this program. It was state-of-the-art four years ago when it was started; now, Academic Cloud Computing is a worldwide phenomena and there are many low-cost cloud computing options that provide viable alternatives to the Academic Cloud Computing Initiative.

Posted in | No comments

Friday, 9 December 2011

Measuring Ad Effectiveness Using Geo Experiments

Posted on 08:48 by Unknown

Posted by Lizzy Van Alstine and Jon Vaver, Quantitative Analysis Team

Advertisers want to be able to measure the effectiveness of their advertising. Many methods have been used to address this need, but the most rigorous and trusted of these are randomized experiments, which involve randomly assigning experimental units to control and test conditions. At Google, we have found that randomized geo experiments are a powerful approach to measuring the effectiveness of advertising.

Many advertising platforms allow advertising to be targeted by geographical region. In these experiments, we first assign geographic regions to test or control conditions and employ AdWords’ geo-targeted advertising capabilities to increase or decrease the regional advertising spend accordingly. The use of randomized assignments guards against potential hidden test/control biases that could impact the measurements. Our approach also accounts for seasonal changes that impact the volume and cost of advertising across the length of the experiment.

In this paper, we describe the application of geo experiments for measuring the impact of advertising on consumer behavior (e.g. clicks, conversions, downloads, etc.). This description includes the results of a geo experiment that our research team ran for a Google advertiser.

Posted in | No comments

Thursday, 8 December 2011

ACM Fellows for 2011

Posted on 07:30 by Unknown

Posted by Alfred Spector, Google Research

Cross-posted with the Official Google Blog

Congratulations to three Googlers elected ACM Fellows

It gives me great pleasure to share that the Association for Computing Machinery (ACM) has announced that three Googlers have been elected ACM Fellows in 2011. The ACM is the world’s largest educational and scientific computing society, and the Fellows Program celebrates the exceptional contributions of leaders in the computing field. This year the society has selected Amit Singhal, Peter S. Magnusson and Amin Vahdat for their outstanding work, which has provided fundamental knowledge to the field.

The recently-named Fellows join 14 prior Googler ACM Fellows and other professional society honorees in exemplifying our extraordinarily talented people. On behalf of Google, I congratulate our colleagues. They embody Google’s commitment to innovation with impact, and I hope that they’ll serve as inspiration to students as well as the broader community of computer scientists.

You can read more detailed summaries of their achievements below, including the official citations from the ACM.

Dr. Amit Singhal, Google Fellow

For contributions to search and information retrieval

Since 2000, Dr. Amit Singhal has been pioneering search as the technical lead for Google's core search algorithms. He is credited with most of the information retrieval design decisions in Google Search – a massive system that has responded to hundreds of billions of queries. More than anyone, Amit has a deep understanding of Google’s entire algorithmic system. He is responsible for prioritization and has overseen the development of numerous algorithmic signals and their progression over time. He is the clear thought and managerial leader who has led critically important initiatives at the company. Among many other things, Amit catalyzed Universal Search, which returns multi-modal results from all available corpora; he was the force behind Realtime Search, which returns results from dynamic corpora with low latency; and he championed Google Instant, which returns search results as the user types.

Prior to joining Google, Amit boasted a prolific publication record averaging 5 publications/year from 1996-9 while at AT&T Labs. Since that time, you could say Google Search has been one long, sustained publication demonstrating a constant advancement in the state of the art of information retrieval.

Peter S. Magnusson, Engineering Director

For contributions to full-system simulation

Peter has made a tremendous impact by driving full-system simulation. His approach was so advanced, it can be used in real world production of commercial CPUs and prototyping of system software. Starting in 1991, Peter began to challenge the notion that simulators could not be made fast enough to run large workloads, nor accurate enough to run commercial operating systems. His innovations in simulator design culminated in Simics, the first academic simulator that could boot and run commercial multiprocessor workloads. Simics saw huge academic success and has been used to run simulations for research presented in several hundred subsequent publications.

Peter founded Virtutech in 1998 to commercially develop Simics, and he ultimately forged and became the leader in a new market segment for software tools. With Peter at the helm, Virtutech pushed Simics beyond several performance barriers to make it the first simulator to exceed 1 billion instructions per second and the first simulator to model over 1,000 processors. Peter joined Google in 2010 to work with cloud computing.

Dr. Amin Vahdat, Principal Engineer

For contributions to data center scalability and management

Amin’s work made an impact at Google long before he arrived here. Amin is known for conducting research through bold, visionary projects that combine creativity with careful consideration of the engineering constraints needed to make them applicable in real world applications. Amin’s infrastructure ideas have underpinned the shift in the computing field from the pure client-server paradigm to a landscape in which major web services are hosted “in the cloud” across multiple data centers. In addition to pioneering “third-party cloud computing” through his work on WebOS and Rent-A-Server in the mid-90s, Amin has made important advancements in managing wide-area consistency between data centers, scalable modeling of data center applications, and building scalable data center networks.

Amin’s innovations have penetrated and broadly influenced the networking community within academia and industry, including Google, and his research has been recapitulated and expanded upon in a number of publications. Conferences that formerly did not even cover data centers now have multiple sessions covering variants of what Amin and his team have proposed. At Google, Amin continues to drive next-generation data center infrastructure focusing on Software Defined Networking and new opportunities from optical technologies. This is emblematic of Amin’s ability to build real systems, and perhaps more significantly, convince people of their value.

Posted in ACM, Awards | No comments

Tuesday, 6 December 2011

Our second round of Google Research Awards for 2011

Posted on 08:00 by Unknown

Posted by Maggie Johnson, Director of Education & University Relations

We’ve just finished the review process for the latest round of the Google Research Awards, which provide funding to full-time faculty working on research in areas of mutual interest with Google. We are delighted to be funding 119 awards across 21 different focus areas for a total of $6 million. The subject areas that received the highest level of support this time were systems and infrastructure, human-computer interaction, social and mobile. In addition, 24% of the funding was awarded to universities outside the U.S.

One way in which we measure the impact of the research award program is through surveys of Principal Investigators (PIs) and their Google sponsors (a Googler with whom grantees can discuss research directions, provide progress updates, engage in knowledge transfer, etc.). Here are some highlights from our most recent survey, covering projects funded over the last two years:

433 papers were published as a result of a Google research award
126 projects made data sets or software publicly available
63 research talks were given by sponsored PIs at Google offices

An important aspect of the program is that it often gives early career academics a head start on their research agenda. Many new PIs commented on how a Google research award allowed them to explore their initial ideas and build a foundation for obtaining more significant funding from other sources. This type of seed funding is especially hard to get in the current economic environment.

The goal of the research award program is to initiate and sustain strong collaborations with our academic colleagues. The collaborations take many forms, from working on a project together, to co-writing a paper, to coming to Google to give a research talk. Whatever the form, the most important aspect is building strong relationships that last. Case in point, many of our focused awards (multi-year, unrestricted grants that include access to Google’s tools, technology and expertise) started as Google research awards.

Congratulations to the well-deserving recipients of this round’s awards, and if you are interested in applying for the next round (deadline is April 15), please visit our website for more information.

Posted in Research Awards, University Relations | No comments

Friday, 2 December 2011

2011 Google China Faculty Summit in Hangzhou

Posted on 11:30 by Unknown

Posted by Aimin Zhu, University Relationship Manager, Google China

We just wrapped up a highly successful 2011 Google China Faculty Summit in Hangzhou, China. On November 17 and 18, Googlers from China and the U.S. gathered with more than 80 faculty members representing more than 45 universities and institutes, including Tsinghua University, Peking University and The Chinese Academy of Sciences. The two-day event revolved around the theme of “Communication, Exploration and Expansion,” with day one covering research and day two focusing on academic development.

The summit provided a unique setting for both sides to share the results of their research and exchange ideas. Speakers included:

Maggie Johnson, director of education and university relations at Google, presenting on innovation in Google research and global university relations programs,
Dr. Boon-Lock Yeo, head of engineering and research for Google China, providing an overview of innovation in China engineering and corporate social responsibility efforts and accomplishments, and
Prof. Edward Chang, director of research for Google China, delivering a keynote on mobile information management and retrieval.

The discussions on November 17 focused on two tracks, mobile computing and natural language processing, while discussions on November 18 focused on curriculum development with a special focus on Android app development. The attendees also spent time discussing joint research and development between universities and industry.

This summit is part of a continuing to effort to collaborate with Chinese universities in order to support education in China. Click here for a list of the variety of education programs we have launched there in recent years. We look forward to expanding partnership opportunities in the future.

Posted in China, Education, University Relations | No comments

Tuesday, 29 November 2011

More Google Cluster Data

Posted on 14:07 by Unknown

Posted by John Wilkes, Principal Software Engineer

Google has a strong interest in promoting high quality systems research, and we believe that providing information about real-life workloads to the academic community can help.

In support of this we published a small (7-hour) sample of resource-usage information from a Google production cluster in 2010 (research blog on Google Cluster Data). Approximately a dozen researchers at UC Berkeley, CMU, Brown, NCSU, and elsewhere have made use of it.

Recently, we released a larger dataset. It covers a longer period of time (29 days) for a larger cell (about 11k machines) and includes significantly more information, including:

the original resource requests, to permit scheduling experiments
request constraints and machine attriibutes
machine availability and failure events
some of the reasons for task exits
(obfuscated) job and job-submitter names, to help identify repeated or related jobs
more types of usage information
CPI (cycles per instruction) and memory traffic for some of the machines

Note that this trace primarily provides data about resource requests and usage. It contains no information about end users, their data, or access patterns to storage systems and other services.

More information can be found via this link, which will (after a short questionnaire) take you to a site that provides access instructions, a description of the data schema, and information about how the data was derived and its meaning.

We hope this data will facilitate a range of research in cluster management. Let us know if you find it useful, are willing to share tools that analyze it, or have suggestions for how to improve it.

Posted in datasets | No comments

Wednesday, 2 November 2011

Discovering Talented Musicians with Acoustic Analysis

Posted on 08:00 by Unknown

Posted by Charles DuHadway, YouTube Slam Team, Google Research

In an earlier post we talked about the technology behind Instant Mix for Music Beta by Google. Instant Mix uses machine hearing to characterize music attributes such as its timbre, mood and tempo. Today we would like to talk about acoustic and visual analysis -- this time on YouTube. A fundamental part of YouTube's mission is to allow anyone anywhere to showcase their talents -- occasionally leading to life-changing success -- but many talented performers are never discovered. Part of the problem is the sheer volume of videos: forty eight hours of video are uploaded to YouTube every minute (that’s eight years of content every day). We wondered if we could use acoustic analysis and machine learning to pore over these videos and automatically identify talented musicians.

First we analyzed audio and visual features of videos being uploaded. We wanted to find “singing at home” videos -- often correlated with features such as ambient indoor lighting, head-and-shoulders view of a person singing in front of a fixed camera, few instruments and often a single dominant voice. Here’s a sample set of videos we found.

Then we estimated the quality of singing in each video. Our approach is based on acoustic analysis similar to that used by Instant Mix, coupled with a small set of singing quality annotations from human raters. Given these data we used machine learning to build a ranker that predicts if an average listener would like a performance.

While machines are useful for weeding through thousands of not-so-great videos to find potential stars, we know they alone can't pick the next great star. So we turn to YouTube users to help us identify the real hidden gems by playing a voting game called YouTube Slam. We're putting an equal amount of effort into the game itself -- how do people vote? What makes it fun? How do we know when we have a true hit? We're looking forward to your feedback to help us refine this process: give it a try*. You can also check out singer and voter leaderboards. Toggle “All time” to “Last week” to find emerging talent in fresh videos or all-time favorites.

Our “Music Slam” has only been running for a few weeks and we have already found some very talented musicians. Many of the videos have less than 100 views when we find them.

And while we're excited about what we've done with music, there's as much undiscovered potential in almost any subject you can think of. Try our other slams: cute, bizarre, comedy, and dance*. Enjoy!

Related work by Google Researchers:
“Video2Text: Learning to Annotate Video Content”, Hrishikesh Aradhye, George Toderici, Jay Yagnik, ICDM Workshop on Internet Multimedia Mining, 2009.

* Music and dance slams are currently available only in the US.

Posted in Machine Hearing, YouTube | No comments

Wednesday, 28 September 2011

Fresh Perspectives about People and the Web from Think Quarterly

Posted on 12:00 by Unknown

Posted by Allison Mooney, Christina Park, and Caroline McCarthy, The Think Quarterly Team

There’s a lot of research, analysis and insights—from inside and outside Google—that we use in building our products and making decisions. To share what we’ve learned with our partners, we created Think Quarterly. It’s intended to be a snapshot of what Google and other industry leaders are talking about and inspired by right now.

Today we’re launching our second edition, the “People” issue, exploring the latest technologies connecting us and the big ideas driving society forward. It also includes some of the research and analysis that helps us shape our strategies.

For those who love data as much as we do, here are a few articles worth reading:

“Following Generation Z,” in which Google research scientist Ed Chi details what he’s learned from monitoring the course of digital innovation and mapping patterns of digital technology use in the future

“Predicting the Present,” by chief economist Hal Varian, about how publicly available search tools can help anyone gain valuable insights into the behavior of web users and predict what they might do next

“Power to the People,” by Meg Pickard, anthropologist turned head of digital engagement at Guardian News and Media, about tracking the influence and power of online communities

“From Cash to Contentment,” about the use of happiness as a measurable metric of success, with insights coming from Nobel Prize winner Joseph Stiglitz

Click here to read all the articles, and if you have a suggestion for our next issue please tell us here. We hope you enjoy (and +1) it!

Posted in | No comments

Tuesday, 27 September 2011

Trying on the new Dynamic Views from Blogger

Posted on 15:11 by Unknown

Posted by Alison Powell, Google Research Team

As you may have noticed, the Google Research blog looks a lot different today. That’s because we—along with a few other Google blogs—are trying out a new set of Blogger templates called Dynamic Views.

Launched today, Dynamic Views is a unique browsing experience that makes it easier and faster for readers to explore blogs in interactive ways. We’re using the Magazine view, but you can also preview this blog in any of the other six new views by using the view selection bar at the top left of the screen.

We’re eager to hear what you think about the new Dynamic Views. You can submit feedback using the “Send feedback” link on the bottom right of this page.

If you like what you see here, and we hope you do, we encourage you to try out the new look(s) on your own blog—read the Blogger Buzz post for more info.

Posted in | No comments

Wednesday, 7 September 2011

Sorting Petabytes with MapReduce - The Next Episode

Posted on 16:50 by Unknown

Posted by Grzegorz Czajkowski, Marián Dvorský, Jerry Zhao, and Michael Conley, Systems Infrastructure

Almost three years ago we announced results of the first ever "petasort" (sorting a petabyte-worth of 100-byte records, following the Sort Benchmark rules). It completed in just over six hours on 4000 computers. Recently we repeated the experiment using 8000 computers. The execution time was 33 minutes, an order of magnitude improvement.

Our sorting code is based on MapReduce, which is a key framework for running multiple processes simultaneously at Google. Thousands of applications, supporting most services offered by Google, have been expressed in MapReduce. While not many MapReduce applications operate at a petabyte scale, some do. Their scale is likely to continue growing quickly. The need to help such applications scale motivated us to experiment with data sets larger than one petabyte. In particular, sorting a ten petabyte input set took 6 hours and 27 minutes to complete on 8000 computers. We are not aware of any other sorting experiment successfully completed at this scale.

We are excited by these results. While internal improvements to the MapReduce framework contributed significantly, a large part of the credit goes to numerous advances in Google's hardware, cluster management system, and storage stack.

What would it take to scale MapReduce by further orders of magnitude and make processing of such large data sets efficient and easy? One way to find out is to join Google’s systems infrastructure team. If you have a passion for distributed computing, are an expert or plan to become one, and feel excited about the challenges of exascale then definitely consider applying for a software engineering position with Google.

Posted in MapReduce | No comments

Monday, 22 August 2011

Google at the Joint Statistical Meetings in Miami

Posted on 15:06 by Unknown

Posted by Marianna Dizik, Statistician

The Joint Statistical Meetings (JSM) were held in Miami, Florida, this year. Nearly 5,000 participants from academia and industry came to present and discuss the latest in statistical research, methodology, and applications. Similar to previous years, several Googlers shared expertise in large-scale experimental design and implementation, statistical inference with massive datasets and forecasting, data mining, parallel computing, and much more.

Our session "Statistics: The Secret Weapon of Successful Web Giants" attracted over one hundred people; surprising for an 8:30 AM session! Revolution Analytics reviewed this in their official blog post "How Google uses R to make online advertising more effective"

The following talks were given by Googlers at JSM 2011. Please check the upcoming Proceedings of the JSM 2011 for the full papers.

Statistical Plumbing: Effective use of classical statistical methods for large scale applications

Author(s): Ni Wang, Yong Li, Daryl Pregibon, and Rachel Schutt

Parallel Computations in R, with Applications for Statistical Forecasting

Author(s): Murray Stokely and Farzan Rohani and Eric Tassone

Conditional Regression Models

Author(s): William D. Heavlin

The Effectiveness of Display Ads

Author(s): Tim Hesterberg and Diane Lambert and David X. Chan and Or Gershony and Rong Ge

Measuring Ad Effectiveness Using Continuous Geo Experiments

Author(s): Jon Vaver and Deepak Kumar and Jim Koehler

Post-Stratification and Network Sampling

Author(s): Rachel Schutt and Andrew Gelman and Tyler McCormick

Google has participated at JSM each year since 2004. We have been increasing our involvement significantly by providing sponsorship, organizing and giving talks at sessions and roundtables, teaching courses and workshops, hosting a booth with new Google products demo, submitting posters, and more. This year Googlers participated in sessions sponsored by ASA sections for Statistical Learning and Data Mining, Statistics and Marketing, Statistical Computing, Bayesian Statistical Science , Health Policy Statistics, Statistical Graphics, Quality and Productivity, Physical and Engineering Sciences, and Statistical Education.

We also hosted the Google faculty reception, which was well-attended by faculty and their promising students. Google hires a growing number of statisticians and we were happy to participate in JSM again this year. People had a chance to talk to Googlers, ask about working here, encounter elements of Google culture (good food! T-shirts! 3D puzzles!), meet old and make new friends, and just have fun!

Thanks to everyone that presented, attended, or otherwise engaged with the statistical community at JSM this year. We’re looking forward to seeing you in San Diego next year.

Posted in conferences, jsm, jsm2011, statistics | No comments

Tuesday, 16 August 2011

A new MIT center for mobile learning, with support from Google

Posted on 09:00 by Unknown

Posted by Hal Abelson, Professor of Computer Science and Engineering, MIT

MIT and Google have a long-standing relationship based on mutual interests in education and technology. Today, we took another step forward in our shared goals with the establishment of the MIT Center for Mobile Learning, which will strive to transform learning and education through innovation in mobile computing. The new center will be actively engaged in studying and extending App Inventor for Android, which Google recently announced it will be open sourcing.

The new center, housed at MIT’s Media Lab, will focus on designing and studying new mobile technologies that enable people to learn anywhere, anytime, with anyone. The center was made possible in part by support from Google University Relations and will be run by myself and two distinguished MIT colleagues: Professors Eric Klopfer (science education) and Mitchel Resnick (media arts and sciences).

App Inventor for Android—a programming system that makes it easy for learners to create mobile apps for Android smartphones—currently supports a community of about 100,000 educators, students and hobbyists. Through the new initiatives at the MIT Center for Mobile Learning, App Inventor will be connected to MIT’s premier research in educational technology and MIT’s long track record of creating and supporting open software.

Google first launched App Inventor internally in order to move it forward with speed and focus, and then developed it to a point where it started to gain critical mass. Now, its impact can be amplified by collaboration with a top academic institution. At MIT, App Inventor will adopt an enriched research agenda with increased opportunities to influence the educational community. In a way, App Inventor has now come full circle, as I actually initiated App Inventor at Google by proposing it as a project during my sabbatical with the company in 2008. The core code for App Inventor came from Eric Klopfer’s lab, and the inspiration came from Mitch Resnick’s Scratch project. The new center is a perfect example of how industry and academia can collaborate effectively to create change enabled by technology, and we look forward to seeing what we can do next, together.

Posted in | No comments

Friday, 12 August 2011

Our Faculty Institute brings faculty back to the drawing board

Posted on 11:00 by Unknown

Posted by Nina Kim Schultz, Google Education Research

Cross-posted with the Official Google Blog

School may still be out for summer, but teachers remain hard at work. This week, we hosted Google’s inaugural Faculty Institute at our Mountain View, Calif. headquarters. The three-day event was created for esteemed faculty from schools of education and math and science to explore teaching paradigms that leverage technology in K-12 classrooms. Selected via a rigorous nomination and application process, the 39 faculty members hail from 19 California State Universities (CSUs), as well as Stanford and UC Berkeley, and teach high school STEM (Science, Technology, Engineering and Math) teachers currently getting their teaching credentials. CSU programs credential 60 percent of California’s teachers—or 10 percent of all U.S. K-12 teachers—and one CSU campus alone can credential around 1,000 new teachers in a year. The purpose of gathering together at the Institute was to ensure our teachers’ teachers have the support they need to help educators adjust to a changing landscape.

There is so much technology available to educators today, but unless they learn how to use it effectively, it does little to change what is happening in our classrooms. Without the right training and inspiration, interactive displays become merely expensive projection screens, and laptops simply replace paper rather than shifting the way teachers teach and students learn. Although the possibilities for technology use in schools are endless, teacher preparation for the 21st century classroom also has many constraints. For example: beyond the expense involved, there’s the time it costs educators to match a technological innovation to the improvement of pedagogy and curriculum; there’s a distinct shift in thinking that needs to take place to change classrooms; and there’s an essential challenge to help teachers develop the dispositions and confidence to be lifelong evaluators, learners and teachers of technology, instead of continuing to rely on traditional skill sets that will soon be outdated.

The Institute featured keynote addresses from respected professors from Stanford and Berkeley, case studies from distinguished high school teachers from across California, hands-on technology workshops with a variety of Google and non-Google tools, and panels with professionals in the tech-education industry. Notable guests included representatives from Teach for America, The New Teacher Project, the Department of Education and Edutopia. Topics covered the ability to distinguish learning paths, how to use technology to transform classrooms into project-based, collaborative spaces and how to utilize a more interactive teaching style rather than the traditional lecture model.

On the last day of the Institute, faculty members were invited to submit grant proposals to scale best practices outside of the meeting. Deans of the participating universities will convene at the end of the month to further brainstorm ways to scale new ideas in teacher preparation programs. Congratulations to all of the faculty members who were accepted into the inaugural Institute, and thank you for all that you do to help bring technology and new ways of thinking into the classroom.

This program is a part of Google’s continued commitment to supporting STEM education. Details on our other programs can be found on www.google.com/education.

Posted in Education | No comments

Wednesday, 10 August 2011

Culturomics, Ngrams and new power tools for Science

Posted on 15:51 by Unknown

Posted by Erez Lieberman Aiden and Jean-Baptiste Michel, Visiting Faculty at Google

Four years ago, we set out to create a research engine that would help people explore our cultural history by statistically analyzing the world’s books. In January 2011, the resulting method, culturomics, was featured on the cover of the journal Science. More importantly, Google implemented and launched a web-based version of our prototype research engine, the Google Books Ngram Viewer.

Now scientists, scholars, and web surfers around the world can take advantage of the Ngram Viewer to study a vast array of phenomena. And that's exactly what they've done. Here are a few of our favorite examples.

Poverty
Martin Ravallion, head of the Development Research Group at the World Bank, has been using the ngrams to study the history of poverty. In a paper published in the journal Poverty and Public Policy, he argues for the existence of two ‘poverty enlightenments’ marked by increased awareness of the problem: one towards the end of the 18th century, and another in the 1970s and 80s. But he makes the point that only the second of these enlightenments brought with it a truly enlightened idea: that poverty can be and should be completely eradicated.

The Science Hall of Fame
Adrian Veres and John Bohannon wondered who the most famous scientists of the past two centuries were. But there was no hall of fame for scientists, or a committee that determines who deserves to get into such a hall. So they used the ngrams data to define a metric for celebrity – the milliDarwin – and algorithmically created a Science Hall of Fame listing the most famous scientists born since 1800. They found that things like a popular book or a major controversy did more to increase discussion of a scientist than, for instance, winning a Nobel Prize.

(Other users have been exploring the history of particular sciences with the Ngram Viewer, covering everything from neuroscience to the nuclear age.)

The History of Typography
When we introduced the Ngram Viewer, we pointed out some potential pitfalls with the data. For instance, the ‘medial s’ ( ſ ), an older form of the letter s that looked like an integral sign and appeared in the beginning or middle of words, tends to be classified as an instance of the letter ‘f’ by the OCR algorithm used to create our version of the data. Andrew West, blogging at Babelstone, found a clever way to exploit this error: using queries like ‘husband’ and ‘hufband’ to study the history of medial s typography, he pinned down the precise moment when the medial s disappeared from English (around 1800), French (1780), and Spanish (1760).

People are clearly having a good time with the Ngram Viewer, and they have been learning a few things about science and history in the process. Indeed, the tool has proven so popular and so useful that Google recently announced that its imminent graduation from Google Labs to become a permanent part of Google Books.

Similar ‘big data’ approaches can also be applied to a wide variety of other problems. From books to maps to the structure of the web itself, 'the world's information' is one amazing dataset.

Erez Lieberman Aiden is Visiting Faculty at Google and a Fellow of the Harvard Society of Fellows. Jean-Baptiste Michel is Visiting Faculty at Google and a Postdoctoral Fellow in Harvard's Department of Psychology.

Posted in Google Books, Ngram | No comments

Thursday, 28 July 2011

President's Council Recommends Open Data for Federal Agencies

Posted on 10:58 by Unknown

Posted by Alon Halevy, Senior Staff Research Scientist

Cross-posted with the Public Sector and Elections Lab Blog

One of the things I most enjoy about working on data management is the ability to work on a variety of problems, both in the private sector and in government. I recently had the privilege of serving on a working group of the President’s Council of Advisors on Science and Technology (PCAST) studying the challenges of conserving the nation’s ecosystems. The report, titled “Sustaining Environmental Capital: Protecting Society and the Economy” was presented to President Obama on July 18th, 2011. The full report is now available to the public.

The press release announcing the report summarizes its recommendations:

The Federal Government should launch a series of efforts to assess thoroughly the condition of U.S. ecosystems and the social and economic value of the services those ecosystems provide, according to a new report by the President’s Council of Advisors on Science and Technology (PCAST), an independent council of the Nation’s leading scientists and engineers. The report also recommends that the Nation apply modern informatics technologies to the vast stores of biodiversity data already collected by various Federal agencies in order to increase the usefulness of those data for decision- and policy-making.

One of the key challenges we face in assessing the condition of ecosystems is that a lot of the data pertaining to these systems is locked up in individual databases. Even though this data is often collected using government funds, it is not always available to the public and in other cases available but not in usable formats. This is a classical example of a data integration problem that occurs in many other domains.

The report calls for creating an ecosystem, EcoINFORMA, around data. The crucial piece of this ecosystem is to make the relevant data publicly available in a timely manner and, most importantly, in a machine readable form. Publishing data embedded in a PDF file is a classical example of what does not count as being machine readable. For example, if you are publishing a tabular data set, then a computer program should be able to directly access the meta-data (e.g., column names, date collected) and the data rows without having to heuristically extract it from surrounding text.

Once the data is published, it can be discovered by search engines. Data from multiple sources can be combined to provide additional insight, and the data can be visualized and analyzed by sophisticated tools. The main point is that innovation should be pursued by many parties (academics, commercial, government), each applying their own expertise and passions.

There is a subtle point about how much meta-data should be provided before publishing the data. Unfortunately, requiring too much meta-data (e.g., standard schemas) often stymies publication. When meta-data exists, that’s great, but when it’s not there or is not complete, we should still publish the data in a timely manner. If the data is valuable and discoverable, there will be someone in the ecosystem who will enhance the data in an appropriate fashion.

I look forward to seeing this ecosystem evolve and excited that Google Fusion Tables, our own cloud-based service for visualizing, sharing and integrating structured data, can contribute to its development.

Posted in Fusion Tables, Government, Policy, Structured Data | No comments

Thursday, 21 July 2011

Studies Show Search Ads Drive 89% Incremental Traffic

Posted on 08:10 by Unknown

Posted by David Chan and Lizzy Van Alstine, Quantitative Management Team

Advertisers often wonder whether search ads cannibalize their organic traffic. In other words, if search ads were paused, would clicks on organic results increase, and make up for the loss in paid traffic? Google statisticians recently ran over 400 studies on paused accounts to answer this question.

In what we call “Search Ads Pause Studies”, our group of researchers observed organic click volume in the absence of search ads. Then they built a statistical model to predict the click volume for given levels of ad spend using spend and organic impression volume as predictors. These models generated estimates for the incremental clicks attributable to search ads (IAC), or in other words, the percentage of paid clicks that are not made up for by organic clicks when search ads are paused.

The results were surprising. On average, the incremental ad clicks percentage across verticals is 89%. This means that a full 89% of the traffic generated by search ads is not replaced by organic clicks when ads are paused. This number was consistently high across verticals. The full study can be found on here.

Posted in search ads | No comments

Wednesday, 20 July 2011

Faculty from across the Americas meet in New York for the Faculty Summit

Posted on 11:08 by Unknown

Posted by Maggie Johnson, Director of Education & University Relations

(Cross-posted from the Official Google Blog)

Last week, we held our seventh annual Computer Science Faculty Summit. For the first time, the event took place at our New York City office; nearly 100 faculty members from universities in the U.S., Canada and Latin America attended. The two-day Summit focused on systems, artificial intelligence and mobile computing. Alfred Spector, VP of research and special initiatives, hosted the conference and led lively discussions on privacy, security and Google’s approach to research.

Google’s Internet evangelist, Vint Cerf, opened the Summit with a talk on the challenges involved in securing the “Internet of things”—that is, uniquely identifiable objects (“things”) and their virtual representations. With almost 2 billion international Internet users and 5 billion mobile devices out there in the world, Vint expounded upon the idea that Internet security is not just about technology, but also about policy and global institutions. He stressed that our new digital ecosystem is complex and large in scale, and includes both hardware and software. It also has multiple stakeholders, diverse business models and a range of legal frameworks. Vint argued that making and keeping the Internet secure over the next few years will require technical innovation and global collaboration.

After Vint kicked things off, faculty spent the two days attending presentations by Google software engineers and research scientists, including John Wilkes on the management of Google's large hardware infrastructure, Andrew Chatham on the self-driving car, Johan Schalkwyk on mobile speech technology and Andrew Moore on the research challenges in commerce services. Craig Nevill-Manning, the engineering founder of Google’s NYC office, gave an update on Google.org, particularly its recent work in crisis response. Other talks covered the engineering work behind products like Ad Exchange and Google Docs, and the range of engineering projects taking place across 35 Google offices in 20 countries. For a complete list of the topics and sessions, visit the Faculty Summit site. Also, a few of our attendees heeded Alfred’s call to recap their breakout sessions in verse—download a PDF of one of our favorite poems, about the future of mobile computing, penned by NYU professor Ken Perlin.

A highlight of this year’s Summit was Bill Schilit’s presentation of the Library Wall, a Chrome OS experiment featuring an eight-foot tall full-color virtual display of ebooks that can be browsed and examined individually via touch screen. Faculty members were invited to play around with the digital-age “bookshelf,” which is one of the newest additions to our NYC office.

We’ve already posted deeper dives on a few of the talks—including cluster management, mobile search and commerce. We also collected some interesting faculty reflections. For more information on all of our programs, visit our University Relations website. The Faculty Summit is meant to connect forerunners across the computer science community—in business, research and academia—and we hope all our attendees returned home feeling informed and inspired.

Posted in conference, University Relations | No comments

Tuesday, 19 July 2011

Google Americas Faculty Summit: Reflections from our attendees

Posted on 14:39 by Unknown

Posted by Alfred Spector, Vice President, Research

Last week, we held our seventh annual Americas Computer Science Faculty Summit at our New York City office. About 100 faculty members from universities in the Western Hemisphere attended the two-day Summit, which focused on systems, artificial intelligence and mobile. To finish up our series of Summit recaps, I asked four faculty members to provide us their perspective on the summit, thinking their views would complement our own blog: Jeannette Wing from Carnegie Mellon, Rebecca Wright from Rutgers, Andrew Williams from Spelman and Christos Kozyrakis from Stanford.

Jeannette M. Wing, Carnegie Mellon University
Fun, cool, edgy and irreverent. Those words describe my impression of Google after attending the Google Faculty Summit, held for the first time at its New York City location. Fun and cool: The Library Wall prototype, which attendees were privileged to see, is a peek at the the future where e-books have replaced physical books, but where physical space, equipped with wall-sized interactive displays, still encourages the kind of serendipitous browsing we enjoy in the grand libraries of today. Cool and edgy: Being in the immense old Port Authority building in the midst of the Chelsea district of Manhattan is just plain cool and adds an edgy character to Google not found at the corporate campuses of Silicon Valley. Edgy, or more precisely “on the edge,” is Google as it explores new directions: social networking (Google+), mobile voice search (check out the microphone icon in your search bar) and commerce (e.g. selling soft goods on-line). Why these directions? Some are definitely for business reasons, but some are also simply because Google can (self-driving cars) and because it’s good for society (e.g., emergency response in Haiti, Chile, New Zealand and Japan). “Irreverent” is Alfred Spector’s word and sums it up—Google is a fun place to work, where smart people can be creative, build cool products and make a difference in untraditional ways.

But the one word that epitomizes Google is “scale.” How do you manage clusters on the order of hundreds of thousands of processors where the focus is faults, not performance or power? What knowledge about humanity can machine learning discover from 12 million scanned books in 400 languages that generated five billion pages and two trillion words digitized? Beyond Google, how do you secure the Internet of Things when eventually everything from light bulbs to pets will all be Internet-enabled and accessible?

One conundrum. Google’s hybrid model of research clearly works for Google and for Googlers. It is producing exciting advances in technology and having an immeasurable impact on society. Evident from our open and intimate breakout sessions, Google stays abreast of cutting-edge academic research, often by hiring our Ph.D. students. The challenge for computer science research is, “how can academia build on the shoulders of Google’s scientific results?”

Academia does not have access to the scale of data or the complexity of system constraints found within Google. For the good of the entire industry-academia-government research ecosystem, I hope that Google continues to maintain an open dialogue with academia—through faculty summits, participation and promotion of open standards, robust university relations programs and much more.
-----

Rebecca Wright, Rutgers University
This was my first time attending a Google Faculty Summit. It was great to see it held in my "backyard," which emphasized the message that much of Google's work takes place outside their Mountain View campus. There was a broad variety of excellent talks, each of which only addressed the tip of the iceberg of the particular problem area. The scope and scale of the work being done at Google is really mind-boggling. It both drives Google’s need for new solutions and allows the company to consider new approaches. At Google’s scale, automation is critical and almost everything requires research advances, engineering advances, considerable development effort and engagement of people outside Google (including academics, the open source community, policymakers and "the crowd").

A unifying theme in much of Google’s work is the use of approaches that leverage its scale rather than fight it (such as MapMaker, which combines Google's data and computational resources with people's knowledge about and interest in their own geographic areas). In addition to hearing presentations, the opportunity to interact with the broad variety of Googlers present as well as other faculty was really useful and interesting. As a final thought, I would like to see Google get more into education, particularly in terms of advancing hybrid in-class/on-line technologies that take advantage of the best features of each.
-----

Andrew Williams, Spelman College
At the 2011 Google Faculty Summit in New York, the idea that we are moving past the Internet of computers to an "Internet of Things" became a clear theme. After hearing presentations by Googlers, such as Vint Cerf dapperly dressed in a three piece suit, I realized that we are in fact moving to an Internet of Things and People. The pervasiveness of connected computing devices and very large systems for cloud computing all interacting with socially connected people were expounded upon both in presentations and in informal discussions with faculty from around the world. The "Internet of people" aspect was also evident in emerging policies we touched on, involving security, privacy and social networks (like the Google+ project). I also enjoyed the demonstration of the Google self-driving car as an advanced application of artificial intelligence that integrates computer vision, localization and decision making in a real world transportation setting. I was impressed with how Google volunteers its talent, technology and time to help people, as it did with its crisis response efforts in Haiti, Japan and other parts of the world.

As an educator and researcher in humanoid robotics and AI at a historically black college for women in Atlanta, the Google Faculty Summit motivated me to improve how I educate our students to eventually tackle the grand challenges posed by the Internet of Things and People. It was fun to learn how Google is actively seeking to solve these grand challenges on a global scale.
-----

Christos Kozyrakis, Stanford University
What makes the Google Faculty Summit a unique event to attend is its wide-reaching focus. Our discipline-focused conferences facilitate in-depth debates over a narrow set of challenges. In contrast, the Faculty Summit is about bringing together virtually all disciplines of computer science to turn information into services with an immediate impact on our everyday lives. It is fascinating to discuss how large data centers and distributed software systems allow us to use machine learning algorithms on massive datasets and get voice based search, tailored shopping recommendations or driver-less cars. Apart from the general satisfaction of seeing these applications in action, one of the important takeaways for me is that specifying and managing the behavior of large systems in an end-to-end manner is currently a major challenge for our field. Now is probably the best time to be a computer scientist, and I am leaving with a better understanding of what advances in my area of expertise can have the biggest overall impact.

I also enjoyed having the summit at the New York City office, away from Google headquarters in Silicon Valley. It’s great to see in practice how the products of our field (networking, video-conferencing and online collaboration tools) allow for technology development anywhere in the world.
-----

As per Jeannette Wing’s comments about Google being “irreverent,” I own up to using the term—initially about a subject on which Aristophanes once wrote (I’ll leave that riddle open). As long as you take my usage in the right way (that is, we’re very serious about the work we do, but perhaps not about all the things one would expect of a large company), I’m fine with it. There’s so much in the future of computer science and its potential impact that we should always be coming at things in new ways, with the highest aspirations and with joy at the prospects.

Posted in Education | No comments

Monday, 18 July 2011

Google Americas Faculty Summit Day 2: Shopping, Coupons and Data

Posted on 14:01 by Unknown

Posted by Andrew W. Moore, Director, Google Commerce and Site Director, Pittsburgh

On July 14 and 15, we held our seventh annual Faculty Summit for the Americas with our New York City offices hosting for the first time. Over the next few days, we will be bringing you a series of blog posts dedicated to sharing the Summit's events, topics and speakers. --Ed

Google is ramping up its commitment to making shopping and commerce fun, convenient and useful. As a computer scientist with a background in algorithms and large scale artificial intelligence, what's most interesting to me is the breadth of fundamental new technologies needed in this area. They range from the computer vision technology that recognizes fashion styles and visually similar items of clothing, to a deep understanding of (potentially) all goods for sale in the world, to new and convenient payments technologies, to the intelligence that can be brought to the mobile shopping experience, to the infrastructure needed to make these technologies work on a global scale.

At the Faculty Summit this week, I took the opportunity to engage faculty in some of the fascinating research questions that we are working on within Google Commerce. For example, consider the processing flow required to present a user with an appropriate set of shoes from which to choose, given the input of an image of a high heel shoe. First, we need to segment or identify the object of interest in the input image. If the input is an image of a high heel with the Alps in the background, we don’t want to find images of different types of shoes with the Alps in the background, we want images of high heels.

The second step is to extract the object’s “visual signature” and build an index using color, shape, pattern and metadata. Then, a search is performed using a variety of similarity measures. The implementation of this processing flow raises several research challenges. For example, the calculations required to determine similar shoes could be slow due to the number of factors that must be considered. Segmentation can also pose a difficult problem because of the complexity of the feature extraction algorithms.

Another important consideration is personalization. Consumers want items that correspond to their interests, so we should include results based on historical search and shopping data for a particular person (who has opted-in to such features). More importantly, we want to downweight styles that the shopper has indicated he does not like. Finally, we also need to include some creative items to simulate the serendipitous connections one makes when shopping in a store. This is a new kind of search experience, which requires a new kind of architecture and new ways to infer shopper satisfaction. As a result, we find ourselves exploring new kinds of statistical models and the underlying infrastructure to support them.

Posted in Education | No comments

Friday, 15 July 2011

Google Americas Faculty Summit Day 1: Cluster Management

Posted on 11:32 by Unknown

Posted by John Wilkes, Principal Software Engineer

On July 14 and 15, we held our seventh annual Faculty Summit for the Americas with our New York City offices hosting for the first time. Over the next few days, we will be bringing you a series of blog posts dedicated to sharing the Summit's events, topics and speakers. --Ed

At this year’s Faculty Summit, I had the opportunity to provide a glimpse into the world of cluster management at Google. My goal was to brief the audience on the challenges of this complex system and explain a few of the research opportunities that these kinds of systems provide.

First, a little background. Google’s fleet of machines are spread across many data centers, each of which consists of a number of clusters (a set of machines with a high-speed network between them). Each cluster is managed as one or more cells. A user (in this case, a Google engineer) submits jobs to a cell for it to run. A job could be a service that runs for an extended period, or a batch job that runs, for example, a MapReduce updating an index.

Cluster management operates on a very large scale: whereas a storage system that can hold a petabyte of data is considered large by most people, our storage systems will send us an emergency page when it has only a few petabytes of free space remaining. This scale give us opportunities (e.g., a single job may use several thousand machines at a time), but also many challenges (e.g., we constantly need to worry about the effects of failures). The cluster management system juggles the needs of a large number of jobs in order to achieve good utilization, trying to strike a balance between a number of conflicting goals.

To complicate things, data centers can have multiple types of machines, different network and power-distribution topologies, a range of OS versions and so on. We also need to handle changes, such as rolling out a software or a hardware upgrade, while the system is running.

Our current cluster management system is about seven years old now (several generations for most Google software) and, although it has been a huge success, it is beginning to show its age. We are currently prototyping a new system that will replace it; most of my talk was about the challenges we face in building this system. We are building it to handle larger cells, to look into the future (by means of a calendar of resource reservations) to provide predictable behavior, to support failures as a first-class concept, to unify a number of today’s disjoint systems and to give us the flexibility to add new features easily. A key goal is that it should provide predictable, understandable behavior to users and system administrators. For example, the latter want to know answers to questions like “Are we in trouble? Are we about to be in trouble? If so, what should we do about it?”

Putting all this together requires advances in a great many areas. I touched on a few of them, including scheduling and ways of representing and reasoning with user intentions. One of the areas that I think doesn’t receive nearly enough attention is system configuration—describing how systems should behave, how they should be set up, how those setups should change, etc. Systems at Google typically rely on dozens of other services and systems. It’s vital to simplify the process of making controlled changes to configurations that result in predictable outcomes, every time, even in the face of heterogeneous infrastructure environments and constant flux.

We’ll be taking steps toward these goals ourselves, but the intent of today’s discussion was to encourage people in the academic community to think about some of these problems and come up with new and better solutions, thereby raising the level for us all.

Posted in Education | No comments

Google Americas Faculty Summit Day 1: Mobile Search

Posted on 10:29 by Unknown

Posted by Johan Schalkwyk, Software Engineer

On July 14 and 15, we held our seventh annual Faculty Summit for the Americas with our New York City offices hosting for the first time. Over the next few days, we will be bringing you a series of blog posts dedicated to sharing the Summit's events, topics and speakers. --Ed

Google’s mobile speech team has a lofty goal: recognize any search query spoken in English and return the relevant results. Regardless of whether your accent skews toward a Southern drawl, a Boston twang, or anything in between, spoken searches like “navigate to the Metropolitan Museum,” “call California Pizza Kitchen” or “weather, Scarsdale, New York” should provide immediate responses with a map, the voice of the hostess at your favorite pizza place or an online weather report. The responses must be fast and accurate or people will stop using the tool, and—given that the number of speech queries has more than doubled over the past year—the team is clearly succeeding.

As a software engineer on the mobile speech team, I took the opportunity of the Faculty Summit this week to present some of the interesting challenges surrounding developing and implementing mobile search. One of the immediate puzzles we have to solve is how to train a computer system to recognize speech queries. There are two aspects to consider: the acoustic model, or the sound of letters and words in a language; and the language model, which in English is essentially grammar, or what allows us to predict words that follow one another. The language model we can put together using a huge amount of data gathered from our query logs. The acoustic model, however, is more challenging.

To build our acoustic model, we could conduct “supervised learning” where we collect 100+ hours of audio data from search queries and then transcribe and label the data. We use this data to translate a speech query into a written query. This approach works fairly well, but it doesn’t improve as we collect more audio data. Thus, we use an “unsupervised model” where we continuously add more audio data to our training set as users do speech queries.

Given the scale of this system, another interesting challenge is testing accuracy. The traditional approach is to have human testers run assessments. Over the past year, however, we have determined that our automated system has the same or better level of accuracy as our human testers, so we’ve decided to create a new method for automated testing at scale, a project we are working on now.

The current voice search system is trained on over 230 billion words and has a one million word vocabulary, meaning it understands all the different contexts in which those one million words can be used. It requires multiple CPU decades for training and data processing, plus a significant amount of storage, so this is an area where Google’s large infrastructure is essential. It’s exciting to be a part of such cutting edge research, and the Faculty Summit was an excellent opportunity to share our latest innovations with people who are equally inspired by this area of computer science.

Posted in Education, Voice Search | No comments

Tuesday, 12 July 2011

What You Capture Is What You Get: A New Way for Task Migration Across Devices

Posted on 14:45 by Unknown

Posted by Yang Li, Research Scientist

We constantly move from one device to another while carrying out everyday tasks. For example, we might find an interesting article on a desktop computer at work, then bring the article with us on a mobile phone during the commute and keep reading it on a laptop or a TV when we get home. Cloud computing and web applications have made it possible to access the same data and applications on different devices and platforms. However, there are not many ways to easily move tasks across devices that are as intuitive as drag-and-drop in a graphical user interface.

Since last year, our research team started developing new technologies for users to easily migrate their tasks across devices. In a project named Deep Shot, we demonstrated how a user can easily move web pages and applications, such as Google Maps directions, between a laptop and an Android phone by using the phone camera. With Deep Shot, a user can simply take a picture of their monitor with a phone camera, and the captured content automatically shows up and becomes instantly interactive on the mobile phone.

This project was inspired by our observations that many people tend to take a picture of map directions on the monitor using their mobile phone camera, rather than using other approaches such as email. Taking pictures feels more direct and convenient, and fits well our everyday activity that is often more opportunistic. Instead of just capturing raw pixels, Deep Shot recovers the actual contents and applications on the mobile phone based on these pixels. You can find out how Deep Shot keeps user interaction simple and what happens behind the scenes here. Similar to WYSIWYG—What You See Is What You Get—for graphical user interfaces, Deep Shot demonstrates WYCIWYG—What You Capture Is What You Get—for cross-device interaction. We are exploring this interaction style for various task migration situations in our everyday life.

Deep Shot remains a research project at Google. With increasing capabilities of mobile phones and fast growing web applications, we hope to explore more exciting ways to help users carry out their everyday activities.

Posted in Android | No comments

Thursday, 7 July 2011

Languages of the World (Wide Web)

Posted on 17:15 by Unknown

Posted by Daniel Ford and Josh Batson

The web is vast and infinite. Its pages link together in a complex network, containing remarkable structures and patterns. Some of the clearest patterns relate to language.

Most web pages link to other pages on the same web site, and the few off-site links they have are almost always to other pages in the same language. It's as if each language has its own web which is loosely linked to the webs of other languages. However, there are a small but significant number of off-site links between languages. These give tantalizing hints of the world beyond the virtual.

To see the connections between languages, start by taking the several billion most important pages on the web in 2008, including all pages in smaller languages, and look at the off-site links between these pages. The particular choice of pages in our corpus here reflects decisions about what is `important'. For example, in a language with few pages every page is considered important, while for languages with more pages some selection method is required, based on pagerank for example.

We can use our corpus to draw a very simple graph of the web, with a node for each language and an edge between two languages if more than one percent of the offsite links in the first language land on pages in the second. To make things a little clearer, we only show the languages which have at least a hundred thousand pages and have a strong link with another language, meaning at least 1% of off-site links go to that language. We also leave out English, which we'll discuss more in a moment. (Figure 1)

Looking at the language web in 2008, we see a surprisingly clear map of Europe and Asia.
The language linkages invite explanations around geopolitics, linguistics, and historical associations.

Figure 1: Language links on the web.

The outlines of the Iberian and Scandinavian Peninsulas are clearly visible, which suggest geographic rather than purely linguistic associations.

Examining links between other languages, it seems that many are explained by people and communities which speak both languages.

The language webs of many former Soviet republics link back to the Russian web, with the strongest link from Ukrainian. While Russia is the major importer of Ukrainian products, the bilingual nature of Ukraine is a more plausible explanation. Most Ukrainians speak both languages, and Russian is even the dominant language in large parts of the country.

The link from Arabic to French speaks to the long connection between France and its former colonies. In many of these countries Arabic and French are now commonly spoken together, and there has been significant emigration from these countries to France.

Another strong link is between the Malay/Malaysian and Indonesian webs. Malaysia and Indonesia share a border, but more importantly the languages are nearly eighty percent cognate, meaning speakers of one can easily understand the other.

What about the sizes of each language web? Both the number of sites in each language and the number of urls seen by Google's crawler follow an exponential distribution, although the ordering for each is slightly different (Figure 2). The exact number of pages in each language in 2008 is unknown, since multiple urls may point to the same page and some pages may not have been seen at all. However, the language of an un-crawled url can be guessed by the dominant language of its site. In fact, calendar pages and other infinite spaces mean that there really are an unlimited number of pages on the web, though some are more useful than others.

Figure 2: The number of sites and seen urls per language are roughly exponentially distributed.

The largest language on the web, in terms of size and centrality, has always been English, but where is it on our map?

Every language on the web has strong links to English, usually with around twenty percent of offsite links and occasionally over forty five percent, such as from Tagalog/Filipino, spoken in the Philippines, and Urdu, principally spoken in Pakistan (Figure 3). Both the Philippines and Pakistan are former British colonies where English is one of the two official languages.

Figure 3: Language links to and from English

You might wonder whether off-site links landing on English pages can be explained simply by the number of English pages available to be linked to. The webs of other languages in our corpus typically have sixty to eighty percent of their out-language links to English pages. However, only 38 percent of the pages and 42 percent of sites in our set are English, while it attracts 79 percent of all out-language links from other languages.

Chinese and Japanese also seem unusual because there are relatively few links from pages in these languages to pages in English. This is despite the fact that Japanese and Chinese sites are the most popular non-English sites for English sites to link to. However, the number of sites in a language is a strong predictor of its `introversion', or fraction of off-site links to pages in the same language. Taking this into account shows that Chinese and Japanese webs are not unusually introverted given their size. In general, language webs with more sites are more introverted, perhaps due to better availability of content. (Figure 4)

Figure 4: Language size vs introversion.

There is a roughly linear relationship between the (log) number of sites in a language and the fraction of off-site links which point to pages in the same language, with a correlation of 0.9 if English is removed. However, only 45 percent of off-site links from English pages are to other English pages, making English the most extroverted web language given its size. Other notable outliers are the Hindi web, which is unusually introverted, and the Tagalog and Malay webs which are unusually extroverted.

We can generate another map by connecting languages if the number of links from one to the other is 50 times greater than expected given the number of out-of-language links and the size of the language linked to (Figure 5). This time, the native languages of India show up clearly. Surprising links include those from Hindi to Ukrainian, Kurdish to Swedish, Swahili to Tagalog and Bengali, and Esperanto to Polish.

Figure 5: Unexpected connections, given the size of each language.

What's happened since 2008? The languages of the web have become more densely connected. There is now significant content in even more languages, and these languages are more closely linked. We hope that tools like Google page translation, voice translation, and other services will accelerate this process and bring more people in the world closer together, whichever languages they speak.

UPDATE 9 July 2011: As has been pointed out in the comments, in both the Philippines and Pakistan, English is one of the two official languages; however, the Philippines was not a British colony.

Posted in | No comments

Tuesday, 21 June 2011

Google Translate welcomes you to the Indic web

Posted on 09:30 by Unknown

Posted by Ashish Venugopal, Research Scientist

(Cross-posted on the Translate Blog and the Official Google Blog)

Beginning today, you can explore the linguistic diversity of the Indian sub-continent with Google Translate, which now supports five new experimental alpha languages: Bengali, Gujarati, Kannada, Tamil and Telugu. In India and Bangladesh alone, more than 500 million people speak these five languages. Since 2009, we’ve launched a total of 11 alpha languages, bringing the current number of languages supported by Google Translate to 63.

Indic languages differ from English in many ways, presenting several exciting challenges when developing their respective translation systems. Indian languages often use the Subject Object Verb (SOV) ordering to form sentences, unlike English, which uses Subject Verb Object (SVO) ordering. This difference in sentence structure makes it harder to produce fluent translations; the more words that need to be reordered, the more chance there is to make mistakes when moving them. Tamil, Telugu and Kannada are also highly agglutinative, meaning a single word often includes affixes that represent additional meaning, like tense or number. Fortunately, our research to improve Japanese (an SOV language) translation helped us with the word order challenge, while our work translating languages like German, Turkish and Russian provided insight into the agglutination problem.

You can expect translations for these new alpha languages to be less fluent and include many more untranslated words than some of our more mature languages—like Spanish or Chinese—which have much more of the web content that powers our statistical machine translation approach. Despite these challenges, we release alpha languages when we believe that they help people better access the multilingual web. If you notice incorrect or missing translations for any of our languages, please correct us; we enjoy learning from our mistakes and your feedback helps us graduate new languages from alpha status. If you’re a translator, you’ll also be able to take advantage of our machine translated output when using the Google Translator Toolkit.

Since these languages each have their own unique scripts, we’ve enabled a transliterated input method for those of you without Indian language keyboards. For example, if you type in the word “nandri,” it will generate the Tamil word நன்றி (see what it means). To see all these beautiful scripts in action, you’ll need to install fonts* for each language.

We hope that the launch of these new alpha languages will help you better understand the Indic web and encourage the publication of new content in Indic languages, taking us five alpha steps closer to a web without language barriers.

*Download the fonts for each language: Tamil, Telugu, Bengali, Gujarati and Kannada.

Posted in Translate | No comments

Monday, 20 June 2011

Auto-Directed Video Stabilization with Robust L1 Optimal Camera Paths

Posted on 08:00 by Unknown

Posted by Matthias Grundmann, Vivek Kwatra, and Irfan Essa, Research Team

Earlier this year, we announced the launch of new features on the YouTube Video Editor, including stabilization for shaky videos, with the ability to preview them in real-time. The core technology behind this feature is detailed in this paper, which will be presented at the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR 2011).

Casually shot videos captured by handheld or mobile cameras suffer from significant amount of shake. Existing in-camera stabilization methods dampen high-frequency jitter but do not suppress low-frequency movements and bounces, such as those observed in videos captured by a walking person. On the other hand, most professionally shot videos usually consist of carefully designed camera configurations, using specialized equipment such as tripods or camera dollies, and employ ease-in and ease-out for transitions. Our goal was to devise a completely automatic method for converting casual shaky footage into more pleasant and professional looking videos.

Our technique mimics the cinematographic principles outlined above by automatically determining the best camera path using a robust optimization technique. The original, shaky camera path is divided into a set of segments, each approximated by either a constant, linear or parabolic motion. Our optimization finds the best of all possible partitions using a computationally efficient and stable algorithm.

To achieve real-time performance on the web, we distribute the computation across multiple machines in the cloud. This enables us to provide users with a real-time preview and interactive control of the stabilized result. Above we provide a video demonstration of how to use this feature on the YouTube Editor. We will also demo this live at Google’s exhibition booth in CVPR 2011.

For further details, please read our paper.

Posted in conference, CVPR, Vision Research, YouTube | No comments

Thursday, 16 June 2011

Google at CVPR 2011

Posted on 14:00 by Unknown

Posted by Mei Han and Sergey Ioffe, Research Team

The computer vision community will get together in Colorado Springs the week of June 20th for the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR 2011). This year will see a record number of people attending the conference and 27 co-located workshops and tutorials. The registration was closed at 1500 attendees even before the conference started.

Computer Vision is at the core of many Google products, such as Image Search, YouTube, Street View, Picasa, and Goggles, and as always, Google is involved in several ways with CVPR. Andrew Senior is serving as an area chair of CVPR 2011 and many Googlers are reviewers. Googlers also co-authored these papers:

Where's Waldo: Matching People in Images of Crowds by Rahul Garg, Deva Ramanan, Steve Seitz, Noah Snavely
Visual and Semantic Similarity in ImageNet by Thomas Deselaers, Vittorio Ferrari
Multicore Bundle Adjustment by Changchang Wu, Sameer Agarwal, Brian Curless, Steve Seitz
A Hierarchical Conditional Random Field Model for Labeling and Segmenting Images of Street Scenes by Qixing Huang, Mei Han, Bo Wu, Sergey Ioffe
Kernelized Structural SVM Learning for Supervised Object Segmentation by Luca Bertelli, Tianli Yu, Diem Vu, Salih Gokturk
Discriminative Tag Learning on YouTube Videos with Latent Sub-tags by Weilong Yang, George Toderici
Auto-Directed Video Stabilization with Robust L1 Optimal Camera Paths by Matthias Grundmann, Vivek Kwatra, Irfan Essa
Image Saliency: From Local to Global Context by Meng Wang, Janusz Konrad, Prakash Ishwar, Yushi Jing, Henry Rowley

If you are attending the conference, stop by Google’s exhibition booth. In addition to talking with Google researchers, you will get to see examples of exciting computer vision research that has made it into Google products including, among others, the following:

Google Earth Facade Shadow Removal by Mei Han, Vivek Kwatra, and Shengyang Dai
We will demonstrate our technique for removing shadows and other lighting/texture artifacts from building facades in Google Earth. We obtain cleaner, clearer, and more uniform textures which provide users with an improved visual experience.
Video Stabilization on YouTube Editor by Matthias Grundmann, Vivek Kwatra, and Irfan Essa
Casually shot videos captured by handheld or mobile cameras suffer from significant amount of shake. In contrast, professionally shot video usually employs stabilization equipment such as tripods or camera dollies, and employ ease-in and ease-out for transitions. Our technique mimics these cinematographic principles, by optimally dividing the original, shaky camera path into a set of segments and approximating each with either constant, linear or parabolic motion using a computationally efficient and stable algorithm. We will showcase a live version of our algorithm, featuring real-time performance and interactive control, which is publicly available at youtube.com/editor.
Tag Suggest for YouTube by George Toderici and Mehmet Emre Sargin
YouTube offers millions of users the opportunity to upload videos and share them with their friends. Many users would love to have their videos discoverable but don't annotate them properly. One new feature on YouTube that seeks to address this problem is tag prediction based on video content and independently based on text metadata.

6/17/2011 UPDATE: "Posted by" was changed to include Sergey Ioffe.

Posted in conference, CVPR, Vision Research | No comments