May 2013 ~ Compact System

Thursday, 30 May 2013

Distributing the Edit History of Wikipedia Infoboxes

Posted on 10:15 by Unknown

Posted by Enrique Alfonseca, Google Research

Aside from its value as a general-purpose encyclopedia, Wikipedia is also one of the most widely used resources to acquire, either automatically or semi-automatically, knowledge bases of structured data. Much research has been devoted to automatically building disambiguation resources, parallel corpora and structured knowledge from Wikipedia. Still, most of those projects have been based on single snapshots of Wikipedia, extracting the attribute values that were valid at a particular point in time. So about a year ago we compiled and released a data set that allows researchers to see how data attributes can change over time.

Figure 1. Infobox for the Republic of Palau in 2006 and 2013 showing the capital change.

Many attributes vary over time. These include the presidents of countries, the spouses of people, the populations of cities and the number of employees of companies. Every Wikipedia page has an associated history from which the users can view and compare past versions. Having the historical values of Infobox entries available would provide a historical overview of change affecting each entry, to understand which attributes are more likely to change over time or have a regularity in their changes, and which ones attract more user interest and are actually updated in a timely fashion. We believe that such a resource will also be useful in training systems to learn to extract data from documents, as it will allow us to collect more training examples by matching old values of an attribute inside old pages.

For this reason, we released, in collaboration with Wikimedia Deutschland e.V., a resource containing all the edit history of infoboxes in Wikipedia pages. While this was already available indirectly in Wikimedia’s full history dumps, the smaller size of the released dataset will make it easier to download and process this data. The released dataset contains 38,979,871 infobox attribute updates for 1,845,172 different entities, and it is available for download both from Google and from Wikimedia Deutschland’s Toolserver page. A description of the dataset can be found in our paper WHAD: Wikipedia Historical Attributes Data, accepted for publication at the Language Resources and Evaluation journal.

What kind of information can be learned from this data? Some examples from preliminary analyses include the following:

Every country in the world has a population in its Wikipedia attribute, which is updated at least yearly for more than 90% of them. The average error rate with respect to the yearly World Bank estimates is between two and three percent, mostly due to rounding.
50% of deaths are updated into Wikipedia infoboxes within a couple of days... but for scientists it takes 31 days to reach 50% coverage!
For the last episode of TV shows, the airing date is updated for 50% of them within 9 days; for for the first episode of TV shows, it takes 106 days.

While infobox attribute updates will be much easier to process as they transition into the Wikidata project, we are not there yet and we believe that the availability of this dataset will facilitate the study of changing attribute values. We are looking forward to the results of those studies.

Thanks to Googler Jean-Yves Delort and Guillermo Garrido and Anselmo Peñas from UNED for putting this dataset together, and to Angelika Mühlbauer and Kai Nissen from Wikipedia Deutschland for their support. Thanks also to Thomas Hofmann and Fernando Pereira for making this data release possible.

Posted in wikipedia | No comments

Wednesday, 29 May 2013

Open Access for Publications

Posted on 12:00 by Unknown

Posted by Alfred Spector, Vice President, Engineering

The Association for Computing Machinery (ACM) recently announced a new option for publication rights management, wherein researchers can choose to pay for the public to have perpetual open access to the publication. Google applauds this new option, and today we are announcing that we will pay the open access fees for all articles by Google researchers that are published in ACM journals. IEEE also has an open access option for some of its publications, and we also pay the open access fee for them and for publications in like organizations.

Google has always believed that by improving access to the world’s knowledge, we can help improve everyone’s lives. When it comes to scientific research, we have consistently said that open access to publications speeds up research, accelerates innovation, and helps grow the global economy.

Policies like ACM’s continue to demonstrate the sustainability of open access publishing. It will also provide better access to the papers that we write at Google. We encourage researchers everywhere to pursue open access options whenever publishing articles, and to continue to make publications available as widely as possible, within your rights.

Posted in ACM, Publications | No comments

Tuesday, 28 May 2013

Explore more with Mapping with Google

Posted on 09:00 by Unknown

Posted by Tina Ornduff, Program Manager

In September 2012 we launched Course Builder, an open source learning platform for educators or anyone with something to teach, to create online courses. This was our experimental first step in the world of online education, and since then the features of Course Builder have continued to evolve. Mapping with Google, our latest MOOC, showcases new features of the platform.

From your own backyard all the way to Mount Everest, Google Maps and Google Earth are here to help you explore the world. You can learn to harness the world’s most comprehensive and accurate mapping tools by registering for Mapping with Google.

Mapping with Google is a self-paced, online course developed to help you better navigate the world around you by improving your use of the new Google Maps, Maps Engine Lite, and Google Earth. All registrants will receive an invitation to preview the new Google Maps.

Through a combination of video and text lessons, activities, and projects, you’ll learn to do much more than look up directions or find your house from outer space. Tell a story of your favorite locations with rich 3D imagery, or plot sights to see on your upcoming trip and share with your travel buddies. During the course, you’ll have the opportunity to learn from Google experts and collaborate with a worldwide community of participants, via Google+ Hangouts and a course forum.

Mapping with Google will be offered from June 10 - June 24, and you can choose whether to explore the features of Google Maps, Google Earth, or both. In addition, you’ll have the option to complete a project, applying the skills you’ve learned to earn a certificate. Visit g.co/mappingcourse to learn more and register today.

The world is a big place; we like to think that you can make it a bit more manageable and adventurous with Google’s mapping tools.

Posted in Education, MOOC | No comments

Thursday, 23 May 2013

Syntactic Ngrams over Time

Posted on 13:00 by Unknown

Posted by Yoav Goldberg, Professor at Bar Ilan University & Post-doc at Google 2011-2013

We are proud to announce the release of a very large dataset of counted dependency tree fragments from the English Books Corpus. This resource will help researchers, among other things, to model the meaning of English words over time and create better natural-language analysis tools. The resource is based on information derived from a syntactic analysis of the text of millions of English books.

Sentences in languages such as English have structure. This structure is called syntax, and knowing the syntax of a sentence is a step towards understanding its meaning. The process of taking a sentence and transforming it into a syntactic structure is called parsing. At Google, we parse a lot of text every day, in order to better understand it and be able to provide better results and services in many of our products.

There are many kinds of syntactic representations (you may be familiar with sentence diagramming), and at Google we've been focused on a certain type of syntactic representation called "dependency trees". Dependency-trees representation is centered around words and the relations between them. Each word in a sentence can either modify or be modified by other words. The various modifications can be represented as a tree, in which each node is a word.

For example, the sentence "we really like syntax" is analyzed as:

The verb "like" is the main word of the sentence. It is modified by a subject (denoted nsubj) "we", a direct object (denoted dobj) "syntax", and an adverbial modifier "really".

An interesting property of syntax is that, in many cases, one could recover the structure of a sentence without knowing the meaning of most of the words. For example, consider the sentence "the krumpets gnorked the koof with a shlap". We bet you could infer its structure, and tell that group of something which is called a krumpet did something called "gnorking" to something called a "koof", and that they did so with a "shlap".

This property by which you could infer the structure of the sentence based on various hints, without knowing the actual meaning of the words, is very useful. For one, it suggests that a even computer could do a reasonable job at such an analysis, and indeed it can! While still not perfect, parsing algorithms these days can analyze sentences with impressive speed and accuracy. For instance, our parser correctly analyzes the made-up sentence above.

Let's try a more difficult example. Something rather long and literary, like the opening sentence of One hundred years of solitude by Gabriel García Márquez, as translated by Gregory Rabassa:

Many years later, as he faced the firing squad, Colonel Aureliano Buendía was to remember that distant afternoon when his father took him to discover ice.

Pretty good for an automatic process, eh?

And it doesn’t end here. Once we know the structure of many sentences, we can use these structures to infer the meaning of words, or at least find words which have a similar meaning to each other.

For example, consider the fragments:
"order a XYZ"
"XYZ is tasty"
"XYZ with ketchup"
"juicy XYZ"

By looking at the words modifying XYZ and their relations to it, you could probably infer that XYZ is a kind of food. And even if you are a robot and don't really know what a "food" is, you could probably tell that the XYZ must be similar to other unknown concepts such as "steak" or "tofu".

But maybe you don't want to infer anything. Maybe you already know what you are looking for, say "tasty food". In order to find such tasty food, one could collect the list of words which are objects of the verb "ate", and are commonly modified by the adjective "tasty" and "juicy". This should provide you a large list of yummy foods.

Imagine what you could achieve if you had hundreds of millions of such fragments. The possibilities are endless, and we are curious to know what the research community may come up with. So we parsed a lot of text (over 3.5 million English books, or roughly 350 billion words), extracted such tree fragments, counted how many times each fragment appeared, and put the counts online for everyone to download and play with.

350 billion words is a lot of text, and the resulting dataset of fragments is very, very large. The resulting datasets, each representing a particular type of tree fragments, contain billions of unique items, and each dataset’s compressed files takes tens of gigabytes. Some coding and data analysis skills will be required to process it, but we hope that with this data amazing research will be possible, by experts and non-experts alike.

The dataset is based on the English Books corpus, the same dataset behind the ngram-viewer. This time there is no easy-to-use GUI, but we still retain the time information, so for each syntactic fragment, you know not only how many times it appeared overall, but also how many times it appeared in each year -- so you could, for example, look at the subjects of the word “drank” at each decade from 1900 to 2000 and learn how drinking habits changed over time (much more ‘beer’ and ‘coffee’, somewhat less ‘wine’ and ‘glass’ (probably ‘of wine’). There’s also a drop in ‘whisky’, and an increase in ‘alcohol’. Brandy catches on around 1930s, and start dropping around 1980s. There is an increase in ‘juice’, and, thankfully, some decrease in ‘poison’).

The dataset is described in details in this scientific paper, and is available for download here.

Posted in NLP | No comments

Thursday, 16 May 2013

Launching the Quantum Artificial Intelligence Lab

Posted on 02:00 by Unknown

Posted by Hartmut Neven, Director of Engineering

We believe quantum computing may help solve some of the most challenging computer science problems, particularly in machine learning. Machine learning is all about building better models of the world to make more accurate predictions. If we want to cure diseases, we need better models of how they develop. If we want to create effective environmental policies, we need better models of what’s happening to our climate. And if we want to build a more useful search engine, we need to better understand spoken questions and what’s on the web so you get the best answer.

So today we’re launching the Quantum Artificial Intelligence Lab. NASA’s Ames Research Center will host the lab, which will house a quantum computer from D-Wave Systems, and the USRA (Universities Space Research Association) will invite researchers from around the world to share time on it. Our goal: to study how quantum computing might advance machine learning.

Machine learning is highly difficult. It’s what mathematicians call an “NP-hard” problem. That’s because building a good model is really a creative act. As an analogy, consider what it takes to architect a house. You’re balancing lots of constraints -- budget, usage requirements, space limitations, etc. -- but still trying to create the most beautiful house you can. A creative architect will find a great solution. Mathematically speaking the architect is solving an optimization problem and creativity can be thought of as the ability to come up with a good solution given an objective and constraints.

Classical computers aren’t well suited to these types of creative problems. Solving such problems can be imagined as trying to find the lowest point on a surface covered in hills and valleys. Classical computing might use what’s called “gradient descent”: start at a random spot on the surface, look around for a lower spot to walk down to, and repeat until you can’t walk downhill anymore. But all too often that gets you stuck in a “local minimum” -- a valley that isn’t the very lowest point on the surface.

That’s where quantum computing comes in. It lets you cheat a little, giving you some chance to “tunnel” through a ridge to see if there’s a lower valley hidden beyond it. This gives you a much better shot at finding the true lowest point -- the optimal solution.

We’ve already developed some quantum machine learning algorithms. One produces very compact, efficient recognizers -- very useful when you’re short on power, as on a mobile device. Another can handle highly polluted training data, where a high percentage of the examples are mislabeled, as they often are in the real world. And we’ve learned some useful principles: e.g., you get the best results not with pure quantum computing, but by mixing quantum and classical computing.

Can we move these ideas from theory to practice, building real solutions on quantum hardware? Answering this question is what the Quantum Artificial Intelligence Lab is for. We hope it helps researchers construct more efficient and more accurate models for everything from speech recognition, to web search, to protein folding. We actually think quantum machine learning may provide the most creative problem-solving process under the known laws of physics. We’re excited to get started with NASA Ames, D-Wave, the USRA, and scientists from around the world.

Posted in | No comments

Compact System