July 2013 ~ Compact System

Thursday, 25 July 2013

Under the hood of Croatian, Filipino, Ukrainian, and Vietnamese in Google Voice Search

Posted on 14:30 by Unknown

Posted by Eugene Weinstein and Pedro Moreno, Google Speech Team

Although we’ve been working on speech recognition for several years, every new language requires our engineers and scientists to tackle unique challenges. Our most recent additions - Croatian, Filipino, Ukrainian, and Vietnamese - required creative solutions to reflect how each language is used across devices and in everyday conversations.

For example, since Vietnamese is a tonal language, we had to explore how to take tones into consideration. One simple technique is to model the tone and vowel combinations (tonemes) directly in our lexicons. This, however, has the side effect of a larger phonetic inventory. As a result we had to come up with special algorithms to handle the increased complexity. Additionally, Vietnamese is a heavily diacritized language, with tone markers on a majority of syllables. Since Google Search is very good at returning valid results even when diacritics are omitted, our Vietnamese users frequently omit the diacritics when typing their queries. This creates difficulties for the speech recognizer, which selects its vocabulary from typed queries. For this purpose, we created a special diacritic restoration algorithm which enables us to present properly formatted text to our users in the majority of cases.

Filipino also presented interesting challenges. Much like in other multilingual societies such as Hong Kong, India, South Africa, etc., Filipinos often mix several languages in their daily life. This is called code switching. Code switching complicates the design of pronunciation, language, and acoustic models. Speech scientists are effectively faced with a dilemma: should we build one system per language, or should we combine all languages into one?

In such situations we prefer to model the reality of daily language use in our speech recognizer design. If users mix several languages, our recognizers should do their best in modeling this behavior. Hence our Filipino voice search system, while mainly focused on the Filipino language, also allows users to mix in English terms.

The algorithms we’re using to model how speech sounds are spoken in each language make use of our distributed large-scale neural network learning infrastructure (yes, the same one that spontaneously discovered cats on YouTube!). By partitioning the gigantic parameter set of the model, and by evaluating each partition on a separate computation server, we’re able to achieve unprecedented levels of parallelism in training acoustic models.

The more people use Google speech recognition products, the more accurate the technology becomes. These new neural network technologies will help us bring you lots of improvements and many more languages in the future.

Posted in internationalization, Speech | No comments

Wednesday, 17 July 2013

11 Billion Clues in 800 Million Documents: A Web Research Corpus Annotated with Freebase Concepts

Posted on 10:15 by Unknown

Posted by Dave Orr, Amar Subramanya, Evgeniy Gabrilovich, and Michael Ringgaard, Google Research

“I assume that by knowing the truth you mean knowing things as they really are.”

- Plato

When you type in a search query -- perhaps Plato -- are you interested in the string of letters you typed? Or the concept or entity represented by that string? But knowing that the string represents something real and meaningful only gets you so far in computational linguistics or information retrieval -- you have to know what the string actually refers to. The Knowledge Graph and Freebase are databases of things, not strings, and references to them let you operate in the realm of concepts and entities rather than strings and n-grams.

We’ve previously released data to help with disambiguation and recently awarded $1.2M in research grants to work on related problems. Today we’re taking another step: releasing data consisting of nearly 800 million documents automatically annotated with over 11 billion references to Freebase entities.

These Freebase Annotations of the ClueWeb Corpora (FACC) consist of ClueWeb09 FACC and ClueWeb12 FACC. 11 billion phrases that refer to concepts and entities in Freebase were automatically labeled with their unique identifiers (Freebase MID’s). For example:

Since the annotation process was automatic, it likely made mistakes. We optimized for precision over recall, so the algorithm skipped a phrase if it wasn’t confident enough of the correct MID. If you prefer higher precision, we include confidence levels, so you can filter out lower confidence annotations that we did include.

Based on review of a sample of documents, we believe the precision is about 80-85%, and recall, which is inherently difficult to measure in situations like this, is in the range of 70-85%. Not every ClueWeb document is included in this corpus; documents in which we found no entities were excluded from the set. A document might be excluded because there were no entities to be found, because the entities in question weren’t in Freebase, or because none of the entities were resolved at a confidence level above the threshold.

The ClueWeb data is used in multiple TREC tracks. You may also be interested in our annotations of several TREC query sets, including those from the Million Query Track and Web Track.

If you would prefer a human-annotated set, you might want to look at the Wikilinks Corpus we released last year. Entities there were disambiguated by links to Wikipedia, inserted by the authors of the page, which is effectively a form of human annotation.

You can find more detail and download the data on the pages for the two sets: ClueWeb09 FACC and ClueWeb12 FACC. You can also subscribe to our data release mailing list to learn about releases as they happen.

Special thanks to Jamie Callan and Juan Caicedo Carvajal for their help throughout the annotation project.

Posted in Natural Language Processing | No comments

Tuesday, 16 July 2013

New research from Google shows that 88% of the traffic generated by mobile search ads is not replaced by traffic originating from mobile organic search

Posted on 09:00 by Unknown

Posted by Shaun Lysen, Statistician at Google

Often times people are presented with two choices after making a search on their devices - they could either click on the organic results for their query, or on the ads that appear on the page. Website owners who want to build a strong online presence often wonder how to balance organic search and paid search ads in driving website traffic. But what happens when ads are paused? Would businesses see an increase in organic traffic that could make up for the loss in paid traffic? To answer these questions, we released a “Search Ads Pause” analysis in 2011 showing that 89% of traffic generated by search ads is not replaced by organic clicks.

As smartphones become increasingly important to consumers, we recently conducted the same studies for mobile devices to understand the role of mobile search ads in driving site traffic. From March 2012 - April 2013, we ran 327 unique studies across US-based mobile advertising accounts from 12 key industries.

We selected AdWords accounts that exhibited sharp changes in advertisers’ spending on mobile search (ad spend) and identified stable periods before the spend change (pre-period) and after the spend change (post-period). We observed the number of organic and paid clicks, and the number of times organic results appear on the first page of search results (impressions) during both the pre-period and post-period. Google then created a proprietary statistical model to predict what the number of organic and paid clicks would have been in the post-period had the ad spend not changed, and compared those figures to the actual number of clicks observed. We then were able to estimate what percentage of paid clicks are incremental, i.e. a visit to the advertiser’s site from an ad click would not have been replaced by a visit to the site from an organic click.

The final results showed that mobile search ads contribute to a very high proportion of incremental traffic to websites. On average, 88% of mobile paid clicks are lost and not recovered when a mobile search campaign is paused. This finding is consistently high across the 12 key industries, including automotive, travel, retail and more. The full study, including details around the methodology and findings, can be found in the paper ‘Incremental Clicks Impact of Mobile Search Advertising’.

Posted in | No comments

Tuesday, 9 July 2013

Google Databoard: A new way to explore industry research

Posted on 08:00 by Unknown

Posted by Adam Grunewald, Mobile Marketing Manager

It’s important for people to stay up to date about the most recent research and insights related to their work or personal lives. But it can be difficult to keep up with all the new studies and updated data that’s out there. To make life a bit easier, we’re introducing a new take on how research can be presented. The Databoard for Research Insights enables people to explore and interact with some of Google’s recent research in a unique and immersive way. The Databoard uses responsive design to to offer an engaging experience across devices. Additionally, the tool is a new venture into data visualization and shareability with bite-sized charts and stats that can be shared with your friends or coworkers. The Databoard is currently home to several of Google’s market research studies for businesses, but we believe that this way of conveying data can work across all forms of research.

Here are some of the things that make the Databoard different from other ways research is released today:

Easy to use
All of the information in the Databoard is presented in a bite-sized way so that you can quickly find relevant information. You can explore an entire study or jump straight to the topics or data points you care about. The Databoard is also optimized for all devices so you can explore the research on your computer, tablet or smartphone.

Meant to be shared
Most people, when they find a compelling piece of data, want to share it! Whether it’s with a colleague, client, or a community on a blog or social network, compelling insights and data are meant to be shared. With the databoard, you can easily share individual charts and insights or collections of data with anyone through email or social networks, just look for the share button at the top of each chart or insight.

Create a cohesive story
Most research studies set out to answer a specific question, like how people use their smartphones in stores, or how a specific type of consumer shops. This means that businesses need to look across multiple pieces of research to craft a comprehensive business or marketing strategy. With this in mind, the Databoard lets you curate a customized infographic out of the charts or data points you find important across multiple Google research studies. Creating an infographic is quick and easy, and you can share the finished product with your friends or colleagues.

The databoard is currently home to six research studies including The New Multi-screen World, Mobile In-store shopper research and Mobile search moments. New studies will be added frequently. To get started creating your own infographic, visit the Databoard now.

Posted in Market Research, Visualization | No comments

Wednesday, 3 July 2013

Conference Report: USENIX Annual Technical Conference (ATC) 2013

Posted on 09:00 by Unknown

Posted by Murray Stokely, Google Storage Analytics Team

This year marks Google’s eleventh consecutive year as a sponsor of the USENIX Annual Technical Conference (ATC), just one of the co-located events at USENIX Federated Conference Week (FCW), which combines numerous conferences and workshops covering fields such as Autonomic Computing, Feedback Computing and much more in an intensive week of research, trends, and community interaction.

ATC provides a broad forum for computing systems research with an emphasis on implementations and experimental results. In addition to the Googlers presenting publications, we had two members on the program committee of ATC and several keynote speakers, invited speakers, panelists, committee members, and participants at the other co-located events at FCW.

In the paper Janus: Optimal Flash Provisioning for Cloud Storage Workloads, Googler Christoph Albrecht and co-authors demonstrated a system that allows users to make informed ﬂash memory provisioning and partitioning decisions in cloud-scale distributed ﬁle systems that include both ﬂash storage and disk tiers. As ﬂash memory is still expensive, it is best to use it only for workloads that can make good use of it. Janus creates long term workload characterizations based on RPC samples and file age metadata. It uses these workload characterizations to formulate and solve an optimization problem that maximizes the reads sent to the flash tier. Based on evaluations from workloads using Janus, in use at Google for the past 6 months, the authors conclude that the recommendation system is quite effective, with ﬂash hit rates using the optimized recommendations 47-76% higher than the option of using the ﬂash as an unpartitioned tier.

In packetdrill: Scriptable Network Stack Testing, from Sockets to Packets, Google’s Neal Cardwell and co-authors showcased a portable, open-source scripting tool that enables testing the correctness and performance of network protocols. Despite their importance in modern computer systems, network protocols often undergo only ad hoc testing before their deployment, in large part due to their complexity. Furthermore, new algorithms have unforeseen interactions with other features, so testing has only become more daunting as TCP has evolved. The packetdrill tool was instrumental in the development of three new features for Linux TCP—Early Retransmit, Fast Open, and Loss Probes—and allowed the authors to ﬁnd and ﬁx 10 bugs in Linux. Furthermore, the team uses packetdrill in all phases of the development process for the kernel used in one of the world’s largest Linux installations. In the hope that sharing packetdrill with the community will make the process of improving Internet protocols an easier one, the source code and test scripts for packetdrill have been made freely available.

There were also additional refereed publications with Google co-authors at some of the co-located events at FCW, notably NicPic: Scalable and Accurate End-Host Rate Limiting, which outlines a system which enables accurate network traffic scheduling in a scalable fashion, and AGILE: Elastic Distributed Resource Scaling for Infrastructure-as-a-Service, a system that efficiently handles dynamic application workloads, reducing both penalties and user dissatisfaction.

Google is proud to support the academic community through conference participation and sponsorship. In particular, we are happy to mention one of the other interesting papers from this year’s USENIX FCW, co-authored by former Google PhD fellowship recipient Ashok Anand, MiG: Efficient Migration of Desktop VM Using Semantic Compression.

USENIX is a supporter of open access, so the papers and videos from the talks are available on the conference website.

Posted in conference, Publications | No comments

Tuesday, 2 July 2013

Natural Language Understanding-focused awards announced

Posted on 09:00 by Unknown

Posted by Massimiliano Ciaramita, Research Scientist and David Harper, Head University Relations (EMEA)

Some of the biggest challenges for the scientific community today involve understanding the principles and mechanisms that underlie natural language use on the Web. An example of long-standing problem is language ambiguity; when somebody types the word “Rio” in a query do they mean the city, a movie, a casino, or something else? Understanding the difference can be crucial to help users get the answer they are looking for. In the past few years, a significant effort in industry and academia has focused on disambiguating language with respect to Web-scale knowledge repositories such as Wikipedia and Freebase. These resources are used primarily as canonical, although incomplete, collections of “entities”. As entities are often connected in multiple ways, e.g., explicitly via hyperlinks and implicitly via factual information, such resources can be naturally thought of as (knowledge) graphs. This work has provided the first breakthroughs towards anchoring language in the Web to interpretable, albeit initially shallow, semantic representations. Google has brought the vision of semantic search directly to millions of users via the adoption of the Knowledge Graph. This massive change to search technology has also been called a shift “from strings to things”.

Understanding natural language is at the core of Google's work to help people get the information they need as quickly and easily as possible. At Google we work hard to advance the state of the art in natural language processing, to improve the understanding of fundamental principles, and to solve the algorithmic and engineering challenges to make these technologies part of everyday life. Language is inherently productive; an infinite number of meaningful new expressions can be formed by combining the meaning of their components systematically. The logical next step is the semantic modeling of structured meaningful expressions -- in other words, “what is said” about entities. We envision that knowledge graphs will support the next leap forward in language understanding towards scalable compositional analyses, by providing a universe of entities, facts and relations upon which semantic composition operations can be designed and implemented.

So we’ve just awarded over $1.2 million to support several natural language understanding research awards given to university research groups doing work in this area. Research topics range from semantic parsing to statistical models of life stories and novel compositional inference and representation approaches to modeling relations and events in the Knowledge Graph.

These awards went to researchers in nine universities and institutions worldwide, selected after a rigorous internal review:

Mark Johnson and Lan Du (Macquarie University) and Wray Buntine (NICTA) for “Generative models of Life Stories”

Percy Liang and Christopher Manning (Stanford University) for “Tensor Factorizing Knowledge Graphs”

Sebastian Riedel (University College London) and Andrew McCallum (University of Massachusetts, Amherst) for “Populating a Knowledge Base of Compositional Universal Schema”

Ivan Titov (University of Amsterdam) for “Learning to Reason by Exploiting Grounded Text Collections”

Hans Uszkoreit (Saarland University and DFKI), Feiyu Xu (DFKI and Saarland University) and Roberto Navigli (Sapienza University of Rome) for “Language Understanding cum Knowledge Yield”

Luke Zettlemoyer (University of Washington) for “Weakly Supervised Learning for Semantic Parsing with Knowledge Graphs”

We believe the results will be broadly useful to product development and will further scientific research. We look forward to working with these researchers, and we hope we will jointly push the frontier of natural language understanding research to the next level.

Posted in NLP, University Relations | No comments

Compact System

Thursday, 25 July 2013

Under the hood of Croatian, Filipino, Ukrainian, and Vietnamese in Google Voice Search

Wednesday, 17 July 2013

11 Billion Clues in 800 Million Documents: A Web Research Corpus Annotated with Freebase Concepts

Tuesday, 16 July 2013

New research from Google shows that 88% of the traffic generated by mobile search ads is not replaced by traffic originating from mobile organic search

Tuesday, 9 July 2013

Google Databoard: A new way to explore industry research

Wednesday, 3 July 2013

Conference Report: USENIX Annual Technical Conference (ATC) 2013

Tuesday, 2 July 2013

Natural Language Understanding-focused awards announced

Popular Posts

Categories

Blog Archive

About Me