Compact System

  • Subscribe to our RSS feed.
  • Twitter
  • StumbleUpon
  • Reddit
  • Facebook
  • Digg

Monday, 14 June 2010

Google Search by Voice now available in France, Italy, Germany and Spain

Posted on 16:00 by Unknown
Posted by Thad Hughes, Martin Jansche, and Pedro Moreno, Google Research

Google’s speech team is composed of people from many different cultural backgrounds. Indeed, if we count the languages spoken by our teammates, the number comes to well over a dozen. Given our own backgrounds and interests, we are naturally excited to extend our software to work with many different languages and dialects. After testing the waters with English, Mandarin Chinese, and Japanese, we decided to tackle four main European languages which are often referred to as FIGS - French, Italian, German and Spanish.

Developing Voice Search systems in each of these languages presented its own challenges. French and Spanish required special work to deal with diacritic and accent marks (e.g. ç in French, ñ in Spanish). When we develop a new language we tweak our dictionaries based on user generated content. To our surprise we found that a lot of this content in French and Spanish often uses non-standard orthography. For example a French speaker might type “francoise” into a search engine and still expect it to return results for “Françoise”. Likewise in Spanish a user might type “espana” and expect results for the term “España”. Of course a lot of this has to do with the fact that, until recently, domain names (like www.elpais.es) did not allow diacritics, and that entering special characters is often painful but omitting diacrictics is usually not an obstacle to communication. However, non-standard spellings distort the intended pronunciations. For example, if “francoise” were a real French word, one would expect it to be pronounced “franquoise”. In order to capture the intended pronunciation of the non-standard spellings, we fixed the orthography in our dictionaries for Spanish and French automatically. While this is not perfect, it deals with many of the offending cases.

Since our Voice search systems typically understand more than a million different words in each language, developing pronunciation dictionaries is one of the most critical tasks. We need the dictionary to match what the user said with the written form. Not surprisingly we found that dictionary development for some languages like Spanish and Italian to be extremely easy, as they have very regular orthographies. In fact the core of our Spanish pronunciation module consists of less than 100 lines of source code. Other languages like German and French have more complex orthographies. For example in French “au”, “eaux” and “hauts” are all pronounced “o”.

A notable aspect of German (especially “Internet German”) is that a lot of English words are in common usage. We do our best to recognize thousands of English words, even though English contains some sounds that don’t exist in German, like “th” in “the”. One of the trickiest examples we came across was when one of our volunteers read “nba playoffs 2009”, saying “nba playoffs” in English followed by “zwei tausend neun” in German. So go ahead and search for “Germany’s Next Topmodel” or “Postbank Online”, see if it works for you.

German is also notorious for having long, complex words. Our favorite examples include:
  • Berufskraftfahrerqualifikationsgesetz (or shorter: BKrFQG)
  • Eierschalensollbruchstellenverursacher
  • Verkehrsinfrastrukturfinanzierungsgesellschaft
  • Stichpimpulibockforcelorum
  • Hypothalamus-Hypophysen-Nebennierenrinde-Achse

Just for fun, compare how long it takes you to say these to Voice Search vs. typing them.

Even though a vocabulary size of one million words sounds like a large number, each of these languages has even more words, so we need a procedure to select which ones to model. We obviously do not do this manually and instead use statistical procedures to identify the list of words we will allow. We do this by looking at many sources of data and looking at the frequency of words. It is therefore surprising to find sometimes really weird terms selected by our algorithms. For example in Spanish we found these unusual words:
  • supercalifragilisticespialidoso
  • chiripitiflautico
  • esternocleidomastoideo

So, in the unlikely event that you ever try a Spanish voice search query like this “imágenes del músculo supercalifragilisticoespialidoso chiripitiflautico esternocleidomastoideo” you may be surprised to see that it works.

French, Italian, German, and Spanish are spoken in many parts of the world. In this first release of Google Search by Voice in these languages, we initially only support the varieties spoken in France, Italy, Germany, and Spain, respectively. The reason is that almost all aspects of a Voice Search system are affected by regional variation: French speakers from different regions have slightly different accents, use a number of different words, and will want to search for different things. Eventually, we plan to support other regions as well, and we will work hard to make sure our systems work well for all of you.

So, we hope you find these new voice search system useful and fun to use. We definitely had a “supercalifragilisticoespialidoso chiripitiflautico” time developing them.
Email ThisBlogThis!Share to XShare to Facebook
Posted in | No comments
Newer Post Older Post Home

0 comments:

Post a Comment

Subscribe to: Post Comments (Atom)

Popular Posts

  • CDC Birth Vital Statistics in BigQuery
    Posted by Dan Vanderkam, Software Engineer Google’s BigQuery Service lets enterprises and developers crunch large-scale data sets quickly...
  • Towards Energy-Proportional Datacenters
    Posted by Dennis Abts, Michael R. Marty, Philip M. Wells, Peter Klausler, and Hong Liu This is part of the series highlighting some notable...
  • Site Reliability Engineers: “solving the most interesting problems”
    Posted by Chris Reid, Sydney Staffing team I recently sat down with Ben Appleton, a Senior Staff Software Engineer, to talk about his recent...
  • Our Faculty Institute brings faculty back to the drawing board
    Posted by Nina Kim Schultz, Google Education Research Cross-posted with the Official Google Blog School may still be out for summer, but tea...
  • Market Algorithms and Optimization Meeting
    Posted by  Vahab S. Mirrokni and Muthu Muthukrishnan Google auctions ads, and enables a market with millions of advertisers and users.  This...
  • Our Unique Approach to Research
    Posted by  Alfred Spector , Vice President of Research and Special Initiatives Google started as a research project —and research has remain...
  • Impact of Organic Ranking on Ad Click Incrementality
    Posted by David Chan, Statistician and Lizzy Van Alstine, Research Evangelist  In 2011, Google released a Search Ads Pause research study w...
  • Large-scale graph computing at Google
    Posted by Grzegorz Czajkowski, Systems Infrastructure Team If you squint the right way, you will notice that graphs are everywhere. For exam...
  • Continuing the quest for future computer scientists with CS4HS
    Erin Mindell, Program Manager, Google Education Computer Science for High School (CS4HS) began five years ago with a simple question: How c...
  • Millions of Core-Hours Awarded to Science
    Posted by Andrea Held, Program Manager, University Relations In 2011 Google University Relations launched a new academic research awards pr...

Categories

  • accessibility
  • ACL
  • ACM
  • Acoustic Modeling
  • ads
  • adsense
  • adwords
  • Africa
  • Android
  • API
  • App Engine
  • App Inventor
  • Audio
  • Awards
  • Cantonese
  • China
  • Computer Science
  • conference
  • conferences
  • correlate
  • crowd-sourcing
  • CVPR
  • datasets
  • Deep Learning
  • distributed systems
  • Earth Engine
  • economics
  • Education
  • Electronic Commerce and Algorithms
  • EMEA
  • EMNLP
  • entities
  • Exacycle
  • Faculty Institute
  • Faculty Summit
  • Fusion Tables
  • gamification
  • Google Books
  • Google+
  • Government
  • grants
  • HCI
  • Image Annotation
  • Information Retrieval
  • internationalization
  • Interspeech
  • jsm
  • jsm2011
  • K-12
  • Korean
  • Labs
  • localization
  • Machine Hearing
  • Machine Learning
  • Machine Translation
  • MapReduce
  • market algorithms
  • Market Research
  • ML
  • MOOC
  • NAACL
  • Natural Language Processing
  • Networks
  • Ngram
  • NIPS
  • NLP
  • open source
  • operating systems
  • osdi
  • osdi10
  • patents
  • ph.d. fellowship
  • PiLab
  • Policy
  • Public Data Explorer
  • publication
  • Publications
  • renewable energy
  • Research Awards
  • resource optimization
  • Search
  • search ads
  • Security and Privacy
  • SIGMOD
  • Site Reliability Engineering
  • Speech
  • statistics
  • Structured Data
  • Systems
  • Translate
  • trends
  • TV
  • UI
  • University Relations
  • UNIX
  • User Experience
  • video
  • Vision Research
  • Visiting Faculty
  • Visualization
  • Voice Search
  • Wiki
  • wikipedia
  • WWW
  • YouTube

Blog Archive

  • ►  2013 (51)
    • ►  December (3)
    • ►  November (9)
    • ►  October (2)
    • ►  September (5)
    • ►  August (2)
    • ►  July (6)
    • ►  June (7)
    • ►  May (5)
    • ►  April (3)
    • ►  March (4)
    • ►  February (4)
    • ►  January (1)
  • ►  2012 (59)
    • ►  December (4)
    • ►  October (4)
    • ►  September (3)
    • ►  August (9)
    • ►  July (9)
    • ►  June (7)
    • ►  May (7)
    • ►  April (2)
    • ►  March (7)
    • ►  February (3)
    • ►  January (4)
  • ►  2011 (51)
    • ►  December (5)
    • ►  November (2)
    • ►  September (3)
    • ►  August (4)
    • ►  July (9)
    • ►  June (6)
    • ►  May (4)
    • ►  April (4)
    • ►  March (5)
    • ►  February (5)
    • ►  January (4)
  • ▼  2010 (44)
    • ►  December (7)
    • ►  November (2)
    • ►  October (9)
    • ►  September (7)
    • ►  August (2)
    • ►  July (7)
    • ▼  June (3)
      • Google launches Korean Voice Search
      • Google Search by Voice now available in France, It...
      • Google Fusion Tables celebrates one year of data m...
    • ►  May (2)
    • ►  April (1)
    • ►  March (1)
    • ►  February (1)
    • ►  January (2)
  • ►  2009 (44)
    • ►  December (8)
    • ►  November (4)
    • ►  August (4)
    • ►  July (5)
    • ►  June (5)
    • ►  May (4)
    • ►  April (6)
    • ►  March (3)
    • ►  February (1)
    • ►  January (4)
  • ►  2008 (11)
    • ►  December (1)
    • ►  November (1)
    • ►  October (1)
    • ►  September (1)
    • ►  July (1)
    • ►  May (3)
    • ►  April (1)
    • ►  March (1)
    • ►  February (1)
  • ►  2007 (9)
    • ►  October (1)
    • ►  September (2)
    • ►  August (1)
    • ►  July (1)
    • ►  June (2)
    • ►  February (2)
  • ►  2006 (15)
    • ►  December (1)
    • ►  November (1)
    • ►  September (1)
    • ►  August (1)
    • ►  July (1)
    • ►  June (2)
    • ►  April (3)
    • ►  March (4)
    • ►  February (1)
Powered by Blogger.

About Me

Unknown
View my complete profile