Compact System

  • Subscribe to our RSS feed.
  • Twitter
  • StumbleUpon
  • Reddit
  • Facebook
  • Digg

Thursday, 2 December 2010

Google Launches Cantonese Voice Search in Hong Kong

Posted on 10:46 by Unknown
Posted by Posted by Yun-hsuan Sung (宋雲軒) and Martin Jansche, Google Research

On November 30th 2010, Google launched Cantonese Voice Search in Hong Kong. Google Search by Voice has been available in a growing number of languages since we launched our first US English system in 2008. In addition to US English, we already support Mandarin for Mainland China, Mandarin for Taiwan, Japanese, Korean, French, Italian, German, Spanish, Turkish, Russian, Czech, Polish, Brazilian Portuguese, Dutch, Afrikaans, and Zulu, along with special recognizers for English spoken with British, Indian, Australian, and South African accents.

Cantonese is widely spoken in Hong Kong, where it is written using traditional Chinese characters, similar to those used in Taiwan. Chinese script is much harder to type than the Latin alphabet, especially on mobile devices with small or virtual keyboards. People in Hong Kong typically use either “Cangjie” (倉頡) or “Handwriting” (手寫輸入) input methods. Cangjie (倉頡) has a steep learning curve and requires users to break the Chinese characters down into sequences of graphical components. The Handwriting (手寫輸入) method is easier to learn, but slow to use. Neither is an ideal input method for people in Hong Kong trying to use Google Search on their mobile phones.

Speaking is generally much faster and more natural than typing. Moreover, some Chinese characters – like “滘” in “滘西州” (Kau Sai Chau) and “砵” in “砵典乍街” (Pottinger Street) – are so rarely used that people often know only the pronunciation, and not how to write them. Our Cantonese Voice Search begins to address these situations by allowing Hong Kong users to speak queries instead of entering Chinese characters on mobile devices. We believe our development of Cantonese Voice Search is a step towards solving the text input challenge for devices with small or virtual keyboards for users in Hong Kong.

There were several challenges in developing Cantonese Voice Search, some unique to Cantonese, some typical of Asian languages and some universal to all languages. Here are some examples of problems that stood out:
  • Data Collection: In contrast to English, there are few existing Cantonese datasets that can be used to train a recognition system. Building a recognition system requires both audio and text data so it can recognize both the sounds and the words. For audio data, our efficient DataHound collection technique uses smartphones to record and upload large numbers of audio samples from local Cantonese-speaking volunteers. For text data, we sample from anonymized search query logs from http://www.google.com.hk to obtain the large amounts of data needed to train language models.
  • Chinese Word Boundaries: Chinese writing doesn’t use spaces to indicate word boundaries. To limit the size of the vocabulary for our speech recognizer and to simplify lexicon development, we use characters, rather than words, as the basic units in our system and allow multiple pronunciations for each character.
  • Mixing of Chinese Characters and English Words: We found that Hong Kong users mix more English into their queries than users in Mainland China and Taiwan. To build a lexicon for both Chinese characters and English words, we map English words to a sequence of Cantonese pronunciation units.
  • Tone Issues: Linguists disagree on the best count of the number of tones in Cantonese – some say 6, some say 7, or 9, or 10. In any case, it’s a lot. We decided to model tone-plus-vowel combinations as single units. In order to limit the complexity of the resulting model, some rarely-used tone-vowel combinations are merged into single models.
  • Transliteration: We found that some users use English words while others use the Cantonese transliteration (e.g.,: “Jordan” vs. “佐敦­”). This makes it challenging to develop and evaluate the system, since it’s often impossible for the recognizer to distinguish between an English word and its Cantonese transliteration. During development we use a metric that simply checks whether the correct search results are returned.
  • Different Accents and Noisy Environment: People speak in different styles with different accents. They use our systems in a variety of environments, including offices, subways, and shopping malls. To make our system work in all these different conditions, we train it using data collected from many different volunteers in many different environments.
Cantonese is Google’s third spoken language for Voice Search in the Chinese linguistic family, after Mandarin for Mainland China and Mandarin for Taiwan. We plan to continue to use our data collection and language modeling technologies to help speakers of Chinese languages easily input text and look up information.
Email ThisBlogThis!Share to XShare to Facebook
Posted in Cantonese, Voice Search | No comments
Newer Post Older Post Home

0 comments:

Post a Comment

Subscribe to: Post Comments (Atom)

Popular Posts

  • Towards Energy-Proportional Datacenters
    Posted by Dennis Abts, Michael R. Marty, Philip M. Wells, Peter Klausler, and Hong Liu This is part of the series highlighting some notable...
  • CDC Birth Vital Statistics in BigQuery
    Posted by Dan Vanderkam, Software Engineer Google’s BigQuery Service lets enterprises and developers crunch large-scale data sets quickly...
  • Market Algorithms and Optimization Meeting
    Posted by  Vahab S. Mirrokni and Muthu Muthukrishnan Google auctions ads, and enables a market with millions of advertisers and users.  This...
  • International Conference on Machine Learning (ICML 2009) in Montreal
    Posted by Eyal Even Dar and Vahab Mirrokni , Google Research, NY The 26th International Conference on Machine Learning ( ICML 2009 ) was re...
  • Site Reliability Engineers: “solving the most interesting problems”
    Posted by Chris Reid, Sydney Staffing team I recently sat down with Ben Appleton, a Senior Staff Software Engineer, to talk about his recent...
  • Two Views from the 2009 Google Faculty Summit
    Posted by Alfred Spector, Vice President of Research and Special Initiatives [cross-posted with the Official Google Blog ] We held our fifth...
  • Focusing on Our Users: The Google Health Redesign
    Posted by Hendrik Mueller, User Experience Researcher When I relocated to New York City a few years ago, some of the most important health i...
  • Supporting computer science education with CS4HS
    Posted by Terry Ednacot, Education Program Manager Recent statistics have shown a decline in the number of U.S. students taking computer sc...
  • Large-scale graph computing at Google
    Posted by Grzegorz Czajkowski, Systems Infrastructure Team If you squint the right way, you will notice that graphs are everywhere. For exam...
  • Our Faculty Institute brings faculty back to the drawing board
    Posted by Nina Kim Schultz, Google Education Research Cross-posted with the Official Google Blog School may still be out for summer, but tea...

Categories

  • accessibility
  • ACL
  • ACM
  • Acoustic Modeling
  • ads
  • adsense
  • adwords
  • Africa
  • Android
  • API
  • App Engine
  • App Inventor
  • Audio
  • Awards
  • Cantonese
  • China
  • Computer Science
  • conference
  • conferences
  • correlate
  • crowd-sourcing
  • CVPR
  • datasets
  • Deep Learning
  • distributed systems
  • Earth Engine
  • economics
  • Education
  • Electronic Commerce and Algorithms
  • EMEA
  • EMNLP
  • entities
  • Exacycle
  • Faculty Institute
  • Faculty Summit
  • Fusion Tables
  • gamification
  • Google Books
  • Google+
  • Government
  • grants
  • HCI
  • Image Annotation
  • Information Retrieval
  • internationalization
  • Interspeech
  • jsm
  • jsm2011
  • K-12
  • Korean
  • Labs
  • localization
  • Machine Hearing
  • Machine Learning
  • Machine Translation
  • MapReduce
  • market algorithms
  • Market Research
  • ML
  • MOOC
  • NAACL
  • Natural Language Processing
  • Networks
  • Ngram
  • NIPS
  • NLP
  • open source
  • operating systems
  • osdi
  • osdi10
  • patents
  • ph.d. fellowship
  • PiLab
  • Policy
  • Public Data Explorer
  • publication
  • Publications
  • renewable energy
  • Research Awards
  • resource optimization
  • Search
  • search ads
  • Security and Privacy
  • SIGMOD
  • Site Reliability Engineering
  • Speech
  • statistics
  • Structured Data
  • Systems
  • Translate
  • trends
  • TV
  • UI
  • University Relations
  • UNIX
  • User Experience
  • video
  • Vision Research
  • Visiting Faculty
  • Visualization
  • Voice Search
  • Wiki
  • wikipedia
  • WWW
  • YouTube

Blog Archive

  • ►  2013 (51)
    • ►  December (3)
    • ►  November (9)
    • ►  October (2)
    • ►  September (5)
    • ►  August (2)
    • ►  July (6)
    • ►  June (7)
    • ►  May (5)
    • ►  April (3)
    • ►  March (4)
    • ►  February (4)
    • ►  January (1)
  • ►  2012 (59)
    • ►  December (4)
    • ►  October (4)
    • ►  September (3)
    • ►  August (9)
    • ►  July (9)
    • ►  June (7)
    • ►  May (7)
    • ►  April (2)
    • ►  March (7)
    • ►  February (3)
    • ►  January (4)
  • ►  2011 (51)
    • ►  December (5)
    • ►  November (2)
    • ►  September (3)
    • ►  August (4)
    • ►  July (9)
    • ►  June (6)
    • ►  May (4)
    • ►  April (4)
    • ►  March (5)
    • ►  February (5)
    • ►  January (4)
  • ▼  2010 (44)
    • ▼  December (7)
      • More researchers dive into the digital humanities
      • Robot hackathon connects with Android, browsers an...
      • Find out what’s in a word, or five, with the Googl...
      • Letting everyone do great things with App Inventor
      • $6 million to faculty in Q4 Research Awards
      • Four Googlers elected ACM Fellows this year
      • Google Launches Cantonese Voice Search in Hong Kong
    • ►  November (2)
    • ►  October (9)
    • ►  September (7)
    • ►  August (2)
    • ►  July (7)
    • ►  June (3)
    • ►  May (2)
    • ►  April (1)
    • ►  March (1)
    • ►  February (1)
    • ►  January (2)
  • ►  2009 (44)
    • ►  December (8)
    • ►  November (4)
    • ►  August (4)
    • ►  July (5)
    • ►  June (5)
    • ►  May (4)
    • ►  April (6)
    • ►  March (3)
    • ►  February (1)
    • ►  January (4)
  • ►  2008 (11)
    • ►  December (1)
    • ►  November (1)
    • ►  October (1)
    • ►  September (1)
    • ►  July (1)
    • ►  May (3)
    • ►  April (1)
    • ►  March (1)
    • ►  February (1)
  • ►  2007 (9)
    • ►  October (1)
    • ►  September (2)
    • ►  August (1)
    • ►  July (1)
    • ►  June (2)
    • ►  February (2)
  • ►  2006 (15)
    • ►  December (1)
    • ►  November (1)
    • ►  September (1)
    • ►  August (1)
    • ►  July (1)
    • ►  June (2)
    • ►  April (3)
    • ►  March (4)
    • ►  February (1)
Powered by Blogger.

About Me

Unknown
View my complete profile