Compact System

  • Subscribe to our RSS feed.
  • Twitter
  • StumbleUpon
  • Reddit
  • Facebook
  • Digg

Friday, 18 May 2012

From Words to Concepts and Back: Dictionaries for Linking Text, Entities and Ideas

Posted on 09:30 by Unknown
Posted by Valentin Spitkovsky and Peter Norvig, Research Team



Yet in each word some concept there must be...
— from Goethe's Faust (Part I, Scene III)

Human language is both rich and ambiguous. When we hear or read words, we resolve meanings to mental representations, for example recognizing and linking names to the intended persons, locations or organizations. Bridging words and meaning — from turning search queries into relevant results to suggesting targeted keywords for advertisers — is also Google's core competency, and important for many other tasks in information retrieval and natural language processing. We are happy to release a resource, spanning 7,560,141 concepts and 175,100,788 unique text strings, that we hope will help everyone working in these areas.

How do we represent concepts? Our approach piggybacks on the unique titles of entries from an encyclopedia, which are mostly proper and common noun phrases. We consider each individual Wikipedia articleas representing a concept (an entity or an idea), identified by its URL. Text strings that refer to concepts were collected using the publicly available hypertext of anchors (the text you click on in a web link) that point to each Wikipedia page, thus drawing on the vast link structure of the web. For every English article we harvested the strings associated with its incoming hyperlinks from the rest of Wikipedia, the greater web, and also anchors of parallel, non-English Wikipedia pages. Our dictionaries are cross-lingual, and any concept deemed too fine can be broadened to a desired level of generality using Wikipedia's groupings of articles into hierarchical categories.

The data set contains triples, each consisting of (i) text, a short, raw natural language string; (ii) url, a related concept, represented by an English Wikipedia article's canonical location; and (iii) count, an integer indicating the number of times text has been observed connected with the concept's url. Our database thus includes weights that measure degrees of association. For example, the top two entries for football indicate that it is an ambiguous term, which is almost twice as likely to refer to what we in the US call soccer:



text=footballurlcount
1. Association football 44,984
2. American football 23,373
⋮ 

An inverted index can be used to perform reverse look-ups, identifying salient terms for each concept. Some of the highest-scoring strings — including synonyms and translations — for both sports, are listed below:




concept:
“soccer”
football and Football
Soccer and soccer
Association football
fútbol and Fútbol
footballer
Futbol and futbol
Fußball
futebol
futbolista
サッカー
축구
footballeur
Fußballspieler
sepak bola
足球
فوتبال
футболист
כדורגל
piłkarz
voetbalclub
ฟุตบอล
bóng đá
voetbal
Foutbaal
futebolista
لعبة كرة القدم
fotbal
          concept:
“football”
American football
football and Football
fútbol americano
football américain
アメリカンフットボール
American football rules
futebol americano
فوتبال آمریکایی
美式足球
football americano
Amerikan futbolu
Le Football Américain
football field
อเมริกันฟุตบอล
פוטבול
كرة القدم الأمريكية
Futbol amerykański
미식축구
futbolu amerykańskiego
football team
американского футбола
Amerikai futball
sepak bola Amerika
football player
američki fudbal
反則
كرة القدم الأميركية

Associated counts can easily be turned into percentages. The following table illustrates the concept-to-words dictionary direction — which may be useful for paraphrasing, summarization and topic modeling — for the idea of soft drink, restricted to English (and normalized for punctuation, pluralization and capitalization differences):



url=Soft_drinktext% 
1. soft drink(and soft-drinks)    28.6 
2. soda(and sodas)    5.5 
3. soda pop0.9 
4. fizzy drinks0.6 
5. carbonated beverages(and beverage)    0.3 
6. non-alcoholic0.2 
7. soft0.1 
8. pop0.1 
9. carbonated soft drink(and drinks)    0.1 
10. aerated water0.1 
11. non-alcoholic drinks(and drink)    0.1 
12. soft drink controversy0.0 
13. citrus-flavored soda0.0 
14. carbonated0.0 
15. soft drink topics0.0 
⋮ 

The words-to-concepts dictionary direction can disambiguate senses and link entities, which are often highly ambiguous, since people, places and organizations can (nearly) all be named after each other. The next table shows the top concepts meant by the string Stanford, which refers to all three (and other) types:



text=Stanfordurl% type
1. Stanford University50.3 ORGANIZATION
2. Stanford (disambiguation)7.7 a disambiguation page
3. Stanford, California7.5 LOCATION
4. Stanford Cardinal football5.7 ORGANIZATION
5. Stanford Cardinal4.1 multiple athletic programs
6. Stanford Cardinal men's basketball2.0 ORGANIZATION
7. Stanford prison experiment2.0 a famous psychology experiment
8. Stanford, Kentucky1.7 LOCATION
9. Stanford, Norfolk1.0 LOCATION
10. Bank of the West Classic1.0 a recurring sporting event
11. Stanford, Illinois0.9 LOCATION
12. Leland Stanford0.9 PERSON
13. Charles Villiers Stanford0.8 PERSON
14. Stanford, New York0.8 LOCATION
15. Stanford, Bedfordshire0.8 LOCATION
⋮ 

The database that we are providing was designed for recall. It is large and noisy, incorporating 297,073,139 distinct string-concept pairs, aggregated over 3,152,091,432 individual links, many of them referencing non-existent articles. For technical details, see our paper(to be presented at LREC 2012) and the README file accompanying the data.

We hope that this release will fuel numerous creative applications that haven't been previously thought of!


Produced by Angel X. Changand Valentin I. Spitkovsky; parts of this work are descended from an earlier collaboration between University of Basque Country's Ixa Group's Eneko Agirreand Stanford's NLP Group, including Eric Yeh, presently of SRI International, and our Ph.D. advisors, Christopher D. Manningand Daniel Jurafsky.

Email ThisBlogThis!Share to XShare to Facebook
Posted in entities, wikipedia | No comments
Newer Post Older Post Home

0 comments:

Post a Comment

Subscribe to: Post Comments (Atom)

Popular Posts

  • New research from Google shows that 88% of the traffic generated by mobile search ads is not replaced by traffic originating from mobile organic search
    Posted by Shaun Lysen, Statistician at Google Often times people are presented with two choices after making a search on their devices - the...
  • Education Awards on Google App Engine
    Posted by Andrea Held, Google University Relations Cross-posted with Google Developers Blog Last year we invited proposals for innovative p...
  • More researchers dive into the digital humanities
    Posted by Jon Orwant, Engineering Manager for Google Books When we started Google Book Search back in 2004, we were driven by the desire to...
  • Google, the World Wide Web and WWW conference: years of progress, prosperity and innovation
    Posted by Prabhakar Raghavan, Vice President of Engineering More than forty members of Google’s technical staff gathered in Lyon, France i...
  • Query Language Modeling for Voice Search
    Posted by Ciprian Chelba, Research Scientist About three years ago we set a goal to enable speaking to the Google Search engine on smart-pho...
  • Announcing our Q4 Research Awards
    Posted by Maggie Johnson, Director of Education & University Relations and Jeff Walz, Head of University Relations We do a significant a...
  • Word of Mouth: Introducing Voice Search for Indonesian, Malaysian and Latin American Spanish
    Posted by Linne Ha, International Program Manager Read more about the launch of Voice Search in Latin American Spanish on the Google América...
  • Under the Hood of App Inventor for Android
    Posted by Bill Magnuson, Hal Abelson, and Mark Friedman We recently announced our App Inventor for Android project on the Google Research B...
  • Make Your Websites More Accessible to More Users with Introduction to Web Accessibility
    Eve Andersson, Manager, Accessibility Engineering Cross-posted with  Google Developer's Blog You work hard to build clean, intuitive web...
  • 11 Billion Clues in 800 Million Documents: A Web Research Corpus Annotated with Freebase Concepts
    Posted by Dave Orr, Amar Subramanya, Evgeniy Gabrilovich, and Michael Ringgaard, Google Research “I assume that by knowing the truth you mea...

Categories

  • accessibility
  • ACL
  • ACM
  • Acoustic Modeling
  • ads
  • adsense
  • adwords
  • Africa
  • Android
  • API
  • App Engine
  • App Inventor
  • Audio
  • Awards
  • Cantonese
  • China
  • Computer Science
  • conference
  • conferences
  • correlate
  • crowd-sourcing
  • CVPR
  • datasets
  • Deep Learning
  • distributed systems
  • Earth Engine
  • economics
  • Education
  • Electronic Commerce and Algorithms
  • EMEA
  • EMNLP
  • entities
  • Exacycle
  • Faculty Institute
  • Faculty Summit
  • Fusion Tables
  • gamification
  • Google Books
  • Google+
  • Government
  • grants
  • HCI
  • Image Annotation
  • Information Retrieval
  • internationalization
  • Interspeech
  • jsm
  • jsm2011
  • K-12
  • Korean
  • Labs
  • localization
  • Machine Hearing
  • Machine Learning
  • Machine Translation
  • MapReduce
  • market algorithms
  • Market Research
  • ML
  • MOOC
  • NAACL
  • Natural Language Processing
  • Networks
  • Ngram
  • NIPS
  • NLP
  • open source
  • operating systems
  • osdi
  • osdi10
  • patents
  • ph.d. fellowship
  • PiLab
  • Policy
  • Public Data Explorer
  • publication
  • Publications
  • renewable energy
  • Research Awards
  • resource optimization
  • Search
  • search ads
  • Security and Privacy
  • SIGMOD
  • Site Reliability Engineering
  • Speech
  • statistics
  • Structured Data
  • Systems
  • Translate
  • trends
  • TV
  • UI
  • University Relations
  • UNIX
  • User Experience
  • video
  • Vision Research
  • Visiting Faculty
  • Visualization
  • Voice Search
  • Wiki
  • wikipedia
  • WWW
  • YouTube

Blog Archive

  • ►  2013 (51)
    • ►  December (3)
    • ►  November (9)
    • ►  October (2)
    • ►  September (5)
    • ►  August (2)
    • ►  July (6)
    • ►  June (7)
    • ►  May (5)
    • ►  April (3)
    • ►  March (4)
    • ►  February (4)
    • ►  January (1)
  • ▼  2012 (59)
    • ►  December (4)
    • ►  October (4)
    • ►  September (3)
    • ►  August (9)
    • ►  July (9)
    • ►  June (7)
    • ▼  May (7)
      • From Words to Concepts and Back: Dictionaries for ...
      • Smart Pricing may increase average publisher revenue
      • Is beautiful usable? What is the influence of beau...
      • Google, the World Wide Web and WWW conference: yea...
      • Video Stabilization on YouTube
      • An Experiment in Music and Crowd-Sourcing
      • From Open Research to Open Flow
    • ►  April (2)
    • ►  March (7)
    • ►  February (3)
    • ►  January (4)
  • ►  2011 (51)
    • ►  December (5)
    • ►  November (2)
    • ►  September (3)
    • ►  August (4)
    • ►  July (9)
    • ►  June (6)
    • ►  May (4)
    • ►  April (4)
    • ►  March (5)
    • ►  February (5)
    • ►  January (4)
  • ►  2010 (44)
    • ►  December (7)
    • ►  November (2)
    • ►  October (9)
    • ►  September (7)
    • ►  August (2)
    • ►  July (7)
    • ►  June (3)
    • ►  May (2)
    • ►  April (1)
    • ►  March (1)
    • ►  February (1)
    • ►  January (2)
  • ►  2009 (44)
    • ►  December (8)
    • ►  November (4)
    • ►  August (4)
    • ►  July (5)
    • ►  June (5)
    • ►  May (4)
    • ►  April (6)
    • ►  March (3)
    • ►  February (1)
    • ►  January (4)
  • ►  2008 (11)
    • ►  December (1)
    • ►  November (1)
    • ►  October (1)
    • ►  September (1)
    • ►  July (1)
    • ►  May (3)
    • ►  April (1)
    • ►  March (1)
    • ►  February (1)
  • ►  2007 (9)
    • ►  October (1)
    • ►  September (2)
    • ►  August (1)
    • ►  July (1)
    • ►  June (2)
    • ►  February (2)
  • ►  2006 (15)
    • ►  December (1)
    • ►  November (1)
    • ►  September (1)
    • ►  August (1)
    • ►  July (1)
    • ►  June (2)
    • ►  April (3)
    • ►  March (4)
    • ►  February (1)
Powered by Blogger.

About Me

Unknown
View my complete profile