Compact System

  • Subscribe to our RSS feed.
  • Twitter
  • StumbleUpon
  • Reddit
  • Facebook
  • Digg

Friday, 8 March 2013

Learning from Big Data: 40 Million Entities in Context

Posted on 10:30 by Unknown
Posted by Dave Orr, Amar Subramanya, and Fernando Pereira, Google Research

When someone mentions Mercury, are they talking about the planet, the god, the car, the element, Freddie, or one of some 89 other possibilities? This problem is called disambiguation (a word that is itself ambiguous), and while it’s necessary for communication, and humans are amazingly good at it (when was the last time you confused a fruit with a giant tech company?), computers need help.

To provide that help, we are releasing the Wikilinks Corpus: 40 million total disambiguated mentions within over 10 million web pages -- over 100 times bigger than the next largest corpus (about 100,000 documents, see the table below for mention and entity counts). The mentions are found by looking for links to Wikipedia pages where the anchor text of the link closely matches the title of the target Wikipedia page. If we think of each page on Wikipedia as an entity (an idea we’ve discussed before), then the anchor text can be thought of as a mention of the corresponding entity.

Dataset Number of Mentions Number of Entities
Bentivogli et al. (data) (2008) 43,704 709
Day et al. (2008) less than 55,0003,660
Artiles et al. (data) (2010) 57,357 300
Wikilinks Corpus 40,323,863 2,933,659

What might you do with this data? Well, we’ve already written one ACL paper on cross-document co-reference (and received lots of requests for the underlying data, which partly motivates this release). And really, we look forward to seeing what you are going to do with it! But here are a few ideas:
  • Look into coreference -- when different mentions mention the same entity -- or entity resolution -- matching a mention to the underlying entity
  • Work on the bigger problem of cross-document coreference, which is how to find out if different web pages are talking about the same person or other entity
  • Learn things about entities by aggregating information across all the documents they’re mentioned in
  • Type tagging tries to assign types (they could be broad, like person, location, or specific, like amusement park ride) to entities. To the extent that the Wikipedia pages contain the type information you’re interested in, it would be easy to construct a training set that annotates the Wikilinks entities with types from Wikipedia.
  • Work on any of the above, or more, on subsets of the data. With existing datasets, it wasn’t possible to work on just musicians or chefs or train stations, because the sample sizes would be too small. But with 10 million Web pages, you can find a decent sampling of almost anything.

Gory Details

How do you actually get the data? It’s right here: Google’s Wikilinks Corpus. Tools and data with extra context can be found on our partners’ page: UMass Wiki-links. Understanding the corpus, however, is a little bit involved.

For copyright reasons, we cannot distribute actual annotated web pages. Instead, we’re providing an index of URLs, and the tools to create the dataset, or whichever slice of it you care about, yourself. Specifically, we’re providing:
  • The URLs of all the pages that contain labeled mentions, which are links to English Wikipedia
  • The anchor text of the link (the mention string), the Wikipedia link target, and the byte offset of the link for every page in the set
  • The byte offset of the 10 least frequent words on the page, to act as a signature to ensure that the underlying text hasn’t changed -- think of this as a version, or fingerprint, of the page
  • Software tools (on the UMass site) to: download the web pages; extract the mentions, with ways to recover if the byte offsets don’t match; select the text around the mentions as local context; and compute evaluation metrics over predicted entities.
The format looks like this:

URL http://1967mercurycougar.blogspot.com/2009_10_01_archive.html
MENTION Lincoln Continental Mark IV 40110 http://en.wikipedia.org/wiki/Lincoln_Continental_Mark_IV
MENTION 1975 MGB roadster 41481 http://en.wikipedia.org/wiki/MG_MGB
MENTION Buick Riviera 43316 http://en.wikipedia.org/wiki/Buick_Riviera
MENTION Oldsmobile Toronado 43397 http://en.wikipedia.org/wiki/Oldsmobile_Toronado
TOKEN seen 58190
TOKEN crush 63118
TOKEN owners 69290
TOKEN desk 59772
TOKEN relocate 70683
TOKEN promote 35016
TOKEN between 70846
TOKEN re 52821
TOKEN getting 68968
TOKEN felt 41508


We’d love to hear what you’re working on, and look forward to what you can do with 40 million mentions across over 10 million web pages!

Thanks to our collaborators at UMass Amherst: Sameer Singh and Andrew McCallum.

Email ThisBlogThis!Share to XShare to Facebook
Posted in Natural Language Processing, wikipedia | No comments
Newer Post Older Post Home

0 comments:

Post a Comment

Subscribe to: Post Comments (Atom)

Popular Posts

  • CDC Birth Vital Statistics in BigQuery
    Posted by Dan Vanderkam, Software Engineer Google’s BigQuery Service lets enterprises and developers crunch large-scale data sets quickly...
  • Towards Energy-Proportional Datacenters
    Posted by Dennis Abts, Michael R. Marty, Philip M. Wells, Peter Klausler, and Hong Liu This is part of the series highlighting some notable...
  • Site Reliability Engineers: “solving the most interesting problems”
    Posted by Chris Reid, Sydney Staffing team I recently sat down with Ben Appleton, a Senior Staff Software Engineer, to talk about his recent...
  • Our Faculty Institute brings faculty back to the drawing board
    Posted by Nina Kim Schultz, Google Education Research Cross-posted with the Official Google Blog School may still be out for summer, but tea...
  • Market Algorithms and Optimization Meeting
    Posted by  Vahab S. Mirrokni and Muthu Muthukrishnan Google auctions ads, and enables a market with millions of advertisers and users.  This...
  • Our Unique Approach to Research
    Posted by  Alfred Spector , Vice President of Research and Special Initiatives Google started as a research project —and research has remain...
  • Impact of Organic Ranking on Ad Click Incrementality
    Posted by David Chan, Statistician and Lizzy Van Alstine, Research Evangelist  In 2011, Google released a Search Ads Pause research study w...
  • Large-scale graph computing at Google
    Posted by Grzegorz Czajkowski, Systems Infrastructure Team If you squint the right way, you will notice that graphs are everywhere. For exam...
  • Continuing the quest for future computer scientists with CS4HS
    Erin Mindell, Program Manager, Google Education Computer Science for High School (CS4HS) began five years ago with a simple question: How c...
  • Millions of Core-Hours Awarded to Science
    Posted by Andrea Held, Program Manager, University Relations In 2011 Google University Relations launched a new academic research awards pr...

Categories

  • accessibility
  • ACL
  • ACM
  • Acoustic Modeling
  • ads
  • adsense
  • adwords
  • Africa
  • Android
  • API
  • App Engine
  • App Inventor
  • Audio
  • Awards
  • Cantonese
  • China
  • Computer Science
  • conference
  • conferences
  • correlate
  • crowd-sourcing
  • CVPR
  • datasets
  • Deep Learning
  • distributed systems
  • Earth Engine
  • economics
  • Education
  • Electronic Commerce and Algorithms
  • EMEA
  • EMNLP
  • entities
  • Exacycle
  • Faculty Institute
  • Faculty Summit
  • Fusion Tables
  • gamification
  • Google Books
  • Google+
  • Government
  • grants
  • HCI
  • Image Annotation
  • Information Retrieval
  • internationalization
  • Interspeech
  • jsm
  • jsm2011
  • K-12
  • Korean
  • Labs
  • localization
  • Machine Hearing
  • Machine Learning
  • Machine Translation
  • MapReduce
  • market algorithms
  • Market Research
  • ML
  • MOOC
  • NAACL
  • Natural Language Processing
  • Networks
  • Ngram
  • NIPS
  • NLP
  • open source
  • operating systems
  • osdi
  • osdi10
  • patents
  • ph.d. fellowship
  • PiLab
  • Policy
  • Public Data Explorer
  • publication
  • Publications
  • renewable energy
  • Research Awards
  • resource optimization
  • Search
  • search ads
  • Security and Privacy
  • SIGMOD
  • Site Reliability Engineering
  • Speech
  • statistics
  • Structured Data
  • Systems
  • Translate
  • trends
  • TV
  • UI
  • University Relations
  • UNIX
  • User Experience
  • video
  • Vision Research
  • Visiting Faculty
  • Visualization
  • Voice Search
  • Wiki
  • wikipedia
  • WWW
  • YouTube

Blog Archive

  • ▼  2013 (51)
    • ►  December (3)
    • ►  November (9)
    • ►  October (2)
    • ►  September (5)
    • ►  August (2)
    • ►  July (6)
    • ►  June (7)
    • ►  May (5)
    • ►  April (3)
    • ▼  March (4)
      • Education Awards on Google App Engine
      • Scaling Computer Science Education
      • Our Commitment to Social Computing Research: Socia...
      • Learning from Big Data: 40 Million Entities in Con...
    • ►  February (4)
    • ►  January (1)
  • ►  2012 (59)
    • ►  December (4)
    • ►  October (4)
    • ►  September (3)
    • ►  August (9)
    • ►  July (9)
    • ►  June (7)
    • ►  May (7)
    • ►  April (2)
    • ►  March (7)
    • ►  February (3)
    • ►  January (4)
  • ►  2011 (51)
    • ►  December (5)
    • ►  November (2)
    • ►  September (3)
    • ►  August (4)
    • ►  July (9)
    • ►  June (6)
    • ►  May (4)
    • ►  April (4)
    • ►  March (5)
    • ►  February (5)
    • ►  January (4)
  • ►  2010 (44)
    • ►  December (7)
    • ►  November (2)
    • ►  October (9)
    • ►  September (7)
    • ►  August (2)
    • ►  July (7)
    • ►  June (3)
    • ►  May (2)
    • ►  April (1)
    • ►  March (1)
    • ►  February (1)
    • ►  January (2)
  • ►  2009 (44)
    • ►  December (8)
    • ►  November (4)
    • ►  August (4)
    • ►  July (5)
    • ►  June (5)
    • ►  May (4)
    • ►  April (6)
    • ►  March (3)
    • ►  February (1)
    • ►  January (4)
  • ►  2008 (11)
    • ►  December (1)
    • ►  November (1)
    • ►  October (1)
    • ►  September (1)
    • ►  July (1)
    • ►  May (3)
    • ►  April (1)
    • ►  March (1)
    • ►  February (1)
  • ►  2007 (9)
    • ►  October (1)
    • ►  September (2)
    • ►  August (1)
    • ►  July (1)
    • ►  June (2)
    • ►  February (2)
  • ►  2006 (15)
    • ►  December (1)
    • ►  November (1)
    • ►  September (1)
    • ►  August (1)
    • ►  July (1)
    • ►  June (2)
    • ►  April (3)
    • ►  March (4)
    • ►  February (1)
Powered by Blogger.

About Me

Unknown
View my complete profile