WWW 2016 Tutorial: Automatic Entity Recognition and Typing in Massive Text Corpora

In today's computerized and information-based society, we are soaked with vast amounts of natural language text data, ranging from news articles, product reviews, advertisements, to a wide range of user-generated content from social media. To turn such massive unstructured text data into actionable knowledge, one of the grand challenges is to gain an understanding of entities and the relationships between them. In this tutorial, we introduce data-driven methods to recognize typed entities of interest in different kinds of text corpora (especially in massive, domain-specific text corpora). These methods can automatically identify token spans as entity mentions in text and label their types (eg, people, product, organization) in a scalable way. We demonstrate on real datasets including news articles and yelp reviews how these typed entities aid in knowledge discovery and management.

Xiang Ren1, Ahmed El-Kishky1, Chi Wang2, Jiawei Han1

University of Illinois at Urbana Champaign1, Microsoft Research2

Outline

  1. Introduction to entity recognition and typing.
  2. Entity recognition: An overview and phrase-mining approaches
  3. Entity typing: An overview and network mining approach
  4. Trends and research problems

Slides

[PDF][PPT]

Code

[ClusType][SegPhrase][TopMine][PLE]

Publications

Presenters


Xiang Ren, Ph.D. candidate, Department of Computer Science, Univ. of Illinois at Urbana-Champaign. His research focuses on knowledge acquisition from text data and mining linked data. In 2016, he received a Google PhD Fellowship for his work in Structured Data and Database Managment. He is the recipient of C. L. and Jane W.-S. Liu Award and Yahoo!-DAIS Research Excellence Gold Award in 2015. He received Microsoft Young Fellowship from Microsoft Research Asia in 2012.

Ahmed El-Kishky, Ph.D. candidate, Department of Computer Science, Univ. of Illinois at Urbana-Champaign. His research interests include mining large unstructured data, text mining, and network mining. He is the recipient of both the National Science Foundation Graduate Research Fellowship as well as National Defense Science and Engineering Fellowship.

Chi Wang (Ph.D. UIUC, 2014), researcher at Microsoft Research, Redmond, Washington. He has been researching into discovering knowledge from unstructured and linked data, such as topics, concepts, relations, communities and social influence. His book Mining Latent Entity Structures is published by Morgan Claypool Pub., 2015, in the series of Synthesis Lectures on Data Mining and Knowledge Discovery. He is a winner of Microsoft Research Graduate Research Fellowship.

Jiawei Han, Abel Bliss Professor, Department of Computer Science, Univ. of Illinois at Urbana-Champaign. His research areas encompass data mining, data ware-housing, information network analysis, and database systems, with over 600 conference and journal publications. He is Fellow of ACM and Fellow of IEEE, and received ACM SIGKDD Innovation Award (2004), IEEE Computer Society Technical Achievement Award (2005), and IEEE Computer Society W. Wallace McDowell Award (2009). His co-authored textbook "Data Mining: Concepts and Techniques", 3rd ed., (Morgan Kaufmann, 2011) has been adopted popularly world-wide.