Mining Text Data Chapter Two: Information Extraction from Text (1)

Information extraction in this chapter mainly focuses on named entities recognition and relation extraction

Introduction

E.g. in 1998, Larry Page and Sergey Brin founded Google Inc. where there are relation between named enetities, such as FounderOf ( Larry Page, Google Inc.), FounderOf (Sergey Brin, Google Inc.), FoundedIn(Google Inc., 1998 ).

Some examples of application:

1, biomedical researchers often need to sift through a large amount of scientific publications to look for discoveries related to particular genes, proteins or other biomedical entities.

2, Financial professionals often need to seek specific pieces of information from news articles to help their day-to-day decision making.

3, Intelligence analysts review large amounts of text to search for information such as people involved in terrorism events, the weapons used and the targets of the attacks.

4, With the fast growth of the Web, search engines have become an integral part of people’s daily lives, and users’ search behaviors are much better understood now.

The main conference MUC (short for Message Understanding Conference) focuses on named entities recognition and relation extraction. Early MUCs defined information extraction as filling a predefined template that contains a set of predefined slots. Then, rule-based systems participate in MUC. And with the decomposition of information extraction systems into components such as named entity recognition, many information extraction subtasks can be transformed into classification problems, which can be solved by standard supervised learning algorithms such as support vector machines and maximum entropy models.

Another new direction is open information extraction, where the system is expected to extract all useful entity relations from a large, diverse corpus such as the Web. Recent advances in this direction include system like TEXTRUNNER, WOE and REVERB

Named Entity Recognition

All the named entities have two characteristics those are open set and context dependent.

Rule-based Approach

A rule consists of a pattern and an action. It is possible for a sequence of tokens to match multiple rules. To handle such conflicts, a set of policies has to be defined to control how rules should be fired. Manually creating the rules for named entity recognition requires human expertise and is labor intensive.

Top-down approach causes low precision and bottom-up approach causes low coverage.

Statistical Learning Approach

This approach regards recognition as a sequence labeling task. Some available method can be used, HMM, MEMM and linear CRF.

In HMM, one way to model the joint probability is to assume a Markov process where the generation of a label or an observation is dependent only on one or a few previous labels and/or observations.

MEMM is the maximum entropy model coupled with a Markovian assumption, which is a shift from generative models to discriminative models.

CRF is popular discriminative model for sequence labeling, and the difference from MEMM is that in CRFs the label of the current observation can depend not only on previous labels but also on future labels. However, in linear-chain CRF, long-range feature cannot be defined in CRF, so semi-Markov CRF will perform better, which is proved in one work.

posted @ 2014-05-17 15:35  LeonCrash  阅读(253)  评论(0编辑  收藏  举报