[IR] Extraction-based Text Summarization

自动文摘要解决的问题描述很简单，就是用一些精炼的话来概括整篇文章的大意，用户通过阅读文摘就可以了解到原文要表达的意思。

问题包括两种解决思路，

一种是extractive，抽取式的，从原文中找到一些关键的句子，组合成一篇摘要；【最主流、应用最多、最容易的方法】
另外一种是abstractive，摘要式的，这需要计算机可以读懂原文的内容，并且用自己的意思将其表达出来。【相对来说更有一种真正人工智能的味道】

单文档Extractive （抽取式）Summarization

抽取式的方法基于一个假设，一篇文档的核心思想可以用文档中的某一句或几句话来概括。那么摘要的任务就变成了找到文档中最重要的几句话，也就是一个句子排序的问题。

排序针对不同的问题，需要提出不同的指标，比如有的应用关心的是相关性，有的关心的是时效性，有的关心的是新颖性等等，在这个层面上来讨论排序，会有不同的模型。
一般的抽取式摘要问题，会考虑相关性和新颖性两个指标。相关性是指摘要所用的句子最能够代表本文档的意思，而新颖性是指候选句子包含的冗余信息要少，尽可能每句话都可以独立地表达出一种独立的意思。

下面简单介绍一些思路。

1.预处理
NLP任务的标准流程中第一步都是预处理，将拿到的文本做分句，这里有两种可能性，一是用句点或者其他可以表达一句话结尾的符号作为分隔，另外一种是用逗号作为分隔符获取句子。

2.词、句表示
这一步的思路是：将词、句子表示成计算机能理解的量，然后计算一些指标进行排序。这个地方也是各种算法、模型最大的不同之处：
（1）Bag Of Words。词袋模型将词定义为一个维度，一句话表示成在所有词张成的空间中的一个高维稀疏向量。
（2）TF-IDF。可以理解为带权重的词袋模型，计算出每个词的TFIDF值，作为该词的权重。
（3）LDA/LSI。将整篇文档利用TFIDF模型表示成一个矩阵，做SVD降维分解，生成两个矩阵，一个是文档-话题矩阵、另一个是词-话题矩阵。得到词-话题矩阵之后，可以得到句子-话题矩阵。
（4）Word Embedding。Tomas Mikolov提出的Word2Vec，用了很多技巧和近似的思路让word很容易地表示成一个低维稠密向量，在很多情况下都可以达到不错的效果。词成为了一个向量，句子也可有很多种方法表示成一个向量。

3.排序
这里介绍两种常见的方法。
（1）基于图排序
将文档的每句话作为节点，句子之间的相似度作为边权值构建图模型，用pagerank算法进行求解，得到每个句子的得分。

（2）基于特征
特征工程在深度学习火之前是解决特定领域问题的良药，这里用到的特征包括：
1）句子长度，长度为某个长度的句子为最理想的长度，依照距离这个长度的远近来打分。
2）句子位置，根据句子在全文中的位置，给出分数。（比如每段的第一句是核心句的比例大概是70%）
3）句子是否包含标题词，根据句子中包含标题词的多少来打分。
4）句子关键词打分，文本进行预处理之后，按照词频统计出排名前10的关键词，通过比较句子中包含关键词的情况，以及关键词分布的情况来打分。
代表算法是TextTeaser。

4.后处理
排序之后的结果只考虑了相关性并没有考虑新颖性，非常有可能出现排名靠前的几句话表达的都是相似的意思。所以需要引入一个惩罚因子，将新颖性考虑进去。对所有的句子重新打分，如下公式：
a score(i) + (1-a) similarity(i, i-1), i = 2,3,….N
序号i表示排序后的顺序，从第二句开始，排第一的句子不需要重新计算，后面的句子必须被和前一句的相似度进行惩罚。
这个算法就是所谓的MMR（Maximum Margin Relevance）

5.输出
输出的结果一般是取排序后的前N句话，这里涉及到一个非常重要的问题，也是一直自动文摘质量被诟病的问题，可读性。因为各个句子都是从不同的段落中选择出来的，如果只是生硬地连起来生成摘要的话，很难保证句子之间的衔接和连贯。保证可读性是一件很难的事情。
这里有一个取巧的方法，就是将排序之后的句子按照原文中的顺序输出，可以在一定程度下保证一点点连贯性。

参考：
[1] TextRank源码阅读笔记

https://gist.github.com/rsarxiv/11470a8d763b2845f671061c21230435

[2] TextTeaser源码阅读笔记

https://gist.github.com/rsarxiv/4e949264b3bda98828b84cf2991e57e4

关于abstractive，摘要式的，详见原文链接: http://blog.csdn.net/lu839684437/article/details/71600410

代码分析：https://github.com/DataTeaser/textteaser

main.py

from summarizer import Summarizer

def getInput():
    with open('input.txt') as file:
        content = file.readlines()

    # remove unnecessary \n
    content = [c.replace('\n', '') for c in content if c != '\n']

    title = content[0]
    text = content[-(len(content) - 1):]

    return {'title': title, 'text': ' '.join(text)}

# #####################

input = getInput()
input['text'] = input['text'].decode("ascii", "ignore")

input['text'] = " ".join(input['text'].replace("\n", " ").split())

summarizer = Summarizer()
result = summarizer.summarize(input['text'], input['title'], 'Undefined', 'Undefined')
result = summarizer.sortScore(result)
result = summarizer.sortSentences(result[:30])

print 'Summary:'

for r in result:
    print r['sentence']
    # print r['totalScore']
    # print r['order']

summarizer.py

#!/usr/bin/python
# -*- coding: utf-8 -*-
from parser import Parser

class Summarizer:
    def __init__(self):
        self.parser = Parser()

    def summarize(self, text, title, source, category):
        sentences  = self.parser.splitSentences(text)
        titleWords = self.parser.removePunctations(title)
        titleWords = self.parser.splitWords(title)
        (keywords, wordCount) = self.parser.getKeywords(text)

        topKeywords = self.getTopKeywords(keywords[:10], wordCount, source, category)
        result      = self.computeScore(sentences, titleWords, topKeywords)
        result      = self.sortScore(result)

        return result

    def getTopKeywords(self, keywords, wordCount, source, category):
        # Add getting top keywords in the database here
        for keyword in keywords:
            articleScore = 1.0 * keyword['count'] / wordCount
            keyword['totalScore'] = articleScore * 1.5

        return keywords

    def sortScore(self, dictList):
        return sorted(dictList, key=lambda x: -x['totalScore'])

    def sortSentences(self, dictList):
        return sorted(dictList, key=lambda x: x['order'])

    def computeScore(self, sentences, titleWords, topKeywords):
        keywordList = [keyword['word'] for keyword in topKeywords]
        summaries = []

        for i, sentence in enumerate(sentences):
            sent  = self.parser.removePunctations(sentence)
            words = self.parser.splitWords(sent)

            sbsFeature = self.sbs(words, topKeywords, keywordList)
            dbsFeature = self.dbs(words, topKeywords, keywordList)

            titleFeature     = self.parser.getTitleScore(titleWords, words)
            sentenceLength   = self.parser.getSentenceLengthScore(words)
            sentencePosition = self.parser.getSentencePositionScore(i, len(sentences))

            keywordFrequency = (sbsFeature + dbsFeature) / 2.0 * 10.0
            totalScore = (titleFeature * 1.5 + keywordFrequency * 2.0 + sentenceLength * 0.5 + sentencePosition * 1.0) / 4.0

            summaries.append({
                # 'titleFeature':     titleFeature,
                # 'sentenceLength':   sentenceLength,
                # 'sentencePosition': sentencePosition,
                # 'keywordFrequency': keywordFrequency,
                'totalScore': totalScore,
                'sentence': sentence,
                'order': i
            })

        return summaries

    def sbs(self, words, topKeywords, keywordList):
        score = 0.0

        if len(words) == 0:
            return 0

        for word in words:
            word = word.lower()
            index = -1

        if word in keywordList:
            index = keywordList.index(word)

        if index > -1:
            score += topKeywords[index]['totalScore']

        return 1.0 / abs(len(words)) * score

    def dbs(self, words, topKeywords, keywordList):
        k = len(list(set(words) & set(keywordList))) + 1
        summ = 0.0
        firstWord = {}
        secondWord = {}

        for i, word in enumerate(words):
            if word in keywordList:
                index = keywordList.index(word)

                if firstWord == {}:
                    firstWord = {'i': i, 'score': topKeywords[index]['totalScore']}
                else:
                    secondWord = firstWord
                    firstWord = {'i': i, 'score': topKeywords[index]['totalScore']}
                    distance = firstWord['i'] - secondWord['i']

                    summ += (firstWord['score'] * secondWord['score']) / (distance ** 2)

        return (1.0 / k * (k + 1.0)) * summ

parser.py: we should implement our parser for law case database.

# !/usr/bin/python
# -*- coding: utf-8 -*-
import nltk.data
import os


class Parser:
    def __init__(self):
        self.ideal = 20.0
        self.stopWords = self.getStopWords()

    def getKeywords(self, text):
        text  = self.removePunctations(text)
        words = self.splitWords(text)
        words = self.removeStopWords(words)
        uniqueWords = list(set(words))

        keywords = [{'word': word, 'count': words.count(word)} for word in uniqueWords]
        keywords = sorted(keywords, key=lambda x: -x['count'])

        return (keywords, len(words))

    def getSentenceLengthScore(self, sentence):
        return (self.ideal - abs(self.ideal - len(sentence))) / self.ideal

    # Jagadeesh, J., Pingali, P., & Varma, V. (2005). Sentence Extraction Based Single Document Summarization. International Institute of Information Technology, Hyderabad, India, 5.
    def getSentencePositionScore(self, i, sentenceCount):
        normalized = i / (sentenceCount * 1.0)

        if normalized > 0 and normalized <= 0.1:
            return 0.17
        elif normalized > 0.1 and normalized <= 0.2:
            return 0.23
        elif normalized > 0.2 and normalized <= 0.3:
            return 0.14
        elif normalized > 0.3 and normalized <= 0.4:
            return 0.08
        elif normalized > 0.4 and normalized <= 0.5:
            return 0.05
        elif normalized > 0.5 and normalized <= 0.6:
            return 0.04
        elif normalized > 0.6 and normalized <= 0.7:
            return 0.06
        elif normalized > 0.7 and normalized <= 0.8:
            return 0.04
        elif normalized > 0.8 and normalized <= 0.9:
            return 0.04
        elif normalized > 0.9 and normalized <= 1.0:
            return 0.15
        else:
            return 0

    def getTitleScore(self, title, sentence):
        titleWords = self.removeStopWords(title)
        sentenceWords = self.removeStopWords(sentence)
        matchedWords = [word for word in sentenceWords if word in titleWords]

        return len(matchedWords) / (len(title) * 1.0)

    def splitSentences(self, text):
        tokenizer = nltk.data.load('file:' + os.path.dirname(os.path.abspath(__file__)).decode('utf-8') + '/trainer/english.pickle')

        return tokenizer.tokenize(text)

    def splitWords(self, sentence):
        return sentence.lower().split()

    def removePunctations(self, text):
        return ''.join(t for t in text if t.isalnum() or t == ' ')

    def removeStopWords(self, words):
        return [word for word in words if word not in self.stopWords]

    def getStopWords(self):
        with open(os.path.dirname(os.path.abspath(__file__)) + '/trainer/stopWords.txt') as file:
            words = file.readlines()

        return [word.replace('\n', '') for word in words]

posted @ 2017-08-04 16:26 郝壹贰叁阅读(441) 评论(0) 编辑收藏举报

刷新页面返回顶部

机器学习水很深

We all have two lives. The second one starts when we realize that we only have one. --- Tom Hiddleston

[IR] Extraction-based Text Summarization

单文档Extractive （抽取式）Summarization

公告