[IR] Extraction-based Text Summarization

文本自动摘要 - 阅读笔记


 

自动文摘要解决的问题描述很简单,就是用一些精炼的话来概括整篇文章的大意,用户通过阅读文摘就可以了解到原文要表达的意思。

问题包括两种解决思路,

  • 一种是extractive,抽取式的,从原文中找到一些关键的句子,组合成一篇摘要;【最主流、应用最多、最容易的方法】
  • 另外一种是abstractive,摘要式的,这需要计算机可以读懂原文的内容,并且用自己的意思将其表达出来。【相对来说更有一种真正人工智能的味道】

 

单文档Extractive (抽取式)Summarization 

抽取式的方法基于一个假设,一篇文档的核心思想可以用文档中的某一句或几句话来概括。那么摘要的任务就变成了找到文档中最重要的几句话,也就是一个句子排序的问题。 

排序针对不同的问题,需要提出不同的指标,比如有的应用关心的是相关性,有的关心的是时效性,有的关心的是新颖性等等,在这个层面上来讨论排序,会有不同的模型。 
一般的抽取式摘要问题,会考虑相关性新颖性两个指标。相关性是指摘要所用的句子最能够代表本文档的意思,而新颖性是指候选句子包含的冗余信息要少,尽可能每句话都可以独立地表达出一种独立的意思。 

下面简单介绍一些思路。 


1.预处理 
NLP任务的标准流程中第一步都是预处理,将拿到的文本做分句,这里有两种可能性,一是用句点或者其他可以表达一句话结尾的符号作为分隔,另外一种是用逗号作为分隔符获取句子。 


2.词、句表示 
这一步的思路是:将词、句子表示成计算机能理解的量,然后计算一些指标进行排序。这个地方也是各种算法、模型最大的不同之处: 
(1)Bag Of Words。词袋模型将词定义为一个维度,一句话表示成在所有词张成的空间中的一个高维稀疏向量。 
(2)TF-IDF。可以理解为带权重的词袋模型,计算出每个词的TFIDF值,作为该词的权重。 
(3)LDA/LSI。将整篇文档利用TFIDF模型表示成一个矩阵,做SVD降维分解,生成两个矩阵,一个是文档-话题矩阵、另一个是词-话题矩阵。得到词-话题矩阵之后,可以得到句子-话题矩阵。 
(4)Word Embedding。Tomas Mikolov提出的Word2Vec,用了很多技巧和近似的思路让word很容易地表示成一个低维稠密向量,在很多情况下都可以达到不错的效果。词成为了一个向量,句子也可有很多种方法表示成一个向量。 


3.排序 
这里介绍两种常见的方法。 
(1)基于图排序 
将文档的每句话作为节点,句子之间的相似度作为边权值构建图模型,用pagerank算法进行求解,得到每个句子的得分。

(2)基于特征 
特征工程在深度学习火之前是解决特定领域问题的良药,这里用到的特征包括: 
1)句子长度,长度为某个长度的句子为最理想的长度,依照距离这个长度的远近来打分。 
2)句子位置,根据句子在全文中的位置,给出分数。(比如每段的第一句是核心句的比例大概是70%) 
3)句子是否包含标题词,根据句子中包含标题词的多少来打分。 
4)句子关键词打分,文本进行预处理之后,按照词频统计出排名前10的关键词,通过比较句子中包含关键词的情况,以及关键词分布的情况来打分。 
代表算法是TextTeaser。 


4.后处理 
排序之后的结果只考虑了相关性并没有考虑新颖性,非常有可能出现排名靠前的几句话表达的都是相似的意思。所以需要引入一个惩罚因子,将新颖性考虑进去。对所有的句子重新打分,如下公式: 
a score(i) + (1-a) similarity(i, i-1), i = 2,3,….N 
序号i表示排序后的顺序,从第二句开始,排第一的句子不需要重新计算,后面的句子必须被和前一句的相似度进行惩罚。 
这个算法就是所谓的MMR(Maximum Margin Relevance) 


5.输出 
输出的结果一般是取排序后的前N句话,这里涉及到一个非常重要的问题,也是一直自动文摘质量被诟病的问题,可读性。因为各个句子都是从不同的段落中选择出来的,如果只是生硬地连起来生成摘要的话,很难保证句子之间的衔接和连贯。保证可读性是一件很难的事情。 
这里有一个取巧的方法,就是将排序之后的句子按照原文中的顺序输出,可以在一定程度下保证一点点连贯性。 


参考: 
[1] TextRank源码阅读笔记

https://gist.github.com/rsarxiv/11470a8d763b2845f671061c21230435

[2] TextTeaser源码阅读笔记 

https://gist.github.com/rsarxiv/4e949264b3bda98828b84cf2991e57e4

 

关于abstractive,摘要式的,详见原文链接: http://blog.csdn.net/lu839684437/article/details/71600410


 

代码分析:https://github.com/DataTeaser/textteaser

main.py

from summarizer import Summarizer

def getInput():
    with open('input.txt') as file:
        content = file.readlines()

    # remove unnecessary \n
    content = [c.replace('\n', '') for c in content if c != '\n']

    title = content[0]
    text = content[-(len(content) - 1):]

    return {'title': title, 'text': ' '.join(text)}

# #####################

input = getInput()
input['text'] = input['text'].decode("ascii", "ignore")

input['text'] = " ".join(input['text'].replace("\n", " ").split())

summarizer = Summarizer()
result = summarizer.summarize(input['text'], input['title'], 'Undefined', 'Undefined')
result = summarizer.sortScore(result)
result = summarizer.sortSentences(result[:30])

print 'Summary:'

for r in result:
    print r['sentence']
    # print r['totalScore']
    # print r['order']

summarizer.py

#!/usr/bin/python
# -*- coding: utf-8 -*-
from parser import Parser

class Summarizer:
    def __init__(self):
        self.parser = Parser()

    def summarize(self, text, title, source, category):
        sentences  = self.parser.splitSentences(text)
        titleWords = self.parser.removePunctations(title)
        titleWords = self.parser.splitWords(title)
        (keywords, wordCount) = self.parser.getKeywords(text)

        topKeywords = self.getTopKeywords(keywords[:10], wordCount, source, category)
        result      = self.computeScore(sentences, titleWords, topKeywords)
        result      = self.sortScore(result)

        return result

    def getTopKeywords(self, keywords, wordCount, source, category):
        # Add getting top keywords in the database here
        for keyword in keywords:
            articleScore = 1.0 * keyword['count'] / wordCount
            keyword['totalScore'] = articleScore * 1.5

        return keywords

    def sortScore(self, dictList):
        return sorted(dictList, key=lambda x: -x['totalScore'])

    def sortSentences(self, dictList):
        return sorted(dictList, key=lambda x: x['order'])

    def computeScore(self, sentences, titleWords, topKeywords):
        keywordList = [keyword['word'] for keyword in topKeywords]
        summaries = []

        for i, sentence in enumerate(sentences):
            sent  = self.parser.removePunctations(sentence)
            words = self.parser.splitWords(sent)

            sbsFeature = self.sbs(words, topKeywords, keywordList)
            dbsFeature = self.dbs(words, topKeywords, keywordList)

            titleFeature     = self.parser.getTitleScore(titleWords, words)
            sentenceLength   = self.parser.getSentenceLengthScore(words)
            sentencePosition = self.parser.getSentencePositionScore(i, len(sentences))
keywordFrequency
= (sbsFeature + dbsFeature) / 2.0 * 10.0 totalScore = (titleFeature * 1.5 + keywordFrequency * 2.0 + sentenceLength * 0.5 + sentencePosition * 1.0) / 4.0 summaries.append({ # 'titleFeature': titleFeature, # 'sentenceLength': sentenceLength, # 'sentencePosition': sentencePosition, # 'keywordFrequency': keywordFrequency, 'totalScore': totalScore, 'sentence': sentence, 'order': i }) return summaries def sbs(self, words, topKeywords, keywordList): score = 0.0 if len(words) == 0: return 0 for word in words: word = word.lower() index = -1 if word in keywordList: index = keywordList.index(word) if index > -1: score += topKeywords[index]['totalScore'] return 1.0 / abs(len(words)) * score def dbs(self, words, topKeywords, keywordList): k = len(list(set(words) & set(keywordList))) + 1 summ = 0.0 firstWord = {} secondWord = {} for i, word in enumerate(words): if word in keywordList: index = keywordList.index(word) if firstWord == {}: firstWord = {'i': i, 'score': topKeywords[index]['totalScore']} else: secondWord = firstWord firstWord = {'i': i, 'score': topKeywords[index]['totalScore']} distance = firstWord['i'] - secondWord['i'] summ += (firstWord['score'] * secondWord['score']) / (distance ** 2) return (1.0 / k * (k + 1.0)) * summ

parser.py: we should implement our parser for law case database.

# !/usr/bin/python
# -*- coding: utf-8 -*-
import nltk.data
import os


class Parser:
    def __init__(self):
        self.ideal = 20.0
        self.stopWords = self.getStopWords()

    def getKeywords(self, text):
        text  = self.removePunctations(text)
        words = self.splitWords(text)
        words = self.removeStopWords(words)
        uniqueWords = list(set(words))

        keywords = [{'word': word, 'count': words.count(word)} for word in uniqueWords]
        keywords = sorted(keywords, key=lambda x: -x['count'])

        return (keywords, len(words))

    def getSentenceLengthScore(self, sentence):
        return (self.ideal - abs(self.ideal - len(sentence))) / self.ideal

    # Jagadeesh, J., Pingali, P., & Varma, V. (2005). Sentence Extraction Based Single Document Summarization. International Institute of Information Technology, Hyderabad, India, 5.
    def getSentencePositionScore(self, i, sentenceCount):
        normalized = i / (sentenceCount * 1.0)

        if normalized > 0 and normalized <= 0.1:
            return 0.17
        elif normalized > 0.1 and normalized <= 0.2:
            return 0.23
        elif normalized > 0.2 and normalized <= 0.3:
            return 0.14
        elif normalized > 0.3 and normalized <= 0.4:
            return 0.08
        elif normalized > 0.4 and normalized <= 0.5:
            return 0.05
        elif normalized > 0.5 and normalized <= 0.6:
            return 0.04
        elif normalized > 0.6 and normalized <= 0.7:
            return 0.06
        elif normalized > 0.7 and normalized <= 0.8:
            return 0.04
        elif normalized > 0.8 and normalized <= 0.9:
            return 0.04
        elif normalized > 0.9 and normalized <= 1.0:
            return 0.15
        else:
            return 0

    def getTitleScore(self, title, sentence):
        titleWords = self.removeStopWords(title)
        sentenceWords = self.removeStopWords(sentence)
        matchedWords = [word for word in sentenceWords if word in titleWords]

        return len(matchedWords) / (len(title) * 1.0)

    def splitSentences(self, text):
        tokenizer = nltk.data.load('file:' + os.path.dirname(os.path.abspath(__file__)).decode('utf-8') + '/trainer/english.pickle')

        return tokenizer.tokenize(text)

    def splitWords(self, sentence):
        return sentence.lower().split()

    def removePunctations(self, text):
        return ''.join(t for t in text if t.isalnum() or t == ' ')

    def removeStopWords(self, words):
        return [word for word in words if word not in self.stopWords]

    def getStopWords(self):
        with open(os.path.dirname(os.path.abspath(__file__)) + '/trainer/stopWords.txt') as file:
            words = file.readlines()

        return [word.replace('\n', '') for word in words]

 

posted @ 2017-08-04 16:26  郝壹贰叁  阅读(441)  评论(0编辑  收藏  举报