python数据采集8-自然语言处理

当你在 Google 的图片搜索里输入“cute kitten”时,Google 怎么会知道你要搜索什么呢?
其实这个词组与可爱的小猫咪是密切相关的。当你在 YouTube 搜索框中输入“dead parrot”
的时候,YouTube 怎么会知道要推荐一些 Monty Python 乐团的幽默短剧呢?那是因为每个
上传的视频里都带有标题和简介文字

概括数据

在第 7 章里,我们介绍过如何把文本内容分解成 n-gram 模型,或者说是 n 个单词长度的
词组。从最基本的功能上说,这个集合可以用来确定这段文字中最常用的单词和短语。另
外,还可以提取原文中那些最常用的短语周围的句子,对原文进行看似合理的概括。

我们即将用来做数据归纳的文字样本源自美国第九任总统威廉 ·亨利 ·哈里森的就职演
说。哈里森的总统生涯创下美国总统任职历史的两个记录:一个是最长的就职演说,另一
个是最短的任职时间——32 天。

我们将用他的总统就职演说(http://pythonscraping.com/files/inaugurationSpeech.txt)的全文
作为这一章许多示例代码的数据源。

简单修改一下我们在第 7 章里用过的 n-gram 模型,就可以获得 2-gram 序列的频率数据,
然后我们用 Python 的 operator 模块对 2-gram 序列的频率字典进行排序


from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import string
import operator
def cleanInput(input):
    input = re.sub('\n+', " ", input).lower()
    input = re.sub('\[[0-9]*\]', "", input)
    input = re.sub(' +', " ", input)
    input = bytes(input, "UTF-8")
    input = input.decode("ascii", "ignore")
    cleanInput = []
    input = input.split(' ')
    for item in input:
        item = item.strip(string.punctuation)
        if len(item) > 1 or (item.lower() == 'a' or item.lower() == 'i'):
            cleanInput.append(item)
    return cleanInput
def ngrams(input, n):
    input = cleanInput(input)
    output = {}
    for i in range(len(input)-n+1):
        ngramTemp = " ".join(input[i:i+n])
        if ngramTemp not in output:
            output[ngramTemp] = 0
        output[ngramTemp] += 1
    return output
content = str(
    urlopen("http://pythonscraping.com/files/inaugurationSpeech.txt").read(),
        'utf-8')
ngrams = ngrams(content, 2)
sortedNGrams = sorted(ngrams.items(), key = operator.itemgetter(1), reverse=True)
print(sortedNGrams)

ouput

[('of the', 213),
he constitution',
), ('the people',
3), ('of a', 22),
('in
34),
24),
('of
the', 65), ('to the', 61), ('by the', 41), ('t
('of our', 29), ('to be', 26), ('from the', 24
('and the', 23), ('it is', 23), ('that the', 2
their', 19)

“of the”“in the”和“to the”看
起来并不重要

最常用的 5000 个单词列表可以免费获取,作为一个基本的过滤器来过滤最常用的 2-gram
序列绰绰有余。其实只用前 100 个单词就可以大幅改善分析结果,我们增加一个 isCommon
函数来实现


from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import string
import operator

def isCommon(ngram):
    commonWords = ["the", "be", "and", "of", "a", "in", "to", "have", "it", "i", "that", "for", "you", "he", "with", "on", "do", "say", "this", "they", "is", "an", "at", "but","we", "his", "from", "that", "not", "by", "she", "or", "as", "what", "go", "their","can", "who", "get", "if", "would", "her", "all", "my", "make", "about", "know", "will","as", "up", "one", "time", "has", "been", "there", "year", "so", "think", "when", "which", "them", "some", "me", "people", "take", "out", "into", "just", "see", "him", "your", "come", "could", "now", "than", "like", "other", "how", "then", "its", "our", "two", "more", "these", "want", "way", "look", "first", "also", "new", "because", "day", "more", "use", "no", "man", "find", "here", "thing", "give", "many", "well"]
    for word in ngram:
        if word in commonWords:
            return True
    return False

def cleanText(input):
    input = re.sub('\n+', " ", input).lower()
    input = re.sub('\[[0-9]*\]', "", input)
    input = re.sub(' +', " ", input)
    input = re.sub("u\.s\.", "us", input)
    input = bytes(input, "UTF-8")
    input = input.decode("ascii", "ignore")
    return input

def cleanInput(input):
    input = cleanText(input)
    cleanInput = []
    input = input.split(' ')
    for item in input:
        item = item.strip(string.punctuation)
        if len(item) > 1 or (item.lower() == 'a' or item.lower() == 'i'):
            cleanInput.append(item)
    return cleanInput

def getNgrams(input, n):
    input = cleanInput(input)
    output = {}
    for i in range(len(input)-n+1):
        ngramTemp = " ".join(input[i:i+n])
        if ngramTemp not in output:
            output[ngramTemp] = 0
        output[ngramTemp] += 1
    return output

def getFirstSentenceContaining(ngram, content):
    #print(ngram)
    sentences = content.split(".")
    for sentence in sentences: 
        if ngram in sentence:
            return sentence
    return ""

content = str(urlopen("http://pythonscraping.com/files/space.txt").read(), 'utf-8')
ngrams = getNgrams(content, 2)
sortedNGrams = sorted(ngrams.items(), key = operator.itemgetter(1), reverse = True)
print(sortedNGrams)



output

('united states', 10), ('executive department', 4), ('general governm
ent', 4), ('called upon', 3), ('government should', 3), ('whole count
ry', 3), ('mr jefferson', 3), ('chief magistrate', 3), ('same causes'
, 3), ('legislative body', 3)

马尔可夫模型

这些文字生成器都是基于一种常用于分析大量随机事件的马尔可夫模型,随机事件的特点
是一个离散事件发生之后,另一个离散事件将在前一个事件的条件下以一定的概率发生。

在这个天气系统模型中,如果今天是晴天,那么明天有 70% 的可能是晴天,20% 的可能
多云,10% 的可能下雨。如果今天是下雨天,那么明天有 50% 的可能也下雨,25% 的可
能是晴天,25% 的可能是多云。
需要注意以下几点。

  • 任何一个节点引出的所有可能的总和必须等于 100%。无论是多么复杂的系统,必然会
    在下一步发生若干事件中的一个事件。
  • 虽然这个天气系统在任一时间都只有三种可能,但是你可以用这个模型生成一个天气状
    态的无限次转移列表。
  • 只有当前节点的状态会影响后一天的状态。如果你在“晴天”节点上,即使前 100 天都
    是晴天或雨天都没关系,明天晴天的概率还是 70%。
  • 有些节点可能比其他节点较难到达。这个现象的原因用数学来解释非常复杂,但是可以
    直观地看出,在这个系统中任意时间节点上,第二天是“雨天”的可能性(指向它的箭
    头概率之和小于“100%”)比“晴天”或“多云”要小很多
from urllib.request import urlopen
from random import randint

def wordListSum(wordList):
    sum = 0
    for word, value in wordList.items():
        sum += value
    return sum

def retrieveRandomWord(wordList):

    randIndex = randint(1, wordListSum(wordList))
    for word, value in wordList.items():
        randIndex -= value
        if randIndex <= 0:
            return word

def buildWordDict(text):
    #Remove newlines and quotes
    text = text.replace("\n", " ")
    text = text.replace("\"", "")

    #Make sure puncuation are treated as their own "word," so they will be included
    #in the Markov chain
    punctuation = [',','.',';',':']
    for symbol in punctuation:
        text = text.replace(symbol, " "+symbol+" ")

    words = text.split(" ")
    #Filter out empty words
    words = [word for word in words if word != ""]

    wordDict = {}
    for i in range(1, len(words)):
        if words[i-1] not in wordDict:
            #Create a new dictionary for this word
            wordDict[words[i-1]] = {}
        if words[i] not in wordDict[words[i-1]]:
            wordDict[words[i-1]][words[i]] = 0
        wordDict[words[i-1]][words[i]] += 1

    return wordDict

text = str(urlopen("http://pythonscraping.com/files/inaugurationSpeech.txt").read(), 'utf-8')
wordDict = buildWordDict(text)

#Generate a Markov chain of length 100
length = 100
chain = ""
currentWord = "I"
for i in range(0, length):
    chain += currentWord+" "
    #print(wordDict[currentWord])
    currentWord = retrieveRandomWord(wordDict[currentWord])

print(chain)







out put

I sincerely believe in Chief Magistrate to make all necessary sacrifices and
oppression of the remedies which we may have occurred to me in the arrangement
and disbursement of the democratic claims them , consolatory to have been best
political power in fervently commending every other addition of legislation , by
the interests which violate that the Government would compare our aboriginal
neighbors the people to its accomplishment . The latter also susceptible of the
Constitution not much mischief , disputes have left to betray . The maxim which
may sometimes be an impartial and to prevent the adoption or

维基百科六度分割

广度优先搜索算法的思路是优先搜寻直接连接到起始页的所有链接(而不是找到一个链接
就纵向深入搜索)。如果这些链接不包含目标页面(你想要找的词条),就对第二层的链
接——连接到起始页的页面的所有链接——进行搜索。这个过程不断重复,直到达到搜索
深度限制(本例中使用的层数限制是 6 层)或者找到目标页面为止。


from urllib.request import urlopen
from bs4 import BeautifulSoup
import pymysql


conn = pymysql.connect(host='127.0.0.1', port=3306, user='root', passwd='root', db='mysql', charset='utf8')
cur = conn.cursor()
cur.execute("USE wikipedia")

def getUrl(pageId):
    cur.execute("SELECT url FROM pages WHERE id = %s", (int(pageId)))
    if cur.rowcount == 0:
        return None
    return cur.fetchone()[0]

def getLinks(fromPageId):
    cur.execute("SELECT toPageId FROM links WHERE fromPageId = %s", (int(fromPageId)))
    if cur.rowcount == 0:
        return None
    return [x[0] for x in cur.fetchall()]

def searchBreadth(targetPageId, currentPageId, depth, nodes):
    if nodes is None or len(nodes) == 0:
        return None
    if depth <= 0:
        for node in nodes:
            if node == targetPageId:
                return [node]
        return None
    #depth is greater than 0 -- go deeper!
    for node in nodes:
        found = searchBreadth(targetPageId, node, depth-1, getLinks(node))
        if found is not None:
            return found.append(currentPageId)
    return None

nodes = getLinks(1)
targetPageId = 123428
for i in range(0,4):
    found = searchBreadth(targetPageId, 1, i, nodes)
    if found is not None:
        print(found)
        for node in found:
            print(getUrl(node))
        break
    else:
        print("No path found")


下面是凯文 ·贝肯词条(在数据库中页面 ID 为 1)和埃里克 ·艾德尔词条(在数据库中页
面 ID 为 78520)的链接路径:

TARGET 134951 FOUND!
PAGE: 156224
PAGE: 155545
PAGE: 3
PAGE: 1

对应的链接名称是:Kevin Bacon → San Diego Comic Con International → Brian Froud →
Terry Jones → Eric Idle。

自然语言工具包

安装与设置

NLTK 模块的安装方法和其他 Python 模块一样,要么从 NLTK 网站直接下载安装包进行
安装,要么用其他几个第三方安装器通过关键词“nltk”安装。详细的安装教程,请参考NLTK 网站(http://www.nltk.org/install.html)。

模块安装之后,可以下载 NLTK 自带的文本库,这样你就可以非常轻松地实验 NLTK 的功
能。在 Python 命令行输入下面的命令即可:

>>> import nltk
>>> nltk.download()

用 NLTK 做统计分析

NLTK 很擅长生成一些统计信息,包括对一段文字的单词数量、单词频率和单词词性的统
计。如果你只需要做一些简单直接的计算(比如,一段文字中不重复单词的数量),导入
NLTK 模块就太大材小用了——它是一个非常大的模块。但是,如果你还需要对文本做一
些更有深度的分析,那么里面有许多函数可以帮你实现任何需要的统计指标。


from nltk import word_tokenize
from nltk import Text

tokens = word_tokenize("Here is some not very interesting text")
text = Text(tokens)


用 NLTK 做统计分析一般是从 Text 对象开始的。 Text 对象可以通过下面的方法用简单的
Python 字符串来创建:

from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

文本对象可以像普通的 Python 数组那样操作,好像它们就是一个包含文本里所有单词的数
组。用这个属性,你可以统计文本中不重复的单词,然后与总单词数据进行比较:




>>> len(text6)/len(words)
7.833333333333333


前面的数据表明剧本中每个单词平均被使用了八次。你还可以将文本对象放到一个频率分
布对象 FreqDist 中,查看哪些单词是最常用的,以及单词的频率是多少。

>>> from nltk import FreqDist
>>> fdist = FreqDist(text6)
>>> fdist.most_common(10)
[(':', 1197), ('.', 816), ('!', 801), (',', 731), ("'", 421), ('[', 3
19), (']', 312), ('the', 299), ('I', 255), ('ARTHUR', 225)]
>>> fdist["Grail"]
34

用 NLTK 做词性分析

网络数据采集经常需要处理搜索的问题。你在采集了一个网站的文字之后,可能想从文字
里面搜索“google”这个词,但你要的是作为动词的 google,不要作为专用名词的 Google。
或者你就想查找 Google 公司的名称 Google,但是不想通过首字母大写来找出答案(人们
可能忘记将首字母大写,直接写成 google)。那么这时函数 pos_tag 就很管用了:


from nltk import word_tokenize, sent_tokenize, pos_tag
sentences = sent_tokenize("Google is one of the best companies in the world. I constantly google myself to see what I'm up to.")
nouns = ['NN', 'NNS', 'NNP', 'NNPS']

for sentence in sentences: 
    if "google" in sentence.lower(): 
        taggedWords = pos_tag(word_tokenize(sentence)) 
            for word in taggedWords: 
                if word[0].lower() == "google" and word[1] in nouns: 
                    print(sentence)
                    
                    
posted @ 2018-12-29 20:20  孙中明  阅读(241)  评论(0编辑  收藏  举报