scrapy基础之爬虫入门：先用urllib2来跑几个爬虫

1，爬取糗事百科

概况：糗事百科是html网页，支持直接抓取html字符然后用正则过滤

爬取糗事百科需要同时发送代理信息，即user-agent

import urllib2,re

def pachong(page):
    url="http://www.qiushibaike.com/hot/page/"+str(page)    #起始页
    user_agent='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'    #代理信息，可通过f12查看
    headers={'User-Agent':user_agent}    #把代理信息按照合理方式编辑到headers中
    try:
        request=urllib2.Request(url,headers=headers)    #url后边加headers参数，发送带headers的访问请求
        response=urllib2.urlopen(request)    #以网页方式打开服务器给的response
        content=response.read().decode('utf-8')    #编码方式是utf-8，没有编码方式的设置不能得出正确答案
        pattern=re.compile('<span>\s*(.*)\s*</span>')    #正则表达式过滤信息
        items=re.findall(pattern,content)    #findall形成的是一个列表，列表的元素是所有匹配的字符串
        for i in items:
            haveimg=re.search('img',i)    #过滤掉图片格式内容
            if not haveimg:
                print i,'\n'
    except Exception as e:
        print e

if __name__=='__main__':
    for i in range(1,3):
        pachong(i)

posted @ 2018-08-27 16:57 0点0度阅读(270) 评论(0) 编辑收藏举报

刷新页面返回顶部

0点0度

scrapy基础 之 爬虫入门：先用urllib2来跑几个爬虫

公告

scrapy基础之爬虫入门：先用urllib2来跑几个爬虫