爬虫基础(一)

爬虫基础概念

什么是爬虫?

-----请求网站并提取数据的自动化程序------

urllib模块

关于py2与py3的调用

复制代码
#################################### in py2:

'''
import urllib2

data=urllib2.urlopen("http://www.baidu.com")
f=open("baidu.html","w")
f.write(data.read())

'''

#################################### in py3:

import urllib.request

data=urllib.request.urlopen("http://www.baidu.com")
f=open("baidu.html","wb")
f.write(data.read())
复制代码

 urllib模块介绍

 

 

requests模块

 

requests模块与正则表达式的简单应用

复制代码
import requests

import re
import json

def getPage(url):

    response=requests.get(url)
    return response.text

def parsePage(s):
    
    com=re.compile('<div class="item">.*?<div class="pic">.*?<em .*?>(?P<id>\d+).*?<span class="title">(?P<title>.*?)</span>'
                   '.*?<span class="rating_num" .*?>(?P<rating_num>.*?)</span>.*?<span>(?P<comment_num>.*?)评价</span>',re.S)

    ret=com.finditer(s)
    for i in ret:
        yield {
            "id":i.group("id"),
            "title":i.group("title"),
            "rating_num":i.group("rating_num"),
            "comment_num":i.group("comment_num"),
        }

def main(num):

    url='https://movie.douban.com/top250?start=%s&filter='%num
    response_html=getPage(url)
    ret=parsePage(response_html)
    print(ret)
    f=open("move_info7","a",encoding="utf8")

    for obj in ret:
        print(obj)
        data=json.dumps(obj,ensure_ascii=False)
        f.write(data+"\n")

if __name__ == '__main__':
    count=0
    for i in range(10):
        main(count)
        count+=25
复制代码

 

posted @ 2018-01-14 12:11  skyflask  阅读(83)  评论(0编辑  收藏  举报