爬取豆瓣电影前250,借此熟悉python的request,数据入库,正则表达式

1.确定爬取的标签
获取排名     x.find('em')
获取title
 
问题:
  • 三年前用urllib2,现在不这么用了
  •  itemlist = soup.find_all("div",class_="item") 找到所以class为item的标签存入列表
  • url = i.find('a',class_='').get('href')    找到a标签的href属性的内容
  •  title = i.find('span',class_='title').text
  • 正则表达式分割 aa11a123/bbb2(sjd,下提取出任意位数数字+/ 以及任意位数数字+( 
        正则表达式为 [0-9]一位数字,取值范围是0-9; [0-9]+,任意位数字,取值范围为0-9,+代表多个
        [0-9]+/   n位数字/    123/
        [0-9]+[/,(]   n位数字/以及n位数字(  123/  2(
 
        python里面的写法规范是 r'正则表达式'   
         patt = r'[0-9]+[/,(]'
        #任意位数的数字+后面跟一位/或者(符号
        match = re.findall(patt,infoTmp)    infoTmp是待处理的字符串
 
  html里不是这么写
  • 爬取过程中第一个网页,li里面的信息是动态加载的,用目前的方法爬不下来。
  • 电脑装了phpstudy,自带mysql,跟本地mysql冲突,到本地mysql的安装目录下,bin文件夹下,cmd管理员模式运行cd C:\D\mysql-8.0.19-winx64\bin  mysqld.exe -install   出现:Service successfully installed. 接下来启动mysql:net start mysql 
  • Navicat Premium 出现2059错误,https://www.cnblogs.com/uncle-kay/p/9751805.html改了mysql的密码
  • navicat过期,注册,下破解机,注意选择破解版本是navicat for mysql的,破解时断网,下注册机关360.参见 navicat15 for mysql激活
  • 数据库添加 的sql写法 
     tumpe = (movie_range,title,infoTmp,rating_num,inq,url)
        sql='insert into movie(movie_range,title,info,rating_num,inq,url)                                   values("%s","%s","%s","%s","%s","%s")'%tumpe
        # "%s" 不能写成%s,会报错
  • 数据库update的sql写法
    sql='''update movie set country="%s",year="%s",
                film_Genres="%s",director_actor="%s"
               where movie_range = "%s"
        '''%tumpe
 
#!/usr/bin/python
#coding: UTF-8
from urllib.request import Request,urlopen
from urllib.error import URLError, HTTPError
from bs4 import BeautifulSoup
import pymysql
import json
import re
 
 
def getConn():
    conn= pymysql.connect(
            host='localhost',
            port = 3306,
            user='root',
            passwd='password',
            db ='douban',
            )
    return conn
 
 
 
 
def getContent(url):
    req = Request(url)
    #增加header头信息
    req.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36')
    try:
        response = urlopen(req)
        buff = response.read()
        html = buff.decode("utf8")
        response.close()
    except HTTPError as e:
        print('The server couldn\'t ful ll the request.')
        print('Error code: ', e.code)
    except URLError as e:
        print('reason:%s' %  e.reason)
    return html
 
def saveContent(content, url):
    soup = BeautifulSoup(content,"html.parser")
    # print (soup)
    itemlist = soup.find_all("div",class_="item")
    conn = getConn()
    cursor = conn.cursor()
    num = 0
    for i in itemlist:
        num += 1
        print (num)
        movie_range = (i.find('em')).text
        url = i.find('a',class_='').get('href')
        title = i.find('span',class_='title').text
        bd = i.find('div',class_='bd')
 
 
        info = bd.find('p',class_='').text
        infoTmp = info.replace('\xa0','').replace('  ','').replace('\n','')
        rating_num = i.find('span',class_='rating_num').text
        inq = i.find('span',class_='inq').text
 
 
        
        # #--------数据入库---------
        tumpe = (movie_range,title,infoTmp,rating_num,inq,url)
        sql='insert into movie(movie_range,title,info,rating_num,inq,url) values("%s","%s","%s","%s","%s","%s")'%tumpe
        # "%s" 不能写成%s,会报错
        cursor.execute(sql)
        conn.commit()
    cursor.close() #关闭游标
    conn.close()   #关闭连接
 
 
#---------已经入库的信息,根据id作为主键添加年份,国家,电影类型等信息
def saveyear(content, url):
    soup = BeautifulSoup(content,"html.parser")
    itemlist = soup.find_all("div",class_="item")
 
 
    conn = getConn()
    cursor = conn.cursor()
 
 
    num = 0
    for i in itemlist:
        num += 1
        print (num)
 
 
        movie_range = (i.find('em')).text
        #作为主键
        
        bd = i.find('div',class_='bd')
        info = bd.find('p',class_='').text
        infoTmp = info.replace('\xa0','').replace('  ','').replace('\n','')
        patt = r'[0-9]+[/,(]'
        #任意位数的数字+后面跟一位/或者(符号
        match = re.findall(patt,infoTmp)
        #导演: 奥利维·那卡什  / 艾力克·托兰达  Toledano主...2011/法国/剧情 喜剧
        director_actor = infoTmp.split(match[0])[0]
        year = match[0].replace('/','')
        country_type = infoTmp.split(match[0])[1]
        country = country_type.split('/')[0]
        film_Genres = country_type.split('/')[1]
        tumpe = (country,year,film_Genres,director_actor,movie_range)
    
        # #--------数据入库---------
        sql='''update movie set country="%s",year="%s",
                film_Genres="%s",director_actor="%s"
               where movie_range = "%s"
        '''%tumpe
        # "%s" 不能写成%s,会报错
        cursor.execute(sql)
        conn.commit()
    cursor.close() #关闭游标
    conn.close()   #关闭连接
def geturl():
    for i in range(0,10):
        j = 25*i
        print (j)
        url = 'https://movie.douban.com/top250?start=%s&filter='%j
        print(url)
        content = getContent(url)
        # saveContent(content, url)
        saveyear(content,url)
geturl()
 
 
 
posted @ 2020-02-22 18:28  zdmlcmepl  阅读(200)  评论(0编辑  收藏  举报