豆瓣:豆列爬取心得
最近把豆瓣2020版电影日历上的数据整理到了豆瓣的豆列里,但豆列里面没法做更方便的筛选和查询,于是乎就想着把数据爬取下来自己筛选一下,便有了这篇笔记,代码实现是python3。
代码实现
爬虫代码实现非常简单,用到的是python3的requests_html,打开豆列页面分析一下页面的Html结构,会发现每部电影的信息是包含在一个div[@class='bd doulist-subject'] 里面的。我们只需要把对应的电影名,评分,类型,主演,导演,年份等取出来就行。
这里比较麻烦的是“导演,主演,类型,上映地区”是在一个div[@class='abstract'] 里的,只是数据换了一下行。所以这里需要将数据去掉换行后分别截取。
然后数据爬取下来后我是选择了插入到MYSQL中,实现也非常简单,这里就不具体描述了。以下是完整代码:
MYSQL:
CREATE TABLE movie(
mid int PRIMARY KEY AUTO_INCREMENT,
title varchar(200),
rate decimal(5,1),
type varchar(200),
director varchar(200),
starring varchar(200),
state varchar(200),
created int
) AUTO_INCREMENT = 1
Python:
import json
from requests_html import HTMLSession
import math
import pyodbc
import re
import pymysql
class MSSQL:
def __init__(self):
self.server = 'mssqlserver'
self.database = 'douban'
self.username = 'admin'
self.password = 'password'
self.driver= '{ODBC Driver 13 for SQL Server}'
def connect(self):
connection = pyodbc.connect('DRIVER='+self.driver+';SERVER='+self.server+';PORT=1433;DATABASE='+self.database+';UID='+self.username+';PWD='+ self.password)
#cursor = connection.cursor()
return connection
def execquery(self,sqltext):
connection = self.connect()
cursor = connection.cursor()
cursordata = cursor.execute(sqltext)
return cursordata
def execscalar(self,sqltext):
connection = self.connect()
cursor = connection.cursor()
cursor.execute(sqltext)
connection.commit()
def insert_douban_movie(self,title,rate,director,starring,movietype,countrystate,releasetime):
sqltext = "insert into douban_movie(title,rate,director,starring,movietype,countrystate,releasetime) values(?,?,?,?,?,?,?)"
connection = self.connect()
cursor = connection.cursor()
cursor.execute(sqltext,title,rate,director,starring,movietype,countrystate,releasetime)
connection.commit()
class MYSQL:
def __init__(self):
self.server = 'mysqlinstance'
self.database = 'douban'
self.username = 'username'
self.password = 'password'
def connect(self):
conn = pymysql.connect(self.server,self.username,self.password,self.database)
return conn
def insert_movie(self,title,rate,type,director,starring,state,created):
try:
conn = self.connect()
cursor = conn.cursor()
cursor.execute("INSERT INTO movie(title,rate,type,director,starring,state,created) VALUES(%s,%s,%s,%s,%s,%s,%s);",(title,rate,type,director,starring,state,created))
conn.commit()
except Exception as e:
conn.rollback()
print("insert error:{error}".format(error=e))
finally:
cursor.close()
conn.close()
def test(self):
db = self.connect()
cursor = db.cursor()
cursor.execute("SELECT VERSION()")
data = cursor.fetchone()
print ("Database version : %s " % data)
db.close()
class HtmlCrawler:
def __init__(self):
self.session = HTMLSession()
def get_doulist(self,doulist_url):
r = self.session.get(doulist_url)
page_size = 25
total_number =int(r.html.xpath("//div[@class='doulist-filter']/a/span")[0].text.replace('(','').replace(')',''))
total_page = math.ceil(total_number/page_size)
for i in range(0,total_page):
doulist_url2 = doulist_url+'/?start='+str(i*page_size)
self.get_movies(doulist_url2)
def get_movies(self,doulist_url):
r = self.session.get(doulist_url)
movies_tilte = r.html.xpath("//div[@class='bd doulist-subject']//div[@class='title']/a")
movies_rate = r.html.xpath("//div[@class='bd doulist-subject']//div[@class='rating']/span[@class='rating_nums']")
movies_abstact = r.html.xpath("//div[@class='bd doulist-subject']//div[@class='abstract']")
for i in range(0,len(movies_tilte)):
regstr = movies_abstact[i].text.strip().replace('\n','')
re1 = r'导演:(.*?)主演'
re2 = r'主演:(.*?)类型'
re3 = r'类型:(.*?)制片国家/地区'
re4 = r'制片国家/地区:(.*?)年份'
director = re.findall(re1,regstr)[0].strip()
starring = re.findall(re2,regstr)[0].strip()
movietype = re.findall(re3,regstr)[0].strip()
state = re.findall(re4,regstr)[0].strip()
created =int(regstr.split('年份:')[-1])
MYSQL().insert_movie(movies_tilte[i].text,float(movies_rate[i].text),movietype,director,starring,state,created)
if __name__ == "__main__":
url = 'https://www.douban.com/doulist/122330446'
HtmlCrawler().get_doulist(url)
统计分析
本来想随便写几个查询语句就算了的,发现还是不符合我偷懒的作风,索性直接把数据放到powerbi上做个简单的查询页面,做好后示意图如下:
上方的查询框是Text Filter,本来还准备加筛选框(Hierarchy Slicer)的,想想还要把源数据中的主演,导演,地区,类型等截取后去重就嫌麻烦没做了。
吐槽
小米的Mix3 扫码也太拉垮了,我用豆瓣扫码录入电影信息,扫个10来次相机就死机了,必须手机重启才行。另外,iPhone 7,华为 mate 30我均试过没有这毛病,说明并不是豆瓣的问题。
看法不是太重要,看法背后的事实和逻辑更重要。