Python爬取网页上想要的数据

 1、源代码如下

from urllib.request import urlopen,Request
import urllib.request
import re
from bs4 import BeautifulSoup
from distutils.filelist import findall

url ='http://movie.douban.com/top250?format=text'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36'}
ret = Request(url,headers=headers)
page = urllib.request.urlopen(ret)
contents = page.read()
# print(contents)
soup = BeautifulSoup(contents, "html.parser")
print("豆瓣电影TOP250" + "\n" + " 影片名              评分       评价人数     链接 ")
for tag in soup.find_all('div', class_='info'):
    # print tag
    m_name = tag.find('span', class_='title').get_text()
    m_rating_score = float(tag.find('span', class_='rating_num').get_text())
    m_people = tag.find('div', class_="star")
    m_span = m_people.findAll('span')
    m_peoplecount = m_span[3].contents[0]
    m_url = tag.find('a').get('href')
    print(m_name + "        " + str(m_rating_score) + "           " + m_peoplecount + "    " + m_url)

2、安装bs4

在文件-设置-python Project-搜索ps4并点击安装,安装完成以后会提示安装成功

 

 3、URLLIB.ERROR.HTTPERROR: HTTP ERROR 418错误

需要模拟浏览器访问,直接爬取会被拦截。打开浏览器按F12,随便访问一个网站,选中连接,找Headers,往下拉找到其中user-agent代表用的哪个请求的浏览器。

posted @ 2022-04-07 16:39  一级退堂鼓表演艺术家  阅读(524)  评论(0编辑  收藏  举报