Python爬虫 | 打开网页获取原码的几种方式
1. 打开页面获取源代码
1.1 urllib模块
import urllib.request
html =urllib.request.urlopen('https://movie.douban.com/subject/3168101/?from=showing').read()
html = html.decode('utf-8')
print(html)
1.2 requests模块
import requests
html = requests.get('https://movie.douban.com/subject/3168101/?from=showing').text
print(html)
2 获取需要的信息
2.1 re正则
get_re = re.findall(r'<span class="short">(.*?)</span>',html)
print(get_re)
2.2 BeautifulSoup
from bs4 import BeautifulSoup
get = BeautifulSoup(html,'lxml')
b=get.find(attrs={'class':"short"})
a=get.find_all(attrs={'class':"short"})
print(b)
print(a)
2.3 Xpath
from lxml import etree
html = etree.HTML(html)
get = html.xpath('//*[@id="hot-comments"]/div[5]/div/p/span/text()')
print(get)
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步