Python爬虫 | 打开网页获取原码的几种方式

1. 打开页面获取源代码

1.1 urllib模块

import urllib.request
html =urllib.request.urlopen('https://movie.douban.com/subject/3168101/?from=showing').read()
html = html.decode('utf-8')
print(html)

1.2 requests模块

import requests
html = requests.get('https://movie.douban.com/subject/3168101/?from=showing').text
print(html)

2 获取需要的信息

2.1 re正则

get_re = re.findall(r'<span class="short">(.*?)</span>',html)
print(get_re)

2.2 BeautifulSoup

from bs4 import  BeautifulSoup
get = BeautifulSoup(html,'lxml')
b=get.find(attrs={'class':"short"})
a=get.find_all(attrs={'class':"short"})
print(b)
print(a)

2.3 Xpath

from lxml import etree
html = etree.HTML(html)
get = html.xpath('//*[@id="hot-comments"]/div[5]/div/p/span/text()')
print(get)

posted @ 2018-11-29 22:25 FalsePlus 阅读(38) 评论(0) 收藏举报

刷新页面返回顶部

falseplus

Python爬虫 | 打开网页获取原码的几种方式

1. 打开页面获取源代码

1.1 urllib模块

1.2 requests模块

2 获取需要的信息

2.1 re正则

2.2 BeautifulSoup

2.3 Xpath