selenium爬取soul网页内容

1、安装Python。

2、通过pip 安装，执行命令pip install selenium。查看selenium安装的版本，pip show selenium。

3、谷歌浏览器设置不自动更新，下载chrome浏览器驱动，下载地址：http://chromedriver.storage.googleapis.com/index.html，注意驱动版本必须和浏览器版本一致。

将驱动解压到python目录。

4、新建python项目，代码如下

from selenium import webdriver
from selenium.webdriver.common.by import By
import time
import lxml.html
import re

driver = webdriver.Chrome()
driver.get('https://cftweb.3g.qq.com/privacy/agreement?appid=42357646')
time.sleep(2)

#driver.switch_to.frame('iframeResult') #参数必须是id或name
iframe = driver.find_element(By.TAG_NAME,'iframe') #当iframe没有id或name时用这个方法
driver.switch_to.frame(iframe)
text = driver.find_element(By.XPATH,'/html/body/p[2]').text
print(text)
time.sleep(5)

# 获取页面源代码
html_source = driver.page_source
html = lxml.html.fromstring(html_source)

# 获取标签下所有文本
items = html.xpath("/html/body/p/text()")
# 正则 匹配以下内容 \s+ 首空格 \s+$ 尾空格 \n 换行
pattern = re.compile("^\s+|\s+$|\n")
 
clause_text = ""
for item in items:
    # 将匹配到的内容用空替换，即去除匹配的内容，只留下文本
    line = re.sub(pattern, "", item)
    if len(line) > 0:
        clause_text += line + "\n"

print(clause_text)
time.sleep(5)
driver.quit()

posted @ 2022-07-19 15:20 最萌小胡胡阅读(704) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

最萌小胡胡

selenium爬取soul网页内容

公告