python自动化1-从web获取信息

1，webbrowser：Python 自带的，打开浏览器获取指定页面。（open）

webbrowser.open('URL')    #打开URL

2，requests：从因特网上下载文件和网页。（get status_code text raise_for_status iter_content）

res = requests.get('URL')   #获取网页或文件
res.status_code             #状态码
res.text                    #获取的html
res.raise_for_status()      #检查下载是否成功，如果下载文件出错，会抛出异常；如果成功，则什么都不做
res.iter_content(num)       #在循环的每次迭代中，返回一段内容。每一段都是 bytes 数据类型，需要指定一段包含多少字节。

3，Beautiful Soup：解析 HTML。（bs4.BeautifulSoup select getText attrs get）

用正则表达式直接解析HTML绝对不是一个好的主意
在Edge中右键->查看页面源代码，可以看到完整的html源文件；在特定的元素上右键->检查，可以查看特定元素所对应的html代码。（借此观察一下自己所要解析的html代码部分的特征）

pip install beautifulsoup4
import bs4
***Soup = bs4.BeautifulSoup(HTML文本，比如requests.get(URL).text；或者打开的HTML文件File对象)

select方法

elems = ***Soup.select('div')                    #所有名（标签）为<div>的元素
elems = ***Soup.select('#author')                #带有 标签的值为 author 的元素
elems = ***Soup.select('.notice')                #所有使用 CSS class 属性名为 notice 的元素
elems = ***Soup.select('div span')               #所有在<div>元素之内的<span>元素
elems = ***Soup.select('div > span')             #所有直接在<div>元素之内的<span>元素， 中间没有其他元素
elems = ***Soup.select('input[name]')            #所有名为<input>，并有一个 name 属性，其值无所谓的元素
elems = ***Soup.select('input[type="button"]')   #所有名为<input>，并有一个 type 属性，其值为 button 的元素
elems = ***soup.select('p #author')              #所有 id 属性为 author 的元素，且在一个<p>元素之内。
# 总结（以下内容可自由组合）
# 1，> 表示直接嵌套    空格表示非直接嵌套
# 2，#后面跟着标签对应的值，没有#就是标签
# 3，[]标签的属性及对应值

'''select()方法将返回一个 Tag 对象的列表，针对 BeautifulSoup 对象中的 HTML 的每次匹配，列表中都有一个 Tag。Tag 值可以传递给 str()函数，显示它们代表的 HTML 标签。 Tag 值也可以有 attrs 属性，它将该 Tag 的所有 HTML 属性作为一个字典。 '''
elems = ***Soup.select('...')
elems[0]                                  #html元素（元素=开始标签+内容+结束标签）
elems[0].getText()                        #html内容
elems[0].attrs                            #html标签内属性对应的字典

'''一个例子，python交互运行输入'''
a = '''
<!-- This is the example.html example file. -->
<html><head><title>The Website Title</title></head>
<body>
<p>Download my <strong>Python</strong> book from <a href="http://
inventwithpython.com">my website</a>.</p>
<p class="slogan">Learn Python the easy way!</p>
<p>By <span id="author">Al Sweigart</span></p>
</body></html>
'''
testSoup = bs4.BeautifulSoup(a)
elems = testSoup.select('#author')
elems
len(elems)
elems[0]
elems[0].getText()
elems[0].attrs

get方法

'''get()方法从元素中获取属性值。向该方法传入一个属性名称的字符串，它将返回该属性的值。'''
a = '''
<!-- This is the example.html example file. -->
<html><head><title>The Website Title</title></head>
<body>
<p>Download my <strong>Python</strong> book from <a href="http://
inventwithpython.com">my website</a>.</p>
<p class="slogan">Learn Python the easy way!</p>
<p>By <span id="author">Al Sweigart</span></p>
</body></html>
'''
testSoup = bs4.BeautifulSoup(a)
elems = testSoup.select('span')[0]
elems.get('id')

一个自动化工具-解析参数，使用必应搜索引擎进行搜索，并打开前十个链接

#! python3
# lucky.py - open several being search results

import sys
import requests
import bs4
import webbrowser
import logging
logging.basicConfig(level=logging.DEBUG, format=' %(asctime)s - %(levelname)s - %(message)s')
logging.disable(logging.DEBUG)

# get the input argument
arg = sys.argv[1]

# get the being search results
logging.info('======')
logging.info('https://www.bing.com/search?q=' + arg)
logging.info('======' + '\n')
res = requests.get('https://www.bing.com/search?q=' + arg)
logging.info('======')
logging.info(res)
logging.info(res.text)
logging.info('======' + '\n')

# parse the html text
searchSoup = bs4.BeautifulSoup(res.text)
logging.info('======')
logging.info(searchSoup)
logging.info('======' + '\n')

# get the tag with <a>
elems = searchSoup.select('h2 > a')
logging.info('======')
logging.info(elems[0].get('href'))
logging.info('======' + '\n')

# open the URL
for i in range(10):
    webbrowser.open(elems[i].get('href'))

一个自动化工具-自动下载XKCD的漫画

#! python3
# downloadXkcd - download images from xckd site

import requests
import bs4
import logging
import os

# init logging and make directory to store images and init index to mark images
logging.basicConfig(level=logging.DEBUG, format='%(asctime)s - %(levelname)s - %(message)s')
logging.disable(logging.WARNING)
os.makedirs('xkcd', exist_ok=True)
index = 0
url = 'https://xkcd.com/'

while not url.endswith('#') and index<=10:
          # get html from xckd site and parse it
          res = requests.get(url)
          logging.info(res.text)
          xkcdSoup = bs4.BeautifulSoup(res.text)
          logging.info(xkcdSoup)

          # find url of images and download images
          imgElems = xkcdSoup.select('#comic img')
          logging.warning(imgElems)
          for img in imgElems:
                    logging.warning(img.get('src'))
                    img = requests.get(url + img.get('src'))
                    logging.warning(img)
                    # save image to ./xkcd
                    with open(os.path.join('xkcd', f'{index}.jpg'), 'wb') as f:
                              print(f'\033[31m downloading {index}.jpg \033[0m')
                              for iter in img.iter_content():
                                        f.write(iter)
                    index += 1
                    
          # get preview page URL
          url = 'https://xkcd.com/' + xkcdSoup.select('ul[class=comicNav] a[rel=prev]')[0].get('href')
          logging.error(xkcdSoup.select('ul[class=comicNav] a[rel=prev]')[0].get('href'))
          logging.error(url)

4，selenium：启动并控制一个 Web 浏览器。（填写表单，登录，模拟鼠标点击等）（webdriver EdgeOptions Edge get find_element_* find_elements_*）

selenium 模块让 Python 直接控制浏览器，实际点击链接，填写登录信息，几乎就像是有一个人类用户在与页面交互。与 Requests 和 Beautiful Soup 相比， Selenium允许使用高级得多的方式与网页交互。但因为它启动了 Web 浏览器，假如只是想从网络上下载一些文件，会有点慢，并且难以在后台运行。
selenium + Edge 浏览器（注意webdriver.py里定义的浏览器驱动的名字是什么）
selenium 能做的事远远超出了以下描述的功能。它可以修改浏览器的 cookie，截取页面快照，运行定制的 JavaScript。要了解这些功能的更多信息，请参考文档：https://selenium-python.readthedocs.io/
Python + Selenium + Microsoft Edge浏览器运行环境搭建及配置无界面模式

启动浏览器

from selenium import webdriver
browser = webdriver.Edge()
browser.get('URL')

'''如果报错USB: usb_device_handle_win.cc:1048 Failed to read descriptor from node connection'''
from selenium import webdriver
options = webdriver.EdgeOptions()
# 处理SSL证书错误问题
options.add_argument('--ignore-certificate-errors')
options.add_argument('--ignore-ssl-errors')
# 忽略无用的日志
options.add_experimental_option("excludeSwitches", ['enable-automation', 'enable-logging'])
browser = webdriver.Edge(options=options)
browser.get('URL')

寻找元素

WebDriver 对象有好几种方法，用于在页面中寻找元素。它们被分成find_element_*和find_elements_*方法。 find_element_*方法返回一个 WebElement 对象，代表页面中匹配查询的第一个元素。 find_elements_*方法返回 WebElement_*对象的列表，包含页面中所有匹配的元素。

browser.find_element_by_class_name(name)                          使用 CSS 类 name 的元素
browser.find_elements_by_class_name(name)
browser.find_element_by_css_selector(selector)                    匹配 CSS selector 的元素
browser.find_elements_by_css_selector(selector)
browser.find_element_by_id(id)                                    匹配 id 属性值的元素
browser.find_elements_by_id(id)
browser.find_element_by_link_text(text)                           完全匹配提供的 text 的<a>元素
browser.find_elements_by_link_text(text)
browser.find_element_by_partial_link_text(text)                   包含提供的 text 的<a>元素
browser.find_elements_by_partial_link_text(text)
browser.find_element_by_name(name)                                匹配 name 属性值的元素
browser.find_elements_by_name(name)
browser.find_element_by_tag_name(name)                            匹配标签 name 的元素
browser.find_elements_by_tag_name(name)                           (大小写无关， <a>元素匹配'a'和'A')
'''WebElement 的属性和方法'''
tag_name                     标签名，例如 'a'表示<a>元素
get_attribute(name)          该元素 name 属性的值
text                         该元素内的文本，例如<span>hello</span>中的'hello'
clear()                      对于文本字段或文本区域元素，清除其中输入的文本
is_displayed()               如果该元素可见，返回 True，否则返回 False
is_enabled()                 对于输入元素，如果该元素启用，返回 True，否则返回 False
is_selected()                对于复选框或单选框元素，如果该元素被选中，选择 True，否则返回 False
location                     一个字典，包含键'x'和'y'，表示该元素在页面上的位置
'''一个例子'''
from selenium import webdriver
options = webdriver.EdgeOptions()
# 处理SSL证书错误问题
options.add_argument('--ignore-certificate-errors')
options.add_argument('--ignore-ssl-errors')
# 忽略无用的日志
options.add_experimental_option("excludeSwitches", ['enable-automation', 'enable-logging'])
browser = webdriver.Edge(options=options)
browser.get('http://inventwithpython.com')
try:
  elem = browser.find_element_by_class_name('bookcover')
  print('Found <%s> element with that class name!' % (elem.tag_name))
except:
  print('Was not able to find an element with that name.')

点击页面

find_element_和 find_elements_方法返回的 WebElement 对象有一个 click()方法，模拟鼠标在该元素上点击。这个方法可以用于链接跳转，选择单选按钮，点击提交按钮，或者触发该元素被鼠标点击时发生的任何事情。

'''一个例子'''
from selenium import webdriver
options = webdriver.EdgeOptions()
# 处理SSL证书错误问题
options.add_argument('--ignore-certificate-errors')
options.add_argument('--ignore-ssl-errors')
# 忽略无用的日志
options.add_experimental_option("excludeSwitches", ['enable-automation', 'enable-logging'])
browser = webdriver.Edge(options=options)
browser.get('http://inventwithpython.com')
linkElem = browser.find_element_by_link_text('Read It Online')
linkElem.click()

填写并提交表单

向 Web 页面的文本字段发送击键，只要找到那个文本字段的<input>或<textarea>元素，然后调用 send_keys()方法。

'''一个例子'''
from selenium import webdriver
options = webdriver.EdgeOptions()
# 处理SSL证书错误问题
options.add_argument('--ignore-certificate-errors')
options.add_argument('--ignore-ssl-errors')
# 忽略无用的日志
options.add_experimental_option("excludeSwitches", ['enable-automation', 'enable-logging'])
browser = webdriver.Edge(options=options)
browser.get('http://gmail.com')
emailElem = browser.find_element_by_id('Email')
emailElem.send_keys('not_my_real_email@gmail.com')
passwordElem = browser.find_element_by_id('Passwd')
passwordElem.send_keys('12345')
passwordElem.submit()

发送特殊键

selenium 有一个模块，针对不能用字符串值输入的键盘击键。它的功能非常类似于转义字符。这些值保存在 selenium.webdriver.common.keys 模块的属性中。由于这个模块名非常长，所以在程序顶部运行 from selenium.webdriver. common.keys import Keys 就比较容易。如果这么做，原来需要写 from selenium. webdriver.common.keys 的地方，就只要写 Keys。

Keys.DOWN, Keys.UP, Keys.LEFT,Keys.RIGHT            键盘箭头键
Keys.ENTER, Keys.RETURN                             回车和换行键
Keys.HOME, Keys.END, Keys.PAGE_DOWN,Keys.PAGE_UP    Home 键、 End 键、 PageUp 键和 Page Down 键
Keys.ESCAPE, Keys.BACK_SPACE,Keys.DELETE            Esc、Backspace 和字母键
Keys.F1, Keys.F2, . . . , Keys.F12                  键盘顶部的 F1到 F12键
Keys.TAB                                            Tab 键
'''一个例子'''
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
options = webdriver.EdgeOptions()
# 处理SSL证书错误问题
options.add_argument('--ignore-certificate-errors')
options.add_argument('--ignore-ssl-errors')
# 忽略无用的日志
options.add_experimental_option("excludeSwitches", ['enable-automation', 'enable-logging'])
browser = webdriver.Edge(options=options)
browser.get('http://nostarch.com')
htmlElem = browser.find_element_by_tag_name('html')
htmlElem.send_keys(Keys.END) # scrolls to bottom
htmlElem.send_keys(Keys.HOME) # scrolls to top

点击浏览器按钮

browser.back()点击“返回”按钮。
browser.forward()点击“前进”按钮。
browser.refresh()点击“刷新”按钮。
browser.quit()点击“关闭窗口”按钮。

2048自动玩游戏

#! python3
# 2048player - a automatic 2048 game plyer, is loss it will try again, maybe you can train a agent to get higher score~

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

# open address of 2048 game
browser = webdriver.Edge()
browser.get(' https://gabrielecirulli.github.io/2048/')
htmlElem = browser.find_element_by_tag_name('html')

# candidate actions
candidateOperation = [Keys.DOWN, Keys.UP, Keys.LEFT, Keys.RIGHT]

for i in range(10000):
          #get board condition
          try:
                    boardCondition = [[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]]
                    board = browser.find_element_by_css_selector('.tile-container')
                    chessNum = board.text.split('\n')       #get all num
                    for chess in range(len(chessNum)):
                              chessLocation = browser.find_element_by_xpath(f"/html/body/div[1]/div[3]/div[3]/div[{chess+1}]").get_attribute("class").split()[2].split("-")[2:] #get location one by one
                              boardCondition[int(chessLocation[1])-1][int(chessLocation[0])-1] = int(chessNum[chess])
          except:
                    continue
          print(f'\033[32m {boardCondition} \033[0m')
          #get score
          score = browser.find_element_by_css_selector('.score-container')
          print(f'\033[31m score:{score.text} \033[0m')
          #play after decision making
          '''decision making'''
          operation  = candidateOperation[i%4]
          htmlElem.send_keys(operation)
          #retrain after loose
          try:
                    browser.find_element_by_class_name('retry-button').click()
          except:
                    pass

posted @ 2021-10-23 17:29 tensor_zhang 阅读(971) 评论(0) 编辑收藏举报

刷新页面返回顶部

tensor_zhang

python自动化1-从web获取信息

1，webbrowser：Python 自带的，打开浏览器获取指定页面。（open）

2，requests：从因特网上下载文件和网页。（get status_code text raise_for_status iter_content）

3，Beautiful Soup：解析 HTML。 （bs4.BeautifulSoup select getText attrs get）

select方法

get方法

一个自动化工具-解析参数，使用必应搜索引擎进行搜索，并打开前十个链接

一个自动化工具-自动下载XKCD的漫画

4，selenium：启动并控制一个 Web 浏览器。（填写表单，登录，模拟鼠标点击等）（webdriver EdgeOptions Edge get find_element_* find_elements_*）

启动浏览器

寻找元素

点击页面

填写并提交表单

发送特殊键

点击浏览器按钮

2048自动玩游戏

公告

3，Beautiful Soup：解析 HTML。（bs4.BeautifulSoup select getText attrs get）