9 chrome能打开去哪儿的机票页面而python selenium启动的chrome不行 2

------------恢复内容开始------------

------------恢复内容开始------------

https://flight.qunar.com/site/oneway_list.htm?searchDepartureAirport=%E4%B8%8A%E6%B5%B7&searchArrivalAirport=%E5%8C%97%E4%BA%AC&searchDepartureTime=2021-10-18&searchArrivalTime=2021-10-21&nextNDays=0&startSearch=true&fromCode=SHA&toCode=BJS&from=qunarindex&lowestPrice=null

上面这一串地址复制到chrome浏览器和别的浏览器比如qq浏览器中,都能访问。唯独从selenium中启动的chrome中,会变成如下界面。无法显示2021-10-21日的机票信息。

 

初步怀疑是selenium不支持我当前的chrome版本。(更新:2021年10月17日10:47:22 后来降低为selenium支持的92.x.x.107版本,包括chrome driver也降了。还是不行)

现象就是:直接启动chrome能搜索去哪儿,并且显示上海到北京的次日机票信息。而从python程序中selenium启动的chrome中,死活打不开。然后把地址复制出来到任何浏览器,包括没有selenium控制的chrome,都能打开网页。

具体原因待查。

为什么要用selenium

首先,机票页面是XHR中用js交互产生的动态渲染数据,因此直接抓取页面的源代码中,没有任何机票信息。只有用selenium,所见即所可得

有人会说,那么就模拟XHR请求

我通过在postman中模拟XHR请求,

 

 返回的是请求成功不错。但是没有任何数据。

而浏览器中访问中返回的数据是有机票信息的。

 

 

更新:

后续通过补全headers信息,发现就可以获取返回数据了。下面是补全的headers信息。

下面是传递的get参数信息,就是直接在url后面的。

 

 

获取的返回数据是:

 并且这里面包括页面中后面页面的数据。也就是点击第2页和下一页才会显示的数据。

并且,在headers中隔了一天以后,要重新打开浏览器,修改pre参数。这个pre应该是一种验证机制。如果pre不修改,之前的请求,隔了一天,无法请求到机票数据

那么selenium,所见即所可得,能否获取js动态渲染中下一页才显示的数据呢?

 不仅chrome,Firefox也遇到同样问题

import time
from selenium.webdriver import Firefox
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
import requests,json
options = Options()
# options.binary_location = "C:\\Users\\xiaojie\\AppData\\Local\\Google\\Chrome\\Application\\chrome.exe"
options.binary_location = "C:\\Program Files\\Mozilla Firefox\\firefox.exe"
binary= FirefoxBinary("C:\\Program Files\\Mozilla Firefox\\firefox.exe")
caps = DesiredCapabilities.FIREFOX.copy()
caps['marionette'] = True
# options.add_experimental_option('excludeSwitches', ['enable-automation'])
# options.add_argument('--incognito')
# options.add_argument('disable-infobars')
# options.add_argument('log-level=3')
driver =Firefox(firefox_binary=binary,capabilities=caps, executable_path="geckodriver.exe")
url="https://www.qunar.com/"
# url = "https://diannao.jd.com/"

driver.get(url)
time.sleep(1)

url="https://flight.qunar.com/site/oneway_list.htm?searchDepartureAirport=%E4%B8%8A%E6%B5%B7&searchArrivalAirport=%E5%8C%97%E4%BA%AC&searchDepartureTime=2021-10-18&searchArrivalTime=2021-10-22&nextNDays=0&startSearch=true&fromCode=SHA&toCode=BJS&from=qunarindex&lowestPrice=null"
#搜索机票
#两种方式
driver.get(url)
with open('items.jl','w',encoding='UTF-8') as file:
    file.write(driver.page_source)


url = "https://diannao.jd.com/"
#设置header属性
header={
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3877.400 QQBrowser/10.8.4506.400"
    ,"Referer":"http://m.611.com/Match/Index"
    ,"Connection":"keep-alive"
    ,"Cookie":"__jdu=1961246440; shshshfpa=9d816ce7-9076-04a7-ea7d-8018bc6eadf2-1623898299; shshshfpb=sdus9TfEminz9dWA7KKGYgw%3D%3D; pinId=pjKhXUE59i7LjgxtUlkd_A; pin=zyj183247166; unick=jdzyj183; _tp=LdLet8T0koyg96E1dqQafA%3D%3D; _pst=zyj183247166; areaId=2; TrackID=1CmX15GEs1MOTf99XeZl5eebqfftNFiLEiMsS8vvKBTMwCAfRlKqXu7YCjQn__C2-mqlg-FJxPlEwiA79snSf04SU1xTtsZOoj5aQk_Cb5mu1XN52nGptNsMI-kJjYqCV; user-key=03e76034-3bcd-4674-a834-cf34a4b960a6; ipLocation=%u4e0a%u6d77; cn=76; ipLoc-djd=2-2824-61056-0.3405425761; unpl=V2_ZzNtbUVQQhV1DhJXLxhfV2IFFV0RAkoSJltEAHsRWwc1AUdbclRCFnUURlVnGVsUZgsZXkJcQxFFCEdkeBBVAWMDE1VGZxBFLV0CFSNGF1wjU00zQwBBQHcJFF0uSgwDYgcaDhFTQEJ2XBVQL0oMDDdRFAhyZ0AVRQhHZHsRWwVkBhVYR1ZzJXI4dmR%2fHV4BbwciXHJWc1chVEBTeRBdByoDGlpCVEYScA1HZHopXw%3d%3d; __jdv=76161171|baidu-pinzhuan|t_288551095_baidupinzhuan|cpc|0f3d30c8dba7459bb52f2eb5eba8ac7d_0_660117e2e02c4761bd86bb3e1963c3d7|1634347276899; PCSYCityID=CN_310000_310100_310101; shshshfp=26c4ea20d915727cb0bd98f196346b4d; __jdc=122270672; wlfstk_smdl=7bqz560g1rixn5kg0i84v5y6xvvrqkhx; __jda=122270672.1961246440.1623769658.1634347277.1634382193.23; __jdb=122270672.1.1961246440|23.1634382193; o2-webp=true; 3AB9D23F7A4B3C9B=TJ6FIT6PTK3N32QTINQHHUBRA4J4MDPCZWEHCIIXVS6J5H3LSD75C3RMTC2RIBLHQDLJOCOMXWMJ2LSD6IEMJMV66M"

    }
#response = requests.get("http://m.611.com/Match/Index",headers=header)
response = requests.get(url,headers=header)
if response.status_code == 200:
    print(response.text)
    with open('items2.jl','w') as file:
        file.write(response.text)
    # data = json.loads(response.text)
    # token = data["Data"]
time.sleep(4)
#商品信息
item__info=driver.find_elements_by_class_name('goods-item')
for item in item__info:
    name=item.find_element_by_class_name('goods-item__info').text
    price=item.find_element_by_class_name('goods-item__price').text
    print(name)
    print(price)
    print("-----------")
driver.delete_all_cookies()
# driver.quit()

使用selenium启动firefox访问机票信息。同样页面卡机

 

 可能原因就是去哪儿机票网,设置了对自动化测试工具webdriver的反爬。导致无法爬取。同样的链接,直接启动firefox或者chrome浏览器,都能爬取。唯独通过selenium启动后,不能访问网页。

 最后实验出来是因为设置了针对selenium的反爬

对于网上说的mitmproxy设置代理的方法,较为复杂。

最后添加了两个参数,隐藏selenium的特征后,就能正常使用selenium进行爬取了。

完整代码如下:

import time
# from selenium.webdriver import Firefox
# from selenium.webdriver.firefox.options import Options

from selenium.webdriver import Chrome
from selenium.webdriver.chrome.options import Options

# from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
# from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
import requests,json
options = Options()
options.binary_location = "C:\\Users\\xiaojie\\AppData\\Local\\Google\\Chrome\\Application\\chrome.exe"
options.add_experimental_option('excludeSwitches', ['enable-automation'])
options.add_argument('--incognito')
options.add_argument('disable-infobars')
options.add_argument('log-level=3')
options.add_argument("--disable-blink-features")
options.add_argument("--disable-blink-features=AutomationControlled")

# options.binary_location = "C:\\Program Files\\Mozilla Firefox\\firefox.exe"
# binary= FirefoxBinary("C:\\Program Files\\Mozilla Firefox\\firefox.exe")
# caps = DesiredCapabilities.FIREFOX.copy()
# caps['marionette'] = True

# driver =Firefox(firefox_binary=binary,capabilities=caps, executable_path="geckodriver.exe")
driver =Chrome(options=options,executable_path="D:\\webdriver\\chromedriver_win32\\chromedriver.exe")
# script = 'Object.defineProperty(navigator,"webdriver",{get:()=>false,});'
# driver.execute_script(script)
url="https://www.qunar.com/"
# url = "https://diannao.jd.com/"

driver.get(url)
time.sleep(1)

url="https://flight.qunar.com/site/oneway_list.htm?searchDepartureAirport=%E4%B8%8A%E6%B5%B7&searchArrivalAirport=%E5%8C%97%E4%BA%AC&searchDepartureTime=2021-10-18&searchArrivalTime=2021-10-22&nextNDays=0&startSearch=true&fromCode=SHA&toCode=BJS&from=qunarindex&lowestPrice=null"
#搜索机票
#两种方式
# script = 'Object.defineProperty(navigator,"webdriver",{get:()=>false,});'
# driver.execute_script(script)
driver.get(url) 
# script = 'Object.defineProperty(navigator,"webdriver",{get:()=>false,});'
# driver.execute_script(script)
with open('items.jl','w',encoding='UTF-8') as file:
    file.write(driver.page_source)


url = "https://diannao.jd.com/"
#设置header属性
header={
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3877.400 QQBrowser/10.8.4506.400"
    ,"Referer":"http://m.611.com/Match/Index"
    ,"Connection":"keep-alive"
    ,"Cookie":"__jdu=1961246440; shshshfpa=9d816ce7-9076-04a7-ea7d-8018bc6eadf2-1623898299; shshshfpb=sdus9TfEminz9dWA7KKGYgw%3D%3D; pinId=pjKhXUE59i7LjgxtUlkd_A; pin=zyj183247166; unick=jdzyj183; _tp=LdLet8T0koyg96E1dqQafA%3D%3D; _pst=zyj183247166; areaId=2; TrackID=1CmX15GEs1MOTf99XeZl5eebqfftNFiLEiMsS8vvKBTMwCAfRlKqXu7YCjQn__C2-mqlg-FJxPlEwiA79snSf04SU1xTtsZOoj5aQk_Cb5mu1XN52nGptNsMI-kJjYqCV; user-key=03e76034-3bcd-4674-a834-cf34a4b960a6; ipLocation=%u4e0a%u6d77; cn=76; ipLoc-djd=2-2824-61056-0.3405425761; unpl=V2_ZzNtbUVQQhV1DhJXLxhfV2IFFV0RAkoSJltEAHsRWwc1AUdbclRCFnUURlVnGVsUZgsZXkJcQxFFCEdkeBBVAWMDE1VGZxBFLV0CFSNGF1wjU00zQwBBQHcJFF0uSgwDYgcaDhFTQEJ2XBVQL0oMDDdRFAhyZ0AVRQhHZHsRWwVkBhVYR1ZzJXI4dmR%2fHV4BbwciXHJWc1chVEBTeRBdByoDGlpCVEYScA1HZHopXw%3d%3d; __jdv=76161171|baidu-pinzhuan|t_288551095_baidupinzhuan|cpc|0f3d30c8dba7459bb52f2eb5eba8ac7d_0_660117e2e02c4761bd86bb3e1963c3d7|1634347276899; PCSYCityID=CN_310000_310100_310101; shshshfp=26c4ea20d915727cb0bd98f196346b4d; __jdc=122270672; wlfstk_smdl=7bqz560g1rixn5kg0i84v5y6xvvrqkhx; __jda=122270672.1961246440.1623769658.1634347277.1634382193.23; __jdb=122270672.1.1961246440|23.1634382193; o2-webp=true; 3AB9D23F7A4B3C9B=TJ6FIT6PTK3N32QTINQHHUBRA4J4MDPCZWEHCIIXVS6J5H3LSD75C3RMTC2RIBLHQDLJOCOMXWMJ2LSD6IEMJMV66M"

    }
#response = requests.get("http://m.611.com/Match/Index",headers=header)
response = requests.get(url,headers=header)
if response.status_code == 200:
    print(response.text)
    with open('items2.jl','w') as file:
        file.write(response.text)
    # data = json.loads(response.text)
    # token = data["Data"]
time.sleep(4)
#商品信息
item__info=driver.find_elements_by_class_name('goods-item')
for item in item__info:
    name=item.find_element_by_class_name('goods-item__info').text
    price=item.find_element_by_class_name('goods-item__price').text
    print(name)
    print(price)
    print("-----------")
driver.delete_all_cookies()
# driver.quit()

里面添加了两句话:

options.add_argument("--disable-blink-features")
options.add_argument("--disable-blink-features=AutomationControlled")

 

 

同时,发现selenium只能抓取当前页显示的数据。无法像前述模拟XHR请求,返回所有机票数据。

至于设置的文本混淆。则需要用其他方法解决。

但是模拟XHR请求,然后尚未渲染到页面的,直接从服务器返回的数据,是应对文本混淆的最佳方法。

 

 

 

 

------------恢复内容结束------------

------------恢复内容结束------------

posted @ 2021-10-16 20:03  秦皇汉武  阅读(226)  评论(0编辑  收藏  举报