动态抓取的实例
在开始爬虫之前,我们需要了解一下Ajax(异步请求)。它的价值在于在与后台进行少量的数据交换就可以使网页实现异步更新。
如果使用Ajax加载的动态网页抓取,有两种方法:
- 通过浏览器审查元素解析地址。
- 通过Selenium模拟浏览器抓取。
解析真实地址抓取
| |
| headers = { |
| 'User-Agent': 'Mozilla/5.0 (Windows NT 10、.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36' |
| } |
| for e in range(1, 11): |
| |
| link = """ |
| https://api-zero.livere.com/v1/comments/list?callback=jQuery112400364209957301318_1640670329077&limit=10&offset=""" + str( |
| e) + """&repSeq=4272904&requestPath=%2Fv1%2Fcomments%2Flist&consumerSeq=1020&livereSeq=28583&smartloginSeq=5154&code=&_=1640670329079 |
| """ |
| |
| r = requests.get(link, headers=headers) |
| |
| json_string = r.text |
| |
| json_string = json_string[json_string.find('{'):-2] |
| |
| json_data = json.loads(json_string) |
| comment_list = json_data['results']['parents'] |
| i = 0 |
| for eachone in comment_list: |
| message = eachone['content'] |
| i += 1 |
| print(i) |
| print(message) |
通过selenium模拟浏览器抓取:使用浏览器渲染的方式将爬取的动态网页变成静态网页
selenium安装
fp = webdriver.FirefoxOptions()
fp.set_preference("permissions.default.stylesheet", 2)
# 打开浏览器
# driver = webdriver.Firefox(firefox_profile=fp, executable_path=r'C:\Program Files\Mozilla Firefox\firefox.exe')
driver = webdriver.Firefox(options=fp)
# 输入打开网址
driver.get("http://www.santostang.com/2018/07/04/hello-world/")
需要下载一个浏览器驱动器
下载chromedriver地址:http://chromedriver.storage.googleapis.com/index.html
下载geckodriver地址:https://github.com/mozilla/geckodriver/releases
selenium获取文章的所有评论
| from selenium import webdriver |
| fp = webdriver.FirefoxOptions() |
| fp = webdriver.ChromeOptions() |
| |
| fp.set_capability("permissions.default.stylesheet", 2) |
| |
| |
| |
| driver = webdriver.Chrome(options=fp) |
| |
| driver.get("http://www.santostang.com/2018/07/04/hello-world/") |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") |
| driver.switch_to.frame(driver.find_element_by_css_selector("iframe[title='livere-comment']")) |
| time.sleep(1) |
| for i in range(0, 3): |
| load_more = driver.find_element_by_css_selector('button.page-last-btn') |
| load_more.click() |
| |
| time.sleep(1) |
| |
| comments = driver.find_elements_by_css_selector('div.reply-content') |
| for cm in comments: |
| content = cm.find_element_by_tag_name('p') |
| print(content.text) |
selenium优化操作
| from selenium import webdriver |
| from selenium.webdriver.chrome.options import Options |
| import time |
| import random |
| |
| options = Options() |
| num = str(float(random.randint(500, 600))) |
| options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/{}" |
| " (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/{}".format(num, num)) |
| |
| prefs = {"profile.managed_default_content_settings.images": 2, 'permissions.default.stylesheet': 2} |
| options.add_experimental_option("prefs", prefs) |
| |
| driver = webdriver.Chrome(executable_path='E:\\DownLoad\\python\\Scripts\\chromedriver.exe', chrome_options=options) |
| driver.get('https://www.ly.com/') |
| time.sleep(5) |
| html = driver.find_element_by_xpath("//body").get_attribute("innerHTML") |
| print(html) |
selenium爬虫时间:深圳短租数据
| option = webdriver.FirefoxOptions() |
| option.set_preference( |
| option.set_preference( |
| firefox = webdriver.Firefox(options=option) |
| firefox.set_page_load_timeout(5) |
| firefox.set_script_timeout(5) |
| try: |
| firefox.get( |
| "https://www.airbnb.cn/s/%E9%9D%92%E5%B2%9B/homes?host_promotion_type_ids[]=0&host_promotion_type_ids[]=1&host_promotion_type_ids[]=8&checkin=2021-12-30&checkout=2021-12-31") |
| divs = firefox.find_elements_by_css_selector("div._8ssblpx") |
| for div in divs: |
| # 价格 评价数 名称 房屋种类 床数量 房客数量 |
| str_fangwu = div.find_element_by_css_selector("span._faldii7") |
| str_fangwu = str_fangwu.text |
| jg = div.find_element_by_css_selector("span._185kh56") |
| jg = jg.text |
| pjs = div.find_element_by_css_selector("span._1clmxfj") |
| pjs = pjs.text |
| name = div.find_element_by_css_selector("div._qrfr9x5") |
| name = name.text |
| print(str_fangwu, jg, pjs, name) |
| except Exception as e: |
| pass |
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 无需6万激活码!GitHub神秘组织3小时极速复刻Manus,手把手教你使用OpenManus搭建本
· Manus爆火,是硬核还是营销?
· 终于写完轮子一部分:tcp代理 了,记录一下
· 别再用vector<bool>了!Google高级工程师:这可能是STL最大的设计失误
· 单元测试从入门到精通