行之间 - 博客园

2020年1月2日

摘要： ```python # -*- coding: utf-8 -*- """ @author: Dell Created on Thu Jan 2 11:16:08 2020 """ import gevent from gevent import monkey monkey.patch_all() from lxml import etree from selenium import webdri 阅读全文

posted @ 2020-01-02 13:02 行之间阅读(197) 评论(0) 推荐(0) 编辑

2020年1月1日

协程框架

摘要： ```python import requests import gevent from gevent import monkey monkey.patch_all() headers = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en', ' 阅读全文

posted @ 2020-01-01 23:03 行之间阅读(283) 评论(0) 推荐(0) 编辑

2019年12月29日

多线程抓取邮箱

摘要： ```python# -*- coding: utf-8 -*-"""@author: Dell Created on Sun Dec 29 17:26:43 2019"""import reimport timeimport queueimport threadingimport requestsdef getpagesource(url): """获取网页源码""" try: ... 阅读全文

posted @ 2019-12-29 21:56 行之间阅读(277) 评论(0) 推荐(0) 编辑

2019年12月24日

selenium操作下拉选和网页提示框

摘要： ```python import time from selenium import webdriver from selenium.webdriver.support.select import Select#处理下拉框 from selenium.webdriver.support.ui import WebDriverWait#等待一个元素加载完成 from selenium.webdriv 阅读全文

posted @ 2019-12-24 20:35 行之间阅读(366) 评论(0) 推荐(0) 编辑

摘要： ```python # -*- coding: utf-8 -*- """ @author: Dell Created on Tue Dec 24 12:33:56 2019 """ import time from selenium import webdriver from selenium.webdriver.support.ui import WebDriverWait#等待一个元素加载完阅读全文

posted @ 2019-12-24 12:53 行之间阅读(1289) 评论(0) 推荐(0) 编辑

2019年12月23日

抓取腾讯招聘python岗位

摘要： # -*- coding: utf-8 -*- """ @author: Dell Created on Mon Dec 23 17:55:06 2019 """ import re import time import requests from lxml import etree from se 阅读全文

posted @ 2019-12-23 20:11 行之间阅读(400) 评论(0) 推荐(0) 编辑

2019年12月20日

爬虫学习笔记整理一

摘要： tips 不论爬取哪个网页，都可以加上请求头信息 requests使用代理 import requests url = "http://httpbin.org/ip"#访问这个地址会返回访问者的ip地址 proxies = {'http':'119.39.68.252:8118'} resp = r 阅读全文

posted @ 2019-12-20 22:05 行之间阅读(491) 评论(0) 推荐(0) 编辑

2019年12月19日

requests结合xpath爬取豆瓣最新上映电影

摘要： ```python # -*- coding: utf-8 -*- """ 豆瓣最新上映电影爬取 # ul = etree.tostring(ul, encoding="utf-8").decode("utf-8") """ import requests from lxml import etree #1.抓取目标网站页面 def getHtml(url): headers = { 'User- 阅读全文

posted @ 2019-12-19 13:20 行之间阅读(415) 评论(0) 推荐(0) 编辑

2019年12月18日

使用xpath提取页面所有a标签的href属性值

摘要： ```python # -*- coding: utf-8 -*- #1.选取节点 #获取所有的div元素 //div #/代表获取根节点的直接子元素 #获取所有带有id属性的div //div[@id] #2.谓词(索引从1开始) #获取body下面的第一个/最后一个div元素/前两个 //body/div[1] //body/div[last()] //body/div[position<3] 阅读全文

posted @ 2019-12-18 22:36 行之间阅读(37961) 评论(0) 推荐(2) 编辑

2019年12月15日

网页提取所有邮箱

摘要： ```python import re from urllib import request # 挖掘邮箱 def getEmailsByLine(url): """按行提取邮箱""" emailregex = re.compile(r"([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)", re.IGNORECASE)#忽略异常情况和大小写 for 阅读全文

posted @ 2019-12-15 21:38 行之间阅读(724) 评论(0) 推荐(0) 编辑