在用scrapy时遇到的坑
1. 一开始是想用scrapy和selenium来爬什么值得买,结果遇到了一个奇怪的问题,直接上代码
def start_requests(self): self.logger.info("starting") broswer = webdriver.Firefox() broswer.get(self.start_url) last_height = broswer.execute_script("return document.body.scrollHeight") print(last_height) count = 0 while True: print(count) if count==2: break broswer.execute_script("window.scrollTo(0, document.body.scrollHeight);") time.sleep(2) new_height = broswer.execute_script("return document.body.scrollHeight") if new_height == last_height: break last_height = new_height time.sleep(1.2) count = count + 1 source = broswer.page_source broswer.close() scrapy_selector = Selector(text = source) items_selector = scrapy_selector.xpath('//div[@class="z-feed-content"]') self.logger.info('Theres a total of ' + str(len(items_selector)) + ' links.') try: s=0 for item_selector in items_selector: print(s) print(item_selector.getall()) #错误写法(Keep in mind that if you are nesting selectors and use an XPath that starts with /, that XPath will be absolute to the document and not relative to the Selector you’re calling it from.) # url_selector = item_selector.xpath('//h5[@class="feed-block-title has-price"]/a/@href') # 错误写法(multiple class should be: *[contains(concat(' ', normalize-space(@class), ' '), ' someclass ')]) # url_selector = item_selector.xpath('.//h5[@class="feed-block-title has-price"]/a/@href') url_selector = item_selector.xpath(".//h5[contains(concat(' ', normalize-space(@class), ' '), 'feed-block-title')]/a/@href") # assert isinstance(url_selector, scrapy.selector.Selector) print(url_selector.extract()) # self.logger.info("sss" + url) url = url_selector.get() s = s + 1 # self.logger.info("sss" + url) except Exception as e: self.logger.info('Reached last iteration #' + str(e) + str(s)) return
broswer.page_source表示浏览器上的整个页面的html代码,scrapy_selector是建立在整个页面的选择器,items_selector是页面上抓下来的表示首页上商品信息的列表div块的选择器列表,这些都没问题。但出问题的是标成大红色的那段代码
url_selector = item_selector.xpath('//h5[@class="feed-block-title has-price"]/a/@href')
item_selector表示每一个商品信息div块的选择器,这个用print(item_selector.getall())打印出来是对的,出问题的是url_selector,表示div块里商品链接的选择器,print(url_selector.extract())打印出来发现是个url列表,共18个
['https://www.smzdm.com/p/20610761/#hfeeds', 'https://www.smzdm.com/p/20601553/#hfeeds', 'https://www.smzdm.com/p/20597500/#hfeeds', 'https://www.smzdm.com/p/20603303/#hfeeds', 'https://www.smzdm.com/p/20613198/#hfeeds', 'https://www.smzdm.com/p/20601438/#hfeeds', 'https://www.smzdm.com/p/20615602/#hfeeds', 'https://www.smzdm.com/p/20596520/#hfeeds', 'https://www.smzdm.com/p/20617429/#hfeeds', 'https://www.smzdm.com/p/20607426/#hfeeds', 'https://www.smzdm.com/p/20615296/#hfeeds', 'https://www.smzdm.com/p/20618224/#hfeeds', 'https://www.smzdm.com/p/20603149/#hfeeds', 'https://www.smzdm.com/p/20604376/#hfeeds', 'https://www.smzdm.com/p/20603224/#hfeeds', 'https://www.smzdm.com/p/20615599/#hfeeds', 'https://www.smzdm.com/p/20615846/#hfeeds', 'https://www.smzdm.com/p/20586712/#hfeeds']
item_selector共有60个,每个里面的url_selector都是一个一样的18个元素的列表(为什么是从头上数下来的18个,不得而知)。
错误原因后来在官网的文档上发现这样一句话
Keep in mind that if you are nesting selectors and use an XPath that starts with /, that XPath will be absolute to the document and not relative to the Selector you’re calling it from.)
If you use@class='someclass'
you may end up missing elements that have other classes, and if you just usecontains(@class, 'someclass')
to make up for that you may end up with more elements that you want, if they have a different class name that shares the stringsomeclass
.
原因很明白了,如果xpath以/或//开始,会从整个文档解析而不是从那个item_selector开始解析。然后改成了
url_selector = item_selector.xpath('.//h5[@class="feed-block-title has-price"]/a/@href')
发现取不到值,后来在官网上又看到了一句话:
Because an element can contain multiple CSS classes, the XPath way to select elements by class is the rather verbose: *[contains(concat(' ', normalize-space(@class), ' '), ' someclass ').
If you use@class='someclass'
you may end up missing elements that have other classes, and if you just usecontains(@class, 'someclass')
to make up for that you may end up with more elements that you want, if they have a different class name that shares the stringsomeclass
.
还是比较麻烦的,改成下面这样就能正确取到url了
url_selector = item_selector.xpath(".//h5[contains(concat(' ', normalize-space(@class), ' '), 'feed-block-title')]/a/@href")
由于这样比较麻烦,官网建议取css时用下面的形式,先用css的取法,然后再加xpath
>>> sel.css('.shout').xpath('./time/@datetime').getall()
2. 页面html是
<h1 class="item-name"> <span class="edit_interface"></span> 闲鱼出售全新ipad pro 2018翻车日记 </h1>
goods_scrapy_selector.xpath("//article/h1/text()")取出来的却是一个数组['\n ', '\n 闲鱼出售全新ipad pro 2018翻车日记 ']
估计是因为text是\n ... \n..., 每一个换行都是一个记录?
-- text()
selects all text node children of the context node. from https://www.w3.org/TR/1999/REC-xpath-19991116/#section-String-Functions
也就是说text()会返回子节点所有的内容,因为h1下面还有个span,而且"闲鱼出售全新ipad pro 2018翻车日记“在span下面。
3. scrapy 2.0.1
scrapy原先输出在console的日志是:
2020-05-28 22:56:06,765 - smzdm_jingxuan - INFO - smzdm_jingxuan spider starting
想在输出的日志改变下格式,把行号打印出来,首先想到的是改logging.basicConfig
class SmzdmSpider(scrapy.Spider): name = 'smzdm_jingxuan' allowed_domains = ['spider.smzdm'] start_urls = ("http://books.toscrape.com/",) # logging.basicConfig(level=logging.INFO, format='%(asctime)s %(pathname)s %(filename)s %(funcName)s %(lineno)d \ # %(levelname)s - %(message)s", "%Y-%m-%d %H:%M') logging.basicConfig( format='%(asctime)s,%(msecs)d %(levelname)-8s [%(pathname)s:%(lineno)d in function % (funcName)s] % (message)s', datefmt = '%Y-%m-%d:%H:%M:%S', level = logging.INFO) logger = logging.getLogger(__name__)
输出在console的日志是
2020-05-28 22:53:25,197 - smzdmCrawler.spiders.smzdm_jingxuan - INFO - smzdm_jingxuan spider starting
改变了一点点,但没有输出行号,结果也和设置不符。(不知道原因)
网上查了下,发现setting.py可以设置LOG_FILE,
LOG_ENABLE
LOG_FORMAT这些日志参数,设置成
IMAGES_STORE = '/Users/gaoxianghu/temp/image' LOG_FILE = '/Users/gaoxianghu/temp/scrapy_log.log' LOG_ENABLED = False LOG_FORMAT = '[%(asctime)s] p%(process)s {%(pathname)s:%(lineno)d} %(levelname)s - %(message)s'
发现LOG_ENABLED = False不生效,不管是console还是日志文件还是有日志输出,但日志文件的日志格式已经按照LOG_FORMAT的打印,但console里的还是没变(不知道原因)
LOG_ENABLED = False不生效的原因,网上有人说这饿做可以
logging.getLogger('scrapy').propagate = False
日志文件里日志格式为
[2020-05-28 22:16:12] p41288 {/Users/gaoxianghu/git/cheap/smzdmCrawler/smzdmCrawler/spiders/smzdm_jingxuan.py:38} INFO - smzdm_jingxuan spider starting
后来发现要把 logging配置写在setting.py里就能改变console的日志了
setting.py
logging.basicConfig(level=logging.DEBUG, format='%(asctime)s %(pathname)s %(filename)s %(funcName)s %(lineno)d %(levelname)s - %(message)s', )
console输出
2020-05-28 23:30:45,008 /Users/gaoxianghu/git/cheap/smzdmCrawler/smzdmCrawler/spiders/smzdm_jingxuan.py smzdm_jingxuan.py parse 34 INFO - smzdm_jingxuan spider starting
3. 在scrapy 的 spider中用到relative import时,执行scrapy crawl smzdm_jingxuan 时报:scrapy attempted relative import with no known parent package. 原来用 from smzdmCrawler.items import SmzdmItem时没问题
from .. import items
代码结构为:
smzdmCrawler |--model |--spider |--|--smzdm_jingxuan.py |--items.py |--__init__.py |--main.py
因为smzdmCrawler下已经有了__init__.py,所以smzdmCrawler是一个包,网上查说包不包的由__name__决定,我这边的情况是和执行scrapy crawl smzdm_jingxuan时的目录有关,原先的位置是:/Users/gaoxianghu/git/cheap/smzdmCrawler/smzdmCrawler,改在/Users/gaoxianghu/git/cheap/smzdmCrawler就没问题
4. 在用scrapyd部署服务时,要注意此时的程序无法读取环境变量,用scrapyd-deploy部署后会先把代码解释一遍,如果此时因为无法读取环境变量而报错,比如如下代码
SCRAPY_ENV=os.environ.get('SCRAPY_ENV',None) # 这里只有线上才会传LOG_FILE if LOG_FILE: log_file = LOG_FILE image_file = '/data/image/' + today_str else: if SCRAPY_ENV == None: log_file = '/Users/gaoxianghu/temp/scraping.log' image_file = '/Users/gaoxianghu/temp/image/' + today_str else: log_file = '/data/log/scrapy/scraping.log' image_file = '/data/image/' + today_str logHandler = TimedRotatingFileHandler(log_file, when='midnight', interval=1)
虽然我在服务器上设了环境变量'SCRAPY_ENV',但因为不是在命令行环境,无法读取,导致log_file = '/Users/gaoxianghu/temp/scraping.log'这个路径不存在,报错。然后我用
curl http://david_scrapyd:david_2021@42.192.51.99:6801/schedule.json -d project=smzdmCrawler -d spider=smzdm_single -d setting=LOG_FILE=/data/log/scrapy/scraping.log 来执行scrapy,但这里虽然设置了setting,按照代码应该不会把log_file设置为'/Users/gaoxianghu/temp/scraping.log',但还是报同样的错,只不过看看日志运行的是一个临时文件代码,为什么还是报错暂且不太清楚,因为在本地起服务验证是可以读到传入的LOG_FILE的。是不是因为部署的时候解释没通过,所以还是会先解释一遍再运行导致报错。
File "/tmp/smzdmCrawler-1614340245-de7610pr.egg/smzdmCrawler/settings.py"
FileNotFoundError: [Errno 2] No such file or directory: '/Users/gaoxianghu/temp/scraping.log'
5. 用scrapyd时还需要注意,根据https://github.com/scrapy/scrapyd-client#scrapyd-deploy,上面说
You may want to keep certain settings local and not have them deployed to Scrapyd. To accomplish this you can create a local_settings.py file at the root of your project, where your scrapy.cfg file resides, and add the following to your project's settings: try: from local_settings import * except ImportError: pass scrapyd-deploy doesn't deploy anything outside of the project module, so the local_settings.py file won't be deployed.
这里根据亲示,在本地部署scrapyd,将scrapy部署到本地时,local_setting是可以访问到的,到egg里却没有。这里比较奇怪,部署到远程的scrapyd就访问不到。作者这句话应该是针对远程来说的
6. 在一台新的服务器上安装scrapy,用的是conda install -c conda-forge scrapy,装的是2.6.1版本
结果装上去运行后爬取时用xpath提取detail时把一大堆不应该提取的内容也提取了,导致存到数据库的值特别庞大,前端加载因为把数据都返回了导致加载时间很长。经测试把scrapy卸了,然后用pip install scrapy重新安装就好了。
代码如下:
detail_selector = response.css(".txt-detail").xpath(".//p") detail = detail_selector.getall()
代码应该没问题,在另一台服务器上正常,但在这边却把.txt-detail后面整张页面都提取出来了。可能是安装的问题。
7. 线上总是内存被开的chrome占满,还以为是seleuim的driver.quit()没有把chrome关闭,经调查是因为scrapy的spider目录下有好几个spider,虽然只运行一个,但里面每个类都初始化了,而chrome在初始化的时候定义好了
self.broswer = webdriver.Chrome( chrome_options=chrome_options)
所以每爬了一次,chrome初始化了几次就打开了几个,而且结束时只关闭运行的那个spider定义的chrome。所以每次看后台有好多chrome的进程还有chromedriver的进程。