pymongodb的使用和一个腾讯招聘爬取的案例
一.在python3中操作mongodb
1.连接条件
- 安装好pymongo库
- 启动mongodb的服务端(如果是前台启动后就不关闭窗口,窗口关闭后服务端也会跟着关闭)
3.使用
import pymongo
#连接mongodb需要使用里面的mongoclient,一般来说传入mongodb的ip和端口即可
#第一个参数为host,,第二个为ip.默认为27017,
client=pymongo.MongoClient(host='127.0.0.1',port=27017)
#这样就可以拿到一个客户端对象了
#另外MongoClient的第一个参数host还可以直接传MongoDB的连接字符串,以mongodb开头,
#例如:client = MongoClient('mongodb://localhost:27017/')可以达到同样的连接效果
# print(client)
###################指定数据库
db=client.test
#也可以这样写
# db=client['test']
##################指定集合
collections=db.student
#也可以这样写
# collections=db['student']
###################插入数据
# student={
# 'id':'1111',
# 'name':'xiaowang',
# 'age':20,
# 'sex':'boy',
# }
#
# res=collections.insert(student)
# print(res)
#在mongodb中,每一条数据其实都有一个_id属性唯一标识,
#如果灭有显示指明_id,mongodb会自动产生yigeObjectId类型的_id属性
#insert执行后的返回值就是_id的值,5c7fb5ae35573f14b85101c0
#也可以插入多条数据
# student1={
# 'name':'xx',
# 'age':20,
# 'sex':'boy'
# }
#
# student2={
# 'name':'ww',
# 'age':21,
# 'sex':'girl'
# }
# student3={
# 'name':'xxx',
# 'age':22,
# 'sex':'boy'
# }
#
# result=collections.insertMany([student1,student2,student3])
# print(result)
#这边的返回值就不是_id,而是insertoneresult对象
#我们可以通过打印insert_id来获取_id
#insert方法有两种
#insert_one,insertMany,一个是单条插入,一个是多条插入,以列表形式传入
#也可以直接inset(),如果是单个就直接写,多个还是以列表的形式传入
###################查找 单条查找
# re=collections.find_one({'name':'xx'})
# print(re)
# print(type(re))
#{'_id': ObjectId('5c7fb8d535573f13f85a6933'), 'name': 'xx', 'age': 20, 'sex': 'boy'}
# <class 'dict'>
#####################多条查找
# re=collections.find({'name':'xx'})
# print(re)
# print(type(re))
# for r in re:
# print(r)
#结果是一个生成器,我们可以遍历里面的这个对象,拿到里面的值
# <pymongo.cursor.Cursor object at 0x000000000A98E630>
# <class 'pymongo.cursor.Cursor'>
# re=collections.find({'age':{'$gt':20}})
# print(re)
# print(type(re))
# for r in re:
# print(r)
# 在这里查询的条件键值已经不是单纯的数字了,而是一个字典,其键名为比较符号$gt,意思是大于,键值为20,这样便可以查询出所有
# 年龄大于20的数据。
# 在这里将比较符号归纳如下表:
"""
符号含义示例
$lt小于{'age': {'$lt': 20}}
$gt大于{'age': {'$gt': 20}}
$lte小于等于{'age': {'$lte': 20}}
$gte大于等于{'age': {'$gte': 20}}
$ne不等于{'age': {'$ne': 20}}
$in在范围内{'age': {'$in': [20, 23]}}
$nin不在范围内{'age': {'$nin': [20, 23]}}
"""
#正则匹配来查找
# re = collections.find({'name': {'$regex': '^x.*'}})
# print(re)
# print(type(re))
# for r in re:
# print(r)
# 在这里将一些功能符号再归类如下:
"""
符号含义示例示例含义
$regex匹配正则{'name': {'$regex': '^M.*'}}name以M开头
$exists属性是否存在{'name': {'$exists': True}}name属性存在
$type类型判断{'age': {'$type': 'int'}}age的类型为int
$mod数字模操作{'age': {'$mod': [5, 0]}}年龄模5余0
$text文本查询{'$text': {'$search': 'Mike'}}text类型的属性中包含Mike字符串
$where高级条件查询{'$where': 'obj.fans_count == obj.follows_count'}自身粉丝数等于关注数
"""
################计数
# count=collections.find({'age':{'$gt':20}}).count()
# print(count)
#################排序
# result=collections.find({'age':{'$gt':20}}).sort('age',pymongo.ASCENDING)
# print([re['name'] for re in result])
########### 偏移,可能想只取某几个元素,在这里可以利用skip()方法偏移几个位置,比如偏移2,就忽略前2个元素,得到第三个及以后的元素。
# result=collections.find({'age':{'$gt':20}}).sort('age',pymongo.ASCENDING).skip(1)
# print([re['name'] for re in result])
##################另外还可以用limit()方法指定要取的结果个数,示例如下:
# results = collections.find().sort('age', pymongo.ASCENDING).skip(1).limit(2)
# print([result['name'] for result in results])
# 值得注意的是,在数据库数量非常庞大的时候,如千万、亿级别,最好不要使用大的偏移量来查询数据,很可能会导致内存溢出,
# 可以使用类似find({'_id': {'$gt': ObjectId('593278c815c2602678bb2b8d')}}) 这样的方法来查询,记录好上次查询的_id。
################################数据更新
# 对于数据更新要使用update方法
# condition={'name':'xx'}
# student=collections.find_one(condition)
# student['age']=100
# result=collections.update(condition,student)
# print(result)
# 在这里我们将name为xx的数据的年龄进行更新,首先指定查询条件,然后将数据查询出来,修改年龄,
# 之后调用update方法将原条件和修改后的数据传入,即可完成数据的更新。
# {'ok': 1, 'nModified': 1, 'n': 1, 'updatedExisting': True}
# 返回结果是字典形式,ok即代表执行成功,nModified代表影响的数据条数。
# 另外update()方法其实也是官方不推荐使用的方法,在这里也分了update_one()方法和update_many()方法,用法更加严格,
# 第二个参数需要使用$类型操作符作为字典的键名,我们用示例感受一下。
# condition={'name':'xx'}
# student=collections.find_one(condition)
# print(student)
# student['age']=112
# result=collections.update_one(condition,{'$set':student})
# print(result)
# print(result.matched_count,result.modified_count)
#再看一个例子
# condition={'age':{'$gt':20}}
# result=collections.update_one(condition,{'$inc':{'age':1}})
# print(result)
# print(result.matched_count,result.modified_count)
# 在这里我们指定查询条件为年龄大于20,
# 然后更新条件为{'$inc': {'age': 1}},执行之后会讲第一条符合条件的数据年龄加1。
# <pymongo.results.UpdateResult object at 0x000000000A99AB48>
# 1 1
# 如果调用update_many()方法,则会将所有符合条件的数据都更新,示例如下:
condition = {'age': {'$gt': 20}}
result = collections.update_many(condition, {'$inc': {'age': 1}})
print(result)
print(result.matched_count, result.modified_count)
# 这时候匹配条数就不再为1条了,运行结果如下:
# <pymongo.results.UpdateResult object at 0x10c6384c8>
# 3 3
# 可以看到这时所有匹配到的数据都会被更新。
# ###############删除
# 删除操作比较简单,直接调用remove()方法指定删除的条件即可,符合条件的所有数据均会被删除,示例如下:
# result = collections.remove({'name': 'Kevin'})
# print(result)
# 运行结果:
# {'ok': 1, 'n': 1}
# 另外依然存在两个新的推荐方法,delete_one()和delete_many()方法,示例如下:
# result = collections.delete_one({'name': 'Kevin'})
# print(result)
# print(result.deleted_count)
# result = collections.delete_many({'age': {'$lt': 25}})
# print(result.deleted_count)
# # 运行结果:
# <pymongo.results.DeleteResult object at 0x10e6ba4c8>
# 1
# 4
# delete_one()即删除第一条符合条件的数据,delete_many()即删除所有符合条件的数据,返回结果是DeleteResult类型,
# 可以调用deleted_count属性获取删除的数据条数。
# 更多
# 另外PyMongo还提供了一些组合方法,如find_one_and_delete()、find_one_and_replace()、find_one_and_update(),
# 就是查找后删除、替换、更新操作,用法与上述方法基本一致。
二.爬取腾讯招聘
爬虫文件
# -*- coding: utf-8 -*- import scrapy from Tencent.items import TencentItem class TencentSpider(scrapy.Spider): name = 'tencent' # allowed_domains = ['www.xxx.com'] #指定基础url用来做拼接用的 base_url = 'http://hr.tencent.com/position.php?&start=' page_num = 0 start_urls = [base_url + str(page_num)] def parse(self, response): tr_list = response.xpath("//tr[@class='even' ] | //tr[@class='odd']") #先拿到存放类目的标签列表,然后循环标签列表 for tr in tr_list: name = tr.xpath('./td[1]/a/text()').extract_first() url = tr.xpath('./td[1]/a/@href').extract_first() #在工作类别的时候,有时候是空值,会报错,需要这样直接给他一个空值 # if len(tr.xpath("./td[2]/text()")): # worktype = tr.xpath("./td[2]/text()").extract()[0].encode("utf-8") # else: # worktype = "NULL" #如果不报错就用这种 worktype = tr.xpath('./td[2]/text()').extract_first() num = tr.xpath('./td[3]/text()').extract_first() location = tr.xpath('./td[4]/text()').extract_first() publish_time = tr.xpath('./td[5]/text()').extract_first() item = TencentItem() item['name'] = name item['worktype'] = worktype item['url'] = url item['num'] = num item['location'] = location item['publish_time'] = publish_time print('----', name) print('----', url) print('----', worktype) print('----', location) print('----', num) print('----', publish_time) yield item # 分页处理:方法一 # 这是第一中写法,在知道他的页码的情况下使用 # 适用场景,在没有下一页可以点击,只能通过url拼接的情况 # if self.page_num<3060: # self.page_num+=10 # url=self.base_url+str(self.page_num) # # yield scrapy.Request(url=url,callback=self.parse) # yield scrapy.Request(url, callback=self.parse) # 方法二: # 直接提取的他的下一页连接 # 这个等于0,说明不是最后一页,可以继续下一页,否则不等于0就继续提取 #获取下一页的url直接拼接就可以了 if len(response.xpath("//a[@id='next' and @class='noactive']")) == 0: next_url = response.xpath('//a[@id="next"]/@href').extract_first() url = 'https://hr.tencent.com/' + next_url yield scrapy.Request(url=url, callback=self.parse)
pipeline
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html import pymysql import json from redis import Redis import pymongo #存储到本地 class TencentPipeline(object): f=None def open_spider(self,spider): self.f=open('./tencent2.txt','w',encoding='utf-8') def process_item(self, item, spider): self.f.write(item['name']+':'+item['url']+':'+item['num']+':'+item['worktype']+':'+item['location']+':'+item['publish_time']+'\n') return item def close_spider(self,spider): self.f.close() #存储到mysql class TencentPipelineMysql(object): conn=None cursor=None def open_spider(self,spider): self.conn=pymysql.connect(host='127.0.0.1',port=3306,user='root',password='123',db='tencent') def process_item(self,item,spider): print('这是mydql.米有进来吗') self.cursor = self.conn.cursor() try: self.cursor.execute('insert into tencent values("%s","%s","%s","%s","%s","%s")'%(item['name'],item['worktype'],item['url'],item['num'],item['publish_time'],item['location'])) self.conn.commit() except Exception as e: print('错误提示',e) self.conn.rollback() return item def close_spider(self,spider): self.cursor.close() self.conn.close() #储存到redis class TencentPipelineRedis(object): conn=None def open_spider(self,spider): self.conn=Redis(host='127.0.0.1',port=6379) def process_item(self,item,spider): item_dic=dict(item) item_json=json.dumps(item_dic) self.conn.lpush('tencent',item_json) return item #存储到mongodb class TencentPipelineMongo(object): client=None def open_spider(self,spider): self.client=pymongo.MongoClient(host='127.0.0.1',port=27017) self.db=self.client['test'] def process_item(self,item,spider): collection = self.db['tencent'] item_dic=dict(item) collection.insert(item_dic) return item def close_spider(self,spider): self.client.close()
settings.py
# -*- coding: utf-8 -*- # Scrapy settings for Tencent project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://doc.scrapy.org/en/latest/topics/settings.html # https://doc.scrapy.org/en/latest/topics/downloader-middleware.html # https://doc.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'Tencent' SPIDER_MODULES = ['Tencent.spiders'] NEWSPIDER_MODULE = 'Tencent.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36' # Obey robots.txt rules ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} # Enable or disable spider middlewares # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'Tencent.middlewares.TencentSpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'Tencent.middlewares.TencentDownloaderMiddleware': 543, #} # Enable or disable extensions # See https://doc.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'Tencent.pipelines.TencentPipeline': 300, 'Tencent.pipelines.TencentPipelineMysql': 301, 'Tencent.pipelines.TencentPipelineRedis': 302, 'Tencent.pipelines.TencentPipelineMongo': 303, } # Enable and configure the AutoThrottle extension (disabled by default) # See https://doc.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
item
# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.html import scrapy class TencentItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() name=scrapy.Field() url=scrapy.Field() worktype=scrapy.Field() location=scrapy.Field() num=scrapy.Field() publish_time=scrapy.Field()