数据采集与融合技术实践作业三
gitee仓库链接:gitee仓库链接
102102141 周嘉辉
作业①
- 指定一个网站,爬取这个网站中的所有的所有图片,例如:中国气象网(http://www.weather.com.cn)。
使用scrapy框架分别实现单线程和多线程的方式爬取。
部分代码:
class weatherSpider(scrapy.Spider):
name = "w0"
allowed_domains = ["weather.com.cn"]
start_urls = ["http://www.weather.com.cn/"]
def parse(self, response):
# filename = "teacher.html"
# open(filename, 'w').write(response.body)
print('=====================================================================')
items = []
soup = BeautifulSoup(response.body, "html.parser")
for img_tag in soup.find_all("img"):
url = img_tag["src"]
i = W0Item()
i['url'] = url
print(url)
items.append(i)
# yield i
print('=====================================================================')
return items
class Img_downloadPipeline:
def process_item(self,item,spider):
print('download...')
print(item)
url = item['url']#.get('src')
filename = '.\\imgs\\'+str(randint(-9999999999,9999999999))+'.jpg'
urllib.request.urlretrieve(url,filename)
结果:
心得体会:后缀名直接加.jpg并不是很好的写法。
gitee仓库链接:gitee仓库链接
作业②
- 熟练掌握 scrapy 中 Item、Pipeline 数据的序列化输出方法;使用scrapy框架+Xpath+MySQL数据库存储技术路线爬取股票相关信息。
class eastmoneySpider(scrapy.Spider):
name = "w1"
allowed_domains = ["eastmoney.com"]
start_urls = ["http://54.push2.eastmoney.com/api/qt/clist/get?cb=jQuery1124015380571520090935_1602750256400&pn=1&pz=20&po=1&np=1&ut=bd1d9ddb04089700cf9c27f6f7426281&fltt=2&invt=2&fid=f3&fs=m:0+t:6,m:0+t:13,m:0+t:80,m:1+t:2,m:1+t:23&fields=f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f12,f13,f14,f15,f16,f17,f18,f20,f21,f23,f24,f25,f22,f11,f62,f128,f136,f115,f152&_=1602750256401"]
count = 1
# header = {
# "args": {},
# "headers": {
# "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
# "Accept-Encoding": "gzip, deflate",
# "Accept-Language": "zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2",
# "Host": "httpbin.org",
# "Upgrade-Insecure-Requests": "1",
# "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:109.0) Gecko/20100101 Firefox/113.0",
# "X-Amzn-Trace-Id": "Root=1-6530dac1-20e651693f4025e634bacb25"
# },
# "origin": "112.49.6.23",
# "url": "http://httpbin.org/get"
# }
def parse(self, response):
# filename = "teacher.html"
# open(filename, 'w').write(response.body)
print('==============================begin=======================================')
data = response.text
# 去除多余信息并转换成json
start = data.index('{')
data = json.loads(data[start:len(data) - 2])
if data['data']:
# 选择和处理数据格式
for stock in data['data']['diff']:
item = W1Item()
item['id'] = str(self.count)
self.count+=1
item["number"] = str(stock['f12'])
item["name"] = stock['f14']
item["value"] = None if stock['f2'] == "-" else str(stock['f2'])
yield item
# 查询当前页面并翻页
pn = re.compile("pn=[0-9]*").findall(response.url)[0]
page = int(pn[3:])
url = response.url.replace("pn="+str(page), "pn="+str(page+1))
yield scrapy.Request(url=url, callback=self.parse)
print('===============================end======================================')
# return items
class writeDB:
def open_spider(self,spider):
self.fp = sqlite3.connect('test.db')
# 建表的sql语句
sql_text_1 = '''CREATE TABLE scores
(id TEXT,
代码 TEXT,
名称 TEXT,
价格 TEXT);'''
# 执行sql语句
self.fp.execute(sql_text_1)
self.fp.commit()
def close_spider(self,spider):
self.fp.close()
def process_item(self, item, spider):
sql_text_1 = "INSERT INTO scores VALUES('"+item['id']+"', '"+item['number']+"', '"+item['name']+"','"+item['value']+"')"
# 执行sql语句
self.fp.execute(sql_text_1)
self.fp.commit()
return item
结果:
心得体会:robots.txt只是个君子协议。
gitee仓库链接:gitee仓库链接
作业③
- 熟练掌握 scrapy 中 Item、Pipeline 数据的序列化输出方法;使用scrapy框架+Xpath+MySQL数据库存储技术路线爬取外汇网站数据。
完成代码:
class bocSpider(scrapy.Spider):
name = "w2"
allowed_domains = ["boc.cn"]
start_urls = ["https://www.boc.cn/sourcedb/whpj/"]
count = 0
def parse(self, response):
print('==============================begin=======================================')
# print(response.xpath('/html/body/div/div[5]/div[1]/div[2]/table'))
# print(response.text)
# print(response.xpath('/html/body/div'))
bs_obj=BeautifulSoup(response.body,features='lxml')
t=bs_obj.find_all('table')[1]
all_tr=t.find_all('tr')
all_tr.pop(0)
for r in all_tr:
item = W2Item()
all_td=r.find_all('td')
item['currency']=all_td[0].text
item['tbp']=all_td[1].text
item['cbp']=all_td[2].text
item['tsp']=all_td[3].text
item['csp']=all_td[4].text
item['time']=all_td[6].text
print(all_td)
yield item
self.count+=1
url = 'http://www.boc.cn/sourcedb/whpj/index_{}.html'.format(self.count)
if self.count != 5:
yield scrapy.Request(url=url, callback=self.parse)
print('===============================end======================================')
class writeDB:
def open_spider(self,spider):
self.fp = sqlite3.connect('test.db')
# 建表的sql语句
sql_text_1 = '''CREATE TABLE scores
(Currency TEXT,
TBP TEXT,
CBP TEXT,
TSP TEXT,
CSP TEXT,
TIME TEXT);'''
# 执行sql语句
self.fp.execute(sql_text_1)
self.fp.commit()
def close_spider(self,spider):
self.fp.close()
def process_item(self, item, spider):
sql_text_1 = "INSERT INTO scores VALUES('"+item['currency']+"', '"+item['tbp']+"', '"+item['cbp']+"', '"+item['tsp']+"', '"+item['csp']+"', '"+item['time']+"')"
# 执行sql语句
self.fp.execute(sql_text_1)
self.fp.commit()
return item
结果:
心得体会:有参考网上历年学长的代码,XPath真的难用。。。。
gitee仓库链接:gitee仓库链接