数据采集与融合技术-第五次作业

作业①

1.1作业内容
- 要求：
  - 熟练掌握 Selenium 查找HTML元素、爬取Ajax网页数据、等待HTML元素等内容。
  - 使用Selenium框架爬取京东商城某类商品信息及图片。
- 候选网站：http://www.jd.com/
- 输出信息：MYSQL的输出信息如下
  
  mNo mMark mPrice mNote mFile
  
  000001 三星Galaxy 9199.00 三星Galaxy Note20 Ultra 5G... 000001.jpg
  
  000002......
1.2代码及实验步骤
- 1.2.1实验步骤：

mNo	mMark	mPrice	mNote	mFile
000001	三星Galaxy	9199.00	三星Galaxy Note20 Ultra 5G...	000001.jpg
000002......

　　复制xpath路径，传入关键字

实现点击搜索按钮

but = self.driver.find_element_by_xpath('//*[@id="search"]/div/div[2]/button')
but.click()
time.sleep(10)

实现滚动翻页

for i in range(33):
  self.driver.execute_script("var a = window.innerHeight;window.scrollBy(0,a*0.5);")
  time.sleep(0.5)

分析商品页面，每个商品的信息都在li标签中
html = self.driver.find_elements_by_xpath('//*[@id="J_goodsList"]/ul/li')#爬取所有li标签

遍历每个li，爬取节点信息

for item in range(len(html)):
	try:
		mMark = html[item].find_element_by_xpath('./div//div[@class="p-name"]/a/em/font[1]').text
		print(mMark)
	except Exception as err:
		mMark = " "
	mPrice = html[item].find_element_by_xpath('./div//div[@class="p-price"]/strong/i').text
	mNote = html[item].find_element_by_xpath('./div//div[@class="p-name"]/a/em').text
	src = html[item].find_element_by_xpath('./div//div[@class="p-img"]/a/img').get_attribute('src')
	self.picSave(src)
	self.db.insert(self.count, mMark, mPrice, mNote, str(self.count)+".jpg")
	self.count += 1

实现翻页

if self.page < 2:
	self.page += 1
	nextPage = self.driver.find_element_by_xpath('//*[@id="J_bottomPage"]/span[1]/a[9]')
	nextPage.click()
	#再次执行爬取函数
	self.Mining()

1.3运行结果：

1.4心得体会
- 爬取信息时先爬取每个商品的li节点，再循环爬取信息节点
- 学会了模拟搜索
- 在爬取过程中必须注意设置sleep时间，以便网页完成加载

作业②

2.1作业内容

要求：
- 熟练掌握 Selenium 查找HTML元素、实现用户模拟登录、爬取Ajax网页数据、等待HTML元素等内容。
- 使用Selenium框架+MySQL模拟登录慕课网，并获取学生自己账户中已学课程的信息保存到MySQL中（课程号、课程名称、授课单位、教学进度、课程状态，课程图片地址），同时存储图片到本地项目根目录下的imgs文件夹中，图片的名称用课程名来存储。
候选网站：中国mooc网：https://www.icourse163.org

输出信息：MySQL数据库存储和输出格式

Id	cCourse	cCollege	cSchedule	cCourseStatus	cImgUrl
1	Python网络爬虫与信息提取	北京理工大学	已学3/18课时	2021年5月18日已结束	http://edu-image.nosdn.127.net/C0AB6FA791150F0DFC0946B9A01C8CB2.jpg
2......

2.2代码及实验步骤
- 2.2.1实验步骤

      # 登录入口
      DL = self.driver.find_element_by_xpath('//*[@id="app"]/div/div/div[1]/div[3]/div[3]/div')
      DL.click()
      # 点击其他方式登录
      QTDL = self.driver.find_element_by_xpath('//span[@class="ux-login-set-scan-code_ft_back"]')
      QTDL.click()
      # 点击手机登录
      phoneDL = self.driver.find_element_by_xpath('//ul[@class="ux-tabs-underline_hd"]/li[2]')
      phoneDL.click()
      # 切换浮动窗口
      phoneI = self.driver.find_element_by_xpath('//div[@class="ux-login-set-container"][@id="j-ursContainer-1"]/iframe')
      self.driver.switch_to.frame(phoneI)
      # 输入手机号
      phoneNum = self.driver.find_element_by_xpath('//*[@id="phoneipt"]')
      phoneNum.send_keys(USERNAME)
      # 输入密码
      phonePassword = self.driver.find_element_by_xpath('//div[@class="u-input box"]/input[2]')
      phonePassword.send_keys(PASSWORD)
      # 点击登录，等待登录成功
      DlClick = self.driver.find_element_by_xpath('//*[@id="submitBtn"]')
      DlClick.click()
      time.sleep(10)

      # 进入我的课程
      myClass = self.driver.find_element_by_xpath('//div[@class="_1Y4Ni"]/div')
      myClass.click()
      time.sleep(5)

分析课程页面，每个课程的信息都在一个div标签中

  # 爬取每个课程的div标签
  html = self.driver.find_elements_by_xpath('//*[@id="j-coursewrap"]/div/div/div')

遍历每个div，爬取节点信息

# 爬取信息
for item in html:
    print(self.count)
    cCourse = item.find_element_by_xpath('./div//div[@class="text"]/span[@class="text"]').text
    print(cCourse)
    cCollege = item.find_element_by_xpath('./div//div[@class="school"]/a').text
    print(cCollege)
    cSchedule = item.find_element_by_xpath('./div//div[@class="text"]/a/span').text
    print(cSchedule)
    cCourseStatus = item.find_element_by_xpath('./div//div[@class="course-status"]').text
    print(cCourseStatus)
    src = item.find_element_by_xpath('./div//div[@class="img"]/img').get_attribute("src")
    print(src）
    self.picSave(src, cCourse)
    self.db.insert(self.count, cCourse, cCollege, cSchedule, cCourseStatus, src)
    self.count += 1

实现翻页

nextPage = self.driver.find_element_by_xpath('//*[@id="j-coursewrap"]/div/div[2]/ul/li[4]/a')
# 实现翻页
if nextPage.get_attribute('class') != "th-bk-disable-gh":
    print(1)
    nextPage.click()
    time.sleep(5)
    # 再次执行爬取函数
    self.Mining()

2.3运行结果：

作业③

3.1作业内容
- 要求：掌握大数据相关服务，熟悉Xshell的使用
  - 完成文档华为云_大数据实时分析处理实验手册-Flume日志采集实验（部分）v2.docx 中的任务，即为下面5个任务，具体操作见文档。
  - 环境搭建
    - 任务一：开通MapReduce服务
  - 实时分析开发实战：
    - 任务一：Python脚本生成测试数据
    - 任务二：配置Kafka
    - 任务三：安装Flume客户端
    - 任务四：配置Flume采集数据
3.2结果
- 任务一：Python脚本生成测试数据
  执行python文件
  
  查看生成数据
- 任务二：配置Kafka
  执行source
  
  任务三：安装Flume客户端
- 最后安装Flume
  
  重启服务
- 任务四：配置Flume采集数据