理解爬虫原理

本次作业要求来自于：https://edu.cnblogs.com/campus/gzcc/GZCC-16SE1/homework/2881

1. 简单说明爬虫原理

程序通过模拟浏览器请求站点，把站点返回的HTML代码、JSON数据、图片视频数据爬到本地，进而提取需要的数据。

2. 理解爬虫开发过程

1）浏览器工作原理

用户输入URL->解析URL->网络连接->服务器相应请求，返回数据->浏览器加载、渲染界面

2）使用requests库抓取网站数据

import requests
url = 'https://v4.bootcss.com/docs/4.0/getting-started/introduction/'
res = requests.get(url).content.decode('utf-8')
print(res)

3）了解网页

html_sample = '''
<html> 
    <body> 
          <h1 id = 'title'>崩坏3</h1>
          <div id = 'div1'>原罪深渊</div>
          <div id = 'div2'>苦痛深渊</div>
          <div id = 'div3'>红莲深渊</div>
          <a href="https://www.bh3.com/index.html" class="link"> 官网</a>
    </body> 
</html> '''

4）使用Beautiful Soup解析网页

from bs4 import BeautifulSoup
soups = BeautifulSoup(html_sample,'html.parser')
print(soups.h1.text)
print(soups.select('div'))
print(soups.select('div')[0].text)
print(soups.select('#div2')[0].text)
print(soups.select(".link")[0].text)

3.提取一篇校园新闻的标题、发布时间、发布单位、作者、点击次数、内容等信息

import requests
from bs4 import BeautifulSoup
from datetime import datetime
url = 'http://news.gzcc.cn/html/2019/xiaoyuanxinwen_0331/11110.html'
res = requests.get(url)
res.encoding = 'utf-8'
#解析html内容
soups = BeautifulSoup(res.text,'html.parser')
#获取新闻标题
title = str(soups.select('title')[0].text).rstrip('- 校园新闻 - 广州商学院新闻网')

info = soups.select('.show-info')[0].text.split()
print(info)
retime = info[0] + ' ' +info[1]
#获取发布时间
retime = retime.lstrip('发布时间:')
#发布时间转换成datetime类型
retime = datetime.strptime(retime,'%Y-%m-%d %H:%M:%S')
#获取新闻作者
author = info[2].split('：')[1]
#获取审核
examine = info[3].split('：')[1]
#获取新闻来源
source = info[4].split('：')[1]
#获取点击次数,转换点击次数为int类型
clickUrl = 'http://oa.gzcc.cn/api.php?op=count&id=11110&modelid=80'
clickStruct = requests.get(clickUrl).text.split('html')[1][2:4]
clickCount = int(clickStruct)
#获取新闻内容
newsContent = soups.select('.show-content')[0].text.strip()
print('标题:',title,'\n发布时间:',retime,'\n审核:',examine,'\n新闻来源:',source,'\n点击次数:',clickCount,'\n内容:',newsContent,)

posted @ 2019-04-01 14:28 梁林森阅读(209) 评论(0) 编辑收藏举报

刷新页面返回顶部

理解爬虫原理

公告