爬取全部的校园新闻
作业要求来自:https://edu.cnblogs.com/campus/gzcc/GZCC-16SE1/homework/3002
0.从新闻url获取点击次数,并整理成函数
- newsUrl
- newsId(re.search())
- clickUrl(str.format())
- requests.get(clickUrl)
- re.search()/.split()
- str.lstrip(),str.rstrip()
- int
- 整理成函数
- 获取新闻发布时间及类型转换也整理成函数
1.从新闻url获取新闻详情: 字典,anews
2.从列表页的url获取新闻url:列表append(字典) alist
3.生成所页列表页的url并获取全部新闻 :列表extend(列表) allnews
*每个同学爬学号尾数开始的10个列表页
4.设置合理的爬取间隔
import time
import random
time.sleep(random.random()*3)
5.用pandas做简单的数据处理并保存
保存到csv或excel文件
newsdf.to_csv(r'F:\duym\爬虫\gzccnews.csv')
全部代码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 | import re url = 'http://news.gzcc.cn/html/2019/xiaoyuanxinwen_0410/11181.html' clickurl = 'http://oa.gzcc.cn/api.php?op=count&id=11029&modelid=80' re.match( 'http://news.gzcc.cn/html/2019/xiaoyuanxinwen(.*).html' ,url).groups( 0 ) re.search( '/(\d*).html' ,url).groups( 1 ) re.findall( '(\d+)' ,url)[ - 1 ] clickurl = 'http://oa.gzcc.cn/api.php?op=count&id=11029&modelid=80' . format ( id ) clickurl import requests from bs4 import BeautifulSoup from datetime import datetime import re def click(url): id = re.findall( '(\d{1,5})' ,url)[ - 1 ] clickurl = 'http://oa.gzcc.cn/api.php?op=count&id=11029&modelid=80' . format ( id ) resClick = requests.get(clickurl) newsclick = int (resClick.text.split( '.html' )[ - 1 ].lstrip( "('" ).rstrip( "');" )) return newsclick def newsdt(showinfo): newsdate = showinfo.split()[ 0 ].split( ':' )[ 1 ] newstime = showinfo.split()[ 1 ] newsdt = newsdate + ' ' + newstime dt = datetime.strptime(newsdt, '%Y-%m-%d %H:%M:%S' ) return dt def anews(url): newsdetail = {} res = requests.get(url) res.encoding = 'utf-8' soup = BeautifulSoup(res.text, 'html.parser' ) showinfo = soup.select( '.show-info' )[ 0 ].text newsdetail[ 'newstitle' ] = soup.select( '.show-title' )[ 0 ].text newsdetail[ 'newsdt' ] = newsdt(showinfo) newsdetail[ 'newsclick' ] = click(newsUrl) return newsdetail newsurl = 'http://news.gzcc.cn/html/2019/xibusudi_0411/11189.html' anews = anews(newsurl) anews def anews(url): newsdetail = {} res = requests.get(url) res.encoding = 'utf-8' soup = BeautifulSoup(res.text, 'html.parser' ) showinfo = soup.select( '.show-info' )[ 0 ].text newsdetail[ 'newstitle' ] = soup.select( '.show-title' )[ 0 ].text newsdetail[ 'newsdt' ] = newsdt(showinfo) newsdetail[ 'newsclick' ] = click(newsUrl) return newsdetail def alist(listurl): res = requests.get(listurl) res.encoding = 'utf-8' soup = BeautifulSoup(res.text, 'html.parser' ) newslist = [] for news in soup.select( 'li' ): if len (news.select( '.news-list-title' ))> 0 : newsurl = news.select( 'a' )[ 0 ][ 'href' ] newsdesc = news.select( '.news-list-description' )[ 0 ].text newsdict = anews(newsurl) newsdict[ 'newsurl' ] = newsurl newsdict[ 'description' ] = newsdesc newslist.append(newsdict) return newslist listurl = 'http://news.gzcc.cn/html/xiaoyuanxinwen/' alist(listurl) res = requests.get( 'http://news.gzcc.cn/html/xiaoyuanxinwen/' ) res.encoding = 'utf-8' soup = BeautifulSoup(res.text, 'html.parser' ) soup.select( '#pages' )[ 0 ].text int (re.search( '..(\d+).下' ,soup.select( '#pages' )[ 0 ].text).groups( 1 )[ 0 ]) listurl = 'http://news.gzcc.cn/html/xiaoyuanxinwen/' allnews = alist(listurl) for i in range ( 2 , 12 ): listurl = 'http://news.gzcc.cn/html/xiaoyuanxinwen/{}.html' . format (i) allnews.extend(alist(listurl)) len (allnews) import pandas as pd newsdf = pd.DataFrame(allnews) newsdf newsdf = pd.DataFrame(allnews) newsdf.to_csv(r 'F:\gzcc.csv' ) |
效果截图:
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】凌霞软件回馈社区,博客园 & 1Panel & Halo 联合会员上线
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 如何打造一个高并发系统?
· .NET Core GC压缩(compact_phase)底层原理浅谈
· 现代计算机视觉入门之:什么是图片特征编码
· .NET 9 new features-C#13新的锁类型和语义
· Linux系统下SQL Server数据库镜像配置全流程详解
· Sdcb Chats 技术博客:数据库 ID 选型的曲折之路 - 从 Guid 到自增 ID,再到
· 语音处理 开源项目 EchoSharp
· 《HelloGitHub》第 106 期
· 使用 Dify + LLM 构建精确任务处理应用
· mysql8.0无备份通过idb文件恢复数据过程、idb文件修复和tablespace id不一致处