Python 爬虫实例（7）—— 爬取新浪军事新闻

我们打开新浪新闻，看到页面如下，首先去爬取一级 url，图片中蓝色圆圈部分

第二zh张图片，显示需要分页，

源代码：

# coding:utf-8

import json
import redis
import time
import requests
session = requests.session()
import logging.handlers
import pickle
import sys
import re
import datetime
from bs4 import BeautifulSoup



import sys
reload(sys)
sys.setdefaultencoding('utf8')

import datetime
# 生成一年的日期
def dateRange(start, end, step=1, format="%Y-%m-%d"):
    strptime, strftime = datetime.datetime.strptime, datetime.datetime.strftime
    days = (strptime(end, format) - strptime(start, format)).days
    return [strftime(strptime(start, format) + datetime.timedelta(i), format) for i in xrange(0, days, step)]




def spider():

    date_list = dateRange("2017-01-01", "2018-01-06")[::-1]
    print date_list
    for date in date_list:
        for page in range(1,5):
            #组合url
            url = "http://roll.mil.news.sina.com.cn/col/zgjq/" + str(date)+"_"+ str(page) +".shtml"
            # 伪装请求头
            headers = {

                "Host":"roll.mil.news.sina.com.cn",
              
                "Cache-Control":"max-age=0",
                "Upgrade-Insecure-Requests":"1",
                "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36",
                "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
              
                "Accept-Encoding":"gzip, deflate",
                "Accept-Language":"zh-CN,zh;q=0.8",
                "If-Modified-Since":"Sat, 06 Jan 2018 09:57:24 GMT",

            }

            result = session.get(url=url,headers=headers).content
            #编码格式是 gb2312，使用BeautifulSoup解决编码格式
            soup = BeautifulSoup(result,'html.parser')
            #找到新闻列表
            result_div = soup.find_all('div',attrs={"class":"fixList"})[0]
            #去下换行
            result_replace = str(result_div).replace('\n','').replace('\r','').replace('\t','')
            #正则匹配信息
            result_list = re.findall('<li>(.*?)</li>',result_replace)

            for i in result_list:
                #匹配出来新闻 url， name，time

                news_url = re.findall('<a href="(.*?)" target=',i)[0]
                news_name = re.findall('target="_blank">(.*?)</a>',i)[0]
                news_time = re.findall('<span class="time">\((.*?)\)</span>',i)[0]

                print news_url
                print news_name
                print news_time






spider()

posted @ 2018-01-06 19:28 淋哥阅读(1467) 评论(0) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

阅读排行：
· 10年+ .NET Coder 心语 ── 封装的思维：从隐藏、稳定开始理解其本质意义
· 地球OL攻略 —— 某应届生求职总结
· 周边上新：园子的第一款马克杯温暖上架
· Open-Sora 2.0 重磅开源！
· 提示词工程——AI应用必不可少的技术

公告

昵称：淋哥
园龄： 8年10个月
粉丝： 229
关注： 0

+加关注

2025年3月

日

一

二

三

四

五

六

英雄莫问出处,富贵当思缘由

Python 爬虫实例（7）—— 爬取新浪军事新闻

公告

搜索

常用链接

最新随笔

我的标签

积分与排名

随笔分类 (338)

随笔档案 (452)

文章分类 (6)

文章档案 (19)

阅读排行榜

评论排行榜

推荐排行榜

最新评论

英雄莫问出处,富贵当思缘由

Python 爬虫实例（7）—— 爬取 新浪军事新闻

公告

搜索

常用链接

最新随笔

我的标签

积分与排名

随笔分类 (338)

随笔档案 (452)

文章分类 (6)

文章档案 (19)

阅读排行榜

评论排行榜

推荐排行榜

最新评论

Python 爬虫实例（7）—— 爬取新浪军事新闻