Python爬虫:爬取自己博客的主页的标题,链接,和发布时间

代码

# -*- coding: utf-8 -*-
"""
-------------------------------------------------
   File Name:     getCnblogs
   Description :
   Author :       神秘藏宝室
   date:          2017-09-21
-------------------------------------------------
   Change Activity:
                   2017-09-21:
-------------------------------------------------
"""
import requests
from bs4 import BeautifulSoup

res = requests.get('http://www.cnblogs.com/Mysterious/')
res.encoding = ('utf-8')

soup = BeautifulSoup(res.text,'html.parser')

def getBlogWriteTime(url):
    res = requests.get(url)
    res.encoding = ('utf-8')
    soup = BeautifulSoup(res.text,'html.parser')
    return soup.select('#post-date')[0].text

#获取标题和链接
num = 1
for pt in soup.select('.postTitle2'):
    print num,'\t',pt.text,'\t',pt['href'],'\t',getBlogWriteTime(pt['href'])
    num = num + 1

结果

1 	Python爬虫:获取新浪网新闻 	http://www.cnblogs.com/Mysterious/p/7538833.html 	2017-09-18 00:10
2 	运行jupyter notebook 出错 Error executing Jupyter command 'notebook' 	http://www.cnblogs.com/Mysterious/p/7538169.html 	2017-09-17 22:10
3 	安装和使用jupyter 	http://www.cnblogs.com/Mysterious/p/7533607.html 	2017-09-17 00:25
4 	windows下python调用c文件流程 	http://www.cnblogs.com/Mysterious/p/7529228.html 	2017-09-16 00:01
5 	python Unable to find vcvarsall.bat 错误 	http://www.cnblogs.com/Mysterious/p/7529142.html 	2017-09-15 23:30
6 	阿里云公网IP不能使用 	http://www.cnblogs.com/Mysterious/p/7523618.html 	2017-09-14 22:36
7 	Python2 socket TCPServer 多线程并发 超时关闭 	http://www.cnblogs.com/Mysterious/p/7523559.html 	2017-09-14 22:27
8 	Python2 socket 多线程并发 ThreadingTCPServer Demo 	http://www.cnblogs.com/Mysterious/p/7507314.html 	2017-09-11 21:50
9 	Python2 socket 多线程并发 TCPServer Demo 	http://www.cnblogs.com/Mysterious/p/7507221.html 	2017-09-11 21:28
10 	Python socket TCPServer Demo 	http://www.cnblogs.com/Mysterious/p/7507042.html 	2017-09-11 20:59
posted on 2017-09-21 23:01  神秘藏宝室  阅读(658)  评论(0编辑  收藏  举报

 >>>转载请注明出处<<<