链接爬虫

import re
import urllib.request

def getlink(url):
　　headers=("Mozilla/5.0","(Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36")
　　opener=urllib.request.build_opener()
　　opener.addheaders=[headers]

　　urllib.request.install_opener(opener)
　　file=urllib.request.urlopen(url)
　　data=str(file.read())

　　pat='(https?://[^\s";]+\.(\w|/)*)'
　　link=re.compile(pat).findall(data)
　　link=list(set(link))
　　return link

url="http://blog.csdn.net/"
linklist=getlink(url)
for link in linklist:
　　print(link[0])

　　（1）确定好要爬取的入口链接

　　（2）根据需求构建好链接提取的正则表达式

　　（3）模拟成浏览器并爬取对应网页

　　（4）根据（2）中的正则表达式提取出该网页中包含的链接

　　（5）过滤掉重复的链接

　　（6）后续操作

posted @ 2018-01-11 13:16 一只宅男的自我修养阅读(370) 评论(0) 编辑收藏举报

刷新页面返回顶部

一只宅男的自我修养

链接爬虫

公告