Python 网络爬虫与信息获取（二）—— 页面内容提取 - 未雨愁眸

公告

1. 获取超链接

links = re.findall(b’”((http|ftp)s?://.*?)”’, html)
links = re.findall(b’href=”(.*?)”’)
- html 为 url 返回的 html 内容，可通过以下方式获取
  - html = urllib.request.urlopen(url).read()
  - html = requests.get().text

2. 下载指定文件到指定路径

比如我们要爬取 http://courses.cs.vt.edu/~cs2704/fall01/Notes/ 链接下的所有 pdf 文件：

#coding: UTF-8
import requests
from urllib import request
import re
import os

url = 'http://courses.cs.vt.edu/~cs2704/fall01/Notes/'
r = requests.get(url)
files = re.findall('href="(.*?)"', r.text)

for file in files[1:]:
    request.urlretrieve(os.path.join(url, file), os.path.join('D:/data/', file))

posted on 2017-07-31 11:21 未雨愁眸阅读(243) 评论(0) 编辑收藏举报

刷新页面返回顶部