学习进度02 - 雨过山

python爬虫学习：

https://blog.csdn.net/xtingjie/article/details/73465522

#获得网页中的超链接

import urllib.request
from bs4 import BeautifulSoup#用于解析网页

url="https://book.douban.com/"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
req = urllib.request.Request(url=url,headers=headers)
html = urllib.request.urlopen(req)
bsObj = BeautifulSoup(html, 'html.parser')
t1 = bsObj.find_all('a')
for t2 in t1:
    t3 = t2.get('href')
    print(t3)

但是爬取到脏数据：

https://blog.csdn.net/weixin_42427638/article/details/80640817

#出现'NoneType' object has no attribute 'decode'报错

将

'label':repo_dict['description']

改为：

'label': str(repo_dict['description'])


https://www.jianshu.com/p/5a06b123035c
#存储数据-txt、json、csv

file = open('info.txt', 'a', encoding='utf-8')
file.write('Hello world!')
file.write('\n')
file.close()

打开方式总分为三种——r,w,a，r是只读模式，w是写入模式，a是追加模式，每种模式又有4种不同方式，以r为例：r,rb,r+,rb+。

r——只读模式打开（read），打开时指针在文件首部，文件要存在，这是默认的模式
w——写入模式打开（write），若文件存在，则覆盖它，不存在则新建文件
a——追加模式打开（add），若文件存在则，在文件尾部开始写入，不存在则新建文件

而b+\b+
可以这么记：b就是以二进制打开，+就是以读写模式打开，然后在加上r\w\a的特性即可。

以r为例：

rb——二进制只读
r+——读写模式
rb+——二进制读写

保存爬取的内容：

file = open("E:/1.txt", "wb")
for t2 in t1:
    t3 = str(t2.get('href'))
    print(t3)
    file.write(t3.encode("utf-8")+'\n'.encode("utf-8"))

file.close()

结果：

posted on 2020-02-02 22:23 雨过山阅读(93) 评论(0) 编辑收藏举报