python网络数据采集笔记（三）

第六章读取文档

一、纯文本

处理 HTML 页面的时候，网站其实会在 <head> 部分显示页面使用的编码格式。大多数网站，尤其是英文网站，都会带这样的标签：

二、CSV

直接把文件读成字符串，然后封装成 StringIO 对象，让Python 把它当作文件来处理

1 from urllib.request import urlopen
2 from io import StringIO
3 import csv
4 
5 data = urlopen("http://pythonscraping.com/files/MontyPythonAlbums.csv").read().decode('ascii', 'ignore')
6 dataFile = StringIO(data)
7 csvReader = csv.reader(dataFile)
8 for row in csvReader:
9     print(row)

Output:

['Name', 'Year']
["Monty Python's Flying Circus", '1970']
['Another Monty Python Record', '1971']
["Monty Python's Previous Record", '1972']
['The Monty Python Matching Tie and Handkerchief', '1973']
['Monty Python Live at Drury Lane', '1974']
['An Album of the Soundtrack of the Trailer of the Film of Monty Python and the Holy Grail', '1975']
['Monty Python Live at City Center', '1977']
['The Monty Python Instant Record Collection', '1977']
["Monty Python's Life of Brian", '1979']
["Monty Python's Cotractual Obligation Album", '1980']
["Monty Python's The Meaning of Life", '1983']
['The Final Rip Off', '1987']
['Monty Python Sings', '1989']
['The Ultimate Monty Python Rip Off', '1994']
['Monty Python Sings Again', '2014']

令一种是用csv.dictReader

 1 from urllib.request import urlopen
 2 from io import StringIO
 3 import csv
 4 
 5 
 6 data = urlopen("http://pythonscraping.com/files/MontyPythonAlbums.csv").read().decode('ascii', 'ignore')
 7 dataFile = StringIO(data)
 8 dictReader = csv.DictReader(dataFile)
 9 
10 print(dictReader.fieldnames)
11 for row in dictReader:
12     print(row)

Output:

['Name', 'Year']
OrderedDict([('Name', "Monty Python's Flying Circus"), ('Year', '1970')])
OrderedDict([('Name', 'Another Monty Python Record'), ('Year', '1971')])
OrderedDict([('Name', "Monty Python's Previous Record"), ('Year', '1972')])
OrderedDict([('Name', 'The Monty Python Matching Tie and Handkerchief'), ('Year', '1973')])
OrderedDict([('Name', 'Monty Python Live at Drury Lane'), ('Year', '1974')])
OrderedDict([('Name', 'An Album of the Soundtrack of the Trailer of the Film of Monty Python and the Holy Grail'), ('Year', '1975')])
OrderedDict([('Name', 'Monty Python Live at City Center'), ('Year', '1977')])
OrderedDict([('Name', 'The Monty Python Instant Record Collection'), ('Year', '1977')])
OrderedDict([('Name', "Monty Python's Life of Brian"), ('Year', '1979')])
OrderedDict([('Name', "Monty Python's Cotractual Obligation Album"), ('Year', '1980')])
OrderedDict([('Name', "Monty Python's The Meaning of Life"), ('Year', '1983')])
OrderedDict([('Name', 'The Final Rip Off'), ('Year', '1987')])
OrderedDict([('Name', 'Monty Python Sings'), ('Year', '1989')])
OrderedDict([('Name', 'The Ultimate Monty Python Rip Off'), ('Year', '1994')])
OrderedDict([('Name', 'Monty Python Sings Again'), ('Year', '2014')])

这里输出的与书本上不同，OrderedDict是一个有序的对象。

三、PDF、word、.docx、MySQL

暂时略过，有需要再看，尤其是MySQL需要重点看一下。

posted @ 2019-07-01 10:45 椰汁软糖阅读(186) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

椰汁软糖