Python Web-第四周-Programs that Surf the Web（Using Python to Access Web Data）

1.Understanding HTML

1.最简单的爬虫

import urllib
fhand=urllib.urlopen('http://www.dr-chuck.com/page1.htm')
for line in fhand:
print line.strip()

2.Python 爬网页和直接访问网页

3.Scrape

2.Parsing HTML with BeautifulSoup

1.这次直接使用简单方法 BeautifulSoup

2.BeautifulSoup的安装

1.下载 http://www.crummy.com/software/BeautifulSoup/#Download

2.将下载后的文件解压，并拷贝到C：Python27目录下

3.CMD cd到该目录下运行 python setuyp.py install

3.初试BeautifulSoup(同样也是初试Python库)

import urllib
from bs4 importBeautifulSoup
url =raw_input('Enter - ')
html = urllib.urlopen(url).read()
soup=BeautifulSoup(html,"html.parser")
tags = soup('a')
for tag in tags:
print tag.get('href',None)

注意点：

1.BeautifulSoup在地址后面要加参数

2.BS的引用方式

更多有关BS的教程：http://cuiqingcai.com/1319.html

4.raw_input() 与 input()

raw_input() 直接读取控制台的输入（任何类型的输入它都可以接收）。

而对于 input() ，它希望能够读取一个合法的 python 表达式，

即你输入字符串的时候必须使用引号将它括起来，否则它会引发一个 SyntaxError 。

一般若无特殊需求，多用raw_input()

input() 可接受合法的 python 表达式，input( 1 + 3 ) 会返回 int 型的 4

5.BS的高级用法（课后作业1）

http://python-data.dr-chuck.net/comments_222777.html

对上面网址中的comments求和

import urllib
from bs4 importBeautifulSoup
url = raw_input('Enter - ')
html = urllib.urlopen(url).read()
soup =BeautifulSoup(html,"html.parser")
sc=soup.select('span[class="comments"]')#查找class为comments的span
Sum=0
Count=0
for span in sc:
# print 'span' ,span
# print 'Attr:' ,span.attrs
# print 'Contents:',span.contents[0]
Sum+=int(span.contents[0])#提取span中的内容
Count+=1
print'Count:',Count
print'Sum:',Sum

PS:

由于从Python 3 换成了 2 出现了 "Non-ASCII character" 问题

在源代码第一行添加：

#coding:utf-8

或是添加：

#-*- coding: UTF-8 -*-

来自为知笔记(Wiz)

posted @ 2016-01-08 09:43 只追昭熙阅读(911) 评论(0) 收藏举报

刷新页面返回顶部

沉香