BeautifulSoup读取页面丢失信息

今天本来想写个脚本自动获取火影忍者漫画的更新，要不每次都要去浏览器上一页一页的翻。以前抓脚本就是靠正则式去源码中匹配，今天网上搜索了一下准备使用Beautiful Soup组件进行页面内容抓取。

在抓取漫画章节列表中始终抓不到信息，查看网页源代码明明是有列出来的，但经过BeautifulSoup一解析楞是找不到。

列出一个简单的示例：

#coding=utf-8

import urllib2
from BeautifulSoup import BeautifulSoup


manhua_url = "http://www.manmankan.com/html/1/index.asp"
manhua = urllib2.urlopen(manhua_url).read()
manhua = unicode(manhua, 'gbk','ignore').encode('utf-8','ignore')
print manhua

soup = BeautifulSoup(manhua,)
print soup.prettify()

两次print一对比，嘿～，soup明显丢失了很多信息，何解？

-To Be Continue-

posted @ 2010-03-18 16:10 听风阅读(1302) 评论(0) 编辑收藏举报

刷新页面返回顶部

BeautifulSoup读取页面丢失信息

公告