技术宅,fat-man

增加语言的了解程度可以避免写出愚蠢的代码

导航

< 2025年3月 >
23 24 25 26 27 28 1
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30 31 1 2 3 4 5

统计

用DOM树分析代替正则搜索要处理的HTML节点

需求:分析从富文本编辑器传递到服务端的HTML源码,从HTML源码里检索形如 <img src="attachment/100" /> 或者 <a href="attachment/101" > ... </a>的标签,替换成为<tn-media  hash = " ... ">的内部标签存入数据库

原方案:使用正则搜索替换上述标签,实际情况上还是比较容易出错的,因此考虑用DOM树节点分析节点代替正则搜索

环境:python2.7 , webpy , BeautifulSoup (DOM分析工具,第三方库)

 

复制代码
#!/usr/bin/env python
#
-*- coding: utf-8 -*-

import web
urls = (
'/', 'index',
'/parse','parse'
)

web.config.debug = False
app = web.application(urls, globals())


class index:
def GET(self):
s = '''
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body>
<form action="/parse" method="post">
<textarea id="s" name="s" style="width:500;height:400">
<font color="red">折腾人的中文</font>
<p><a href="http://www.baidu.com">baidu百度</a><img src="attachment/19" alt="" data-mce-src="attachment/19"></p><p><a href="attachment/20" data-mce-href="attachment/20"><img src="images/home/attType/unknown.png" alt="" data-mce-src="images/home/attType/unknown.png"></a></p><p><span>Bug.txt--0.31KB<br></span></p><p><a href="attachment/21" data-mce-href="attachment/21"><img src="images/home/attType/unknown.png" alt="" data-mce-src="images/home/attType/unknown.png"></a><br><span>BBS--0.23KB</span></p><p><span> </span></p><p> </p>
<p><img src="http://csdnimg.cn/www/images/csdnindex_logo.gif"></p>
</textarea>
<input type="submit" value="post">
</form>
</body>
</html>
'''
return s

import json
from BeautifulSoup import BeautifulSoup

class parse:
def POST(self):
wi = web.input()
s = wi.s
soup = BeautifulSoup(s)
links = soup.findAll('a')
a = []
a1 = []
for link in links:
a.append(link['href'])
if link['href'].find('attachment/') != -1:
a1.append(link['href'])
s1 ="ALL LINK:<br/>%s<br/><br/>WE NEED LINK:<br/>%s" %(json.dumps(a),json.dumps(a1))

imgs = soup.findAll('img')
imgsAll = []
imgsNeed = []
for img in imgs:
imgsAll.append(img['src'])
if img['src'].find('attachment/') != -1:
imgsNeed.append(img['src'])
s2 ="ALL IMG:<br/>%s<br/><br/>WE NEED IMG:<br/>%s" %(json.dumps(imgsAll),json.dumps(imgsNeed))

return '''<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /></head><body> %s <br/><br/> %s <br/><br/> %s </body></html>''' %(s,s1,s2)


if __name__ == "__main__":
app.run()
复制代码

  

不过因为担心BeautifulSoup不够稳定(字符编码会出错,解析糟糕的HTML标记符出错等因素)所以只采用BeautifulSoup功能的一个子集:让它去检索标签,而不修改DOM树,修改HTML的工作仍然由使用它的程序负责完成

复制代码
def convertToMedia(self,db,s):    
''' 新版代码 编写执行通过 等待测试 需要安装Soup'''
def getDigestOfAttchment(db,attId):
sql = "select digest from Attachment where attId=$attId"
list = db.query(sql,vars=locals()).list()
if len(list)>0:
return list[0].digest
else:
return None

def convertOtherToMedia(db,links,s):
for link in links:
#期望的字符串是'/attachment/12' 所以find必然等于0
if link['href'].find('/attachment/') == 0:
listPartOfHref = link['href'].split('/')
#字符串被拆分成['','attachment','12']
if len(listPartOfHref) != 3: continue
if len(listPartOfHref[2]) < 1: continue
attId = int(listPartOfHref[2])
digest = getDigestOfAttchment(db,attId)
if digest == None : continue
s = s.replace(str(link),'<tn-media hash="%s" />'%digest)
return s

def convertPictureToMedia(db,images,s):
for image in images:
#期望的字符串是'/attachment/12' 所以find必然等于0
if image['src'].find('/attachment/') == 0:
listPartOfSrc = image['src'].split('/')
#字符串被拆分成['','attachment','12']
if len(listPartOfSrc) != 3: continue
if len(listPartOfSrc[2]) < 1: continue
attId = int(listPartOfSrc[2])
digest = getDigestOfAttchment(db,attId)
if digest == None : continue
s = s.replace(str(image),'<tn-media hash="%s" />'%digest)
return s

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(s)
links = soup.findAll('a')
s = convertOtherToMedia(db,links,s)
images = soup.findAll('img')
s = convertPictureToMedia(db,images,s)
return False,s #提供False是因为原来的实现里外部接口需要函数提供,修改外部代码即可取消
复制代码

 

BeautifulSoup在DOM树上进行修改的例子

a = soup.findAll('img')
for i in a:
if i['src'].find('attachment/') > -1:
i.replaceWith('<tnmedia hash="123456">')




posted on   codestyle  阅读(576)  评论(0编辑  收藏  举报

编辑推荐:
· 如何编写易于单元测试的代码
· 10年+ .NET Coder 心语,封装的思维:从隐藏、稳定开始理解其本质意义
· .NET Core 中如何实现缓存的预热?
· 从 HTTP 原因短语缺失研究 HTTP/2 和 HTTP/3 的设计差异
· AI与.NET技术实操系列:向量存储与相似性搜索在 .NET 中的实现
阅读排行:
· 周边上新:园子的第一款马克杯温暖上架
· Open-Sora 2.0 重磅开源!
· .NET周刊【3月第1期 2025-03-02】
· 分享 3 个 .NET 开源的文件压缩处理库,助力快速实现文件压缩解压功能!
· [AI/GPT/综述] AI Agent的设计模式综述
点击右上角即可分享
微信分享提示