python 去掉html中其他属性，只保留href 和 src

https://segmentfault.com/q/1010000010845573

import re
#reg=r'\s+[^(href)]*=\"[^<>]+\"'
reg = r'\b(?!(?:href|src))\w+=(["\']).+?\1'
with open(r'input.txt','r',encoding='ISO-8859-15') as f_read:
    html= f_read.read()
    result = re.sub(reg,"",html)
    #print(type(result))
    result = result.replace('<table>','<table class="table14_3">')
    #result = result.replace('<img>','<img src="min_images/new_logo.jpg">')
    result = result.replace('<span>','').replace('</span>','')
    print(result)
    with open(r'output.txt','w',encoding='ISO-8859-15') as f_write:            
        f_write.write(result)

posted @ 2018-11-14 14:18 YuQiao0303 阅读(373) 评论(0) 编辑收藏举报

刷新页面返回顶部

YuQiao0303

python 去掉html中其他属性，只保留href 和 src

公告