python 去掉html中其他属性,只保留href 和 src

https://segmentfault.com/q/1010000010845573

import re
#reg=r'\s+[^(href)]*=\"[^<>]+\"'
reg = r'\b(?!(?:href|src))\w+=(["\']).+?\1'
with open(r'input.txt','r',encoding='ISO-8859-15') as f_read:
    html= f_read.read()
    result = re.sub(reg,"",html)
    #print(type(result))
    result = result.replace('<table>','<table class="table14_3">')
    #result = result.replace('<img>','<img src="min_images/new_logo.jpg">')
    result = result.replace('<span>','').replace('</span>','')
    print(result)
    with open(r'output.txt','w',encoding='ISO-8859-15') as f_write:            
        f_write.write(result)
            

posted @ 2018-11-14 14:18  YuQiao0303  阅读(373)  评论(0编辑  收藏  举报