Python 爬虫 数据清洗 去掉 超链接

有时候我们需要清洗数据,里面有超链接,怎么去掉他们,比如下面的问题

复制代码
<div class="lot-page-details"><ul class="info-list"><li class="lot-info-item"><p><strong class="section-header">Provenance</strong></p><p>Brand New
Gallery, Milan<br/>Acquired from the above by the present owner</p></li><li class="lot-info-item"><p><strong class="section-header">Exhibited</strong>
</p><p>Milan, Brand New Gallery, <em>This is the story of America. Everybody's doing what they 
think they
're supposed to do</em>, November 21, 2013
- January 11, 2014</p></li><li class="artist-biography"><p><strong class="section-header">Artist Bio
</strong></p><a href="/artist/12106/ethan-cook"><h4>Ethan Cook</h4></a><p class="artist-info">American • 1983
</p><div class="follow-artist" data-artist-id="12106"
role="button"
tabindex="0">
<span cl
ass
="icon"></
span><s
pan class=
"toolti
p
">Follow</span></div><div class="artist-bio"><p>

<p>New York-based artist Ethan Cook is known for his abstract paintings on self-produced canvases. More recently, he has used handwoven strips of
cotton and linen to create painterly compositions. Cook's woven canvases are contemporary in their minimalist focus on shape and color while referencing
one of the most traditional art forms, weaving. Cook weaves his own canvases on a
loom and juxtaposes these with
 store-bought canvas sheets
in abstract arrangements.
For the artist,
the surface of th
e canvas itself becomes the foc
us of his practice. Using simple geometric shapes and a l
imited color palate, Cook
's works nurture structural s
implicity.</p></p><a href="/artist/12106/ethan-cook"><div class="lot-essay-button artist"><em>View More Works</em></div></a></div></li></ul></div>
复制代码

 

 

第一种方法:

  用这则替换,把 href 替换为 hre1f 就可以了,

第二种方法:

        result_div_list = re.findall('<(.*?)>',str(result_div))
       
    
if 'href' in str(result_div_list): for ii in result_div_list: if 'href' in ii: item_desc = str(result_div).replace(str(ii) ,'') else: item_desc = result_div

记录下来,供以后学习参考 

 

posted @   淋哥  阅读(4007)  评论(0编辑  收藏  举报
编辑推荐:
· 如何编写易于单元测试的代码
· 10年+ .NET Coder 心语,封装的思维:从隐藏、稳定开始理解其本质意义
· .NET Core 中如何实现缓存的预热?
· 从 HTTP 原因短语缺失研究 HTTP/2 和 HTTP/3 的设计差异
· AI与.NET技术实操系列:向量存储与相似性搜索在 .NET 中的实现
阅读排行:
· 10年+ .NET Coder 心语 ── 封装的思维:从隐藏、稳定开始理解其本质意义
· 地球OL攻略 —— 某应届生求职总结
· 周边上新:园子的第一款马克杯温暖上架
· Open-Sora 2.0 重磅开源!
· 提示词工程——AI应用必不可少的技术
点击右上角即可分享
微信分享提示