python爬虫 -- 处理emoji表情符导致xpath无法正常解析网页的问题
前言
本篇文章很短,就是记录一个偶然遇到的问题
问题复现
是这样的,在用xpath解析某网站的时候,由于网站数据格式是普通的html,而非json字符串,所以只能解析DOM对象,有的能用正则表达式的我都尽量用正则表达式了,没法用正则的我都用beautifulsoup库或者pyquery了,但是没法,通用型还是没法跟xpath比,而且我已经写好一版,在有限的时间改的话就很烦了
不多说,先看问题
首先部分的网站源码如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 | < article class="_55wo _5rgr _5gh8 _3drq async_like" data-ft='{"mf_story_key":"10159935560038463","top_level_post_id":"10159935560038463","tl_objid":"10159935560038463","content_owner_id_new":"8245623462","throwback_story_xxid":"10159935560038463","page_id":"8245623462","story_location":4,"story_attachment_style":"video_inline","tds_flgs":3,"ott":"AX90AyHPzJSMfPjF","tn":"-R"}' data-sigil="story-div story-popup-metadata story-popup-metadata feed-ufi-metadata" data-store='{"linkdata":"mf_story_key.10159935560038463:top_level_post_id.10159935560038463:tl_objid.10159935560038463:content_owner_id_new.8245623462:throwback_story_xxid.10159935560038463:page_id.8245623462:story_location.4:story_attachment_style.video_inline:tds_flgs.3:ott.AX90AyHPzJSMfPjF","share_id":"10159935560038463","feedback_target":"10159935560038463","feedback_source":0,"action_source":0,"actor_id":100065274592441}' data-xt="2.mf_story_key.10159935560038463:top_level_post_id.10159935560038463:tl_objid.10159935560038463:content_owner_id_new.8245623462:throwback_story_xxid.10159935560038463:page_id.8245623462:story_location.4:story_attachment_style.video_inline:tds_flgs.3:ott.AX90AyHPzJSMfPjF" data-xt-vimp='{"pixel_in_percentage":0,"duration_in_ms":1,"subsequent_gap_in_ms":60000,"log_initial_nonviewable":false,"should_batch":true,"require_horizontally_onscreen":false}' id="u_0_5_iv"> < div class="story_body_container"> < header class="_7om2 _1o88 _77kd _5qc1"> < div class="_5s61 _2pii _5i2i _52wc"> < div class="_5xu4"> < div class="_67lm _77kc" data-gt='{"tn":"~"}' data-sigil="feed_story_ring8245623462">< a data-click='{"event":"click_post_avatar_image","target_id":"10159935560038463"}' data-gt='{"tn":"~"}' href="/nba/?__tn__=%7E%7E-R">< i aria-label="NBA, profile picture" class="img _1-yc profpic" role="img" ></ i ></ a > </ div > </ div > </ div > < div class="_4g34 _5i2i _52we"> < div class="_5xu4"> < div class="_7om2 _52wc"> < div class="_4g34">< h3 class="_52jd _52jb _52jh _5qc3 _4vc- _3rc4 _4vc-" data-gt='{"tn":"C"}'> < span >< strong >< a href="/nba/?__tn__=C-R">NBA</ a ></ strong >< span aria-label="Verified Page" class="_56_f _5dzy _5dz- _3twv" id="u_0_e_x2" role="img"></ span ></ span > </ h3 > < div class="_52jc _5qc4 _78cz _24u0 _36xo" data-sigil="m-feed-voice-subtitle">< a href="/story.php?story_xxid=10159935560038463&id=8245623462&__tn__=-R">< abbr >6 hrs</ abbr ></ a >< span aria-hidden="true"> · </ span >< span >< div class="_7jwi">< span data-sigil="audience-icon">< i aria-label="Public" class="feedAudienceIcon img sp_eXcmc5QyINt_2x sx_e966fc" role="img"></ i ></ span >< div class="_7jwh"></ div ></ div ></ span > </ div > </ div > < div class="_5s61"> < div class="_2pir" id="feed_story_fan_8245623462"></ div > </ div > < div class="_5s61"></ div > < div class="_5s61 _2pis"> < div class="_yff" data-sigil="story-popup-causal-init" data-store='{"feedobjectsIdentifiers":"S:_I8245623462:10159935560038463","feedContext":"{\"use_m_feed\":true,\"m_entstream_source\":\"timeline\",\"is_pages_timeline\":true,\"story_node_id\":\"u_0_5_iv\",\"show_attachments\":true,\"is_attached_story\":false}"}' id="u_0_b_35">< a aria-haspopup="true" class="_4s19 sec" data-sigil="touchable" href="#" role="button"></ a >< i class="img sp_eXcmc5QyINt_2x sx_b9866d" data-sigil="story-popup-context-init">< u >More options</ u ></ i ></ div > </ div > </ div > </ div > </ div > </ header > < div class="_5rgt _5nk5 _5msi" data-ft='{"tn":"*s"}' data-gt='{"tn":"*s"}' style=""> < div >< span >< p >Watch the BEST DEEP 3'S from the < a href="/LAClippers/?__tn__=%2As-R">L.A. Clippers</ a > during the < a class="_5ayv _qdx" href="/hashtag/nbaplayoffs?__tn__=%2As-R">< span class="_5aw4 _qdz">#</ span >< span class="_5ayu">NBAPlayoffs</ span ></ a >! </ p >< p > < a class="_5ayv _qdx" href="/hashtag/thatsgame?__tn__=%2As-R">< span class="_5aw4 _qdz">#</ span >< span class="_5ayu">ThatsGame</ span ></ a > < span class="_5mfr">< span class="_6qdm" style='height: 16px; width: 16px; font-size: 16px; background-image: url("https://static.xx.xxcdn.net/images/emoji.php/v9/tdf/2/16/1f4a5.png")'>💥</ span ></ span ></ p ></ span > </ div > < a aria-label="Open story" class="_5msj" href="/story.php?story_xxid=10159935560038463&id=8245623462&__tn__=%2As%2As-R"></ a ></ div > < div class="_5rgu _7dc9 _27x0" data-ft='{"tn":"H"}'> < section class="_2rea _24e1 _412_ _bpa _vyy _5t8z"> < div class="_2zi_ _zgm _2zj0"> < div class="_53mw" data-sigil="inlineVideo" data-store='{"videoID":"4456269257751059","playerFormat":"inline","playerOrigin":"page_timeline","external_log_id":null,"external_log_type":null,"rootID":4456269257751059,"playerSuborigin":"misc","useOzLive":false,"playbackIsLiveStreaming":false,"canUseOffline":null,"playOnClick":true,"videoDebuggerEnabled":false,"videoViewabilityLoggingEnabled":false,"videoViewabilityLoggingPollingRate":-1,"videoScrollUseLowThrottleRate":true,"playInFullScreen":false,"type":"video","src":"https:\/\/video-mad1-1.xx.xxcdn.net\/v\/t42.1790-2\/10000000_540531577146622_2129266242166849959_n.mp4?_nc_cat=111&ccb=1-3&_nc_sid=985c63&efg=eyJ2ZW5jb2RlX3RhZyI6InN2ZV9zZCJ9&_nc_ohc=CHxlLBnqdg8AX84rJTC&tn=3o-lXXvU9tVtdq6j&_nc_rml=0&_nc_ht=video-mad1-1.xx&oh=5ab243e6a2407a74ed09407f43ad04e9&oe=6107CF3F","width":320,"height":180,"trackingNodes":"FH-R","downloadResources":null,"subtitlesSrc":null,"spherical":false,"sphericalParams":null,"defaultQuality":null,"availableQualities":null,"playStartSec":null,"playEndSec":null,"playMuted":null,"disableVideoControls":false,"loop":false,"numOfLoops":null,"shouldPlayInline":true,"dashManifest":null,"isAdsPreview":false,"iframeEmbedReferrer":null,"adClientToken":null,"audioOnlyVideoSrc":null,"audioOnlyEnabled":false,"permalinkShareID":null,"feedPosition":null,"chainDepth":null,"videoURL":"https:\/\/www.xxxxxx.com\/nba\/videos\/4456269257751059\/","disableLogging":false}'> < i class="img _lt3 _4s0y" data-sigil="playInlineVideo" style=""></ i > < div class="_1o0y" data-sigil="m-video-play-button playInlineVideo">< span style="display:block;height:0;overflow:hidden;position:absolute;width:0;padding:0">Play Video</ span > </ div > </ div > </ div > </ section > < div ></ div > < div ></ div > </ div > </ div > < footer class="_22rc" data-ft='{"tn":"*W"}'> < div class="_2ip_ _4b44" data-sigil="mufi-inline" id="feedback_inline_10159935560038463"> < div class="_34qc _3hxn _3myz _4b45">< a data-sigil="feed-ufi-trigger" href="/story.php?story_xxid=10159935560038463&id=8245623462&anchor_composer=false&__tn__=%2AW-R" role="button"> < div class="_rnk _77ke _2eo- _1e6 _4b44" data-sigil="reactions-bling-bar" id="u_0_f_m4"> < div class="_1w1k" data-sigil="reactions-sentence-container">< span class="_qfz _77kf">< div class="_1g05 _77lc" style="z-index:3">< i class="img sp_eXcmc5QyINt_2x sx_9540f7" role="presentation">< u >Like</ u ></ i ></ div >< div class="_1g05 _77lc" style="z-index:2">< i class="img sp_eXcmc5QyINt_2x sx_2d1286" role="presentation">< u >Love</ u ></ i ></ div >< div class="_1g05 _77lc" style="z-index:1">< i class="img sp_eXcmc5QyINt_2x sx_176208" role="presentation">< u >Wow</ u ></ i ></ div ></ span > < div aria-label="567 left reactions including Like, Love and Wow" class="_1g06">567</ div > </ div > < div class="_1fnt">< span class="_1j-c" data-sigil="comments-token">10 Comments</ span >< span class="_1j-c">36 Shares</ span ></ div > </ div > </ a ></ div > < div class="_52jh _7om2 _15kk _15ks _15km _4b47 _4b46" data-sigil="ufi-inline-actions"> < div class="_52jj _15kl _3hwk _4g34">< a aria-pressed="false" class="_15ko _77li touchable" data-ft='{"tn":">"}' data-sigil="touchable ufi-inline-like like-reaction-flyout" data-store='{"reaction":0,"feedbackTarget":"10159935560038463","kaiOSReactions":false}' href="/ufi/reaction/?ft_ent_identifier=10159935560038463&reaction_type=1&story_render_location=timeline&feedback_source=0&is_sponsored=0&ext=1628151954&hash=AeQmDqjrKECVo8k9bxk&__tn__=%3E%2AW-R" id="u_0_g_4b" role="button" tabindex="0">Like</ a > < div class="_1ekf" data-sigil="screenreader-reactions-trigger" role="link" tabindex="-1">Show more reactions </ div > </ div > < div class="_52jj _15kl _3hwk _4g34">< a class="_15kq _77li" data-click='{"event":"click_comment_ufi","target_id":"10159935560038463"}' data-ft='{"tn":"S"}' data-sigil="feed-ufi-focus feed-ufi-trigger ufiCommentLink mufi-composer-focus" href="/story.php?story_xxid=10159935560038463&id=8245623462&fs=0&focus_composer=0&__tn__=S%2AW-R">Comment</ a > </ div > < div class="_52jj _15kl _3hwk _4g34">< a class="_15kr _77li" data-click='{"event":"click_share_ufi","target_id":"10159935560038463"}' data-ft='{"tn":"J"}' data-sigil="share-popup" data-store='{"is_acting_as_page":false,"reshare_post":false,"share_id":"10159935560038463","feedback_source":0,"feedback_referrer":null,"internal_preview_image_id":null,"shareable_uri":"\/story.php?story_xxid=10159935560038463&id=8245623462","user_id":100065274592441,"behavior":"custom"}' href="/sharer.php?fs=0&sid=10159935560038463&__tn__=J%2AW-R">Share</ a > </ div > </ div > </ div > </ footer > </ article > |
然后我的xpath语法就是解析不了,我用以下代码测试:
就很奇怪了,经过我的测试,发现是因为有emoji表情符引起的,
我把那些emoji符号删除了就可以正常解析了:
就很骚了。
你知道这个问题我花了1个小时排查吗,我真的是一点一点的把问题抠出来的,就感觉我在逆向js代码一样一段一段抠
解决问题
一开始我想的是,用beautifulsoup找出那段有emoji的符号部分的节点删除就行,问题是解决了:
但是我发现并不是很通用,因为,有可能emoji不会一定存在于我筛选出来的那个class为_6qdm上,也可能出现在其他地方。
那么就还是得用正则匹配了:
re.compile(u'[\U00010000-\U0010ffff]')
既然能匹配到,那就用sub替换即可:
f = open('profile.html',encoding='utf-8')
cont = f.read()
f.close()
try:
pattern = re.compile(u'[\U00010000-\U0010ffff]')
except re.error:
pattern = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')
print(pattern.findall(cont))
cont = pattern.sub('',cont)
# soup = BeautifulSoup(cont, 'html.parser')
# remove_obj = soup.select('span[class="_6qdm"]')
# if remove_obj:
# [rem.extract() for rem in remove_obj]
# html_xpath = etree.HTML(str(soup))
html_xpath = etree.HTML(cont)
print(html_xpath.xpath('//text()'))
执行:
验证下,我换了一个html结构:
果然能匹配到,ok,问题解决
分类:
python高级应用
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 如何编写易于单元测试的代码
· 10年+ .NET Coder 心语,封装的思维:从隐藏、稳定开始理解其本质意义
· .NET Core 中如何实现缓存的预热?
· 从 HTTP 原因短语缺失研究 HTTP/2 和 HTTP/3 的设计差异
· AI与.NET技术实操系列:向量存储与相似性搜索在 .NET 中的实现
· 地球OL攻略 —— 某应届生求职总结
· 周边上新:园子的第一款马克杯温暖上架
· Open-Sora 2.0 重磅开源!
· 提示词工程——AI应用必不可少的技术
· .NET周刊【3月第1期 2025-03-02】