python爬虫 -- 处理emoji表情符导致xpath无法正常解析网页的问题

前言

 

本篇文章很短,就是记录一个偶然遇到的问题

 

问题复现

 

是这样的,在用xpath解析某网站的时候,由于网站数据格式是普通的html,而非json字符串,所以只能解析DOM对象,有的能用正则表达式的我都尽量用正则表达式了,没法用正则的我都用beautifulsoup库或者pyquery了,但是没法,通用型还是没法跟xpath比,而且我已经写好一版,在有限的时间改的话就很烦了

不多说,先看问题

 

 

首先部分的网站源码如下:

 

<article class="_55wo _5rgr _5gh8 _3drq async_like"
         data-ft='{"mf_story_key":"10159935560038463","top_level_post_id":"10159935560038463","tl_objid":"10159935560038463","content_owner_id_new":"8245623462","throwback_story_xxid":"10159935560038463","page_id":"8245623462","story_location":4,"story_attachment_style":"video_inline","tds_flgs":3,"ott":"AX90AyHPzJSMfPjF","tn":"-R"}'
         data-sigil="story-div story-popup-metadata story-popup-metadata feed-ufi-metadata"
         data-store='{"linkdata":"mf_story_key.10159935560038463:top_level_post_id.10159935560038463:tl_objid.10159935560038463:content_owner_id_new.8245623462:throwback_story_xxid.10159935560038463:page_id.8245623462:story_location.4:story_attachment_style.video_inline:tds_flgs.3:ott.AX90AyHPzJSMfPjF","share_id":"10159935560038463","feedback_target":"10159935560038463","feedback_source":0,"action_source":0,"actor_id":100065274592441}'
         data-xt="2.mf_story_key.10159935560038463:top_level_post_id.10159935560038463:tl_objid.10159935560038463:content_owner_id_new.8245623462:throwback_story_xxid.10159935560038463:page_id.8245623462:story_location.4:story_attachment_style.video_inline:tds_flgs.3:ott.AX90AyHPzJSMfPjF"
         data-xt-vimp='{"pixel_in_percentage":0,"duration_in_ms":1,"subsequent_gap_in_ms":60000,"log_initial_nonviewable":false,"should_batch":true,"require_horizontally_onscreen":false}'
         id="u_0_5_iv">
    <div class="story_body_container">
        <header class="_7om2 _1o88 _77kd _5qc1">
            <div class="_5s61 _2pii _5i2i _52wc">
                <div class="_5xu4">
                    <div class="_67lm _77kc" data-gt='{"tn":"~"}' data-sigil="feed_story_ring8245623462"><a
                            data-click='{"event":"click_post_avatar_image","target_id":"10159935560038463"}'
                            data-gt='{"tn":"~"}' href="/nba/?__tn__=%7E%7E-R"><i aria-label="NBA, profile picture"
                                                                                 class="img _1-yc profpic" role="img"
                                                                                 ></i></a>
                    </div>
                </div>
            </div>
            <div class="_4g34 _5i2i _52we">
                <div class="_5xu4">
                    <div class="_7om2 _52wc">
                        <div class="_4g34"><h3 class="_52jd _52jb _52jh _5qc3 _4vc- _3rc4 _4vc-" data-gt='{"tn":"C"}'>
                            <span><strong><a href="/nba/?__tn__=C-R">NBA</a></strong><span aria-label="Verified Page"
                                                                                           class="_56_f _5dzy _5dz- _3twv"
                                                                                           id="u_0_e_x2"
                                                                                           role="img"></span></span>
                        </h3>
                            <div class="_52jc _5qc4 _78cz _24u0 _36xo" data-sigil="m-feed-voice-subtitle"><a
                                    href="/story.php?story_xxid=10159935560038463&id=8245623462&__tn__=-R"><abbr>6
                                hrs</abbr></a><span aria-hidden="true"> · </span><span><div class="_7jwi"><span
                                    data-sigil="audience-icon"><i aria-label="Public"
                                                                  class="feedAudienceIcon img sp_eXcmc5QyINt_2x sx_e966fc"
                                                                  role="img"></i></span><div class="_7jwh"></div></div></span>
                            </div>
                        </div>
                        <div class="_5s61">
                            <div class="_2pir" id="feed_story_fan_8245623462"></div>
                        </div>
                        <div class="_5s61"></div>
                        <div class="_5s61 _2pis">
                            <div class="_yff" data-sigil="story-popup-causal-init"
                                 data-store='{"feedobjectsIdentifiers":"S:_I8245623462:10159935560038463","feedContext":"{\"use_m_feed\":true,\"m_entstream_source\":\"timeline\",\"is_pages_timeline\":true,\"story_node_id\":\"u_0_5_iv\",\"show_attachments\":true,\"is_attached_story\":false}"}'
                                 id="u_0_b_35"><a aria-haspopup="true" class="_4s19 sec" data-sigil="touchable" href="#"
                                                  role="button"></a><i class="img sp_eXcmc5QyINt_2x sx_b9866d"
                                                                       data-sigil="story-popup-context-init"><u>More
                                options</u></i></div>
                        </div>
                    </div>
                </div>
            </div>
        </header>
        <div class="_5rgt _5nk5 _5msi" data-ft='{"tn":"*s"}' data-gt='{"tn":"*s"}' style="">
            <div><span><p>Watch the BEST DEEP 3'S from the <a href="/LAClippers/?__tn__=%2As-R">L.A. Clippers</a> during the <a
                    class="_5ayv _qdx" href="/hashtag/nbaplayoffs?__tn__=%2As-R"><span class="_5aw4 _qdz">#</span><span
                    class="_5ayu">NBAPlayoffs</span></a>! </p><p> <a class="_5ayv _qdx"
                                                                     href="/hashtag/thatsgame?__tn__=%2As-R"><span
                    class="_5aw4 _qdz">#</span><span class="_5ayu">ThatsGame</span></a> <span class="_5mfr"><span
                    class="_6qdm"
                    style='height: 16px; width: 16px; font-size: 16px; background-image: url("https://static.xx.xxcdn.net/images/emoji.php/v9/tdf/2/16/1f4a5.png")'>💥</span></span></p></span>
            </div>
            <a aria-label="Open story" class="_5msj"
               href="/story.php?story_xxid=10159935560038463&id=8245623462&__tn__=%2As%2As-R"></a></div>
        <div class="_5rgu _7dc9 _27x0" data-ft='{"tn":"H"}'>
            <section class="_2rea _24e1 _412_ _bpa _vyy _5t8z">
                <div class="_2zi_ _zgm _2zj0">
                    <div class="_53mw" data-sigil="inlineVideo"
                         data-store='{"videoID":"4456269257751059","playerFormat":"inline","playerOrigin":"page_timeline","external_log_id":null,"external_log_type":null,"rootID":4456269257751059,"playerSuborigin":"misc","useOzLive":false,"playbackIsLiveStreaming":false,"canUseOffline":null,"playOnClick":true,"videoDebuggerEnabled":false,"videoViewabilityLoggingEnabled":false,"videoViewabilityLoggingPollingRate":-1,"videoScrollUseLowThrottleRate":true,"playInFullScreen":false,"type":"video","src":"https:\/\/video-mad1-1.xx.xxcdn.net\/v\/t42.1790-2\/10000000_540531577146622_2129266242166849959_n.mp4?_nc_cat=111&ccb=1-3&_nc_sid=985c63&efg=eyJ2ZW5jb2RlX3RhZyI6InN2ZV9zZCJ9&_nc_ohc=CHxlLBnqdg8AX84rJTC&tn=3o-lXXvU9tVtdq6j&_nc_rml=0&_nc_ht=video-mad1-1.xx&oh=5ab243e6a2407a74ed09407f43ad04e9&oe=6107CF3F","width":320,"height":180,"trackingNodes":"FH-R","downloadResources":null,"subtitlesSrc":null,"spherical":false,"sphericalParams":null,"defaultQuality":null,"availableQualities":null,"playStartSec":null,"playEndSec":null,"playMuted":null,"disableVideoControls":false,"loop":false,"numOfLoops":null,"shouldPlayInline":true,"dashManifest":null,"isAdsPreview":false,"iframeEmbedReferrer":null,"adClientToken":null,"audioOnlyVideoSrc":null,"audioOnlyEnabled":false,"permalinkShareID":null,"feedPosition":null,"chainDepth":null,"videoURL":"https:\/\/www.xxxxxx.com\/nba\/videos\/4456269257751059\/","disableLogging":false}'>
                        <i class="img _lt3 _4s0y" data-sigil="playInlineVideo"
                           style=""></i>
                        <div class="_1o0y" data-sigil="m-video-play-button playInlineVideo"><span
                                style="display:block;height:0;overflow:hidden;position:absolute;width:0;padding:0">Play Video</span>
                        </div>
                    </div>
                </div>
            </section>
            <div></div>
            <div></div>
        </div>
    </div>
    <footer class="_22rc" data-ft='{"tn":"*W"}'>
        <div class="_2ip_ _4b44" data-sigil="mufi-inline" id="feedback_inline_10159935560038463">
            <div class="_34qc _3hxn _3myz _4b45"><a data-sigil="feed-ufi-trigger"
                                                    href="/story.php?story_xxid=10159935560038463&id=8245623462&anchor_composer=false&__tn__=%2AW-R"
                                                    role="button">
                <div class="_rnk _77ke _2eo- _1e6 _4b44" data-sigil="reactions-bling-bar" id="u_0_f_m4">
                    <div class="_1w1k" data-sigil="reactions-sentence-container"><span class="_qfz _77kf"><div
                            class="_1g05 _77lc" style="z-index:3"><i class="img sp_eXcmc5QyINt_2x sx_9540f7"
                                                                     role="presentation"><u>Like</u></i></div><div
                            class="_1g05 _77lc" style="z-index:2"><i class="img sp_eXcmc5QyINt_2x sx_2d1286"
                                                                     role="presentation"><u>Love</u></i></div><div
                            class="_1g05 _77lc" style="z-index:1"><i class="img sp_eXcmc5QyINt_2x sx_176208"
                                                                     role="presentation"><u>Wow</u></i></div></span>
                        <div aria-label="567 left reactions including Like, Love and Wow" class="_1g06">567</div>
                    </div>
                    <div class="_1fnt"><span class="_1j-c" data-sigil="comments-token">10 Comments</span><span
                            class="_1j-c">36 Shares</span></div>
                </div>
            </a></div>
            <div class="_52jh _7om2 _15kk _15ks _15km _4b47 _4b46" data-sigil="ufi-inline-actions">
                <div class="_52jj _15kl _3hwk _4g34"><a aria-pressed="false" class="_15ko _77li touchable"
                                                        data-ft='{"tn":">"}'
                                                        data-sigil="touchable ufi-inline-like like-reaction-flyout"
                                                        data-store='{"reaction":0,"feedbackTarget":"10159935560038463","kaiOSReactions":false}'
                                                        href="/ufi/reaction/?ft_ent_identifier=10159935560038463&reaction_type=1&story_render_location=timeline&feedback_source=0&is_sponsored=0&ext=1628151954&hash=AeQmDqjrKECVo8k9bxk&__tn__=%3E%2AW-R"
                                                        id="u_0_g_4b" role="button" tabindex="0">Like</a>
                    <div class="_1ekf" data-sigil="screenreader-reactions-trigger" role="link" tabindex="-1">Show more
                        reactions
                    </div>
                </div>
                <div class="_52jj _15kl _3hwk _4g34"><a class="_15kq _77li"
                                                        data-click='{"event":"click_comment_ufi","target_id":"10159935560038463"}'
                                                        data-ft='{"tn":"S"}'
                                                        data-sigil="feed-ufi-focus feed-ufi-trigger ufiCommentLink mufi-composer-focus"
                                                        href="/story.php?story_xxid=10159935560038463&id=8245623462&fs=0&focus_composer=0&__tn__=S%2AW-R">Comment</a>
                </div>
                <div class="_52jj _15kl _3hwk _4g34"><a class="_15kr _77li"
                                                        data-click='{"event":"click_share_ufi","target_id":"10159935560038463"}'
                                                        data-ft='{"tn":"J"}' data-sigil="share-popup"
                                                        data-store='{"is_acting_as_page":false,"reshare_post":false,"share_id":"10159935560038463","feedback_source":0,"feedback_referrer":null,"internal_preview_image_id":null,"shareable_uri":"\/story.php?story_xxid=10159935560038463&id=8245623462","user_id":100065274592441,"behavior":"custom"}'
                                                        href="/sharer.php?fs=0&sid=10159935560038463&__tn__=J%2AW-R">Share</a>
                </div>
            </div>
        </div>
    </footer>
</article>

  

 

然后我的xpath语法就是解析不了,我用以下代码测试:

 

 

 

 

 

就很奇怪了,经过我的测试,发现是因为有emoji表情符引起的,

 

 

 

我把那些emoji符号删除了就可以正常解析了:

 

 

 

就很骚了。

 

你知道这个问题我花了1个小时排查吗,我真的是一点一点的把问题抠出来的,就感觉我在逆向js代码一样一段一段抠

 

 

解决问题

 

一开始我想的是,用beautifulsoup找出那段有emoji的符号部分的节点删除就行,问题是解决了:

 

 

 

 

 

 

但是我发现并不是很通用,因为,有可能emoji不会一定存在于我筛选出来的那个class为_6qdm上,也可能出现在其他地方。

 

那么就还是得用正则匹配了:

 

 

re.compile(u'[\U00010000-\U0010ffff]')

 

 

 

 

既然能匹配到,那就用sub替换即可:

 

f = open('profile.html',encoding='utf-8')
cont = f.read()
f.close()
try:
    pattern = re.compile(u'[\U00010000-\U0010ffff]')
except re.error:
    pattern = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')
print(pattern.findall(cont))
cont = pattern.sub('',cont)
# soup = BeautifulSoup(cont, 'html.parser')
# remove_obj = soup.select('span[class="_6qdm"]')
# if remove_obj:
#     [rem.extract() for rem in remove_obj]
# html_xpath = etree.HTML(str(soup))
html_xpath = etree.HTML(cont)
print(html_xpath.xpath('//text()'))

 

执行:

 

 

 

 

验证下,我换了一个html结构:

 

 

 

 

 

果然能匹配到,ok,问题解决

 

posted @ 2021-08-04 11:00  Eeyhan  阅读(661)  评论(0编辑  收藏  举报