PowerShell 获取虎扑步行街热榜json数据

获取热榜

(curl "https://bbs.hupu.com/all-gambia").ParsedHtml.getElementsByClassName('t-info') | %{
    $texts = $_.getElementsByTagName('span')
    @{
        url= $_.getElementsByTagName('a')[0].href.Replace("about:","https://bbs.hupu.com")
        title = $texts[0].innerText
        lights = $texts[1].innerText
        replies = $texts[2].innerText
    }
}| ConvertTo-Json

结果

获取详情

$document = (curl "https://bbs.hupu.com/56224674.html").ParsedHtml

#post
$postReplyCount = $document.querySelector(".reply").innerText
$postLikeCount = $document.querySelector(".light").innerText
$postViewCount = $document.querySelector(".read").innerText
$postName = $document.querySelector(".post-user-comp-info-bottom-title").innerText
$postUserName =  $document.querySelector(".post-user-comp-info-top-name").innerText
$postUserAvatar = $document.querySelector("img").src
$postTime = $document.querySelector(".post-user-comp-info-top-time").innerText
$postContent = $document.querySelector(".thread-content-detail").innerHTML

#reply
$postReplyList=@()
$relpyListCount = $document.getElementsByClassName("post-reply-list").Length
0..($relpyListCount-1) | %{
    $item = $document.getElementsByClassName("post-reply-list")[$_]
    $postReplyList += @{
        replyUserAvatar = $item.querySelector(".reply-list-avatar img").src
        replyUserName = $item.querySelector(".post-reply-list-user-info-top-name").innerText
        replyTime = $item.querySelector(".post-reply-list-user-info-top-time").innerText
        replyQuotUser = $item.querySelector(".quote-text span").innerText
        replyQuotContent = $item.querySelector(".simple-detail-content").innerHTML
        replyContent = $item.querySelector(".thread-content-detail").innerHTML
        replyLike = $item.querySelector(".light").innerText
    }
}
 
@{
    postName=$postName
    postTime=$postTime
    postContent=$postContent
    postReplyCount=$postReplyCount
    postLikeCount=$postLikeCount
    postViewCount=$postViewCount
    postUserName = $postUserName
    postUserAvatar = $postUserAvatar
    postReplyList = $postReplyList
} | ConvertTo-Json

结果

问题

获取详情的时候,没有使用又臭又长的getElementsByClassName等方法,使用querySelector
在循环reply list的时候,开始是这么写:

$document.querySelectorAll(".post-reply-list")| %{
  #
}

但发现querySelectorAll会造成PowerShell卡死崩溃,查询发现这是已知的bug

https://stackoverflow.com/questions/37196558/using-queryselectorall-on-an-mshtml-htmldocumentclass-object-in-powershell-cause
https://github.com/PowerShell/PowerShell/issues/3027

那就切回去吧,这么写

$document.getElementsByClassName("post-reply-list") | %{
  #循环中遍历每一条reply,然后查找对应的信息
  #写法1
  $_.getElementsByClassName("className")[0].innerText

  #写法2
  $_.querySelector(".className").innerText
}

然后发现循环中以上两种写法都会出错,提示找不到方法...
看了下方法列表,只有getElementsByTagName,有点奇葩

事实上我们这么写是可以的

$document.getElementsByClassName("post-reply-list")[0].getElementsByClassName("className")[0].innerText
$document.getElementsByClassName("post-reply-list")[0].querySelector(".className").innerText

那么解决方案就来了,我们不用ForEach-Object,直接用for循环不就好了?!
所以最终方案如下

0..($relpyListCount-1) | %{
    $item = $document.getElementsByClassName("post-reply-list")[$_]
    #logic
}

当然也可以这么写

for($i = 0; $i -lt ($relpyListCount-1);$i++)
{
    $item = $document.getElementsByClassName("post-reply-list")[$i]
    #logic
}
posted @ 2022-11-01 13:25  talentzemin  阅读(65)  评论(0编辑  收藏  举报