PowerShell 获取虎扑步行街热榜json数据
获取热榜
(curl "https://bbs.hupu.com/all-gambia").ParsedHtml.getElementsByClassName('t-info') | %{
$texts = $_.getElementsByTagName('span')
@{
url= $_.getElementsByTagName('a')[0].href.Replace("about:","https://bbs.hupu.com")
title = $texts[0].innerText
lights = $texts[1].innerText
replies = $texts[2].innerText
}
}| ConvertTo-Json
结果
获取详情
$document = (curl "https://bbs.hupu.com/56224674.html").ParsedHtml
#post
$postReplyCount = $document.querySelector(".reply").innerText
$postLikeCount = $document.querySelector(".light").innerText
$postViewCount = $document.querySelector(".read").innerText
$postName = $document.querySelector(".post-user-comp-info-bottom-title").innerText
$postUserName = $document.querySelector(".post-user-comp-info-top-name").innerText
$postUserAvatar = $document.querySelector("img").src
$postTime = $document.querySelector(".post-user-comp-info-top-time").innerText
$postContent = $document.querySelector(".thread-content-detail").innerHTML
#reply
$postReplyList=@()
$relpyListCount = $document.getElementsByClassName("post-reply-list").Length
0..($relpyListCount-1) | %{
$item = $document.getElementsByClassName("post-reply-list")[$_]
$postReplyList += @{
replyUserAvatar = $item.querySelector(".reply-list-avatar img").src
replyUserName = $item.querySelector(".post-reply-list-user-info-top-name").innerText
replyTime = $item.querySelector(".post-reply-list-user-info-top-time").innerText
replyQuotUser = $item.querySelector(".quote-text span").innerText
replyQuotContent = $item.querySelector(".simple-detail-content").innerHTML
replyContent = $item.querySelector(".thread-content-detail").innerHTML
replyLike = $item.querySelector(".light").innerText
}
}
@{
postName=$postName
postTime=$postTime
postContent=$postContent
postReplyCount=$postReplyCount
postLikeCount=$postLikeCount
postViewCount=$postViewCount
postUserName = $postUserName
postUserAvatar = $postUserAvatar
postReplyList = $postReplyList
} | ConvertTo-Json
结果
问题
获取详情的时候,没有使用又臭又长的getElementsByClassName
等方法,使用querySelector
在循环reply list的时候,开始是这么写:
$document.querySelectorAll(".post-reply-list")| %{
#
}
但发现querySelectorAll会造成PowerShell卡死崩溃,查询发现这是已知的bug
https://stackoverflow.com/questions/37196558/using-queryselectorall-on-an-mshtml-htmldocumentclass-object-in-powershell-cause
https://github.com/PowerShell/PowerShell/issues/3027
那就切回去吧,这么写
$document.getElementsByClassName("post-reply-list") | %{
#循环中遍历每一条reply,然后查找对应的信息
#写法1
$_.getElementsByClassName("className")[0].innerText
#写法2
$_.querySelector(".className").innerText
}
然后发现循环中以上两种写法都会出错,提示找不到方法...
看了下方法列表,只有getElementsByTagName
,有点奇葩
事实上我们这么写是可以的
$document.getElementsByClassName("post-reply-list")[0].getElementsByClassName("className")[0].innerText
$document.getElementsByClassName("post-reply-list")[0].querySelector(".className").innerText
那么解决方案就来了,我们不用ForEach-Object
,直接用for循环不就好了?!
所以最终方案如下
0..($relpyListCount-1) | %{
$item = $document.getElementsByClassName("post-reply-list")[$_]
#logic
}
当然也可以这么写
for($i = 0; $i -lt ($relpyListCount-1);$i++)
{
$item = $document.getElementsByClassName("post-reply-list")[$i]
#logic
}