简单R语言爬虫

R爬虫实验

简单的R语言爬虫实验，因为比较懒，在处理javascript翻页上用了取巧的办法。
主要用到的网页相关的R包是: {rvest}. 其余的R包都是常用包。

library(rvest)
library(stringr)
library(dplyr)
library(ggplot2)

测试的网页是B站，我想根据关键词搜索，然后统计一下UP主的作品个数(好吧，挺无聊的)。

首先就是在B站键入搜索词以后把网页地址复制下来。

中文是搜索词最后会被保护起来，反正那串东西我看不懂

url <- "https://search.bilibili.com/all?keyword=%E5%A4%9A%E8%82%89"

在处理翻页上取巧，根据page进行动态刷新抓取，首先就需要得到一共覆盖多少页，这个在网页的body中。

这次测试的是50页，还好不多。

用正则把最大页数提取出来:

body_title <- strsplit(as.character(html_nodes(read_html(url), xpath = 'body')), ">")[[1]][1]
page_number <- as.numeric(gsub(".*data-num_pages=\\\"([0-9]+)\\\".*","\\1", body_title, perl = T))

接下来就是UP主名字的提取，存储方式如下：

xpath = ‘//a[@class="up-name"]’

然后根据B站的page规则实现翻页就可以了。

up_name_vec <- c()
for(i in 1:page_number){
  new_url <- paste0(url, "&page=", i, "&order=totalrank")
  info <- read_html(new_url) %>% html_nodes(xpath = '//a[@class="up-name"]') %>% html_text(trim = T)
  up_name_vec<- c(up_name_vec, info)
 }

简单的用{ggplot}画一下barplot

up_table <- table(up_name_vec)
need_stat <- up_table[which(up_table >= 5)]
up_df <- data.frame(
  up_name = names(need_stat),
  up_num = as.vector(need_stat)
 )
ggplot(data = up_df, aes(up_name, up_num)) +
  geom_bar(stat = "identity", aes(fill = up_name), show.legend = F, width = 0.7) +
  theme_bw() +
  theme(
    panel.grid = element_blank(),
    axis.text.x = element_text(angle = 45,size = 9, vjust = 0.58, color = "black")
 )

posted @ 2018-03-10 19:17 PeRl` 阅读(738) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

PeRl`

你是无意穿堂风，偏偏孤倨引山洪。
生物信息学小学生，写博客是为了提醒自己: 别摸鱼了

简单R语言爬虫

R爬虫实验

PeRl

公告

PeRl`

你是无意穿堂风，偏偏孤倨引山洪。 生物信息学小学生，写博客是为了提醒自己: 别摸鱼了

简单R语言爬虫

R爬虫实验

PeRl

公告

你是无意穿堂风，偏偏孤倨引山洪。
生物信息学小学生，写博客是为了提醒自己: 别摸鱼了