网易云爬虫解析
因为老板要我爬网易云的数据,要对歌曲的评论进行相似度抽取,形成多个歌曲文案,于是我就做了这个爬虫!在此记录一下!
一、分析网易云 API
为了缓解服务器的压力,网易云会有反爬虫策略!我打开网易云歌曲页面, F12 发现看不到我要的数据,明白了!他应该是到这个页面在发送请求获取的歌词、评论信息!于是我在网上找了要用的 API。
分析了 API 请求参数的加密方式。这个写的比较好 (https://www.zhanghuanglong.com/detail/csharp-version-of-netease-cloud-music-api-analysis-(with-source-code))
贴几个项目中用到的 API:
抓歌曲信息(没有歌词) | http://music.163.com/m/song?id=123 | GET |
抓歌词信息: | http://music.163.com/api/song/lyric?os=pc&lv=-1&kv=-1&tv=-1&id=123 | GET |
抓评论信息 | http://music.163.com/weapi/v1/resource/comments/R_SO_4_123 (123 是歌词) | POST |
二、深度网络爬虫
因为网易云对数据进行了保护,所以不能像常规的网络爬虫一样,抓页面-->分析有用数据-->保持有用的数据-->提取链接加入任务队列-->继续抓页面。
我决定采用 id 的方式进行数据的抓取,将比如 100000000~200000000 的 id 加入任务队列中。对于 id = 123,获取歌曲信息、歌词,评论,都是通过 song_id 对应起来的。
为了抓取的速度,采用 java 线程池做多线程爬虫。
在这里,只讨论爬虫的具体实现吧!
三、自定义任务类
java 任务类就是继承 Runnable 接口,实现 Runnable 方法。在Runnable 方法中实现数据的抓取、分析、存数据库。
1、歌曲信息任务类、
1 @Override 2 public void run() { 3 try { 4 Response execute; 5 if (uid % 2 == 0) { 6 execute = Jsoup.connect("http://music.163.com/m/song?id=" + uid) 7 .header("User-Agent", 8 "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36") 9 .header("Cache-Control", "no-cache").timeout(2000000000) 11 // .proxy(IpProxy.ipEntitys.get(i).getIp(),IpProxy.ipEntitys.get(i).getPort()) 12 .execute(); 13 } 14 else { 15 execute = Jsoup.connect("http://music.163.com/m/song?id=" + uid) 16 .header("User-Agent", "Mozilla/5.0 (Windows NT 6.3; W…) Gecko/20100101 Firefox/56.0") 17 .header("Cache-Control", "no-cache") 18 19 .timeout(2000000000).execute(); 20 } 21 String body = execute.body(); 22 if (body.contains("很抱歉,你要查找的网页找不到")) { 23 System.out.println("歌曲ID:" + uid + "=============网页找不到"); 24 return; 25 } 26 Document parse = execute.parse(); 27 28 // 解析歌名 29 Elements elementsByClass = parse.getElementsByClass("f-ff2"); 30 Element element = elementsByClass.get(0); 31 Node childNode = element.childNode(0); 32 String song_name = childNode.toString(); 33 34 // 获取歌手名 35 Elements elements = parse.getElementsByClass("s-fc7"); 36 Element singerElement = elements.get(1); 37 Node singerChildNode = singerElement.childNode(0); 38 String songer_name = singerChildNode.toString(); 39 40 // 获取专辑名称 41 Element albumElement = elements.get(2); 42 Node albumChildNode = albumElement.childNode(0); 43 String album_name = albumChildNode.toString(); 44 45 // 歌曲链接 46 String song_url = "http://music.163.com/m/song?id="+uid; 47 48 // 获取歌词 49 String lyric = getSongLyricBySongId(uid); 50 51 //歌曲持久化 52 dbUtils.insert_song(uid, song_name, songer_name, lyric, song_url, album_name); 53 54 } catch (Exception e) { 55 } 56 } 57 58 /* 59 * 根据歌曲 id 获取 歌词 60 */ 61 private String getSongLyricBySongId(long id) { 62 try { 63 Response data = Jsoup.connect("http://music.163.com/api/song/lyric?os=pc&lv=-1&kv=-1&tv=-1&id=" + id) 64 .header("User-Agent", "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36") 65 .header("Cache-Control", "no-cache")//.timeout(20000) 66 .execute(); 67 68 String body = data.body(); 69 70 JsonObject jsonObject = (JsonObject)new Gson().fromJson(body, JsonObject.class); 71 jsonObject = (JsonObject) jsonObject.get("lrc"); 72 73 JsonElement jsonElement = jsonObject.get("lyric"); 74 String lyric = jsonElement.getAsString(); 75 // 替换掉 [*] 76 // String regex = "\\[\\d{2}\\:\\d{2}\\.\\d{2}\\]"; 77 String regex = "\\[\\d+\\:\\d+\\.\\d+\\]"; 78 lyric = lyric.replaceAll(regex, ""); 79 String regex2 = "\\[\\d+\\:\\d+\\]"; 80 lyric = lyric.replaceAll(regex2, ""); 81 lyric = lyric.replaceAll("'", ""); 82 lyric = lyric.replaceAll("\"", ""); 83 84 return lyric; 85 } catch (IOException e) { 86 e.printStackTrace(); 87 } 88 return "";
2、歌曲热评任务类
一首歌大概 0~20 个热评,都是通过一次 POST 请求就可以获取到的。因此与普通评论分开来处理。因为参数 params、ensecKey 进行了加密,可以看前面的链接。
1 @Override 2 public void run() { 3 try{ 4 String url = "http://music.163.com/weapi/v1/resource/comments/R_SO_4_" + uid; 5 String data = CenterUrl.getDataByUrl(url, "{\"offset\":0,\"limit\":10};"); 6 System.out.println(data); 7 JsonParseUtil<CommentBean> commentData = new JsonParseUtil<>(); 8 CommentBean jsonData = commentData.getJsonData(data, CommentBean.class); 9 List<HotComments> hotComments = jsonData.getHotComments(); 10 for (HotComments comment : hotComments) { 11 // 组装字段 12 Long comment_id = comment.getCommentId(); 13 String comment_content = comment.getContent(); 14 comment_content = comment_content.replaceAll("'", "").replaceAll("\"", ""); 15 Long liked_count = comment.getLikedCount(); 16 String commenter_name = comment.getUser().getNickname(); 17 int is_hot_comment = 1; 18 Long create_time = comment.getTime(); 19 // 插入数据库 20 dbUtils.insert_hot_comments(uid, comment_id, comment_content, liked_count, commenter_name, is_hot_comment, create_time); 21 } 22 } catch (Exception e) { 23 logger.error(e.getMessage()); 24 } 25 }
3、歌曲普通评论任务类
因为普通评论要进行翻页操作,所以里边有一个循环,可以设置抓取的每首歌的普通评论数。
1 @Override 2 public void run() { 3 long pageSize = 0; 4 int dynamicPage = 105; // +1050,防止一部分抓取失败 5 for (long i = 0; i <= pageSize && i < dynamicPage; i++) { // 1000 条非热评 6 try { 7 String url = "http://music.163.com/weapi/v1/resource/comments/R_SO_4_" + uid; 8 String data = CenterUrl.getDataByUrl(url, "{\"offset\":" + i * 10 + ",\"limit\":"+ 10 + "};"); 9 10 if(data.trim().equals("HTTP/1.1 400 Bad Request") || data.contains("用户的数据无效")) { 11 // 由于网络等原因请求抓取失败 12 i--; 13 if(pageSize == 0) { // 第一次就失败了。。。 14 pageSize = dynamicPage; 15 } 16 System.out.println("~~ song_id = " + uid + ", i(Page)=" + i + ", reason = " + data); 17 continue; 18 } 19 // 这一页发生异常 20 if(data.contains("网络超时") || data.equals("")) { 21 continue; 22 } 23 24 JsonParseUtil<CommentBean> commentData = new JsonParseUtil<>(); 25 CommentBean jsonData = commentData.getJsonData(data, CommentBean.class); 26 long total = jsonData.getTotal(); 27 pageSize = total / 10; 28 List<Comments> comments = jsonData.getComments(); 29 for (Comments comment : comments) { 30 try { 31 // 组装字段 32 Long comment_id = comment.getCommentId(); 33 String comment_content = comment.getContent(); 34 comment_content = comment_content.replaceAll("'", "").replaceAll("\"", ""); 35 Long liked_count = comment.getLikedCount(); 36 String commenter_name = comment.getUser().getNickname(); 37 int is_hot_comment = 0; 38 Long create_time = comment.getTime(); 39 // 插入数据库 40 dbUtils.insert_tmp_comments(uid, comment_id, comment_content, liked_count, commenter_name, is_hot_comment, create_time); 41 } catch (Exception e) { 42 System.out.println(">>>>>>>>插入失败: " + uid ); 43 } 44 } 45 } catch (Exception e) { 46 System.err.println("^^^" + e.getMessage()); 47 } 48 } 49 }
4、POST 请求
因为爬取的数据量比较大,当我用本地 IP 时,几分钟后,发现浏览器打开网易云音乐,评论加载不出来了,几分钟后,歌曲也加载不出来了。所以,我觉得网易云会对判定为爬虫的 IP 禁止调用他的相应的接口!于是,我做了一个代理 IP 池 。当发现调用接口返回信息包含 “cheating” 等,就去除这个代理 IP,重新从池子里获取一个!
public static String getDataByUrl(String url, String encrypt) { try{ System.out.println("****************************正在使用的代理IP:"+ip+"*********端口"+port+"**********************"); String data = ""; // 参数加密 String secKey = new BigInteger(100, new SecureRandom()).toString(32).substring(0, 16);//limit String encText = EncryptUtils.aesEncrypt(EncryptUtils.aesEncrypt(encrypt,"0CoJUm6Qyw8W8jud"), secKey); String encSecKey = EncryptUtils.rsaEncrypt(secKey); // 设置请求头 Response execute = Jsoup.connect(url+"?csrf_token=6b9af67aaac0a2d1deb5683987d059e1") .header("User-Agent", "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.32 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36") .header("Cache-Control", "max-age=60").header("Accept", "*/*").header("Accept-Encoding", "gzip, deflate, br") .header("Accept-Language", "zh-CN,zh;q=0.9,en;q=0.8").header("Connection", "keep-alive") .header("Referer", "https://music.163.com/song?id=1324447466") .header("Origin", "https://music.163.com").header("Host", "music.163.com") .header("Content-Type", "application/x-www-form-urlencoded") .data("params",encText) .data("encSecKey",encSecKey) .method(Method.POST).ignoreContentType(true) .timeout(1000000) .proxy(ip, port) .execute(); data = execute.body().toString(); //如果当前的IP被拉黑了就从IP网站上抓取新的IP if(data.contains("Cheating")||data.contains("指定 product id") || data.contains("无效用户")){ // 去除无效 ipEntity if(IpProxy.ipEntitys.contains(ipEntity)) IpProxy.ipEntitys.remove(ipEntity); ipEntity = getIpEntityByRandom(); ip = ipEntity.getIp(); port = ipEntity.getPort(); return "用户的数据无效!!!"; } return data; } catch (Exception e) { // 去除无效 ipEntity if(IpProxy.ipEntitys.contains(ipEntity)) IpProxy.ipEntitys.remove(ipEntity); ipEntity = getIpEntityByRandom(); ip = ipEntity.getIp(); port = ipEntity.getPort(); System.err.println("网络超时原因: " + e.getMessage()); if(e.getMessage().contains("Connection refused: connect") || e.getMessage().contains("No route to host: connect")) { IpProxy.ipEntitys.clear(); IpProxy.getZDaYeProxyIp(); } return "网络超时"; } } /* * 随机从 List 中获取 ipEntity */ private static IpEntity getIpEntityByRandom() { try { int size = IpProxy.ipEntitys.size(); if(size == 0) { Thread.sleep(20000); IpProxy.getZDaYeProxyIp(); } int i = (int)(Math.random()*size); if(size > 0 && i < size) return IpProxy.ipEntitys.get(i); } catch (Exception e) { System.err.println("pig!pig!随机获取生成代理 ip 异常:!!!!!!!"); } return null; }
四、代理 IP 资源池
免费的代理 IP 比较好用的是 西刺代理,IP很新鲜!缺点是不稳定,他的网站经常崩掉。
还有这个: https://www.ip-adress.com/proxy-list
对西刺代理网页进行分析,抓取Dom节点的数据放入 IP 代理池中!
1 public static List<IpEntity> getProxyIp(String url) throws Exception{ 2 Response execute = Jsoup.connect(url) 3 .header("User-Agent", 4 "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36") 5 .header("Cache-Control", "max-age=60").header("Accept", "*/*") 6 .header("Accept-Language", "zh-CN,zh;q=0.8,en;q=0.6").header("Connection", "keep-alive") 7 .header("Referer", "http://music.163.com/song?id=186016") 8 .header("Origin", "http://music.163.com").header("Host", "music.163.com") 9 .header("Content-Type", "application/x-www-form-urlencoded") 10 .header("Cookie", 11 "UM_distinctid=15e9863cf14335-0a09f939cd2af9-6d1b137c-100200-15e9863cf157f1; vjuids=414b87eb3.15e9863cfc1.0.ec99d6f660d09; _ntes_nnid=4543481cc76ab2fd3110ecaafd5f1288,1505795231854; _ntes_nuid=4543481cc76ab2fd3110ecaafd5f1288; __s_=1; __gads=ID=6cbc4ab41878c6b9:T=1505795247:S=ALNI_MbCe-bAY4kZyMbVKlS4T2BSuY75kw; usertrack=c+xxC1nMphjBCzKpBPJjAg==; NTES_CMT_USER_INFO=100899097%7Cm187****4250%7C%7Cfalse%7CbTE4NzAzNDE0MjUwQDE2My5jb20%3D; P_INFO=m18703414250@163.com|1507178162|2|mail163|00&99|CA&1506163335&mail163#hun&430800#10#0#0|187250&1|163|18703414250@163.com; vinfo_n_f_l_n3=8ba0369be425c0d2.1.7.1505795231863.1507950353704.1508150387844; vjlast=1505795232.1508150167.11; Province=0450; City=0454; _ga=GA1.2.1044198758.1506584097; _gid=GA1.2.763458995.1508907342; JSESSIONID-WYYY=Zm%2FnBG6%2B1vb%2BfJp%5CJP8nIyBZQfABmnAiIqMM8fgXABoqI0PdVq%2FpCsSPDROY1APPaZnFgh14pR2pV9E0Vdv2DaO%2BKkifMncYvxRVlOKMEGzq9dTcC%2F0PI07KWacWqGpwO88GviAmX%2BVuDkIVNBEquDrJ4QKhTZ2dzyGD%2Bd2T%2BbiztinJ%3A1508946396692; _iuqxldmzr_=32; playerid=20572717; MUSIC_U=39d0b2b5e15675f10fd5d9c05e8a5d593c61fcb81368d4431bab029c28eff977d4a57de2f409f533b482feaf99a1b61e80836282123441c67df96e4bf32a71bc38be3a5b629323e7bf122d59fa1ed6a2; __remember_me=true; __csrf=2032a8f34f1f92412a49ba3d6f68b2db; __utma=94650624.1044198758.1506584097.1508939111.1508942690.40; __utmb=94650624.20.10.1508942690; __utmc=94650624; __utmz=94650624.1508394258.18.4.utmcsr=xujin.org|utmccn=(referral)|utmcmd=referral|utmcct=/") 12 .method(Method.GET).ignoreContentType(true) 13 .timeout(2099999999).execute(); 14 Document pageJson = execute.parse(); 15 Element body = pageJson.body(); 16 List<Node> childNodes = body.childNode(11).childNode(3).childNode(5).childNode(1).childNodes(); 17 // ipEntitys.clear(); // 先清空在添加 18 19 for(int i = 2;i < childNodes.size();i += 2){ 20 IpEntity ipEntity = new IpEntity(); 21 Node node = childNodes.get(i); 22 List<Node> nodes = node.childNodes(); 23 String ip = nodes.get(3).childNode(0).toString(); 24 int port = Integer.parseInt(nodes.get(5).childNode(0).toString()); 25 ipEntity.setIp(ip); 26 ipEntity.setPort(port); 27 ipEntitys.add(ipEntity); 28 } 29 return ipEntitys; 30 }
但是为了少操心,最终买了“站大爷”的提供的代理 IP 服务,17块钱一天!服务还挺好的,嗯!
五、总结
写这个爬虫持续了挺久时间的!中间碰到了很多问题。比如,调网易云接口一直返回 460 等。最后发现是代理 IP 没有更新的问题!调用站大爷的接口,获取到的全是没用的 IP 原来是没有绑定自己公网的 IP !还有爬虫经常是爬十来分钟就卡住了,一直调定时任务更新线程池,而线程没有动了! 我猜是因为我抓普通评论时的 for 循环使得我的多线程都堵塞了,还有待验证!
虽然爬虫很简单,但是把他做好也很难!加油!