java爬取免费HTTP代理 code-for-fun
偶然看到一个提供免费HTTP 代理IP的网站,该网站一两个小时就会更新一次,很有用。之后自己就用Java写了一个爬虫,爬取网站上的代理IP,以备后用。
网站源码:
<!DOCTYPE html> <!-- saved from url=(0035)http://www.swei360.com/free/?page=2 --> <html><head><meta http-equiv="Content-Type" content="text/html; charset=GBK"> <meta name="viewport" content="width=1000"> <title>360三维代理 - 高速http代理ip每天更新https和socks和connect免费匿名长效提取</title> <meta name="keywords" content="360代理,代理ip购买,代理服务器,高速http代理,代理ip地址,国外代理服务器,免费代理服务器,免费高匿最新代理ip,代理服务器ip,代理地址,代理列表,最新免费代理ip"> <meta name="description" content="全球HTTP代理IP|HTTPS提取购买|SOCKS代理搜索引擎采集器|免费CONNECT代理采集|免费匿名QQ代理|国内外网游加速软件|优质IP代理资源|"> <meta content="index,follow" name="robots"> <meta content="index,follow" name="GOOGLEBOT"> <meta content="360三维IP" name="Author"> <link rel="shortcut icon" href="http://www.swei360.com/img/favicon.ico" type="image/x-icon"> <link rel="stylesheet" href="./360三维代理 - 高速http代理ip每天更新https和socks和connect免费匿名长效提取_files/base.min.css" media="screen"> <!--[if lt IE 8]><link rel="stylesheet" href="/css/ie.css" media="screen" /><![endif]--> <link rel="stylesheet" type="text/css" href="./360三维代理 - 高速http代理ip每天更新https和socks和connect免费匿名长效提取_files/db.css"><style> body { font-family:"微软雅黑", Helvetica; font-size:13px;line-height:160%; margin:0; padding:0; color:#777; -webkit-tap-highlight-color:rgba(0, 0, 0, 0); } body { margin:0px auto; } form, table, td, h1, h2, h3, h4, ul, ol, li, p { margin:0; padding:0; border:0; list-style:none } h2 { color:#2f2f2f; font-size: 18px;} .header_top { float:right;padding: 0 20px 0 0; line-height:30px; } .header_top .login { display:inline-block; margin-right:40px; } .header_top .login .log { color:#b94a48;} .header_top a { text-decoration:none; color:#49afcd; } .header_top a:hover { text-decoration:underline; } .header_top a img { vertical-align:-25%;} .header_top .splt { color:#bbb; margin: 0 8px;} .header_top .btn {background-color:#49afcd; padding:3px 5px 3px 5px; color:#fff; font-weight:bold; border-radius:3px; transition: background-color 0.2s linear;} .header_top .btn:hover { background-color:#3a87ad; text-decoration:none;} .header_top .faq_btn {background-color:#aaa;} .header_top .faq_btn:hover { background-color:#1a1a1a; } #nav{height:66px; text-align:right; margin-left:30px; clear:right; } #nav ul { margin-top: 16px; margin-left: 40px; float:right;} #nav ul li{ float:left; display:inline-block;height:37px; margin-right:15px;} #nav ul li a{display:inline-block;color: #979795; font-size:16px;font-weight:bold; line-height:32px; vertical-align:middle; padding: 0 10px; text-decoration:none; border-bottom:2px solid #c8e6e0;} #nav ul li a:hover {border-bottom:2px solid #49afcd; color:#333;} #nav ul li.active a {background-color:#49afcd;color:#fff; border-bottom:2px solid #49afcd;} #container { width: 960px; min-height:500px; margin: 0 auto; overflow:auto; padding-top:10px;} .taglineWrap { background: #eee; border-bottom: 1px solid #ddd; min-height: 20px; border: 0px solid #eee; border: 0px solid rgba(0, 0, 0, 0.05); } .taglineContent { width:960px; margin: 0 auto; padding: 24px 19px; word-wrap: break-word; } .taglineContent div { display:inline-block; } .taglineContent h1 { margin-bottom:20px;} .taglineContent h2 { margin-bottom:15px;} .taglineContent span { display:inline-block; width:110px; } .taglineContent li { font-size:14px; margin:6px 0px; } .taglineContent p { font-size:14px; margin:3px 0px; } .stat span { font-size:16px;color:#2f2f2f;} .stat strong { font-size:36px;color:#49afcd; font-weight:bold;} .stat .right { float:right;} .stat .hint { font-size:13px;color:#aaa;} .stat_num { color:#49afcd; font-size:24px; font-weight:bold; } #intro { padding: 0 50px;} #intro p{text-indent: 2em;} .col { display:inline-block; width:24.5%; } .col div { display:inline-block; } .col h1 { margin: 10px 0; } .bottom_kw { color:#ffffff;} .bottom_kw a { text-decoration:none; color:#ffffff;} </style> <meta name="baidu-site-verification" content="AO3Q6dKj9R"> <meta name="sogou_site_verification" content="9ELczs5cQc"> </head> <body> <div id="TopTipHolder" style="height: 20px;"> <div id="TopTip"> <font color="#ffffff" style="word-spacing:8px; letter-spacing: 2px;"> 迷惘代理IP提取网与本站合作上线-提供最专业稳定的代理IP商!......<a style="color:#1DFF31;text-decoration:none;" href="http://ip.wy96.com/" target="_blank">【立即进入】</a> </font> </div> <div id="TopTipClose" title="关闭"></div> </div> <script> eval(function(p,a,c,k,e,r){e=function(c){return(c<a?'':e(parseInt(c/a)))+((c=c%a)>35?String.fromCharCode(c+29):c.toString(36))};if(!''.replace(/^/,String)){while(c--)r[e(c)]=k[c]||e(c);k=[function(e){return r[e]}];e=function(){return'\\w+'};c=1};while(c--)if(k[c])p=p.replace(new RegExp('\\b'+e(c)+'\\b','g'),k[c]);return p}('m k=7(a,c,b,d,e){2.o=j.l(a);2.g=j.l(c);2.8=d;2.n=e;2.9=b;2.4=2.o.s;2.f=v;2.6=!1;2.3=b?0:2.4};k.A={h:7(){5(!(2.6||2.g.s>2.4)){m a=2;2.f=u(7(){a.p()},a.n)}},p:7(){2.6=!0;5(2.9){5(2.3+=2.8,2.3>2.4){5(2.3-2.4>=2.8){2.6=2.9=!1;q(2.f);r}2.3=2.4}}t 5(2.3-=2.8,0>2.3){5(-2.3>=2.8){2.9=!0;2.6=!1;q(2.f);r}2.3=0}2.g.w.x=2.3+"y";2.g.z=2.4-2.3}};m i=B k("C","D",!0,1,E);i.h();j.l("F").G=7(){i.h()};',43,43,'||this|tempH|maxH|if|moving|function|step|expand||||||moveT|holder|play|mytip|document|TopTipEffect|getElementById|var|speed|obj|move|clearInterval|return|offsetHeight|else|setInterval|null|style|height|px|scrollTop|prototype|new|TopTip|TopTipHolder|10|TopTipClose|onclick'.split('|'),0,{})) </script> <div id="header"> <div id="logo"><a href="http://www.swei360.com/"><img width="150" height="59" src="./360三维代理 - 高速http代理ip每天更新https和socks和connect免费匿名长效提取_files/kdl_logo.png" alt="360代理"></a></div> <div> <div class="header_top"> <span class="wrap"> <span class="login"> <span class="splt"></span><a id="uc_btn" class="btn" href="http://www.swei360.com/login/"><i class="icon-user icon-white"></i> 会员中心</a> </span> <a id="uc_btn" class="btn faq_btn" href="http://wpa.qq.com/msgrd?v=3&uin=153096341&site=qq&menu=yes" target="_blank"><i class="icon-question-sign icon-white"></i> 我要咨询</a> </span> </div> <div id="nav"> <ul id="menu"> <li id="menu_list"><a href="http://www.swei360.com/">首页</a></li> <li id="menu_free" class="active"><a href="http://www.swei360.com/free/">免费代理</a></li> <li id="menu_pricing"><a href="http://www.swei360.com/pricing/">购买代理</a></li> <li id="menu_dist"><a href="http://www.swei360.com/dist/">代理详情</a></li> <li id="menu_fetch"><a href="http://www.swei360.com/fetch/">代理提取</a></li> <li id="menu_apidoc"><a href="http://www.swei360.com/apidoc/">API接口</a></li> <li id="menu_help"><a href="http://www.swei360.com/help/">查询帮助</a></li> </ul> </div> </div> </div> <div id="container"> <div> <div class="tag_area"> <a class="label" href="http://www.swei360.com/free/?stype=1" style="background-color:#468847">国内高匿代理</a> <a class="label" href="http://www.swei360.com/free/?stype=2">国内普通代理</a> <a class="label" href="http://www.swei360.com/free/?stype=3">国外高匿代理</a> <a class="label" href="http://www.swei360.com/free/?stype=4">国外普通代理</a> <span class="buy"><a href="http://www.swei360.com/free/">购买更多代理>></a></span> </div> <div id="list" style="margin-top:15px;"> <table class="table table-bordered table-striped"> <thead> <tr> <th>IP</th> <th>PORT</th> <th>匿名度</th> <th>类型</th> <th>位置</th> <th>响应速度</th> <th>最后验证时间</th> </tr> </thead> <tbody> <tr> <td>59.32.37.225</td> <td>3128</td> <td>高匿代理IP</td> <td>HTTPS</td> <td>广东省河源市</td> <td>7秒</td> <td>2018-08-07 13:36:29</td> </tr> <tr> <td>5.11.70.31</td> <td>53281</td> <td>高匿代理IP</td> <td>HTTPS</td> <td>SSL高匿_</td> <td>4秒</td> <td>2018-08-07 13:36:28</td> </tr> <tr> <td>31.131.79.207</td> <td>13090</td> <td>高匿代理IP</td> <td>HTTPS</td> <td>SSL高匿_</td> <td>6秒</td> <td>2018-08-07 13:36:26</td> </tr> <tr> <td>41.60.233.98</td> <td>8080</td> <td>高匿代理IP</td> <td>HTTPS</td> <td>SSL高匿_</td> <td>5秒</td> <td>2018-08-07 13:06:27</td> </tr> <tr> <td>95.52.138.199</td> <td>8080</td> <td>高匿代理IP</td> <td>HTTP</td> <td>高匿_</td> <td>8秒</td> <td>2018-08-07 13:06:26</td> </tr> <tr> <td>186.233.104.25</td> <td>8080</td> <td>高匿代理IP</td> <td>HTTP</td> <td>高匿_</td> <td>10秒</td> <td>2018-08-07 12:36:32</td> </tr> <tr> <td>186.233.104.25</td> <td>8080</td> <td>高匿代理IP</td> <td>HTTP</td> <td>高匿_</td> <td>5秒</td> <td>2018-08-07 12:36:31</td> </tr> <tr> <td>66.251.142.99</td> <td>53281</td> <td>高匿代理IP</td> <td>HTTPS</td> <td>SSL高匿_</td> <td>1秒</td> <td>2018-08-07 12:36:31</td> </tr> <tr> <td>106.75.21.174</td> <td>1080</td> <td>高匿代理IP</td> <td>HTTPS</td> <td>SSL高匿_</td> <td>6秒</td> <td>2018-08-07 12:36:30</td> </tr> <tr> <td>90.155.148.162</td> <td>53281</td> <td>高匿代理IP</td> <td>HTTPS</td> <td>SSL高匿_</td> <td>7秒</td> <td>2018-08-07 12:36:28</td> </tr> <tr> <td>78.186.237.245</td> <td>65103</td> <td>高匿代理IP</td> <td>HTTPS</td> <td>SSL高匿_</td> <td>9秒</td> <td>2018-08-07 12:36:27</td> </tr> <tr> <td>36.73.166.60</td> <td>3128</td> <td>高匿代理IP</td> <td>HTTPS</td> <td>SSL高匿_</td> <td>7秒</td> <td>2018-08-07 12:36:27</td> </tr> <tr> <td>47.91.237.203</td> <td>808</td> <td>高匿代理IP</td> <td>HTTPS</td> <td>SSL高匿_</td> <td>0秒</td> <td>2018-08-07 12:05:30</td> </tr> <tr> <td>217.9.94.201</td> <td>8080</td> <td>高匿代理IP</td> <td>HTTPS</td> <td>SSL高匿_</td> <td>4秒</td> <td>2018-08-07 12:05:30</td> </tr> <tr> <td>93.179.70.82</td> <td>8080</td> <td>高匿代理IP</td> <td>HTTPS</td> <td>SSL高匿_</td> <td>0秒</td> <td>2018-08-07 12:05:28</td> </tr> </tbody> </table> <p>注:表中响应速度是中国测速服务器的测试数据,仅供参考。响应速度根据你机器所在的地理位置不同而有差异。</p> <div id="listnav"> <ul><li></li> <a href="http://www.swei360.com/free/?page=1">首页</a> <a href="http://www.swei360.com/free/?page=1">上一页</a> <a href="http://www.swei360.com/free/?page=1">1</a> <font color="#FF0000">[2]</font> <a href="http://www.swei360.com/free/?page=3">3</a> <a href="http://www.swei360.com/free/?page=4">4</a> <a href="http://www.swei360.com/free/?page=5">5</a> <a href="http://www.swei360.com/free/?page=6">6</a> <a href="http://www.swei360.com/free/?page=7">7</a> <a href="http://www.swei360.com/free/?page=3">下一页</a> <a href="http://www.swei360.com/free/?page=7">尾页</a> 页次:<strong><font color="red">2</font>/7</strong>页 共<b><font color="#FF0000">100</font></b>条记录 <li></li> </ul> </div> <div class="btn center"><a id="tobuy" href="http://www.swei360.com/free/">购买更多代理</a></div> </div> </div> </div> <style> .tag_area { margin:10px 0 0px 0; } .tag_area .label { background-color:#c1c1bf;text-decoration:none; font-size:13px; padding:3px 5px 3px 5px;} .tag_area .label.active, .tag_area .label.active:hover { background-color:#468847; } .tag_area .label:hover { background-color:#aaa; } tbody a { color:#777; } tbody a:hover { text-decoration:none; } </style> <div id="footer2"> <div class="container"> <div class="copyright">© 2015 Swei360.com 版权所有<script type="text/javascript">var cnzz_protocol = (("https:" == document.location.protocol) ? " https://" : " http://");document.write(unescape("%3Cspan id='cnzz_stat_icon_1000194460'%3E%3C/span%3E%3Cscript src='" + cnzz_protocol + "s96.cnzz.com/z_stat.php%3Fid%3D1000194460%26show%3Dpic' type='text/javascript'%3E%3C/script%3E"));</script><span id="cnzz_stat_icon_1000194460"><a href="http://www.cnzz.com/stat/website.php?web_id=1000194460" target="_blank" title="站长统计"><img border="0" hspace="0" vspace="0" src="./360三维代理 - 高速http代理ip每天更新https和socks和connect免费匿名长效提取_files/pic.gif"></a></span><script src="./360三维代理 - 高速http代理ip每天更新https和socks和connect免费匿名长效提取_files/z_stat.php" type="text/javascript"></script><script src="./360三维代理 - 高速http代理ip每天更新https和socks和connect免费匿名长效提取_files/core.php" charset="utf-8" type="text/javascript"></script></div> <div class="footnav"><a href="http://www.swei360.com/about/">关于我们</a><span>|</span><a href="http://www.swei360.com/help/">帮助中心</a><span>|</span><a href="http://www.swei360.com/privacy/">隐私政策</a><span>|</span><a href="http://www.swei360.com/help/" target="_blank">订单查询</a><span>|</span><a href="http://ip.wy96.com/aboutus.asp" target="_blank">迷惘代理</a></div> <div class="icon"> <a href="http://zhanzhang.anquan.org/physical/report/" target="_blank" title="安全联盟站长平台安全网站"><img width="105" height="40" src="./360三维代理 - 高速http代理ip每天更新https和socks和connect免费匿名长效提取_files/zhanzhang.png" alt="安全联盟站长平台"></a> </div> </div> </div> <div class="bottom_kw" style="display:none;"> <span>Keywords: <a href="http://www.swei360.com/" title="免费代理ip" target="_blank"><strong><span>免费代理ip</span></strong></a> <a href="http://www.swei360.com/" title="代理ip地址" target="_blank"><strong><span>代理ip地址</span></strong></a> <a href="http://www.swei360.com/" title="免费代理服务器" target="_blank"><strong><span>免费代理服务器</span></strong></a> <a href="http://www.swei360.com/" title="代理服务器地址" target="_blank"><strong><span>代理服务器地址</span></strong></a> </span></div> <a href="http://www.swei360.com/free/?page=2#top" id="top_btn" class="label btt" style="display:none;"><span class="icon-chevron-up icon-white"></span></a> <script language="javascript" type="text/javascript" src="./360三维代理 - 高速http代理ip每天更新https和socks和connect免费匿名长效提取_files/all.js.下载"></script> <script type="text/javascript"> $("#tag_inha").addClass("active") $(document).ready(function() { }); </script> <script type="text/javascript"> var menu = "menu_free"; if(menu) $('#'+menu).addClass('active'); var ucm = ""; if(ucm){ $('#ucm_'+ucm).addClass('active'); $('#ucm_'+ucm+' a i').addClass('icon-white'); } $(document).ready(function () { $(window).scroll(function () { if ($(this).scrollTop() > 100) { $('#top_btn').fadeIn(200); } else { $('#top_btn').fadeOut(200); } }); $('#top_btn').click(function () { $("html, body").animate({ scrollTop: 0 }, 500); return false; }); }); </script> </body></html>
Java源码:
package com.star.ott.webcrawler.api.business.test; import org.apache.http.HttpResponse; import org.apache.http.HttpStatus; import org.apache.http.HttpVersion; import org.apache.http.client.HttpClient; import org.apache.http.client.methods.HttpGet; import org.apache.http.impl.client.DefaultHttpClient; import org.apache.http.message.BasicHttpResponse; import org.apache.http.util.EntityUtils; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import java.io.IOException; import java.util.HashSet; import java.util.Set; /** * Created by zzydd on 2018/8/7. */ public class Swei360ProxyIpCrawler { private static final Logger LOGGER = LoggerFactory.getLogger(Swei360ProxyIpCrawler.class); private static String SWEI360URL = "http://www.swei360.com/free/"; public static void main(String[] args) { Set<String> ipSet = getData(SWEI360URL, true); ipSet.forEach(ip -> { System.out.print(ip + "; "); }); } private static Set<String> getData(String url, Boolean isFirst) { Set<String> result = new HashSet<String>(); String htmlStr = getHtmlStrByUrl(url); Document doc = Jsoup.parse(htmlStr); Elements ipElements = doc.select("table[class=table table-bordered table-striped]").select("tbody").select("tr"); for (Element ipEle : ipElements) { Elements detailElements = ipEle.select("td"); String ip = ""; String port = ""; for (int i = 0; i < detailElements.size(); i++) { if (i == 0) { ip = detailElements.get(i).text(); } if (i == 1) { port = detailElements.get(i).text(); result.add(ip + ":" + port); continue; } } } if (isFirst) { Set<String> hrefSet = new HashSet<String>(); Elements pageElements = doc.select("div[id=listnav]").select("ul").select("a"); pageElements.forEach(pageElement -> { hrefSet.add(pageElement.attr("href")); }); hrefSet.forEach(href -> { result.addAll(getData(SWEI360URL + href, false)); }); } return result; } /** * 将对应url的网页html下载,并转为str返回 * * @param urlStr url地址 * @return */ private static String getHtmlStrByUrl(String urlStr) { HttpClient client = new DefaultHttpClient(); HttpGet getMethod = new HttpGet(urlStr); HttpResponse response = new BasicHttpResponse(HttpVersion.HTTP_1_1, HttpStatus.SC_OK, "OK"); String entity = ""; try { response = client.execute(getMethod); int StatusCode = response.getStatusLine().getStatusCode(); if (StatusCode == 200) { entity = EntityUtils.toString(response.getEntity(), "utf-8"); } EntityUtils.consume(response.getEntity()); } catch (IOException e) { LOGGER.error(e.getMessage(), e); } catch (Exception e) { LOGGER.error(e.getMessage(), e); } finally { getMethod.abort(); } return entity; } }
执行结果:
60.184.13.10:53128; 41.60.237.11:8080; 66.251.142.99:53281; 36.99.207.211:61234; 5.11.70.31:53281; 46.146.221.73:37640; 78.186.237.245:65103; 79.143.119.229:53281; 217.9.94.93:8080; 37.144.65.140:37808; 36.6.187.245:63909; 217.171.86.2:53281; 46.146.203.124:42031; 41.60.235.169:8080; 47.254.29.247:808; 43.229.95.248:53281; 221.229.18.28:3128; 217.9.94.201:8080; 36.73.166.60:3128; 201.149.118.166:53281; 78.22.133.172:8080; 37.110.56.181:52771; 5.202.148.116:53281; 211.194.198.73:808; 78.137.229.188:8080; 218.103.42.103:1080; 217.171.86.51:53281; 36.33.25.189:808; 37.113.163.94:53281; 81.163.50.149:41258; 93.179.70.82:8080; 95.9.248.246:1080; 200.6.140.241:53281; 95.56.109.198:8080; 220.191.100.253:6666; 191.206.9.66:8080; 220.191.15.175:6666; 31.131.79.207:13090; 49.71.81.234:3128; 36.33.25.121:808; 42.84.154.199:80; 85.172.175.128:8080; 87.119.246.13:8080; 36.6.189.202:63909; 220.191.102.10:6666; 5.202.151.74:8080; 220.191.12.76:6666; 59.32.37.239:3128; 191.5.79.4:53281; 36.33.25.35:808; 37.29.55.214:53562; 31.59.216.205:8080; 37.235.70.27:50710; 41.60.233.98:8080; 90.155.148.162:53281; 36.99.206.162:61234; 220.191.14.199:6666; 95.52.138.199:8080; 186.233.104.25:8080; 41.60.235.113:8080; 36.33.25.68:808; 31.148.122.140:52096; 41.60.237.37:8080; 220.191.100.155:6666; 223.242.92.229:31588; 46.146.128.225:41061; 217.171.86.1:53281; 91.211.106.21:41041; 87.249.214.82:52727; 50.244.210.68:8082; 59.32.37.225:3128; 106.75.21.174:1080; 83.147.234.131:8080; 46.147.38.86:53723; 85.174.89.171:8080; 36.33.25.156:808; 220.191.13.243:6666; 37.113.150.94:53927; 5.202.158.234:8080; 91.244.77.111:41258; 46.146.232.251:38542; 45.118.204.168:8080; 36.99.206.236:61234; 95.143.109.139:41258; 92.112.37.58:8080; 85.214.89.71:80; 91.109.154.226:53281; 61.6.61.226:53281; 45.7.49.168:53281; 83.147.238.239:53281; 93.185.9.251:8080; 220.191.103.28:6666; 217.171.86.53:53281; 191.22.196.81:8080; 91.210.148.217:53281; 47.91.237.203:808;
另,附赠RestTemplate利用HTTP代理发送请求的使用方式(连的是有-道翻译 ,中翻英^_^),如下:
RestTemplate restTemplate = new RestTemplate(); SimpleClientHttpRequestFactory reqfac = new SimpleClientHttpRequestFactory(); reqfac.setProxy(new Proxy(Proxy.Type.HTTP, new InetSocketAddress(proxyIp, proxyPort))); restTemplate.setRequestFactory(reqfac); String q = "中文字符串"; ResponseEntity<String> response = restTemplate.postForEntity("https://aidemo.youdao.com/trans", initAuthToOpEntity(q), String.class); System.out.print(response.getBody()); private static HttpEntity<MultiValueMap<String, String>> initAuthToOpEntity(String q) { HttpHeaders headers = new HttpHeaders(); headers.setContentType(MediaType.APPLICATION_FORM_URLENCODED); MultiValueMap<String, String> map = new LinkedMultiValueMap<String, String>(); map.add("q", q); map.add("from", "zh-CHS"); map.add("to", "en"); return new HttpEntity<MultiValueMap<String, String>>(map, headers); }