ZeroCrawler V0.1:多线程爬虫
ZeroCrawler V0.1是一只简单的多线程爬虫,其基本架构如下:
整个程序是这样运作的:Scheduler不断从Queue取出URL,如果发现可用的爬虫(空闲线程),那么就将URL分给一只爬虫。然后爬虫完成下载网页,抽取URL,保存网页的工作后就回归Scheduler(变回空闲线程)。直到Queue没有待爬取的URL,并且所有爬虫都空闲下来,就停止程序。
Scheduler的主要工作就是建立线程池,从Queue中取出URL,分配URL给线程。容易出错的地方是退出条件。如果只是判断Queue为空就退出是不行的。因为这时可能还有爬虫在工作中,而它可能提取到新的URL加到Queue中。所以退出条件应该是Queue为空且线程池的线程全部空闲。Scheduler实现如下:
public static void Crawl(String url, String savePath) { int cnt = 1; long startTime = System.currentTimeMillis(); AtomicInteger numberOfThreads = new AtomicInteger(); //记录当前使用的爬虫数 ThreadPoolExecutor executor = new ThreadPoolExecutor(m_maxThreads, m_maxThreads, 3, TimeUnit.SECONDS, new LinkedBlockingQueue<Runnable>()); //建立线程池 Queue.Add(UrlUtility.Encode(UrlUtility.Normalizer(url))); //添加初始URL到Queue中 try { while ((url = Queue.Fetch()) != null) { executor.execute(new PageCrawler(url, savePath, numberOfThreads)); //将URL交给爬虫 while( (Queue.Size() == 0 && numberOfThreads.get() != 0) || (numberOfThreads.get() == m_maxThreads) ) { //防止提前退出 sleep(); } //if( cnt++ > 1000 ) break; if( Queue.Size() == 0 && numberOfThreads.get() == 0 ) break; } } finally { executor.shutdown(); } long useTime = System.currentTimeMillis() - startTime; System.out.println("use " + Utility.ToStandardTime((int)(useTime / 1000)) + "to finish " + cnt + " links"); System.out.println("remain url: " + Queue.Size()); }
Queue负责保存URL,判断URL重复出现与否。目前的保存方式是先使用Hash判断URL是否已经保存过,然后再将完整的URL整个保存到list中。从Queue中取出URL时使用广度优先原则。
public class Queue { private static HashSet<String> m_appear = new HashSet<String>(); private static LinkedList<String> m_queue = new LinkedList<String>(); public synchronized static void Add(String url) { if( !m_appear.contains(url) ) { m_appear.add(url); m_queue.addLast(url); } } public synchronized static String Fetch() { if( !m_queue.isEmpty() ) { return m_queue.poll(); } return null; } public static int Size() { return m_queue.size(); } }
接下来逐一介绍爬虫最重要的几个功能,首先从获取网页开始。网页的获取分为两部分:一是下载网页,二是正解地对字节流进行编码。网页下载使用httpclient-4.2.2完成,具体如下:
//伪装用的agent private static String[] m_agent = {"Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0)", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2)", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)", }; private static Logger m_debug = LogManager.getLogger("Debuglogger"); //取得"url"指定的entity public static HttpEntity GetEntity(String url) { HttpClient client = new DefaultHttpClient(); HttpGet getMethod = new HttpGet(UrlUtility.Encode(url)); getMethod.getParams().setParameter("http.protocol.cookie-policy", CookiePolicy.BROWSER_COMPATIBILITY); //伪装agent java.util.Random r = new java.util.Random(); getMethod.setHeader("User-Agent", m_agent[r.nextInt(m_agent.length)]); HttpResponse response = null; try { response = client.execute(getMethod); } catch (Exception e) { m_debug.debug("can't get response from " + url); m_debug.debug("reason is : " + e.getMessage()); return null; } int statusCode = response.getStatusLine().getStatusCode(); if ((statusCode == HttpStatus.SC_MOVED_PERMANENTLY) || (statusCode == HttpStatus.SC_MOVED_TEMPORARILY) || (statusCode == HttpStatus.SC_SEE_OTHER) || (statusCode == HttpStatus.SC_TEMPORARY_REDIRECT)) { // 转向抓取新链接 return GetEntity(response.getLastHeader("Location").getValue()); } else if( statusCode == HttpStatus.SC_NOT_FOUND ) { //找不到网页 m_debug.debug(url + " : page was no found"); response = null; } if( response != null ) return response.getEntity(); else return null; }
得到网站返回的entity后,接下来要做的就是对字节流进行正确的编码以得到网页的内容。一般情况下,所下载的网页都会在头部写清楚用的是哪一种charset。但这只是一般情况,当没有写的时候,就要自己来检测。检测编码可不是简单的工作,因而使用现成的ICU4J库。实现如下:
//从"entity"得到网页内容 public static String GetContent(HttpEntity entity) { if( entity != null ) { byte[] bytes; try { bytes = EntityUtils.toByteArray(entity); } catch (IOException e) { m_debug.debug("can't get bytes from entity. Reason are: " + e.getMessage()); return null; } String charSet = EntityUtils.getContentCharSet(entity); //得到网页编译格式 if( charSet != null ) { //网页本身有告知编码 try { return new String(bytes, charSet); } catch (UnsupportedEncodingException e) { m_debug.debug("unsupported charset " + charSet); return null; } } else { return GetContent(bytes); } } return null; } //使用ICU4J检测编码,并将编码后的网页内容返回 public static String GetContent(byte[] bytes) { CharsetDetector detector = new CharsetDetector(); detector.setText(bytes); CharsetMatch match = detector.detect(); try { return match.getString(); } catch (Exception e) { m_debug.debug("can't get content. Reason are: " + e.getMessage()); return null; } }
第二个重要的功能就是获取URL。这一功能要分为三个步骤:提取URL,拼接URL,对URL进行编码。提取URL的方法使用正则表达式。提取出的URL如果是完整的就最好,如果不是,就要进行URL的拼接了。URL的拼接方式是将URL分为三部分:scheme, host, path。然后看提取出的相对URL缺少哪个,就补上哪个。最后,如果URL包含中文,空格等非法字符,就将这些字符编码为"UTF-8"。
public class UrlUtility { private static String m_urlPatternString = "(?i)(?s)<\\s*?a.*?href=\"(.*?)\".*?>"; private static Pattern m_urlPattern = Pattern.compile(m_urlPatternString); private static Logger m_debug = LogManager.getLogger("Debuglogger"); public static void ExtractURL(String baseUrl, String content) { Matcher matcher = m_urlPattern.matcher(content); while( matcher.find() ) { String anchor = matcher.group(); String url = Utility.GetSubString(anchor, "href=\"", "\""); if( (url = UrlUtility.Refine(baseUrl, url)) != null ) { Queue.Add(url); } } } //将"url"变为编码为合法的URL public static String Encode(String url) { String res = ""; for(char c : url.toCharArray()) { if( !":/.?&#=".contains("" + c) ) { try { res += URLEncoder.encode("" + c, "UTF-8"); } catch (UnsupportedEncodingException e) { m_debug.debug("This JVM has no UTF-8 charset. It's strange"); } } else { res += c; } } return res; } public static String Normalizer(String url) { url = url.replaceAll("&", "&"); if( url.endsWith("/") ) { url = url.substring(0, url.length() - 1); } return url; } //拼接URL public static String Refine(String baseUrl, String relative) { if( baseUrl == null || relative == null ) { return null; } final Url base = Parse(baseUrl), url = Parse(relative); if( base == null || url == null ) { return null; } if( url.scheme == null ) { url.scheme = base.scheme; if( url.host == null ) { url.host = base.host; } } if( url.path.startsWith("../") ) { String prefix = ""; int idx = base.path.lastIndexOf('/'); if( (idx = base.path.lastIndexOf('/', idx - 1)) > 0 ) prefix = base.path.substring(0, idx); url.path = prefix + url.path.substring(3); } return Normalizer(url.ToUrl()); } //拆分URL成scheme, host, path private static Url Parse(String link) { int idx, endIndex; final Url url = new Url(); if( (idx = link.indexOf("#")) >= 0 ) { //ignore fragment if( idx == 0 ) return null; else link = link.substring(0, idx - 1); } // if( (idx = link.indexOf("?")) > 0 ) { //ignore query information // link = link.substring(0, idx); // } if( (idx = link.indexOf(":")) > 0 ) { url.scheme = link.substring(0, idx).trim(); if( IsLegalScheme(url.scheme) ) { link = link.substring(idx + 1); } else { return null; } } if( link.startsWith("//") ) { if( (endIndex = link.indexOf('/', 2)) > 0 ) { url.host = link.substring(2, endIndex).trim(); link = link.substring(endIndex + 1); } else { url.host = link.substring(2).trim(); link = null; } } if( link != null ) url.path = link.trim(); else url.path = ""; return url; } //判断scheme是否合法(要处理的scheme类型) private static boolean IsLegalScheme(String scheme) { if( scheme.equals("http") || scheme.equals("https") || scheme.equals("ftp") ) return true; else return false; } private static class Url { public Url() {} public String ToUrl() { String prefix = null; if( path.startsWith("/") ) prefix = scheme + "://" + host; else prefix = scheme + "://" + host + "/"; return prefix + path; } public String scheme; public String host; public String path; } }
最后的重要功能就是保存网页了。在这个功能中值得注意的是:如果将字节流编码后,再保存为HTML文件(即后缀名为HTML),那么在保存时一定要指定charset,且该charset为网页头中的charset。否则,将来打开时会乱码。原因是系统在打开这类文件时,会按照其头部指定的charset来解码。当不指定charset保存String,会按照平台的默认charset保存,默认charset与头部指定的charset不同时,系统按照头部的charset解码,就出现乱码的情况了。因而推荐直接将原始的字节流保存。
//保存网页 public static boolean SavePage(byte[] bytes, String content, String savePath) { String name = Utility.GetSubString(content, "<title>", "</title>"); //提取标题名,作为保存时的文件名 if( name != null ) name = name.trim() + ".html"; else return false; name = FixFileName(name); try { FileOutputStream fos = new FileOutputStream(new File(savePath, name)); fos.write(bytes); fos.close(); } catch(FileNotFoundException e) { m_debug.debug("无法建立文件名为\"" + name + "\"的文件"); return false; } catch (IOException e) { m_debug.debug(e.getMessage()); return false; } return true; } //去掉文件名中的非法字符 public static String FixFileName(String name) { String res = ""; for(char c : name.toCharArray()) { if( "/\\:*?\"<>|".contains("" + c) ) { res += " "; } else { res += c; } } return res; }
至此,ZeroCrawler V0.1的主要部分就介绍完了,如果想要完整的代码,可从[1]下载。运行代码所需的类库则从[2]下载。
[1]http://ishare.iask.sina.com.cn/f/34836546.html
[2]http://ishare.iask.sina.com.cn/f/34836710.html
//保存网页 public static boolean SavePage(byte[] bytes, String content, String savePath) { String name = Utility.GetSubString(content, "<title>", "</title>"); //提取标题名,作为保存时的文件名 if( name != null ) name = name.trim() + ".html"; else return false; name = FixFileName(name); try { FileOutputStream fos = new FileOutputStream(new File(savePath, name)); fos.write(bytes); fos.close(); } catch(FileNotFoundException e) { m_debug.debug("无法建立文件名为\"" + name + "\"的文件"); return false; } catch (IOException e) { m_debug.debug(e.getMessage()); return false; } return true; } //去掉文件名中的非法字符 public static String FixFileName(String name) { String res = ""; for(char c : name.toCharArray()) { if( "/\\:*?\"<>|".contains("" + c) ) { res += " "; } else { res += c; } } return res; }