ZeroCrawler V0.1：多线程爬虫

ZeroCrawler V0.1是一只简单的多线程爬虫，其基本架构如下：

整个程序是这样运作的：Scheduler不断从Queue取出URL，如果发现可用的爬虫（空闲线程），那么就将URL分给一只爬虫。然后爬虫完成下载网页，抽取URL，保存网页的工作后就回归Scheduler（变回空闲线程）。直到Queue没有待爬取的URL，并且所有爬虫都空闲下来，就停止程序。

Scheduler的主要工作就是建立线程池，从Queue中取出URL，分配URL给线程。容易出错的地方是退出条件。如果只是判断Queue为空就退出是不行的。因为这时可能还有爬虫在工作中，而它可能提取到新的URL加到Queue中。所以退出条件应该是Queue为空且线程池的线程全部空闲。Scheduler实现如下：

View Code

    public static void Crawl(String url, String savePath) {        
        int cnt = 1;        
        long startTime = System.currentTimeMillis();
        AtomicInteger numberOfThreads = new AtomicInteger();    //记录当前使用的爬虫数
        ThreadPoolExecutor executor = new ThreadPoolExecutor(m_maxThreads, m_maxThreads, 
                3, TimeUnit.SECONDS, new LinkedBlockingQueue<Runnable>());    //建立线程池
        
        Queue.Add(UrlUtility.Encode(UrlUtility.Normalizer(url)));    //添加初始URL到Queue中
        try {
            while ((url = Queue.Fetch()) != null) {
                executor.execute(new PageCrawler(url, savePath, numberOfThreads));    //将URL交给爬虫
                        
                while( (Queue.Size() == 0 && numberOfThreads.get() != 0) 
                        || (numberOfThreads.get() == m_maxThreads) ) {    //防止提前退出
                    sleep();
                }
                
                //if( cnt++ > 1000 ) break;
                if( Queue.Size() == 0 && numberOfThreads.get() == 0 ) break;
            }            
        } finally {
            executor.shutdown();
        }
        
        long useTime = System.currentTimeMillis() - startTime;
        System.out.println("use " + Utility.ToStandardTime((int)(useTime / 1000)) + "to finish " + cnt + " links");    
        System.out.println("remain url: " + Queue.Size());
    }

Queue负责保存URL，判断URL重复出现与否。目前的保存方式是先使用Hash判断URL是否已经保存过，然后再将完整的URL整个保存到list中。从Queue中取出URL时使用广度优先原则。

View Code

public class Queue {
    private static HashSet<String> m_appear = new HashSet<String>();
    private static LinkedList<String> m_queue = new LinkedList<String>();
        
    public synchronized static void Add(String url) {
        if( !m_appear.contains(url) ) {
            m_appear.add(url);
            m_queue.addLast(url);
        }
    }
    
    public synchronized static String Fetch() {
        if( !m_queue.isEmpty() ) {
            return m_queue.poll();
        }
        
        return null;
    }
    
    public static int Size() {
        return m_queue.size();
    }
}

接下来逐一介绍爬虫最重要的几个功能，首先从获取网页开始。网页的获取分为两部分：一是下载网页，二是正解地对字节流进行编码。网页下载使用httpclient-4.2.2完成，具体如下：

View Code

    //伪装用的agent
    private static String[] m_agent = {"Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0)", 
            "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2)", 
            "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)", };
    
    private static Logger m_debug = LogManager.getLogger("Debuglogger");
    
    //取得"url"指定的entity
    public static HttpEntity GetEntity(String url) {
        HttpClient client = new DefaultHttpClient();        
        HttpGet getMethod = new HttpGet(UrlUtility.Encode(url));
        getMethod.getParams().setParameter("http.protocol.cookie-policy", CookiePolicy.BROWSER_COMPATIBILITY);
        
        //伪装agent
        java.util.Random r = new java.util.Random(); 
        getMethod.setHeader("User-Agent", m_agent[r.nextInt(m_agent.length)]);
                    
        HttpResponse response = null;
        try {
            response = client.execute(getMethod);
        } catch (Exception e) {
            m_debug.debug("can't get response from " + url);
            m_debug.debug("reason is : " + e.getMessage());
            return null;
        }
                
        int statusCode = response.getStatusLine().getStatusCode();
        if ((statusCode == HttpStatus.SC_MOVED_PERMANENTLY)
                    || (statusCode == HttpStatus.SC_MOVED_TEMPORARILY)
                    || (statusCode == HttpStatus.SC_SEE_OTHER)
                    || (statusCode == HttpStatus.SC_TEMPORARY_REDIRECT)) {    // 转向抓取新链接
            return GetEntity(response.getLastHeader("Location").getValue());
        }
        else if( statusCode == HttpStatus.SC_NOT_FOUND ) { //找不到网页
            m_debug.debug(url + " : page was no found");
            response = null;
        }
        
        if( response != null ) return response.getEntity();
        else                   return null;    
    }

得到网站返回的entity后，接下来要做的就是对字节流进行正确的编码以得到网页的内容。一般情况下，所下载的网页都会在头部写清楚用的是哪一种charset。但这只是一般情况，当没有写的时候，就要自己来检测。检测编码可不是简单的工作，因而使用现成的ICU4J库。实现如下：

View Code

    //从"entity"得到网页内容
    public static String GetContent(HttpEntity entity) {        
        if( entity != null ) {
            byte[] bytes;
            try {
                bytes = EntityUtils.toByteArray(entity);                
            } catch (IOException e) {
                m_debug.debug("can't get bytes from entity. Reason are: " + e.getMessage());
                return null;
            }
            
            String charSet = EntityUtils.getContentCharSet(entity); //得到网页编译格式                                    
            if( charSet != null ) {  //网页本身有告知编码            
                try {
                    return new String(bytes, charSet);
                } catch (UnsupportedEncodingException e) {
                    m_debug.debug("unsupported charset " + charSet);
                    return null;
                }
            }
            else {    
                return GetContent(bytes);
            }            
        }    
        
        return null;
    }
    
    //使用ICU4J检测编码，并将编码后的网页内容返回
    public static String GetContent(byte[] bytes) {
        CharsetDetector detector = new CharsetDetector();
        detector.setText(bytes);
        CharsetMatch match = detector.detect();
        
        try {
            return match.getString();
        } catch (Exception e) {
            m_debug.debug("can't get content. Reason are: " + e.getMessage());
            return null;
        }            
    }

第二个重要的功能就是获取URL。这一功能要分为三个步骤：提取URL，拼接URL，对URL进行编码。提取URL的方法使用正则表达式。提取出的URL如果是完整的就最好，如果不是，就要进行URL的拼接了。URL的拼接方式是将URL分为三部分：scheme, host, path。然后看提取出的相对URL缺少哪个，就补上哪个。最后，如果URL包含中文，空格等非法字符，就将这些字符编码为"UTF-8"。

View Code

public class UrlUtility {
    
    private static String m_urlPatternString = "(?i)(?s)<\\s*?a.*?href=\"(.*?)\".*?>";
    private static Pattern m_urlPattern = Pattern.compile(m_urlPatternString);
    
    private static Logger m_debug = LogManager.getLogger("Debuglogger");
    
    public static void ExtractURL(String baseUrl, String content) {
        Matcher matcher = m_urlPattern.matcher(content);
        while( matcher.find() ) {
            String anchor = matcher.group();
                       
            String url = Utility.GetSubString(anchor, "href=\"", "\"");
            if( (url = UrlUtility.Refine(baseUrl, url)) != null ) {    
                Queue.Add(url);
            }
        }
    }
    
    //将"url"变为编码为合法的URL
    public static String Encode(String url) {
        String res = "";
        for(char c : url.toCharArray()) {
            if( !":/.?&#=".contains("" + c) ) {
                try {
                    res += URLEncoder.encode("" + c, "UTF-8");
                } catch (UnsupportedEncodingException e) {
                    m_debug.debug("This JVM has no UTF-8 charset. It's strange");
                }
            } else {
                res += c;
            }
        }

        return res;
    }
    
    public static String Normalizer(String url) {
        url = url.replaceAll("&amp;", "&");
        if( url.endsWith("/") ) {
            url = url.substring(0, url.length() - 1);
        }
        
        return url;
    }
    
    //拼接URL
    public static String Refine(String baseUrl, String relative) {
        if( baseUrl == null || relative == null ) {
            return null;
        }
        
        final Url base = Parse(baseUrl), url = Parse(relative);        
        if( base == null || url == null ) {
            return null;
        }
        
        if( url.scheme == null ) {
            url.scheme = base.scheme;
            if( url.host == null ) {
                url.host = base.host;
            }
        }
        
        if( url.path.startsWith("../") ) {
            String prefix = "";
            int idx = base.path.lastIndexOf('/');
            if( (idx = base.path.lastIndexOf('/', idx - 1)) > 0 ) prefix = base.path.substring(0, idx);
            url.path = prefix + url.path.substring(3);            
        }
                                                
        return Normalizer(url.ToUrl());
    }
    
    //拆分URL成scheme, host, path
    private static Url Parse(String link) {
        int idx, endIndex;
        final Url url = new Url();    
        
        if( (idx = link.indexOf("#")) >= 0 ) {    //ignore fragment
            if( idx == 0 ) return null;
            else           link = link.substring(0, idx - 1);
        }
        
//        if( (idx = link.indexOf("?")) > 0 ) {    //ignore query information
//            link = link.substring(0, idx);
//        }
        
        if( (idx = link.indexOf(":")) > 0 ) {
            url.scheme = link.substring(0, idx).trim();
            if( IsLegalScheme(url.scheme) ) {
                link = link.substring(idx + 1);
            }
            else {
                return null;
            }
        }
        
        if( link.startsWith("//") ) {
            if( (endIndex = link.indexOf('/', 2)) > 0 ) {
                url.host = link.substring(2, endIndex).trim();
                link = link.substring(endIndex + 1);
            }
            else {
                url.host = link.substring(2).trim();
                link = null;
            }
        }
        
        if( link != null ) url.path = link.trim();
        else               url.path = "";
        
        return url;        
    }
    
    //判断scheme是否合法（要处理的scheme类型）
    private static boolean IsLegalScheme(String scheme) {
        if( scheme.equals("http") || scheme.equals("https") || scheme.equals("ftp") ) return true;
        else                                                                          return false;
    }
    
    private static class Url {
        public Url() {}
                
        public String ToUrl() {
            String prefix = null;
            if( path.startsWith("/") ) prefix =  scheme + "://" + host;
            else                       prefix =  scheme + "://" + host + "/";
            
            return prefix + path;
        }
        
        public String scheme;
        public String host;
        public String path;
    }
}

最后的重要功能就是保存网页了。在这个功能中值得注意的是：如果将字节流编码后，再保存为HTML文件（即后缀名为HTML），那么在保存时一定要指定charset，且该charset为网页头中的charset。否则，将来打开时会乱码。原因是系统在打开这类文件时，会按照其头部指定的charset来解码。当不指定charset保存String，会按照平台的默认charset保存，默认charset与头部指定的charset不同时，系统按照头部的charset解码，就出现乱码的情况了。因而推荐直接将原始的字节流保存。

View Code

    //保存网页
    public static boolean SavePage(byte[] bytes, String content, String savePath) {        
        String name = Utility.GetSubString(content, "<title>", "</title>");    //提取标题名，作为保存时的文件名
        if( name != null ) name = name.trim() + ".html";
        else               return false;
                
        name = FixFileName(name);                
        try {        
            FileOutputStream fos = new FileOutputStream(new File(savePath, name));
            fos.write(bytes);
            fos.close();     
        }
        catch(FileNotFoundException e) {
            m_debug.debug("无法建立文件名为\"" + name + "\"的文件");
            return false;
        } catch (IOException e) {
            m_debug.debug(e.getMessage());
            return false;
        }        
        
        return true;
    }
    
    //去掉文件名中的非法字符
    public static String FixFileName(String name) {
        String res = "";
        for(char c : name.toCharArray()) {
            if( "/\\:*?\"<>|".contains("" + c) ) {
                res += " ";
            } else {
                res += c;
            }
        }        
        return res;
    }

至此，ZeroCrawler V0.1的主要部分就介绍完了，如果想要完整的代码，可从[1]下载。运行代码所需的类库则从[2]下载。

[1]http://ishare.iask.sina.com.cn/f/34836546.html

[2]http://ishare.iask.sina.com.cn/f/34836710.html

    //保存网页
    public static boolean SavePage(byte[] bytes, String content, String savePath) {        
        String name = Utility.GetSubString(content, "<title>", "</title>");    //提取标题名，作为保存时的文件名
        if( name != null ) name = name.trim() + ".html";
        else               return false;
                
        name = FixFileName(name);                
        try {        
            FileOutputStream fos = new FileOutputStream(new File(savePath, name));
            fos.write(bytes);
            fos.close();     
        }
        catch(FileNotFoundException e) {
            m_debug.debug("无法建立文件名为\"" + name + "\"的文件");
            return false;
        } catch (IOException e) {
            m_debug.debug(e.getMessage());
            return false;
        }        
        
        return true;
    }
    
    //去掉文件名中的非法字符
    public static String FixFileName(String name) {
        String res = "";
        for(char c : name.toCharArray()) {
            if( "/\\:*?\"<>|".contains("" + c) ) {
                res += " ";
            } else {
                res += c;
            }
        }        
        return res;
    }

posted @ 2012-11-27 20:57 风中之炎阅读(4189) 评论(3) 编辑收藏举报

刷新页面返回顶部

ZeroCrawler V0.1：多线程爬虫

公告