Java版网络爬虫基础（转）

网络爬虫不仅仅可以爬取网站的网页，图片，甚至可以实现抢票功能，网上抢购，机票查询等。这几天看了点基础，记录下来。

网页的关系可以看做是一张很大的图，图的遍历可以分为深度优先和广度优先。网络爬虫采取的广度优先，概括的说来如下:

2个数组，一个记录已访问的网页(Al)，一个记录未访问的网页(Un)。假设网页A为爬取的起始点，分析A中的所有的超链接B,C,D，将B,C,D加入到Un，分析B中的所有的超链接E,F，将E,F加入到Un末尾，将B从Un除去并加入到AL。依次分析Un中的超链接并加入到Un中就能完成广度优先的遍历。

从上面可以看出，自己写爬虫有几个主要部分，分析网页中的链接，将使用htmlparser来完成，网页的下载功能，将使用httpcliient来完成。

现有的爬虫工具有webharvest等，可以直接使用。Lucene是一个全文检索系统的框架，它只是用来建立索引并搜索的，它不能够实现网络爬虫功能。能够实现网络搜索的系统叫Nutch，它是基于Lucene开发的。

相关中间件的下载地址；

　　HTMLParser : http://downloads.sourceforge.net/project/htmlparser/Integration-Builds/2.0-20060923/HTMLParser-2.0-SNAPSHOT-bin.zip

　　httpcliient : http://hc.apache.org/downloads.cgi

httpclient分为3.x版本和4.x版本，使用3.x版本的在抓取HTTP V1.1时总出现cache设置的问题，使用4.x版本则是正常的。使用了代理时对httpclient和htmlparser都需要做代理设置。

LinkQueue定义已访问队列，待访问队列和爬取得URL的哈希表，包括出队列，入队列，判断队列是否空等操作。

public class LinkQueue {
    // 已访问的 url 集合
    private static Set<String> visitedUrl = new HashSet<String>();
    // 待访问的 url 集合
    private static Queue<String> unVisitedUrl = new PriorityQueue<String>();
    // 获得URL队列
    public static Queue<String> getUnVisitedUrl() {
        return unVisitedUrl;
    }

    // 添加到访问过的URL队列中
    public static void addVisitedUrl(String url) {
        visitedUrl.add(url);
    }

    // 移除访问过的URL
    public static void removeVisitedUrl(String url) {
        visitedUrl.remove(url);
    }

    // 未访问的URL出队列
    public static Object unVisitedUrlDeQueue() {
        return unVisitedUrl.poll();
    }

    // 保证每个 url 只被访问一次
    public static void addUnvisitedUrl(String url) {
        if (url != null && !url.trim().equals("") && !visitedUrl.contains(url) && !unVisitedUrl.contains(url))
            unVisitedUrl.add(url);
    }

    // 获得已经访问的URL数目
    public static int getVisitedUrlNum() {
        return visitedUrl.size();
    }

    // 判断未访问的URL队列中是否为空
    public static boolean unVisitedUrlsEmpty() {
        return unVisitedUrl.isEmpty();
    }
}

实现抓取内容过滤的接口LinkFilter

public interface LinkFilter {    
    public boolean accept(String url);
}

DownLoadFile类，根据得到的url，爬取网页内容，下载到本地保存。 F盘下面需要有名为spider的文件夹，存储爬取的网页。

public class DownLoadFileV4 {

    /* 下载 url 指向的网页 */
    public String downloadFile(String url) throws Exception {
        String filePath = null;
        // 初始化，此处构造函数就与3.1中不同
        HttpClient httpclient = new DefaultHttpClient();
        //设置代理和超时，没有使用代理时注掉
        HttpHost proxy = new HttpHost("172.16.91.109", 808);   
        httpclient.getParams().setParameter(ConnRoutePNames.DEFAULT_PROXY, proxy);   
        httpclient.getParams().setIntParameter(CoreConnectionPNames.CONNECTION_TIMEOUT, 3000);


        HttpHost targetHost = new HttpHost(url.replace("http://", ""));
        HttpGet httpget = new HttpGet("/");
        // 查看默认request头部信息
        System.out.println("Accept-Charset:" + httpget.getFirstHeader("Accept-Charset"));
        // 以下这条如果不加会发现无论你设置Accept-Charset为gbk还是utf-8，他都会默认返回gb2312（本例针对google.cn来说）
        httpget.setHeader("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9.1.2)");
        // 用逗号分隔显示可以同时接受多种编码
        httpget.setHeader("Accept-Language", "zh-cn,zh;q=0.5");
        httpget.setHeader("Accept-Charset", "GB2312,utf-8;q=0.7,*;q=0.7");
        // 验证头部信息设置生效
        System.out.println("Accept-Charset:" + httpget.getFirstHeader("Accept-Charset").getValue());

        // Execute HTTP request
        System.out.println("executing request " + httpget.getURI());

        HttpResponse response = null;
        try {
            response = httpclient.execute(targetHost, httpget);
        
        // HttpResponse response = httpclient.execute(httpget);

        System.out.println("----------------------------------------");
        System.out.println("Location: " + response.getLastHeader("Location"));
        System.out.println(response.getStatusLine().getStatusCode());
        System.out.println(response.getLastHeader("Content-Type"));
        System.out.println(response.getLastHeader("Content-Length"));
        System.out.println("----------------------------------------");

        // 判断页面返回状态判断是否进行转向抓取新链接
        int statusCode = response.getStatusLine().getStatusCode();
        if ((statusCode == HttpStatus.SC_MOVED_PERMANENTLY) || (statusCode == HttpStatus.SC_MOVED_TEMPORARILY) || (statusCode == HttpStatus.SC_SEE_OTHER)
                || (statusCode == HttpStatus.SC_TEMPORARY_REDIRECT)) {
            // 此处重定向处理 此处还未验证
            String newUri = response.getLastHeader("Location").getValue();
            httpclient = new DefaultHttpClient();
            httpget = new HttpGet(newUri);
            response = httpclient.execute(httpget);
        }

        // Get hold of the response entity
        HttpEntity entity = response.getEntity();

        // 查看所有返回头部信息
        Header headers[] = response.getAllHeaders();
        int ii = 0;
        while (ii < headers.length) {
            System.out.println(headers[ii].getName() + ": " + headers[ii].getValue());
            ++ii;
        }

        // If the response does not enclose an entity, there is no need
        // to bother about connection release
        if (entity != null) {
            // 将源码流保存在一个byte数组当中，因为可能需要两次用到该流，
            byte[] bytes = EntityUtils.toByteArray(entity);
            if(response.getLastHeader("Content-Type") != null){
                filePath = "f:\\spider\\" + getFileNameByUrl(url, response.getLastHeader("Content-Type").getValue());
            }else{
                filePath = "f:\\spider\\" + url.substring(url.lastIndexOf("/"), url.length());
            }
            saveToLocal(bytes, filePath);
            
            String charSet = "";

            // 如果头部Content-Type中包含了编码信息，那么我们可以直接在此处获取
            charSet = EntityUtils.getContentCharSet(entity);

            System.out.println("In header: " + charSet);
            // 如果头部中没有，那么我们需要 查看页面源码，这个方法虽然不能说完全正确，因为有些粗糙的网页编码者没有在页面中写头部编码信息
            if (charSet == "") {
                String regEx = "(?=<meta).*?(?<=charset=[\\'|\\\"]?)([[a-z]|[A-Z]|[0-9]|-]*)";
                Pattern p = Pattern.compile(regEx, Pattern.CASE_INSENSITIVE);
                Matcher m = p.matcher(new String(bytes)); // 默认编码转成字符串，因为我们的匹配中无中文，所以串中可能的乱码对我们没有影响
                boolean result = m.find();
                if (m.groupCount() == 1) {
                    charSet = m.group(1);
                } else {
                    charSet = "";
                }
            }
            System.out.println("Last get: " + charSet);
            // 至此，我们可以将原byte数组按照正常编码专成字符串输出（如果找到了编码的话）
            //System.out.println("Encoding string is: " + new String(bytes, charSet));
        }} catch (Exception e) {
            e.printStackTrace();
        }
        finally {
            httpclient.getConnectionManager().shutdown();
            httpget.abort();
        }
        
        return filePath;
    }
    
    /**
     * 根据 url 和网页类型生成需要保存的网页的文件名 去除掉 url 中非文件名字符
     */
    public String getFileNameByUrl(String url, String contentType) {
        // remove http://
        url = url.substring(7);
        // text/html类型
        if (contentType.indexOf("html") != -1&& url.indexOf(".jpg")!=-1&& url.indexOf(".gif")!=-1) {
            url = url.replaceAll("[\\?/:*|<>\"]", "_") + ".html";
            return url;
        }
        
        else if(url.indexOf(".jpg")!=-1|| url.indexOf(".gif")!=-1){
            url =url;
            return url;
        }// 如application/pdf类型
        else {
            return url.replaceAll("[\\?/:*|<>\"]", "_") + "." + contentType.substring(contentType.lastIndexOf("/") + 1);
        }
    }

    /**
     * 保存网页字节数组到本地文件 filePath 为要保存的文件的相对地址
     */
    private void saveToLocal(byte[] data, String filePath) {
        try {
            DataOutputStream out = new DataOutputStream(new FileOutputStream(new File(filePath)));
            for (int i = 0; i < data.length; i++)
                out.write(data[i]);
            out.flush();
            out.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

HtmlParserTool类，用来获得网页中的超链接（包括a标签，frame中的src等等），即为了得到子节点的URL。需要引入htmlparser.jar。

public class HtmlParserTool {
    public static List<String>imageURLS = new ArrayList<String>();
    // 获取一个网站上的链接,filter 用来过滤链接
    public static Set<String> extracLinks(String url, LinkFilter filter) {
        Set<String> links = new HashSet<String>();
        try {
            Parser parser = new Parser();
            // 设置代理，没有代理时注掉
            System.getProperties().put("proxySet", "true");
            System.getProperties().put("proxyHost", "172.16.91.109");
            System.getProperties().put("proxyPort", "808");
            Parser.getConnectionManager().setProxyHost("123");

            parser.setURL(url);
            parser.setEncoding("utf-8");
            // 设置过滤图片
            NodeFilter imgfil = new TagNameFilter("IMG");

            // 过滤 <frame >标签的 filter，用来提取 frame 标签里的 src 属性所表示的链接
            NodeFilter frameFilter = new NodeFilter() {
                private static final long serialVersionUID = -6464506837817768182L;

                public boolean accept(Node node) {
                    if (node.getText().startsWith("frame src=")) {
                        return true;
                    } else {
                        return false;
                    }
                }
            };
            // OrFilter 来设置过滤 <a> 标签，和 <frame> 标签
            OrFilter lf = new OrFilter(new NodeClassFilter(LinkTag.class), frameFilter);
            
            // 得到所有经过过滤的标签
            //NodeList list = parser.extractAllNodesThatMatch(lf);
            NodeList list = parser.extractAllNodesThatMatch(imgfil);
            for (int i = 0; i < list.size(); i++) {
                Node tag = list.elementAt(i);
                if (tag instanceof LinkTag)// <a> 标签
                {
                    if (tag instanceof ImageTag) {
                        // 加入图片信息
                        ImageTag link = (ImageTag) tag;
                        String imageUrl = link.getImageURL();// url
                        links.add(imageUrl);
                        imageURLS.add(imageUrl);
                        System.out.println(imageUrl);
                    }else{
                        LinkTag link = (LinkTag) tag;
                        String linkUrl = link.getLink();// url
                        if (filter.accept(linkUrl))
                        links.add(linkUrl);
                    }
                } else// <frame> 标签
                {
                    if (tag instanceof ImageTag) {
                        // 加入图片信息
                        ImageTag link = (ImageTag) tag;
                        String imageUrl = link.getImageURL();// url
                        links.add(imageUrl);
                        imageURLS.add(imageUrl);
                        System.out.println(imageUrl);
                    } else {
                        // 提取 frame 里 src 属性的链接如 <frame src="test.html"/>
                        String frame = tag.getText();
                        int start = frame.indexOf("src=");
                        frame = frame.substring(start);
                        int end = frame.indexOf(" ");
                        if (end == -1)
                            end = frame.indexOf(">");
                        String frameUrl = frame.substring(5, end - 1);
                        if (filter.accept(frameUrl))
                            links.add(frameUrl);
                    }
                }
            }
        } catch (ParserException e) {
            e.printStackTrace();
        }
        return links;
    }
}

测试类MyCrawler，用来测试爬取效果

public class MyCrawler {
    /**
     * 使用种子初始化 URL 队列
     * 
     * @return
     * @param seeds
     *            种子URL
     */
    private void initCrawlerWithSeeds(String[] seeds) {
        for (int i = 0; i < seeds.length; i++)
            LinkQueue.addUnvisitedUrl(seeds[i]);
    }

    /**
     * 抓取过程
     * 
     * @return
     * @param seeds
     * @throws Exception 
     */
    public void crawling(String[] seeds) throws Exception { 
        LinkFilter filter = new LinkFilter() {
            public boolean accept(String url) {
                if (url.contains("csdn"))
                    return true;
                else
                    return false;
            }
        };
        // 初始化 URL 队列
        initCrawlerWithSeeds(seeds);
        // 循环条件：待抓取的链接不空且抓取的网页不多于1000
        while (!LinkQueue.unVisitedUrlsEmpty() && LinkQueue.getVisitedUrlNum() <= 1000) {
            // 队头URL出队列
            String visitUrl = (String) LinkQueue.unVisitedUrlDeQueue();
            if (visitUrl == null)
                continue;
            DownLoadFileV4 downLoader = new DownLoadFileV4();
            // 下载网页
            try {
                downLoader.downloadFile(visitUrl);
                // 只下载图片，不下载网页
                //if(HtmlParserTool.imageURLS.contains(visitUrl)){
                //    downLoader.downloadFile(visitUrl);
                //}
            } catch (Exception e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            }
            System.out.println();            
            // 该 url 放入到已访问的 URL 中
            LinkQueue.addVisitedUrl(visitUrl);
            // 提取出下载网页中的 URL
            Set<String> links = HtmlParserTool.extracLinks(visitUrl, filter);
            // 新的未访问的 URL 入队
            for (String link : links) {
                LinkQueue.addUnvisitedUrl(link);
            }
        }
    }

    // main 方法入口
    public static void main(String[] args) throws Exception {
        MyCrawler crawler = new MyCrawler();
        crawler.crawling(new String[] {"http://www.csdn.com"});
    }
}


http://www.cnblogs.com/lnlvinso/p/3970865.html

BLEXBot Crawler

General information about the BLEXBot site crawler

What is it

The BLEXBot crawler is an automated robot that visits pages to examine and analyse the content, in this sense it is similar to the robots used by the major search engine companies.

The BLEXBot crawler is identified by having a user-agent of the following form:
Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)

The BLEXBot crawler can be identified by the user-agent above. If you are suspicious about requests being spoofed you should first check the IP address of the request and make a reverse DNS lookup to see its domain name via appropriate tools - it should point to one of the sub-domains of *.webmeup.com.

We care about your site's performance and will never hurt it!

BLEXbot is a very site-friendly crawler. We made it as "gentle" as possible when crawling sites: it makes only 1 request per 3 seconds, or even less frequently, if another crawl delay is specified in your robots.txt file. BLEXbot respects rules you specify in your robots.txt file.

If any problems arise, they may be due to peculiarities of your particular site, or a bug on another site linking to you. Therefore, we would like to ask you, if you noticed any problem with BLEXbot, please report it to customercare@webmeup.com We will quickly make unique settings for your particular site, so that the crawling will never affect your site's performance.

http://webmeup-crawler.com/

posted @ 2015-07-24 07:01 沧海一滴阅读(334) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

沧海一滴

不积小流，无以成江海

Java版网络爬虫基础（转）

BLEXBot Crawler

What is it

We care about your site's performance and will never hurt it!

公告