这个java小爬虫, 功能很简单,只有一个,抓取网上的邮箱。用到了javaI/O,正则表达式。

public static void main(String[] args) throws IOException {
        // TODO Auto-generated method stub
//        List<String> list= getEmail();
        List<String> list= getEmailFromWeb();
        for (String string : list) {
            System.out.println(string);
        }
    }

    public static List<String> getEmail() throws IOException{
        //1.读取源文件
        BufferedReader bufferedReader= new BufferedReader(new FileReader("G:\\index.htm"));
        //2.对读取的数据进行规则的匹配
        String regex_email= "\\w+@\\w+(\\.[a-zA-Z]{2,3}){1,3}";//xinwenge@vip.qq.com
        Pattern pattern= Pattern.compile(regex_email);
        String line = null;
        List<String> list= new ArrayList<>();
        while ((line= bufferedReader.readLine())!=null) {
            Matcher matcher= pattern.matcher(line);
            while (matcher.find()) {
                list.add(matcher.group());
            }
        }
        return list;
    }
    
public static List<String> getEmailFromWeb() throws IOException{

    //1.读取web源文件
    URL url= new URL("http://news.qq.com/zt2015/wxghz/index.htm");
    BufferedReader bufferedReader= new BufferedReader(new InputStreamReader(url.openStream()));
    //2.对读取的数据进行规则的匹配
    String regex_email= "\\w+@\\w+(\\.[a-zA-Z]{2,3}){1,2}";
    Pattern pattern= Pattern.compile(regex_email);
    String line = null;
    List<String> list= new ArrayList<>();
    while ((line= bufferedReader.readLine())!=null) {
        Matcher matcher= pattern.matcher(line);
        while (matcher.find()) {
            list.add(matcher.group());
        }
    }
    return list;

    }

output:
xinwenge@vip.qq.com

哈哈,爬的腾讯新闻里面的一个网页。

posted on 2016-04-17 13:08  WesTward  阅读(1637)  评论(0编辑  收藏  举报