初入爬虫(java)

public class CrawlerUtil {
    public static void main(String [] args) throws IOException {
        // 创建默认的httpClient实例.    
        CloseableHttpClient httpclient  =  HttpClients.createDefault();
// 创建httpget    
        HttpGet httpGet  = new HttpGet("http://localhost:8080/");
        CloseableHttpResponse response  =  httpclient.execute(httpGet);

        HttpEntity  entity  =  response.getEntity();

        if(entity !=null){
            System.out.println("______________________________________");
            System.out.println("Response content: "+  EntityUtils.toString(entity,"UTF-8"));
            System.out.println("______________________________________");
        }
    }
}

最近项目中有部分数据需要从另一个网址爬取，这才初次入手爬虫。

开发语言是java，通过跟前辈取经及百度，终于搞定了这个需求。

以上为简单的demo。

maven配置：

<!--<dependency>-->
<!--<groupId>commons-httpclient</groupId>-->
<!--<artifactId>commons-httpclient</artifactId>-->
<!--<version>3.1</version>-->
<!--</dependency>-->
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.9.2</version>
</dependency>
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.3</version>
</dependency>

使用的工具包是httpclient（爬取数据）和jsoup（解析html）；

需要注意的事httpclient有两个版本：

1.org.apache.commons.httpclient.HttpClient；

2.org.apache.http.client.HttpClient；

但前者目前已经不再更新了，所以我使用的时候后者；

简单理解起来如下：

1.一个客户端，用来发起http请求（HttpClient.createDefault()）;

2.请求对象（get，post等，比如上面demo里的HttpGet）；

3.返回值：CloseableHttpResponse；

客户端对象操作（execute）请求对象，得到返回值：

CloseableHttpResponse response = httpclient.execute(httpGet);

上面的demo是get请求，至于post请求，就将请求参数放进一个对象里（HttpEntity里），然后将这个对象放进请求对象里（HttpPost）；

如下：

List formparams = new ArrayList();
formparams.add(new BasicNameValuePair("username","admin"));
formparams.add(new BasicNameValuePair("password","123456"));
UrlEncodedFormEntity uefEntity;
uefEntity  =  new UrlEncodedFormEntity(formparams,"UTF-8");
CloseableHttpResponse response  =  httpclient.execute(httpGet);
post.setEntity(uefEntity);


以上.-------------------------------------------------------------------------------------------------------------------

以上皆为api层的东西，当然针对不同的需求场景，还有各种参数的设置需要注意，我也只是初入此类知识.


以上为爬虫原理，至于更底层的协议等东西我暂时没有去深入，以后会慢慢系统深入.

posted @ 2017-03-18 21:15 it馅儿包子阅读(141) 评论(0) 编辑收藏举报

刷新页面返回顶部

it馅儿包子

初入爬虫(java)

公告