爬虫综合案例

爬虫综合案例(jd爬虫)

学习了HttpClient和Jsoup，就掌握了如何抓取数据和如何解析数据，接下来，我们完成我们的项目案例，把京东的手机数据抓取下来

一、需求分析

需求说明:

本次爬取jd商城中所有手机商品数据:主要包含商品的名称商品价格商品的id 商品图片商品的详情的地址

通过点击F12观察: 所需要爬取的数据在一下这几个地方

对于商品的详情页: 通过分析发现 , 请详情页的url地址就是通过spu拼接而来的

1. spu 和 sku的区别说明

l SPU = Standard Product Unit （标准产品单位）

SPU是商品信息聚合的最小单位，是一组可复用、易检索的标准化信息的集合，该集合描述了一个产品的特性。通俗点讲，属性值、特性相同的商品就可以称为一个SPU。

例如 iPhone X 可以确定一个产品即为一个SPU

l SKU=stock keeping unit(库存量单位)

SKU即库存进出计量的单位，可以是以件、盒、托盘等为单位。SKU是物理上不可分割的最小存货单元。在使用时要根据不同业态，不同管理模式来处理。在服装、鞋类商品中使用最多最普遍。

例如 iPhone X 64G 银色则是一个SKU。

二、项目的准备工作

1. 表结构的准备工作

根据需求分析, 我们创建的表如下:

CREATE DATABASE `day04_jdspider`;

USE `day04_jdspider`;

CREATE TABLE `jd_item` (

`id` bigint(10) NOT NULL AUTO_INCREMENT COMMENT '主键id',

`spu` bigint(15) DEFAULT NULL COMMENT '商品集合id',

`sku` bigint(15) DEFAULT NULL COMMENT '商品最小品类单元id',

`title` varchar(1000) DEFAULT NULL COMMENT '商品标题',

`price` double(10,0) DEFAULT NULL COMMENT '商品价格',

`pic` varchar(200) DEFAULT NULL COMMENT '商品图片',

`url` varchar(1500) DEFAULT NULL COMMENT '商品详情地址',

`created` varchar(100) DEFAULT NULL COMMENT '创建时间',

`updated` varchar(100) DEFAULT NULL COMMENT '更新时间',

PRIMARY KEY (`id`),

KEY `sku` (`sku`) USING BTREE

) ENGINE=InnoDB AUTO_INCREMENT=1116 DEFAULT CHARSET=utf8 COMMENT='京东商品';

2. 项目准备

l 1) 创建项目的模块

2) 添加pom依赖

<dependencies>
    <dependency>
        <groupId>org.apache.httpcomponents</groupId>
        <artifactId>httpclient</artifactId>
        <version>4.5.4</version>
    </dependency>
    <dependency>
        <groupId>org.jsoup</groupId>
        <artifactId>jsoup</artifactId>
        <version>1.10.3</version>
    </dependency>
    <dependency>
        <groupId>com.mchange</groupId>
        <artifactId>c3p0</artifactId>
        <version>0.9.5.2</version>
    </dependency>

    <dependency>
        <groupId>mysql</groupId>
        <artifactId>mysql-connector-java</artifactId>
        <version>5.1.38</version>
    </dependency>

    <dependency>
        <groupId>org.projectlombok</groupId>
        <artifactId>lombok</artifactId>
        <version>1.18.8</version>
        <scope>provided</scope>
    </dependency>

</dependencies>
<build>
    <plugins>

        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-compiler-plugin</artifactId>
            <version>3.2</version>
            <configuration>
                <source>1.8</source>
                <target>1.8</target>
                <encoding>UTF-8</encoding>
            </configuration>
        </plugin>

    </plugins>

</build>

l 3) 添加C3P0配置文件： c3p0-config.xml

<c3p0-config>
    
    <default-config>
        
        <property name="driverClass">com.mysql.jdbc.Driver</property>
        <property name="jdbcUrl">jdbc:mysql://localhost:3306/day04_jdspider</property>
        <property name="user">root</property>
        <property name="password">123456</property>

        
        <property name="initialPoolSize">5</property>
        <property name="maxPoolSize">10</property>
        <property name="checkoutTimeout">3000</property>
    </default-config>
</c3p0-config>

l 4) 添加工具类

public class C3P0Utils {

    private static ComboPooledDataSource dataSource = new ComboPooledDataSource();

    private C3P0Utils() {
    }

    public static Connection getConnection(){

        Connection connection = null;
        try {
            connection = dataSource.getConnection();
        } catch (SQLException e) {
            e.printStackTrace();
        }
        return connection;
    }

    public static void closeAll(ResultSet resultSet, Statement statement, Connection connection){
        try{
            if( resultSet!=null ){
                resultSet.close();
            }

            if( statement!=null ){
                statement.close();
            }

            if( connection!=null ){
                connection.close();
            }

        }catch (Exception e) {
            e.printStackTrace();
        }

    }

}

l 5) 添加pojo类：

注意: 使用此注解 ,前提必须在idea中安装好lombok插件, 并在pom中导入lombok依赖才可以使用, 否则手动实现 get set toString 以及空参和全参构造

@Data
@AllArgsConstructor
@NoArgsConstructor
public class Item {
    //主键
    private Long id;
    //标准产品单位（商品集合）
    private Long spu;
    //库存量单位（最小品类单元）
    private Long sku;
    //商品标题
    private String title;
    //商品价格
    private Double price;
    //商品图片
    private String pic;
    //商品详情地址
    private String url;
    //创建时间
    private String created;
    //更新时间
    private String updated;

}

3. 项目开发

l 1) 发送请求, 获取数据

}

l 2) 解析数据: 注意红色部分为新增解析数据代码

public class JdSpider {

    public static void main(String[] args) throws Exception {
        //1. 确定首页URL
        String indexUrl = "https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&page=0&click=0";

        //2. 发送请求, 获取数据 httpClient
        //2.1: 创建HttpClient对象:
        CloseableHttpClient httpClient = HttpClients.createDefault();
        //2.2: 创建请求方式的对象: HttpGet HttpPost
        HttpGet httpGet = new HttpGet(indexUrl);
        //2.3: 设置请求信息: 请求头
        httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36");

        //2.4: 发送请求, 获取响应对象
        CloseableHttpResponse response = httpClient.execute(httpGet);

        //2.5: 根据response 获取响应的数据
        int statusCode = response.getStatusLine().getStatusCode();
        System.out.println("状态码为:" + statusCode);
        if (statusCode == 200) {

            String html = EntityUtils.toString(response.getEntity(), "UTF-8");

            //2.6 释放资源
            response.close();
            //3. 解析数据： jsoup
            //3.1：根据html 获取其对应document对象
            Document document = Jsoup.parse(html);
            //3.2: 获取获取商品的li标签: 每个li就是一个商品详细信息
            Elements lis = document.select("#J_goodsList>ul[class=gl-warp clearfix]>li");
            List<Item> itemList = new ArrayList<>();
            for (Element li : lis) {
                //3.3: 获取每件商品的图片的URL , 完成图片的下载
                Elements imgs = li.select(".gl-i-wrap>.p-img>a>img");
                String imgUrl = "https:" + imgs.attr("src");
                //3.3.1: 根据获取图片的地址, 发送请求, 获取数据(字节流数据)
                HttpGet imgGet = new HttpGet(imgUrl);
                CloseableHttpResponse imgResonse = httpClient.execute(imgGet);
                HttpEntity imgEntity = imgResonse.getEntity();
                InputStream inputStream = imgEntity.getContent(); // 注意此处千万不能使用EntityUtils, 这个东东是用来获取文本内容的

                //3.3.2: 创建一个本地的输出流 : 输出某一个文件上
               String imgFileName = "E:\\jdImg\\" + UUID.randomUUID().toString() + imgUrl.substring(imgUrl.lastIndexOf("."));
               FileOutputStream outputStream = new FileOutputStream(imgFileName);

               //3.3.3: 两个流进行对接将数据写入到本地磁盘中
              int len;
              byte[] b = new byte[1024];
              while ((len = inputStream.read(b)) != -1) {
                   outputStream.write(b, 0, len);
               }

               //3.3.4: 释放资源
               outputStream.close();
               inputStream.close();
               imgResonse.close();
               //3.4: 解析 spu 和 sku
               String skuValue = li.attr("data-sku");
               String spuValue = li.attr("data-spu");
               if (spuValue == null || "".equals(spuValue)) spuValue = skuValue;
               //3.5: 解析商品名称
               Elements ems = li.select(".gl-i-wrap>div[class=p-name p-name-type-2]>a>em");
               String title = ems.text();
               //3.6: 解析商品的价格
               Elements priceLiEls = li.select(".gl-i-wrap>.p-price>strong>i");
               String price = priceLiEls.text();
               //3.7: 解析商品的URL
               String itemUrl = "https://item.jd.com/" + skuValue + ".html";
               //3.8: 封装数据
               Item item = new Item(null,
                            Long.parseLong(spuValue),
                            Long.parseLong(skuValue),
                            title,
                            Double.parseDouble(price),
                            imgFileName,
                            itemUrl,
                            new Date().toLocaleString(),
                            new Date().toLocaleString()
                );
                    //3.9: 把解析每一个item对象. 都封装到一个集合中
                itemList.add(item);
           }

           System.out.println("获取到:" + itemList.size() + "个");
       }

    }
}

l 3) 保存数据

n 3.1: 先构建一个 jdSpiderDao 用于执行保存数据

public class JDItemDao {

    // 保存数据的操作
    public void saveItem(List<Item> itemList) throws Exception {

        //1. 从连接池中获取连接对象
        Connection connection = C3P0Utils.getConnection();

        //2. 根据连接创建预处理的执行平台
        String sql = "insert into jd_item VALUES (null,?,?,?,?,?,?,?,?) ";
        PreparedStatement statement = connection.prepareStatement(sql);

        //3.执行SQL. 获取结果
        for (Item item : itemList) {

            //3.1: 有? 先封装 ?
            statement.setLong(1,item.getSpu());
            statement.setLong(2,item.getSku());
            statement.setString(3,item.getTitle());
            statement.setDouble(4,item.getPrice());
            statement.setString(5,item.getPic());
            statement.setString(6,item.getUrl());
            statement.setString(7,item.getCreated());
            statement.setString(8,item.getUpdated());

            //3.2: 执行SQL
            statement.executeUpdate();

        }

        //4. 释放资源
        C3P0Utils.closeAll(null,statement,connection);
    }
}

n 3.2) 代码操作: 注意红色是新增地方

public class JdSpider {

    public static void main(String[] args) throws Exception {

        //1. 确定首页URL
        String indexUrl = "https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&page=1&click=0";

        //2. 发送请求, 获取数据 httpClient
        //2.1: 创建HttpClient对象:
        CloseableHttpClient httpClient = HttpClients.createDefault();

        //2.2: 创建请求方式的对象: HttpGet HttpPost
        HttpGet httpGet = new HttpGet(indexUrl);
        //2.3: 设置请求信息: 请求头
        httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36");

        //2.4: 发送请求, 获取响应对象
        CloseableHttpResponse response = httpClient.execute(httpGet);

        //2.5: 根据response 获取响应的数据
        int statusCode = response.getStatusLine().getStatusCode();
        System.out.println("状态码为:" + statusCode);
        if (statusCode == 200) {

            String html = EntityUtils.toString(response.getEntity(), "UTF-8");

            //2.6 释放资源
            response.close();

            //3. 解析数据： jsoup
            //3.1：根据html 获取其对应document对象
            Document document = Jsoup.parse(html);
            //3.2: 获取获取商品的li标签: 每个li就是一个商品详细信息
            Elements lis = document.select("#J_goodsList>ul[class=gl-warp clearfix]>li");
            List<Item> itemList = new ArrayList<>();
            for (Element li : lis) {
                //3.3: 获取每件商品的图片的URL , 完成图片的下载
                Elements imgs = li.select(".gl-i-wrap>.p-img>a>img");
                String imgUrl = "https:" + imgs.attr("src");

                //3.3.1: 根据获取图片的地址, 发送请求, 获取数据(字节流数据)
                HttpGet imgGet = new HttpGet(imgUrl);

                CloseableHttpResponse imgResonse = httpClient.execute(imgGet);
                HttpEntity imgEntity = imgResonse.getEntity();

                InputStream inputStream = imgEntity.getContent(); // 注意此处千万不能使用EntityUtils, 这个东东是用来获取文本内容的

                //3.3.2: 创建一个本地的输出流 : 输出某一个文件上
                // http://img10.360buyimg.com/n7/jfs/t1/110811/33/3085/317953/5e8c4bafEf33aaa74/5531debb59f5350c.jpg
                String imgFileName = "E:\\jdImg\\" + UUID.randomUUID().toString() + imgUrl.substring(imgUrl.lastIndexOf("."));
                FileOutputStream outputStream = new FileOutputStream(imgFileName);

                //3.3.3: 两个流进行对接将数据写入到本地磁盘中

                int len;
                byte[] b = new byte[1024];
                while ((len = inputStream.read(b)) != -1) {
                    outputStream.write(b, 0, len);
                }

                //3.3.4: 释放资源
                outputStream.close();
                inputStream.close();
                imgResonse.close();

                //3.4: 解析 spu 和 sku
                String skuValue = li.attr("data-sku");
                String spuValue = li.attr("data-spu");
                if (spuValue == null || "".equals(spuValue)) spuValue = skuValue;

                //3.5: 解析商品名称
                Elements ems = li.select(".gl-i-wrap>div[class=p-name p-name-type-2]>a>em");
                String title = ems.text();

                //3.6: 解析商品的价格
                Elements priceLiEls = li.select(".gl-i-wrap>.p-price>strong>i");
                String price = priceLiEls.text();

                //3.7: 解析商品的URL
                String itemUrl = "https://item.jd.com/" + skuValue + ".html";

                //3.8: 封装数据
                Item item = new Item(null,
                        Long.parseLong(spuValue),
                        Long.parseLong(skuValue),
                        title,
                        Double.parseDouble(price),
                        imgFileName,
                        itemUrl,
                        new Date().toLocaleString(),
                        new Date().toLocaleString()
                );
                //3.9: 把解析每一个item对象. 都封装到一个集合中
                itemList.add(item);
            }

            System.out.println("获取到:" + itemList.size() + "个");

            //4. 保存数据操作 : mysql

            JDItemDao jdItemDao = new JDItemDao();
            jdItemDao.saveItem(itemList);

        }

    }
}

l 4) 分页处理: 红色为分页代码处理

public class JdSpider {

    public static void main(String[] args) throws Exception {
        int page = 1;
        //1. 确定首页URL
        String indexUrl = "https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&page=" + (page * 2 - 1) + "&click=0";

        //2. 发送请求, 获取数据 httpClient
        //2.1: 创建HttpClient对象:
        CloseableHttpClient httpClient = HttpClients.createDefault();

        while (page <= 100) {
            System.out.println("当前正在处理:" + page);
            System.out.println("当前正在处理页面地址为:" + indexUrl);

            //2.2: 创建请求方式的对象: HttpGet HttpPost
            HttpGet httpGet = new HttpGet(indexUrl);
            //2.3: 设置请求信息: 请求头
            httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36");

            //2.4: 发送请求, 获取响应对象
            CloseableHttpResponse response = httpClient.execute(httpGet);

            //2.5: 根据response 获取响应的数据
            int statusCode = response.getStatusLine().getStatusCode();
            System.out.println("状态码为:" + statusCode);
            if (statusCode == 200) {

                String html = EntityUtils.toString(response.getEntity(), "UTF-8");

                //2.6 释放资源
                response.close();

                //3. 解析数据： jsoup
                //3.1：根据html 获取其对应document对象
                Document document = Jsoup.parse(html);
                //3.2: 获取获取商品的li标签: 每个li就是一个商品详细信息
                Elements lis = document.select("#J_goodsList>ul[class=gl-warp clearfix]>li");
                List<Item> itemList = new ArrayList<>();
                for (Element li : lis) {
                    //3.3: 获取每件商品的图片的URL , 完成图片的下载
                    Elements imgs = li.select(".gl-i-wrap>.p-img>a>img");
                    String imgUrl = "https:" + imgs.attr("src");

                    //3.3.1: 根据获取图片的地址, 发送请求, 获取数据(字节流数据)
                    HttpGet imgGet = new HttpGet(imgUrl);

                    CloseableHttpResponse imgResonse = httpClient.execute(imgGet);
                    HttpEntity imgEntity = imgResonse.getEntity();

                    InputStream inputStream = imgEntity.getContent(); // 注意此处千万不能使用EntityUtils, 这个东东是用来获取文本内容的

                    //3.3.2: 创建一个本地的输出流 : 输出某一个文件上
                    // http://img10.360buyimg.com/n7/jfs/t1/110811/33/3085/317953/5e8c4bafEf33aaa74/5531debb59f5350c.jpg
                    String imgFileName = "E:\\jdImg\\" + UUID.randomUUID().toString() + imgUrl.substring(imgUrl.lastIndexOf("."));
                    FileOutputStream outputStream = new FileOutputStream(imgFileName);

                    //3.3.3: 两个流进行对接将数据写入到本地磁盘中

                    int len;
                    byte[] b = new byte[1024];
                    while ((len = inputStream.read(b)) != -1) {
                        outputStream.write(b, 0, len);
                    }

                    //3.3.4: 释放资源
                    outputStream.close();
                    inputStream.close();
                    imgResonse.close();

                    //3.4: 解析 spu 和 sku
                    String skuValue = li.attr("data-sku");
                    String spuValue = li.attr("data-spu");
                    if (spuValue == null || "".equals(spuValue)) spuValue = skuValue;

                    //3.5: 解析商品名称
                    Elements ems = li.select(".gl-i-wrap>div[class=p-name p-name-type-2]>a>em");
                    String title = ems.text();

                    //3.6: 解析商品的价格
                    Elements priceLiEls = li.select(".gl-i-wrap>.p-price>strong>i");
                    String price = priceLiEls.text();

                    //3.7: 解析商品的URL
                    String itemUrl = "https://item.jd.com/" + skuValue + ".html";

                    //3.8: 封装数据
                    Item item = new Item(null,
                            Long.parseLong(spuValue),
                            Long.parseLong(skuValue),
                            title,
                            Double.parseDouble(price),
                            imgFileName,
                            itemUrl,
                            new Date().toLocaleString(),
                            new Date().toLocaleString()
                    );
                    //3.9: 把解析每一个item对象. 都封装到一个集合中
                    itemList.add(item);
                }

                System.out.println("获取到:" + itemList.size() + "个");

                //4. 保存数据操作 : mysql

                JDItemDao jdItemDao = new JDItemDao();
                jdItemDao.saveItem(itemList);

                //5. 获取下一页
                page++;
                indexUrl = "https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&page=" + (page * 2 - 1) + "&click=0";
            }
        }

        // 6. 释放资源 : 千万不要放置在while循环里面
        httpClient.close();

    }
}

到此基础jd爬虫案例全部实现

三、爬虫项目优化

将各个阶段的代码抽取为方法

l 抽取一个根据指定的url来获取html的方法

public static String getHtml(String indexUrl, CloseableHttpClient httpClient) throws Exception {

    //2.2: 创建请求方式的对象: HttpGet HttpPost
    HttpGet httpGet = new HttpGet(indexUrl);
    //2.3: 设置请求信息: 请求头
    httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36");

    //2.4: 发送请求, 获取响应对象
    CloseableHttpResponse response = httpClient.execute(httpGet);

    //2.5: 根据response 获取响应的数据
    int statusCode = response.getStatusLine().getStatusCode();
    System.out.println("状态码为:" + statusCode);
    if (statusCode == 200) {

        String html = EntityUtils.toString(response.getEntity(), "UTF-8");

        //2.6 释放资源
        response.close();

        return html;
    }

    return null;

}

l 抽取一个用于解析每页数据的方法

public static List<Item> parseHtmlToListItem(CloseableHttpClient httpClient, String html) throws IOException {
    //3. 解析数据： jsoup
    //3.1：根据html 获取其对应document对象
    Document document = Jsoup.parse(html);
    //3.2: 获取获取商品的li标签: 每个li就是一个商品详细信息
    Elements lis = document.select("#J_goodsList>ul[class=gl-warp clearfix]>li");
    List<Item> itemList = new ArrayList<>();
    for (Element li : lis) {
        //3.3: 获取每件商品的图片的URL , 完成图片的下载
        Elements imgs = li.select(".gl-i-wrap>.p-img>a>img");
        String imgUrl = "https:" + imgs.attr("src");

        //3.3.1: 根据获取图片的地址, 发送请求, 获取数据(字节流数据)
        HttpGet imgGet = new HttpGet(imgUrl);

        CloseableHttpResponse imgResonse = httpClient.execute(imgGet);
        HttpEntity imgEntity = imgResonse.getEntity();

        InputStream inputStream = imgEntity.getContent(); // 注意此处千万不能使用EntityUtils, 这个东东是用来获取文本内容的

        //3.3.2: 创建一个本地的输出流 : 输出某一个文件上
        // http://img10.360buyimg.com/n7/jfs/t1/110811/33/3085/317953/5e8c4bafEf33aaa74/5531debb59f5350c.jpg
        String imgFileName = "E:\\jdImg\\" + UUID.randomUUID().toString() + imgUrl.substring(imgUrl.lastIndexOf("."));
        FileOutputStream outputStream = new FileOutputStream(imgFileName);

        //3.3.3: 两个流进行对接将数据写入到本地磁盘中

        int len;
        byte[] b = new byte[1024];
        while ((len = inputStream.read(b)) != -1) {
            outputStream.write(b, 0, len);
        }

        //3.3.4: 释放资源
        outputStream.close();
        inputStream.close();
        imgResonse.close();

        //3.4: 解析 spu 和 sku
        String skuValue = li.attr("data-sku");
        String spuValue = li.attr("data-spu");
        if (spuValue == null || "".equals(spuValue)) spuValue = skuValue;

        //3.5: 解析商品名称
        Elements ems = li.select(".gl-i-wrap>div[class=p-name p-name-type-2]>a>em");
        String title = ems.text();

        //3.6: 解析商品的价格
        Elements priceLiEls = li.select(".gl-i-wrap>.p-price>strong>i");
        String price = priceLiEls.text();

        //3.7: 解析商品的URL
        String itemUrl = "https://item.jd.com/" + skuValue + ".html";

        //3.8: 封装数据
        Item item = new Item(null,
                Long.parseLong(spuValue),
                Long.parseLong(skuValue),
                title,
                Double.parseDouble(price),
                imgFileName,
                itemUrl,
                new Date().toLocaleString(),
                new Date().toLocaleString()
        );
        //3.9: 把解析每一个item对象. 都封装到一个集合中
        itemList.add(item);
    }
    return itemList;
}

l 最终的抽取后的整个代码的

public class JdSpider {

    public static void main(String[] args) throws Exception {
        int page = 1;
        //1. 确定首页URL
        String indexUrl = "https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&page=" + (page * 2 - 1) + "&click=0";

        //2. 发送请求, 获取数据 httpClient
        //2.1: 创建HttpClient对象:
        CloseableHttpClient httpClient = HttpClients.createDefault();

        while (page <= 100) {
            System.out.println("当前正在处理:" + page);
            System.out.println("当前正在处理页面地址为:" + indexUrl);

            String html = getHtml(indexUrl, httpClient);
            if(html!=null){
                //3. 解析数据： jsoup
                List<Item> itemList = parseHtmlToListItem(httpClient, html);
                System.out.println("获取到:" + itemList.size() + "个");
                //4. 保存数据操作 : mysql
                JDItemDao jdItemDao = new JDItemDao();
                jdItemDao.saveItem(itemList);

                //5. 获取下一页
                page++;
                indexUrl = "https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&page=" + (page * 2 - 1) + "&click=0";
            }
        }

        // 6. 释放资源 : 千万不要放置在while循环里面
        httpClient.close();
    }

    // 解析数据
    public static List<Item> parseHtmlToListItem(CloseableHttpClient httpClient, String html) throws IOException {
        //3. 解析数据： jsoup
        //3.1：根据html 获取其对应document对象
        Document document = Jsoup.parse(html);
        //3.2: 获取获取商品的li标签: 每个li就是一个商品详细信息
        Elements lis = document.select("#J_goodsList>ul[class=gl-warp clearfix]>li");
        List<Item> itemList = new ArrayList<>();
        for (Element li : lis) {
            //3.3: 获取每件商品的图片的URL , 完成图片的下载
            Elements imgs = li.select(".gl-i-wrap>.p-img>a>img");
            String imgUrl = "https:" + imgs.attr("src");

            //3.3.1: 根据获取图片的地址, 发送请求, 获取数据(字节流数据)
            HttpGet imgGet = new HttpGet(imgUrl);

            CloseableHttpResponse imgResonse = httpClient.execute(imgGet);
            HttpEntity imgEntity = imgResonse.getEntity();

            InputStream inputStream = imgEntity.getContent(); // 注意此处千万不能使用EntityUtils, 这个东东是用来获取文本内容的

            //3.3.2: 创建一个本地的输出流 : 输出某一个文件上
            // http://img10.360buyimg.com/n7/jfs/t1/110811/33/3085/317953/5e8c4bafEf33aaa74/5531debb59f5350c.jpg
            String imgFileName = "E:\\jdImg\\" + UUID.randomUUID().toString() + imgUrl.substring(imgUrl.lastIndexOf("."));
            FileOutputStream outputStream = new FileOutputStream(imgFileName);

            //3.3.3: 两个流进行对接将数据写入到本地磁盘中

            int len;
            byte[] b = new byte[1024];
            while ((len = inputStream.read(b)) != -1) {
                outputStream.write(b, 0, len);
            }

            //3.3.4: 释放资源
            outputStream.close();
            inputStream.close();
            imgResonse.close();

            //3.4: 解析 spu 和 sku
            String skuValue = li.attr("data-sku");
            String spuValue = li.attr("data-spu");
            if (spuValue == null || "".equals(spuValue)) spuValue = skuValue;

            //3.5: 解析商品名称
            Elements ems = li.select(".gl-i-wrap>div[class=p-name p-name-type-2]>a>em");
            String title = ems.text();

            //3.6: 解析商品的价格
            Elements priceLiEls = li.select(".gl-i-wrap>.p-price>strong>i");
            String price = priceLiEls.text();

            //3.7: 解析商品的URL
            String itemUrl = "https://item.jd.com/" + skuValue + ".html";

            //3.8: 封装数据
            Item item = new Item(null,
                    Long.parseLong(spuValue),
                    Long.parseLong(skuValue),
                    title,
                    Double.parseDouble(price),
                    imgFileName,
                    itemUrl,
                    new Date().toLocaleString(),
                    new Date().toLocaleString()
            );
            //3.9: 把解析每一个item对象. 都封装到一个集合中
            itemList.add(item);
        }
        return itemList;
    }

    public static String getHtml(String indexUrl, CloseableHttpClient httpClient) throws Exception {

        //2.2: 创建请求方式的对象: HttpGet HttpPost
        HttpGet httpGet = new HttpGet(indexUrl);
        //2.3: 设置请求信息: 请求头
        httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36");

        //2.4: 发送请求, 获取响应对象
        CloseableHttpResponse response = httpClient.execute(httpGet);

        //2.5: 根据response 获取响应的数据
        int statusCode = response.getStatusLine().getStatusCode();
        System.out.println("状态码为:" + statusCode);
        if (statusCode == 200) {

            String html = EntityUtils.toString(response.getEntity(), "UTF-8");

            //2.6 释放资源
            response.close();

            return html;
        }

        return null;

    }
}

posted @ 2020-11-13 15:47 十一vs十一阅读(177) 评论(0) 编辑收藏举报

刷新页面返回顶部