【爬虫】Java爬取省市县行政区域统计数据
前言
网上看了好几个Python爬虫来爬取省市县行政区域统计
官网除了省市县以外,还有区,街道,居委村委层级
1 | https: //zhuanlan.zhihu.com/p/512852193 |
所以自己用Java写一个完整爬取的,之前写过的一版不是很理想
这次换了更轻量的库来重构,逻辑也直观些
依赖库:
Hutool工具库,有Http工具包和DB操作的API
Jsoup解析HTML代码,爬虫标配
Lombok简化PO
放FastJson是考虑可能不用DB存放,直接写JSON文件,在这里没用到
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | < dependencies > < dependency > < groupId >cn.hutool</ groupId > < artifactId >hutool-all</ artifactId > < version >5.8.4</ version > </ dependency > < dependency > < groupId >mysql</ groupId > < artifactId >mysql-connector-java</ artifactId > < version >8.0.15</ version > </ dependency > < dependency > < groupId >com.alibaba</ groupId > < artifactId >fastjson</ artifactId > < version >1.2.62</ version > </ dependency > <!-- https://mvnrepository.com/artifact/org.jsoup/jsoup --> < dependency > < groupId >org.jsoup</ groupId > < artifactId >jsoup</ artifactId > < version >1.13.1</ version > </ dependency > < dependency > < groupId >org.projectlombok</ groupId > < artifactId >lombok</ artifactId > < version >1.18.10</ version > </ dependency > <!-- 连接池https://mvnrepository.com/artifact/com.alibaba/druid --> < dependency > < groupId >com.alibaba</ groupId > < artifactId >druid</ artifactId > < version >1.1.14</ version > </ dependency > </ dependencies > |
HutoolDb需要的配置文件:
## db.setting文件 url = jdbc:mysql://localhost:3308/my?serverTimezone=Asia/Shanghai user = root pass = 123456 ## 可选配置 # 是否在日志中显示执行的SQL showSql = true # 是否格式化显示的SQL formatSql = false # 是否显示SQL参数 showParams = true # 打印SQL的日志等级,默认debug,可以是info、warn、error sqlLevel = debug #---------------------------------------------------------------------------------------------------------------- ## 连接池配置项 #———————————————— #版权声明:本文为CSDN博主「soulCoke」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。 #原文链接:https://blog.csdn.net/qq_36328170/article/details/105687633 ## ---------------------------------------------------- Druid # 初始化时建立物理连接的个数。初始化发生在显示调用init方法,或者第一次getConnection时 initialSize = 1 # 最大连接池数量 maxActive = 8 # 最小连接池数量 minIdle = 0 # 获取连接时最大等待时间,单位毫秒。配置了maxWait之后, 缺省启用公平锁,并发效率会有所下降, 如果需要可以通过配置useUnfairLock属性为true使用非公平锁。 maxWait = 0 # 是否缓存preparedStatement,也就是PSCache。 PSCache对支持游标的数据库性能提升巨大,比如说oracle。 在mysql5.5以下的版本中没有PSCache功能,建议关闭掉。作者在5.5版本中使用PSCache,通过监控界面发现PSCache有缓存命中率记录, 该应该是支持PSCache。 poolPreparedStatements = false # 要启用PSCache,必须配置大于0,当大于0时, poolPreparedStatements自动触发修改为true。 在Druid中,不会存在Oracle下PSCache占用内存过多的问题, 可以把这个数值配置大一些,比如说100 maxOpenPreparedStatements = -1 # 用来检测连接是否有效的sql,要求是一个查询语句。 如果validationQuery为null,testOnBorrow、testOnReturn、 testWhileIdle都不会其作用。 validationQuery = SELECT 1 # 申请连接时执行validationQuery检测连接是否有效,做了这个配置会降低性能。 testOnBorrow = true # 归还连接时执行validationQuery检测连接是否有效,做了这个配置会降低性能 testOnReturn = false # 建议配置为true,不影响性能,并且保证安全性。 申请连接的时候检测,如果空闲时间大于 timeBetweenEvictionRunsMillis,执行validationQuery检测连接是否有效。 testWhileIdle = false # 有两个含义: 1) Destroy线程会检测连接的间隔时间 2) testWhileIdle的判断依据,详细看testWhileIdle属性的说明 timeBetweenEvictionRunsMillis = 60000 # 物理连接初始化的时候执行的sql connectionInitSqls = SELECT 1 # 属性类型是字符串,通过别名的方式配置扩展插件, 常用的插件有: 监控统计用的filter:stat 日志用的filter:log4j 防御sql注入的filter:wall # filters = stat # 类型是List<com.alibaba.druid.filter.Filter>, 如果同时配置了filters和proxyFilters, 是组合关系,并非替换关系 # proxyFilters =
表结构:
1 2 3 4 5 6 7 8 9 10 | CREATE TABLE `region2021` ( `CODE` varchar (24) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL COMMENT '行政区代码' , `PARENT_CODE` varchar (24) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci DEFAULT NULL COMMENT '上级代码' , ` NAME ` varchar (24) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci DEFAULT NULL COMMENT '名称' , `LINK` varchar (252) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci DEFAULT NULL , ` LEVEL ` int DEFAULT NULL COMMENT '层级' , `TYPE_CODE` varchar (12) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci DEFAULT NULL COMMENT '类型' , `GEN_TIME` datetime DEFAULT NULL COMMENT '创建时间' , PRIMARY KEY (`CODE`) USING BTREE ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE =utf8mb4_general_ci ROW_FORMAT= DYNAMIC ; |
代码部分:
封装的PO类:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | package cn.cloud9.rdp.po; import lombok.*; import java.time.LocalDateTime; /** * 行政区域最小存储单元 * @projectName: 行政区域爬取工具 * @author: Cloud9 * @date: 2022年06月29日 10:30 * @version: 1.0 */ @Data @EqualsAndHashCode @AllArgsConstructor @NoArgsConstructor @Builder public class RegionCell { // 名称 private String name; // 下一层的访问地址 private String url; // 统计用区分代码 private String regionCode; // 统计用区分代码(上级代码) private String parentCode; // 城乡分类代码 private String typeCode; // 行政区的层级 private Integer level; // 创建时间 private LocalDateTime genTime; } |
Main启动类:
本来是想用递归写逻辑的,但是每一层的逻辑不是完全一样,所以不采用递归
层级是可以确认的,最底层到village就没有了
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 | package cn.cloud9.rdp; import cn.cloud9.rdp.po.RegionCell; import cn.hutool.core.collection.CollectionUtil; import cn.hutool.db.Db; import cn.hutool.db.Entity; import cn.hutool.http.HttpRequest; import cn.hutool.http.HttpResponse; import cn.hutool.http.HttpUtil; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; import java.net.HttpCookie; import java.sql.SQLException; import java.time.LocalDateTime; import java.util.List; import java.util.Map; import java.util.Random; import java.util.concurrent.ConcurrentHashMap; /** * @projectName: 行政区域爬取工具 * @author: Cloud9 * @date: 2022年06月29日 09:55 * @version: 1.0 */ public class MainApplication { public static final String HEADER_URL = "http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2021/" ; public static final String COOKIE_KEY = "SF_cookie_1" ; public static final String COOKIE_HEADER_KEY = "Cookie" ; public static final String USER_AGENT = "User-Agent" ; public static final String[] BROWSER_AGENTS = { "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Mobile Safari/537.36" , "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36" , "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:102.0) Gecko/20100101 Firefox/102.0" }; public final static Random R = new Random(); public static final String TABLE_NAME = "region2021" ; public static Map<String, String> getNewRequestHeader() { return new ConcurrentHashMap<String, String>(){{ this .put( "Host" , "www.stats.gov.cn" ); // this.put("Server", "CWAP-waf"); // this.put("Content-Length", "2351"); // this.put("Content-Type", "text/html"); // this.put("Accept-Ranges", "bytes"); // this.put("Content-Encoding", "gzip"); // this.put("ETag", "1736-5d4bccabedf00-gzip"); // this.put("Vary", "Accept-Encoding"); // this.put("X-Powered-By", "anyu.qianxin.com"); // this.put("WZWS-RAY", "1129-1656502841.38-w-waf03cdm"); }}; } public static void main(String[] args) throws SQLException { // 创建MySQL连接 Db db = Db.use(); // 清空表记录 db.execute( "TRUNCATE TABLE region2021" , null ); // 第一层请求 HttpRequest getRequest = HttpUtil.createGet(HEADER_URL); HttpResponse httpResponse = getRequest.execute(); HttpCookie cookie = httpResponse.getCookie(COOKIE_KEY); System.out.println(cookie); // 取响应状态 int status = httpResponse.getStatus(); // 取响应data String body = httpResponse.body(); System.out.println( "省 响应状态 " + status); System.out.println( "省 响应状态 " + body); if (status != 200 ) { System.out.println( "爬取异常,程序终止" ); return ; } // 解析HTML文档,封装成文档对象 final Document DOC = Jsoup.parse(body); // 省份是 class="provincetr" 的tr标签, Elements provinceTrList = DOC.getElementsByClass( "provincetr" ); provinceTrList.forEach(tr -> { // 每个tr标签内部嵌套了若干省份 a标签 Elements provinceAList = tr.getElementsByTag( "a" ); provinceAList.forEach(a -> { int provinceTrIndex = provinceTrList.indexOf(tr) + 1 ; int provinceALinkIndex = provinceAList.indexOf(a) + 1 ; RegionCell cell = RegionCell.builder() // 文本就是省份 .name(a.text()) // 下一级的地址是 首页地址 + 标签存的地址 .url(HEADER_URL + a.attr( "href" )) // 第一级没有行政区编号,我自己设定规则是(tr元素下标 + a元素下标)组合 .regionCode(String.valueOf(provinceTrIndex) + provinceALinkIndex) // 省级没有 .parentCode(String.valueOf( 0 )) .genTime(LocalDateTime.now()) .build(); System.out.println(cell); // 插入省份 try { db.insertOrUpdate(Entity .create(TABLE_NAME) .set( "CODE" , cell.getRegionCode()) .set( "PARENT_CODE" , cell.getParentCode()) .set( "NAME" , cell.getName()) .set( "LEVEL" , 1 ) .set( "LINK" , cell.getUrl()) .set( "GEN_TIME" , cell.getGenTime()) ); } catch (SQLException e) { e.printStackTrace(); } // 设置header头, Map<String, String> header = getNewRequestHeader(); // 把首次请求提供的Cookie放进来 header.put(COOKIE_HEADER_KEY, cookie.toString()); // 设置每次请求时,伪装不同的浏览器访问 header.put(USER_AGENT, BROWSER_AGENTS[R.nextInt(BROWSER_AGENTS.length)]); HttpResponse response = HttpUtil.createGet(cell.getUrl()).addHeaders(header).execute(); // 保留本次cookie,给下一级请求时提供 List<HttpCookie> provinceCookies = response.getCookies(); // 不判断状态,默认200 int provinceStatus = response.getStatus(); System.out.println(cell.getName() + " | " + cell.getUrl() + " | 请求状态" + provinceStatus); String provinceBody = response.body(); final Document provinceDOC = Jsoup.parse(provinceBody); // 市级取的 class="citytr" Elements cityTrList = provinceDOC.getElementsByClass( "citytr" ); cityTrList.forEach(cityTr -> { // 每个tr 存放一个城市, 第一个td 放编码 第二个td放城市名 // 有可能这个城市没有下一级,td就不嵌套a元素, 所以这里写了判断逻辑 Elements cityTds = cityTr.getElementsByTag( "td" ); Element codeTd = cityTds.get( 0 ); Element nameTd = cityTds.get( 1 ); Elements aTagInCodeTd = codeTd.getElementsByTag( "a" ); Elements aTagInNameTd = nameTd.getElementsByTag( "a" ); // 取城乡代码 String regionCityCode; String cityNextHref = null ; boolean isLinkTag = !CollectionUtil.isEmpty(aTagInCodeTd); if (isLinkTag) { // 编码a元素和城市a元素都会放href, 这里按编码的来取 regionCityCode = aTagInCodeTd.get( 0 ).text(); cityNextHref = aTagInCodeTd.get( 0 ).attr( "href" ); } else regionCityCode = codeTd.text(); String regionName; if (!CollectionUtil.isEmpty(aTagInNameTd)) regionName = aTagInNameTd.get( 0 ).text(); else regionName = nameTd.text(); RegionCell.RegionCellBuilder cellBuilder = RegionCell.builder(); // 城市的下一级Link 也是一样,用首地址 + href放的地址拼接组成 if (isLinkTag) cellBuilder.url(HEADER_URL + cityNextHref); cellBuilder.name(regionName); cellBuilder.regionCode(regionCityCode); cellBuilder.parentCode(cell.getRegionCode()); cellBuilder.genTime(LocalDateTime.now()).build(); RegionCell cityCell = cellBuilder.build(); System.out.println(cityCell); try { db.insertOrUpdate(Entity .create(TABLE_NAME) .set( "CODE" , cityCell.getRegionCode()) .set( "PARENT_CODE" , cityCell.getParentCode()) .set( "NAME" , cityCell.getName()) .set( "LEVEL" , 2 ) .set( "LINK" , cityCell.getUrl()) .set( "GEN_TIME" , cityCell.getGenTime()) ); } catch (SQLException e) { e.printStackTrace(); } if ( null == cityNextHref) return ; Map<String, String> countyHeader = getNewRequestHeader(); provinceCookies.forEach( provinceCookie -> countyHeader.put(COOKIE_HEADER_KEY, provinceCookie.toString() + ";" )); countyHeader.put(USER_AGENT, BROWSER_AGENTS[R.nextInt(BROWSER_AGENTS.length)]); HttpResponse countyResponse = HttpUtil.createGet(cityCell.getUrl()).addHeaders(countyHeader).execute(); List<HttpCookie> countyCookies = countyResponse.getCookies(); int countyStatus = countyResponse.getStatus(); String countyBody = countyResponse.body(); System.out.println(cityCell.getName() + " | " + cityCell.getUrl() + " | 请求状态" + countyStatus); Document countyDOC = Jsoup.parse(countyBody); Elements countyTrList = countyDOC.getElementsByClass( "countytr" ); countyTrList.forEach(countyTr -> { Elements countyTds = countyTr.getElementsByTag( "td" ); Element countyCodeTd = countyTds.get( 0 ); Element countyNameTd = countyTds.get( 1 ); Elements aTagInCountyCodeTd = countyCodeTd.getElementsByTag( "a" ); Elements aTagInCountyNameTd = countyNameTd.getElementsByTag( "a" ); // 取城乡代码 String regionCountyCode; String countyNextHref = null ; boolean isCountyLinkTag = !CollectionUtil.isEmpty(aTagInCountyCodeTd); if (isCountyLinkTag) { regionCountyCode = aTagInCountyCodeTd.get( 0 ).text(); countyNextHref = aTagInCountyCodeTd.get( 0 ).attr( "href" ); } else regionCountyCode = countyCodeTd.text(); String regionCountyName; if (!CollectionUtil.isEmpty(aTagInCountyNameTd)) regionCountyName = aTagInCountyNameTd.get( 0 ).text(); else regionCountyName = countyNameTd.text(); RegionCell.RegionCellBuilder countyCellBuilder = RegionCell.builder(); if (isCountyLinkTag) { // 县级的href需要截取处理。 /xxxxxx.html -> 首地址/xx/xxxxxx.html这样 int index = countyNextHref.indexOf( '/' ); String provincePath = countyNextHref.substring(index + 1 , index + 3 ) + "/" ; String countyUrl = HEADER_URL + provincePath + countyNextHref; countyCellBuilder.url(countyUrl); } countyCellBuilder.name(regionCountyName); countyCellBuilder.regionCode(regionCountyCode); countyCellBuilder.genTime(LocalDateTime.now()); countyCellBuilder.parentCode(cityCell.getRegionCode()); RegionCell countyCell = countyCellBuilder.build(); System.out.println(countyCell); try { db.insertOrUpdate(Entity .create(TABLE_NAME) .set( "CODE" , countyCell.getRegionCode()) .set( "PARENT_CODE" , countyCell.getParentCode()) .set( "NAME" , countyCell.getName()) .set( "LEVEL" , 3 ) .set( "LINK" , countyCell.getUrl()) .set( "GEN_TIME" , countyCell.getGenTime()) ); } catch (SQLException e) { e.printStackTrace(); } if ( null == countyNextHref) return ; Map<String, String> townHeader = getNewRequestHeader(); countyCookies.forEach( provinceCookie -> townHeader.put(COOKIE_HEADER_KEY, provinceCookie.toString() + ";" )); townHeader.put(USER_AGENT, BROWSER_AGENTS[R.nextInt(BROWSER_AGENTS.length)]); HttpResponse townResponse = HttpUtil.createGet(countyCell.getUrl()).addHeaders(townHeader).execute(); List<HttpCookie> townCookies = townResponse.getCookies(); int townStatus = townResponse.getStatus(); String townBody = townResponse.body(); System.out.println(countyCell.getName() + " | " + countyCell.getUrl() + " | 请求状态" + townStatus); Document townDOC = Jsoup.parse(townBody); Elements townTrList = townDOC.getElementsByClass( "towntr" ); townTrList.forEach(townTr -> { Elements townTds = townTr.getElementsByTag( "td" ); Element townCodeTd = townTds.get( 0 ); Element townNameTd = townTds.get( 1 ); Elements aTagInTownCodeTd = townCodeTd.getElementsByTag( "a" ); Elements aTagInTownNameTd = townNameTd.getElementsByTag( "a" ); // 取城乡代码 String regionTownCode; String townNextHref = null ; boolean isTownLinkTag = !CollectionUtil.isEmpty(aTagInTownCodeTd); if (isTownLinkTag) { regionTownCode = aTagInTownCodeTd.get( 0 ).text(); townNextHref = aTagInTownCodeTd.get( 0 ).attr( "href" ); } else regionTownCode = townCodeTd.text(); String regionTownName; if (!CollectionUtil.isEmpty(aTagInTownNameTd)) regionTownName = aTagInTownNameTd.get( 0 ).text(); else regionTownName = townNameTd.text(); RegionCell.RegionCellBuilder townCellBuilder = RegionCell.builder(); if (isTownLinkTag) { int index = townNextHref.indexOf( "/" ); String provincePath = townNextHref.substring(index + 1 , index + 3 ); String cityPath = "/" + townNextHref.substring(index + 3 , index + 5 ) + "/" ; String url = HEADER_URL + provincePath + cityPath + townNextHref; townCellBuilder.url(url); } townCellBuilder.name(regionTownName); townCellBuilder.regionCode(regionTownCode); townCellBuilder.genTime(LocalDateTime.now()); townCellBuilder.parentCode(countyCell.getRegionCode()); RegionCell townCell = townCellBuilder.build(); System.out.println(townCell); try { db.insertOrUpdate(Entity .create(TABLE_NAME) .set( "CODE" , townCell.getRegionCode()) .set( "PARENT_CODE" , townCell.getParentCode()) .set( "NAME" , townCell.getName()) .set( "LEVEL" , 4 ) .set( "LINK" , townCell.getUrl()) .set( "GEN_TIME" , townCell.getGenTime()) ); } catch (SQLException e) { e.printStackTrace(); } if ( null == townNextHref) return ; Map<String, String> villageHeader = getNewRequestHeader(); townCookies.forEach(townCookie -> villageHeader.put(COOKIE_HEADER_KEY, townCookie.toString() + ";" )); villageHeader.put(USER_AGENT, BROWSER_AGENTS[R.nextInt(BROWSER_AGENTS.length)]); HttpResponse villageResponse = HttpUtil.createGet(townCell.getUrl()).addHeaders(villageHeader).execute(); int villageStatus = villageResponse.getStatus(); String villageBody = villageResponse.body(); System.out.println(townCell.getName() + " | " + townCell.getUrl() + " | 请求状态" + villageStatus); Document villageDOC = Jsoup.parse(villageBody); Elements villageTrList = villageDOC.getElementsByClass( "villagetr" ); villageTrList.forEach(villageTr -> { Elements villageTdList = villageTr.getElementsByTag( "td" ); Element villageCodeTd = villageTdList.get( 0 ); Element villageTypeCodeTd = villageTdList.get( 1 ); Element villageNameTd = villageTdList.get( 2 ); RegionCell.RegionCellBuilder villageCellBuilder = RegionCell.builder(); villageCellBuilder.regionCode(villageCodeTd.text()); villageCellBuilder.name(villageNameTd.text()); villageCellBuilder.typeCode(villageTypeCodeTd.text()); villageCellBuilder.parentCode(townCell.getRegionCode()); villageCellBuilder.genTime(LocalDateTime.now()); RegionCell villageCell = villageCellBuilder.build(); System.out.println(villageCell); try { db.insert(Entity .create(TABLE_NAME) .set( "CODE" , villageCell.getRegionCode()) .set( "PARENT_CODE" , villageCell.getParentCode()) .set( "NAME" , villageCell.getName()) .set( "LEVEL" , 5 ) .set( "TYPE_CODE" , villageCell.getTypeCode()) .set( "LINK" , villageCell.getUrl()) .set( "GEN_TIME" , villageCell.getGenTime()) ); } catch (SQLException e) { e.printStackTrace(); } }); }); }); }); }); }); System.out.println( "爬取完毕" ); } } |
注意项:
在爬取居委村委层级时存在反爬限制,请求会被阻塞10分钟,程序不会报链接超时
这个问题暂时没找到解决办法,就是这样爬取的效率慢很多
我的思路是想,可不可以判断是否阻塞,如果阻塞就直接重新请求尝试
2022年7月6日22点49分更新:
对请求进行封装,使用递归不停止请求
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | private static HttpResponse retryConn(HttpRequest httpRequest) { HttpResponse httpResponse = null ; try { httpResponse = httpRequest .timeout(TIMEOUT) .setConnectionTimeout(TIMEOUT) .setReadTimeout(TIMEOUT) .execute(); } catch (Exception exception) { exception.printStackTrace(); return retryConn(httpRequest); } return httpResponse; } |
所有的请求都这样改成异常递归执行
1 2 | // HttpResponse response = HttpUtil.createGet(cell.getUrl()).addHeaders(header).execute(); HttpResponse response = retryConn(HttpUtil.createGet(cell.getLink()).addHeaders(header)); |
但是发现还是有爆栈的情况:
所以有对应的写了一份补数据的逻辑:
1、补数据直接采用递归实现
2、首先要查询已经爬取的数据,查询那些本该有子节点,但实际为空的记录
3、查询得到之后遍历记录记载的LINK,继续爬取
4、发现44和46两个省份的link规则不一样,要单独做调整(Ctrl + F 输44,搜下面的代码)
5、LEVEL的层级也不能市确定的1 - 2 - 3 - 4 - 5, 所以改用自连接LEFT JOIN,根据上级代码查询
6、逻辑可重复执行
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 | package cn.cloud9.fix; import cn.cloud9.po.RegionCell; import cn.hutool.core.collection.CollUtil; import cn.hutool.db.Db; import cn.hutool.db.Entity; import cn.hutool.http.HttpRequest; import cn.hutool.http.HttpResponse; import cn.hutool.http.HttpUtil; import com.alibaba.druid.util.StringUtils; import lombok.SneakyThrows; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; import java.sql.SQLException; import java.time.LocalDateTime; import java.util.HashMap; import java.util.List; import java.util.Map; import java.util.Random; /** * @author OnCloud9 * @description * @project RegionReptile * @date 2022年07月04日 下午 09:21 */ public class DataFixApplication { private static final int TIMEOUT = 3000 ; private static String YEAR; private static String HEADER_URL; private static String TABLE_NAME; public static final String CODE = "CODE" ; public static final String PARENT_CODE = "PARENT_CODE" ; public static final String NAME = "NAME" ; public static final String LEVEL = "LEVEL" ; public static final String TYPE_CODE = "TYPE_CODE" ; public static final String LINK = "LINK" ; public static final String GEN_TIME = "GEN_TIME" ; public static final String REFERER = "Referer" ; public static final String USER_AGENT = "User-Agent" ; public static final String[] BROWSER_AGENTS = { "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Mobile Safari/537.36" , "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36" , "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:102.0) Gecko/20100101 Firefox/102.0" , "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36 Edg/103.0.1264.37" , "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko" }; public final static Random R = new Random(); private static Db db = Db.use(); public static void main(String[] args) { } public static void fixData(String year) { YEAR = year; TABLE_NAME = "region" + YEAR; HEADER_URL = "http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/" + YEAR + "/" ; List<RegionCell> regionCells = queryLostData(db, TABLE_NAME, 1 ); int provinceLost = regionCells.size(); if (!CollUtil.isEmpty(regionCells)) { // 递归查询 regionCells.forEach(DataFixApplication::readDataRecursive); } regionCells = queryLostData(db, TABLE_NAME, 2 ); int cityLost = regionCells.size(); if (!CollUtil.isEmpty(regionCells)) { // 递归查询 regionCells.forEach(DataFixApplication::readDataRecursive); } regionCells = queryLostData(db, TABLE_NAME, 3 ); int countyLost = regionCells.size(); if (!CollUtil.isEmpty(regionCells)) { // 递归查询 regionCells.forEach(DataFixApplication::readDataRecursive); } regionCells = queryLostData(db, TABLE_NAME, 4 ); int townLost = regionCells.size(); if (!CollUtil.isEmpty(regionCells)) { // 递归查询 regionCells.forEach(DataFixApplication::readDataRecursive); } if (provinceLost + cityLost + countyLost + townLost == 0 ) { System.out.println(YEAR + "年数据补完!" ); } } private static RegionCell readDataRecursive(RegionCell regionCell) { final String regionCellLink = regionCell.getLink(); if (StringUtils.isEmpty(regionCellLink)) return null ; final HttpRequest httpRequest = HttpUtil .createGet(regionCellLink) .timeout(TIMEOUT) .setConnectionTimeout(TIMEOUT) .setReadTimeout(TIMEOUT); if (!CollUtil.isEmpty(regionCell.getCookies())) { Map<String, String> headers = new HashMap<>(); regionCell.getCookies().forEach( cookie -> headers.put( "Cookie" , cookie.toString() + ";" )); headers.put(REFERER, regionCell.getLink()); headers.put(USER_AGENT, BROWSER_AGENTS[R.nextInt(BROWSER_AGENTS.length)]); headers.put( "Host" , "www.stats.gov.cn" ); headers.put( "Upgrade-Insecure-Requests" , "1" ); httpRequest.addHeaders(headers); } final HttpResponse httpResponse = httpRequest.execute(); // 封装cookie给下一次请求使用 final RegionCell.RegionCellBuilder builder = RegionCell.builder(); builder.cookies(httpResponse.getCookies()); final RegionCell newCell = builder.build(); if (!httpResponse.isOk()) return null ; final Document document = Jsoup.parse(httpResponse.body()); readCityData(document, newCell, regionCell); readCountyData(document, newCell, regionCell); readTownData(document, newCell, regionCell); readVillageData(document, newCell, regionCell); return null ; } private static void readVillageData(Document document, RegionCell newCell, RegionCell superCell) { Elements villageTrList = document.getElementsByClass( "villagetr" ); if (CollUtil.isEmpty(villageTrList)) return ; villageTrList.forEach(villageTr -> { Elements villageTdList = villageTr.getElementsByTag( "td" ); Element villageCodeTd = villageTdList.get( 0 ); Element villageTypeCodeTd = villageTdList.get( 1 ); Element villageNameTd = villageTdList.get( 2 ); newCell.setCode(villageCodeTd.text()); newCell.setParentCode(superCell.getCode()); newCell.setName(villageNameTd.text()); newCell.setTypeCode(villageTypeCodeTd.text()); newCell.setGenTime(LocalDateTime.now()); writeDataToDb( 5 , newCell); }); } private static void readTownData(Document document, RegionCell newCell, RegionCell superCell) { Elements townTrList = document.getElementsByClass( "towntr" ); if (CollUtil.isEmpty(townTrList)) return ; townTrList.forEach(townTr -> { Elements townTds = townTr.getElementsByTag( "td" ); Element townCodeTd = townTds.get( 0 ); Element townNameTd = townTds.get( 1 ); Elements aTagsInTownCodeTd = townCodeTd.getElementsByTag( "a" ); Elements aTagsInTownNameTd = townNameTd.getElementsByTag( "a" ); // 取城乡代码 String regionTownCode; String townNextHref = null ; boolean isTownLinkTag = !CollUtil.isEmpty(aTagsInTownCodeTd); if (isTownLinkTag) { regionTownCode = aTagsInTownCodeTd.get( 0 ).text(); townNextHref = aTagsInTownCodeTd.get( 0 ).attr( "href" ); } else regionTownCode = townCodeTd.text(); String regionTownName; if (!CollUtil.isEmpty(aTagsInTownNameTd)) regionTownName = aTagsInTownNameTd.get( 0 ).text(); else regionTownName = townNameTd.text(); final String codePrefix = regionTownCode.substring( 0 , 2 ); boolean condition1 = "44" .equals(codePrefix) || "46" .equals(codePrefix); if (condition1 && isTownLinkTag) { final String link = superCell.getLink(); final String basePath = link.substring( 0 , link.lastIndexOf( "/" ) + 1 ); String url = basePath + townNextHref; newCell.setLink(url); } else if (isTownLinkTag) { int index = townNextHref.indexOf( "/" ); String provincePath = townNextHref.substring(index + 1 , index + 3 ); String cityPath = "/" + townNextHref.substring(index + 3 , index + 5 ) + "/" ; String url = HEADER_URL + provincePath + cityPath + townNextHref; newCell.setLink(url); } newCell.setCode(regionTownCode); newCell.setParentCode(superCell.getCode()); newCell.setName(regionTownName); newCell.setGenTime(LocalDateTime.now()); writeDataToDb( 4 , newCell); readDataRecursive(newCell); }); } private static void readCountyData(Document document, RegionCell newCell, RegionCell superCell) { Elements countyTrList = document.getElementsByClass( "countytr" ); if (CollUtil.isEmpty(countyTrList)) return ; countyTrList.forEach(countyTr -> { Elements countyTds = countyTr.getElementsByTag( "td" ); Element countyCodeTd = countyTds.get( 0 ); Element countyNameTd = countyTds.get( 1 ); Elements aTagsInCountyCodeTd = countyCodeTd.getElementsByTag( "a" ); Elements aTagsInCountyNameTd = countyNameTd.getElementsByTag( "a" ); // 取城乡代码 String regionCountyCode; String countyNextHref = null ; boolean isCountyLinkTag = !CollUtil.isEmpty(aTagsInCountyCodeTd); if (isCountyLinkTag) { regionCountyCode = aTagsInCountyCodeTd.get( 0 ).text(); countyNextHref = aTagsInCountyCodeTd.get( 0 ).attr( "href" ); } else regionCountyCode = countyCodeTd.text(); String regionCountyName; if (!CollUtil.isEmpty(aTagsInCountyNameTd)) regionCountyName = aTagsInCountyNameTd.get( 0 ).text(); else regionCountyName = countyNameTd.text(); if (isCountyLinkTag) { int index = countyNextHref.indexOf( '/' ); String provincePath = countyNextHref.substring(index + 1 , index + 3 ) + "/" ; String countyUrl = HEADER_URL + provincePath + countyNextHref; newCell.setLink(countyUrl); } newCell.setCode(regionCountyCode); newCell.setParentCode(superCell.getCode()); newCell.setName(regionCountyName); newCell.setGenTime(LocalDateTime.now()); writeDataToDb( 3 , newCell); readDataRecursive(newCell); }); } private static void readCityData(Document document, RegionCell newCell, RegionCell superCell) { final Elements citytrList = document.getElementsByClass( "citytr" ); if (CollUtil.isEmpty(citytrList)) return ; citytrList.forEach(cityTr -> { Elements cityTds = cityTr.getElementsByTag( "td" ); Element codeTd = cityTds.get( 0 ); Element nameTd = cityTds.get( 1 ); Elements asTagInCodeTd = codeTd.getElementsByTag( "a" ); Elements asTagInNameTd = nameTd.getElementsByTag( "a" ); String regionCityCode; String cityNextHref = null ; boolean isLinkTag = !CollUtil.isEmpty(asTagInCodeTd); if (isLinkTag) { regionCityCode = asTagInCodeTd.get( 0 ).text(); cityNextHref = asTagInCodeTd.get( 0 ).attr( "href" ); } else regionCityCode = codeTd.text(); String regionName; if (!CollUtil.isEmpty(asTagInNameTd)) regionName = asTagInNameTd.get( 0 ).text(); else regionName = nameTd.text(); if (isLinkTag) newCell.setLink(HEADER_URL + cityNextHref); newCell.setCode(regionCityCode); newCell.setParentCode(superCell.getCode()); newCell.setName(regionName); newCell.setGenTime(LocalDateTime.now()); writeDataToDb( 2 , newCell); readDataRecursive(newCell); }); } private static void writeDataToDb( int level, RegionCell cell) { try { db.insertOrUpdate(Entity .create(TABLE_NAME) .set(CODE, cell.getCode()) .set(PARENT_CODE, cell.getParentCode()) .set(NAME, cell.getName()) .set(LEVEL, level) .set(TYPE_CODE, cell.getTypeCode()) .set(LINK, cell.getLink()) .set(GEN_TIME, cell.getGenTime()), CODE ); } catch (SQLException e) { e.printStackTrace(); } } /** * SELECT * SUPER.* * FROM * (SELECT * FROM region2021 WHERE `LEVEL` = 4) AS SUPER * LEFT JOIN (SELECT * FROM region2021) AS SUB ON SUPER.CODE = SUB.PARENT_CODE * WHERE SUB.PARENT_CODE IS NULL AND SUPER.LINK IS NOT NULL * @param db * @param tableName * @param level * @return */ @SneakyThrows private static List<RegionCell> queryLostData( Db db, String tableName, int level ) { String sql = "SELECT \n" + "\tSUPER.*\n" + "FROM \n" + "\t(SELECT * FROM " + tableName + " WHERE `LEVEL` = ? ) AS SUPER\n" + "\tLEFT JOIN (SELECT * FROM " + tableName + " ) AS SUB ON SUPER.CODE = SUB.PARENT_CODE\n" + "\tWHERE SUB.PARENT_CODE IS NULL AND SUPER.LINK IS NOT NULL" ; return db.query(sql, RegionCell. class , level); } } |
2022年7月9日06点08分更新
通过写补偿逻辑发现可以进一步优化代码结果:
1、常量统一存放
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | package cn.cloud9.constant; import java.util.Random; public interface Constant { int TIMEOUT = 3000 ; String PATH_CHAR = "/" ; String CODE = "CODE" ; String PARENT_CODE = "PARENT_CODE" ; String NAME = "NAME" ; String LEVEL = "LEVEL" ; String TYPE_CODE = "TYPE_CODE" ; String LINK = "LINK" ; String GEN_TIME = "GEN_TIME" ; String REFERER = "Referer" ; String USER_AGENT = "User-Agent" ; String[] BROWSER_AGENTS = { "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Mobile Safari/537.36" , "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36" , "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:102.0) Gecko/20100101 Firefox/102.0" , "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36 Edg/103.0.1264.37" , "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko" }; String ROOT_PATH = "http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/${YEAR}/" ; Random R = new Random(); } |
2、调用的方法封装在Util中
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 | package cn.cloud9.util; import cn.cloud9.constant.Constant; import cn.cloud9.po.RegionCell; import cn.hutool.core.collection.CollUtil; import cn.hutool.db.Db; import cn.hutool.db.Entity; import cn.hutool.http.HttpRequest; import cn.hutool.http.HttpResponse; import cn.hutool.http.HttpUtil; import com.alibaba.druid.util.StringUtils; import lombok.SneakyThrows; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; import java.sql.SQLException; import java.time.LocalDateTime; import java.util.HashMap; import java.util.Map; import static cn.cloud9.constant.Constant.*; /** * @author OnCloud9 * @description * @project RegionReptile-Remaster * @date 2022年07月07日 下午 10:01 */ public class MyUtil { private static Db db = Db.use(); /** * 分配新的请求头Header * @return */ public static Map<String, String> getNewRequestHeader() { return new HashMap<String, String>(){{ this .put( "Host" , "www.stats.gov.cn" ); this .put( "Upgrade-Insecure-Requests" , "1" ); }}; } /** * 初始化表空间 * @param tableName */ @SneakyThrows public static void initialTableSpace(String tableName) { String SQL = "CREATE TABLE IF NOT EXISTS " + tableName + " (\n" + " `CODE` varchar(24) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL COMMENT '行政区代码',\n" + " `PARENT_CODE` varchar(24) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci DEFAULT NULL COMMENT '上级代码',\n" + " `NAME` varchar(24) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci DEFAULT NULL COMMENT '名称',\n" + " `LINK` varchar(252) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci DEFAULT '网页地址',\n" + " `LEVEL` int DEFAULT NULL COMMENT '层级',\n" + " `TYPE_CODE` varchar(12) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci DEFAULT NULL COMMENT '类型',\n" + " `GEN_TIME` datetime DEFAULT NULL COMMENT '创建时间',\n" + " PRIMARY KEY (`CODE`) USING BTREE, \n" + " KEY `IDX_LEVEL` (`LEVEL`) USING BTREE,\n" + " KEY `IDX_PC` (`PARENT_CODE`) USING BTREE \n" + ") ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci ROW_FORMAT=DYNAMIC;" ; db.execute(SQL, null ); } /** * 连接重试 * @param httpRequest * @return */ public static HttpResponse retryConn(HttpRequest httpRequest) { HttpResponse httpResponse = null ; try { httpResponse = httpRequest .timeout(TIMEOUT) .setConnectionTimeout(TIMEOUT) .setReadTimeout(TIMEOUT) .execute(); } catch (Exception exception) { exception.printStackTrace(); return retryConn(httpRequest); } return httpResponse; } /** * 写入操作封装, 不更新, * 有CODE主键发生冲突,异常后执行下一个 * @param tableName * @param cell */ @SneakyThrows public static void writeDataToDb(String tableName, RegionCell cell) { db.insert(Entity .create(tableName) .set(CODE, cell.getCode()) .set(PARENT_CODE, cell.getParentCode()) .set(NAME, cell.getName()) .set(LEVEL, cell.getLevel()) .set(TYPE_CODE, cell.getTypeCode()) .set(LINK, cell.getLink()) .set(GEN_TIME, cell.getGenTime())); } /** * 村,居委会读取 * @param document * @param newCell * @param superCell * @param tableName */ public static void readVillage( Document document, RegionCell newCell, RegionCell superCell, String tableName ) { Elements villageTrList = document.getElementsByClass( "villagetr" ); if (CollUtil.isEmpty(villageTrList)) return ; villageTrList.forEach(villageTr -> { Elements villageTdList = villageTr.getElementsByTag( "td" ); newCell.setCode(villageTdList.get( 0 ).text()); newCell.setParentCode(superCell.getCode()); newCell.setName(villageTdList.get( 2 ).text()); newCell.setTypeCode(villageTdList.get( 1 ).text()); newCell.setLevel( 5 ); newCell.setGenTime(LocalDateTime.now()); writeDataToDb(tableName, newCell); }); } /** * 街道,镇 读取 * @param BASE_URL * @param document * @param newCell * @param superCell * @param tableName */ public static void readTown ( String BASE_URL, Document document, RegionCell newCell, RegionCell superCell, String tableName ) { Elements townTrList = document.getElementsByClass( "towntr" ); if (CollUtil.isEmpty(townTrList)) return ; townTrList.forEach(townTr -> { Elements townTds = townTr.getElementsByTag( "td" ); Element townCodeTd = townTds.get( 0 ); Element townNameTd = townTds.get( 1 ); Elements aTagsInTownCodeTd = townCodeTd.getElementsByTag( "a" ); Elements aTagsInTownNameTd = townNameTd.getElementsByTag( "a" ); // 取城乡代码 String regionTownCode; String townNextHref = null ; boolean isTownLinkTag = !CollUtil.isEmpty(aTagsInTownCodeTd); if (isTownLinkTag) { regionTownCode = aTagsInTownCodeTd.get( 0 ).text(); townNextHref = aTagsInTownCodeTd.get( 0 ).attr( "href" ); } else regionTownCode = townCodeTd.text(); String regionTownName; if (!CollUtil.isEmpty(aTagsInTownNameTd)) regionTownName = aTagsInTownNameTd.get( 0 ).text(); else regionTownName = townNameTd.text(); final String codePrefix = regionTownCode.substring( 0 , 2 ); boolean condition1 = "44" .equals(codePrefix) || "46" .equals(codePrefix); if (condition1 && isTownLinkTag) { final String link = superCell.getLink(); final String basePath = link.substring( 0 , link.lastIndexOf(Constant.PATH_CHAR) + 1 ); String url = basePath + townNextHref; newCell.setLink(url); } else if (isTownLinkTag) { int index = townNextHref.indexOf(Constant.PATH_CHAR); String provincePath = townNextHref.substring(index + 1 , index + 3 ); String cityPath = Constant.PATH_CHAR + townNextHref.substring(index + 3 , index + 5 ) + Constant.PATH_CHAR; String url = BASE_URL + provincePath + cityPath + townNextHref; newCell.setLink(url); } newCell.setCode(regionTownCode); newCell.setParentCode(superCell.getCode()); newCell.setName(regionTownName); newCell.setGenTime(LocalDateTime.now()); newCell.setLevel( 4 ); writeDataToDb(tableName, newCell); readDataRecursive(newCell, BASE_URL, tableName); }); } /** * 区县读取 * @param BASE_URL * @param document * @param newCell * @param superCell * @param tableName */ public static void readCounty ( String BASE_URL, Document document, RegionCell newCell, RegionCell superCell, String tableName ) { Elements countyTrList = document.getElementsByClass( "countytr" ); if (CollUtil.isEmpty(countyTrList)) return ; countyTrList.forEach(countyTr -> { Elements countyTds = countyTr.getElementsByTag( "td" ); Element countyCodeTd = countyTds.get( 0 ); Element countyNameTd = countyTds.get( 1 ); Elements aTagsInCountyCodeTd = countyCodeTd.getElementsByTag( "a" ); Elements aTagsInCountyNameTd = countyNameTd.getElementsByTag( "a" ); // 取城乡代码 String regionCountyCode; String countyNextHref = null ; boolean isCountyLinkTag = !CollUtil.isEmpty(aTagsInCountyCodeTd); if (isCountyLinkTag) { regionCountyCode = aTagsInCountyCodeTd.get( 0 ).text(); countyNextHref = aTagsInCountyCodeTd.get( 0 ).attr( "href" ); } else regionCountyCode = countyCodeTd.text(); String regionCountyName; if (!CollUtil.isEmpty(aTagsInCountyNameTd)) regionCountyName = aTagsInCountyNameTd.get( 0 ).text(); else regionCountyName = countyNameTd.text(); if (isCountyLinkTag) { int index = countyNextHref.indexOf(Constant.PATH_CHAR); String provincePath = countyNextHref.substring(index + 1 , index + 3 ) + Constant.PATH_CHAR; String countyUrl = BASE_URL + provincePath + countyNextHref; newCell.setLink(countyUrl); } newCell.setCode(regionCountyCode); newCell.setParentCode(superCell.getCode()); newCell.setName(regionCountyName); newCell.setGenTime(LocalDateTime.now()); newCell.setLevel( 3 ); writeDataToDb(tableName, newCell); readDataRecursive(newCell, BASE_URL, tableName); }); } /** * 城市读取 * @param BASE_URL * @param document * @param newCell * @param superCell * @param tableName */ public static void readCity ( String BASE_URL, Document document, RegionCell newCell, RegionCell superCell, String tableName ) { final Elements citytrList = document.getElementsByClass( "citytr" ); if (CollUtil.isEmpty(citytrList)) return ; citytrList.forEach(cityTr -> { Elements cityTds = cityTr.getElementsByTag( "td" ); Element codeTd = cityTds.get( 0 ); Element nameTd = cityTds.get( 1 ); Elements asTagInCodeTd = codeTd.getElementsByTag( "a" ); Elements asTagInNameTd = nameTd.getElementsByTag( "a" ); String regionCityCode; String cityNextHref = null ; boolean isLinkTag = !CollUtil.isEmpty(asTagInCodeTd); if (isLinkTag) { regionCityCode = asTagInCodeTd.get( 0 ).text(); cityNextHref = asTagInCodeTd.get( 0 ).attr( "href" ); } else regionCityCode = codeTd.text(); String regionName; if (!CollUtil.isEmpty(asTagInNameTd)) regionName = asTagInNameTd.get( 0 ).text(); else regionName = nameTd.text(); if (isLinkTag) newCell.setLink(BASE_URL + cityNextHref); newCell.setCode(regionCityCode); newCell.setParentCode(superCell.getCode()); newCell.setName(regionName); newCell.setGenTime(LocalDateTime.now()); newCell.setLevel( 2 ); writeDataToDb(tableName, newCell); readDataRecursive(newCell, BASE_URL, tableName); }); } /** * 递归请求调用 * @param regionCell * @param BASE_URL * @param tableName */ @SneakyThrows public static void readDataRecursive(RegionCell regionCell, String BASE_URL, String tableName) { final String regionCellLink = regionCell.getLink(); if (StringUtils.isEmpty(regionCellLink)) return ; final HttpRequest httpRequest = HttpUtil .createGet(regionCellLink) .timeout(TIMEOUT) .setConnectionTimeout(TIMEOUT) .setReadTimeout(TIMEOUT); if (!CollUtil.isEmpty(regionCell.getCookies())) { final Map<String, String> headers = MyUtil.getNewRequestHeader(); regionCell.getCookies().forEach( cookie -> headers.put( "Cookie" , cookie.toString() + ";" )); headers.put(REFERER, regionCell.getLink()); headers.put(USER_AGENT, BROWSER_AGENTS[Constant.R.nextInt(BROWSER_AGENTS.length)]); headers.put( "Host" , "www.stats.gov.cn" ); headers.put( "Upgrade-Insecure-Requests" , "1" ); httpRequest.addHeaders(headers); } final HttpResponse httpResponse = retryConn(httpRequest); // 封装cookie给下一次请求使用 final RegionCell.RegionCellBuilder builder = RegionCell.builder(); builder.cookies(httpResponse.getCookies()); final RegionCell newCell = builder.build(); if (!httpResponse.isOk()) { System.out.println(httpResponse.body()); return ; } final Document document = Jsoup.parse(httpResponse.body()); readCity(BASE_URL, document, newCell, regionCell, tableName); readCounty(BASE_URL, document, newCell, regionCell, tableName); readTown(BASE_URL, document, newCell, regionCell, tableName); readVillage(document, newCell, regionCell, tableName); return ; } /** * 读取省份数据 * @param BASE_URL * @param tableName */ public static void readProvinceData(String BASE_URL, String tableName) { final String s = HttpUtil.get(BASE_URL); final Elements provincetrs = Jsoup.parse(s).getElementsByClass( "provincetr" ); provincetrs.forEach(tr -> { Elements provinceas = tr.getElementsByTag( "a" ); provinceas.forEach(a -> { RegionCell cell = RegionCell.builder() .name(a.text()) .code( a.attr( "href" ).replace( ".html" , "" )) .link(BASE_URL + a.attr( "href" )) .parentCode(String.valueOf( 0 )) .genTime(LocalDateTime.now()) .level( 1 ) .build(); MyUtil.writeDataToDb(tableName, cell); readDataRecursive(cell, BASE_URL, tableName); }); }); } } |
启动类就不用写那么多东西了
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | package cn.cloud9; import cn.cloud9.constant.Constant; import cn.cloud9.util.MyUtil; /** * @author OnCloud9 * @description * @project RegionReptile-Remaster * @date 2022年07月07日 下午 09:35 */ public class MainApplication { public static void main(String[] args) { final String year = 0 == args.length ? "2009" : args[ 0 ]; String tableName = "region" + year; String BASE_URL = Constant.ROOT_PATH.replace( "${YEAR}" , year); MyUtil.initialTableSpace(tableName); MyUtil.readProvinceData(BASE_URL, tableName); } } |
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 震惊!C++程序真的从main开始吗?99%的程序员都答错了
· 【硬核科普】Trae如何「偷看」你的代码?零基础破解AI编程运行原理
· 单元测试从入门到精通
· 上周热点回顾(3.3-3.9)
· Vue3状态管理终极指南:Pinia保姆级教程