利用php爬虫分析南京房价
前些天看到csdn上的一篇文章,利用python写爬虫来分析上海房价的。感觉挺有意思的。正好最最近写snake后台也写到了文章采集,我也来用php的爬虫来分析一下南京的房价。说干就开始吧。
本次爬虫的依赖文件: 首先是ares333大神的CURL类。我用的是初期的版本,这是[url]https://github.com/ares333/php-curlmulti[/url]大神的github项目地址,他写的curl确实很牛!
采集用的是phpQuery,不知道这个类的朋友,可以自行百度吧。
至于数据的来源,我选择安居客,数据量还是可以的,打开安居客选到南京的频道。开始分析他们的页面结构,至于怎么用phpQuery分析页面结构采集的方法,这里就不做详细的介绍了。分析好结构,好,开始建立数据表。首先建立区域表,房屋交易都是分版块的,版块表结构如下:
1 CREATE TABLE `area` ( 2 `id` int(11) NOT NULL AUTO_INCREMENT, 3 `name` varchar(155) NOT NULL COMMENT '南京市区', 4 `url` varchar(155) NOT NULL COMMENT '房源区域连接', 5 `pid` int(2) NOT NULL COMMENT '分类', 6 PRIMARY KEY (`id`) 7 ) ENGINE=MyISAM DEFAULT CHARSET=utf8;
我是首先自己添加的一些区服的数据,其实可以采集这些,因为就那几个区,地址有限,就直接添加了。添加了是14条数据:
初始数据准备好了,就可以开始采集所有的区域版块入口地址了。贴上代码
area.php
1 <?php 2 // +---------------------------------------------------------------------- 3 // | 采集区域脚本 4 // +---------------------------------------------------------------------- 5 // | Author: NickBai <1902822973@qq.com> 6 // +---------------------------------------------------------------------- 7 set_time_limit(0); 8 require 'init.php'; 9 10 //根据大区信息前往抓取 11 $sql = "select * from `area`"; 12 $area = $db->query( $sql )->fetchAll( PDO::FETCH_ASSOC ); 13 14 foreach($area as $key=>$vo){ 15 16 $url = $vo['url']; 17 $result = $curl->read($url); 18 19 $charset = preg_match("/<meta.+?charset=[^\w]?([-\w]+)/i", $result['content'], $temp) ? strtolower( $temp[1] ) : ""; 20 phpQuery::$defaultCharset = $charset; //设置默认编码 21 22 $html = phpQuery::newDocumentHTML( $result['content'] ); 23 24 $span = $html['.items .sub-items a']; 25 26 $st = $db->prepare("insert into area(name,url,pid) values(?,?,?)"); 27 foreach($span as $v){ 28 $v = pq( $v ); 29 30 //为方便分页抓取,先加入分页规则 31 $href = trim( $v->attr('href') ) . 'p*/#filtersort'; 32 $st->execute([ trim( $v->text() ), $href, $vo['id']]); 33 } 34 }
采集出的单条数据如下:
15 百家湖 http://nanjing.anjuke.com/sale/baijiahu/p*/#filtersort 1
数据地址都有了,而且页面地址我加了*,这样就可以替换了,打开程序就能开始采集每个模块下的其他分页的书据了。最重要的主程序就要开始了;
新建hdetail表来记录采集来的房屋数信息:
1 CREATE TABLE `hdetail` ( 2 `id` int(11) NOT NULL AUTO_INCREMENT, 3 `pid` int(5) NOT NULL COMMENT '区域id', 4 `square` int(10) DEFAULT NULL COMMENT '面积', 5 `housetype` varchar(55) DEFAULT '' COMMENT '房屋类型', 6 `price` int(10) DEFAULT '0' COMMENT '单价', 7 `allprice` int(10) DEFAULT '0' COMMENT '总价', 8 `name` varchar(155) DEFAULT '' COMMENT '小区名称', 9 `addr` varchar(155) DEFAULT '' COMMENT '小区地址', 10 PRIMARY KEY (`id`) 11 ) ENGINE=MyISAM DEFAULT CHARSET=utf8;
数据库有了,那么主程序奉上。house.php
1 <?php 2 // +---------------------------------------------------------------------- 3 // | 采集各区具体房源信息 4 // +---------------------------------------------------------------------- 5 // | Author: NickBai <1902822973@qq.com> 6 // +---------------------------------------------------------------------- 7 set_time_limit(0); 8 require 'init.php'; 9 10 //查询各板块数据 11 $sql = "select * from `area` where id > 14"; 12 $allarea = $db->query($sql)->fetchAll( PDO::FETCH_ASSOC ); 13 //http://nanjing.anjuke.com/sale/页面不存在时,会跳转到首页 14 15 foreach($allarea as $key=>$vo){ 16 17 $url = $vo['url']; 18 $i = 1; 19 20 while ( true ){ 21 22 $urls = str_replace( "*" , $i, $url); 23 $result = $curl->read( $urls ); 24 if( "http://nanjing.anjuke.com/sale/" == $result['info']['url'] ){ 25 break; 26 } 27 28 $charset = preg_match("/<meta.+?charset=[^\w]?([-\w]+)/i", $result['content'], $temp) ? strtolower( $temp[1] ) : ""; 29 phpQuery::$defaultCharset = $charset; //设置默认编码 30 31 $html = phpQuery::newDocumentHTML( $result['content'] ); 32 33 $div = $html['#houselist-mod li .house-details']; 34 $isGet = count( $div->elements ); //未采集到内容跳出,视为结束 35 if( 0 == $isGet ){ 36 break; 37 } 38 39 foreach($div as $v){ 40 41 $sql = "insert into hdetail(pid,square,housetype,price,allprice,name,addr) "; 42 $pid = $vo['id']; 43 $square = rtrim( trim( pq($v)->find("div:eq(1) span:eq(0)")->text() ), "平方米"); 44 $htype = trim( pq($v)->find("div:eq(1) span:eq(1)")->text() ); 45 $price = rtrim ( trim( pq($v)->find("div:eq(1) span:eq(2)")->text() ), "元/m²"); 46 $area = explode(" ", trim( pq($v)->find("div:eq(2) span")->text() ) ); 47 48 $name = str_replace( chr(194) . chr(160), "", array_shift($area) ); //utf-8中的空格无法用trim去除,所以采用此方法 49 $addr = rtrim( ltrim (trim( array_pop($area) ) , "["), "]" ); 50 $allprice = trim( pq($v)->siblings(".pro-price")->find("span strong")->text() ); 51 52 $sql .= " value( ". $pid .",". $square .", '". $htype ."' ,". $price .",". $allprice .", '". $name ."' ,'". $addr ."' )"; 53 $db->query($sql); 54 } 55 56 echo mb_convert_encoding($vo['name'], "gbk", "utf-8") . " PAGE : ". $i . PHP_EOL; 57 $i++; 58 59 } 60 61 } 62 [/code] 63 跳过前面的大区,逐个采集。建议用cmd模式运行这个脚本。因为时间较长,所以用浏览器会导致卡死现象。至于不知道怎么用cmd命令执行php的,自己百度吧。 64 [img]/Uploads/editor/2016-08-17/57b3d927dd54e.jpg[/img] 65 如果觉得慢的话,你们可以复制几分house.php文件,修改 66 [code] 67 $sql = "select * from `area` where id > 14"; 68 [/code] 69 根据id进行截取,多打开几个cmd执行,就变成多进程模式了。 70 [img]/Uploads/editor/2016-08-17/57b3d978afae8.jpg[/img] 71 下面就是等待了,我是8.16号采集的,一共采集了311226条数据。好了现在数有了,就可以开始分析了。我分析的代码如下: 72 [code] 73 <?php 74 require "init.php"; 75 76 $data = unserialize( file_get_contents('./data/nj.data') ); 77 if( empty( $data ) ){ 78 79 //全南京 80 $sql = "select avg(price) price from hdetail"; 81 $nanjing = intval( $db->query($sql)->fetch( PDO::FETCH_ASSOC )['price'] ); 82 83 //其余数据 84 $data = [ 85 $nanjing, 86 getOtherPrice('2,3,4,5,6,7,8,10'), 87 getOtherPrice('1'), 88 getOtherPrice('2'), 89 getOtherPrice('3'), 90 getOtherPrice('4'), 91 getOtherPrice('5'), 92 getOtherPrice('6'), 93 getOtherPrice('7'), 94 getOtherPrice('8'), 95 getOtherPrice('9'), 96 getOtherPrice('10'), 97 getOtherPrice('11'), 98 getOtherPrice('12'), 99 getOtherPrice('13') 100 ]; 101 102 //添加缓存 103 file_put_contents('./data/nj.data', serialize( $data )); 104 } 105 106 //均价最高TOP10 107 $sql = "select avg(price) price,name from hdetail GROUP BY name ORDER BY price desc limit 10"; 108 $res = $db->query($sql)->fetchAll( PDO::FETCH_ASSOC ); 109 $x = ""; 110 $y = ""; 111 foreach($res as $vo){ 112 $x .= "'" . $vo['name'] . "',"; 113 $y .= intval( $vo['price'] ). ","; 114 } 115 116 //均价最低TOP10 117 $sql = "select avg(price) price,name from hdetail GROUP BY name ORDER BY price asc limit 10"; 118 $res = $db->query($sql)->fetchAll( PDO::FETCH_ASSOC ); 119 $xl = ""; 120 $yl = ""; 121 foreach($res as $vo){ 122 $xl .= "'" . $vo['name'] . "',"; 123 $yl .= intval( $vo['price'] ). ","; 124 } 125 126 //交易房型数据 127 $sql = "select count(0) allnum, housetype from hdetail GROUP BY housetype order by allnum desc"; 128 $res = $db->query($sql)->fetchAll( PDO::FETCH_ASSOC ); 129 $htype = ""; 130 foreach($res as $vo){ 131 $htype .= "[ '" . $vo['housetype'] . "', " .$vo['allnum']. "],"; 132 } 133 134 $htype = rtrim($htype, ','); 135 136 //交易的房屋面积数据 137 $square = ['50平米以下', '50-70平米', '70-90平米', '90-120平米', '120-150平米', '150-200平米', '200-300平米', '300平米以上']; 138 $sql = "select count(0) allnum, square from hdetail GROUP BY square"; 139 $squ = $db->query($sql)->fetchAll( PDO::FETCH_ASSOC ); 140 141 $p50 = 0; 142 $p70 = 0; 143 $p90 = 0; 144 $p120 = 0; 145 $p150 = 0; 146 $p200 = 0; 147 $p250 = 0; 148 $p300 = 0; 149 150 foreach($squ as $key=>$vo){ 151 if( $vo['square'] < 50 ){ 152 $p50 += $vo['allnum']; 153 } 154 if( $vo['square'] >= 50 && $vo['square'] < 70 ){ 155 $p70 += $vo['allnum']; 156 } 157 if( $vo['square'] >= 70 && $vo['square'] < 90 ){ 158 $p90 += $vo['allnum']; 159 } 160 if( $vo['square'] >= 90 && $vo['square'] < 120 ){ 161 $p120 += $vo['allnum']; 162 } 163 if( $vo['square'] >= 120 && $vo['square'] < 150 ){ 164 $p150 += $vo['allnum']; 165 } 166 if( $vo['square'] >= 150 && $vo['square'] < 200 ){ 167 $p200 += $vo['allnum']; 168 } 169 if( $vo['square'] >= 200 && $vo['square'] < 300 ){ 170 $p250 += $vo['allnum']; 171 } 172 if( $vo['square'] >= 300 ){ 173 $p300 += $vo['allnum']; 174 } 175 } 176 177 $num = [ $p50, $p70, $p90, $p120, $p150, $p200, $p250, $p300 ]; 178 179 $sqStr = ""; 180 foreach($square as $key=>$vo){ 181 $sqStr .= "[ '" . $vo . "', " .$num[$key]. "],"; 182 } 183 184 185 //根据获取ids字符串获取对应的均价信息 186 function getOtherPrice($str){ 187 global $db; 188 189 $sql = "select id from area where pid in(" . $str . ")"; 190 $city = $db->query($sql)->fetchAll( PDO::FETCH_ASSOC ); 191 $ids = ""; 192 foreach($city as $v){ 193 $ids .= $v['id'] . ","; 194 } 195 $sql = "select avg(price) price from hdetail where pid in (".rtrim($ids, ",").")"; 196 $price = intval( $db->query($sql)->fetch( PDO::FETCH_ASSOC )['price'] ); 197 198 return $price; 199 } 200 201 ?> 202 <!DOCTYPE html> 203 <html> 204 <head> 205 <meta charset="utf-8"> 206 <meta name="viewport" content="width=device-width, initial-scale=1.0"> 207 <title>南京房价分析</title> 208 <link rel="shortcut icon" href="favicon.ico"> <link href="css/bootstrap.min.css?v=3.3.6" rel="stylesheet"> 209 <link href="css/font-awesome.min.css?v=4.4.0" rel="stylesheet"> 210 <link href="css/animate.min.css" rel="stylesheet"> 211 <link href="css/style.min.css?v=4.1.0" rel="stylesheet"> 212 </head> 213 <body class="gray-bg"> 214 <div class="wrapper wrapper-content"> 215 <div class="row"> 216 <div class="col-sm-12"> 217 <div class="row"> 218 <div class="col-sm-12"> 219 <div class="ibox float-e-margins"> 220 <div class="ibox-title"> 221 <h5>全南京以及各区二手房均价</h5> 222 <div class="ibox-tools"> 223 <a class="collapse-link"> 224 <i class="fa fa-chevron-up"></i> 225 </a> 226 <a class="close-link"> 227 <i class="fa fa-times"></i> 228 </a> 229 </div> 230 </div> 231 <div class="ibox-content"> 232 <div id="container"></div> 233 </div> 234 </div> 235 </div> 236 </div> 237 </div> 238 </div> 239 <div class="row"> 240 <div class="col-sm-6"> 241 <div class="row"> 242 <div class="col-sm-12"> 243 <div class="ibox float-e-margins"> 244 <div class="ibox-title"> 245 <h5>均价最高的小区TOP10</h5> 246 <div class="ibox-tools"> 247 <a class="collapse-link"> 248 <i class="fa fa-chevron-up"></i> 249 </a> 250 <a class="close-link"> 251 <i class="fa fa-times"></i> 252 </a> 253 </div> 254 </div> 255 <div class="ibox-content"> 256 <div id="avgpriceh"></div> 257 </div> 258 </div> 259 </div> 260 </div> 261 </div> 262 <div class="col-sm-6"> 263 <div class="row"> 264 <div class="col-sm-12"> 265 <div class="ibox float-e-margins"> 266 <div class="ibox-title"> 267 <h5>均价最低的小区TOP10</h5> 268 <div class="ibox-tools"> 269 <a class="collapse-link"> 270 <i class="fa fa-chevron-up"></i> 271 </a> 272 <a class="close-link"> 273 <i class="fa fa-times"></i> 274 </a> 275 </div> 276 </div> 277 <div class="ibox-content"> 278 <div id="avgpricel"></div> 279 </div> 280 </div> 281 </div> 282 </div> 283 </div> 284 </div> 285 286 <div class="row"> 287 <div class="col-sm-6"> 288 <div class="row"> 289 <div class="col-sm-12"> 290 <div class="ibox float-e-margins"> 291 <div class="ibox-title"> 292 <h5>交易房型比例</h5> 293 <div class="ibox-tools"> 294 <a class="collapse-link"> 295 <i class="fa fa-chevron-up"></i> 296 </a> 297 <a class="close-link"> 298 <i class="fa fa-times"></i> 299 </a> 300 </div> 301 </div> 302 <div class="ibox-content"> 303 <div id="htype"></div> 304 </div> 305 </div> 306 </div> 307 </div> 308 </div> 309 <div class="col-sm-6"> 310 <div class="row"> 311 <div class="col-sm-12"> 312 <div class="ibox float-e-margins"> 313 <div class="ibox-title"> 314 <h5>交易房屋面积比例</h5> 315 <div class="ibox-tools"> 316 <a class="collapse-link"> 317 <i class="fa fa-chevron-up"></i> 318 </a> 319 <a class="close-link"> 320 <i class="fa fa-times"></i> 321 </a> 322 </div> 323 </div> 324 <div class="ibox-content"> 325 <div id="square"></div> 326 </div> 327 </div> 328 </div> 329 </div> 330 </div> 331 </div> 332 333 </div> 334 <script type="text/javascript" src="js/jquery.min.js?v=2.1.4"></script> 335 <script type="text/javascript" src="js/bootstrap.min.js?v=3.3.6"></script> 336 <script type="text/javascript" src="http://cdn.hcharts.cn/highcharts/highcharts.js"></script> 337 <script type="text/javascript"> 338 $(function () { 339 $('#container').highcharts({ 340 chart: { 341 type: 'column' 342 }, 343 title: { 344 text: '全南京以及各区二手房均价' 345 }, 346 subtitle: { 347 text: '来源于安居客8.16的数据' 348 }, 349 xAxis: { 350 categories: ['全南京','江南八区','江宁区','鼓楼区','白下区','玄武区','建邺区','秦淮区','下关区','雨花台区','浦口区','栖霞区','六合区', 351 '溧水区','高淳区','大厂'], 352 crosshair: true 353 }, 354 yAxis: { 355 min: 0, 356 title: { 357 text: '元/m²' 358 } 359 }, 360 tooltip: { 361 headerFormat: '<span style="font-size:10px">{point.key}</span><table>', 362 pointFormat: '<tr><td style="color:{series.color};padding:0">{series.name}: </td>' + 363 '<td style="padding:0"><b>{point.y:.1f} 元/m²</b></td></tr>', 364 footerFormat: '</table>', 365 shared: true, 366 useHTML: true 367 }, 368 plotOptions: { 369 column: { 370 pointPadding: 0.2, 371 borderWidth: 0, 372 dataLabels:{ 373 enabled:true// dataLabels设为true 374 } 375 } 376 }, 377 series: [{ 378 name: '平均房价', 379 data: [<?php echo implode(',', $data); ?>] 380 }] 381 }); 382 383 //均价最高top10 384 $('#avgpriceh').highcharts({ 385 chart: { 386 type: 'column' 387 }, 388 title: { 389 text: '均价最高的小区TOP10' 390 }, 391 subtitle: { 392 text: '来源于安居客8.16的数据' 393 }, 394 xAxis: { 395 categories: [<?=$x; ?>], 396 crosshair: true 397 }, 398 yAxis: { 399 min: 0, 400 title: { 401 text: '元/m²' 402 } 403 }, 404 tooltip: { 405 headerFormat: '<span style="font-size:10px">{point.key}</span><table>', 406 pointFormat: '<tr><td style="color:{series.color};padding:0">{series.name}: </td>' + 407 '<td style="padding:0"><b>{point.y:.1f} 元/m²</b></td></tr>', 408 footerFormat: '</table>', 409 shared: true, 410 useHTML: true 411 }, 412 plotOptions: { 413 column: { 414 pointPadding: 0.2, 415 borderWidth: 0, 416 dataLabels:{ 417 enabled:true// dataLabels设为true 418 } 419 } 420 }, 421 series: [{ 422 name: '平均房价', 423 data: [<?=$y; ?>] 424 }] 425 }); 426 427 //均价最低top10 428 $('#avgpricel').highcharts({ 429 chart: { 430 type: 'column' 431 }, 432 title: { 433 text: '均价最低的小区TOP10' 434 }, 435 subtitle: { 436 text: '来源于安居客8.16的数据' 437 }, 438 xAxis: { 439 categories: [<?=$xl; ?>], 440 crosshair: true 441 }, 442 yAxis: { 443 min: 0, 444 title: { 445 text: '元/m²' 446 } 447 }, 448 tooltip: { 449 headerFormat: '<span style="font-size:10px">{point.key}</span><table>', 450 pointFormat: '<tr><td style="color:{series.color};padding:0">{series.name}: </td>' + 451 '<td style="padding:0"><b>{point.y:.1f} 元/m²</b></td></tr>', 452 footerFormat: '</table>', 453 shared: true, 454 useHTML: true 455 }, 456 plotOptions: { 457 column: { 458 pointPadding: 0.2, 459 borderWidth: 0, 460 dataLabels:{ 461 enabled:true// dataLabels设为true 462 } 463 } 464 }, 465 series: [{ 466 name: '平均房价', 467 data: [<?=$yl; ?>] 468 }] 469 }); 470 471 // Radialize the colors 472 Highcharts.getOptions().colors = Highcharts.map(Highcharts.getOptions().colors, function (color) { 473 return { 474 radialGradient: { cx: 0.5, cy: 0.3, r: 0.7 }, 475 stops: [ 476 [0, color], 477 [1, Highcharts.Color(color).brighten(-0.3).get('rgb')] // darken 478 ] 479 }; 480 }); 481 //房型类型 482 $('#htype').highcharts({ 483 chart: { 484 plotBackgroundColor: null, 485 plotBorderWidth: null, 486 plotShadow: false 487 }, 488 title: { 489 text: '交易的二手房型比例' 490 }, 491 tooltip: { 492 pointFormat: '{series.name}: <b>{point.percentage:.1f}%</b>' 493 }, 494 plotOptions: { 495 pie: { 496 allowPointSelect: true, 497 cursor: 'pointer', 498 dataLabels: { 499 enabled: true, 500 format: '<b>{point.name}</b>: {point.percentage:.1f} %', 501 style: { 502 color: (Highcharts.theme && Highcharts.theme.contrastTextColor) || 'black' 503 }, 504 connectorColor: 'silver' 505 } 506 } 507 }, 508 series: [{ 509 type: 'pie', 510 name: 'Browser share', 511 data: [ 512 <?=$htype; ?> 513 ] 514 }] 515 }); 516 517 //房型面积类型 518 $('#square').highcharts({ 519 chart: { 520 plotBackgroundColor: null, 521 plotBorderWidth: null, 522 plotShadow: false 523 }, 524 title: { 525 text: '交易的二手房面积比例' 526 }, 527 tooltip: { 528 pointFormat: '{series.name}: <b>{point.percentage:.1f}%</b>' 529 }, 530 plotOptions: { 531 pie: { 532 allowPointSelect: true, 533 cursor: 'pointer', 534 dataLabels: { 535 enabled: true, 536 format: '<b>{point.name}</b>: {point.percentage:.1f} %', 537 style: { 538 color: (Highcharts.theme && Highcharts.theme.contrastTextColor) || 'black' 539 }, 540 connectorColor: 'silver' 541 } 542 } 543 }, 544 series: [{ 545 type: 'pie', 546 name: 'Browser share', 547 data: [ 548 <?=$sqStr; ?> 549 ] 550 }] 551 }); 552 553 }); 554 </script> 555 </body> 556 </html>
具体的截图效果,大家可以参考我的http://www.thinkphp.cn/code/2239.html这篇文章,源码在那边页游呈现
声明:本文内容仅是本人学习的记录,不保证在项目中可用,若引用此代码导致了严重后果,本人不承担任何法律责任。