编程

看山是山 看水是水

导航

Linux下利用Shell使PHP并发采集淘宝产品

上次项目中用到<<PHP采集淘宝商品>>

此方法有一个缺点,就是执行效率问题.一个商品采集平均需要0.8秒.那10000个商品采集完需要2个半小时.

首先想到的解决办法是并发.

但是PHP不支持并发(这里是指通过PHP命令执行PHP文件,如果通过apache或nginx等做服务器是可以并发的,是并发访问,不能在程序中实现并发).

通过Shell把对php命令推到后台执行,以达到并发的效果.

整体思路:

  1.在Shell中连接数据库,取出需要更新的产品

  2.Shell中对数据进行循环,把商品id,price,url传递给PHP执行,将执行过程推到后台

  3.每循环20条使程序暂停5秒,达到控制并发数的目的

  4.php得到id,price,url参数后,通过URL进行采集,并返回现价

  5.将现价和原数据库中的价格进行比较,如果有变化或下架则更新.

  6.将执行结果写入日志文件.

Shell

#!/bin/sh
#updateprice.sh
j=0
currcyline=0;
#查询数据库
for i in `/usr/local/bin/mysql -uroot -pshop123 shop -e"SELECT id,price,url FROM s_goods WHERE url!='' AND keyid like 'taobao%' AND is_off_sale=0 ORDER BY id DESC  "` 
do
        if [ $j -gt 2 ]; then
        #前3个循环分别为id,price,url这相当于表头,不需要进行操作,所以从第3开始.
                line=$(($j%3))
                case $line in
                0)
                        currcyline=$(($currcyline+1))

                        s=$(($currcyline%20))

                        if [ $s -eq 0 ]; then
                                sleep 5  #每循环20次休息5秒,以此来控制避免产生过多的后台进行,使服务器压力过大或死机.
                        fi
                        id=$i;;
                1)
                        price=$i;;

                2)
                        url=$i;
                        #echo id:${id}  price:${price}  url:${url}

                        {
                                                                #调用php命令执行PHP文件.
                                re=`/usr/local/bin/php /var/www/9384shop/cron/goodsupdate.php ${id} ${price} "${url}"` 
                        }&
                        #此&为推到后台执行, 关键
                esac


        fi
        j=$(($j+1))
done
wait
#等待后台进行执行完成

 

PHP:

<?php
/*
 ============================================================================
 Name        : goodsupdate.php
 Author      : 风飘无痕
 Version     :
 Copyright   : Your copyright notice
 Description : Collect taobao goods
 ============================================================================
 */

//$argv为获取命令中的参数
$s=microtime(1);
$id=$argv[1];
$oldprice=$argv[2];
$price=getPrice($argv[3]);
$t=microtime(1)-$s;
$r=array();
$r[]=date('Y-m-d H:i:s');
$r[]=$id;
$r[]=ceil($t*1000)/1000;
if($price=='soldout'){
        $r[]="OutStock";
        $con=mysql_connect('localhost','shop','shop123');
        mysql_select_db("shop", $con);
        mysql_query("UPDATE s_goods SET is_off_sale=1 WHERE id=".$id);
        mysql_close($con);
}
elseif($price===false) $r[]= 'FALSE';
elseif($price==$oldprice) $r[]='EQUAL';
else{
        $r[]="UPDATE";
        $r[]=$oldprice;
        $r[]=$price;
        $con=mysql_connect('localhost','shop','shop123');
        mysql_select_db("shop", $con);
        mysql_query("UPDATE s_goods SET price=".$price." WHERE id=".$id);
        mysql_close($con);
}
//以日志的形式保存执行过程
$h=fopen('/home/staff/www/9384shop/log/goodsUpdate.log','a+');

fputcsv($h,$r);
fclose($h);

function getPrice($url,$time=1){
        $des_url='';
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_USERAGENT,'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:26.0) Gecko/20100101 Firefox/26.0');
        curl_setopt($ch, CURLOPT_REFERER,'http://www.tmall.com/');//采集淘宝商品必须设置此项
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);//设置输出方式, 0为自动输出返回的内容, 1为返回输出的内容,但不自动输出.
        curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30); //timeout on connect
        curl_setopt($ch, CURLOPT_TIMEOUT, 30); //timeout on response
        curl_setopt($ch, CURLOPT_HEADER, 1);//是否输出头信息,0为不输出,非零则输出
        curl_setopt($ch, CURLOPT_MAXREDIRS, 10 );
        curl_setopt($ch, CURLOPT_URL, $url);
        $content = curl_exec($ch);
        curl_close($ch);
        if(preg_match('/noitem\.htm/',$content)){
                return 'soldout';
        }elseif(preg_match("/'reservePrice'\s*:\s*'([\d\.]+?)',/",$content,$price)){
                $price = (float)$price[1];
        }elseif(preg_match('/price:([\d\.]+?),/',$content,$price)){
                $price = (float)$price[1];
        }
        if(!$price){
                preg_match('/id=(\d+)+/',$url,$temp);
                $url2="http://mdskip.taobao.com/core/initItemDetail.htm?itemId=".$temp[1];
                $ch = curl_init();
                curl_setopt( $ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1" );
                curl_setopt( $ch, CURLOPT_URL, $url2 );
                curl_setopt( $ch, CURLOPT_FOLLOWLOCATION, true );
                curl_setopt( $ch, CURLOPT_ENCODING, "" );
                curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true );
                curl_setopt( $ch, CURLOPT_REFERER, 'http://www.tmall.com' );
                curl_setopt( $ch, CURLOPT_AUTOREFERER, true );
                curl_setopt( $ch, CURLOPT_SSL_VERIFYPEER, false );   
                curl_setopt( $ch, CURLOPT_CONNECTTIMEOUT, 10 );
                curl_setopt( $ch, CURLOPT_TIMEOUT, 10 );
                curl_setopt( $ch, CURLOPT_MAXREDIRS, 10 );
                $price_content = curl_exec( $ch );
                $response = curl_getinfo( $ch );
                curl_close ( $ch );
                $price_content=json_decode(iconv('gbk','utf-8',preg_replace('/(\d{10,}):/','"${1}":',$price_content)),true);
                $priceinfo=$price_content['defaultModel']['itemPriceResultDO']['priceInfo'];
                $price=array();
                if(is_array($priceinfo)){
                foreach ($priceinfo as $v){
                        $price[]=$v['price'];
                        if(is_array($v['promotionList'])){
                                foreach ($v['promotionList'] as $v2){
                                        $price[]=$v2['extraPromPrice']?$v2['extraPromPrice']:$v2['price'];
                                }
                        }
                        if(is_array($v['suggestivePromotionList'])){
                                foreach ($v['suggestivePromotionList'] as $v2){
                                        $price[]=$v2['extraPromPrice']?$v2['extraPromPrice']:$v2['price'];
                                }
                        }
                }
                }
                $price=count($price)>0?min($price):false;

        }
        if($price) return $price;
        elseif($time<3) return getPrice($url,$time+1);//如果没有取到价格递归执行,最多执行3次.
        else return false;
}

执行结果:

tail -10 goodsUpdate.log
"2014-03-21 13:45:34",13357,0.273,EQUAL
"2014-03-21 13:45:35",13380,5.883,EQUAL
"2014-03-21 13:45:35",13343,0.914,EQUAL
"2014-03-21 13:45:35",13344,0.923,EQUAL
"2014-03-21 13:45:35",13347,0.927,UPDATE,599.00,181.00
"2014-03-21 13:45:35",13339,0.908,EQUAL
"2014-03-21 13:45:35",13342,0.93,EQUAL
"2014-03-21 13:45:35",13348,0.933,EQUAL
"2014-03-21 13:45:35",13349,0.946,UPDATE,1547.00,1877.00
"2014-03-21 13:45:35",13338,0.947,EQUAL

此方法比只用PHP更新大大节约了时间,更新2万个商品大约是半小时.但是线程数非常不好控制

posted on 2014-03-21 14:44  风飘无痕  阅读(1478)  评论(0编辑  收藏  举报