【laravel5】详解laravel的chunk和chunkById函数

1、laravel的chunk和chunkById主要处理比较大的数据，通过分块来处理。

优缺点：

1）chunk的话，底层原理是通过分页page参数处理，update的时候回存在漏一半数据情况（并且MySQL的分页在数据量大时，严重影响查询效率）

2）因为2，出现了chunkById的优化版本，通过主键priKey（或者其他column，下面会讲）来处理数据，完美的避过update和分页效率低下问题

3、参考文档：

https://www.lqwang.net/13.html

https://segmentfault.com/a/1190000015284897

https://learnku.com/articles/15655/the-use-of-offset-limit-or-skip-and-take-in-laravel

4、本文具体讲解chunkById使用和底层运行原理，chunk不讲。先上干货代码：

<?php
namespace App\Console\Commands;

use Illuminate\Console\Command;
use App\Model\OrderReturnTrace;
use App\Dao\UpPayOrderDao;
use App\Model\OrderPayTrace;
use App\Model\OrderTrace;

class OmsOrderDeliveryQuery extends Command
{
    /**
     * The name and signature of the console command.
     *
     * @var string
     */
    protected $signature = 'oms:odquery {order_id?}';
    
    protected $order_id;

    /**
     * The console command description.
     *
     * @var string
     */
    protected $description = '待发货单物流查询及更新';

    /**
     * Create a new command instance.
     *
     * @return void
     */
    public function __construct()
    {
        parent::__construct();
    }

    /**
     * Execute the console command.
     *
     * @return mixed
     */
    public function handle()
    {
        //
        $this->order_id = $this->argument('order_id') ?? '';
        try{
            if(!empty($this->order_id)) {
                $orderM = OrderTrace::where(['order_id'=>$this->order_id, 'pay_status'=>2, 'order_status'=>1, 'delivery_status'=>1])->first();
                if(empty($orderM)) return ;

                if(! $res = UpPayOrderDao::omsDeliveryQuery($orderM->order_id) ){
                    return false;
                }
                # 更新物流信息
                UpPayOrderDao::orderDeliveryRelease($this->order_id, $res);
                return true;
            } else {
                # 执行oms物流查询
                $echoCsv = function($item){
                    if(! $res = UpPayOrderDao::omsDeliveryQuery($item->order_id) ){
                        log_write("定时任务退款单refundId{$item->refund_id}状态查询有误");
                        return false;
                    }
                    # 更新物流信息
                    UpPayOrderDao::orderDeliveryRelease($item->order_id, $res);
                };
                # 批量查询字段涉及update，使用chunkById($count, $callback, $coulumn, $alias)处理,避免chunk漏数据
                $refundAll = OrderTrace::where('pay_status', 2)
                        ->where('order_type', 0)
                        ->where('order_status', 1)
                        ->where('pay_channel', 2)
                        ->where('delivery_status', 1)
                        ->select('id','order_id','');
                $refundAll->chunkById(200, function($refundAll) {
                    foreach ($refundAll as $item) {
                        $echoCsv($item);
                    }
                }, 'id');
            }
        }catch(\Exception $e) {
            log_write("定时任务refundOrderUnionPayQuery，异常：msg={$e->getMessage()}");
            UpPayOrderDao::sendEmail('定时任务refundOrderUnionPayQuery，异常', $e->getMessage());
            return false;
        }
        return ;
    }
}

5、chunkById的底层原理&运行：

（或者其他column，下面会讲）

参考这篇文章，讲得很清楚的：https://www.lqwang.net/13.html

Laravel chunk和chunkById的坑
公司中的项目在逐渐的向Laravel框架进行迁移。在编写定时任务脚本的时候，用到了chunk和chunkById的API，记录一下踩到的坑。

一、前言
数据库引擎为innodb。

表结构简述，只列出了本文用到的字段。

字段    类型    注释
id    int(11)    ID
type    int(11)    类型
mark_time    int(10)    标注时间（时间戳）
索引，也只列出需要的部分。

索引名    字段
PRIMARY    id
idx_sid_blogdel_marktime    type
blog_del
mark_time
Idx_marktime    mark_time
二、需求
每天凌晨一点取出昨天标注type为99的所有数据，进行逻辑判断，然后进行其他操作。本文的重点只在于取数据的阶段。

数据按月分表，每个月表中的数据为1000w上下。

三、chunk处理数据
代码如下
$this->dao->where('type', 99)->whereBetween('mark_time', [$date, $date+86399])->select(array('mark_time', 'id'))->chunk(1000, function ($rows){
    // 业务处理
});
从一个月中的数据，筛选出type为99，并且标注时间在某天的00:00:00-23:59:59的数据。可以使用到mark_time和type的索引。

type为99，一天的数据大概在15-25w上下的样子。使用->get()->toArray()内存会直接炸掉。所以使用chunk方法，每次取出1000条数据。

使用chucnk，不会出现内存不够的情况。但是性能较差。粗略估计，从一月数据中取出最后一天的数据，跑完20w数据大概需要一两分钟。

查看源码，底层的chunk方法，是为sql语句添加了限制和偏移量。

1
select * from `users` asc limit 500 offset 500;
在数据较多的时候，越往后的话效率会越慢，因为Mysql的limit方法底层是这样的。

limit 10000，10

是扫描满足条件的10010行，然后扔掉前面的10000行，返回最后最后20行。在数据较多的时候，性能会非常差。

查了下API，对于这种情况Laraverl提供了另一个API chunkById。

四、chunkById 原理
使用limit和偏移量在处理大量的数据会有性能的明显下降。于是chunkById使用了id进行分页处理。很好理解，代码如下：

1
select * from `users` where `id` > :last_id order by `id` asc limit 500;
API会自动保存最后一个ID，然后通过id > :last_id 再加上limit就可以通过主键索引进行分页。只取出来需要的行数。性能会有明显的提升。

五、chunkById的坑
API显示chunk和chunkById的用法完全相同。于是把脚本的代码换成了chunkById。
$this->dao->where('type', 99)->whereBetween('mark_time', [$date, $date+86399])->select(array('mark_time', 'id'))->chunkById(1000, function ($rows){
    // 业务处理
});
在执行脚本的时候，1月2号和1月1号的数据没有任何问题。执行速度快了很多。但是在执行12月31号的数据的时候，发现脚本一直执行不完。

在定位后发现是脚本没有进入业务处理的部分，也就是sql一直没有执行完。当时很疑惑，因为刚才执行的没问题，为什么执行12月31号的就出问题了呢。

于是查看sql服务器中的执行情况。

1
show full processlist;
发现了问题。上节说了chunkById的底层是通过id进行order by，然后limie取出一部分一部分的数据，也就是我们预想的sql是这样的。

1
select * from `tabel` where `type` = 99 and mark_time between :begin_date and :end_date limit 500;
explain出来的情况如下：

select_type    type    key    rows    Extra
SIMPLE    Range    idx_marktime    2370258    Using index condition; Using where
实际上的sql是这样的：

1
select * from `tabel` where `type` = 99 and mark_time between :begin_date and :end_date order by id limit 500;
实际explain出来的情况是这样的：

select_type    type    key    rows    Extra
SIMPLE    Index    PRIMARY    4379    Using where
chunkById会自动添加order by id。innodb一定会使用主键索引。那么就不会再使用mark_time的索引了。导致sql执行效率及其缓慢。

六、解决方法
再次查看chunkById的源码。
/**
  * Chunk the results of a query by comparing numeric IDs.
  *
  * @param  int  $count
  * @param  callable  $callback
  * @param  string|null  $column
  * @param  string|null  $alias
  * @return bool
  */
  public function chunkById($count, callable $callback, $column = null, $alias = null)
  {
      $column = is_null($column) ? $this->getModel()->getKeyName() : $column;

      $alias = is_null($alias) ? $column : $alias;

      $lastId = null;

      do {
        $clone = clone $this;

        // We'll execute the query for the given page and get the results. If there are
        // no results we can just break and return from here. When there are results
        // we will call the callback with the current chunk of these results here.
        $results = $clone->forPageAfterId($count, $lastId, $column)->get();

        $countResults = $results->count();

        if ($countResults == 0) {
          break;
        }

        // On each chunk result set, we will pass them to the callback and then let the
        // developer take care of everything within the callback, which allows us to
        // keep the memory low for spinning through large result sets for working.
        if ($callback($results) === false) {
          return false;
        }

        $lastId = $results->last()->{$alias};

      unset($results);
     } while ($countResults == $count);

     return true;
  }
能看到这个方法有四个参数count，callback，column，alias。

默认的column为null，第一行会进行默认赋值。

$column = is_null($column) ? $this->getModel()->getKeyName() : $column;
往下跟:
/**
  * Get the primary key for the model.
  *
  * @return string
  */
  public function getKeyName()
  {
      return $this->primaryKey;
     }
     
/**
  * The primary key for the model.
  *
  * @var string
  */
  protected $primaryKey = 'id';
能看到默认的column为id。

进入forPageAfterId方法。
/**
  * Constrain the query to the next "page" of results after a given ID.
  *
  * @param  int  $perPage
  * @param  int|null  $lastId
  * @param  string  $column
  * @return \Illuminate\Database\Query\Builder|static
  */
  public function forPageAfterId($perPage = 15, $lastId = 0, $column = 'id')
  {
      $this->orders = $this->removeExistingOrdersFor($column);

      if (! is_null($lastId)) {
          $this->where($column, '>', $lastId);
      }

      return $this->orderBy($column, 'asc')
      ->take($perPage);//take取多少条
  }
能看到如果lastId不为0则自动添加where语句，还会自动添加order by column。

看到这里就明白了。上文的chunkById没有添加column参数，所以底层自动添加了order by id。走了主键索引，没有使用上mark_time的索引。导致查询效率非常低。

chunkById的源码显示了我们可以传递一个column字段来让底层使用这个字段来order by。

代码修改如下：
$this->dao->where('type', 99)->whereBetween('mark_time', [$date, $date+86399])->select(array('mark_time', 'id'))->chunkById(1000, function ($rows){
    // 业务处理
}, 'mark_time');
这样最后执行的sql如下：

1
select * from `tabel` where `type` = 99 and mark_time between :begin_date and :end_date order by mark_time limit 500;
再次执行脚本，大概执行一次也就十秒作用了，性能提升显著。

七、总结
chunk和chunkById的区别就是chunk是单纯的通过偏移量来获取数据，chunkById进行了优化，不实用偏移量，使用id过滤，性能提升巨大。在数据量大的时候，性能可以差到几十倍的样子。

而且使用chunk在更新的时候，也会遇到数据会被跳过的问题。详见解决Laravel中chunk方法分块处理数据的坑

同时chunkById在你没有传递column参数时，会默认添加order by id。可能会遇到索引失效的问题。解决办法就是传递column参数即可。

本人感觉chunkById不光是根据Id分块，而是可以根据某一字段进行分块，这个字段是可以指定的。叫chunkById有一些误导性，chunkByColumn可能更容易理解。算是自己提的小小的建议。

本文标题：Laravel chunk和chunkById的坑
本文作者：旺阳
本文链接：https://www.lqwang.net/13.html
发布时间：2020-01-06
版权声明：本博客所有文章除特别声明外，均采用 CC BY-NC-SA 4.0 许可协议。转载请注明出处！

posted @ 2021-01-27 15:42 PHP急先锋阅读(3924) 评论(0) 编辑收藏举报

刷新页面返回顶部

PHP急先锋

【laravel5】详解laravel的chunk和chunkById函数

公告