Quartz任务调度:MisFire策略和源码分析

Quartz是为大家熟知的任务调度框架,先看看官网的介绍:

-------------------------------------------------------------------------------------------------------------------------

What is the Quartz Job Scheduling Library?

Quartz is a richly featured, open source job scheduling library that can be integrated within virtually any Java application - from the smallest stand-alone application to the largest e-commerce system. Quartz can be used to create simple or complex schedules for executing tens, hundreds, or even tens-of-thousands of jobs; jobs whose tasks are defined as standard Java components that may execute virtually anything you may program them to do. The Quartz Scheduler includes many enterprise-class features, such as support for JTA transactions and clustering.

Quartz is freely usable, licensed under the Apache 2.0 license.

-------------------------------------------------------------------------------------------------------------------------

翻译:Quartz是一个功能丰富、开源的任务调度库,它可以集成到几乎任意Java应用中---小到最小的独立应用,大到最大的电子商务系统。Quartz 可以用来创建简单或者复杂的工作计划,同时执行数十、成百、甚至上万的任务。可被定义为标准Java组件的任务,几乎可以执行任意可以编程的任务。Quartz 任务调度包含许多企业级功能特性,比如支持JTA事务和集群。

Quartz可以免费试用,遵循 Apache 2.0 license 许可协议

-------------------------------------------------------------------------------------------------------------------------

公司项目也用的Quartz,最近遇到一些关于Quartz的问题,带着疑问,查阅了部分Quartz源码,与大家分享。

开始是为了研究Quartz的MisFire策略,当任务执行时间过长、服务停机、任务暂停等原因,导致其超过其下次执行的时间点时,就会涉及MisFire(失火,错误任务的触发)处理的策略问题。 Quartz的任务分为SimpleTrigger和CronTrigger,项目中一般使用CronTrigger居多,本文只涉及了CronTrigger的MisFire处理策略(SimpleTrigger的MisFire策略与CronTrigger不同,后续再说)。

MisFire策略常量的定义在类CronTrigger中,列举如下:

  1. MISFIRE_INSTRUCTION_FIRE_ONCE_NOW                 = 1
  2. MISFIRE_INSTRUCTION_DO_NOTHING                       = 2
  3. MISFIRE_INSTRUCTION_SMART_POLICY                    = 0
  4. MISFIRE_INSTRUCTION_IGNORE_MISFIRE_POLICY    = -1

根据JavaDoc介绍和官网文档分析,其对应执行策略如下:

  1. MISFIRE_INSTRUCTION_FIRE_ONCE_NOW:立即执行一次,然后按照Cron定义时间点执行
  2. MISFIRE_INSTRUCTION_DO_NOTHING:什么都不做,等待Cron定义下次任务执行的时间点
  3. MISFIRE_INSTRUCTION_SMART_POLICY:智能的策略,针对不同的Trigger执行不同,CronTrigger时为MISFIRE_INSTRUCTION_FIRE_ONCE_NOW
  4. MISFIRE_INSTRUCTION_IGNORE_MISFIRE_POLICY:将所有错过的执行时间点全都补上,例如,任务15s执行一次,执行的任务错过了4分钟,则执行MisFire时,一次性执行4*(60/15)= 16次任务

但是,我写了例子,实际执行策略1策略2与文档又不太相同,示例任务cron表达式为:0/10 * * * * ?,每10s执行一次。

测试步骤如下:

任务下次执行时间为15:05:10,misFire策略为MISFIRE_INSTRUCTION_FIRE_ONCE_NOW(1)

1.将任务暂停至15:05:35

2.重新启动任务,任务瞬间执行了3次

将misFire策略设置为MISFIRE_INSTRUCTION_DO_NOTHING与上述表现一致。这个实验结果与文档描述不太相符。

于是,翻阅Quartz源码,首先从定时任务本身入手,打断点,找到任务执行工作线程为:WorkerThread对象,工作线程池为:SimpleThreadPool

核心代码如下:

       // WorkerThread.class
// 将任务送入工作线程
    public void run(Runnable newRunnable) { synchronized(lock) { if(runnable != null) { throw new IllegalStateException("Already running a Runnable!"); } runnable = newRunnable; lock.notifyAll(); } }
//循环执行,当有任务送入时执行任务 @Override
public void run() { boolean ran = false; while (run.get()) { try { synchronized(lock) { while (runnable == null && run.get()) { lock.wait(500); } if (runnable != null) { ran = true; runnable.run(); } } } catch (InterruptedException unblock) { // do nothing (loop will terminate if shutdown() was called try { getLog().error("Worker thread was interrupt()'ed.", unblock); } catch(Exception e) { // ignore to help with a tomcat glitch } } catch (Throwable exceptionInRunnable) { try { getLog().error("Error while executing the Runnable: ", exceptionInRunnable); } catch(Exception e) { // ignore to help with a tomcat glitch } } finally { synchronized(lock) { runnable = null; } // repair the thread in case the runnable mucked it up... if(getPriority() != tp.getThreadPriority()) { setPriority(tp.getThreadPriority()); } if (runOnce) { run.set(false); clearFromBusyWorkersList(this); } else if(ran) { ran = false; makeAvailable(this); } } }

可以看到当有任务送入工作线程时,任务将被执行。由此,反向找到线程池代码,代码如下:

    
    // SimpleThreadPool.class
    public boolean runInThread(Runnable runnable) { if (runnable == null) { return false; } synchronized (nextRunnableLock) { handoffPending = true; // Wait until a worker thread is available while ((availWorkers.size() < 1) && !isShutdown) { try { nextRunnableLock.wait(500); } catch (InterruptedException ignore) { } } if (!isShutdown) { WorkerThread wt = (WorkerThread)availWorkers.removeFirst(); busyWorkers.add(wt); wt.run(runnable); } else { // If the thread pool is going down, execute the Runnable // within a new additional worker thread (no thread from the pool). WorkerThread wt = new WorkerThread(this, threadGroup, "WorkerThread-LastJob", prio, isMakeThreadsDaemons(), runnable); busyWorkers.add(wt); workers.add(wt); wt.start(); } nextRunnableLock.notifyAll(); handoffPending = false; } return true; }

可以看到线程池从可用的工作线程队列中取出一个工作线程,将任务送入工作线程(WorkerThread),然后任务会被执行。

由此,反向找到调用方法runInThread的地方,类QuartzSchedulerThread(约398行),QuartzSchedulerThread集成自Thread,又是一个无限循环执行的线程任务,找到类QuartzSchedulerThread.run()方法(由于代码量较大,此处不再全部粘贴),可以看到这个方法干的活大概是:循环找出需要执行的Job,然后送入线程池,再由线程池送入工作线程

列举部分关键代码:

1.找出需要执行的Job的代码
                    try {
              //此处去数据库查询将要执行的任务 triggers
= qsRsrcs.getJobStore().acquireNextTriggers( now + idleWaitTime, Math.min(availThreadCount, qsRsrcs.getMaxBatchSize()), qsRsrcs.getBatchTimeWindow()); acquiresFailed = 0; if (log.isDebugEnabled()) log.debug("batch acquisition of " + (triggers == null ? 0 : triggers.size()) + " triggers"); } catch (JobPersistenceException jpe) { if (acquiresFailed == 0) { qs.notifySchedulerListenersError( "An error occurred while scanning for the next triggers to fire.", jpe); } if (acquiresFailed < Integer.MAX_VALUE) acquiresFailed++; continue; } catch (RuntimeException e) { if (acquiresFailed == 0) { getLog().error("quartzSchedulerThreadLoop: RuntimeException " +e.getMessage(), e); } if (acquiresFailed < Integer.MAX_VALUE) acquiresFailed++; continue; }

关键点在注释处的代码,方法:acquireNextTriggers,继续debug跟进该方法,找到查询SQL,代码如下:

   // StdJDBCDelegate.class
public List<TriggerKey> selectTriggerToAcquire(Connection conn, long noLaterThan, long noEarlierThan, int maxCount) throws SQLException { PreparedStatement ps = null; ResultSet rs = null; List<TriggerKey> nextTriggers = new LinkedList<TriggerKey>(); try { ps = conn.prepareStatement(rtp(SELECT_NEXT_TRIGGER_TO_ACQUIRE)); // Set max rows to retrieve if (maxCount < 1) maxCount = 1; // we want at least one trigger back. ps.setMaxRows(maxCount); // Try to give jdbc driver a hint to hopefully not pull over more than the few rows we actually need. // Note: in some jdbc drivers, such as MySQL, you must set maxRows before fetchSize, or you get exception! ps.setFetchSize(maxCount); ps.setString(1, STATE_WAITING); ps.setBigDecimal(2, new BigDecimal(String.valueOf(noLaterThan))); ps.setBigDecimal(3, new BigDecimal(String.valueOf(noEarlierThan))); rs = ps.executeQuery(); while (rs.next() && nextTriggers.size() <= maxCount) { nextTriggers.add(triggerKey( rs.getString(COL_TRIGGER_NAME), rs.getString(COL_TRIGGER_GROUP))); } return nextTriggers; } finally { closeResultSet(rs); closeStatement(ps); } }

根据debug时实时参数,处理过的SQL为:

SELECT
    TRIGGER_NAME,
    TRIGGER_GROUP,
    NEXT_FIRE_TIME,
    PRIORITY
FROM
    qrtz_TRIGGERS
WHERE
    SCHED_NAME = 'schedulerFactoryBean'
AND TRIGGER_STATE = 'WAITING'
AND NEXT_FIRE_TIME <= (now + idleWaitTime)
AND (
    MISFIRE_INSTR = -1
    OR (
        MISFIRE_INSTR != -1
        AND NEXT_FIRE_TIME >= (now - misfireThreshold)
    )
)
ORDER BY  NEXT_FIRE_TIME ASC, PRIORITY DESC

其中:now为系统当前时间,idleWaitTime为系统线程闲置时间,默认取值为30s,misfireThreshold为配置参数,意思为系统能容忍的misFire的最大阀值,默认为60s(当前系统配置也是60s,之前一直不知道这个值什么意思)。从SQL中看得很清楚了,这个SQL语句是要查询出:未来30s内将要执行的任务,且MISFIRE_INSTR为-1(MISFIRE_INSTRUCTION_IGNORE_MISFIRE_POLICY),或者MISFIRE_INSTR不为-1,但是,NEXT_FIRE_TIME错过的执行时间不能超过阀值60s。至此问题搞清楚了,影响misFire执行策略的另一个参数就是misfireThreshold,配置文件quartz.properties中,对应org.quartz.jobStore.misfireThreshold: 60000,单位毫秒。也就是说:如果【错过时间】不超过60s都不算是misFire,不执行misFire策略,依次执行错过的任务时间点;【错过时间】超过60s按misFire策略执行。

根据上述结论重新进行试验,将任务暂停时间超过60s,这次试验结果与文档描述一致。

 

另外,跟踪启动任务的代码,找到处理misFire的方法,代码位置:org.quartz.impl.triggers.CronTriggerImpl.updateAfterMisfire(Calendar)

    @Override
    public void updateAfterMisfire(org.quartz.Calendar cal) {
        int instr = getMisfireInstruction();

        if(instr == Trigger.MISFIRE_INSTRUCTION_IGNORE_MISFIRE_POLICY)
            return;

        if (instr == MISFIRE_INSTRUCTION_SMART_POLICY) {
            instr = MISFIRE_INSTRUCTION_FIRE_ONCE_NOW;
        }

        if (instr == MISFIRE_INSTRUCTION_DO_NOTHING) {
            Date newFireTime = getFireTimeAfter(new Date());
            while (newFireTime != null && cal != null
                    && !cal.isTimeIncluded(newFireTime.getTime())) {
                newFireTime = getFireTimeAfter(newFireTime);
            }
            setNextFireTime(newFireTime);
        } else if (instr == MISFIRE_INSTRUCTION_FIRE_ONCE_NOW) {
            setNextFireTime(new Date());
        }
    }

可以清楚看到,misFire的执行逻辑。

 

在翻阅源码的同时,对之前比较疑惑的几个问题也做了研究,比如:Quartz的任务执行机制如何实现等等问题,都可以轻松通过翻阅源码找到答案,有兴趣的 童鞋 可以自己去翻阅下代码。 

其实,针对这个问题,上网也可以查询问题的原因,但是,个人感觉由翻阅源码找到问题原因,对问题理解的更透彻,同时也能了解下Quartz的实现逻辑。鼓励大家遇到问题,去翻阅框架的源码,其实没有想象中的那么复杂。

(以上如有错误,还请指正,欢迎留言评论) 

 

posted on 2019-07-25 18:12  一叶落知天下秋  阅读(5231)  评论(4编辑  收藏  举报