接着上篇来说。hadoop首先调度辅助型task(job-cleanup task、task-cleanup task和job-setup task)，这是由JobTracker来完成的；但对于计算型task，则是由作业调度器TaskScheduler来分配的，其默认实现为JobQueueTaskScheduler。具体过程在assignTasks()方法中完成，下面来一段一段的分析该方法。

 public synchronized List<Task> assignTasks(TaskTracker taskTracker)
      throws IOException {
    // Check for JT safe-mode
    if (taskTrackerManager.isInSafeMode()) {
      LOG.info("JobTracker is in safe-mode, not scheduling any tasks.");
      return null;
    } 

    TaskTrackerStatus taskTrackerStatus = taskTracker.getStatus(); 
    ClusterStatus clusterStatus = taskTrackerManager.getClusterStatus();
    final int numTaskTrackers = clusterStatus.getTaskTrackers();
    final int clusterMapCapacity = clusterStatus.getMaxMapTasks();
    final int clusterReduceCapacity = clusterStatus.getMaxReduceTasks();

    Collection<JobInProgress> jobQueue =
      jobQueueJobInProgressListener.getJobQueue();

　　首先检查是否处于安全模式；接着分别获取该TaskTracker的状态信息、集群状态信息、集群中的TaskTracker数目、集群能运行的最大Map Task个数和Reduce Task个数；再选择一个作业队列，对该队列中的作业进行调度。

 1 //
 2     // Get map + reduce counts for the current tracker.
 3     //
 4     final int trackerMapCapacity = taskTrackerStatus.getMaxMapSlots();
 5     final int trackerReduceCapacity = taskTrackerStatus.getMaxReduceSlots();
 6     final int trackerRunningMaps = taskTrackerStatus.countMapTasks();
 7     final int trackerRunningReduces = taskTrackerStatus.countReduceTasks();
 8 
 9     // Assigned tasks
10     List<Task> assignedTasks = new ArrayList<Task>();

　　这4行分别是获取Map和Reduce的slot，然后是获取当前TaskTracker上正在运行的Map和Reduce task数目；最后一行的集合用来存放分配给该TaskTracker的task。

 1 //
 2     // Compute (running + pending) map and reduce task numbers across pool
 3     //
 4     int remainingReduceLoad = 0;
 5     int remainingMapLoad = 0;
 6     synchronized (jobQueue) {
 7       for (JobInProgress job : jobQueue) {
 8         if (job.getStatus().getRunState() == JobStatus.RUNNING) {
 9           remainingMapLoad += (job.desiredMaps() - job.finishedMaps());
10           if (job.scheduleReduces()) {
11             remainingReduceLoad += 
12               (job.desiredReduces() - job.finishedReduces());
13           }
14         }
15       }
16     }

　　该段代码用来计算作业队列中还有多少Map和Reduce task需要运行。job.desiredMaps()方法用来计算该Job总共有多少个Map task。job.finishedMaps()方法用来计算该Job有多少个已完成的Map task。同理，job.desiredReduces()方法与job.finishedReduces()方法用来计算Reduce。

// Compute the 'load factor' for maps and reduces
    double mapLoadFactor = 0.0;
    if (clusterMapCapacity > 0) {
      mapLoadFactor = (double)remainingMapLoad / clusterMapCapacity;
    }
    double reduceLoadFactor = 0.0;
    if (clusterReduceCapacity > 0) {
      reduceLoadFactor = (double)remainingReduceLoad / clusterReduceCapacity;
    }

　　用来计算Map和Reduce task的装载百分比，即根据剩余需要运行的Map task和集群能运行的最大Map Task个数的比例，来为TaskTracker计算一个装载因子，使得该TaskTracker上的Map task个数不超过这个比例。Reduce也一样。

 1  //
 2     // In the below steps, we allocate first map tasks (if appropriate),
 3     // and then reduce tasks if appropriate.  We go through all jobs
 4     // in order of job arrival; jobs only get serviced if their 
 5     // predecessors are serviced, too.
 6     //
 7 
 8     //
 9     // We assign tasks to the current taskTracker if the given machine 
10     // has a workload that's less than the maximum load of that kind of
11     // task.
12     // However, if the cluster is close to getting loaded i.e. we don't
13     // have enough _padding_ for speculative executions etc., we only 
14     // schedule the "highest priority" task i.e. the task from the job 
15     // with the highest priority.
16     //
17     
18     final int trackerCurrentMapCapacity = 
19       Math.min((int)Math.ceil(mapLoadFactor * trackerMapCapacity), 
20                               trackerMapCapacity);
21     int availableMapSlots = trackerCurrentMapCapacity - trackerRunningMaps;
22     boolean exceededMapPadding = false;
23     if (availableMapSlots > 0) {
24       exceededMapPadding = 
25         exceededPadding(true, clusterStatus, trackerMapCapacity);
26     }

　　第一行根据上一步计算出来的Map task装载因子，计算当前结点能够运行的Map task个数；第二行计算剩余的能够运行Map task的slot个数availableMapSlots。如果availableMapSlots大于0表示还有余地运行Map task。Hadoop不会把所有的slot 都分配完，而是会留一些slot给失败的和推测执行的任务，exceededPadding()方法就是来完成这个任务的。

 1  int numLocalMaps = 0;
 2     int numNonLocalMaps = 0;
 3     scheduleMaps:
 4     for (int i=0; i < availableMapSlots; ++i) {
 5       synchronized (jobQueue) {
 6         for (JobInProgress job : jobQueue) {
 7           if (job.getStatus().getRunState() != JobStatus.RUNNING) {
 8             continue;
 9           }
10 
11           Task t = null;
12           
13           // Try to schedule a Map task with locality between node-local 
14           // and rack-local
15           t = 
16             job.obtainNewNodeOrRackLocalMapTask(taskTrackerStatus, 
17                 numTaskTrackers, taskTrackerManager.getNumberOfUniqueHosts());
18           if (t != null) {
19             assignedTasks.add(t);
20             ++numLocalMaps;
21             
22             // Don't assign map tasks to the hilt!
23             // Leave some free slots in the cluster for future task-failures,
24             // speculative tasks etc. beyond the highest priority job
25             if (exceededMapPadding) {
26               break scheduleMaps;
27             }
28            
29             // Try all jobs again for the next Map task 
30             break;
31           }
32           
33           // Try to schedule a node-local or rack-local Map task
34           t = 
35             job.obtainNewNonLocalMapTask(taskTrackerStatus, numTaskTrackers,
36                                    taskTrackerManager.getNumberOfUniqueHosts());
37           
38           if (t != null) {
39             assignedTasks.add(t);
40             ++numNonLocalMaps;
41             
42             // We assign at most 1 off-switch or speculative task
43             // This is to prevent TaskTrackers from stealing local-tasks
44             // from other TaskTrackers.
45             break scheduleMaps;
46           }
47         }
48       }
49     }
50     int assignedMaps = assignedTasks.size();

　　以上这部分就是分配Map task的过程。obtainNewNodeOrRackLocalMapTask()方法和obtainNewNonLocalMapTask()方法分别用来分配node-local/rack-local task和非本地的task(我觉得hadoop中这个方法的注释写的有问题，第33行，原代码第195行)。他们最终都调用了findNewMapTask()方法来分配task，但区别在于调用时的级别：obtainNewNodeOrRackLocalMapTask ()方法是“maxLevel”，表示可以运行node-local/rack-local级别的task，obtainNewNonLocalMapTask()方法是“NON_LOCAL_CACHE_LEVEL”，表示只能运行off-switch/speculative级别的task。而“anyCacheLevel”级别最高，表示node-local, rack-local, off-switch and speculative task都可以分配。

  1 1 /**
  2   2    * Find new map task
  3   3    * @param tts The task tracker that is asking for a task
  4   4    * @param clusterSize The number of task trackers in the cluster
  5   5    * @param numUniqueHosts The number of hosts that run task trackers
  6   6    * @param avgProgress The average progress of this kind of task in this job
  7   7    * @param maxCacheLevel The maximum topology level until which to schedule
  8   8    *                      maps. 
  9   9    *                      A value of {@link #anyCacheLevel} implies any 
 10  10    *                      available task (node-local, rack-local, off-switch and 
 11  11    *                      speculative tasks).
 12  12    *                      A value of {@link #NON_LOCAL_CACHE_LEVEL} implies only
 13  13    *                      off-switch/speculative tasks should be scheduled.
 14  14    * @return the index in tasks of the selected task (or -1 for no task)
 15  15    */
 16  16  private synchronized int findNewMapTask(final TaskTrackerStatus tts, 
 17  17                                           final int clusterSize,
 18  18                                           final int numUniqueHosts,
 19  19                                           final int maxCacheLevel,
 20  20                                           final double avgProgress) {
 21  21     if (numMapTasks == 0) {
 22  22       if(LOG.isDebugEnabled()) {
 23  23         LOG.debug("No maps to schedule for " + profile.getJobID());
 24  24       }
 25  25       return -1;
 26  26     }
 27  27 
 28  28     String taskTracker = tts.getTrackerName();
 29  29     TaskInProgress tip = null;
 30  30     
 31  31     //
 32  32     // Update the last-known clusterSize
 33  33     //
 34  34     this.clusterSize = clusterSize;
 35  35 
 36  36     if (!shouldRunOnTaskTracker(taskTracker)) {
 37  37       return -1;
 38  38     }
 39  39 
 40  40     // Check to ensure this TaskTracker has enough resources to 
 41  41     // run tasks from this job
 42  42     long outSize = resourceEstimator.getEstimatedMapOutputSize();
 43  43     long availSpace = tts.getResourceStatus().getAvailableSpace();
 44  44     if(availSpace < outSize) {
 45  45       LOG.warn("No room for map task. Node " + tts.getHost() + 
 46  46                " has " + availSpace + 
 47  47                " bytes free; but we expect map to take " + outSize);
 48  48 
 49  49       return -1; //see if a different TIP might work better. 
 50  50     }
 51  51     
 52  52     
 53  53     // When scheduling a map task:
 54  54     //  0) Schedule a failed task without considering locality
 55  55     //  1) Schedule non-running tasks
 56  56     //  2) Schedule speculative tasks
 57  57     //  3) Schedule tasks with no location information
 58  58 
 59  59     // First a look up is done on the non-running cache and on a miss, a look 
 60  60     // up is done on the running cache. The order for lookup within the cache:
 61  61     //   1. from local node to root [bottom up]
 62  62     //   2. breadth wise for all the parent nodes at max level
 63  63     // We fall to linear scan of the list ((3) above) if we have misses in the 
 64  64     // above caches
 65  65 
 66  66     // 0) Schedule the task with the most failures, unless failure was on this
 67  67     //    machine
 68  68     tip = findTaskFromList(failedMaps, tts, numUniqueHosts, false);
 69  69     if (tip != null) {
 70  70       // Add to the running list
 71  71       scheduleMap(tip);
 72  72       LOG.info("Choosing a failed task " + tip.getTIPId());
 73  73       return tip.getIdWithinJob();
 74  74     }
 75  75 
 76  76     Node node = jobtracker.getNode(tts.getHost());
 77  77     
 78  78     //
 79  79     // 1) Non-running TIP :
 80  80     // 
 81  81 
 82  82     // 1. check from local node to the root [bottom up cache lookup]
 83  83     //    i.e if the cache is available and the host has been resolved
 84  84     //    (node!=null)
 85  85     if (node != null) {
 86  86       Node key = node;
 87  87       int level = 0;
 88  88       // maxCacheLevel might be greater than this.maxLevel if findNewMapTask is
 89  89       // called to schedule any task (local, rack-local, off-switch or
 90  90       // speculative) tasks or it might be NON_LOCAL_CACHE_LEVEL (i.e. -1) if
 91  91       // findNewMapTask is (i.e. -1) if findNewMapTask is to only schedule
 92  92       // off-switch/speculative tasks
 93  93       int maxLevelToSchedule = Math.min(maxCacheLevel, maxLevel);
 94  94       for (level = 0;level < maxLevelToSchedule; ++level) {
 95  95         List <TaskInProgress> cacheForLevel = nonRunningMapCache.get(key);
 96  96         if (cacheForLevel != null) {
 97  97           tip = findTaskFromList(cacheForLevel, tts, 
 98  98               numUniqueHosts,level == 0);
 99  99           if (tip != null) {
100 100             // Add to running cache
101 101             scheduleMap(tip);
102 102 
103 103             // remove the cache if its empty
104 104             if (cacheForLevel.size() == 0) {
105 105               nonRunningMapCache.remove(key);
106 106             }
107 107 
108 108             return tip.getIdWithinJob();
109 109           }
110 110         }
111 111         key = key.getParent();
112 112       }
113 113       
114 114       // Check if we need to only schedule a local task (node-local/rack-local)
115 115       if (level == maxCacheLevel) {
116 116         return -1;
117 117       }
118 118     }
119 119 
120 120     //2. Search breadth-wise across parents at max level for non-running 
121 121     //   TIP if
122 122     //     - cache exists and there is a cache miss 
123 123     //     - node information for the tracker is missing (tracker's topology
124 124     //       info not obtained yet)
125 125 
126 126     // collection of node at max level in the cache structure
127 127     Collection<Node> nodesAtMaxLevel = jobtracker.getNodesAtMaxLevel();
128 128 
129 129     // get the node parent at max level
130 130     Node nodeParentAtMaxLevel = 
131 131       (node == null) ? null : JobTracker.getParentNode(node, maxLevel - 1);
132 132     
133 133     for (Node parent : nodesAtMaxLevel) {
134 134 
135 135       // skip the parent that has already been scanned
136 136       if (parent == nodeParentAtMaxLevel) {
137 137         continue;
138 138       }
139 139 
140 140       List<TaskInProgress> cache = nonRunningMapCache.get(parent);
141 141       if (cache != null) {
142 142         tip = findTaskFromList(cache, tts, numUniqueHosts, false);
143 143         if (tip != null) {
144 144           // Add to the running cache
145 145           scheduleMap(tip);
146 146 
147 147           // remove the cache if empty
148 148           if (cache.size() == 0) {
149 149             nonRunningMapCache.remove(parent);
150 150           }
151 151           LOG.info("Choosing a non-local task " + tip.getTIPId());
152 152           return tip.getIdWithinJob();
153 153         }
154 154       }
155 155     }
156 156 
157 157     // 3. Search non-local tips for a new task
158 158     tip = findTaskFromList(nonLocalMaps, tts, numUniqueHosts, false);
159 159     if (tip != null) {
160 160       // Add to the running list
161 161       scheduleMap(tip);
162 162 
163 163       LOG.info("Choosing a non-local task " + tip.getTIPId());
164 164       return tip.getIdWithinJob();
165 165     }
166 166 
167 167     //
168 168     // 2) Running TIP :
169 169     // 
170 170  
171 171     if (hasSpeculativeMaps) {
172 172       long currentTime = jobtracker.getClock().getTime();
173 173 
174 174       // 1. Check bottom up for speculative tasks from the running cache
175 175       if (node != null) {
176 176         Node key = node;
177 177         for (int level = 0; level < maxLevel; ++level) {
178 178           Set<TaskInProgress> cacheForLevel = runningMapCache.get(key);
179 179           if (cacheForLevel != null) {
180 180             tip = findSpeculativeTask(cacheForLevel, tts, 
181 181                                       avgProgress, currentTime, level == 0);
182 182             if (tip != null) {
183 183               if (cacheForLevel.size() == 0) {
184 184                 runningMapCache.remove(key);
185 185               }
186 186               return tip.getIdWithinJob();
187 187             }
188 188           }
189 189           key = key.getParent();
190 190         }
191 191       }
192 192 
193 193       // 2. Check breadth-wise for speculative tasks
194 194       
195 195       for (Node parent : nodesAtMaxLevel) {
196 196         // ignore the parent which is already scanned
197 197         if (parent == nodeParentAtMaxLevel) {
198 198           continue;
199 199         }
200 200 
201 201         Set<TaskInProgress> cache = runningMapCache.get(parent);
202 202         if (cache != null) {
203 203           tip = findSpeculativeTask(cache, tts, avgProgress, 
204 204                                     currentTime, false);
205 205           if (tip != null) {
206 206             // remove empty cache entries
207 207             if (cache.size() == 0) {
208 208               runningMapCache.remove(parent);
209 209             }
210 210             LOG.info("Choosing a non-local task " + tip.getTIPId() 
211 211                      + " for speculation");
212 212             return tip.getIdWithinJob();
213 213           }
214 214         }
215 215       }
216 216 
217 217       // 3. Check non-local tips for speculation
218 218       tip = findSpeculativeTask(nonLocalRunningMaps, tts, avgProgress, 
219 219                                 currentTime, false);
220 220       if (tip != null) {
221 221         LOG.info("Choosing a non-local task " + tip.getTIPId() 
222 222                  + " for speculation");
223 223         return tip.getIdWithinJob();
224 224       }
225 225     }
226 226     
227 227     return -1;
228 228   }

findNewMapTask

　　这里穿插说一下findNewMapTask()方法，真正的任务分配都是它来做的，task分配的优先级为：

1）、从failedMaps中调度failed Task

2）、从nonRunningMapCache中选择具有本地性的任务，优先级为node-local、rack-local、off-switch。至于本地性如何体现在后边说。

3）、从nonLocalMaps中选择任务

4）、从runningMapCache中选择任务，为其启动备份执行

5）、从nonLocalRunningMaps中选择任务，为其启动备份执行

最后，如果findNewMapTask()方法返回值为-1，则表示没有找到合适的Map task。否则返回值表示该Map task在JobInProgress的maps[]数组中的下标。

 1   //
 2     // Same thing, but for reduce tasks
 3     // However we _never_ assign more than 1 reduce task per heartbeat
 4     //
 5     final int trackerCurrentReduceCapacity = 
 6       Math.min((int)Math.ceil(reduceLoadFactor * trackerReduceCapacity), 
 7                trackerReduceCapacity);
 8     final int availableReduceSlots = 
 9       Math.min((trackerCurrentReduceCapacity - trackerRunningReduces), 1);
10     boolean exceededReducePadding = false;
11     if (availableReduceSlots > 0) {
12       exceededReducePadding = exceededPadding(false, clusterStatus, 
13                                               trackerReduceCapacity);

　　同理，这部分用来计算是否给Reduce task留有足够的slot去执行失败的和推测执行的Reduce task。

 1 synchronized (jobQueue) {
 2         for (JobInProgress job : jobQueue) {
 3           if (job.getStatus().getRunState() != JobStatus.RUNNING ||
 4               job.numReduceTasks == 0) {
 5             continue;
 6           }
 7 
 8           Task t = 
 9             job.obtainNewReduceTask(taskTrackerStatus, numTaskTrackers, 
10                                     taskTrackerManager.getNumberOfUniqueHosts()
11                                     );
12           if (t != null) {
13             assignedTasks.add(t);
14             break;
15           }
16           
17           // Don't assign reduce tasks to the hilt!
18           // Leave some free slots in the cluster for future task-failures,
19           // speculative tasks etc. beyond the highest priority job
20           if (exceededReducePadding) {
21             break;
22           }
23         }
24       }
25     }

　　这部分用来分配Reduce task。可以看到，与分配Map task时用的双层for循环不同，分配Reduce task的时候是单层for循环，因为每次只分配一个Reduce task。Reduce task分配优先级为：

1）、从nonRunningReduces中选择

2）、从runningReduces选择一个task为其启动推测任务

最后，如果findNewReduceTask ()方法返回值为-1，则表示没有找到合适的Reduce task。否则返回值表示该Reduce task在JobInProgress的reduces[]数组中的下标。

 1 if (LOG.isDebugEnabled()) {
 2       LOG.debug("Task assignments for " + taskTrackerStatus.getTrackerName() + " --> " +
 3                 "[" + mapLoadFactor + ", " + trackerMapCapacity + ", " + 
 4                 trackerCurrentMapCapacity + ", " + trackerRunningMaps + "] -> [" + 
 5                 (trackerCurrentMapCapacity - trackerRunningMaps) + ", " +
 6                 assignedMaps + " (" + numLocalMaps + ", " + numNonLocalMaps + 
 7                 ")] [" + reduceLoadFactor + ", " + trackerReduceCapacity + ", " + 
 8                 trackerCurrentReduceCapacity + "," + trackerRunningReduces + 
 9                 "] -> [" + (trackerCurrentReduceCapacity - trackerRunningReduces) + 
10                 ", " + (assignedTasks.size()-assignedMaps) + "]");
11     }
12 
13     return assignedTasks;

　　最后返回分配给该TaskTracker的task集合。

　　说一下JobInProgress中与分配任务相关的重要数据结构：

1 Map<Node, List<TaskInProgress>> nonRunningMapCache：Node与未运行的TIP集合映射关系，通过作业的InputFormat可直接获取
2 Map<Node, Set<TaskInProgress>> runningMapCache：Node与运行的TIP集合映射关系，一个任务获得调度机会，其TIP便会添加进来
3 final List<TaskInProgress> nonLocalMaps：non-local(没有输入数据，InputSplit为空)且未运行的TIP集合
4 final SortedSet<TaskInProgress> failedMaps：按照Task Attempt失败次数排序的TIP集合
5 Set<TaskInProgress> nonLocalRunningMaps：non-local且正在运行的TIP集合
6 Set<TaskInProgress> nonRunningReduces：等待运行的Reduce集合
7 Set<TaskInProgress> runningReduces：正在运行的Reduce集合

　　关于Map task本地性的实现：

　　JobInProgress中的数据结构nonRunningMapCache体现了本地性，其中记录的是node与该node上待运行的Map task(TaskInProgress)集合。这个数据结构在JobInProgress中的createCache()中创建：

 1 private Map<Node, List<TaskInProgress>> createCache(
 2                                  TaskSplitMetaInfo[] splits, int maxLevel)
 3                                  throws UnknownHostException {
 4     Map<Node, List<TaskInProgress>> cache = 
 5       new IdentityHashMap<Node, List<TaskInProgress>>(maxLevel);
 6     
 7     Set<String> uniqueHosts = new TreeSet<String>();
 8     for (int i = 0; i < splits.length; i++) {
 9       String[] splitLocations = splits[i].getLocations();
10       if (splitLocations == null || splitLocations.length == 0) {
11         nonLocalMaps.add(maps[i]);
12         continue;
13       }
14 
15       for(String host: splitLocations) {
16         Node node = jobtracker.resolveAndAddToTopology(host);
17         uniqueHosts.add(host);
18         LOG.info("tip:" + maps[i].getTIPId() + " has split on node:" + node);
19         for (int j = 0; j < maxLevel; j++) {
20           List<TaskInProgress> hostMaps = cache.get(node);
21           if (hostMaps == null) {
22             hostMaps = new ArrayList<TaskInProgress>();
23             cache.put(node, hostMaps);
24             hostMaps.add(maps[i]);
25           }
26           //check whether the hostMaps already contains an entry for a TIP
27           //This will be true for nodes that are racks and multiple nodes in
28           //the rack contain the input for a tip. Note that if it already
29           //exists in the hostMaps, it must be the last element there since
30           //we process one TIP at a time sequentially in the split-size order
31           if (hostMaps.get(hostMaps.size() - 1) != maps[i]) {
32             hostMaps.add(maps[i]);
33           }
34           node = node.getParent();
35         }
36       }
37     }
38     
39     // Calibrate the localityWaitFactor - Do not override user intent!
40     if (localityWaitFactor == DEFAULT_LOCALITY_WAIT_FACTOR) {
41       int jobNodes = uniqueHosts.size();
42       int clusterNodes = jobtracker.getNumberOfUniqueHosts();
43       
44       if (clusterNodes > 0) {
45         localityWaitFactor = 
46           Math.min((float)jobNodes/clusterNodes, localityWaitFactor);
47       }
48       LOG.info(jobId + " LOCALITY_WAIT_FACTOR=" + localityWaitFactor);
49     }
50     
51     return cache;
52   }

　　在这个方法中，根据split所在的node，将与该分片对应的Map Task(TaskInProgress)和Node添加到该数据结构中。当选择未运行的Map Task时，只要从该数据结构中查找与该结点对应的任务即可实现本地性。

　　本文基于hadoop1.2.1

　　如有错误，还请指正

　　参考文章：《Hadoop技术内幕深入理解MapReduce架构设计与实现原理》董西成

　　　　　　　　http://www.cnblogs.com/lxf20061900/p/3775963.html

　　转载请注明出处：http://www.cnblogs.com/gwgyk/p/4085627.html

posted on 2014-11-09 17:54 有无之中阅读(1302) 评论(0) 编辑收藏举报

刷新页面返回顶部

公告