YARN-SLS如何实现大集群,大数据的模拟

对于几台到几十台机器组成的超小集群,可以采用sls或者rumen的文件方式输入,也就是runsls.sh后加参数--input-sls。

但如果要模拟1000的中等集群甚至5000以上的大集群,手动修改sls或者rumen文件显然太不现实。

hadoop社区一开始也没有考虑到这个问题,因此一直到2.6.5版本,都没有提供解决方案。

但5月份新发布的hadoop3.0版本中,实现了这个功能。

功能名字:Synthetic Load Generator

yarn官方对该功能的解释如下:

The Synthetic Load Generator complements the extensive nature of SLS-native and RUMEN traces, by providing a distribution-driven generation of load. The load generator is organized as a JobStoryProducer (compatible with rumen, and thus gridmix for later integration). We seed the Random number generator so that results randomized but deterministic—hence reproducible. We organize the jobs being generated around /workloads/job_class hierarchy, which allow to easily group jobs with similar behaviors and categorize them (e.g., jobs with long running containers, or maponly computations, etc..). The user can control average and standard deviations for many of the important parameters, such as number of mappers/reducers, duration of mapper/reducers, size (mem/cpu) of containers, chance of reservation, etc. We use weighted-random sampling (whenever we pick among a small number of options) or LogNormal distributions (to avoid negative values) when we pick from wide ranges of values—see appendix on LogNormal distributions.

The SYNTH mode of SLS is very convenient to generate very large loads without the need for extensive input files. This allows to easily explore wide range of use cases (e.g., imagine simulating 100k jobs, and in different runs simply tune the average number of mappers, or average task duration), in an efficient and compact way.

简单翻译:这是一种对SLS和rumen的扩展,能够提供庞大但是有规律的负载。比如100k个job。


如何使用?

slsrun.sh --tracetype=SYNTH --tracelocation=<syn.json> --outputdir=<outdir>

如果不想记,也可通过slsrum.sh -help查看usage。


传入的syn.json文件模板如下:

{
  "description" : "tiny jobs workload",    //description of the meaning of this collection of workloads
  "num_nodes" : 10,  //total nodes in the simulated cluster
  "nodes_per_rack" : 4, //number of nodes in each simulated rack
  "num_jobs" : 10, // total number of jobs being simulated
  "rand_seed" : 2, //the random seed used for deterministic randomized runs

  // a list of “workloads”, each of which has job classes, and temporal properties
  "workloads" : [
    {
      "workload_name" : "tiny-test", // name of the workload
      "workload_weight": 0.5,  // used for weighted random selection of which workload to sample from
      "queue_name" : "sls_queue_1", //queue the job will be submitted to

    //different classes of jobs for this workload
       "job_classes" : [
        {
          "class_name" : "class_1", //name of the class
          "class_weight" : 1.0, //used for weighted random selection of class within workload

          //nextr group controls average and standard deviation of a LogNormal distribution that
          //determines the number of mappers and reducers for thejob.
          "mtasks_avg" : 5,
          "mtasks_stddev" : 1,
          "rtasks_avg" : 5,
          "rtasks_stddev" : 1,

          //averge and stdev input param of LogNormal distribution controlling job duration
          "dur_avg" : 60,
          "dur_stddev" : 5,

          //averge and stdev input param of LogNormal distribution controlling mappers and reducers durations
          "mtime_avg" : 10,
          "mtime_stddev" : 2,
          "rtime_avg" : 20,
          "rtime_stddev" : 4,

          //averge and stdev input param of LogNormal distribution controlling memory and cores for map and reduce
          "map_max_memory_avg" : 1024,
          "map_max_memory_stddev" : 0.001,
          "reduce_max_memory_avg" : 2048,
          "reduce_max_memory_stddev" : 0.001,
          "map_max_vcores_avg" : 1,
          "map_max_vcores_stddev" : 0.001,
          "reduce_max_vcores_avg" : 2,
          "reduce_max_vcores_stddev" : 0.001,

          //probability of running this job with a reservation
          "chance_of_reservation" : 0.5,
          //input parameters of LogNormal distribution that determines the deadline slack (as a multiplier of job duration)
          "deadline_factor_avg" : 10.0,
          "deadline_factor_stddev" : 0.001,
        }
       ],
    // for each workload determines with what probability each time bucket is picked to choose the job starttime.
    // In the example below the jobs have twice as much chance to start in the first minute than in the second minute
    // of simulation, and then zero chance thereafter.
      "time_distribution" : [
        { "time" : 1, "weight" : 66 },
        { "time" : 60, "weight" : 33 },
        { "time" : 120, "jobs" : 0 }
     ]
    }
 ]
}


附上yarn官方对于SYNTH的文档:

https://issues.apache.org/jira/browse/YARN-6363


posted on 2018-07-12 22:55  sichenzhao  阅读(189)  评论(0编辑  收藏  举报

导航