Hadoop Exporter开源项目开发指南

Hadoop Exporter开源项目

该项目最后一次更新为2018年。其主要就是监控集群中的各个组件的JMX端口。而开源的集群大多数也都是通过JMX开放自己的重要监控数据。例如:HDFS、YARN等。

总体来说,项目是不错的,如果我们自己去逐个组件开发支持Prometheus,会耗用我们大量时间。所以,在完成Hadoop集群监控的对接后,考虑到将来项目的升级、扩展。我Fork了该项目,后续对项目持续维护,后续还会提供独立的版本号。

项目基本说明

可以说Hadoop Exporter是一个ETL项目。负责将JMX的JSON数据转换为维度模型。我把当前架构理解为如下:

image-20210310114805846

当前V1.0版本,ETL程序需要在每个节点上安装,而不能做统一采集,这样当有集群扩展或变更时,不利于维护。d但好处是,各个ETL程序彼此独立,互不干涉。

开发语言及依赖库

说明
开发语言 Python
开发语言版本 2.7.x
依赖库 request、prometheus_client、python-consul、pyyaml

项目架构

image-20210309234425214

处理基础流程

  1. 解析命令行参数
  2. 启动Prometheus对外暴露指标的HTTP服务器
  3. 向集群JMX配置中心发送HTTP请求,通过本机的hostname拿到JMX配置信息
  4. 根据配置中心中配置的不同进程对应的JMX URL,创建对应的Collector实例
  5. 将Collector注册到Prometheus客户端
  6. 当服务器端请求拉取JMX指标时,会自动调用对应Collector的collect方法
  7. 获取JMX端口的MBean,在根据不同的进程处理标签、再处理指标。
  8. 用通用指标更新指定实例指标

项目源码结构

说明
入口 hadoop_exporter.py
cmd 处理各类JMX指标核心代码
config 存放配置
test 存放测试数据

开发规范[持续完善]

类名

  • 类名以大写驼峰式命名
  • 单类单文件

方法

  • 公开方法以小写字母开头,用下划线分隔单词
  • 公开方法需要使用文档注释
  • 私有方法以下划线开头
  • 初始化方法统一放在__init__

应用开发

  • 每增加一类组件,需要新增一个Collector实现
  • 每增加一类组件,需要创建一个目录,并基于MBean对指标进行分类
  • 在核心流程上的节点必须添加logger.info,为了方便日后排错,多提供logger.debug信息

开发指南

JMX指标结构

所有的JMX指标封装在一个名为beans的数组中。

  • name为MBean的名称
  • 其他的均为对应监控信息。
{
  "beans" : [ {
    "name" : "Hadoop:service=NameNode,name=JvmMetrics",
    "modelerType" : "JvmMetrics",
    "tag.Context" : "jvm",
    "tag.ProcessName" : "NameNode",
    "tag.SessionId" : null,
    "tag.Hostname" : "ha-node1",
    "MemNonHeapUsedM" : 69.99187,
    "MemNonHeapCommittedM" : 71.84375,
    "MemNonHeapMaxM" : -1.0,
    "MemHeapUsedM" : 189.16359,
    "MemHeapCommittedM" : 323.5,
    "MemHeapMaxM" : 1287.5,
    "MemMaxM" : 1287.5,
    "GcCount" : 22,
    "GcTimeMillis" : 3708,
    "GcNumWarnThresholdExceeded" : 0,
    "GcNumInfoThresholdExceeded" : 1,
    "GcTotalExtraSleepTime" : 10874,
    "ThreadsNew" : 0,
    "ThreadsRunnable" : 6,
    "ThreadsBlocked" : 0,
    "ThreadsWaiting" : 11,
    "ThreadsTimedWaiting" : 39,
    "ThreadsTerminated" : 0,
    "LogFatal" : 0,
    "LogError" : 0,
    "LogWarn" : 11,
    "LogInfo" : 3922
  },
  ...
]

因为Prometheus中采用的是维度模型,所以,需要将JMX中的数据转换为维度模型。例如:字符串类型为维度、数值类型为指标。而有哪些指标、指标对应需要有什么标签,需要我们自己来定义、实现。

目标维度指标结构

上述数据为扁平化的事实指标数据,要转换为维度结构。首先需要确定哪些是指标数据哪些是维度数据,简单点来理解就是。这个部分针对不同的MBean都是不一样的。所以需要在应用中实现转换。例如:针对NameNode、DataNode。还有一些指标是不能进行分类的,就可以直接将指标,以及使用预定义维度,例如:cluster、instance等。

目标维度指标结构表示Prometheus能够识别的指标数据。在项目中,它存储为一个二维字典结构。例如:

_hadoop_namenode_metrics[指标分类(通常是MBean的名称)][指标名称]

MetricCol类

以hdfs_namenode.py为例。

创建MetricsCollector类并初始化

class MetricCol(object):
    '''
    MetricCol是所有MetricsCollector的超类,它构建了写通用的参数,例如:cluster、url、component、service等。
    '''
    def __init__(self, cluster, url, component, service):
        '''
        @param cluster: 集群名称, 在配置文件配置或者通过命令行设置.
        @param url: 每个组件暴露指标的URL。例如:通过http://ip:9870/jmx可以获取hdfs集群的指标。
                    而通过http://ip:8088/jmx可以获取ResourceManager的指标。
        @param component: 组件名称. 例如:"hdfs", "resourcemanager", "mapreduce", "hive", "hbase".
        @param service: 服务名称. 例如:"namenode", "resourcemanager", "mapreduce".
        '''
        self._cluster = cluster
        # 删除末尾的/
        self._url = url.rstrip('/')
        self._component = component
        # 指标前缀, 以 hadoop_组件名_服务名 命名
        self._prefix = 'hadoop_{0}_{1}'.format(component, service)
        # 获取以服务名命名的所有JSON文件列表,例如:namenode,会将namenode中的所有文件夹中的json文件加载
        # 获取到的是文件名
        self._file_list = utils.get_file_list(service)
        # 获取common目录中的所有json文件
        self._common_file = utils.get_file_list("common")
        # 整合所有json文件
        self._merge_list = self._file_list + self._common_file
        # 用于保存指标对象
        self._metrics = {}
        for i in range(len(self._file_list)):
            # 设置文件名,并读取对应的指标配置文件(JSON文件)
            self._metrics.setdefault(self._file_list[i], utils.read_json_file(service, self._file_list[i]))

JSON指标配置文件语法格式如下:

{
    "指标名": "指标描述."
}

例如:

{
    "MemNonHeapUsedM": "Current non-heap memory used in MB.",
    "MemNonHeapCommittedM": "Current non-heap memory committed in MB.",
    "MemNonHeapMaxM": "Max non-heap memory size in MB.",
    "MemHeapUsedM": "Current heap memory used in MB.",
    "MemHeapCommittedM": "Current heap memory committed in MB.",
    "MemHeapMaxM": "Max heap memory size in MB.",
    "MemMaxM": "Max memory size in MB.",
    "ThreadsNew": "Current number of NEW threads.",
    "ThreadsRunnable": "Current number of RUNNABLE threads.",
    "ThreadsBlocked": "Current number of BLOCKED threads.",
    "ThreadsWaiting": "Current number of WAITING threads.",
    "ThreadsTimedWaiting": "Current number of TIMED_WAITING threads.",
    "ThreadsTerminated": "Current number of TERMINATED threads.",
    "GcCount": "Total number of Gc count",    
    "GcTimeMillis": "Total GC time in msec.",
    "GcCountParNew": "ParNew GC count.",
    "GcTimeMillisParNew": "ParNew GC time in msec.",
    "GcCountConcurrentMarkSweep": "ConcurrentMarkSweep GC count.",
    "GcTimeMillisConcurrentMarkSweep": "ConcurrentMarkSweep GC time in msec.",
    "GcNumWarnThresholdExceeded": "Number of times that the GC warn threshold is exceeded.",
    "GcNumInfoThresholdExceeded": "Number of times that the GC info threshold is exceeded.",
    "GcTotalExtraSleepTime": "Total GC extra sleep time in msec.",
    "LogFatal": "Total number of FATAL logs.",
    "LogError": "Total number of ERROR logs.",
    "LogWarn": "Total number of WARN logs.",
    "LogInfo": "Total number of INFO logs."
}

创建NameNodeMetricsCollector类并初始化

class NameNodeMetricCollector(MetricCol):

    def __init__(self, cluster, url):
        # 手动调用父类初始化,传入cluster名称、jmx url、组件名称、服务名称
        # 注意:服务名称应与JSON配置的文件夹名称保持一致
        MetricCol.__init__(self, cluster, url, "hdfs", "namenode")
        self._hadoop_namenode_metrics = {}
        for i in range(len(self._file_list)):
            # 读取JSON配置文件,设置每个导出指标对象
            self._hadoop_namenode_metrics.setdefault(self._file_list[i], {})

注意:

  • 所有想要导出的指标都应该放在对应服务名称目录下的JSON文件中
  • 组件名称为指标的前缀:hadoop_hdfs_

实现collect

collect方法为针对组件进行指标数据转换的核心实现。它的实现主要分为以下几步:

  1. 请求JMX
  2. 设置Label
  3. 设置Metric
    def collect(self):
        # 发送HTTP请求从JMX URL中获取指标数据。
        # 获取JMX中对应bean JSON数组。

        try:
            # 发起HTTP请求JMX JSON数据
            beans = utils.get_metrics(self._url)
        except:
            logger.info("Can't scrape metrics from url: {0}".format(self._url))
            pass
        else:
            # 设置监控需要关注的每个MBean,并设置好指标对应的标签以及描述
            self._setup_metrics_labels(beans)
    
            # 设置每个指标值
            self._get_metrics(beans)
    
            # 将通用的指标更新到NameNode对应的指标中
            common_metrics = common_metrics_info(self._cluster, beans, "hdfs", "namenode")
            self._hadoop_namenode_metrics.update(common_metrics())
    
            # 遍历每一个指标分类(包含NameNode以及Common的指标分类)
            # 返回每一个指标和标签
            for i in range(len(self._merge_list)):
                service = self._merge_list[i]
                for metric in self._hadoop_namenode_metrics[service]:
                    yield self._hadoop_namenode_metrics[service][metric]

_setup_metrics_labels实现

_setup_metrics_labels表示从MBean中加载标签。

    def _setup_metrics_labels(self, beans):
        # The metrics we want to export.
        # 遍历每一个MBean,设置需要关注的Label
        for i in range(len(beans)):
            # 只处理指定的MBean
            if 'NameNodeActivity' in beans[i]['name']:
                self._setup_nnactivity_labels()
    
            if 'StartupProgress' in beans[i]['name']:
                self._setup_startupprogress_labels()
                    
            if 'FSNamesystem' in beans[i]['name']:
                self._setup_fsnamesystem_labels()
    
            if 'FSNamesystemState' in beans[i]['name']:
                self._setup_fsnamesystem_state_labels()
    
            if 'RetryCache' in beans[i]['name']:
                self._setup_retrycache_labels()

处理Label实现

下面的代码表示对JSON文件中配置的指标进行遍历,并对匹配到的指标进行处理。下面的实现逻辑为以下:

  1. 遍历每一个NameNode中的NameNodeActivity.json中定义的指标

  2. 定义Label(此处只有cluster、method)

  3. 因为NameNodeActivity对应的指标有以下几类:

    • 以NumOps结尾的
    • 以AvgTime结尾的
    • 其他(其他操作都放入到Operation中)

    程序对这些数据进行维度抽象处理。

  4. 设置指标以及维度标签

    def _setup_nnactivity_labels(self):
        # 记录是否已处理(1表示需要处理,0表示无需处理)
        num_namenode_flag,avg_namenode_flag,ops_namenode_flag = 1,1,1
        # 遍历NameNodeActivity MBean对应的指标
        for metric in self._metrics['NameNodeActivity']:
            # 对指标名称进行处理(驼峰式转下划线分隔名称)
            snake_case = re.sub('([a-z0-9])([A-Z])', r'\1_\2', metric).lower()
            # 提前定义预先的label
            label = ["cluster", "method"]
            # 按照MBean中的后缀做分类,生成指标名称、以及对应的label
            # 例如:hadoop_hdfs_namenode_nnactivity_method_avg_time_milliseconds{cluster="hadoop-ha",method="BlockReport"}
            if "NumOps" in metric:
                if num_namenode_flag:
                    key = "MethodNumOps"
                    # 构建Guage类型指标
                    # 第一个参数为指标名称
                    # 第二个参数为指标描述信息
                    # 第三个参数为标签
                    self._hadoop_namenode_metrics['NameNodeActivity'][key] = GaugeMetricFamily("_".join([self._prefix, "nnactivity_method_ops_total"]),
                                                                                               "Total number of the times the method is called.",
                                                                                               labels=label)
                    # 设置为0,表示下一次同样类型的指标不做处理
                    num_namenode_flag = 0
                else:
                    continue
            elif "AvgTime" in metric:
                if avg_namenode_flag:
                    key = "MethodAvgTime"
                    self._hadoop_namenode_metrics['NameNodeActivity'][key] = GaugeMetricFamily("_".join([self._prefix, "nnactivity_method_avg_time_milliseconds"]),
                                                                                               "Average turn around time of the method in milliseconds.",
                                                                                               labels=label)
                    avg_namenode_flag = 0
                else:
                    continue
            else:
                # 如果没有进行指标分类维度化,则统一放到nnactivity_operations_total中存储
                if ops_namenode_flag:
                    ops_namenode_flag = 0
                    key = "Operations"
                    self._hadoop_namenode_metrics['NameNodeActivity'][key] = GaugeMetricFamily("_".join([self._prefix, "nnactivity_operations_total"]),
                                                                                               "Total number of each operation.",
                                                                                               labels=label)
                else:
                    continue

_get_metrics实现

该方法表示设置指标值到之前生成的指标中。

    def _get_metrics(self, beans):
        # 遍历每一个MBean
        for i in range(len(beans)):
            # 根据每个MBean进行相应处理
            if 'NameNodeActivity' in beans[i]['name']:
                self._get_nnactivity_metrics(beans[i])
            if 'StartupProgress' in beans[i]['name']:
                self._get_startupprogress_metrics(beans[i])
            if 'FSNamesystem' in beans[i]['name'] and 'FSNamesystemState' not in beans[i]['name']:
                self._get_fsnamesystem_metrics(beans[i])
            if 'FSNamesystemState' in beans[i]['name']:
                self._get_fsnamesystem_state_metrics(beans[i])
            if 'RetryCache' in beans[i]['name']:
                self._get_retrycache_metrics(beans[i])

处理指标值实现

加载指标值其实就是从MBean中将指标取出来,然后设置好即可。注意:加载指标值需要在加载label之后执行。

    def _get_nnactivity_metrics(self, bean):
        # 遍历对应分类的所有指标
        for metric in self._metrics['NameNodeActivity']:
            # 不同的指标类别进行不同处理
            if "NumOps" in metric:
                # 获取method操作Label
                method = metric.split('NumOps')[0]
                # 设置Label
                label = [self._cluster, method]
                key = "MethodNumOps"
            elif "AvgTime" in metric:
                method = metric.split('AvgTime')[0]
                label = [self._cluster, method]
                key = "MethodAvgTime"
            else:
                if "Ops" in metric:
                    method = metric.split('Ops')[0]
                else:
                    method = metric
                label = [self._cluster, method]                    
                key = "Operations"

            # 调用promethues的add_metric,设置label值和metric值
            self._hadoop_namenode_metrics['NameNodeActivity'][key].add_metric(label,
                                                                              bean[metric] if metric in bean else 0)

处理common指标实现

在组件中有不少的指标是公共的,例如:JVM相关的、操作系统相关的、RPC相关的、UGI、运行时相关的等等。这些指标定义在common文件中的JSON文件中。

将这些指标抽取出来处理,可以复用代码。毕竟,大数据集群中有不少的组件其实都是基于JVM的。

def common_metrics_info(cluster, beans, component, service):
    '''
    为所有服务实现的处理相同的指标数据定义的闭包。
    @return a 名为common_metrics的闭包, 从指定的beans中维度处理后的所有指标。
    '''
    tmp_metrics = {}
    common_metrics = {}
    _cluster = cluster
    _prefix = 'hadoop_{0}_{1}'.format(component, service)
    # 读取common下的所有json指标配置
    _metrics_type = utils.get_file_list("common")

    for i in range(len(_metrics_type)):
        common_metrics.setdefault(_metrics_type[i], {})
        # 加载所有指标到字典
        # 这里取名为tmp,因为它总是会被添加到具体组件实现中
        tmp_metrics.setdefault(_metrics_type[i], utils.read_json_file("common", _metrics_type[i]))

    ...
        def get_metrics():
        '''
        给setup_labels模块的输出结果进行赋值,从url中获取对应的数据,挨个赋值
        '''
        common_metrics = setup_labels(beans)
        for i in range(len(beans)):
            if 'name=JvmMetrics' in beans[i]['name']:
                get_jvm_metrics(beans[i])

            if 'OperatingSystem' in beans[i]['name']:
                get_os_metrics(beans[i])

            if 'RpcActivity' in beans[i]['name']:
                get_rpc_metrics(beans[i])

            if 'RpcDetailedActivity' in beans[i]['name']:
                get_rpc_detailed_metrics(beans[i])

            if 'UgiMetrics' in beans[i]['name']:
                get_ugi_metrics(beans[i])

            if 'MetricsSystem' in beans[i]['name'] and "sub=Stats" in beans[i]['name']:
                get_metric_system_metrics(beans[i])

            if 'Runtime' in beans[i]['name']:
                get_runtime_metrics(beans[i])                

        return common_metrics

    return get_metrics

可以看到,common指标的实现和之前的组件类似,也是先处理Label、然后处理指标值。

common指标Label实现

根据不同的MBean,调用不同设置Label方法。

    def setup_labels(beans):
        '''
        预处理,分析各个模块的特点,进行分类,添加label
        '''
        for i in range(len(beans)):
            if 'name=JvmMetrics' in beans[i]['name']:
                setup_jvm_labels()

            if 'OperatingSystem' in beans[i]['name']:
                setup_os_labels()

            if 'RpcActivity' in beans[i]['name']:
                setup_rpc_labels()

            if 'RpcDetailedActivity' in beans[i]['name']:
                setup_rpc_detailed_labels()

            if 'UgiMetrics' in beans[i]['name']:
                setup_ugi_labels()

            if 'MetricsSystem' in beans[i]['name'] and "sub=Stats" in beans[i]['name']:
                setup_metric_system_labels()   

            if 'Runtime' in beans[i]['name']:
                setup_runtime_labels()
                
        return common_metrics

下面以JVM指标为例。可以看到,与之前的处理方式类似,也是根据不同类似的指标进行维度分组,然后设置标签。

    def setup_jvm_labels():
        for metric in tmp_metrics["JvmMetrics"]:
            '''
            Processing module JvmMetrics
            '''
            snake_case = "_".join(["jvm", re.sub('([a-z0-9])([A-Z])', r'\1_\2', metric).lower()])
            if 'Mem' in metric:
                name = "".join([snake_case, "ebibytes"])
                label = ["cluster", "mode"]
                if "Used" in metric:
                    key = "jvm_mem_used_mebibytes"
                    descriptions = "Current memory used in mebibytes."
                elif "Committed" in metric:
                    key = "jvm_mem_committed_mebibytes"
                    descriptions = "Current memory committed in mebibytes."
                elif "Max" in metric:
                    key = "jvm_mem_max_mebibytes"
                    descriptions = "Current max memory in mebibytes."
                else:
                    key = name
                    label = ["cluster"]
                    descriptions = tmp_metrics['JvmMetrics'][metric]
            elif 'Gc' in metric:
                label = ["cluster", "type"]
                if "GcCount" in metric:
                    key = "jvm_gc_count"
                    descriptions = "GC count of each type GC."
                elif "GcTimeMillis" in metric:
                    key = "jvm_gc_time_milliseconds"
                    descriptions = "Each type GC time in milliseconds."
                elif "ThresholdExceeded" in metric:
                    key = "jvm_gc_exceeded_threshold_total"
                    descriptions = "Number of times that the GC threshold is exceeded."
                else:
                    key = snake_case
                    label = ["cluster"]
                    descriptions = tmp_metrics['JvmMetrics'][metric]
            elif 'Threads' in metric:
                label = ["cluster", "state"]
                key = "jvm_threads_state_total"
                descriptions = "Current number of different threads."
            elif 'Log' in metric:
                label = ["cluster", "level"]
                key = "jvm_log_level_total"
                descriptions = "Total number of each level logs."
            else:
                label = ["cluster"]
                key = snake_case
                descriptions = tmp_metrics['JvmMetrics'][metric]
            common_metrics['JvmMetrics'][key] = GaugeMetricFamily("_".join([_prefix, key]),
                                                                  descriptions,
                                                                  labels=label)
        return common_metrics

common指标处理指标值实现

与之前的处理指标值类似。

def get_jvm_metrics(bean):
        for metric in tmp_metrics['JvmMetrics']:
            name = "_".join(["jvm", re.sub('([a-z0-9])([A-Z])', r'\1_\2', metric).lower()])
            if 'Mem' in metric:
                if "Used" in metric:
                    key = "jvm_mem_used_mebibytes"
                    mode = metric.split("Used")[0].split("Mem")[1]
                    label = [_cluster, mode]
                elif "Committed" in metric:
                    key = "jvm_mem_committed_mebibytes"
                    mode = metric.split("Committed")[0].split("Mem")[1]
                    label = [_cluster, mode]
                elif "Max" in metric:
                    key = "jvm_mem_max_mebibytes"
                    if "Heap" in metric:
                        mode = metric.split("Max")[0].split("Mem")[1]
                    else:
                        mode = "max"
                    label = [_cluster, mode]
                else:
                    key = "".join([name, 'ebibytes'])
                    label = [_cluster]
                
            elif 'Gc' in metric:
                if "GcCount" in metric:
                    key = "jvm_gc_count"
                    if "GcCount" == metric:
                        typo = "total"
                    else:
                        typo = metric.split("GcCount")[1]
                    label = [_cluster, typo]
                elif "GcTimeMillis" in metric:
                    key = "jvm_gc_time_milliseconds"
                    if "GcTimeMillis" == metric:
                        typo = "total"
                    else:
                        typo = metric.split("GcTimeMillis")[1]
                    label = [_cluster, typo]
                elif "ThresholdExceeded" in metric:
                    key = "jvm_gc_exceeded_threshold_total"
                    typo = metric.split("ThresholdExceeded")[0].split("GcNum")[1]
                    label = [_cluster, typo]
                else:
                    key = name
                    label = [_cluster]
            elif 'Threads' in metric:
                key = "jvm_threads_state_total"
                state = metric.split("Threads")[1]
                label = [_cluster, state]
            elif 'Log' in metric:
                key = "jvm_log_level_total"
                level = metric.split("Log")[1]
                label = [_cluster, level]
            else:
                key = name
                label = [_cluster]
            common_metrics['JvmMetrics'][key].add_metric(label,
                                                         bean[metric] if metric in bean else 0)
        return common_metrics

注册组件到Prometheus

实现好Collector后,需要到hadoop_exporter.py中的register_prometheus中进行处理。例如:

if 'NAMENODE' in k:
    if namenode_flag:
        namenode_url = v['jmx']
        logger.info("namenode_url = {0}, start to register".format(namenode_url))
        # 注册组件
        REGISTRY.register(NameNodeMetricCollector(cluster, namenode_url))
        namenode_flag = 0
        continue

开发环境部署

Hadoop运行指标使用github上的Hadoop exporter。项目地址:https://github.com/IloveZiHan/hadoop_exporter

安装pip

# 创建目录
mkdir /opt/prometheus/exporters/hadoop_exporter/modules /opt/prometheus/exporters/hadoop_exporter/scripts

# 下载get-pip
curl https://bootstrap.pypa.io/pip/2.7/get-pip.py -o /opt/prometheus/exporters/hadoop_exporter/scripts/get-pip.py

cd /opt/prometheus/exporters/hadoop_exporter/scripts

# 下载setuptools
wget https://pypi.python.org/packages/source/s/setuptools/setuptools-0.6c11.tar.gz
# 解压到当前目录
tar -zxvf setuptools-0.6c11.tar.gz

cd setuptools-0.6c11
# 安装setuptools
sudo python setup.py install

cd /opt/prometheus/exporters/hadoop_exporter/scripts

python get-pip.py install requests --target=/opt/prometheus/exporters/hadoop_jmx_exporter/hadoop_exporter/modules
python get-pip.py install prometheus_client --target=/opt/prometheus/exporters/hadoop_jmx_exporter/hadoop_exporter/modules
python get-pip.py install python-consul --target=/opt/prometheus/exporters/hadoop_jmx_exporter/hadoop_exporter/modules
python get-pip.py install pyyaml --target=/opt/prometheus/exporters/hadoop_jmx_exporter/hadoop_exporter/modules

部署Hadoop集群服务发现

1、部署tomcat

# 上传tomcat包
[prometheus@ha-node1 exporters]$ ll | grep tomcat
-rw-r--r--  1 root       root       10515248 3月   9 14:38 apache-tomcat-8.5.63.tar.gz

cd /opt/prometheus/exporters

# 解压tomcat
[prometheus@ha-node1 exporters]$ tar -xvzf apache-tomcat-8.5.63.tar.gz

# 修改tomcat端口配置
cd /opt/prometheus/exporters/apache-tomcat-8.5.63/conf

[prometheus@ha-node1 conf]$ vim server.xml
<!-- 修改第72行 -->
    <Connector port="9035" protocol="HTTP/1.1"
                connectionTimeout="20000"
                redirectPort="8443" />

2、添加服务发现json

# 创建配置文件
vim /opt/prometheus/exporters/apache-tomcat-8.5.63/webapps/ROOT/cluster_config.json
{
    "hadoop-ha": [
        {
            "ha-node1": {
                "NAMENODE": {
                    "jmx": "http://ha-node1:9870/jmx"
                },
                "DATANODE": {
                    "jmx": "http://ha-node1:9864/jmx"
                },
                "HISTORYSERVER": {
                    "jmx": "http://ha-node1:19888/jmx"
                },
                "JOURNALNODE": {
                    "jmx": "http://ha-node1:8480/jmx"
                },
                "RESOURCEMANAGER": {
                    "jmx": "http://ha-node1:8088/jmx"
                },
                "NODEMANAGER": {
                    "jmx": "http://ha-node1:8042/jmx"
                }
            }
        },
        {
            "ha-node2": {
                "NAMENODE": {
                    "jmx": "http://ha-node2:9870/jmx"
                },
                "DATANODE": {
                    "jmx": "http://ha-node2:9864/jmx"
                },
                "JOURNALNODE": {
                    "jmx": "http://ha-node2:8480/jmx"
                },
                "RESOURCEMANAGER": {
                    "jmx": "http://ha-node2:8088/jmx"
                },
                "NODEMANAGER": {
                    "jmx": "http://ha-node2:8042/jmx"
                }
            }
        },
		{
            "ha-node3": {
                "NAMENODE": {
                    "jmx": "http://ha-node3:9870/jmx"
                },
                "DATANODE": {
                    "jmx": "http://ha-node3:9864/jmx"
                },
                "JOURNALNODE": {
                    "jmx": "http://ha-node3:8480/jmx"
                },
                "RESOURCEMANAGER": {
                    "jmx": "http://ha-node3:8088/jmx"
                },
                "NODEMANAGER": {
                    "jmx": "http://ha-node3:8042/jmx"
                }
            }
        },
		{
            "ha-node4": {
                "DATANODE": {
                    "jmx": "http://ha-node4:9864/jmx"
                },
                "JOURNALNODE": {
                    "jmx": "http://ha-node4:8480/jmx"
                },
                "NODEMANAGER": {
                    "jmx": "http://ha-node4:8042/jmx"
                }
            }
        },
		{
            "ha-node5": {
                "NAMENODE": {
                    "jmx": "http://ha-node5:9870/jmx"
                },
                "DATANODE": {
                    "jmx": "http://ha-node5:9864/jmx"
                },
                "HISTORYSERVER": {
                    "jmx": "http://ha-node5:19888/jmx"
                },
                "JOURNALNODE": {
                    "jmx": "http://ha-node5:8480/jmx"
                },
                "NODEMANAGER": {
                    "jmx": "http://ha-node5:8042/jmx"
                }
            }
        }
    ]
}

3、启动tomcat

http://ha-node1:9035/cluster_config.json

修改hadoop_exporter.py源码

# 找到第40行,将URL地址调整为
url = 'http://{0}/cluster_config.json'.format(rest_url)

启动NameNode监控

export PYTHONPATH=${PYTHONPATH}:/opt/prometheus/exporters/hadoop_exporter/modules
python /opt/prometheus/exporters/hadoop_exporter/cmd/hdfs_namenode.py -c "hadoop-ha" -hdfs "http://ha-node1:9870/jmx" -host "ha-node1" -P 9131 -s "ha-node1:9035"

启动Exporter

export PYTHONPATH=${PYTHONPATH}:/opt/prometheus/exporters/hadoop_exporter/modules
python /opt/prometheus/exporters/hadoop_exporter/hadoop_exporter.py -host "ha-node1" -P 9131 -s "ha-node1:9035"
posted @ 2021-03-16 00:47  斜杠代码日记  阅读(1483)  评论(1编辑  收藏  举报