ambari 警告信息

8.1 理解警报 (Understanding Alerts)
-----------------------------------------------------------------------------------------------------------------------------------------
Ambari 预定义了一系列警报来监控集群组件和主机。每一个警报由一个警报定义(alert definition)来定义，定义警报类型检查的间隔和阈值。集群创建或
修改时，Ambari 读取警报定义并为指定的项(items)创建警报实例进行监控。例如，如果集群包括 Hadoop Distributed File System (HDFS), 有一个警报
定义用于监控 "DataNode Process". 集群中为每一个 DataNode 创建一个警报定义的实例。

利用 Ambari Web，通过单击 Alert tab 可以浏览集群上警报定义列表。可以通过当前状态，最后状态变化，以及与警报定义相关联的服务，查找或过滤警报
的定义。可以单击 alert definition name 来查看该警报的详细信息，或修改警报属性(如检查间隔和阈值)，以及该警报定义相关联的警报实例列表。

每个警报实例报告一个警报状态，由严重程度定义。最常用的严重级别为 OK, WARNING, and CRITICAL, 也有 UNKNOWN 和 NONE 的严重级别。警报通知在警报
状态发生变化时发送(如，状态从 OK 变为 CRITICAL)。

8.1.1 警报类型 (Alert Types)
-----------------------------------------------------------------------------------------------------------------------------------------
警报阈值和阈值的单位取决于警报的状态。下表列出了警报类型，它们可能的状态，以及可以配置什么阈值单位，如果阈值可配置的话

WEB Alert Type            ：WEB 警报监视一个给定组件的 web URL, 警报状态由 HTTP 响应代码确定。因此，不能改变 HTTP 的响应代码来确定 WEB 警报
                        的阈值。可以自定义每个阈值和整个 web 连接超时的响应文本。连接超时被认为是 CRITICAL 警报。阈值单位基于秒。

                        响应代码对应 WEB 警报的状态如下：

                            ● OK status         ：如果 web URL 响应代码低于 400.
                            ● WARNING status    ：如果 web URL 响应代码等于或高于 400.
                            ● CRITICAL status    ：如果 Ambari 不能连接到某个 web URL.


PORT Alert Type            ：PORT 警报检查连接到一个给定端口的响应时间，阈值单位基于秒

METRIC Alert Type        ：METRIC 警报检查一个或多个度量的值(如果执行计算)。度量从一个给定组件上的可用的 URL 端点访问。连接超时被认为是 CRITICAL
                        警报。

                        阈值是可调整的，并且每一个阈值的单位取决于度量。例如，在 CPU utilization 警报的场景下，单位是百分数；在
                        RPC latency 警报的场景下，单位为毫秒。

AGGREGATE Alert Type    ：AGGREGATE 警报聚合警报状态的数量作为受影响警报数量的百分比。例如，Percent DataNode Process 警报聚合 DataNode Process
                        警报。

SCRIPT Alert Type        ：SCRIPT 警报执行某个脚本来确定其状态，例如 OK, WARNING, 或 CRITICAL. 可以自定义响应文本和属性的值，以及 SCRIPT 警报的
                        阈值。

SERVER Alert Type        ：SERVER 警报执行一个服务器侧的可运行类以确定警报状态，例如，OK, WARNING, 或 CRITICAL

RECOVERY Alert Type        ：RECOVERY 警报由 Ambari Agent 处理，用于监控进程重启。警报状态 OK, WARNING, 以及 CRITICAL 基于一个进程自动重启所用时间的
                        数量。这在要了解进程终止并被 Ambari 自动重启时非常有用。

8.2 修改警报 (Modifying Alerts)
-----------------------------------------------------------------------------------------------------------------------------------------
警报的通用属性包括名称，描述，检查间隔，以及阈值。

检查间隔定义了 Ambari 检查警报状态的频率。例如，"1 minute" 意思是 Ambari 每分钟检查警报的状态。

阈值的配置选项取决于警报的类型

修改警报的通用属性：

    ①    在 Ambari Web 上浏览到 Alerts 部分
    ②    找到警报到定义并单击以查看定义详细信息
    ③    单击 Edit 来修改名称，描述，检查间隔，以及阈值(如果可用)
    ④    单击 Save
    ⑤    在下一次检查间隔时，在所有警报实例上修改生效

8.3 修改警报检查数量 (Modifying Alert Check Counts)
-----------------------------------------------------------------------------------------------------------------------------------------
Ambari 可以设置警报在分发一个通知之前执行检查的数量。如果警报状态在一个检查期间发生了变化，Ambari 在分发通知之前会尝试检查这个条件一定的
次数(check count)。

警报检查次数不适用于 AGGREATE 警报类型。一个状态的变化对于 AGGREATE 警报导致一个通知分发。

如果环境中经常会用短时的问题导致错误的警报，可以提升检查次数。这种情况下，警报状态的变化仍然会记录，但是作为 SOFT 状态变化。如果在一个指定
的检查次数之后警报条件仍然触发，这个状态的变化被认为是 HARD, 并且通知被发出。

通常对所有警报全局设置检查次数，但如果一个或多个警报实践中有短时问题的情况，也可以对单个的警报设置一覆盖全局设定值。

修改全局警报检查次数：

    ① 在 Ambari Web 中浏览到 Alerts 部分
    ② 在 Actions 菜单, 单击 Manage Alert Settings
    ③ 更新 Check Count 值
    ④ 单击 Save

    对全局警报检查次数对修改可能要求几秒钟后出现在 Ambari UI 的单个警报上

为单个警报覆盖全局警报检查次数：

    ① Ambari Web 中浏览到 Alerts 部分
    ② 选择要设置特殊 Check Count 值的警报
    ③ 在右侧，单击 Check Count property 旁的 Edit 图标
    ④ 更新 Check Count 值
    ⑤ 单击 Save


8.4 禁用和再启用警报 (Disabling and Re-enabling Alerts)
-----------------------------------------------------------------------------------------------------------------------------------------
可以禁用警报。当一个警报禁用时，没有警报实例生效，并且 Ambari 不在执行该警报的检查。因而，没有警报状态变化会记录，并且没有通知发送。

    ① Ambari Web 中浏览到 Alerts 部分
    ② 找到警报定义，单击文本旁的 Enabled 或 Disabled 以启用/禁用该警报
    ③ 另一方法，单击警报以查看定义的详细信息，然后单击 Enabled 或 Disabled 以启用/禁用该警报
    ④ 提示确认启用/禁用



8.5 预定义的警报 (Tables of Predefined Alerts)
-----------------------------------------------------------------------------------------------------------------------------------------

8.5.1 HDFS 服务警报 (HDFS Service Alerts)
-----------------------------------------------------------------------------------------------------------------------------------------

    □ 警报名称：NameNode Blocks Health
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    ：METRIC
    描述        ：This service-level alert is triggered if the number of corrupt or missing blocks exceeds the configured critical threshold.
    潜在原因    ：Some DataNodes are down and the replicas that are missing blocks are only on those DataNodes.
                The corrupt or missing blocks are from files with a replication factor of 1. New replicas cannot be created because the
                only replica of the block is missing.
    解决方法    ：For critical data, use a replication factor of 3.
                Bring up the failed DataNodes with missing or corrupt blocks.
                Identify the files associated with the missing or corrupt blocks by running the Hadoop fsck command.
                Delete the corrupt files and recover them from backup, if one exists.




    □ 警报名称：NFS Gateway Process
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    ：PORT
    描述        ：This host-level alert is triggered if the NFS Gateway process cannot be confirmed as active.
    潜在原因    ：NFS Gateway is down.
    解决方法    ：Check for a non-operating NFS Gateway in Ambari Web.




    □ 警报名称：DataNode Storage
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    ：METRIC
    描述        ：This host-level alert is triggered if storage capacity is full on the DataNode (90% critical). It checks the DataNode
                JMX Servlet for the Capacity and Remaining properties.
    潜在原因    ：Cluster storage is full.
                If cluster storage is not full, DataNode is full.
    解决方法    ：If the cluster still has storage, use the load balancer to distribute the data to relatively less-used DataNodes.
                If the cluster is full, delete unnecessary data or add additional storage by adding either more DataNodes or more or larger
                disks to the DataNodes. After adding more storage, run the load balancer.



    □ 警报名称：DataNode Process
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    ：PORT
    描述        ：This host-level alert is triggered if the individual DataNode processes cannot be established to be up and listening on
                the network for the configured critical threshold, in seconds.
    潜在原因    ：DataNode process is down or not responding.
                DataNode are not down but is not listening to the correct network port/address.
    解决方法    ：Check for non-operating DataNodes in Ambari Web.
                Check for any errors in the DataNode logs (/var/log/hadoop/hdfs) and restart the DataNode, if necessary.
                Run the netstat -tuplpn command to check if the DataNode process is bound to the correct network port.




    □ 警报名称：DataNode Web UI
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    ：WEB
    描述        ：This host-level alert is triggered if the DataNode web UI is unreachable.
    潜在原因    ：The DataNode process is not running.
    解决方法    ：Check whether the DataNode process is running.



    □ 警报名称：NameNode Host CPU Utilization
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    ：METRIC
    描述        ：This host-level alert is triggered if CPU utilization of the NameNode exceeds certain thresholds (200% warning,
                250% critical). It checks the NameNode JMX Servlet for the SystemCPULoad property. This information is available only if
                you are running JDK 1.7.
    潜在原因    ：Unusually high CPU utilization might be caused by a very unusual job or query workload, but this is generally the sign
                of an issue in the daemon.
    解决方法    ：Use the top command to determine which processes are consuming excess CPU.
                Reset the offending process.




    □ 警报名称：NameNode Web UI
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    ：WEB
    描述        ：This host-level alert is triggered if the NameNode web UI is unreachable.
    潜在原因    ：The NameNode process is not running.
    解决方法    ：Check whether the NameNode process is running.



    □ 警报名称：Percent DataNodes with Available Space
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    ：AGGREGATE
    描述        ：This service-level alert is triggered if the storage is full on a certain percentage of DataNodes(10% warn, 30% critical)
    潜在原因    ：Cluster storage is full.
                If cluster storage is not full, DataNode is full.
    解决方法    ：If the cluster still has storage, use the load balancer to distribute the data to relatively less-used DataNodes
                If the cluster is full, delete unnecessary data or increase storage by adding either more DataNodes or more or larger disks
                to the DataNodes. After adding more storage, run the load balancer.



    □ 警报名称：Percent DataNodes Available
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    ：AGGREGATE
    描述        ：This alert is triggered if the number of non-operating DataNodes in the cluster is greater than the configured critical
                threshold. This    aggregates the DataNode process alert.
    潜在原因    ：DataNodes are down.
                DataNodes are not down but are not listening to the correct network port/address.
    解决方法    ：Check for non-operating DataNodes in Ambari Web.
                Check for any errors in the DataNode logs (/var/log/hadoop/hdfs) and restart the DataNode hosts/processes.
                Run the netstat -tuplpn command to check if the DataNode process is bound to the correct network port.




    □ 警报名称：NameNode RPC Latency
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    ：METRIC
    描述        ：This host-level alert is triggered if the NameNode operations RPC latency exceeds the configured critical threshold.
                Typically an increase in the RPC processing time increases the RPC queue length, causing the average queue wait time to
                increase for NameNode operations.
    潜在原因    ：A job or an application is performing too many NameNode operations.
    解决方法    ：Review the job or the application for potential bugs causing it to perform too many NameNode operations.



    □ 警报名称：NameNode Last Checkpoint
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    ：SCRIPT
    描述        ：This alert will trigger if the last time that the NameNode performed a checkpoint was too long ago or if the number of
                uncommitted transactions is beyond a certain threshold.
    潜在原因    ：Too much time elapsed since last NameNode checkpoint.
                Uncommitted transactions beyond threshold.
    解决方法    ：Set NameNode checkpoint.
                Review threshold for uncommitted transactions.



    □ 警报名称：Secondary NameNode Process
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    ：WEB
    描述        ：If the Secondary NameNode process cannot be confirmed to be up and listening on the network. This alert is not applicable
                when NameNode HA is configured.
    潜在原因    ：The Secondary NameNode is not running.
    解决方法    ：Check that the Secondary DataNode process is running.



    □ 警报名称：NameNode Directory Status
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    ：METRIC
    描述        ：This alert checks if the NameNode NameDirStatus metric reports a failed directory.
    潜在原因    ：One or more of the directories are reporting as not healthy.
    解决方法    ：Check the NameNode UI for information about unhealthy directories.



    □ 警报名称：HDFS Capacity Utilization
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    ：METRIC
    描述        ：This service-level alert is triggered if the HDFS capacity utilization exceeds the configured critical threshold
                (80% warn, 90% critical). It checks the NameNode JMX Servlet for the CapacityUsed and CapacityRemaining properties.
    潜在原因    ：Cluster storage is full.
    解决方法    ：Delete unnecessary data.
                Archive unused data.
                Add more DataNodes.
                Add more or larger disks to the DataNodes.
                After adding more storage, run the load balancer.

    □ 警报名称: DataNode Health Summary
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    : METRIC
    描述        : This service-level alert is triggered if there are unhealthy DataNodes.
    潜在原因    : A DataNode is in an unhealthy state.
    解决方法    : Check the NameNode UI for the list of non-operating DataNodes.


    □ 警报名称：HDFS Pending Deletion Blocks
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    : METRIC
    描述        : This service-level alert is triggered if the number of blocks pending deletion in HDFS exceeds the configured warning
                and critical thresholds. It checks the NameNode JMX Servlet for the PendingDeletionBlock property.
    潜在原因    : Large number of blocks are pending deletion.
    解决方法    :


    □ 警报名称：HDFS Upgrade Finalized State
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    : SCRIPT
    描述        : This service-level alert is triggered if HDFS is not in the finalized state.
    潜在原因    : The HDFS upgrade is not finalized.
    解决方法    : Finalize any upgrade you have in process.


    □ 警报名称：DataNode Unmounted Data Dir
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    : SCRIPT
    描述        : This host-level alert is triggered if one of the data directories on a host was previously on a mount point and became
                unmounted.
    潜在原因    : If the mount history file does not exist, then report an error if a host has one or more mounted data directories as well
                as one or more unmounted data directories on the root partition. This may indicate that a data directory is writing to the
                root partition, which is undesirable.
    解决方法    : Check the data directories to confirm they are mounted as expected.

    □ 警报名称：DataNode Heap Usage
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    : METRIC
    描述        : This host-level alert is triggered if heap usage goes past thresholds on the DataNode. It checks the DataNode JMXServlet
                for the MemHeapUsedM and MemHeapMaxM properties. The threshold values are percentages.
    潜在原因    :
    解决方法    :




    □ 警报名称：NameNode Client RPC Queue Latency
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    : SCRIPT
    描述        : This service-level alert is triggered if the deviation of RPC queue latency on client port has grown beyond the specified
                threshold within an given period. This alert will monitor Hourly and Daily periods.
    潜在原因    :
    解决方法    :


    □ 警报名称：NameNode Client RPC Processing Latency
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    : SCRIPT
    描述        : This service-level alert is triggered if the deviation of RPC latency on client port has grown beyond the specified
                threshold within a given period. This alert will monitor Hourly and Daily periods.
    潜在原因    :
    解决方法    :


    □ 警报名称：NameNode Service RPC Queue Latency
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    : SCRIPT
    描述        : This service-level alert is triggered if the deviation of RPC latency on the DataNode port has grown beyond the specified
                threshold within a given period. This alert will monitor Hourly and Daily periods.
    潜在原因    :
    解决方法    :


    □ 警报名称：NameNode Service RPC Processing Latency
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    : SCRIPT
    描述        : This service-level alert is triggered if the deviation of RPC latency on the DataNode port has grown beyond the specified
                threshold within a given period. This alert will monitor Hourly and Daily periods.
    潜在原因    :
    解决方法    :


    □ 警报名称：HDFS Storage Capacity Usage
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    : SCRIPT
    描述        : This service-level alert is triggered if the increase in storage capacity usage deviation has grown beyond the specified
                threshold within a given period. This alert will monitor Daily and Weekly periods.
    潜在原因    :
    解决方法    :


    □ 警报名称：NameNode Heap Usage
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    : SCRIPT
    描述        : This service-level alert is triggered if the NameNode heap usage deviation has grown beyond the specified threshold
                within a given period. This alert will monitor Daily and Weekly periods.
    潜在原因    :
    解决方法    :



8.5.2 HDFS HA 警报 (HDFS HA Alerts)
-----------------------------------------------------------------------------------------------------------------------------------------


    □ 警报名称: JournalNode Web UI
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    : WEB
    描述        : This host-level alert is triggered if the individual JournalNode process cannot be established to be up and listening
                on the network for the configured critical threshold, given in seconds.
    潜在原因    : The JournalNode process is down or not responding.
                The JournalNode is not down but is not listening to the correct network port/address.
    解决方法    :


    □ 警报名称: NameNode High Availability Health
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    : SCRIPT
    描述        : This service-level alert is triggered if either the Active NameNode or Standby NameNode are not running.
    潜在原因    : The Active, Standby or both NameNode processes are down.
    解决方法    : On each host running NameNode, check for any errors in the logs (/var/log/hadoop/hdfs/) and restart the NameNode
                host/process using Ambari Web.
                On each host running NameNode, run the netstat -tuplpn command to check if the NameNode process is bound to the correct
                network port.


    □ 警报名称: Percent JournalNodes Available
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    : AGGREGATE
    描述        : This service-level alert is triggered if the number of down JournalNodes in the cluster is greater than the configured
                critical threshold (33% warn, 50% crit ). It aggregates the results of JournalNode process checks.
    潜在原因    : JournalNodes are down.
                JournalNodes are not down but are not listening to the correct network port/address.

    解决方法    : Check for dead JournalNodes in Ambari Web.


    □ 警报名称: ZooKeeper Failover Controller Process
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    : PORT
    描述        : This alert is triggered if the ZooKeeper Failover Controller process cannot be confirmed to be up and listening on the
                network.
    潜在原因    : The ZKFC process is down or not responding.
    解决方法    : Check if the ZKFC process is running.


8.5.3 NameNode HA 警报 (NameNode HA Alerts)
-----------------------------------------------------------------------------------------------------------------------------------------

    □ 警报名称: JournalNode Process
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    : WEB
    描述        : This host-level alert is triggered if the individual JournalNode process cannot be established to be up and listening
                on the network for the configured critical threshold, given in seconds.

    潜在原因    : The JournalNode process is down or not responding.
                The JournalNode is not down but is not listening to the correct network port/address.
    解决方法    : Check if the JournalNode process is running.


    □ 警报名称: NameNode High Availability Health
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    : SCRIPT
    描述        : This service-level alert is triggered if either the Active NameNode or Standby NameNode are not running.
    潜在原因    : The Active, Standby or both NameNode processes are down.
    解决方法    : On each host running NameNode, check for any errors in the logs (/var/log/hadoop/hdfs/) and restart the NameNode
                host/process using Ambari Web.
                On each host running NameNode, run the netstat -tuplpn command to check if the NameNode process is bound to the correct
                network port.

    □ 警报名称: Percent JournalNodes Available
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    : AGGREGATE
    描述        : This service-level alert is triggered if the number of down JournalNodes in the cluster is greater than the configured
                critical threshold (33% warn, 50% crit ). It aggregates the results of JournalNode process checks.
    潜在原因    : JournalNodes are down.
                JournalNodes are not down but are not listening to the correct network port/address.

    解决方法    : Check for non-operating JournalNodes in Ambari Web.


    □ 警报名称: ZooKeeper Failover Controller Process
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    : PORT
    描述        : This alert is triggered if the ZooKeeper Failover Controller process cannot be confirmed to be up and listening on the
                network.
    潜在原因    : The ZKFC process is down or not responding.
    解决方法    : Check if the ZKFC process is running.


8.5.4 YARN 警报 (YARN Alerts)
-----------------------------------------------------------------------------------------------------------------------------------------

    □ 警报名称: App Timeline Web UI
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    : WEB
    描述        : This host-level alert is triggered if the App Timeline Server Web UI is unreachable.
    潜在原因    : The App Timeline Server is down.
                App Timeline Service is not down but is not listening to the correct network port/address.

    解决方法    : Check for non-operating App Timeline Server in Ambari Web.


    □ 警报名称: Percent NodeManagers Available
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    : AGGREGATE
    描述        : This alert is triggered if the number of down NodeManagers in the cluster is greater than the configured critical threshold.
                It aggregates the results of DataNode process alert checks.
    潜在原因    : NodeManagers are down.
                NodeManagers are not down but are not listening to the correct network port/address.

    解决方法    : Check for non-operating NodeManagers.
                Check for any errors in the NodeManager logs (/var/log/hadoop/yarn) and restart the NodeManagers hosts/processes, as necessary.
                Run the netstat -tuplpn command to check if the NodeManager process is bound to the correct network port.

    □ 警报名称: ResourceManager Web UI
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    : WEB
    描述        : This host-level alert is triggered if the ResourceManager Web UI is unreachable.
    潜在原因    : The ResourceManager process is not running.
    解决方法    : Check if the ResourceManager process is running.


    □ 警报名称: ResourceManager RPC Latency
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    : METRIC
    描述        : This host-level alert is triggered if the ResourceManager operations RPC latency exceeds the configured critical threshold.
                Typically an increase in the RPC processing time increases the RPC queue length, causing the average queue wait time to
                increase for ResourceManager operations.

    潜在原因    : A job or an application is performing too many ResourceManager operations
    解决方法    : Review the job or the application for potential bugs causing it to perform too many ResourceManager operations.


    □ 警报名称: ResourceManager CPU Utilization
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    : METRIC
    描述        : This host-level alert is triggered if CPU utilization of the ResourceManager exceeds certain thresholds (200% warning,
                250% critical). It checks the ResourceManager JMX Servlet for the SystemCPULoad property. This information is only available
                if you are running JDK 1.7.
    潜在原因    : Unusually high CPU utilization: Can be caused by a very unusual job/query workload, but this is generally the sign of
                an issue in the daemon.
    解决方法    : Use the top command to determine which processes are consuming excess CPU.
                Reset the offending process.



    □ 警报名称: NodeManager Web UI
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    : WEB
    描述        : This host-level alert is triggered if the NodeManager process cannot be established to be up and listening on the network
                for the configured critical threshold, given in seconds.

    潜在原因    : NodeManager process is down or not responding.
                NodeManager is not down but is not listening to the correct network port/address.

    解决方法    : Check if the NodeManager is running.
                Check for any errors in the NodeManager logs (/var/log/hadoop/yarn) and restart the NodeManager, if necessary.

    □ 警报名称: NodeManager Health Summary
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    : SCRIPT
    描述        : This host-level alert checks the node health property available from the NodeManager component.
    潜在原因    : NodeManager Health Check script reports issues or is not configured.
    解决方法    : Check in the NodeManager logs (/var/log/hadoop/yarn) for health check errors and restart the NodeManager, and restart
                if necessary.
                Check in the ResourceManager UI logs (/var/log/hadoop/yarn) for health check errors.

    □ 警报名称: NodeManager Health
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    : SCRIPT
    描述        : This host-level alert checks the nodeHealthy property available from the NodeManager component.
    潜在原因    : The NodeManager process is down or not responding.
    解决方法    : Check in the NodeManager logs (/var/log/hadoop/yarn) for health check errors and restart the NodeManager, and restart
                if necessary.



8.5.5 MapReduce2 警报 (MapReduce2 Alerts)
-----------------------------------------------------------------------------------------------------------------------------------------

    □ 警报名称: History Server Web UI
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    : WEB
    描述        : This host-level alert is triggered if the HistoryServer Web UI is unreachable.
    潜在原因    : The HistoryServer process is not running.
    解决方法    : Check if the HistoryServer process is running.

    □ 警报名称: History Server RPC latency
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    : METRIC
    描述        : This host-level alert is triggered if the HistoryServer operations RPC latency exceeds the configured critical threshold.
                Typically an increase in the RPC processing time increases the RPC queue length, causing the average queue wait time to
                increase for NameNode operations.
    潜在原因    : A job or an application is performing too many HistoryServer operations.
    解决方法    : Review the job or the application for potential bugs causing it to perform too many HistoryServer operations.

    □ 警报名称: History Server CPU Utilization
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    : METRIC
    描述        : This host-level alert is triggered if the percent of CPU utilization on the HistoryServer exceeds the configured
                critical threshold.
    潜在原因    : Unusually high CPU utilization: Can be caused by a very unusual job/query workload, but this is generally the sign of
                an issue in the daemon.
    解决方法    : Use the top command to determine which processes are consuming excess CPU.
                Reset the offending process.

    □ 警报名称: History Server Process
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    : PORT
    描述        : This host-level alert is triggered if the HistoryServer process cannot be established to be up and listening on the
                network for the configured critical threshold, given in seconds.
    潜在原因    : HistoryServer process is down or not responding.
                HistoryServer is not down but is not listening to the correct network port/address.
    解决方法    : Check the HistoryServer is running.
                Check for any errors in the HistoryServer logs (/var/log/hadoop/mapred) and restart the HistoryServer, if necessary.



8.5.6 HBase 服务警报 (HBase Service Alerts)
-----------------------------------------------------------------------------------------------------------------------------------------

    □ 警报名称: Percent RegionServers Available
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    :
    描述        : This service-level alert is triggered if the configured percentage of Region Server processes cannot be determined to be
                up and listening on the network for the configured critical threshold. The default setting is 10% to produce a WARN alert
                and 30% to produce a CRITICAL alert. It aggregates the results of RegionServer process down checks.
    潜在原因    : Misconfiguration or less-thanideal configuration caused the RegionServers to crash.
                Cascading failures brought on by some workload caused the RegionServers to crash.
                The RegionServers shut themselves own because there were problems in the dependent services, ZooKeeper or HDFS.
                GC paused the RegionServer for too long and the RegionServers lost contact with Zookeeper.

    解决方法    : Check the dependent services to make sure they are operating correctly.
                Look at the RegionServer log files (usually /var/log/hbase/*.log) for further information.
                If the failure was associated with a particular workload, try to understand the workload better.
                Restart the RegionServers.

    □ 警报名称: HBase Master Process
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    :
    描述        : This alert is triggered if the HBase master processes cannot be confirmed to be up and listening on the network for
                the configured critical threshold, given in seconds.
    潜在原因    : The HBase master process is down.
                The HBase master has shut itself down because there were problems in the dependent services, ZooKeeper or HDFS.

    解决方法    : Check the dependent services.
                Look at the master log files (usually /var/log/hbase/*.log) for further information.
                Look at the configuration files (/etc/hbase/conf).
                Restart the master.

    □ 警报名称: HBase Master CPU Utilization
    -------------------------------------------------------------------------------------------------------------------------------------
    描述        : This host-level alert is triggered if CPU utilization of the HBase Master exceeds certain thresholds (200% warning,
                250% critical). It checks the HBase Master JMX Servlet for the SystemCPULoad property. This information is only available
                if you are running JDK 1.7.

    潜在原因    : Unusually high CPU utilization: Can be caused by a very unusual job/query workload, but this is generally the sign of
                an issue in the daemon.

    解决方法    : Use the top command to determine which processes are consuming excess CPU
                Reset the offending process.

    □ 警报名称: RegionServers Health Summary
    -------------------------------------------------------------------------------------------------------------------------------------
    描述        : This service-level alert is triggered if there are unhealthy RegionServers
    潜在原因    : The RegionServer process is down on the host.
                The RegionServer process is up and running but not listening on the correct network port (default 60030).

    解决方法    : Check for dead RegionServer in Ambari Web.

    □ 警报名称: HBase RegionServer Process
    -------------------------------------------------------------------------------------------------------------------------------------
    描述        : This host-level alert is triggered if the RegionServer processes cannot be confirmed to be up and listening on the
                network for the configured critical threshold, given in seconds.

    潜在原因    : The RegionServer process is down on the host.
                The RegionServer process is up and running but not listening on the correct network port (default 60030).

    解决方法    : Check for any errors in the logs (/var/log/hbase/) and restart the RegionServer process using Ambari Web.
                Run the netstat -tuplpn command to check if the RegionServer process is bound to the correct network port.




8.5.7 Hive 警报 (Hive Alerts)
-----------------------------------------------------------------------------------------------------------------------------------------

    □ 警报名称: HiveServer2 Process
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    :
    描述        : This host-level alert is triggered if the HiveServer cannot be determined to be up and responding to client requests.

    潜在原因    : HiveServer2 process is not running.
                HiveServer2 process is not responding.

    解决方法    : Using Ambari Web, check status of HiveServer2 component. Stop and then restart.

    □ 警报名称: HiveMetastore Process
    -------------------------------------------------------------------------------------------------------------------------------------
    描述        : This host-level alert is triggered if the Hive Metastore process cannot be determined to be up and listening on the
                network for the configured critical threshold, given in seconds.
    潜在原因    : The Hive Metastore service is down.
                The database used by the Hive Metastore is down.
                The Hive Metastore host is not reachable over the network.
    解决方法    : Using Ambari Web, stop the Hive service and then restart it.

    □ 警报名称: WebHCat Server Status
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    :
    描述        : This host-level alert is triggered if the WebHCat server cannot be determined to be up and responding to client requests.
    潜在原因    : The WebHCat server is down.
                The WebHCat server is hung and not responding.
                The WebHCat server is not reachable over the network.
    解决方法    : Restart the WebHCat server using Ambari Web.


8.5.8 Oozie 警报 (Oozie Alerts)
-----------------------------------------------------------------------------------------------------------------------------------------

    □ 警报名称: Oozie Server Web UI
    -------------------------------------------------------------------------------------------------------------------------------------
    描述        : This host-level alert is triggered if the Oozie server Web UI is unreachable.
    潜在原因    : The Oozie server is down.
                Oozie Server is not down but is not listening to the correct network port/address.
    解决方法    : Check for dead Oozie Server in Ambari Web.

    □ 警报名称: Oozie Server Status
    -------------------------------------------------------------------------------------------------------------------------------------
    描述        : This host-level alert is triggered if the Oozie server cannot be determined to be up and responding to client requests.
    潜在原因    : The Oozie server is down.
                The Oozie server is hung and not responding.
                The Oozie server is not reachable over the network.
    解决方法    : Restart the Oozie service using Ambari Web.


8.5.9 ZooKeeper 警报 (ZooKeeper Alerts)
-----------------------------------------------------------------------------------------------------------------------------------------



    □ 警报名称: Percent ZooKeeper Servers Available
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    : AGGREGATE
    描述        : This service-level alert is triggered if the configured percentage of ZooKeeper processes cannot be determined to be up
                and listening on the network for the configured critical threshold, given in seconds. It aggregates the results of
                ZooKeeper process checks.
    潜在原因    : The majority of your ZooKeeper servers are down and not responding.
    解决方法    : Check the dependent services to make sure they are operating correctly.
                Check the ZooKeeper logs (/var/log/hadoop/zookeeper.log) for further information.
                If the failure was associated with a particular workload, try to understand the workload better.
                Restart the ZooKeeper servers from the Ambari UI.

    □ 警报名称: ZooKeeper Server Process
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    : PORT
    描述        : This host-level alert is triggered if the ZooKeeper server process cannot be determined to be up and listening on the
                network for the configured critical threshold, given in seconds.
    潜在原因    : The ZooKeeper server process is down on the host.
                The ZooKeeper server process is up and running but not listening on the correct network port (default 2181).
    解决方法    : Check for any errors in the ZooKeeper logs (/var/log/hbase/) and restart the ZooKeeper process using Ambari Web.
                Run the netstat -tuplpn command to check if the ZooKeeper server process is bound to the correct network port.



8.5.10 Ambari 警报 (Ambari Alerts)
-----------------------------------------------------------------------------------------------------------------------------------------



    □ 警报名称: Host Disk Usage
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    : SCRIPT
    描述        : This host-level alert is triggered if the amount of disk space used on a host goes above specific thresholds (50% warn,
                80% crit ).
    潜在原因    : The amount of free disk space left is low.
    解决方法    : Check host for disk space to free or add more storage.

    □ 警报名称: Ambari Agent Heartbeat
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    : SERVER
    描述        : This alert is triggered if the server has lost contact with an agent.
    潜在原因    : Ambari Server host is unreachable from Agent host
                Ambari Agent is not running
    解决方法    : Check connection from Agent host to Ambari Server
                Check Agent is running

    □ 警报名称: Ambari Server Alerts
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    : SERVER
    描述        : This alert is triggered if the server detects that there are alerts which have not run in a timely manner
    潜在原因    : Agents are not reporting alert status
                Agents are not running
    解决方法    : Check that all Agents are running and heartbeating


8.5.11 Ambari Metrics 警报 (Ambari Metrics Alerts)
-----------------------------------------------------------------------------------------------------------------------------------------


    □ 警报名称: Metrics Collector Process
    -------------------------------------------------------------------------------------------------------------------------------------
    描述        : This alert is triggered if the Metrics Collector cannot be confirmed to be up and listening on the configured port for
                number of seconds equal    to threshold.
    潜在原因    : The Metrics Collector process is not running.
    解决方法    : Check the Metrics Collector is running.

    □ 警报名称: Metrics Collector –ZooKeeper Server Process
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    :
    描述        : This host-level alert is triggered if the Metrics Collector ZooKeeper Server Process cannot be determined to be up and
                listening on the network.
    潜在原因    : The Metrics Collector process is not running.
    解决方法    : Check the Metrics Collector is running.

    □ 警报名称: Metrics Collector –HBase Master Process
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    :
    描述        : This alert is triggered if the Metrics Collector HBase Master Processes cannot be confirmed to be up and listening on
                the network for the configured critical threshold, given in seconds.
    潜在原因    : The Metrics Collector process is not running.
    解决方法    : Check the Metrics Collector is running.

    □ 警报名称: Metrics Collector – HBase Master CPU Utilization
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    :
    描述        : This host-level alert is triggered if CPU utilization of the Metrics Collector exceeds certain thresholds.
    潜在原因    : Unusually high CPU utilization generally the sign of an issue in the daemon configuration.
    解决方法    : Tune the Ambari Metrics Collector.

    □ 警报名称: Metrics Monitor Status
    -------------------------------------------------------------------------------------------------------------------------------------
    警报类型    :
    描述        : This host-level alert is triggered if the Metrics Monitor process cannot be confirmed to be up and running on the network.
    潜在原因    : The Metrics Monitor is down.
    解决方法    : Check whether the Metrics Monitor is running on the given host.

    □ 警报名称: Percent Metrics Monitors Available
    -------------------------------------------------------------------------------------------------------------------------------------
    描述        : This is an AGGREGATE alert of the Metrics Monitor Status.
    潜在原因    : Metrics Monitors are down.
    解决方法    : Check the Metrics Monitors are running.

    □ 警报名称: Metrics Collector -Auto-Restart Status
    -------------------------------------------------------------------------------------------------------------------------------------
    描述        : This alert is triggered if the Metrics Collector has been auto-started for number of times equal to start threshold in
                a 1 hour timeframe. By default if restarted 2 times in an hour, you will receive a Warning alert. If restarted 4 or more
                times in an hour, you will receive a Critical alert.
    潜在原因    : The Metrics Collector is running but is unstable and causing restarts. This could be due to improper tuning.
    解决方法    : Tune the Ambari Metrics Collector.

    □ 警报名称: Percent Metrics Monitors Available
    -------------------------------------------------------------------------------------------------------------------------------------
    描述        : This is an AGGREGATE alert of the Metrics Monitor Status.
    潜在原因    : Metrics Monitors are down.
    解决方法    : Check the Metrics Monitors.

    □ 警报名称: Grafana Web UI
    -------------------------------------------------------------------------------------------------------------------------------------
    描述        : This host-level alert is triggered if the AMS Grafana Web UI is unreachable.
    潜在原因    : Grafana process is not running.
    解决方法    : Check whether the Grafana process is running. Restart if it has gone down.

8.5.12 SmartSenses 警报 (SmartSense Alerts)
-----------------------------------------------------------------------------------------------------------------------------------------

    □ 警报名称: SmartSense Server Process
    -------------------------------------------------------------------------------------------------------------------------------------
    描述        : This alert is triggered if the HST server process cannot be confirmed to be up and listening on the network for the
                configured critical threshold, given in seconds.
    潜在原因    : HST server is not running.
    解决方法    : Start HST server process. If startup fails, check the hst-server.log.

    □ 警报名称: SmartSense Bundle Capture Failure
    -------------------------------------------------------------------------------------------------------------------------------------
    描述        : This alert is triggered if the last triggered SmartSense bundle is failed or timed out.
    潜在原因    : Some nodes are timed out during capture or fail during data capture. It could also be because upload to Hortonworks fails.
    解决方法    : From the "Bundles" page check the status of bundle. Next, check which agents have failed or timed out, and review their logs.
                You can also initiate a new capture.

    □ 警报名称: SmartSense Long Running Bundle
    -------------------------------------------------------------------------------------------------------------------------------------
    描述        : This alert is triggered if the SmartSense in-progress bundle has possibility of not completing successfully on time.
    潜在原因    : Service components that are getting collected may not be running. Or some agents may be timing out during data
                collection/upload.
    解决方法    : Restart the services that are not running. Force-complete the bundle and start a new capture.

    □ 警报名称: SmartSense Gateway Status
    -------------------------------------------------------------------------------------------------------------------------------------
    描述        : This alert is triggered if the SmartSense Gateway server process is enabled but is unable to reach.
    潜在原因    : SmartSense Gateway is not running.
    解决方法    : Start the gateway. If gateway start fails, review hst-gateway.log

————————————————
版权声明：本文为CSDN博主「devalone」的原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接及本声明。
原文链接：https://blog.csdn.net/devalone/article/details/80826036

posted on 2019-11-15 09:07 果冻TD 阅读(5389) 评论(0) 编辑收藏举报