Ambari Auto Start(进程自动拉起)

   文章作者:luxianghao

   文章来源:http://www.cnblogs.com/luxianghao/p/7886850.html  转载请注明,谢谢合作。

   免责声明:文章内容仅代表个人观点,如有不当,欢迎指正。

   --- 

 

一 引言
Ambari作为一个集群管理工具自然不能少了进程自动拉起这个功能,具体的场景就是

1 当你的进程异常挂掉后,Ambari自动将进程拉起,恢复服务,避免人为介入;
2 当机器启动后你不用再一个一个的点击,拉起服务,避免繁琐的机械劳动;

......

总之它会努力变成你期望的样子^-^

 

二 版本迭代
Ambari早期就有这个功能,在2.2 2.3 2.4等版本不断的迭代,使其不断的完善易用,早期相关配置在ambari.properties中,由于这种方式相关的配置属性是静态的,修改后需重启Ambari Server和Amabari Agent,后来迁移到了cluster-env.xml中,并录入数据库,在Web端也做了支持,当修改了相关配置也不用重启服务了,相关的修改会随着心跳信息从Ambari Server发送到Ambari agent。支持集群级别的总开关和组件粒度的开关,相关配置属性如下:
recovery_enabled:集群级别自动拉起功能的开关
recovery_type: 恢复功能的类型,不同类型会有不同的执行逻辑,如下表
recovery_lifetime_max_count:自动拉起生命周期的最大次数,如果Ambari Agent重启这个值会被重置
recovery_max_count:在一个时间窗口内,自动拉起动作的最大尝试次数,如果Ambari Agent重启这个值会被重置
recovery_window_in_minutes:自动拉起功能的时间窗口长度
recovery_retry_interval:两次重试之间的时间间隔

Attribute: recovery_type Commands State Transitions
AUTO_START Start INSTALLED → STARTED
FULL Install, Start, Restart, Stop INIT → INSTALLED, INIT → STARTED, INSTALLED → STARTED, STARTED → STARTED, STARTED → INSTALLED
DEFAULT None Auto start feature disabled

 

 

 

 

三 功能介绍 && 代码逻辑
Ambari概览中的Ambari Server架构图中我们可以看到Ambari Server维护了一个FSM(有限状态机),记录了每个组件的desired state(Ambari Server期望的组件状态),Ambari Agent会实时的检测自己的宿主机上的服务的current state(当前状态),当desired state和current state不一致就会触发recovery,状态的迁移如上面的表格中所述,2.4版本中recovery_type我们一般使用AUTO START,最常见的场景就是INSTALLED-->STARTED状态的迁移,该事件的逻辑如下:

                                        

 备注:组件正常运行时状态为STARTED,异常宕掉或正常停止后状态为INSTALLED。

 

上述状态迁移发生的前提是两个开关要打开,如下图所示

1 recovery_enabled = True
2 enable components包含Service A
3 当我们不想关上面两个开关但又想某个节点上的组件不启用自启动功能时,我们可以利用Maintenance模式,下面几种情况都会造成组件处于Maintenance模式
a)组件被置为Maintenance模式
b)组件所在主机被置为Maintenance模式
c)组件所属服务被置为Maintenance模式
d)组件所在主机所属的集群被置为Maintenance模式

 

 

相关的源代码文件
1 AmbariManagementControllerImpl.java
2 ServiceComponentDesiredStateEntity.java
3 ServiceComponentRecoveryChangedEvent.java
4 RecoveryConfigHelper.java
5 RecoveryManager.py
6 Controller.py
...

相关的服务log
INFO 2017-11-21 12:16:24,210 RecoveryManager.py:243 - Service A needs recovery.
INFO 2017-11-21 12:16:24,209 Controller.py:265 - Heartbeat response received (id = 15)
INFO 2017-11-21 12:16:24,210 RecoveryManager.py:243 - Service A needs recovery.
INFO 2017-11-21 12:16:24,210 RecoveryManager.py:798 - START command cannot be computed as details are not received from Server.
INFO 2017-11-21 12:16:34,210 Heartbeat.py:82 - Building Heartbeat: {responseId = 15, timestamp = 1511237794210, commandsInProgress = False, componentsMapped = True,recoveryTimestamp = 1511237693282}
INFO 2017-11-21 12:16:54,588 Controller.py:310 - Adding recovery command START for component Service A
INFO 2017-11-21 12:16:54,589 ActionQueue.py:117 - Adding AUTO_EXECUTION_COMMAND for role Service A of cluster DRUID to the queue.
INFO 2017-11-21 12:16:54,604 ActionQueue.py:238 - Executing command with id = 1-0 for role = Service A of cluster DRUID.
INFO 2017-11-21 12:16:54,705 Heartbeat.py:82 - Building Heartbeat: {responseId = 18, timestamp = 1511237814704, commandsInProgress = False, componentsMapped = True,recoveryTimestamp = 1511237693282}
INFO 2017-11-21 12:16:54,854 Controller.py:265 - Heartbeat response received (id = 19)
INFO 2017-11-21 12:16:58,982 ActionQueue.py:341 - After EXECUTION_COMMAND (START), current state of Service A to STARTED


相关patch
AMBARI-15077:Auto-start services: Backend API and DB changes for component auto start
AMBARI-14983:Auto-start services: Show list of Services/Component with status indicator
AMBARI-14023:Agents should not ask for auto-start command details if it has the details (smohanty)
AMBARI-13463:Auto start should allow selection of components that can be auto-started (smohanty)
AMBARI-13434:Expose Alert Grace Period Setting in Agents (aonishuk)
AMBARI-13954:Enable auto-start with alerting for AMS (dsen)
AMBARI-14182: Recovery alerts do not go away
AMBARI-14865: Auto start - Maintenance mode of components should be respected when handling agent registration
AMBARI-15141: Start all services request aborts in the middle and hosts go into heartbeat-lost state
AMBARI-15230: Auto-start services: Move default values in ambari.properties to cluster-env.xml
AMBARI-15474: Listen for changes to auto-start configuration and send them to the agent during heartbeats.
AMBARI-12517: Don't send install_packages command to hosts without versionable components

四 类似工具
进程的自动拉起也可以用进程守护工具比如SupervisorGod,不同的是这两者是用自己的daemon fork出子进程,通过监控子进程的方式获取进程状态的,而Ambari是通过pid或者端口监控的方式获取进程状态。

五 相关链接
WIKI: https://cwiki.apache.org/confluence/display/AMBARI/Recovery%3A+auto+start+components

posted on 2017-11-24 20:52  luxianghao  阅读(4316)  评论(0编辑  收藏  举报

导航