请求、信息-Cloud Foundry中基于Master/Slave机制的Service Gateway——解决Service Gateway单点故障问题-by小雨

在写这篇文章之前，xxx经已写过了几篇关于改请求、信息-主题的文章,想要懂得的朋友可以去翻一下之前的文章

Cloud Foudnry作为业界最色出的PaaS台平之一，给宽大的互联网开发者和消费者供提色出的验体。自Cloud Foudnry开源以来，有关Cloud Foundry的究研越来越多，这也很好的支持着Cloud Foundry的生态系统。但是作为一个台平，Cloud Foundry仍然会存在一些可靠性，扩展性方面的足不，这也吸引着多众的Cloud Foundry爱好者对其停止更多更深刻的究研。

本文要主述讲Cloud Foundry中Service Gateway的行运机制，Service Gateway存在的单点故障题问，以及处置单点题问的Master/Slave Service Gateway机制。

1. Service Gateway行运机制

关于Cloud Foundry中Service Gateway的分析，可以参考我之前的博文：Cloud Foundry Service Gateway码源分析。

分析该组件行运机制的时候，我望希用使Service Gateway的最用常的功能provision a service 和 bind a service 来扼要述阐，虽然不能代表有所的功能，但是这两个功能可以让我们对Service Gateway行运机制懂得个大概。

1.1 Service Gateway的启动

以上及提的博文中经已分析得很多，关于启动，需要夸大的是fetch_handles的功能，即Service Gateway向Cloud Controller请求得获service instance的信息，并存储在内存中，面后会讲到在provision a service 的时候会用到该内存中的信息；还有send_heartbeat的功能，即Service Gateway向Cloud Controller发送心跳信息，明证自己的存活状态（另外Cloud Controller也会通过这个心跳信息来存储该类型Service Gateway）。

1.2 provision a service

provision a service 的程流很简单了明，要主现实创立一个服务实例，无非是Cloud Controller接收用户请求，用使HTTP式方发送给Service Gateway，Service Gateway接收并处置请求后通过NATS向Service Node发送provision请求，Service Node接收到请求到provision终了后，通过NATS将结果回返给Service Gateway，Service Gateway将结果份备一份当前，通过HTTP式方回返给Cloud Controller，最后Cloud Controller收到结果信息，将其持久化至数据库，并对用户做回应。这里需要意注的是：首先，关于provision出来的service instance的信息终最会被持久化到Cloud Controller的postgres数据库中；然后，关于这些信息，Service Gateway会份备一份（存储在内存中）。这两点在后文中会显得尤为要重。

1.3 bind a service

bind a service 要主是现实为一个应用定绑一个服务实例。程流与通信式方和provision a service 异小同大，不同的是，在Service Gateway在接收到Cloud Controller的时候，会从自己关于service的份备信息中找出来响应的service instance，然后给该instance发送请求。

2. Service Gateway的单点故障题问

在Cloud Foundry中，同种类型service的Service Gateway只有一个，因此在行运过程当中，定肯会现出单点故障题问。一旦Service Gateway地点节点宕机，那么该Service Gateway变得不可用，所以关于该service的provision和bind等多众请求都将得不到响应。需要意注的是，对于经已功成bind服务的应用来讲，是不受影响的，因为app可以直接通过URL问访Service Node，而不必经过Service Gateway。

由于现在的Cloud Foundry乏缺对于组件的状态监控机制，所以Cloud Foundry对于Service Gateway的宕机不会采用任何办法，因此Cloud Foundry的管理员也不会被知告Service Gateway的故障，从而致使该类型service的不可用。当Cloud Foundry的用户感受到该类型service不可用，并且向管理员馈反时，管理员才会去后台去查看Service Gateway的状态，现发状态为关闭状态，从而停止人为的启动。在这一漫长的过程当中，该类型的service均处于不可用状态，大大影响Cloud Foundry的可用性。

在这样的况情下，最理想的处置案方天然可以使得Cloud Foundry中关于某种类型的service可以有具多个Service Gateway ，如此一来，在一个Service Gateway宕机的时候，还可以有其他的Service Gateway继承任务。通过以上的假设，在际实过程当中，我经曾尝试过两种案方：Master/Slave Gateway 以及Multi Gateways。以下是两个案方的简述：

Master/Salve Gateway：该案方在执行过程当中，只有一个Service Gateway接收到Cloud Controller的请求，并做处置；只有当一个Service Gateway宕机的时候，Cloud Controller才会将请求发给另一个请求。看到这里，定肯会有一些题问：那就是Cloud Controller如何获知Gateway是不是宕机的信息，另外Cloud Controller如何策决将请求发送给哪个gateway，还有一个Service Gateway经已provision完一个服务后宕机，则关于该服务的bind请求会发送至另一个Service Gateway，而这个Service Gateway不有具这个服务的任何信息，从而致使请求不能被执行。本文要主述讲这类案方的计设与现实。
Multi Gateways： multi 就是多个，另外有一种多个之间位置等平的念概，也就是说在Cloud Controller在接收到请求当前，分发给有所的该类型service的Service Gateway，从而使当得一个Service Gateway宕机当前，而其他的有所gateway仍能常正任务。当然，该案方也会存在难点：除了上一种案方中的分部难题以外，还有：由于请求为多个，一个Service Node接收到多个Service Gateway的请求后，该如何策决该请求是不是执行过等。

3. 基于Master/Slave式模的Service Gateway

关于Master/Slave式模的Service Gateway框架如下图：

3.1 两个Service Gateway的册注题问

由于该案方中有两个同种类型的Service Gateway与Cloud Controller相连，则必需要处置的题问是：如何让Cloud Controller意识这两个Service Gateway。在这里，我们可以回想在先原的Cloud Foundry中，Cloud Controller如何意识或者存储一个Service Gateway：Cloud Controller接收Service Gateway的heartbeat信息的同时，在自己的postgres数据库中存储该Service Gateway的信息，也相当于实现Service Gateway的册注。

处置案方：

启动同种类型service的两个Service Gateway，他们分别向Cloud Controller发送heartbeat，由Cloud Controller接收这些请求，并策决哪一个是Master，哪一个是Slave ，从而存入数据库。

体具现实：

Cloud Controller的软件框架是基于Rails编写的，关于postgres数据库的数据表都有一个model（MVC式模中的M），而修改数据表的式模则必须修改这个model文件，由于之前的式模会验证service类型的唯一性，当初则必须将/cloudfoundry/cloud_controller/cloud_controller/app/models/service.rb中的验证代码validates_uniqueness_of :label 给注释失落，这样就有了向数据库中加添雷同类型service的可能性。另外我们还需要在该式模中入加一个属性MasterOrSlave，这样可以使得Cloud Controller策决Master/Slave后，将带有Master/Slave记标的service记载存入postgres数据库。

以上操纵只是从理论上，答应在数据库中现出两个雷同label的service记载，关于终最往数据库中存两条记载还是需要Cloud Controller的controller来现实（MVC中的C）。在体具程序行运中，service_controller.rb文件中的create法方会接收来自Service Gateway的heartbeat信息，并做响应的册注。为了现实Master/Slave式模，我们需要改良该法方，入加策决机制，现实Master/Slave的册注。下图是策决机制：

3.2 Cloud Controller策决用使Service Gateway的题问

处置了以上的册注题问，Cloud Foundry还需要处置Cloud Controller如何选择Service Gateway的题问。

处置案方：

Cloud Controller关于某一label的service 请求，首先找到该label的Master Service Gateway，如果该Service Gateway的active属性为TRUE，则示表该Service Gateway存活状态为存活，可以接收请求；若active属性为FALSE，则示表该Service Gateway存活状态为不存活，则找到该label的Slave Service Gateway，若Slave Service Gateway的active属性为TRUE，则示表Slave Service Gateway可以接收请求，否则该类型的Master Slave Service Gateway都不可用，请求不会执行，用户被知告Service Gateway不可用。策决程流如下图：

3.3 Master/Slave Service Gateway之间服务信息的步同题问

在Cloud Foundry中Service Gateway的行运机制中，经已分析了provision和bind的功能，在这两个过程当中，都触及了一个内存空间（体具代码为@prov_svcs），这个内存空间用来寄存这个service类型的service信息。Service Gateway启动的时候会通过fetch_handles法方得获部全的信息，另外在每次经过这个Service Gateway的provision和bind等请求时，会有响应的信息更新。

那如果用使Master/Slave Service Gateway机制的话，为什么要斟酌两个Service Gateway之间@prov_svcs的步同信息呢？关于这个题问，可以测推：要证保Master/Slave Service Gateway的高可用性，那么在一个Service Gateway宕机的时候，另一个Service Gateway仍然可以任务，并接管当前的请求。但是，如果一个请求通过Master Service Gateway 功成provision 一个服务实例后，Master Service Gateway宕机了，那么关于这个服务实例的bind请求，定肯会通过Slave Service Gateway来实现，然而Slave Service Gateway不有具这个服务实例的任何信息，故bind失败。毫无疑问，Master/Slave Service Gateway机制必须处置服务信息的步同题问，也就是在一个Service Gateway实现某一个请求，更新@prov_svcs后，必须将雷同的更新知告另一个Service Gateway，实现信息的步同。

处置案方：

从Service Gateway的请求程流来看，关于service instance的信息在@prov_svcs中更新后，都市存储到Cloud Controller的postgres数据库中。为了做到Master/Slave Service Gateway的服务信息步同，最效有的法方就是一有信息被更新到Cloud Controller的postgres数据库中，就马上把这个更新信息发送给另一个Service Gateway。为了简单起见，本文采用的步同式方是：Service Gateway以一个可以人为设定的率频向Cloud Controller发送得获服务信息的请求。这样的话，从必定水平上说，可以证保两个Service Gateway服务信息的步同。

体具现实:

在上文的Service Gateway启动的时候，经已述讲过，Service Gateway会用使fetch_handles法方来得获服务信息，体具代码见asynchronous_service_gateway.rb：

　　update_callback = Proc.new do |resp|
      @provisioner.update_handles(resp.handles)
      @handle_fetched = true
      event_machine.cancel_timer(@fetch_handle_timer)

      # TODO remove it when we finish the migration
      current_version = @version_aliases && @version_aliases[:current]
      if current_version
        @provisioner.update_version_info(current_version)
      else
        @logger.info("No current version alias is supplied, skip update version in CCDB.")
      end
    end

    @fetch_handle_timer = event_machine.add_periodic_timer(@handle_fetch_interval) { fetch_handles(&update_callback) }
    event_machine.next_tick { fetch_handles(&update_callback) }

从代码中可以看到Service Gateway是首先是通过一个周期性的timer向Cloud Controller发送得获请求，当收到Cloud Controller的响应当前，取消了这个timer。在Master/Slave Service Gateway 中为了不断地发送fetch_handles请求，只需要将代码event_machine.cancel_timer(@fetch_handle_timer)注释失落可即。这样的话，Service Gateway就会以周期@handle_fetch_interval发送fetch_handles请求。只要Cloud Controller关于service的信息有动变，Service Gateway就会在@handle_fetch_interval时间内获知。在此基础上，修改以上代码只是现实周期性发送请求，而Service Gateway在受接周期性响应后，还需要在处置的时候作出响应的修改，体具代码在provision.rb中，可以每次执行的时候都将@prov_svcs置空，然后再将有所的服务信息放入hash对象@prov_svcs。

关于该题问的处置，以上只是述阐了Service Gateway需要作的修改，这括包Service Gateway发送请和求受接请求，但是只有这些还是不敷的，对于通信的Server端，Cloud Controller也需要作必定的修改，要主是如何根据Service Gateway的请求，读取postgres数据库中有所的服务信息。代码址地要主为cloudfoundry/cloud_controller/cloud_controller/app/controllers/service_controller.rb的list_handles法方。由于Cloud Controller都是通过Service Gateway的URL信息去寻觅数据库中数据表service_bindings和service_configs的关于该Service Gateway的有所service instance，而当初是Master/Slave Service Gateway机制，需要将两个Service Gateway的服务信息部全信息找出，并回返给Service Gateway。比如：service instance A是由 Master Service Gateway来provision或者bind的，那么在service_configs或者service_bindings中是属于Master Service Gateway的，因此Slave Service Gateway在得获handles的时候，也需要将Master Service Gateway掏出，并回返给Slave Service Gateway。其中的体具现实可以参考Rails中的model，在这里service_binding blongs_to :service_config，service_config belongs _to :service。

3.4 unprovison, bind和unbind等请求的分部修改

关于这分部的修改，以unprovision请求为例：

Cloud Controller在收到用户关于某service instance的unprovision后，首先会从service_configs数据表中找出这个service instance，然后再掏出Service Gateway的URL信息，并向这个URL发送unprovision请求。但是如果当初这个Service Gateway经已宕机的话，之前的机制将不能响应这个请求。如果要在Master/Slave Service Gateway机制中处置这个题问，只需要在每次执行unprovision请求的时候，不是去找service instance的Service Gateway信息，而是去寻觅前当正在任务的Service Gateway 来实现unprovision任务。

关于bind和unbind操纵，也是如此，在Cloud Controller向Service Gateway发送请求的时候，不是通过service instance来寻觅Service Gateway，而是通过前当存活的Service Gateway来寻觅。

4. 评价

以上关于Master/Slave Service Gateway计设与现实，经已述阐终了，当初可以在评价这个机制。

首先从可用性手入，该机制很好的处置了之前Cloud Foundry中Service Gateway的单点故障题问，一旦Cloud Foundry中的一个Service Gateway宕机的话，该机制下，Cloud Foundry仍然可以用使另一个Service Gateway来实现任务。当然，当两个Service Gateway都宕机的况情下，该机制也不能常正证保Service的任务，但是在两个Service Gateway雷同境环下，现出这类况情的概率要远远低于只有一个Service Gateway的时候。

Master/Slave Service Gateway机制最要重的分部要主是如何现实service信息的步同题问，本计设要主是通过周期性的发送得获请求，来现实终最的步同。但是周期性的请求，其实不能谨严的证保consistency。如果关于一个service instance的porvision和bind操纵在一个周期内需要实现，而在两个请求之间的某个刻时，Master Service Gateway宕机的话，Slave Service Gateway也不可用。当这个周期非常粗粒度的时候，这样的题问是不能容忍的，但是如果这个周期比拟细粒度的话，题问能失掉非常好的解缓，这样的况情几乎可以忽略不计。但是当周期性的请求粒度很细的话，势必会形成Cloud Controller的通信的挤拥，这也是一个在请求很多，系统很大的时候不得不斟酌的题问。

5. 总结

本文要主旨在处置先原Cloud Foundry中关于Service Gateway的单点故障题问。在扼要分析Service Gateway的行运程流后，述讲了Service Gateway可能存在的单点故障题问，并对于该题问提出了两种处置案方。随后，本文大篇幅分析了其中一个处置案方Master/Slave Service Gateway机制，要主括包该案方的体具处置的题问和体具现实。最后，对于该机制停止了扼要的评价，要主着眼于可用性与对系统的性能影响。

转载清注明出处。

这篇文档更多出于我本人的理解，定肯在一些方地存在足不和错误。望希本文可以对开始接触Cloud Foundry中service的人有些帮助，如果你对这方面感兴趣，并有更好的法想和议建，也请联系我。

我的邮箱：shlallen@zju.edu.cn

新浪微博：@莲子弗如清

文章结束给大家分享下程序员的一些笑话语录：打赌
飞机上，一位工程师和一位程序员坐在一起。程序员问工程师是否乐意和他一起玩一种有趣的游戏。工程师想睡觉，于是他很有礼貌地拒绝了，转身要睡觉。程序员坚持要玩并解释说这是一个非常有趣的游戏："我问你一个问题，如果你不知道答案，我付你5美元。然后你问我一个问题，如果我答不上来，我付你5美元。"然而，工程师又很有礼貌地拒绝了，又要去睡觉。　　程序员这时有些着急了，他说："好吧，如果你不知道答案，你付5美元；如果我不知道答案，我付50美元。"果然，这的确起了作用，工程师答应了。程序员就问："从地球到月球有多远？"工程师一句话也没有说，给了程序员5美元。　　现在轮到工程师了，他问程序员："什么上山时有三条腿，下山却有四条腿？"程序员很吃惊地看着工程师，拿出他的便携式电脑，查找里面的资料，过了半个小时，他叫醒工程师并给了工程师50美元。工程师很礼貌地接过钱又要去睡觉。程序员有些恼怒，问："那么答案是什么呢？"工程师什么也没有说，掏出钱包，拿出5美元给程序员，转身就去睡觉了。

posted @ 2013-04-17 13:07 坚固66 阅读(382) 评论(0) 编辑收藏举报

刷新页面返回顶部

坚固66

程序员乐园