前言
最早是认为 kubernetes 中 controller 模式 informer 中 resync 是 controller 定时和 api-server 去同步,保证数据的一致性的。后来发现其实不是。下面我们来一一说下。本文的引用都是来自这本书 Programming Kubernetes 。
controller 和 api-server 是不需要定期同步来保证数据的一致性的
参考下面这段话:
Programming Kubernetes 第三章 client-go 的 Informers and Caching
The resync is purely in-memory and does not trigger a call to the server. This used to be different but was eventually changed because the error behavior of the watch mechanism had been improved enough to make relists unnecessary.
Informers also have advanced error behavior: when the long-running watch connection breaks down, they recover from it by trying another watch request, picking up the event stream without losing any events. If the outage is long, and the API server lost events because etcd purged them from its database before the new watch request was successful, the informer will relist all objects.
Next to relists, there is a configurable resync period for reconciliation between the in-memory cache and the business logic: the registered event handlers will be called for all objects each time this period has passed. Common values are in minutes (e.g., 10 or 30 minutes).
我记得之前也看过 blog ,最早 kubernetes controller 是有定期和 api-server 同步来保证数据一致性的,但是现在 watch 机制已经改良了,定期同步这个机制是没有必要的。
上面文字也说了,resync 是为了 reconcile 业务逻辑和内存缓存( 最近一次 relist 的结果 )的。
如果 resync 是为了 reconcile 业务逻辑和内存缓存,那么为什么有的 controller 会有对比 ResourceVersion 的步骤
The resync interval of 30 seconds in this example leads to a complete set of events being sent to the registered UpdateFunc such that the controller logic is able to reconcile its state with that of the API server. By comparing the ObjectMeta.resourceVersion field, it is possible to distinguish a real update from a resync.
我觉得如果要 reconcile 业务逻辑和内存缓存,就应该把所有的 event 都放到 workqueue 中,但是实际查看有的 controller 中是会比较一下 ResourceVersion,可以先看下我提的这个 issue kubernetes 源码中在 pkg/controller/deployment/deployment_controller.go 101行 中的 NewDeploymentController 会传入 DeploymentInformer 和 ReplicaSetInformer,为什么这两个 Informer 的 AddEventHandler 中 注册的 UpdateFunc “dc.updateDeployment dc.updateReplicaSet” updateReplicaSet 中有对比 ResourceVersion 的步骤而 updateDeployment 中没有?
我还是不太理解这个 resync 是怎么 reconcile 业务逻辑,我再仔细研究一下,研究明白了,会在更新这个博客。
更新于20201208
问了一下大神 Brendan Burns,邮件回复如下:
Hello,
I'm not super familiar with that code since I wasn't involved in writting it, but my guess from looking at it is that it is an oversight.In both cases it is possible that there will be an "update" where the resource version doesn't change, due to a re-list of the resources. Such an "update" is in fact a no-op and should be ignored. So I think it would be reasonable to add the same check to the updateDeployment code.
Hope that helps.
--brendan
看了大神的回答,还是没能解开我的疑惑,我又从 discuss.kubernetes.io 也问了这个问题,还没人回复我呢。
更新于20210408
我给 kubernetes 提了一个 PR Add compare ResourceVersion process,主要是有群友和我说,提个 PR 估计就知道负责这块代码的人,对这个问题怎么看了,主要是也想通过这种方式弄清楚这个问题,目前基本已经得到答案了,Programming Kubernetes 这本书中应该是写错了,这个 resync 就不是 reconcile 业务逻辑和内存缓存( 最近一次 relist 的结果 )的,resync 就是处理最近两次 relist 结果不一致的,不加 compare ResourceVersion process 的结果就是后面 reconcile 的过程多做些事情,加了后面 reconcile 的过程就能省些事情,这个对最终结果是没有影响,所以现在 kubernetes 源码中有的地方有 compare ResourceVersion process 这一步有的地方没有。