k8s中Controller-Manager和Scheduler的选主逻辑
K8s中的control-plane包括了apiserver、controller-manager、scheduler、etcd,当搭建高可用集群时就会涉及到部分组件的选主问题。etcd是整个集群所有状态信息的存储,涉及数据的读写和多个etcd之间数据的同步,对数据的一致性要求严格,所以使用较复杂的raft算法来选择用于提交数据的主节点。而apiserver作为集群入口,本身是无状态的web服务器,多个apiserver服务之间直接负载请求并不需要做选主。Controller-Manager和Scheduler作为任务类型的组件,比如controller-manager内置的k8s各种资源对象的控制器实时的watch apiserver获取对象最新的变化事件做期望状态和实际状态调整,调度器watch未绑定节点的pod做节点选择,显然多个这些任务同时工作是完全没有必要的,所以controller-manager和scheduler也是需要选主的,但是选主逻辑和etcd不一样的,这里只需要保证从多个controller-manager和scheduler之间选出一个进入工作状态即可,而无需考虑它们之间的数据一致和同步。
kube-scheduler中关于leader选择的参数描述
/ # kube-scheduler -h 2>&1 | grep -i leader--leader-elect Start a leader election client and gain leadership before executing the main loop. Enable this when running replicated components for high availability. (default true) --leader-elect-lease-duration duration The duration that non-leader candidates will wait after observing a leadership renewal until attempting to acquire leadership of a led but unrenewed leader slot. This is effectively the maximum duration that a leader can be stopped before it is replaced by another candidate. This is only applicable if leader election is enabled. (default 15s) --leader-elect-renew-deadline duration The interval between attempts by the acting master to renew a leadership slot before it stops leading. This must be less than or equal to the lease duration. This is only applicable if leader election is enabled. (default 10s) --leader-elect-resource-lock endpoints The type of resource object that is used for locking during leader election. Supported options are endpoints (default) and `configmaps`. (default "endpoints") --leader-elect-retry-period duration The duration the clients should wait between attempting acquisition and renewal of a leadership. This is only applicable if leader election is enabled. (default 2s)
基于k8s 1.11源码分析,Lock Resouce为Endpoint
1、调度器启动时先选举leader,再回调schuduler的run方法进入调度逻辑
// https://sourcegraph.com/github.com/kubernetes/kubernetes@release-1.11/-/blob/cmd/kube-scheduler/app/server.go func Run(c schedulerserverconfig.CompletedConfig, stopCh <-chan struct{}) error { ...... // Prepare a reusable run function. run := func(stopCh <-chan struct{}) { sched.Run() <-stopCh } // If leader election is enabled, run via LeaderElector until done and exit. if c.LeaderElection != nil { c.LeaderElection.Callbacks = leaderelection.LeaderCallbacks{ OnStartedLeading: run, OnStoppedLeading: func() { utilruntime.HandleError(fmt.Errorf("lost master")) }, } leaderElector, err := leaderelection.NewLeaderElector(*c.LeaderElection) leaderElector.Run()
}
}
2、直接调用Acquire方法来尝试竞选为leader
// Run starts the leader election loop func (le *LeaderElector) Run() { defer func() { runtime.HandleCrash() le.config.Callbacks.OnStoppedLeading() }() le.acquire() stop := make(chan struct{}) go le.config.Callbacks.OnStartedLeading(stop) le.renew() close(stop) }
3、Acquire方法以leader-elect-retry-period指定的时间为间隔,循环调用TryAcquireOrRenew方法,其中的le.config.Lock类型为EndpointsLock,EndpointsLock.Identity()方法返回自己的主机名,EndpointsLock.Get方法请求apiServer获取保存在etcd中的选举记录。
如果从apiserver获取ep选举记录对象失败,则尝试自己作为leader
以自己观察到的observe时间来看,如果租约(15s)未到,并且自己不是leader,不能去抢占为leader,所以就没有其他可以做的了
如果当前自己就是leader,不管租约是否到期,都以当前时间尝试续约,竞选时间acquireTime保持、leader切换次数保持,否则切换次数加1
向apiserver发送更新ep选举记录对象的请求,由apiserver来保证多个客户端的原子更新操作,通过对比resourceVersion版本号(对应etcd中的modifiedindex编号),保证只有一个client能修改成功,其余的返回409
Lock被初始化为EndpointsLock type EndpointsLock struct { // EndpointsMeta should contain a Name and a Namespace of an // Endpoints object that the LeaderElector will attempt to lead. EndpointsMeta metav1.ObjectMeta Client corev1client.EndpointsGetter LockConfig ResourceLockConfig e *v1.Endpoints } // Get returns the election record from a Endpoints Annotation func (el *EndpointsLock) Get() (*LeaderElectionRecord, error) { var record LeaderElectionRecord el.e, err = el.Client.Endpoints(el.EndpointsMeta.Namespace).Get(el.EndpointsMeta.Name, metav1.GetOptions{}) if recordBytes, found := el.e.Annotations[LeaderElectionRecordAnnotationKey]; found { if err := json.Unmarshal([]byte(recordBytes), &record); err != nil { return nil, err } } return &record, nil } //如果自己不是leader,尝试竞选为leader,如果自己就是leader,尝试renew续租 // tryAcquireOrRenew tries to acquire a leader lease if it is not already acquired, // else it tries to renew the lease if it has already been acquired. Returns true // on success else returns false. func (le *LeaderElector) tryAcquireOrRenew() bool { now := metav1.Now() // 这个Identity()返回的就是自己的hostname + "_" + string(uuid.NewUUID())
// 初始化一个leader是自己的leaderElectionRecord对象,为自己acquire成功时准备 leaderElectionRecord := rl.LeaderElectionRecord{ HolderIdentity: le.config.Lock.Identity(), LeaseDurationSeconds: int(le.config.LeaseDuration / time.Second), RenewTime: now, AcquireTime: now, } // 1. obtain or create the ElectionRecord oldLeaderElectionRecord, err := le.config.Lock.Get()
// 如果从apiserver获取ep失败,则尝试自己作为leader if err != nil { le.observedRecord = leaderElectionRecord le.observedTime = le.clock.Now() return true } // 2. Record obtained, check the Identity & Time
// apiServer中的leader对象和自己记录的不一样,更新自己的记录 if !reflect.DeepEqual(le.observedRecord, *oldLeaderElectionRecord) { le.observedRecord = *oldLeaderElectionRecord le.observedTime = le.clock.Now() }
//以自己观察到的observe时间来看,如果租约(15s)未到,并且自己不是leader,那么自己没有其他可以做的了 if le.observedTime.Add(le.config.LeaseDuration).After(now.Time) && oldLeaderElectionRecord.HolderIdentity != le.config.Lock.Identity() { return false } // 3. We're going to try to update. The leaderElectionRecord is set to it's default // here. Let's correct it before updating.
// 走到这里可能:1、自己不是leader,但是租约到期了 2、自己是leader,但租约没有到期 3、自己是leader,但是租约到期
// 如果当前自己就是leader,即对应2、3,不管租约是否到期,都以当前时间尝试续约,竞选时间acquireTime保持、leader切换次数保持,否则切换次数加1 if oldLeaderElectionRecord.HolderIdentity == le.config.Lock.Identity() { leaderElectionRecord.AcquireTime = oldLeaderElectionRecord.AcquireTime leaderElectionRecord.LeaderTransitions = oldLeaderElectionRecord.LeaderTransitions } else { leaderElectionRecord.LeaderTransitions = oldLeaderElectionRecord.LeaderTransitions + 1 } // update the lock itself
// 向apiserver发送更新ep的请求,由apiserver来保证多个客户端的原子更新操作,其resourceVersion版本号机制保证只有一个client能修改成功 if err = le.config.Lock.Update(leaderElectionRecord); err != nil { glog.Errorf("Failed to update lock: %v", err) return false } le.observedRecord = leaderElectionRecord le.observedTime = le.clock.Now() return true }