论文笔记(2)—"Adaptive Federated Optimization"

Intuition

Authors demonstrated that the gap between centralized and federated performance was caused by two reasons: 1)client drift, 2) a lack of adaptive.

Different from variance reduction methods, they extended federated learning with adaptive methods, like adam.

They rewrote the update rule of FedAvg

\[x_{t+1}=x_t-\frac{1}{\vert \mathcal S\vert}\sum_{i\in\mathcal S}(x_t-x_i^t) \]

Let \(\Delta_i^t=x_{i}^t-x_t\), where \(x_i^t\) denotes the model of client \(i\) after local training.

The server learning rate \(\eta\) is FedAvg is \(1\) with applying SGD and \(\Delta_i^t\) is a kind of pseudo-gradient. They purposed that apart from SGD, the server optimizer could utilize adaptive methods to update server model \(x\). Their framework is following:

sotnsA.png

Convergence

Multi steps local update, concretely, \(E[\Delta_i^t]\neq K\nabla F(x^t))\), obstacles the analysis of convergence. In my opinion, they offered a roughly bound of error.

I'll only give my personal analysis of their proof of Theorem 1 and thoughts of Theorem 2 are similar.

Firstly, we should build relationships between \(x^{t+1}\) and \(x^t\). According to the update rule of adaptive methods and \(L\)-smooth assumption, we have

soUwCT.png

Furthermore, like in Adagrad, we will have

soU5xe.png

Now, we should bound these two terms \(T_1\) and \(T_2\)

So far, there is no local training involving \(T_2\) and we follow the same process of Adagrad

sowiCV.png

sowZDJ.png

sodXjg.png

To bound \(T_1\), they tried to link \(\Delta_t\) containing local update with \(\nabla f(x_t)\).

As mentioned above, \(\Delta_t\) is a kind of pseudo-gradient of \(\nabla f(x_t)\)

sowDxS.png

Again, note that \(x_{i, k}\) is \(k\)-th model during local training in client \(i\) and \(x_t\) is the server model at round \(t\).

sosrHe.png

In my opinion, how to bound \(x_{i,k}^t-x_t\) is the most impressive part of the whole paper.

sosvuT.png

Honestly, local gradient \(g_{i,k}^t\) builds the bridge between \(x_{i, k}^t\) and \(x_t\) and \(E[\nabla F_i(x_{i, k-1}^t)]\neq \nabla F_i(x_t)\)

soy12t.png

The second inequity is very rough and unclear. Known \(E[\eta_l(\cdots)\)They used \(E[\Vert z_1+z_2+\dots+z_r\Vert^2]\leq rE[\Vert z_1\Vert^2+\Vert z_2\Vert^2]+\dots+\Vert z_r\Vert^2]\)

posted @ 2021-01-22 23:21  Neo_DH  阅读(1199)  评论(0编辑  收藏  举报