Wasserstein GAN and the Kantorovich-Rubinstein Duality

Wasserstein GAN and the Kantorovich-Rubinstein Duality

From what I can tell, there is much interest in the recent Wasserstein GAN paper. In this post, I don’t want to repeat the justifications, mechanics and promised benefit of WGANs, for this you should read the original paper or this excellent summary. Instead, we will focus mainly on one detail that is only mentioned quickly, but I think lies in some sense at the heart of it: the Kantorovich-Rubinstein duality, or rather a special case of it. This is of course not a new result, but the application is very clever and attractive.

The paper cites the book “Optimal Transport - Old and New” by Fields-Medal winner and french eccentric Cedric Villani, you can download it from his homepage. That’s about a thousand pages targeted at math PhDs and researchers - have fun! Villani also talks about this topic in an accessible way in this lecture, at around the 28 minute mark. Generally though, I found it very hard to find material that gives real explanations but is not bursting with definitions and references to theorems I didn’t know. Maybe this post will help to fill this gap little bit. We will only use basic linear algebra, probability theory and optimization. These will not be rigorous proofs, and we will generously imply many regularity conditions. But I tried to make the chains of reasoning as clear and complete as possible, so it should be enough get some intuition for this subject.

The argument for our case of the Kantorovich-Rubinstein duality is actually not too complicated and stands for itself. It is, however, very abstract, which is why I decided to defer it to the end and start with the nice discrete case and somewhat related problems in Linear Programming.

If you’re interested, you can take a look at the Jupyter notebook that I created to plot some of the graphics in this post.

Earth Mover’s Distance

For discrete probability distributions, the Wasserstein distance is also descriptively called the earth mover’s distance (EMD). If we imagine the distributions as different heaps of a certain amount of earth, then the EMD is the minimal total amount of work it takes to transform one heap into the other. Work is defined as the amount of earth in a chunk times the distance it was moved. Let’s call our discrete distributions Pr and Pθ, each with l possible states x or y respectively, and take two arbitrary distributions as an example.

Fig.1: Probability distribution Pr and Pθ, each with ten states

Calculating the EMD is in itself an optimization problem: There are infinitely many ways to move the earth around, and we need to find the optimal one. We call the transport plan that we are trying to find γ(x,y). It simply states how we distribute the amount of earth from one place x over the domain of y, or vice versa.

To be a valid transport plan, the constraints xγ(x,y)=Pr(y) and yγ(x,y)=Pθ(x) must of course apply. This ensures that following this plan yields the correct distributions. Equivalently, we can call γ a joined probability distribution and require that γΠ(Pr,Pθ), where Π(Pr,Pθ) is the set of all distributions whose marginals are Pr or Pθ respectively. To get the EMD, we have to multiply every value of γ with the Euclidian distance between x and y. With that, the definition of the Earth mover’s distance is:

EMD(Pr,Pθ)=infγΠx,yxyγ(x,y)=infγΠ E(x,y)γxy

If you’re not familiar with the expression inf, it stands for infimum, or greatest lower bound. It is simply a slight mathematical variation of the minimum. The opposite is sup or supremum, roughly meaning maximum, which we will come across later. We can also set Γ=γ(x,y) and D=xy, with Γ,DRl×l. Now we can write

EMD(Pr,Pθ)=infγΠD,ΓF

where ,F is the Frobenius inner product (sum of all the element-wise products).

Fig.2: Transport plan Γ with Pr and Pθ as marginal probabilities, and distances D

Linear Programming

In the picture above you can see the optimal transport plan Γ. It can be calculated using the generic method of Linear Programming (LP). With LP, we can solve problems of a certain canonical form: Find a vector xRn that minimizes the cost z=cTx, cRn. Additionally, x is constrained by the equation Ax=b, ARm×n,bRm and x0.

To cast our problem of finding the EMD into this form, we have to flatten Γ and D:

(1)x=vec(Γ)(2)c=vec(D)

This means n=l2. For the constraints, we concatenate the target distributions, which makes m=2l:

b=[PrPθ]

For A, we have to construct a large sparse binary matrix that picks out the values from x and sums them to get b. This schematic should make it clear:

[Pr(x1)Pr(x2)Pr(xn)Pθ(y1)Pθ(y2)Pθ(yn)]}bTx{[γ(x1,y1)γ(x1,y2)γ(x2,y1)γ(x2,y2)γ(xn,y1)γ(xn,y2)][100100100010010100010010001100001010]}AT

With that, we can call a standard LP routine, for example linprog() from scipy.

import numpy as np
from scipy.optimize import linprog


# We construct our A matrix by creating two 3-way tensors,



and then reshaping and concatenating them



A_r = np.zeros((l, l, l))

A_t = np.zeros((l, l, l))




for i in range(l):

for j in range(l):

A_r[i, i, j] = 1

A_t[i, j, i] = 1




A = np.concatenate((A_r.reshape((l, l2)), A_t.reshape((l, l2))), axis=0)

b = np.concatenate((P_r, P_t), axis=0)

c = D.reshape((l**2))


opt_res = linprog(c, A_eq=A, b_eq=b)

emd = opt_res.fun

gamma = opt_res.x.reshape((l, l))

Now we have our transference plan, as well as the EMD.

Fig.3: Optimal transportation between Pr and Pθ

Dual Form

Unfortunately, this kind of optimization is not practical in many cases, certainly not in domains where GANs are usually used. In our example, we use a one-dimensional random variable with ten possible states. The number of possible discrete states scales exponentially with the number of dimensions of the input variable. For many applications, e.g. images, the input can easily have thousands of dimensions. Even an approximation of γ is then virtually impossible.

But actually we don’t care about γ. We only want a single number, the EMD. Also, we want to use it to train our generator network, which generates the distribution Pθ. To do this, we must be able to calculate the gradient PθEMD(Pr,Pθ). Since Pr and Pθ are only constraints of our optimization, this not possible in any straightforward way.

As it turns out, there is another way of calculating the EMD that is much more convenient. Any LP has two ways in which the problem can be formulated: The primal form, which we just used, and the dual form.

primal form:dual form:minimize z= cTx,so that Ax= band x 0maximize z~= bTy,so that ATy c

By changing the relations between the same values, we can turn our minimization problem into a maximization problem. Here the objective z~ directly depends on b, which contains the our distributions Pr and Pθ. This is exactly that we want. It is easy to see that z~ is a lower bound of z:

z=cTxyTAx=yTb=z~

This is called the Weak Duality theorem. As you might have guessed, there also exists a Strong Duality theorem, which states that, should we find an optimal solution for z~, then z=z~. Proving it is a bit more complicated and requires Farkas theorem as an intermediate result.

Farkas Theorem

We can regard the columns of a matrix A^Rd×n as vectors a1,a2,,anRd. The set of all possible linear combinations of these vectors with nonnegative coefficients is a convex cone with its apex (peak) at the origin (note that a convex cone could also potentially cover the whole Rd space). We can combine these coefficients in a vector xR0n.

For a vector b^Rd, there are now exactly two possibilities: Either b^ is contained in the cone, or not. If b^ is not contained, then we can fit a hyperplane h that goes through the origin between the convex cone and b^. We can define it in terms of only its normal vector y^Rd. If a vector vRd lies on h, then vTy^=0, if v lies in the upper half-space of h (the same side as y^), then vTy^>0 and if v lies in the lower half-space (the opposite side of y^), then vTy^<0. As we specified, if h exists, then b^ lies in a different half-space than all vectors ai.

Fig.4: Geometrical view of Farkas theorem: If b does not lie inside or on the blue cone, then we can fit a hyperplane h between b and the cone

Summarized, exactly one of the following statements is true:

  • (1) There exists xRn, so that A^x=b^ and x0
  • (2) There exists y^Rd, so that A^Ty^0 and b^Ty^>0

This is called Farkas theorem, or Farkas alternatives. There exist slightly different versions and several proofs, but what we showed is sufficient for our purposes.

Strong Duality

The trick for the second part of this proof is to construct a problem that is related to our original LP forms, but with one additional dimension and in such a way that b^ lies right at the edge of the convex cone. Then according to Farkas, for some y^, the corresponding hyperplane comes arbitrarily close to b^. From this, in combination with the Weak Duality theorem, we will proof the Strong Duality.

Let the minimal solution to the primal problem be z=cTx. Then we define

A^=[AcT],   b^ϵ=[bz+ϵ],   y^=[yα]

with ϵ,αR. For ϵ=0, we have Farkas case (1), because A^x=b^0. For ϵ>0, there exists no nonnegative solution (because z is already minimal) and we have Farkas case (2). This means there exist y and α, such that

[AcT]T[yα]0,    [bz+ϵ][yα]>0

or equivalently

ATyαc,  bTy>α(zϵ).

The way we constructed it, we can find y^ so that additionally b^0y^=0. Then changing ϵ to any number greater than 0 has to result in b^0y^>0. In our specific problem, this is only possible if α>0, because z>0. We showed that y^ is simply the normal vector of a hyperplane, and because of that we can freely scale it to any magnitude greater than zero. Particularly, we can scale it so that α=1. This means there exists a vector y, so that

ATyc,  bTy>zϵ.

We see that z~:=zϵ for any ϵ>0 is a feasible value of the objective of our dual problem. From the Weak Duality theorem, we know that z~z. We just showed that z~ can get arbitrarily close to z. This means the optimal (maximal) value of our dual form is also z.

Dual Implementation

Now we can confidently use the dual form to calculate the EMD. As we showed, the maximal value z~=bTy is the EMD. Let’s define

y=[fg]

with f,gRd. This means EMD(Pr,Pθ)=fTPr+gTPθ. Recall the constraints of the dual form: ATyc.

[D1,1D1,2D2,1D2,2Dn,1Dn,2]}cTy{[f(x1)f(x2)f(xn)g(x1)g(x2)g(xn)][110000001100000011101010010101000000]}AT;

We have written the vectors f and g as values of the functions f and g. The constraints can be summarized as f(xi)+g(xj)Di,j. The case i=j yields g(xi)f(xi) for all i, because Di,i=0. Since Pr and Pθ are nonnegative, to maximize our objective, ifi+gi has to be as great as possible. This sum has a maximum value of 0, which we achieve with g=f. This remains true given all additional constraints: Choosing g(xi)<f(xi) would only make sense if it had benefits for values f(xj) and g(xj) with ji. Since f(xj) and g(xj) remain upper-bounded by the other constraints, this is not the case.

Fig.5: Constraints for the dual solution. Blue and red lines depict the upper bounds for f and g respectively. For every fg, there is a net loss of the optimization function.

For g=f the constraints become f(xi)f(xj)Di,j and f(xi)f(xj)Di,j. If we see the values f(xi) as connected with line segments, this means that the upward and the downward slope of these segments is limited. In our case, where we use Euclidian distances, these slope limits are 1 and 1. We call this constraint Lipschitz continuity (with Lipschitz constant 1) and write fL1. With that, our dual form of the EMD is

EMD(Pr,Pθ)=supfL1 ExPrf(x)ExPθf(x).

The implementation is straightforward:

# linprog() can only minimize the cost, because of that
# we optimize the negative of the objective. Also, we are
# not constrained to nonnegative values.
opt_res = linprog(-b, A, c, bounds=(None, None))
emd = -opt_res.fun

f = opt_res.x[0:l]

g = opt_res.x[l:]

Fig.6: Element-wise multiplication of the optimal f and Pr, as well as g=f and Pθ.

As we see, the optimal strategy is of course to set f(x) to high values where Pr(x)>Pθ(x), and to low values where Pθ>Pr.

Wasserstein Distance

Lastly, we have to consider continuous probability distributions. We can of course view them intuitively as discrete distributions with infinitely many states and use a similar reasoning as described so far. But as I mentioned at the beginning, we will try something neater. Let our continuous distributions be pr and pθ, and the set of joined distributions with marginals pr and pθ be π(pr,pθ). Then the Wasserstein distance is defined as

W(pr,pθ)=infγπx,yxyγ(x,y)dxdy=infγπEx,yγ[xy].

If we add suitable terms, we can remove all constraints on the distribution γ. This is done by adding an additional optimization over a function f:xkR, which rules out all γπ as solutions:

W(pr,pθ)=infγπEx,yγ[xy]=infγEx,yγ[xy+supfEspr[f(s)]Etpθ[f(t)](f(x)f(y))]={(3)0,   if γπ(4)+   else=infγ supf Ex,yγ[xy+Espr[f(s)]Etpθ[f(t)](f(x)f(y))]

Now we have bilevel optimization. This means we take the optimal solution of the inner optimization (supf) for every value of the outer optimization (infγ), and from these possible solutions choose the one that is optimal for the outer optimization. To continue, we have to make use of the minimax-principle, which says that in certain cases we can invert the order of inf and sup without changing the solution. But first we have to show that our problem is indeed such a case.

Consider some function g:A,BR. Let g(a^,b^)=infaAsup bBg(a,b) (‘the least of all maxima’) and g(a^,b^)=sup bBinfaAg(a,b) (‘the greatest of all minima’). The argument that g(a^,b^)g(a^,b^) is simple: Any g(a^,b^) is automatically allowed as a candidate for the infimum of sup bBinfaAg(a,b), but not the other way around. This is equivalent to the statement supbg(a,b)g(a,b)infag(a,b)a,b, which is true by definition.

For g(a^,b^)>g(a^,b^), at least one of these statements must be true:

  • g(a^+t,b^)<g(a^,b^) for some t0. This is only possible if sup bBg(a,b) is not convex in a, because g(a^,b^) is already an infimum for a^.
  • g(a^,b^+t)>g(a^,b^) for some t0. This is only possible if inf aAg(a,b) is not concave in b, because g(a^,b^) is already a supremum for b^.

This means of course that, if sup bBg(a,b) is convex and inf aAg(a,b) is concave, then the minimax principle applies and g(a^,b^)=g(a^,b^). In our case, we can above already see from the underbrace that the convexity condition is met. Let’s try changing to supinf:

=,supf infγ Ex,yγ[xy+Espr[f(s)]Etpθ[f(t)](f(x)f(y))]=supf Espr[f(s)]Etpθ[f(t)]+infγ Ex,yγ[xy(f(x)f(y))]={(5)0,   if fL1(6)   else

We see that the infimum is concave, as required. Because all functions f that are Lipschitz continuous produce the same optimal solution for infγ, and only they are feasible solutions of supf, we can turn this condition into a constraint. With that, we have the dual form of the Wasserstein distance:

(7)W(pr,pθ)=supf Espr[f(s)]Etpθ[f(t)]+infγ Ex,yγ[xy(f(x)f(y))](8)=supfL1 Espr[f(s)]Etpθ[f(t)]

This is our case of the Kantorovich-Rubinstein duality. It actually holds for other metrics than just the Euclidian metric we used. But the function f is suitable to be approximated by a neural network, and this version has the advantage that the Lipschitz continuity can simply be achieved by clamping the weights.

Sources

posted @   jasonzhangxianrong  阅读(17)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· CSnakes vs Python.NET:高效嵌入与灵活互通的跨语言方案对比
· 【.NET】调用本地 Deepseek 模型
· Plotly.NET 一个为 .NET 打造的强大开源交互式图表库
· 上周热点回顾(2.17-2.23)
· 如何使用 Uni-app 实现视频聊天(源码,支持安卓、iOS)
历史上的今天:
2021-07-26 conda配置镜像并安装gpu版本pytorch和tensorflow2
点击右上角即可分享
微信分享提示