Wasserstein GAN and the Kantorovich-Rubinstein Duality
Wasserstein GAN and the Kantorovich-Rubinstein Duality
From what I can tell, there is much interest in the recent Wasserstein GAN paper. In this post, I don’t want to repeat the justifications, mechanics and promised benefit of WGANs, for this you should read the original paper or this excellent summary. Instead, we will focus mainly on one detail that is only mentioned quickly, but I think lies in some sense at the heart of it: the Kantorovich-Rubinstein duality, or rather a special case of it. This is of course not a new result, but the application is very clever and attractive.
The paper cites the book “Optimal Transport - Old and New” by Fields-Medal winner and french eccentric Cedric Villani, you can download it from his homepage. That’s about a thousand pages targeted at math PhDs and researchers - have fun! Villani also talks about this topic in an accessible way in this lecture, at around the 28 minute mark. Generally though, I found it very hard to find material that gives real explanations but is not bursting with definitions and references to theorems I didn’t know. Maybe this post will help to fill this gap little bit. We will only use basic linear algebra, probability theory and optimization. These will not be rigorous proofs, and we will generously imply many regularity conditions. But I tried to make the chains of reasoning as clear and complete as possible, so it should be enough get some intuition for this subject.
The argument for our case of the Kantorovich-Rubinstein duality is actually not too complicated and stands for itself. It is, however, very abstract, which is why I decided to defer it to the end and start with the nice discrete case and somewhat related problems in Linear Programming.
If you’re interested, you can take a look at the Jupyter notebook that I created to plot some of the graphics in this post.
Earth Mover’s Distance
For discrete probability distributions, the Wasserstein distance is also descriptively called the earth mover’s distance (EMD). If we imagine the distributions as different heaps of a certain amount of earth, then the EMD is the minimal total amount of work it takes to transform one heap into the other. Work is defined as the amount of earth in a chunk times the distance it was moved. Let’s call our discrete distributions


Calculating the EMD is in itself an optimization problem: There are infinitely many ways to move the earth around, and we need to find the optimal one. We call the transport plan that we are trying to find
To be a valid transport plan, the constraints
If you’re not familiar with the expression
where

Linear Programming
In the picture above you can see the optimal transport plan
To cast our problem of finding the EMD into this form, we have to flatten
This means
For
With that, we can call a standard LP routine, for example linprog()
from scipy.
import numpy as np
from scipy.optimize import linprog
# We construct our A matrix by creating two 3-way tensors,
and then reshaping and concatenating them
A_r = np.zeros((l, l, l))
A_t = np.zeros((l, l, l))
for i in range(l):
for j in range(l):
A_r[i, i, j] = 1
A_t[i, j, i] = 1
A = np.concatenate((A_r.reshape((l, l2)), A_t.reshape((l, l2))), axis=0)
b = np.concatenate((P_r, P_t), axis=0)
c = D.reshape((l**2))
opt_res = linprog(c, A_eq=A, b_eq=b)
emd = opt_res.fun
gamma = opt_res.x.reshape((l, l))
Now we have our transference plan, as well as the EMD.

Dual Form
Unfortunately, this kind of optimization is not practical in many cases, certainly not in domains where GANs are usually used. In our example, we use a one-dimensional random variable with ten possible states. The number of possible discrete states scales exponentially with the number of dimensions of the input variable. For many applications, e.g. images, the input can easily have thousands of dimensions. Even an approximation of
But actually we don’t care about
As it turns out, there is another way of calculating the EMD that is much more convenient. Any LP has two ways in which the problem can be formulated: The primal form, which we just used, and the dual form.
By changing the relations between the same values, we can turn our minimization problem into a maximization problem. Here the objective
This is called the Weak Duality theorem. As you might have guessed, there also exists a Strong Duality theorem, which states that, should we find an optimal solution for
Farkas Theorem
We can regard the columns of a matrix
For a vector


Summarized, exactly one of the following statements is true:
There exists , so that and There exists , so that and
This is called Farkas theorem, or Farkas alternatives. There exist slightly different versions and several proofs, but what we showed is sufficient for our purposes.
Strong Duality
The trick for the second part of this proof is to construct a problem that is related to our original LP forms, but with one additional dimension and in such a way that
Let the minimal solution to the primal problem be
with
or equivalently
The way we constructed it, we can find
We see that
Dual Implementation
Now we can confidently use the dual form to calculate the EMD. As we showed, the maximal value
with
We have written the vectors



For
The implementation is straightforward:
# linprog() can only minimize the cost, because of that
# we optimize the negative of the objective. Also, we are
# not constrained to nonnegative values.
opt_res = linprog(-b, A, c, bounds=(None, None))
emd = -opt_res.fun
f = opt_res.x[0:l]
g = opt_res.x[l:]


As we see, the optimal strategy is of course to set
Wasserstein Distance
Lastly, we have to consider continuous probability distributions. We can of course view them intuitively as discrete distributions with infinitely many states and use a similar reasoning as described so far. But as I mentioned at the beginning, we will try something neater. Let our continuous distributions be
If we add suitable terms, we can remove all constraints on the distribution
Now we have bilevel optimization. This means we take the optimal solution of the inner optimization (
Consider some function
For
for some . This is only possible if is not convex in , because is already an infimum for . for some . This is only possible if is not concave in , because is already a supremum for .
This means of course that, if
We see that the infimum is concave, as required. Because all functions
This is our case of the Kantorovich-Rubinstein duality. It actually holds for other metrics than just the Euclidian metric we used. But the function
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】凌霞软件回馈社区,博客园 & 1Panel & Halo 联合会员上线
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】博客园社区专享云产品让利特惠,阿里云新客6.5折上折
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· CSnakes vs Python.NET:高效嵌入与灵活互通的跨语言方案对比
· 【.NET】调用本地 Deepseek 模型
· Plotly.NET 一个为 .NET 打造的强大开源交互式图表库
· 上周热点回顾(2.17-2.23)
· 如何使用 Uni-app 实现视频聊天(源码,支持安卓、iOS)
2021-07-26 conda配置镜像并安装gpu版本pytorch和tensorflow2