Fisher vector fundamentals
文章《Fisher Kernels on Visual Vocabularies for Image Categorization》中提到:
Pattern classication techniques can be divided into the classes ofgenerative approaches anddiscriminative approaches. While the first class focuses onthe modeling of class-conditional probability density functions, the second one focuses directly on the problem of interest:classication. This explains the theoretical superiority of discriminative methods over generative ones. However, generative approaches have a number of properties which still make them attractive, including the possibility to handle variable length data.
Within the field of pattern classication, the Fisher kernel is a powerful framework which combines the strengths of generative and discriminative approaches. The idea is to characterize a signal with a gradient vector derived from a generative probability model and to subsequently feed this representation to a discriminative classier.
(本博客系原创,转载请注明出处:http://www.cnblogs.com/pfli1995/p/4655574.html)
一、核心
Fisher vector本质上是用似然函数的梯度vector来表达一幅图像。
二、基础知识的预备
1. 高斯分布
生活和自然中,很多的事和物的分布都可以近似的看做是高斯分布。比如说:一个班的成绩的优良中差的分布。最优秀的和最差的往往都是少数,一般人是大多数。
高斯分布直观的感受是这样的:这是这种分布的概率情况的表示:
2. 混合高斯分布
问题是:一个班的成绩的分布他也可能是这样的:60分以下以及95分以上很少人,60-75很多人,突然75-85人又少了,但是85-90又多了。这个时候直观的感受是这样的:
这个时候很显然若使用两个高斯分布来拟合,加上二者的权重效果比单个的高斯分布要好得多!若是不止两个山峰那最好就是再多几个高斯的分布同时来拟合。对GMM(Gaussian Mixture Model)的理解。
3.高斯分布用于图像
相信大家对于独立同分布(i.i.d)还是知道的。对于图像也是,它表示为你用来表示这样图像的特征的各个维度之间是独立的。拿一个人来说,我们如果用它的身高、体重、三维来代替他。那这些就是他的特征了。对于他来说,这些特征就可以看做是独立同分布了。对于一个图像同样是这样。
而用到独立同分布最重要的原因是:你可以将一个样本(一张图片)的概率分布表示为各个特征维度上面的概率分布的乘积。
取对数以后则表示为各项的对数概率的和,这样就极大的降低了计算的难度。
4.流形学习
嵌入在高维空间中的低维流形:最直观的例子通常都会是嵌入在三维空间中的二维或者一维流行。比如说一块布,可以把它看成一个二维平面,这是一个二维的欧氏空间,现在我们(在三维)中把它扭一扭,它就变成了一个流形(当然,不扭的时候,它也是一个流形,欧氏空间是流形的一种特殊情况)。所以,直观上来讲,一个流形好比是一个 d 维的空间,在一个 m 维的空间中 (m > d) 被扭曲之后的结果。
具体参考之前的博客:http://www.cnblogs.com/pfli1995/p/4655602.html
5. Fisher Vector 的本质
Fisher Vector的本质就是对于高斯分布
的变量求偏导!也就是对权重,均值,标准差求偏导得到的结果。最后在需要一个归一化处理。具体计算放在了下面。
6. 为什么Fisher Vector比高斯分布有效
我们将一张图近似为一个高斯分布,由这个高斯分布来表示这张图像。假设我们是做目标的检测,那么当你得到一个有相同的高斯分布的图的时候你就可以判断出这就是那个目标了。但实际的情况是却不一定是这样的,我们看一张图
这两张图上特征点的分布在黑色的区域,二者的分布却可以一样(当然我画的不是很好)!
由此,我们知道,在高斯分布的基础上我们再找到变化的方向,我们便可以更加准确的表示这一张图!
三、具体原理
Fisher vector本质上是用似然函数的梯度vector来表达一幅图像,这个梯度向量的物理意义就是describes the direction in which parameters should be modified to best fit the data
《Fisher Kernels on Visual Vocabularies for Image Categorization》:
We propose to apply Fisher kernels on visual vocabularies, where the vocabularies of visual words are represented by means of a GMM.
(1)
取对数之后就是:
(2)
现在需要一组K个高斯分布的线性组合来逼近这些i.i.d.,假设这些高斯混合分布参数也是lamda,于是(The likelihood that observation xt was generated by the GMM is:)
(3)
其中线性组合的系数满足:(The weights are subject to the constraint)
(4)
Pi表示的就是高斯分布:(the components pi are given by)
(5)
在这里D是特征矢量的维数,协方差矩阵计算的是不用维数之间的关系。在这这里假设协方差矩阵是对角阵也就是feature的不同dim之间是相互独立的。where D is the dimensionality of the feature vectors. We assume that the covariance matrices are diagonal as (i) any distribution can be approximated with an arbitrary precision by a weighted sum of Gaussians with diagonal covariances and (ii) the computational cost of diagonal covariances is much lower than the cost involved by full covariances. We use the notation
对公式(2)求导,然后将偏导数,也就是梯度作为fisher vector了。在此之前再定义一个变量:(In the following,
denotes the occupancy probability,i.e. the probability for observation xt to have been generated by the i-th Gaussian. Bayes formula gives)
(6)
表征的是occupancyprobability,也就是特征xt是由第i个高斯分布生成的概率。
下面的公式给出了偏导计算公式:(Straightforward derivations provide the following results)
(7)
值得注意的是上面求出来的都是没有归一化的vector,需要进行归一化操作,由于是在概率空间中,与欧式空间中的归一化不同,引入Fisher matrix进行归一化。
公式(7)的三个变量分别引入三个对应的归一化需要的fisher matrix:
(8)
于是最终归一化之后的fisher vector就是:
(9)
由于每一个特征是d维的,需要K个高斯分布的线性组合,有公式(8),一个Fisher vector的维数为(2*d+1)*K-1维。
有了Fisher vector,你就可以做图像分类了。在文章[2,3]中都介绍了对这个Fisher vector的进一步改进,在此不再赘述。
四、vl-feat中的介绍
Fisher vector fundamentals
The FV is an image representation obtained by pooling local image features. It is frequently used as a global image descriptor in visual classification.
While the FV can be derived as a special, approximate, and improved case of the general Fisher Kernel framework, it is easy to describe directly. Let I=(x1,…,xN) bea set of D dimensional feature vectors (e.g. SIFT descriptors) extracted from an image. Let Θ=(μk,Σk,πk:k=1,…,K) be the parameters of a Gaussian Mixture Model fitting the distribution of descriptors. The GMM associates each vector xi to a mode k in the mixture with a strength given by the posterior probability:
For each mode k, consider the mean and covariance deviation vectors
where j=1,2,…,D spans the vector dimensions. The FV of image I is the stacking of the vectors uk and then of the vectors vk for each of the K modes in the Gaussian mixtures:
Normalization and improved Fisher vectors
The improved Fisher Vector [24] (IFV) improves the classification performance of the representation by using to ideas:
- Non-linear additive kernel. The Hellinger's kernel (or Bhattacharya coefficient) can be used instead of the linear one at no cost by signed squared rooting. This is obtained by applying the function |z|signz to each dimension of the vector Φ(I). Other additive kernels can also be used at an increased space or time cost.
- Normalization. Before using the representation in a linear model (e.g. a support vector machine), the vector Φ(I) is further normalized by the l2 norm (note that the standard Fisher vector is normalized by the number of encoded feature vectors).
After square-rooting and normalization, the IFV is often used in a linear classifier such as an SVM.
Faster computations
In practice, several data to cluster assignments qik are likely to be very small or even negligible. The fast version of the FV sets to zero all but the largest assignment for each input feature xi.
五、vl-feat Tutorials
This short tutorial shows how to compute Fisher vector and VLAD encodings with VLFeat MATLAB interface.
These encoding serve a similar purposes: summarizing in a vectorial statistic a number of local feature descriptors (e.g. SIFT). Similarly to bag of visual words, they assign local descriptor to elements in a visual dictionary, obtained with vector quantization (KMeans) in the case of VLAD or a Gaussian Mixture Models for Fisher Vectors. However, rather than storing visual word occurrences only, these representations store a statistics of the difference between dictionary elements and pooled local features.
Fisher encoding
The Fisher encoding uses GMM to construct a visual word dictionary. To exemplify constructing a GMM, consider a number of 2 dimensional data points (see also the GMM tutorial). In practice, these points would be a collection of SIFT or other local image features. The following code fits a GMM to the points:
numFeatures = 5000 ; dimension = 2 ; data = rand(dimension,numFeatures) ; numClusters = 30 ; [means, covariances, priors] = vl_gmm(data, numClusters);
Next, we create another random set of vectors, which should be encoded using the Fisher Vector representation and the GMM just obtained:
numDataToBeEncoded = 1000; dataToBeEncoded = rand(dimension,numDataToBeEncoded);
The Fisher vector encoding enc
of these vectors is obtained by calling the vl_fisher
function using the output of the vl_gmm
function:
encoding = vl_fisher(dataToBeEncoded, means, covariances, priors);
The encoding
vector is the Fisher vector representation of the data dataToBeEncoded
.
Note that Fisher Vectors support several normalization options that can affect substantially the performance of the representation.
VLAD encoding
The Vector of Linearly Agregated Descriptors is similar to Fisher vectors but (i) it does not store second-order information about the features and (ii) it typically use KMeans instead of GMMs to generate the feature vocabulary (although the latter is also an option).
Consider the same 2D data matrix data
used in the previous section to train the Fisher vector representation. To compute VLAD, we first need to obtain a visual word dictionary. This time, we use K-means:
numClusters = 30 ; centers = vl_kmeans(dataLearn, numClusters);
Now consider the data dataToBeEncoded
and use the vl_vlad
function to compute the encoding. Differently from vl_fisher
, vl_vlad
requires the data-to-cluster assignments to be passed in. This allows using a fast vector quantization technique (e.g. kd-tree) as well as switching from soft to hard assignment.
In this example, we use a kd-tree for quantization:
kdtree = vl_kdtreebuild(centers) ;
nn = vl_kdtreequery(kdtree, centers, dataToBeEncoded) ;
Now we have in the nn
the indexes of the nearest center to each vector in the matrix dataToBeEncoded
. The next step is to create an assignment matrix:
assignments = zeros(numClusters,numDataToBeEncoded); assignments(sub2ind(size(assignments), nn, 1:length(nn))) = 1;
It is now possible to encode the data using the vl_vlad
function:
enc = vl_vlad(dataToBeEncoded,centers,assignments);
Note that, similarly to Fisher vectors, VLAD supports several normalization options that can affect substantially the performance of the representation.
参考:
http://blog.csdn.net/ikerpeng/article/details/41644197
http://blog.csdn.net/carrierlxksuper/article/details/28151013
http://www.vlfeat.org/api/fisher-fundamentals.html#fisher-normalization
http://www.vlfeat.org/overview/encodings.html
[1] Fisher Kernels on Visual Vocabularies for Image Categorization Florent Perronnin and Christopher Dance. CVPR 2007
[2] Improving the Fisher Kernel for Large-Scale Image Classification. Florent Perronnin, Jorge Sanchez, and Thomas Mensink. ECCV 2010
[3] Image Classification with the Fisher Vector: Theory and Practice. Jorge Sánchez , Florent Perronnin , Thomas Mensink , Jakob Verbeek.
[4] exploiting generative models in discriminative classification
本文在CSDN中对应blog:
http://blog.csdn.net/xuexiyanjiusheng/article/details/46927491