Computer Vision 基础学习(4)

 

Knowledge and Thinking

Bild Urbild and Cobild

       One line or point will be projected into the image  \(\mathbb{R}^3 \Rightarrow \mathbb{R}^2\), through the transform \( \pi_0P \) or \( \pi_0L \) . The definition of the urbild is \( Urbild(P) = \{Q \in \mathbb{R}^3 \big|\pi_0Q\sim\pi_0P\} \). It means that all of the points which can be projected to the same point in the image form the urbild of the point. Therefore, the urbild of a point is a line through the origin and the urbild of a line is flat.

        The cobild of a line or point is the orthogonal vector or space of the urbild. We can see the relationship between the line, points, and urbild cobild in the figure 1. Because the urbild of a point is a line through the origin, a whole flat becomes the cobild of the point but the cobild of a line is also a line since the urbild is a flat through the origin.

Figure 1. The illustration of the urbild and cobild of a line and point

      The linear span of a set of vectors is defined as \( span(A) = \{\sum^m_{i=1}\lambda_ia_i\big|\lambda_i\in\mathbb{R}\} \). It can be interpreted as the set of b ,where \( Ax = b \)  and \( A = \left[ a_1,a_2,a_3 \right] \),where \(a_1,a_2,a_3\) are column vectors. In other words, the linear span of a set of vectors is the set of points, which can be obtained by a linear transform of A. Therefore, e.g. the urbild of a point can also be interpreted as the points which can be calculated by a linear transform of P. It is easier to understand why the urbild of a line is flat.

       The function of a line is \( ax+by+c=0 \) and can be described in homogeno coordinate as \( \begin{pmatrix} a \\ b \\c \end{pmatrix} \). and a point is \( \begin{pmatrix} x\\ y\\z \end{pmatrix} \), if the point is in the line, The  line function should be fulfilled, which leads \( x^Tl = 0 \). If the cross product of two lines is \( x = l \times l^\prime \), we can derive from the above \( l(l\times l^\prime) = l^\prime(l\times l^\prime) = 0 \) and \( l^Tx =l^{\prime T}x = 0 \). which can be interpreted that \(x\) is the intersection of the lines \(l,l^\prime\).

       Correspondingly, two points can define a line, the line is \(x\times x^\prime\). If there are several points which are colinear, we can test \( Rank([x_1,x_2, \ldots ,x_n]) \leq 2 \) we can also calculate  ifthe smallest eigenvalue of \( \mathbf{M}=\sum_{i=1}^{n} \omega_{i} \mathbf{x}_{i} \mathbf{x}_{i}^{\top} \) for \( \forall \omega_i>0 \) is zero. There is a simpler method to test colinear \(det[x_1,x_2,x_3]=0\)

Epipolargeometry

      Two view geometry is more useful as a single view geometry for 3D project reconstruction because the second camera is the rotated and transposed version of the first camera. If we want to reconstruct the position of a point by the single view geometry as figure 2, we can easily see that the Z coordinate can be never obtained. In other words, different points with the same x,y positions can be projected into the same point in the image.

                                                   

Figure 2. x1,x2 will be projected into the same point in the image.

Only two view geometry can solve the problem since the two cameras can capture different information from different perspectives.  Figure 3 shows a general case of two view geometry. 

Figure 3. illustration of two view geometry

There are several definitions, the point from the projection of the origin of the second camera is called epipole of the first image. \( e_1 \) ,correspondingly, \( e_2 \) indicates the projection point of the first camera origin. The object point is P, and the image point of the first camera is \( x_1 \), \( x_2 \) is the image point of the second camera. The line between \( x_1 \) and \( e_1 \) , \( x_2 \) and \( e_2 \)is called epipolar line.

Since the second camera is the euclidean movement of the first camera, the object point in different coordinates can be described as \( P_2 = RP_1+T \).

 The epipoles can be obtained by \( e_2 = M_2O_1 \), where \(M_2\) is the second camera matrix \(O_1\) is the origin of the first camera. The epipoles can also be interpreted as \( e_2 = O_1^{(2)} \) the projection point of the first camera origin. \( O_1^{(2)} = RO_1^{(1)}+T \) , because \( O_1^{(1)} \) is \( \begin{bmatrix} 0 \\ 0 \\ 0 \end{bmatrix} \) in the camera frame1 \( O_1^{(2)} = T \Rightarrow e_2 \sim T \). Correspondingly, \( O_2^{(1)} = RO_2^{(2)}+R^TT \Rightarrow e_1 \sim R^TT \).

The\(x_1\) is projected by the first camera, which leads to a linear transform of P. \( x _1 = M_1P \) and\( P = M_1^+x1 \) where \( M_1^+ \) is the pseudoinverse of matrix of \(M_1\), \( x_2 = M_2P = M_2M _1^+x_1 \). the epipolar line of the second camera is \( l_2 = e_2\times M_2M_1^+x_1\Rightarrow l_2 = \hat e_2M_2M_1^+x_1 \), \( x_2^T\hat e_2M_2M_1^+x_1 = 0 \) , \( x_2^TEx_1 = 0 \) Where \(E\) donates the essential matrix. Therefore, \( l_1 = E^Tx_2 \) and \(l_2 = Ex_1\)

SVD:

For a real valued nxn matrix, it can be decomposed into \( A = V\Sigma V^{T} \), where \(V\) is a matrix contains the eigenvectores and \(\Sigma\) is a diagonal matrix contains eigenvalues. But a \(M\times N\) matrix can be decomposed into the form. A more general decomposition is singular value decomposition. \( A = U \widetilde\Sigma V^T \),where U and V are unitary matrix contains singular vectors and \(\Sigma\) contains the singular values. If A is a \(M\times M\) matrix. U is a \(M\times M\) matrix and \(\Sigma\) is a M*N matrix, V is a \(N\times N\) matrix.  \(U\) is the eigenvectors of \(A^TA\) and \(V\) is the eigenvectors of \(AA^T\). Derivation: \( A = U\Sigma V^T \)  and \( A^T = (U\Sigma V^T)^T = V\Sigma^T U^T \) therefore, \( A^TA = V\Sigma^TU^TU\Sigma V^T = V\Sigma ^2V^T \) and we can see that U is the eigenvectors of \(A^TA\) and singular values are the \( \sqrt \lambda \), where \(\lambda\) is the eigenvalues of \(A^TA\).

Since the essential matrix E can be decomposed in \( E = \begin{pmatrix} u_1 & u_2 & u_3 \end{pmatrix}\begin{pmatrix}\sigma & \sigma & 0 \\ 0 & \sigma & 0 \\ 0 & 0 & 0\end{pmatrix}\begin{pmatrix} v_1^T \\ v_2^T \\ v_3^T \end{pmatrix} \) and \(U\)  \(V\) are orthogonal matrix, we can prove that \(e_1=v_3\) by using the fact \(Ee_1 = 0\). \( Ee_1 = \begin{pmatrix} u_1 & u_2 & u_3 \end{pmatrix}\begin{pmatrix}\sigma & \sigma & 0 \\ 0 & \sigma & 0 \\ 0 & 0 & 0\end{pmatrix}\begin{pmatrix} v_1^T \\ v_2^T \\ v_3^T \end{pmatrix}v_3 = \begin{pmatrix} u_1 & u_2 & u_3 \end{pmatrix}\begin{pmatrix}\sigma & \sigma & 0 \\ 0 & \sigma & 0 \\ 0 & 0 & 0\end{pmatrix} \begin{pmatrix}0 \\ 0 \\ 1\end{pmatrix} = 0 \) and by using \(E^Te_2 = 0\) we can prove that \( e_2 = u_3 \)

 

Inspired by programming

 1. Because we should take a block around the feature points, we should be careful about the border effect. The feature points which are close to borders can not form a block. The solutions are to add a border which larger than the block size, or we can drop the feature points which are closed to the border. Since the border usually cannot give more information about the image than the center part of the image, we can just drop them. 

2. Since the correspondence is determined by a threshold there exist feature points that have more than one correspondences. That means some similar features may false matching. A  solution to the problem is that we should find the most powerful correspondence and cancel the other correspondences. The cross-correlation values should be saved in a matrix with size [feature_points_im1,feature_points_im], each element in the position [a,b] donates the cross=correlation value of feature a in image 1 and feature b in image 2. Therefore, in the second step, when we find the best correspondence we can just set the corresponding column as 0 in case more correspondences occurs. 

3. I found that the NCC algorithm has only limited performance. If there are three images, the second one is the translation of the first image and the third one is the rotated version of the first one. After extracting the features as for.

                                                                                                                   

                                                                                                                                                                 (b)                                                   (c)

                                                                                                                             Figure 4. (a) shows the original image features (b) shows the features 

                                                                                                                      of the translation version and (c) shows the rotation version. the features points 

                                                                                                                   are exacted by using harris detectors with N 10  [Original from Xinlei Zhang]

 

We can see that the feature points on the book are exacted, but if we use NCC to find correspondences, the performance is not sufficient as figure 5.

 

                                                                                                                     

                                                                                                                                             (a)                                                                                      (b)

                                                                                                                              Figure 5. (a) shows the correspondences of the translation and original image and (b) shows the correspondence of the rotation version.                         [Original from Xinlei]

 

As we can see, for the translation version the correspondence has a good performance, since the corresponding feature points are perfectly found. But the algorithm cannot find the correspondences from a rotation version of original image. The NCC has not rotation-invariant. 

posted @ 2020-06-03 03:17  brass  阅读(238)  评论(0编辑  收藏  举报