Defocus Image Deblurring Network With Defocus Map Estimation as Auxiliary Task
Abstract:
Different from the object motion blur, the defocus blur is caused by the limitation of the cameras’ depth of field. The defocus amount can be characterized by the parameter of point spread function and thus forms a defocus map. In this paper, we propose a new network architecture called Defocus Image Deblurring Auxiliary Learning Net (DID-ANet), which is specifically designed for single image defocus deblurring by using defocus map estimation as auxiliary task to improve the deblurring result. To facilitate the training of the network, we build a novel and large-scale dataset for single image defocus deblurring, which contains the defocus images, the defocus maps and the all-sharp images. To the best of our knowledge, the new dataset is the first large-scale defocus deblurring dataset for training deep networks. Moreover, the experimental results demonstrate that the proposed DID-ANet outperforms the state-of-the-art methods for both tasks of defocus image deblurring and defocus map estimation, both quantitatively and qualitatively. The dataset, code, and model is available on GitHub: https://github.com/xytmhy/DID-ANet-Defocus-Deblurring .
SECTION I.Introduction
When an image is captured, there are mainly two reasons for image blur, i.e., motion and defocus. On one hand, relative motion between the camera and the object, no matter which one is the actual moving one, leads to motion blur. On the other hand, when the depth range of the scene is relatively large but the depth of field of the camera is limited, the captured image can also be blurry due to defocus. An example of defocus blur is shown in Figure 1(a). The defocus amount is highly dependent on the depth between the object and the focal plane of the camera, therefore, it is usually spatially-varying. The defocus amount is usually modeled by the parameter of point spread function (PSF), forming a pixel-wise defocus map, as shown in Figure 1(b). Clear images can benefit many computer vision tasks such as detection, identification and segmentation [1], therefore, in this paper, we concentrate on defocus image deblurring from a single image.
Fig. 1. - Example for the proposed DID-ANet. The input defocus image and the ground truth all-sharp input come from the proposed dataset.
Fig. 1.
Example for the proposed DID-ANet. The input defocus image and the ground truth all-sharp input come from the proposed dataset.
Show All
Defocus image deblurring from a single image (Figure 1(c)) is a challenging ill-posed problem [2], as the target all-sharp image (Figure 1(d)) contains much more details than the input defocus image (Figure 1(a)), especially in the highly defocused areas such as the background including the people and chair in Figure 1. Hence more attention should be paid to the highly defocused areas by incorporating external information. Fortunately, by using deep neural networks, rich real-world information could be learned from the data from a wide range of environments. Therefore, we propose a deep network approach for defocus image deblurring. Our contributions are four-fold.
We design a new network for defocus image deblurring, with defocus map estimation as auxiliary learning task. The proposed Defocus Image Deblurring Auxiliary Learning Network (DID-ANet) is a novel deep learning (end-to-end) architecture specifically designed for defocus deblurring.
We introduce a new defocus image deblurring dataset for training and test of deep networks. This dataset contains partial defocus images, all-sharp (i.e., all-in-focus) images, as well as the corresponding defocus maps. The partial defocus image and all-sharp image are generated from a single image captured by a light-field camera. As far as we know, this is the first large-scale defocus blurring dataset taken in real scenes which could be used for the training of deep learning networks.
By using defocus map estimation as guidance for defocus image deblurring, we alleviate the difficulty in network training. Furthermore, we improve the deblurring results by introducing effective loss functions and flexible training strategies.
Our experiments on several benchmarks show that DID-ANet obtains the state-of-the-art performance on both the defocus map estimation and defocus image deblurring, both quantitatively and qualitatively.
SECTION II.Related Work
A. Image Deblurring
For defocus image deblurring, a typical blur model can be expressed as follows [3]:
IBlur=k∗IClear+NG,(1)
View SourceRight-click on figure for MathML and additional features.where IClear is the clear image, k is the blur kernel, NG is the additive Gaussian noise, and IBlur is the blurry image [4]. Usually, the defocus procedure is modeled as a convolution of the clear image with the PSF; therefore, conventional methods [5]–[6][7] first estimate defocus kernels and then deconvolve the defocus image to produce the all-sharp image. This is a direct way and can usually obtain satisfactory results for areas with low or middle level defocus, but cannot produce sharp results for areas with high level defocus since almost all the high frequency information has been lost. Additional natural images priors [8]–[9][10] can improve the deblurring results in several particular scenes, but they are insufficient to fill in the rich real-world information, either. Moreover, due to the scene change in photos taken in different environments, it is usually difficult to obtain the natural scene priors with conventional methods.
In recent years, with the development of deep learning methods, many new approaches have been proposed. In the early stage, Sun et al. [11] estimate the blur kernel with CNN; Chakrabarti [12], Anwar et al. [13], [14] and Gong et al. [15] also use neural networks to replace several parts of the deblurring process. Currently, end-to-end CNN approaches are widely applied. Nah et al. [16] and Tao et al. [4] use multi-scale structure [17] for dynamic scene deblurring and get good visual results. Meanwhile, Ramakrishnan et al. [18] and Kupyn et al. [3], [19] use the conditional adversarial networks (GANs) [20] for the deblurring. Unfortunately, most of these end-to-end methods pay no attention to different types of blur, and some are specially designed for the motion blur and therefore unsuitable for the defocus deblurring.
Recently, Abuolaim et al. [21] use CNN to deblur defocus dual-pixel image pairs. But for single image defocus deblurring, their results are unsatisfactory. Moreover, Lee et al. [22] apply CNN for defocus image deblurring with the dual-pixel image dataset [21]. In this paper, we want to introduce a novel model and a new light-field camera dataset for single defocus image deblurring.
B. Defocus Map Estimation
Existing defocus map estimation (DME) methods can be roughly categorized into three classes, i.e., edge-based methods, region-based methods and learning-based methods.
In the edge-based methods, the defocus amount is usually calculated at edge points and then propagated to the whole image. Zhuo and Sim [23], Cao et al. [24], Zhang et al. [25] and Karaali and Jung [26] reblur the input image with Gaussian kernels and use the ratio of the gradients of the reblurred images at edge points to calculate the defocus amount. Liu et al. [27] propose a two-parameter defocus model to better analyze the defocus process and produce more accurate estimations at edge points. Park et al. [28] unify handcrafted and deep features to estimate defocus amount at edge points. After the defocus amounts are obtained at edge points, propagation method such as Laplacian matting, KNN matting or guided image filtering is employed to obtain the final defocus map. In this procedure, the input image or a smoothed version is used as the guidance. Therefore, the final defocus map usually suffers from the textures of the input image. Moreover, for areas that are far from edge points, the propagated defocus amount estimations are usually not reliable enough.
For region-based methods, the defocus amounts are often directly calculated from local patches centered at the current pixel. Trouvé et al. [29] deblur the input image patch with a set of PSF candidate and take the one which produces the sharpest deblurred result as the estimation of defocus amount. Shi et al. [30] build a defocus patch dictionary on which they decompose the input image patches and use the sparsity of the decomposition coefficient as a feature. Zhu et al. [31] employ localized 2D frequency analysis to generate the likelihood of defocus amount and employ coherent labeling to refine it. D’Andrès et al. [5] extract a feature from the likelihood and refine it with a regression tree fields to ease the problem of [31]. They also build a realistic dataset for DME in the sense of image deblurring, using a light field camera. This is the first spatially-varing DME dataset, however, there are only 22 images. Usually, region-based method is free from textures of the input image, while they often produce inaccurate estimations for homogeneous areas and cannot catch the defocus discontinuities very well. In our another work [32], we extract a region-based feature based on improved likelihood and incorporate it with edge-based basis to produce texture-free defocus map while catching the defocus discontinuities well.
Recently, there are several deep-learning-based trials. Yan and Shao [33] build a general regression neural network to first classify the blur type and then estimate the blur parameter. The defocus amount of their training dataset is spatially invariant, limiting the application of their method. Zhao et al. [34] use a bottom-top-bottom fully convolutional network to detect defocus blur, which can be viewed as a loose problem of DME. Similarly, Zhang et al. [35] build a smart defocus dataset with three defocus levels and train a deep neural network to estimate the defocus of an input image. Lee et al. [36] build a synthetic dataset based on which they develop an end-to-end deep neural network (DME-Net) to generate defocus map. Domain adaptation is used since there are not enough data with ground truth defocus map. This is the first truly deep-learning based DME method, while the lack of real scene data limits its performance and applications.
C. Auxiliary Learning
Auxiliary learning is a method to complement the primary task by training on additional auxiliary tasks alongside this primary task [37]. A direct approach to auxiliary learning is to use a related task as auxiliary. Intermediate representations are used as auxiliary supervision at lower levels of deep networks to combine the advantages of end-to-end training and more traditional pipeline approaches [38], [39]. Liebel and Körner [40] empirically demonstrate that auxiliary tasks can boost network performance, in terms of both final results and training time. Several different vision auxiliary tasks have been applied for depth estimation in monocular or multiple images [41]–[42][43]. Jaderberg et al. [44] use unsupervised learning tasks to continue developing in the absence of extrinsic rewards in reinforcement learning.
In this paper, we use defocus map estimation as the auxiliary task due to its close relevance to the primary task: defocus image deblurring. The network architecture is also designed according to the relationship between these two tasks. In short, defocus map estimation is used as a low-level guidance in front of the defocus image deblurring.
SECTION III.Defocus Image Deblurring
A. Auxiliary Learning Network
As there are out-of-focus images and clear ground truth in pair, it is a natural thought to train a single end-to-end network solving the defocus image deblurring. However, our experience with such a simple structure is that the network output is very similar to the original input, and the end point error (EPE) would not decrease to a desired small value. One explanation for this phenomenon is: since the input and output are very similar in large scale and there are clear areas in the input images, the output similar to the input can be a trivial local minimum, where the clear areas reach the smallest loss and the out-of-focus areas gets a reasonable EPE, making it hard to be further enhanced.
To avoid such a local minimum, we can group the input pixels according to their defocus amount, and the pixels with similar defocus amount can be deblurred with the same network parameters. That is, the defocus map can be used to guide the deblurring process. Hence, we propose in this paper a defocus image deblurring network with defocus map estimation as auxiliary task. As shown by the detailed architecture in Figure 2, the input defocus image is firstly processed by a simple defocus map estimation sub-net, which contains several convolution layers to estimate a defocus map 4 times smaller than the input image; then the estimated defocus map is upsampled to the original size and concatenated with the input images as the input of the defocus deblurring sub-net, which contains 12 res blocks and several convolution layers. With the defocus map as the guidance for deblurring, the defocus areas in the input image are deblurred nicely. The final deblurring result is the deblurring residual added to the original input image.
Fig. 2. - The architecture of proposed DID-ANet. The network consists of two main parts: the defocus map estimation sub-net and the defocus image deblurring sub-net. The image is deblurred with the guidance of the estimated defocus map.
Fig. 2.
The architecture of proposed DID-ANet. The network consists of two main parts: the defocus map estimation sub-net and the defocus image deblurring sub-net. The image is deblurred with the guidance of the estimated defocus map.
Show All
The standard residual blocks in our network contain 2 convolutional layers, 2 batch normalization layers, and a ReLU layer in the middle [45]. The kernel size is 128 in the defocus deblurring sub-net. No pooling or sub-sampling is used in the defocus deblurring sub-net as [46].
In addition, our network will pay more attention to the areas with larger defocus, as to be introduced in detail in Section III-C of loss functions.
B. Defocus Re-Evaluation
The ultimate goal of the defocus image deblurring is to generate an all-in-focus image. It is natural to propose that the defocus map estimation sub-net can be reused to evaluate the deblurring effect, as shown in Figure 3. Therefore, to further enhance the deblurring result after the first round of training, a re-evaluation loss is applied: the deblurring result and the ground truth clear image are separately processed by the frozen defocus map estimation sub-net, and then the average pixel difference between these two new estimated defocus maps is used as the re-evaluation loss. If the deblurring result is similar to the ground truth, the defocus map estimations of the deblurring result and the ground truth would be similar, too.
Fig. 3. - Illustration for the estimation and re-evaluation of the defocus map, as well as the corresponding training losses employed in our DID-ANet. In the first two training stages shown above, the two sub-nets are optimized jointly. In the last training stage, the defocus map estimation sub-net is frozen, focusing on the improvement of deblurring sub-net.
Fig. 3.
Illustration for the estimation and re-evaluation of the defocus map, as well as the corresponding training losses employed in our DID-ANet. In the first two training stages shown above, the two sub-nets are optimized jointly. In the last training stage, the defocus map estimation sub-net is frozen, focusing on the improvement of deblurring sub-net.
Show All
C. Loss Functions
To train the proposed network, we elaborately adopt several kinds of loss functions. The commonly used L1 norm and L2 norm are firstly applied. We use the defocus map estimation loss (LossDME ) to supervise the estimated defocus map (DMEst ) with the ground truth defocus map (DMGT ):
LossDME=∥DMEst−DMGT∥1.(2)
View SourceRight-click on figure for MathML and additional features.
Then, we design a loss to supervise the deblurring result with its paired clear ground truth. Because highly defocused areas are much more difficult to deblur than lightly defocused or non-defocused areas, with the estimated defocus map as the reference, we increase the importance of the difficult areas by using the weighted deblur loss (LossWD ) with different weights at different positions:
LossWD=∥WDME×(Deblurred−ClearGT)∥2,(3)
View SourceRight-click on figure for MathML and additional features.where weight map WDME is the normalized defocus map with an offset W0 :
WDME=DMEstmean(DMEst)+W0.(4)
View SourceRight-click on figure for MathML and additional features.
Here we use W0=1/9 .
As mentioned in Section III-B, we reuse the defocus map estimation sub-net to evaluate the deblurring result of the defocus image deblurring network to enhance it. That is, we compare the defocus map estimations of the deblurring output (DMDeblurred ) and the all-sharp ground truth (DMClearGT ). The difference is called the re-evalution loss LossRE :
LossRE=∥DMDeblur−DMClearGT∥1.(5)
View SourceRight-click on figure for MathML and additional features.
Accordingly, we also design some training strategies to optimize the whole network. Specifically, the training procedure contains three stages (Figure 3). In stage 1 and stage 2, the two sub-nets are jointly trained for 400 and 200 epochs, respectively. For stage 1, we employ the ground truth defocus map as the input of the defocus image deblurring sub-net to avoid divergence caused by random output of the defocus map estimation sub-net and speed up the training. While for stage 2, we use the output of the defocus map estimation sub-net as the input of the defocus image deblurring sub-net to jointly fine-tune the whole network. In stage 1 and stage 2, we employ LossDME and LossWD for supervision:
Loss1=λ1×LossDME+λ2×LossWD.(6)
View SourceRight-click on figure for MathML and additional features.
In stage 3, we add the re-evaluation loss to further fine-tune the defocus image deblurring sub-net for another 400 epochs, with parameters of the defocus map estimation sub-net frozen. Hence we use LossWD and LossRE for supervision:
Loss2=λ2×LossWD+λ3×LossRE.(7)
View SourceRight-click on figure for MathML and additional features.
In this paper, the weights of the loss functions are λ1=0.1 , λ2=0.9 and λ3=0.2 .
D. DED Real Scenes Dataset
To the best of our knowledge, in the sense of defocus map estimation and defocus image deblurring, there is only a small dataset called Realistic [5] consisting of 22 image pairs, which are far from enough for training deep neural networks. To fill this gap and facilitate the training of our model, we build the first large-scale realistic dataset for defocus map estimation and defocus image deblurring (termed as DED dataset) with a light field camera.
Usually, it is extremely hard to directly capture an RGB-Defocus dataset using conventional cameras. To build such a dataset, typically two images captured with different camera settings are needed. However, the contents and intensities of these two images would be different more or less. Consequently, geometric and photometric alignments are needed. Unfortunately, precise alignments are also difficult. Alternatively, one can estimate defocus maps based on stereo/RGB-Depth datasets and then reblur the all-in-focus images manually to synthetically generate the partially defocus images. However, the employed kernels might be different from real ones and consequently the produced defocused images are different from real scenes. To bypass these problems, we use a Lytro Illum light field camera [47], which can generate two differently focused images at one shot, to generate the dataset.
The Lytro company provides a software along with their camera to process the captured images that record the 4 dimensional light field. With the help of this software, the all-sharp image Is , a partially-defocus image Ib and the corresponding depth map Id 1 can be easily generated. In principle, Is and Ib are generated by filtering the 4 dimensional light field with specific 4 dimensional band-pass filters [48]. They can be viewed as if they were captured by a camera twice with different settings, based on the Fourier Slice Photography Theorem [49] for light field camera. Id is generated using the stereo information extracted from the 4 dimensional light field.
Then, inspired by [5], we calculate the mean squared error (MSE) between Ib and a reblurred version of Is in a patch-wise way as follows:
d(r)[i]=∑j∈Ni(Ib[j]−(Is⊗k(r))[j])2L2,(8)
View SourceRight-click on figure for MathML and additional features.where i is a pixel, Ni is a small window of size L×L centered at pixel i , and r is the radius of the candidate PSF k(r) . Then the defocus amount at pixel i is obtained by minimizing this MSE:
b[i]=r∗=minrd(r)[i].(9)
View SourceRight-click on figure for MathML and additional features.
Next, we detect the high confidence values as [5] did and propagate them to the non-confident pixels via Laplacian matting with the depth map Id as the guidance.
In the end, we generate in total 1,112 image pairs in the proposed DED dataset and Figure 4 illustrates several examples. Some of the images are from the multi-view dataset [50] and the others (over a half) are captured by ourselves. Among the image pairs, 100 pairs are randomly selected as the test set, and the rest 1,012 pairs are for the training. The selection is also conducted to ensure that the test set has different scenes from the training set. Both the training set (including the defocus images, the defocus maps and the clear ground truth) and the test sets (only the defocus images) of DED are open source.
Fig. 4. - Examples of the proposed DED dataset.
Fig. 4.
Examples of the proposed DED dataset.
Show All
SECTION IV.Experiments
A. Implementation Details
For the training process, the Adam solver is used with parameters β1=0.9 , β2=0.999 and ϵ=10−8 with the numbers of epochs detailed above. The input images and the corresponding all-sharp ground truths, as well as the defocus maps are randomly cropped to the size of 256 ×256 . Other data augmentation strategies, such as random flipping, rotation and color change, are also applied to make the dataset more variable [51]. The batch size is set to 16 when using 4× Nvidia 1080Ti GPU for training, and the testing is conducted with a single GPU. The testing time for a single image of size 600 ×400 is 0.27 seconds on average, with about 570 billion FLOPs and 19.2 million parameters.
B. Experimental Results
We evaluate the proposed method on the Realistic dataset [5] and the test set of the proposed DED dataset (DED-test). Both the results of defocus map estimation and defocus image deblurring are compared with the state-of-the-art methods.
The results for defocus map estimation are compared with the methods of Zhou and Sim [23], D’Andrès et al. [5], Park et al. [28], Karaali and Jung [26] and the recent deep-learning-based DME-Net [36]. The evaluation metrics are the mean absolute error (MAE) and mean squared error (MSE) to the defocus map ground truth. The quantitative comparison can be found in Table I, where the best results are in bold. We can see that: for Realistic, the proposed method is comparable with D’Andrès et al. [5] and outperforms the rest three methods; for DED-test, the proposed method performs the best, with the lowest MAE and MSE.
TABLE I Quantitative Comparison for Defocus Map Estimation, Where the Best Results are in Bold
Table I-
Quantitative Comparison for Defocus Map Estimation, Where the Best Results are in Bold
Several visual examples of defocus map estimation are also shown in Figure 5 (Realistic) and Figure 6 (DED-test). Our results are much closer to the ground truth and the error area beyond the boundary is less than those of other four methods.
Fig. 5. - Visual comparison of defocus map estimation on realistic.
Fig. 5.
Visual comparison of defocus map estimation on realistic.
Show All
Fig. 6. - Visual comparison of defocus map estimation on DED-test.
Fig. 6.
Visual comparison of defocus map estimation on DED-test.
Show All
The results for defocus deblurring are compared with the conventional method of D’Andrès et al. [5]; the deep learning methods of SRN-Deblur [4], DeblurGAN [3], [19], IFAN [22]; and the DME-Net [36] which applies CNN for defocus map estimation and conventional deconvolution [52] for deblurring. The deep learning methods are fine-tuned2 on the training set of the proposed DED dataset. The evaluation metrics are the Peak Signal to Noise Ratio (PSNR) and Structural Similarity (SSIM) to the clear ground truth. The quantitative comparison can be found in Table II, where the best results are in bold. For both Realistic dataset and the proposed DED test set, the proposed method outperforms all the other methods; with the best PSNR and SSIM.
TABLE II Quantitative Comparison for Defocus Image Deblurring, Where the Best Results are in Bold
Table II-
Quantitative Comparison for Defocus Image Deblurring, Where the Best Results are in Bold
Several visual examples of defocus image deblurring are shown in Figure 7 (Realistic) and Figure 8 (DED-test). As images from the DED dataset are larger, we crop them and zoom in for a better view. Results of the proposed DID-ANet are more clear and vivid in our results than those of all other methods. In Figure 7 (top), the texture in the rock and the people with a black bag are clearer in our result; in Figure 7 (bottom), our result has a much more colorful number ‘2’, and more vivid boundaries of the locks. In Figure 8, the human faces (top), the pedestrians and the background trees (middle), and the cars and the bicycle (bottom) are all more realistic and clearer in our result. Besides, the visually fuzzy phenomenon is greatly decreased in our deblurring results.
Fig. 7. - Visual comparison of defocus image deblurring on realistic.
Fig. 7.
Visual comparison of defocus image deblurring on realistic.
Show All
Fig. 8. - Visual comparison of defocus image deblurring on DED-test. As the images in DED dataset are too large, we crop them and zoom in for better view.
Fig. 8.
Visual comparison of defocus image deblurring on DED-test. As the images in DED dataset are too large, we crop them and zoom in for better view.
Show All
In contrast, there is block effect in results of [5], and the methods cannot deal with the defocus blur very well. Specifically, for the DME-Net [22] with deconvolution deblurring [52], the results are not as clear as ours; for the motion deblurring methods [4], [19], they cannot deal with the defocus blur very well; and for the IFAN [22], although it is specially designed for defocus blur removal and the model is refined with the proposed DED dataset, the results are still not as clear as those of the proposed method. Moreover, there are noticeable artifacts in the deblurring results of [22], which also lead to lower PSNR and SSIM scores than our method, as shown in Table II.
Moreover, to verify the generalization ability of the DID-ANet besides the light field camera generated datasets, we select a small collection of pictures with obvious defocus areas from the COCO dataset (some examples in Figure 9). The higher value in the defocus map, the higher defocus amount. After deblurring, defocused areas (the face of the girl and the wall in the first example, the man sitting behind the desk in the second example, as well as the above part in the third example) become clear.
Fig. 9.
Examples of the defocus map estimation and defocus deblurring by DID-ANet on the COCO dataset.
Show All
C. Ablation Studies
Several ablation studies are conducted on both the Realistic dataset and the test set of the DED dataset. The results can be found in Table III.
TABLE III Ablation Study on Realistic and DED Test Sets., Where the Best Results are in Bold. The Deblurring Results Can Benefit From the Auxiliary Defocus Map Estimation Task, the Loss Functions and Flexible Training Strategies
Table III-
Ablation Study on Realistic and DED Test Sets., Where the Best Results are in Bold. The Deblurring Results Can Benefit From the Auxiliary Defocus Map Estimation Task, the Loss Functions and Flexible Training Strategies
Firstly, to demonstrate the necessity of auxiliary learning, two experiments are conducted. One is termed “Backbone”, the simple structure without the defocus map estimation module. The other is termed “Without AL”, the complete network as DID-ANet but trained with no supervision for defocus map estimation. As expected, these two variations produce much lower PSNR and SSIM for defocus image deblurring. Furthermore, more training epochs are required for both “Backbone” and “Without AL” to convergence, meaning that it is hard to train a network for defocus image deblurring without the auxiliary learning. Besides, we also use the deconvolution method [52] to deblur the input image with the defocus map generated by our DME module. However, the PSNR and SSIM of the results are much lower than even the “Backbone” network. This implies that the network methods are indeed important for defocus image deblurring.
Then, another ablation experiment is conducted to verify the effectiveness of each training strategy. Specifically, we test the deblurring result after each training stage. As shown in Table III, the performance is already better than those of all other methods after the first training stage (compared with Table II). Moreover, after each training stage, both PSNR and SSIM get increased. We also study the effectiveness of each loss function in the proposed DID-ANet. The performance with a simple L2 loss is worse than that with the weight loss (LWD ), meaning that LWD is useful. Training stage 3 can improve the performance compared with training stage 2, showing the effectiveness of LRE .
We also show the loss curves for the training processes of the different network structures or loss functions aforementioned in Figure 10. Only the first 200 training epochs are plotted in the figure. We use the mean end point error (EPE) on the validation set (about 10% of the training set that are kept out for validation in the training process) to show the performance of all the variations. As shown in Figure 10, the proposed DID-ANet with the weighted deblur loss function LWD achieves the lowest EPE and can converge more quickly than all other variations.
Fig. 10. - EPE loss curve on validation set for ablation studies. The proposed DID-ANet with the weighted deblur loss function
$L_{WD}$
achieves the lowest EPE and can converge more quickly than all other variations.
Fig. 10.
EPE loss curve on validation set for ablation studies. The proposed DID-ANet with the weighted deblur loss function LWD achieves the lowest EPE and can converge more quickly than all other variations.
Show All
D. Cross Validation on the DED Dataset
A five-fold cross validation experiment is conducted on the proposed DED dataset to demonstrate the robustness of the proposed DID-ANet. The DED dataset is partitioned to 5 folds randomly. Specifically, we first divide apart all the images according to the different scenes, where each scene contains 2 to 6 images. Then, the scenes are randomly assigned to a fold. Finally, the 5 folds are adjusted appropriately to make sure that each fold has about 20% images.
The DID-ANet is trained on 4 folds and tested on the remaining fold. Furthermore, the models are also tested on the Realistic dataset. The results are shown in Table IV. It should be noted that the performance for these models are quite similar to each other for the Realistic dataset, meaning that the proposed model is insensitive to the training/test partition.
TABLE IV Five-Fold Cross Validation of DID-ANet on the Proposed DED Dataset. The Model is Trained on Four Folds and Tested on the Remaining Fold as Well as the Realistic Dataset. It Should be Noted That the Performance Does Not Change Very Much for the Realistic Dataset, Meaning That the Model is Insensitive to the Training Set Partition
Table IV-
Five-Fold Cross Validation of DID-ANet on the Proposed DED Dataset. The Model is Trained on Four Folds and Tested on the Remaining Fold as Well as the Realistic Dataset. It Should be Noted That the Performance Does Not Change Very Much for the Realistic Dataset, Meaning That the Model is Insensitive to the Training Set Partition
SECTION V.Conclusion
In this paper, we propose a novel deep auxiliary learning approach called DID-ANet, with defocus map estimation as the auxiliary task for defocus image deblurring. The guidance provided by the defocus map estimation makes the network easier to train end-to-end, and helps to improve deblurring results. Several novel loss functions and flexible training strategies are also introduced. Furthermore, a new large-scale defocus dataset termed DED is built, which is also the first large-scale defocus deblurring dataset taken in real scenes and suitable for training deep networks. Experiments show that our DID-ANet obtain the state-of-the-art performance for both defocus map estimation and defocus image deblurring tasks, both quantitatively and qualitatively.