Learning sceneaware image priors with highorder Markov random fields
 Dong Gong^{1},
 Yanning Zhang^{1},
 Qingsen Yan^{1} and
 Haisen Li^{1}Email author
Received: 1 August 2017
Accepted: 16 October 2017
Published: 30 October 2017
Abstract
Many methods have been proposed to learn image priors from natural images for the illposed image restoration tasks. However, many prior learning algorithms assume that a general prior distribution is suitable for over all kinds of images. Since the contents of the natural images and the corresponding lowlevel statistical characteristics vary from scene to scene, we argue that learning a universal generative prior for all natural images may be imperfect. Although the universal generative prior can remove artifacts and reserve a natural smoothness in image restoration, it also tends to introduce unreal flatness and clutter textures. To address this issue, in this paper, we present to learn a sceneaware image prior based on the highorder Markov random field (MRF) model (SAMRF). With this model, we jointly learn a set of shared lowlevel features and different potentials for specific scene contents. In prediction, a good prior can be adapted to the given degenerated image with the scene content perception. Experimental results on the image denoising and inpainting tasks demonstrate the efficiency of the SAMRF on both numerical evaluation and visual compression.
Keywords
Introduction
Image restoration tasks, such as denoising (Tappen et al. 2007; Schmidt et al. 2010; Schmidt and Roth 2014), deblurring (Krishnan and Fergus 2009; Krishnan et al. 2011; Levin et al. 2009; Zhang et al. 2013; Gong et al. 2016, 2017) and super resolution (Tappen and Liu 2012) are all inherently illposed. Some knowledge of natural images is used as prior to boost the estimation stability and to recover information lost in nonideal imaging processes. Recently, many image priors work on image gradients for briefness of modeling and better performance (Fergus et al. 2006; Levin et al. 2007, 2009; Krishnan et al. 2011; Krishnan and Fergus 2009; Xu et al. 2013; Zhang et al. 2013). However, the representation of image prior distribution in gradient domain is fragile for sophisticated concept of natural, as the variant of image content and/or scale makes the gradient characteristics unstable for modeling the unique clear individual images.
All of the manually designed priors and learned priors expect to model a universal distribution to represent all realworld natural images (in a specific discussed domain). Unfortunately, different images with different scene contents have varying statistics on usual lowlevel features like gradients or responses of learned filters in highorder MRF cliques (Fig. 1). Figure 1 shows that images with different contents (Left) have different responses on the gradient filter (Middle) or the learned highorder filters in Schmidt et al. (2010) (Right). Therefore, relying on universal generative image prior to recover every specific image is improper.
Considering the gap between the universal image prior and the special property of individual images, a series of contentrelated image priors are exploited in many image restoration tasks (Tappen et al. 2007; Cho et al. 2010; Sun et al. 2010; Schmidt and Roth 2014; McAuley et al. 2006). In Tappen et al. (2007) and Cho et al. (2010), local features are utilized to adapt the prior works on local areas in restoration tasks. However, as the local features like gradient filter responses (Tappen et al. 2007) and local texture (Cho et al. 2010) are usually not striking on weak edges or regions with ambiguous content, these localspecific models face inaccurate labeling problems, and the restoration results often suffer artifacts. In addition, the models in Tappen et al. (2007) and Schmidt and Roth (2014) can only be learned for specific state of the degeneration, which limits the range of application. The previous related works trying to approach the contentaware prior mainly focus on connecting the contents with some simple features such as statistics on gradients, since connecting the complex lowlevel features (e.g., any filter responses) with the highlevel features representing the scene contents is more difficult. Additionally, recently, McAuley et al. (2006) proposed to the highorder MRF prior for color images. In Feng et al. (2016), a highorder natural image prior model was proposed for reducing the Poisson noise. Ren et.al. (2013) introduced the “contextaware” concept into the sparse representation for image denoising and superresolution. Considering the limitation of expression ability of the classical MRF, Wu et.al. proposed to compact the MRFs with deep neural networks (Wu et al. 2016).
In a natural image, lowlevel statistical characteristics are usually generated by the contents in the captured scene (Torralba and Oliva 2003). And the scene perception for an image is usually more robust than the pixellevel (lowlevel) characteristics. Based on this observation, we focus on developing a sceneaware prior model that can adapt the manifolds of the scenerelated content in an image globally instead of taking the local structures. In this paper, we propose a sceneaware Markov random field (SAMRF) model to capture the scenediscriminating statistical prior of any whole natural image; the SAMRF model owes highorder nonGaussian potential conditioned on a scene coefficient extracted from highlevel concepts of observations. This is based on an assumption that the highlevel contents are preserved fairly even in degenerated observations. Then related efficient algorithms for learning and inference are proposed. Experiments on image restoration tasks, denoising and inpainting, illustrate that the SAMRFbased sceneaware image prior captures the image statistic characteristics accurately and improves the quality of images effectively.
Sceneaware image prior based on MRFs
The purpose of this paper is to build a system, in which (1) a highorder MRF model depending on scene content of the image is proposed to model the lowlevel statistical distribution and (2) the observed image can be adapted to a specific proper prior in restoration procedure. Overview of the system is illustrated in Fig. 2.
Sceneaware MRF model
The distribution of natural image \(\mathbf {x}\) is formulated as a highorder MRF (Schmidt et al. 2010). To let the scene content information guide the modeling, we introduce an explicit scene coefficient as a parameter of the distribution.
Potential function conditioned on scene coefficient
In (1), the formulation of the potential function is still not given. In this section, we will focus on the modeling of the potential function depending on the scene coefficient.
Link the image \(\mathbf{x}\) and the scene perception through \(f(\mathbf{x})\)
Given a \(\mathbf{x}\), an easy way to represent its scene is to assign the discrete labels associated with the content (e.g., objects or scene) in \(\mathbf{x}\) as many scene understanding works (Li et al. 2009). However, because there is a bias between the highlevel perception of the content and the lowlevel feature [e.g., SIFT (Lowe 2004) and GIST (Oliva and Torralba 2001)] (Li et al. 2010), even images with same content labels may have dissimilar lowlevel statistical distributions. Instead of tackling this issue directly, we try to take advantage of it. Because our task roots in the lowlevel tasks, we do not need to assign exact labels to the contents in the scene. We directly use the Bagofwords (BoW) histogram of SIFT descriptors to toward scene perception. Given an image \(\mathbf{x}\), we extract dense SIFT (DSIFT) from it and generate DSIFTBoW histogram with 200 vocabularies as \(\mathbf{b}_{\mathbf{x}}\). \(f(\mathbf{x})\) is defined as \(f(\mathbf{x})=\mathbf{b}_{\mathbf{x}}\). To extract dense SIFT, we run the SIFT feature extractor on a dense grid of location covering all locations on an image at a fixed scale and orientation. Specifically, in prediction task, given a \(\mathbf{y}\), we first roughly recover a clear image \(\hat{\mathbf {x}}(\mathbf {y})\). For example, for noisy observation \(\mathbf{y}\), we do denoising via a simple Wiener filter (Sonka et al. 2014) or Gaussian lowpass filter. Then we extract DSIFTBoW feature from \(\hat{\mathbf {x}}(\mathbf {y})\) and let \(\mathbf{b}_{\hat{\mathbf {x}}(\mathbf {y})}\) represent the corresponding feature of the latent clear image. The encoder of DSIFTBoW is denoted as \(\mathrm {D}\). Note that, a clear image can be roughly recovered using some simple methods for extracting the BoW feature as the initialization. But it is not good enough to show many pixellevel details.
Link the scene coefficient \(f(\mathbf{x})\) and lowlevel statistics through \(\mathbf{w}(f(\mathbf{x}), \varvec{\theta })\)
Learning algorithm
In this section, we will introduce an efficient learning algorithm that estimates model parameters from highquality training samples, and inference algorithm for image restoration.
Given a set of training images, \(\{\mathbf {x}_t\}_{t=1}^{T}\), the parameters of the model \(\varvec{\Theta }\) and lowlevel features \(\{\mathbf {F}_i\}\) are estimated by maximizing the likelihood on the training data. We maximize the likelihood through minimizing the Kullback–Leibler divergence (KLD) between the model and empirical distribution of training data.
An auxiliaryvariablebased Gibbs sampler (Schmidt et al. 2010) is used to draw samples from the model distribution. The expectation can be calculated by averaging over the samples. Full learning scheme is illustrated in Algorithm 1.
Applications and experiments
To evaluate the modeling ability of the sceneaware prior on realworld image directly, we evaluate the performance of the learned prior on image denoising and image inpainting. Before the evaluation, we will first introduce some implementation details for learning the sceneaware image prior and the learned model in this paper. Following that, we then revisit the standard Bayesian restoration formulation and derive an MMSE estimation approach for our sceneaware image prior.
Learning details and learning results of the SAMRF
When we set the number of GMM mode K as 4, the training images are split into four sets. We randomly select several representative samples from each cluster and illustrate them into Fig. 4. As shown in Fig. 4, images within same clusters have closed appearances; conversely, images in different clusters have different visual properties. Although the clustering result does not follow the contents strictly, it reflects the lowlevel properties properly. For example, in Fig. 4, the cluster on the left contains a lot of clear and flat background areas, and the right bottom one has more complex textures and clutters. The clustering result provides a preferred intermediate result to let the algorithm learn diverse and meaningfull features and distributions. As a result, the learned filters and four sets of experts (potential functions) are shown in Fig. 3. Figure 3a shows the learned filters, and b–e are the learned weights and curves of the potential functions for the four clusters, respectively. Comparing to the learning result in Schmidt et al. (2010), our filters have a wider variant region, and the experts have more spiky peak and heavy tail, which reserve the favored image in a narrower region.
Bayesian image restoration formulation
Evaluation on image denoising
When the noise level is low (\(\sigma =10\)), the performance of the sceneaware prior is very similar with Schmidt et al. (2010). It can be explained as that there is always noise that is hard to be removed; and both the prior in Schmidt et al. (2010) and ours reach the latent limitation of similar algorithms. An example for visual illustration is shown in Fig. 9. As shown in Fig. 9, our result is much more closed to the ground truth than others. Hence, although the sceneaware prior and the MRF model learned in Schmidt et al. (2010) are closed to each other on the numerical evaluation, the proposed method can achieve more natural and accurate results, which illustrates the power of the sceneaware concept in prior learning.
Evaluation on image inpainting
Image inpainting is to recover a highquality image from a degenerated image in which part of the image pixels is lost or deteriorated. Apart from the image denoising task, we also test the proposed method on image inpainting in this section. As shown in Fig. 10, a part of an image is crimped and deteriorated due to folding, which is used as the input image in this experiment. Given a binary mask indicating the deteriorated pixels, the MRF model in Schmidt et al. (2010) and the proposed sceneaware prior both work well on recovering an intact image. The result of the proposed however is more natural, especially on the pixels near the crimps in the original image. Since the groundtruth image for the realworld deteriorated image, only the visual comparison is illustrated in Fig. 10.
Conclusion and future work
The proposed highorder MRFbased sceneaware image prior models the lowlevel distribution of image conditioned on highlevel scene characteristic of observations, and improves the restoration of the degenerated observations. Experimental results demonstrate that the proposed method can generate desirable restoration results.

Our proposed model learns lowlevel features in a small local area, and use the simple DSIFTBoW to express the highlevel scene concept, which restricts the expression ability of the model. Embedding the proposed method with the deep convolutional neural network, Bengio et al. (2013) might enable higher expression ability.

We evaluated the efficiency of the proposed method on image denoising and inpainting tasks. In the future, we may extent this work to more applications tasks, including image superresolution, image deblurring, optical flow, etc.
In model (1), if \(\mathbf{F}_i\)’s are \(l\times l\) filters, each \(\mathbf{x}_c\) is a \(l\times l\) subvector in \(\mathbf{x}\).
We slightly abuse the notation \(\{\mathbf {w}_i\}_{i=1}^N\) as both the parameters of the potentials and the functions \(\mathbf{w}_i(f(\mathbf {x}), \varvec{\theta })\).
Declarations
Authors' contributions
DG drafted the manuscript. YZ, QY, and HL participated in its design and coordination and/or helped to revise the manuscript. All authors read and approved the final manuscript.
Acknowledgements
This work was supported in part by National Natural Science Foundation of China (61231016, 61572405), China 863 Project 2015AA016402.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828View ArticleGoogle Scholar
 Bottou L (2010) Largescale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010. Springer, Berlin, pp 177–186Google Scholar
 Chen Y, Yu W, Pock T (2015) On learning optimized reaction diffusion processes for effective image restoration. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 5261–5269Google Scholar
 Cho TS, Joshi N, Zitnick CL, Kang SB, Szeliski R, Freeman WT (2010) A contentaware image prior. In: IEEE conference on computer vision and pattern recognition (CVPR)Google Scholar
 Dabov K, Foi A, Katkovnik V, Egiazarian K (2007) Image denoising by sparse 3D transformdomain collaborative filtering. In: IEEE transactions on image processingGoogle Scholar
 Feng W, Qiao H, Chen Y (2016) Poisson noise reduction with higherorder natural image prior model. SIAM J Imaging Sci 9(3):1502–1524View ArticleMATHMathSciNetGoogle Scholar
 Fergus R, Singh B, Hertzmann A, Roweis ST, Freeman WT (2006) Removing camera shake from a single photograph. In: ACM transactions on graphics (TOG)Google Scholar
 Gong D, Tan M, Zhang Y, van den Hengel A, Shi Q (2016) Blind image deconvolution by automatic gradient activation. In: IEEE conference on computer vision and pattern recognition (CVPR)Google Scholar
 Gong D, Yang J, Liu L, Zhang Y, Reid I, Shen C et al (2017) From motion blur to motion flow: a deep learning solution for removing heterogeneous motion blur. In: The IEEE conference on computer vision and pattern recognition (CVPR)Google Scholar
 Hinton GE (2002) Training products of experts by minimizing contrastive divergence. Neural computationGoogle Scholar
 Koller D, Friedman N (2009) Probabilistic graphical models: principles and techniques. MIT Press, CambridgeMATHGoogle Scholar
 Krishnan D, Fergus R (2009) Fast image deconvolution using hyperLaplacian priors. In: NIPSGoogle Scholar
 Krishnan D, Tay T, Fergus R (2011) Blind deconvolution using a normalized sparsity measure. In: IEEE conference on computer vision and pattern recognition (CVPR)Google Scholar
 Levin A, Fergus R, Durand F, Freeman WT (2007) Image and depth from a conventional camera with a coded aperture. ACM Trans Gr 26:70View ArticleGoogle Scholar
 Levin A, Weiss Y, Durand F, Freeman WT (2009) Understanding and evaluating blind deconvolution algorithms. In: IEEE conference on computer vision and pattern recognition (CVPR)Google Scholar
 Li LJ, Socher R, FeiFei L (2009) Towards total scene understanding: classification, annotation and segmentation in an automatic framework. In: IEEE conference on computer vision and pattern recognition, 2009. CVPR 2009. IEEE, New YorkGoogle Scholar
 Li LJ, Su H, FeiFei L, Xing EP (2010) Object bank: a highlevel image representation for scene classification & semantic feature sparsification. In: Advances in neural information processing systemsGoogle Scholar
 Lowe DG (2004) Distinctive image features from scaleinvariant keypoints. In: IEEE international conference on computer vision (ICCV)Google Scholar
 Martin D, Fowlkes C, Tal D, Malik J (2001) A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: IEEE international conference on computer vision (ICCV)Google Scholar
 McAuley JJ, Caetano TS, Smola AJ, Franz MO (2006) Learning highorder mrf priors of color images. In: Proceedings of the 23rd international conference on machine learning. ACM, New York, pp 617–624Google Scholar
 Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. Int J Comput Vis 42(3):145–175View ArticleMATHGoogle Scholar
 Ren J, Liu J, Guo Z (2013) Contextaware sparse decomposition for image denoising and superresolution. IEEE Trans Image Process 22(4):1456–1469View ArticleMATHMathSciNetGoogle Scholar
 Roth S, Black MJ (2005) Fields of experts: a framework for learning image priors. In: IEEE conference on computer vision and pattern recognition (CVPR)Google Scholar
 Samuel KGG, Tappen MF (2009) Learning optimized MAP estimates in continuouslyvalued MRF models. In: IEEE conference on computer vision and pattern recognition (CVPR)Google Scholar
 Schmidt U, Gao Q, Roth S (2010) A generative perspective on MRFs in lowlevel vision. In: IEEE conference on computer vision and pattern recognition (CVPR)Google Scholar
 Schmidt U, Jancsary J, Nowozin S, Roth S, Rother C (2014) Cascades of regression tree fields for image restorationGoogle Scholar
 Schmidt U, Roth S (2014) Shrinkage fields for effective image restoration. In: IEEE conference on computer vision and pattern recognition (CVPR)Google Scholar
 Sonka M, Hlavac V, Boyle R (2014) Image processing, analysis, and machine vision. Cengage Learning, BostonGoogle Scholar
 Sun J, Zhu J, Tappen MF (2010) Contextconstrained hallucination for image superresolution. In: IEEE conference on computer vision and pattern recognition (CVPR)Google Scholar
 Tappen MF, Liu C (2012) A Bayesian approach to alignmentbased image hallucination. In: ECCVGoogle Scholar
 Tappen MF, Liu C, Adelson EH, Freeman WT (2007) Learning Gaussian conditional random fields for lowlevel vision. In: IEEE conference on computer vision and pattern recognition (CVPR)Google Scholar
 Torralba A, Oliva A (2003) Statistics of natural image categories. Netw Comput Neural Syst 14:391–412View ArticleGoogle Scholar
 Wang Z, Bovik AC, Sheikh HR, Simoncelli EP (2004) Image quality assessment: from error visibility to structural similarity. In: IEEE transactions on image processingGoogle Scholar
 Weiss Y, Freeman WT (2007) What makes a good model of natural images? In: IEEE conference on computer vision and pattern recognition (CVPR)Google Scholar
 Wu Z, Lin D, Tang X (2016) Deep Markov random field for image modeling. In: European conference on computer vision. Springer, Berlin, pp 295–312Google Scholar
 Xu L, Zheng S, Jia J (2013) Unnatural l0 sparse representation for natural image deblurring. In: IEEE conference on computer vision and pattern recognition (CVPR). IEEE, New York, pp 1107–1114Google Scholar
 Zhang H, Wipf D, Zhang Y (2013) Multiimage blind deblurring using a coupled adaptive sparse prior. In: IEEE conference on computer vision and pattern recognition (CVPR)Google Scholar