On the translationinvariance of image distance metric
 Bing Sun^{1}Email author,
 Jufu Feng^{1} and
 Guoping Wang^{2}
Received: 30 May 2015
Accepted: 7 October 2015
Published: 25 November 2015
Abstract
An appropriate choice of the distance metric is a fundamental problem in pattern recognition, machine learning and cluster analysis. Some methods that based on the distance of samples, e.g, the kmeans clustering algorithm and the knearest neighbor classifier, are crucially relied on the performance of the distance metric. In this paper, the property of translation invariance for the distance metric of images is especially emphasized. The consideration is twofold. Firstly, some of the commonly used distance metrics, such as the Euclidean and Minkowski distance, are independent of the training set and/or the domainspecific knowledge. Secondly, the translation invariance is a necessary property for any intuitively reasonable image metric. The image Euclidean distance (IMED) and generalized Euclidean distance (GED) are image metrics that take the spatial relationship between pixels into consideration. Sun et al.(IEEE Conference on Computer Vision and Pattern Recognition, pp 1398–1405, 2009) showed that IMED is equivalent to a translationinvariant transform and proposed a metric learning algorithm based on the equivalency. In this paper, we provide a complete treatment on this topic and extend the equivalency to the discrete frequency domain. Based on the connection, we show that GED and IMED can be implemented as lowpass filters, which reduce the space and time complexities significantly. The transform domain metric learning proposed in (Sun et al. 2009) is also resembled as a translationinvariant counterpart of LDA. Experimental results demonstrate improvements in algorithm efficiency and performance boosts on the small sample size problems.
Keywords
Background
The distance measure of images plays a central role in computer vision and pattern recognition, which can be either learned from a training set, or specified according to a priori domainspecific knowledge. The problem of metric learning, has gained considerable interest in recent years (Hastie and Tibshirani 1996; Xing et al. 2003; Hertz and Pavel 2002; BarHillel et al. 2003; Goldberger et al. 2005; ShalevShwartz et al. 2004; Chopra et al. 2005; Globerson et al. 2006; Weinberger et al. 2005; Lebanon 2006; Davis et al. 2007; Li et al. 2007). On the other hand, the fact that the standard Euclidean distance assumes that pixels are spatially independent yields counterintuitive results, e.g, a perceptually large distortion can produce smaller distance (Jean 1990; Wang et al. 2005). By incorporating the spatial correlation of pixels, two classes of image metrics, namely IMED (Wang et al. 2005) and GED (Jean 1990), were designed to deal with the spatial dependencies for image distances, which were demonstrated consistent performance improvements in many real world problems (Jean 1990; Wang et al. 2005; Chen et al. 2006; Wang et al. 2006; Zhu et al. 2007).
A key advantage of GED and IMED is that they can be embedded in any classification technique. The calculation of IMED is equivalent to performing a linear transform called the standardizing transform (ST) and then followed by the traditional Euclidean distance. Hence, feeding the STtransformed images to a recognition algorithm automatically embeds IMED (Wang et al. 2005). The analogous transform for GED is referred as to the generalized Euclidean transform (GET) (Jean 1990).
IMED and GED are invariant to image translation, namely, if the same image translation is applied to two images, their IMED remains invariant. However, the associated transforms (ST and GET) are not translation invariant (TI). This left a problem whether IMED can be implemented by a TI transform. In (Sun et al. 2009), the authors gave a positive answer to the problem and provided a proof for simple cases, yet a few technical problems are left unresolved.
We should emphasize the importance of the translation invariances. Intuitively, as the relative distance between images should only depend on the relative position of them, translation invariance (TI) should be a fundamental requirement for any reasonable image metric. Yet few metric learning or linear subspace methods are aware of the TI property when dealing with images.
In this paper, we extend the theory in (Sun et al. 2009) to the discrete frequency domain to cover the practical cases. Based on the metrictransform connection, we show that both GED and IMED are essentially lowpass filters. The resulting filters lead to the fast implementations of GED and IMED, coinciding the algorithm proposed in (Sun et al. 2008), which reduces the space and time complexities significantly. The transform domain metric learning (TDML) proposed in (Sun et al. 2009) is also resembled as a translationinvariant counterpart of LDA. Experimental results demonstrate significant improvements of algorithm efficiency and performance boosts on the small sample size problems.
IMED and GED
Given an image X of size \(n_1 \times n_2\), the vectorization of X is the vector \({{\mathbf {x}}}= \mathrm{vec} \left( X \right) \), such that the \(\left( n_2 i_1 + i_2 \right) \)th component of \({{\mathbf {x}}}\) is the intensity at the \(\left( i_1, i_2 \right) \) pixel. This is a common technique to manipulate image data.
As suggested in (Wang et al. 2005), the calculation of IMED can be simplified by decomposing G to \(A^T A\). The standardizing transform (ST) is the special case when \(A^T = A\), written as \(A = G^{\frac{1}{2}}\). By incorporating the standardizing transform matrix \(G^{\frac{1}{2}}\), IMED can be easily embedded into almost any recognition algorithm. That is, feeding the STtransformed image \(G^{\frac{1}{2}} {\mathbf {x}}\) to a recognition algorithm automatically embeds IMED. Besides, Wang et al.showed that ST seems to have a smoothing effect (Wang et al. 2005) by illustrating a few eigenvectors associated with the largest eigenvalue of \(G^{\frac{1}{2}}\), and then argued that since IMED is equivalent to a transform domain smoothing, it can tolerate small deformation and noises and hence improve recognition performances.
The translation invariant transform of a translation invariant metric
In (Sun et al. 2009), the authors give a positive answer to the problem whether a translation invariant metric can be implemented by a translation invariant transform.
Theorem 1
A solid requirement of Theorem 1 is \(\hat{g}(\omega ) \geqslant 0\). The condition is satisfied when \(G \geqslant 0\) is an infinitesized matrix, as a consequence of the positive operator theorem (Rudin 1991) or the generalized Bochner’s theorem on groups (Rudin 1990). In practice, G is a positivedefinite matrix of finite size \(n \times n\). Gray (2006) proved that as n approximates infinity, \(\hat{g}(\omega )\) converges to a nonnegative value.
Unlike the case of ST for IMED (Wang et al. 2005) and GET for GED (Jean 1990), the constructed translationinvariant transform matrix H is not a square matrix. Specifically, H is of size \((n+2m) \times n\), where \([m,m)\) is the support of the sequence g[i].
Methods
Computational aspects
Unfortunately, Theorem 1 is presented in the continuous frequency domain only (Sun et al. 2009), which is not easy to be applied directly in practical problems because \(\hat{g}(\omega )\) is a continuous function that has to be discretized. A naive extension of Theorem 1 can be constructed by using the circular convolution (Oppenheim et al. 1999) instead of the regular convolution.
Proposition 2
The second problem is even worse: to derive a translationinvariant transform in discrete frequency domain, the matrix representation of the metric \({\mathbf {G}}\) must be a circulant matrix, which is not true for common cases, including both IMED and GED.
The above statements assert that given a finitely supported translationinvariant transform h[x], the induced metric \(\tilde{g}[i]\) constructed by the padded period filter \(\tilde{h}[i]\) is also translation invariant.
Hence, the analogous version of Theorem 1 can be given as follows.
Theorem 3
Given the \([m, m)\) supported metric filter g[i], there exists a circular filter \(\tilde{h} [i]\) , such that g[i] is equal to \(\tilde{h} {\circledast }_{n+2m} \tilde{h} [i]\) on its support.
Proof
The results in discrete frequency domain can be easily extended to multidimensional signal space the same as in continuous frequency domain (Sun et al. 2009). A convenient property of the extension is that the multidimensional data (e.g, 2d images) can be processed without vectorization.
The translationinvariant transforms of IMED and GED

IMED The metric tensor \(\mathbbm {g}\) for IMED is defined in (Wang et al. 2005) by a Gaussian, i.e.,where$$\begin{aligned} \mathbbm {g}_{j_1 j_2}^{i_1 i_2} = \frac{1}{2 \pi } e^{ \frac{d^2}{2}}, \end{aligned}$$The metric filter for IMED is separable, i.e.,$$\begin{aligned} d = \sqrt{(i_1  j_1)^2 + (i_2  j_2)^2}. \end{aligned}$$We choose the support length \(m_1 = m_2 = 4\) (\(g [4, 4] \approx 1.7911 \times 10^{ 8}\)), i.e., \(g [i_1, i_2]\) is supported on \([ 4, 4] \times [ 4, 4]\). For \(52 \times 52\) signals (\(n_1 = n_2 = 52\)), we build the period \(n_1 + 2 m_1 = 60\) sequence$$\begin{aligned} g[i_1, i_2] = \frac{1}{2 \pi } e^{ \frac{i_1^2 + i_2^2}{2}} = \frac{1}{\sqrt{2 \pi }} e^{ \frac{i_1^2}{2}} \cdot \frac{1}{\sqrt{2 \pi }} e^{ \frac{i_2^2}{2}} = g_0 [i_1] g_0 [i_2]. \end{aligned}$$It is easy to validate that \(\widehat{\widetilde{g_0}} [j] \geqslant 0, \forall j\). Thus the separated period filter \(\widetilde{h_0} [i]\) can be constructed by$$\begin{aligned} \widetilde{g_0}[i]={\left\{ \begin{array}{ll} g_0 [i] = \frac{1}{\sqrt{2 \pi }} e^{\frac{i^2}{2}}, &{} i \in [4,4] \\ 0, &{} i \in (4,56). \\ \end{array}\right. } \end{aligned}$$and the overall filter is \(\tilde{h} [i_1, i_2] = \widetilde{h_0} [i_1] \widetilde{h_0} [i_2]\).$$\begin{aligned} \widetilde{h_0} [i] =\mathcal {F}^{ 1} \left( \sqrt{\widehat{\tilde{g}} [j]} \right) , \end{aligned}$$

GED The metric tensor \(\mathbbm {g}\) for GED is defined in (Jean 1990) by a Laplacian, i.e.,where \(d =  i_1  j_1  +  i_2  j_2 \) is the \(l_1\) distance of the two pixels and \(r = 0.6\) is a decay constant. The metric filter for GED is separable, i.e.,$$\begin{aligned} \mathbbm {g}_{j_1 j_2}^{i_1 i_2} = r^d = e^{d \log r}, \end{aligned}$$We choose the support length \(m_1 = m_2 = 15\) (\(g [15, 15] \approx 2.2107 \times 10^{ 7}\)), i.e., \(g [i_1, i_2]\) is supported on \([ 15, 15] \times [ 15, 15]\). For \(30 \times 30\) signals (\(n_1 = n_2 = 30\)), we build the period \(n_1 + 2 m_1 = 60\) sequence$$\begin{aligned} g [i_1, i_2] = r^{ i_1  +  i_2 } = r^{ i_1 } \cdot r^{ i_2 } = g_0 [i_1] g_0 [i_2]. \end{aligned}$$We can validate that \(\widehat{\widetilde{g_0}} [j] \geqslant 0, \forall j\). Thus the separated period filter \(\widetilde{h_0} [i]\) can be constructed by$$\begin{aligned} \widetilde{g_0}[i]={\left\{ \begin{array}{ll} g_0[i] = r^{\vert i \vert }, &{} i \in [15,15] \\ 0, &{} i \in (15,45). \\ \end{array}\right. } \end{aligned}$$and the overall filter is \(\tilde{h} [i_1, i_2] = \widetilde{h_0} [i_1] \widetilde{h_0} [i_2]\).$$\begin{aligned} \widetilde{h_0} [i] =\mathcal {F}^{ 1} \left( \sqrt{\widehat{\tilde{g}} [j]} \right) , \end{aligned}$$
The fast implementation of IMED and GED
The advantages of the filtering decomposition over the GET or ST are not only the physical explanation but also the time and space complexity. Generally, the computational complexity associated with the filtering decomposition can be of \(O (n \log n)\) due to the efficiency of FFT (Oppenheim et al. 1999).
Since the filter is of fixed size, the fast implementation can further reduces the space complexity from \(O(n^2)\) to O(1), and the time complexity from \(O(n^2)\) to O(n).
Transform domain metric learning
Generally, in order to learn a metric G, one can do optimization with respect to G. For images of size \(n_1 \times n_2\), G has \(n_1^2 \times n_2^2\) elements, making the optimization intractable. Another problem is G must satisfy the positive semidefinite constraint, i.e., \(G \geqslant 0\), so it is not easy to find efficient algorithm to solve problem with such a constraint.
Results
Experiments on the transform implementations of IMED
In this section, the standardizing transform (ST) and the translation invariant implementation of IMED are evaluated using the US postal service (USPS) and the FERET database. The USPS database consists of 16 by 16 pixel size normalized images of handwritten digits, divided into a training set of 7291 prototypes and a test set of 2007 pattern. The FERET database consists of 384 by 256 pixel size images of human faces, in which th ’fa’ subset is chosen, including 1762 images.
 1
The ST group

Algorithm 1 \(U = G^{\frac{1}{2}} \mathrm{vec} (X)\), the original ST. It is memory expensive, and sometimes unfeasible, e.g, for the FERET database, the \(G^{\frac{1}{2}}\) is of size \(98304 \times 98304\), yielding a 36GiB usage of memory (4 bytes per element).

Algorithm 2 Since G is separable Wang et al. (2005), it can be shown \(G_1^{\frac{1}{2}} X G^{\frac{1}{2}}_2\) is equivalent to Algorithm 1. This solves the memory problem. For the FERET database, only a \(384 \times 384\) and a \(256 \times 256\) matrices are needed.
 2
The CST group (translation invariant transforms)

Algorithm 3 \(({\mathbf {h}}_1 \otimes {\mathbf {h}}_2^{*}) *X\), we need only a precomputed \(5 \times 5\) template.

Algorithm 4 Apply the template \({\mathbf {h}}_1\) to each column of X, then \({\mathbf {h}}_2\) to each row of X. This is the separated equivalent to Algorithm 3, in compared with Algorithm 2. Because \({\mathbf {h}}_1 ={\mathbf {h}}_2\), only one copy is in memory.
Time complexities
Algorithm  USPS (s)  FERETfa (s) 

ST(1)  0.3283  n/a 
ST(2)  0.0970  130.13 
CST(3)  0.7330  10.26 
CST(4)  0.0584  15.34 
Also, we computed the Euclidean distance of CSTed images, which has an error rate of \(\sim 1\%\) comparing to the IMED of the original images, due to the approximate property of the convolution template.
Experiments on the transform domain metric learning
In this section, we conduct several sets of experiments. The experiments are performed on 3 face data sets (UMIST, Yale and ORL database). The images in UMIST, Yale and ORL data sets are resized to \(28 \times 23\), \(40 \times 30\) and \(28 \times 23\), respectively.^{1} We randomly select two images from each class as the training set, and use the remaining images for test. We repeat the process 20 times independently and the average results are calculated.
Comparison of image metrics on various databases (%)
ED  IMED  GED  XNZ  TDML  

UMIST  60.88  60.90  62.05  60.96  73.92 
Yale  71.41  71.41  71.11  67.73  75.26 
ORL  81.95  81.63  80.88  81.24  84.06 
Another set of experiments was to test whether embedding the learned TI metric in an image recognition technique, e.g., SVM (Vapnik 1998), can improve that algorithm’s accuracy. Embedding a TI metric in an algorithm is simple: first, transform all images by the corresponding TI transform, and then run the algorithm with the transformed images as input data.
SVM classification performances of the embedded metrics (%)
ED  IMED  GED  TDML  

UMIST  60.33  62.02  62.45  69.53 
Yale  68.90  69.12  69.23  72.30 
ORL  79.25  79.07  79.00  80.38 
Conclusion
In this paper, we extend the equivalency in (Sun et al. 2009) to the discrete frequency domain. We show that GED and IMED are lowpass filters, resulting in fast implementations which reduce the space and time complexities significantly. The transform domain metric learning (TDML) proposed in (Sun et al. 2009) is also resembled as a translationinvariant counterpart of LDA. Experimental results demonstrate significant improvement of algorithm efficiency and performance boosts on small sample size problems.
One possible future direction is the search for more effective metric learning algorithm. TDML is a simple and intuitive attempt and we expect novel methods that combine the concepts of margins, kernels, locality and nonlinearity.
The resization is necessary for traditional subspace and metric learning methods since they are vulnerable to the computational issue and small sample size problem from the curse of dimensionality. Our method doesn’t suffer from it.
Declarations
Authors’ contributions
BS proposed the idea of translationinvariant metric and proved the main theoretical results, JFF and GPW participated in its design and coordination and helped to revise the manuscript presentation of this method. All authors read and approved the final manuscript.
Acknowledgements
This work was supported by NSFC(61333015) and NBRPC(2010CB328002, 2011CB302400).
Competing interests
The authors declare that they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 BarHillel A, Hertz T, Shental N, Weinshall D (2003) Learning distance functions using equivalence relations. Proc Int Conf Mach Learn 11–18Google Scholar
 Chopra S, Hadsell R, LeCun Y (2005) Learning a similarity metric discriminatively, with application to face verification. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005. CVPR 2005, vol. 1, pp 539–5461. doi:10.1109/CVPR.2005.202
 Chen J, Wang R, Shan S, Chen X, Gao W (2006) Isomap based on the image euclidean distance. In: 18th International Conference on Pattern Recognition, 2006. ICPR 2006. vol. 2, pp 1110–1113. doi:10.1109/ICPR.2006.729
 Davis JV, Kulis B, Jain P, Sra S, Dhillon IS (2007) Informationtheoretic metric learning. In: Proceedings of the 24th International Conference on Machine Learning. ICML ’07, ACM, New York, NY, USA, pp. 209–216. doi:10.1145/1273496.1273523. http://doi.acm.org/10.1145/1273496.1273523. Accessed 15 May 2013
 Duda RO, Hart PE, Stork DG (2000) Pattern Classification, 2nd edn. WileyInterscience (2000)Google Scholar
 Goldberger J, Roweis S, Hinton G, Salakhutdinov R (2005) Neighbourhood components analysis. In: Saul LK, Weiss Y, Bottou L (eds) Advances in neural information processing systems, vol 17. MIT Press, Cambridge, MA, pp 513–520Google Scholar
 Globerson A, Roweis S (2006) Metric learning by collapsing classes. In: Weiss Y, Schölkopf B, Platt J (eds) Advances in neural information processing systems, vol 18. MIT Press, Cambridge, pp 451–458Google Scholar
 Gray RM (2006) Toeplitz and circulant matrices: a review. Found Trends Commun Inform Theory 2(3):155–239View ArticleGoogle Scholar
 Hastie T, Tibshirani R (1996) Discriminant adaptive nearest neighbor classification. IEEE Trans Pat Anal Mach Intel 18(6):607–616. doi:10.1109/34.506411 View ArticleGoogle Scholar
 Jean JSN (1990) A new distance measure for binary images. In: International Conference on Acoustics, Speech, and Signal Processing, 1990. ICASSP90., pp. 2061–2064. doi:10.1109/ICASSP.1990.115932
 Lebanon G (2006) Metric learning for text documents. IEEE Trans Pat Anal Mach Intel 28(4):497–508. doi:10.1109/TPAMI.2006.77 View ArticleGoogle Scholar
 Li F, Yang J, Wang J (2007) A transductive framework of distance metric learning by spectral dimensionality reduction. In: Proceedings of the 24th Annual International Conference on Machine Learning (ICML 2007), pp 513–520Google Scholar
 Oppenheim AV, Schafer RW, Buck JR (1999) DiscreteTime Signal Processing, 2nd edn., Prentice Hall Signal Processing Series, Prentice Hall, Englewood CliffsGoogle Scholar
 Rudin W (1991) Functional Analysis, 2nd edn. McGrawHill Book Company, New YorkMATHGoogle Scholar
 Rudin W (1990) Fourier Analysis on Groups. Wiley, New York Google Scholar
 ShalevShwartz S, Singer Y, Ng AY (2004) Online and batch learning of pseudometrics. In: Proceedings of the Twentyfirst International Conference on Machine Learning. ICML ’04, ACM, New York, p 94. doi:10.1145/1015330.1015376.http://doi.acm.org/10.1145/1015330.1015376. Accessed 11 03 2013
 Shental N, Hertz T, Weinshall D, Pavel M (2002) Adjustment learning and relevant component analysis. In: ECCV ’02: Proceedings of the 7th European Conference on Computer VisionPart IV, Springer, London, pp. 776–792Google Scholar
 Sun B, Feng J (2008) A fast algorithm for image euclidean distance. In: Chinese Conference on Pattern Recognition, 2008. CCPR ’08, pp 1–5. doi:10.1109/CCPR.2008.32
 Sun B, Feng J, Wang L (2009) Learning IMED via shiftinvariant transformation. In: IEEE Conference on Computer Vision and Pattern Recognition, 2009. CVPR 2009, pp 1398–1405. doi:10.1109/CVPR.2009.5206720
 Vapnik VN (1998) Statistical Learning Theory. WileyInterscienceGoogle Scholar
 Weinberger KQ, Blitzer J, Saul LK (2005) Distance metric learning for large margin nearest neighbor classification. In: Advances in Neural Information Processing Systems, vol. 18, pp 1473–1480. http://books.nips.cc/papers/files/nips18/NIPS20050265.pdf
 Wang L, Zhang Y, Feng J (2005) On the euclidean distance of images. IEEE Trans Pat Anal Mach Intel 27(8):1334–1339. doi:10.1109/TPAMI.2005.165 View ArticleGoogle Scholar
 Wang R, Chen J, Shan S, Chen X, Gao W (2006) Enhancing training set for face detection. In: 18th International Conference on Pattern Recognition, 2006. ICPR 2006. vol. 3, pp 477–480. IEEE Computer Society, Washington, DC. doi:10.1109/ICPR.2006.493
 Xing EP, Ng AY, Jordan MI, Russell S (2003) Distance metric learning, with application to clustering with sideinformation. In: Advances in Neural Information Processing Systems 15, vol. 15, pp 505–512. http://citeseerx.ist.psu.edu/viewdoc/summary?. doi:10.1.1.58.3667
 Xiang S, Nie F, Zhang C (2008) Learning a mahalanobis distance metric for data clustering and classification. Pat Recogn 41(12):3600–3612. doi:10.1016/j.patcog.2008.05.018 MATHView ArticleGoogle Scholar
 Zhu S, Song Z, Feng J (2007) Face recognition using local binary patterns with image euclidean distance. In: SPIE, vol. 6790. doi:10.1117/12.750642