Object segmentation by saliency-seeded and spatial-weighted region merging

Li, Junxia; Ding, Jundi; Yang, Jian; Dai, Lingzheng

doi:10.1186/s40535-016-0024-z

Research
Open access
Published: 22 November 2016

Object segmentation by saliency-seeded and spatial-weighted region merging

Junxia Li¹,
Jundi Ding¹,
Jian Yang¹ &
…
Lingzheng Dai¹

Applied Informatics volume 3, Article number: 9 (2016) Cite this article

3155 Accesses
Metrics details

Abstract

In this paper, we present a region merging-based method for object segmentation in natural images. The method consists of three separate steps: (1) initial over-segmentation such that pixels in each region are as homogeneous as possible and therefore likely to be from the same object; (2) saliency-seeded interaction to provide proper prior input to guide the segmentation; (3) region merging by an introduced maximal spatially weighted similarity (MSWS) criterion. Saliency-seeded interaction can well reflect the human intention but does not require any manual user editing, which makes our method applicable to increasingly large-scale image databases. The MSWS criterion takes into account both the color similarity and spatial distance of the candidate regions for merging, which allows the region merging-based method to achieve better performance. Extensive experiments show that our method can reliably and automatically segment the objects from a great variety of natural images.

Background

Object segmentation is an important task in the field of image processing (Gollmer et al. 2014, Tavakoli and Amini 2013, Seo et al. 2006). In many applications such as object recognition (Russell et al. 2006) and content-aware image resizing (Avidan and Shamir 2007), one of the core issues is to segment the object(s) of interest out from an image. If the object(s) can be correctly segmented, better application performance can be achieved such as higher recognition rate or lower resizing deformation.

However, this segmentation task in itself is a difficult and still open problem. Over the last three decades, a plethora of methods have been proposed: mean shift (Comaniciu and Meer 2002), fuzzy c-means (Cai et al. 2007; Chen and Zhang 2004), normalized cuts (Shi and Malik 2000), the coherence-connected tree algorithm (Ding et al. 2006), etc. But, as reported, they are all restricted to work well if the assumption of homogeneity in one or more region attributes hold. In other words, these segmentation methods yield good results when the objects are piece-wise smooth or nearly constant in at least one attribute. However, in commonly encountered complex natural images, they often perform poorly. Quite often, the objects tend to be segmented into pieces. In recent years, interactive techniques such as graph cuts (Boykov and Jolly 2001), GrabCut (Rother et al. 2004), and those in Bai and Sapiro (2007), Peng et al. (2011), Xiang et al. (2009) and Li et al. (2004) have received considerable attention. The underlying idea is to utilize some prior user inputs to guide the segmentation.

Experiments have shown that if proper prior input is provided, most of the existing interactive methods can yield satisfactory results for natural images. But, providing the proper input is not straightforward (Yang et al. 2010). Quite often, the user, especially a non-expert, has to struggle with a carefully patient editing among all possibly ‘desired’ locations in the image (Rother et al. 2004; Li et al. 2004). If the user fails to provide effective priors, more interactions are required to correct the segmentation. This is a tedious task, and especially difficult when the object and its background have low contrast (Ning et al. 2010), or the object is camouflaged, or there is clutter in the image (Rother et al. 2004). In such cases, despite the interactive input, the segmentation may not always yield the desired output (see Fig. 1). This can be remedied partly by a second tier editing of the initial segmentation results (Li et al. 2004). Another option is to employ multiple types of prior user input, including object and background strokes, soft boundary brushes or boxes, hard edge scribbles, and any combination of these (Rother et al. 2004).

Although this effort results in improved segmentation results, the whole process is tedious and is not at all practical especially in view of image databases of increasingly larger sizes (Liu et al. 2011). Manual annotation of these databases is out of the question. This is the main motivation behind our method. Our proposed scheme aims as follows: (i) provide a segmentation method effective in a great variety of natural images, where regions are primed by a few background and object seed inputs; and (ii) any seed input must be acquired automatically, that is free from any user manual effort.

Photographs of natural scenes reflect real-world variations and are characterized by large ranges of color, texture, shape, or similar attributes. Image objects are not necessarily homogeneous in their attributes, and consequently even the state-of-the-art methods can fail to segment an object in its entirety, and more often the segmentation yields fragmented objects. This gives us the following idea: we can first over-segment an image into regions that are as homogeneous as possible, and then try to merge the object regions that are adjacent and similar to each other. The rationale is that these regions in all likelihood belong to the same object. To this end, we present a merging-based segmentation method in this paper.

We introduce a novel rule termed ‘maximal spatially weighted similarity’ (MSWS) to aggregate regions. Specifically, our proposed rule is to merge the regions that not only have the highest similarity in color, but that also are the nearest to each other. That is, MSWS criterion takes into account both the “color similarity” and “spatial distance” of the candidate regions for merging. Merging methods in the current literature focus on finding neighboring regions with color similarity above a threshold (Yang et al. 2010) or the highest (Ning et al. 2010) among all, without a distance weighting criterion. Disregarding the distance weighting criterion increases the risk that background regions with similar colors will be erroneously merged with object regions (see Fig. 2b).

Furthermore, we adopt an interactive merging strategy as recently proposed in Ning et al. (2010). That is, we first generate image clues to direct the merging, and these clues, in the form of simple strokes, roughly indicate the locations of the object and of the background. However, while in Ning et al. (2010), the object and background seeds are all drawn by the user, in our scheme they are automatically extracted. To generate segmentation priors, we have to take into account the following observations:

From the prior interaction point of view, the locations of pixels which have different attributes but belong to the same object are often good candidates for priors (see the toucan image in Fig. 1). As a case in point, the “toucan” object consists in majority of black pixels, and a minority of orange and white pixels. To be segmented into the same region as the black pixel, the minority pixels have to be marked as prior object seeds.
From the human attention point of view, the locations of pixels which have different attributes but belong to the same object are generally the salient places where human attention is attracted (see Fig. 3 for the toucan again). The orange and white pixels which are highly contrasted to the black pixels have the highest salience, shown as bright regions. At the same time, we can also observe that the pixels with the lowest salience are usually part of background.

From these two observations, we can conclude that the salient parts of an image, which attract more human attention are also likely to be the locations of prior interactions. Inspired by this conclusion, we build a saliency-seeded interactive scheme that can automatically find the good object (i.e., by highest salience) and background (i.e., by lowest salience) seed inputs. A typical result of our automatic interaction for the toucan image is shown in Fig. 3. Clearly, the object marks fall onto a small portion of locations where the pixels are largely orange and white, while the background marks are all located in the background.

A brief overview of our ‘saliency-seeded and spatial-weighted’ (SSaSW) region merging-based method is illustrated in Fig. 3. It consists of three main stages: (A) initial over-segmentation; (B) saliency-seeded interaction; and (C) MSWS-based region merging. First, we run an image segmentation algorithm to divide the input image into many small homogenous regions. Next, with the aid of a saliency detection method, the prior interactions are determined automatically. Finally, the object is extracted from the background when our MSWS-based merging process ends. Extensive experiments are conducted and results show that our method can reliably segment the objects from a wide variety of natural images.

In summary, the contributions of this paper mainly include the following:

1.
We build a saliency-seeded interaction scheme that can well reflect the human intention but is free of any manual user editing effort. In addition, it is easy and flexible for our interactions embedded into many interactive methods.
2.
We propose a novel rule MSWS to aggregate regions. It takes into account both the color similarity and spatial distance of the candidate regions for merging, which allows the region merging-based method to achieve better performance.

Our merging-based segmentation method

In this section, we will detail three stages of our method.

Initial over-segmentation

There are many low-level homogeneity-based methods which can be used for an initial over-segmentation, such as normalized cuts (Ncuts) (Shi and Malik 2000), k-means (Mignotte 2008), mean shift (Comaniciu and Meer 2002), Otsu’s thresholding (Otsu 1979), and watershed (Vincent and Soille 1991). Our required initial segmentation should be that pixels in each region are as homogeneous as possible such that (i) they are from the same object and (ii) the object boundary is well preserved. The results produced by the mean-shift algorithm satisfy these two requirements. However, methods like k-means, Ncuts, and otsu’s require a preset threshold on the number of regions, and their computational complexity always rapidly increases with this threshold. The results produced by these three methods usually do not keep the boundary well. Although the results produced by watershed also satisfy the mentioned two requirements, they always tend to yield over-segmentation regions that increase the complexity of computation. For these reasons, we choose mean-shift to produce our required initial over-segmentation. In particular, the EDISON system EDISON Software (http://www.caip.rutgers.edu/riul/research/code.html) of mean shift software is used here.

Saliency-seeded automatic interaction

Most of existing interactive segmentation methods can yield satisfactory results, if proper user interaction is provided. However, the image database nowadays becomes increasingly larger, and manual annotation of them is impractical at all. Thus, finding an automatic way to figure out the prior interaction is very important.

Our motivation of automatic interaction

Saliency detection is one recently developed technique for object extraction (Cheng et al. 2011; Achanta et al. 2009). It seeks to identify the highly informative parts of a scene that attract more human attention. In an image, the regions that are strongly contrasted to their surroundings often tend to pop out being salient. To date, there are many popular salience detection methods proposed to identify these regions, such as IT (Itti et al. 1998), MZ (Ma and Zhang 2003), GB (Harel et al. 2007), SR (Hou and Zhang 2007), AC (Achanta et al. 2008), CA (Goferman et al. 2010), FT (Achanta et al. 2009), LC (Zhai and Shah 2006), HC (Cheng et al. 2011), and RC (Cheng et al. 2011). In all of them, the salience values of pixels are represented in gray and normalized to the range [0, 1]. The brighter a pixel is, the higher its salience value is. From their typical results shown in Fig. 4, we can observe that pixels with the higher salience (shown as brighter pixels) are near high-contrast positions (e.g., object boundaries), or within some high-contrast regions (e.g., a textured region). On the other hand, they are all related to the object of interest in one image. On the contrary, pixels in the background tend to have the lower salience, shown in black. Interactive methods such as GrabCut (Rother et al. 2004), graph cuts (Boykov and Jolly 2001), or MSRM (maximal similarity-based region merging) (Ning et al. 2010) yield good results when the locations of pixels with higher salience are marked as prior inputs. That is, the high-contrast positions or regions are always good candidate places for prior user interaction. Inspired by these, we will build a saliency-seeded automatic interaction scheme in the following:

Our way of automatic interaction

In particular, we intend to mark pixels with the highest salience being ‘object’ (denoted ‘O’), and to mark pixels with the lowest salience being ‘background’ (denoted ‘B’). That is, we are to pick the pixels with salience above a threshold $T_O$ as the prior object seeds, and to pick the pixels with salience below a threshold $T_B$ as the prior background seeds ($T_O> T_B$):

$$\begin{aligned} O=\{(x,y)\mid s(x,y)\ge {T}_{O}\} \end{aligned}$$

(1)

$$\begin{aligned} B=\{(x,y)\mid s(x,y)\le {T}_{B}\}, \end{aligned}$$

(2)

where s(x, y) is the salience value of the pixel (x, y). However, it is difficult to find a general-purpose value for such two thresholds. The objects and backgrounds in different images tend to have different salience values.

Thus we turn to specify other two alternative thresholds $P_{O}$, $P_{B}$ that represent the amount of prior object and background seeds in an image I:

$$\begin{aligned} {\rm Pr}(O)={\rm Pr}(s(x,y)\ge \text{T}_{O})={P}_{O} \end{aligned}$$

(3)

$$\begin{aligned} {\rm Pr}(B)={\rm Pr}(s(x,y)\le {T}_{B})= {P}_{B}, \end{aligned}$$

(4)

where ${\rm Pr}(\cdot )$ is a probability function and defined as

$$\begin{aligned} {\rm Pr}(O)=\frac{|O|}{|I|};\quad {\rm Pr}(B)=\frac{|B|}{|I|}. \end{aligned}$$

(5)

$|\cdot |$ denotes the number of elements in a set. We observe that in each salience map, the probability of pixels with the highest salience is about 2–5%, and the probability of pixels with the lowest salience in black is near $50\%$ (see Fig. 5c). Then, we here select a value for $P_{O}$ in the range [0.02, 0.05] and set $P_{B}$ to be 0.5. As shown in Fig. 5d, the object and background seed inputs are well determined.

However, with this approach, there are still too many marked inputs, especially in the background. For a shrink, we take the morphological ‘thin’ operation on marked object and background seeds, respectively. We use the function ‘bwmorph’ in the MATLAB R2010b function library in forms of bwmorph (BW, operation, n) which means applying a specific morphological operation to the binary image ‘BW’ n times. Specifically, we apply the operation ‘thin’ repeatedly until the image no longer changes, i.e., operation =‘ thin,’ and n = inf. As a results, the ‘thin’ operation removes pixels so that the object or background seeds regions without hole shrink to a minimally connected stroke, and the regions with holes shrink to a ring halfway between the hold and outer boundary (see Fig. 5e).

As thus, only a small portion of image pixels are marked as prior interaction inputs, and they have reflected human attention well. More importantly, they are all obtained free from any user manual effort and adaptive to the image content.

MSWS-based region merging

After the above interaction input, there are some over-segmented regions that will contain both object seeds and background seeds. Before the merging step, we should first label the regions with more prior object (or background) seeds as the object (or background) marker region, and label the regions with no prior seed input as non-marker regions. The merging aim of MSWS is to assign to each non-marker region the correct label ‘O’ or ‘B.’ The whole merging process contains two stages, which are repeatedly executed until no new merging occurs. (i) Merging non-marker regions with background marker regions. For each background marker region, if a non-marker region satisfies the MSWS criterion with it, the two regions are merged and the new region is labeled ‘B.’ (ii) Merging non-marker regions remained from the first stage adaptively. For each non-marker region, if a non-marker region satisfies the MSWS criterion with it, the two non-marker regions are merged and form a new non-marker region. In what follows, we will give a brief review of a principle of maximal color similarity (MCS) in MSRM. Based on it, we will provide our insight into why the spatial distance between regions is also important for the merging.

Overview of MCS

Color is a simple and effective low-level attribute that is commonly used for image segmentation. The idea is that regions from the same object are more similar in color than regions from different objects. Specifically, MCS is a very useful merging principle described in MSRM. It merges two neighboring regions that have the maximal similarity in color. That is, for one region R, let Q denote an adjacent region of R (i.e., a region with at least one pixel in common with R), if

$$\begin{aligned} \rho _{c} (R,Q^{*})=\max \limits _{Q\in N(R)}\rho _{c}(R,Q) \end{aligned}$$

(6)

$Q^{*}$ is called the most similar region to R and is merged with R, where $\rho _{c}(R,Q)$ denotes the color similarity between R and Q, and N(R) is the set of R’s all adjacent regions. By this “max” operator, the merging process avoids a preset similarity threshold. However, the “max” operator may be somewhat sensitive to noise. To avoid this issue, MSRM uses an RGB histogram to represent each region. In the RGB space, each channel is uniformly quantized into 16 levels, and then a color space of $16\times 16\times 16=4096$ bins is used to calculate the histogram of each region. MSRM computes the color similarity of regions as the Bhattacharyya coefficient between two histograms:

$$\begin{aligned} \rho _{c} (R,Q)=\sum _{u=1}^{4096}\sqrt{{\rm Hist}_R^u\cdot {\rm Hist}_Q^u}, \end{aligned}$$

(7)

where ${\rm Hist}_R$ and ${\rm Hist}_Q$ denote the normalized color histograms of R and Q respectively, and the superscript u represents the uth bin.

Our MSWS criterion

It is worthwhile to note that in MCS all neighboring regions are treated equally in the merging, and only color information is used to judge the similarity between regions. This has some limitations. This approach may fail when low-contrast edges and shadow occur. It may also fail when part of the object region is slightly more similar in color to the adjacent background region than adjacent object regions, or vice versa.

We take the yellow flower shown in Fig. 2 as an example. The flower consists of two parts: petals and stamen. Although both parts are yellow, the stamen is slightly darker than the surrounding petals. In Fig. 2b (first row), only parts of the petals are marked as belonging to the object, and a small portion of the background is present in the segmented object. In Fig. 2b (second row), it can be seen that the prior interactions are well designed, but the segmentation problem remains. The object cannot be reliably extracted from the background by either of these two interaction inputs. This example illustrates that even if the prior interactions are well designed, a satisfactory result cannot be obtained for this image. This is mainly because the object of interest is not piece-wise smooth or nearly constant in color and the contrast between the object and background is low. These problems are relatively common in natural images. Therefore, using only color information cannot ensure good segmentation performance for these natural images.

To solve this problem, we propose a novel rule termed maximal spatially weighted similarity (MSWS) to merge regions. It takes into account both the color similarity and the spatial distance of the candidate regions for merging. The implied idea is that regions of the same object are spatially adjacent and their colors are similar enough to each other. That is, one aims to merge the regions that not only have the highest similarity in color, but that also are the nearest to each other. Specifically, for two regions R and Q, we first define the spatial distance as

$$\begin{aligned} \rho _{s} (R,Q)={\Vert center_{R}-center_{Q}\Vert }_{2} \end{aligned}$$

(8)

where $center_{R}$ and $center_{Q}$ are the center pixel coordinates of the regions R and Q, respectively, and ${\Vert \cdot \Vert }_2$ denotes the Euclidean distance. The lower $\rho _{s} (R,Q)$ is for a pair of regions, the higher the spatial similarity between them. Directly integrating spatial distance into the color similarity computation, the MSWS is defined as

$$\begin{aligned} \rho (R,Q)={\text{ exp }(-{\rho _{s} (R,Q)}/{\sigma ^{2}})}\cdot {\rho _{c} (R,Q)}, \end{aligned}$$

(9)

where $\sigma$ controls the effect of spatial distance in the maximal spatially weighted similarity measure. In our experiments, we use $\sigma ^{2}=1$ empirically. Note that, although we choose the RGB color space and Bhattacharyya coefficient to compute the color similarity as in Ning et al. (2010), other color spaces (e.g., HSI) and distance metrics (e.g., Euclidean distance) can also be used here.

We use Fig. 6 as a toy example to explain the rationale behind our MSWS criterion. Fig. 6a is the initial over-segmentation result. It contains four different homogeneous regions, denoted by A, B, C, and D (Fig. 6c). We assume that A and C are object regions, and B and D are background regions. In the MCS-based labeled result (Fig. 6h), regions B and region C are labeled ‘O.’ The labeling result which uses only spatial information is shown in Fig. 6j; in this case, region B and region C are labeled ‘B.’ However, as shown in Fig. 6i, the corresponding MSWS-based result is consistent with the benchmark (Fig. 6a). This shows that our criterion can improve the performance of region merging-based methods by considering the color similarity and spatial distance of the candidate regions jointly.

Figure 7 shows the segmentation results based on MCS and MSWS criterion. In the MCS-based results, the objects of interest cannot be segmented accurately. In the person image, parts of the object are merged into the background; in the flower image, a small portion of the background regions are erroneously integrated into the object (see Fig. 7b). Figure 7c shows the segmentation results by our proposed method. Clearly, it can effectively and accurately extract the objects from their backgrounds.

Experiments and comparisons

Experiment setting

Datasets

In this section, we evaluate the performance of our proposed algorithm from multiple perspectives. These extensive experiments are conducted on two public image databases. The first one is the Berkeley Segmentation Database denoted as BSDS300 (Martin et al. 2001). It is an information-rich dataset which contains 300 images along with the ground-truth segmentations. These images are of complex, natural scenes, and have five to ten human hand-labeled segmentations on each one of them. The second database MSRA1000 is provided by Achanta et al. (2009). It consists of 1000 images with obvious salient objects and clean backgrounds with a manually generated segmentation result for each image.

Parameters setting

P$_{O}$ and P$_{B}$ are two important parameters for our method to obtain the object and background seeds. In order to determine the P$_{O}$ and P$_{B}$ values, we conducted an elaborated analysis on MSRA1000 dataset.

We analyzed this problem mainly from two aspects:

(i)
From the aspect of the accuracy that the saliency detection algorithm brings for our prior interactions, we conducted extensive statistical experiments over ten saliency detection methods with different thresholds P$_{O}$ and P$_{B}$. For each saliency method, we compute average accuracy-P $_{O}$ curve and accuracy-P $_{B}$ curve on MSRA1000 dataset, and present all the curves in Fig. 8, respectively. From Fig. 8, we can see that when P $_{O}\le 5 \%$, the accuracy is above $50\%$ for all these ten methods (SR has the worst performance when P $_{O} = 5 \%$), and when P $_{O}> 5 \%$, the accuracy is decrease gradually. To our minds, it will not be accepted when the accuracy is less than $50\%$. So, we here choose the maximum value of P $_{O}$ is $5\%$. From Fig. 8, it can be seen that when P $_{B}$ is near $50\%$ the accuracy is higher than $95\%$ for most of saliency models.
(ii)
From the aspect of the foreground object size of the image, we computed the proportion of the foreground object in the whole image for all the MSRA1000 dataset. In 1000 images, there are only 21 images which have a very small proportion—less than $5\%$. Among the 21 images there are only two images, whose proportion is less than $2\%$. So, here we choose the minimum value of P$_{O}$ is $2\%$. Besides, there are only several images whose proportion of the background is less than $50\%$.

Taking these two aspects into consideration and for fairly comparing with other methods, in this paper, we select a value for P$_{O}$ in the range [0.02, 0.05], and set P$_{B}$ to be 0.5.

Qualitative result comparisons

The two main stages of our proposed method are the saliency-seeded automatic interaction and MSWS-based region merging. In order to verity their effectiveness, we conduct extensive experiments on the two test datasets.

Results based on different saliency detection methods

Figure 9 illustrates the corresponding segmentation results of SSaSW based on different saliency detection methods IT (Itti et al. 1998), MZ (Ma and Zhang 2003), GB (Harel et al. 2007), SR (Hou and Zhang 2007), AC (Achanta et al. 2008), CA (Goferman et al. 2010), FT (Achanta et al. 2009), LC (Zhai and Shah 2006), HC (Cheng et al. 2011), and RC (Cheng et al. 2011). These images are from the MSRA1000 database. We can clearly see that SSaSW yields satisfactory segmentation results from most of these methods, except for AC and SR. Therefore, most saliency detection methods except for AC and SR can provide the proper automatic interactions for SSaSW. In the following experiments, the RC saliency map is used to automatically determine prior interactions.

Effectiveness analysis of our MSWS

We compare the performance of our MSWS criterion with that of the MCS criterion. Note that MCS can be seamlessly embedded into our framework. All experiments are conducted on the BSDS300database. Figure 10 shows the segmentation results of the MCS- and MSWS-based region merging methods. In these images, some objects contain low-contrast edges, or parts of the background are very similar in color to the adjacent object regions. It is difficult to achieve satisfying results in these cases with MCS. However, given the same marking, MSWS achieves much better results than MCS.

Quantitative result evaluations

Evaluations on the $\mathbf BSDS300$ database

Until now, the effectiveness of MSWS is evaluated visually. However, visual observation is subjective. In order to demonstrate the performance objectively, it is necessary to provide some performance measures for quantitative evaluations. We make use of the following performance measures: a probabilistic measure PRI (Unnikrishnan et al. 2007), and two metrics VoI (Meila 2005) and GCE (Martin et al. 2001), to demonstrate the effectiveness of our proposed MSWS. The three performance measures adopted here are described in the following sections:

1.
Probabilistic Rand Index (PRI) (higher probability is better): The Rand index proposed in Unnikrishnan et al. (2005) calculates the fraction of pairs of pixels whose labels are consistent between the test segmentation S and the ground-truth segmentation G. PRI proposed in Unnikrishnan et al. (2007) is a simple extension of the Rand index. It allows the comparison of a segmentation algorithm to a set of ground-truth segmentations by averaging the results. Given a set of ground-truth segmentations ${\{G_k\}}$, the PRI is defined as
$$\begin{aligned} \text{ PRI }(S,\{G_k\})=\frac{1}{K}\sum \limits _{i<j}[c_{ij}p_{ij}+(1-c_{ij})(1-p_{ij})], \end{aligned}$$
(10)
where $c_{ij}$ means that pixel i and j have the same label and $p_{ij}$ denotes its probability. Let K be the number of ground-truth segmentations for an image. Thus, PRI is based on pair-wise relationships and highly correlated with human hand-labeled segmentation results.
2.
Variation of Information (VoI) (lower distance is better): In contrast to PRI, VoI (Meila 2005) is based on the relationship between a pixel and its own cluster. It views a clustering as an element of a lattice. As a metric, VoI uses conditional entropies to approximate the distance between two clusters, and is defined as
$$\begin{aligned} \text{ VoI }(R_1,R_2)=H(R_1)+H(R_2)-2I(R_1,R_2), \end{aligned}$$
(11)
where H and I represent, respectively, the entropies and mutual information between two regions of $R_1$ and $R_2$. It is a form of ‘external evaluation,’ and measures the amount of information that is lost or gained in changing from one clustering to another.
3.
Global Consistency Error (GCE) (lower distance is better): A supervised evaluation method, GCE, was introduced by Martin et al. (2001) to quantify the consistency between segmentations. Let R(S, p) be the set of pixels which are in the same region R as the pixel p in segmentation S, where $|\cdot |$ denotes the cardinality of a set and $\cdot \setminus \cdot$ set difference. The local refinement error is
$$\begin{aligned} E (S_1, S_2, p) = \frac {| R\,(S_1, p) \setminus R\, (S_2, p)|}{|R\, (S_1, p)|}. \end{aligned}$$
(12)
Then the GCE is defined as
$$\begin{aligned} \text{ GCE }(S_1, S_2)=\frac{1}{n}\min \left\{\sum \limits _{i}E(S_1, S_2, p_i), E(S_2, S_1, p_i) \right\}. \end{aligned}$$
(13)
Let n be the size of the image. Note that GCE forces all local refinements to be in the same direction, and it does not penalize over-segmentation.

Table 1 compares model performance on the images presented in Fig. 10 using the PRI, VoI, and GCE metrics, where ‘NO.’ denotes the ID number of the images. The values of PRI, VoI, and GCE are given comparatively in the two columns. Obviously, MSWS outperforms MCS on all the indices. The average PRI value of MSWS over the 300 images of the BSDS300 dataset is 0.5551, which is higher than MCS of 0.5476. The average GCE and VoI values of MSWS on this database are 0.0561 and 2.0146, which are lower than the MCS averages of 0.0646 and 2.0519.

Table 1 Qualitative comparison of the results of our method based on MSWS and MCS on the ten images presented in Fig. 10

Full size table

Evaluations on the MSRA1000 database

In order to demonstrate the effective of our method, we conduct our method based on six recently proposed saliency detection methods LR (Shen and Wu 2012), SF (Perazzi et al. 2012), HS (Yan et al. 2013), MR (Yang et al. 2013), DS (Li et al. 2013), and AMC (Jiang et al. 2013) on the MSRA1000 database, and then compare our object segmentation results with their adaptive-thresholding segmentation results. The term of adaptive threshold is proposed by Achanta et al. (2009) which is image saliency dependent. Note that in the adaptive-thresholding segmentation, each saliency map is first over segmented by mean-shift. An average saliency is then calculated for each segment, and an overall mean saliency value over the entire image is obtained as well. If the saliency in this segment is larger than twice of the overall mean saliency value, the segment is marked as foreground, otherwise to be background. In this way, the binary segmentation map is yielded.

F-measure is used to assess the consistency of each segmentation result with the ground truth, and is defined as

$$\begin{aligned} {\text F}{\text {-measure}} = \frac{(1+\beta ^{2})\times {\rm Precision}\times {\rm Recall}}{\beta ^{2}\times {\rm Precision} + {\rm Recall}}. \end{aligned}$$

(14)

We use $\beta ^{2}=0.3$ in our method to weigh Precision more than Recall. Table 2 shows the F-measure scores of our SSaSW and the adaptive-thresholding segmentation. From the results, we can see that our method consistently performs better than the adaptive-thresholding segmentation. This comparison results also nicely demonstrate the effectiveness of the strategies of our proposed saliency-seeded interaction and maximal spatially weighted similarity criterion.

Table 2 F-measure evaluations with different saliency methods on the MSRA1000 database

Full size table

Furthermore, we compare the segmentation results of SSaSW and RCC (Cheng et al. 2011) with the human segmentation result for each image. RCC is an RC-based cut algorithm. It employs the RC saliency map to initialize the process of GrabCut instead of using human input. Figure 11 compares the segmentation results of SSaSW and RCC. From Fig. 11c and f, we can see that each object of interest is effectively extracted from the background by SSaSW, while RCC has difficulty handling images with cluttered and highly textured objects or backgrounds (see Fig. 11b, e). Table 3 presents the F-measure scores on the test images and shows that our results are very consistent with the ground truth. The averaged F-measure score of our SSaSW is 0.8749 on MSRA1000 database. These experiments are conducted using the parameters $P_{O}=0.05$, $P_{B}=0.5$, and $\sigma ^{2}=1$ throughout.

Table 3 Precision (P), Recall (R), and F-measure values for test images

Full size table

$P_{O}$ and $P_{B}$ are two important parameters for our method to obtain the object and background seeds. In our experiments, in general, we can find a good result in the range [0.02, 0.05] for $P_{O}$, and 0.5 for $P_{B}$. In some cases, SSaSW can obtain better results by adjusting the parameters $P_{O}$ and $P_{B}$. Such a case is shown in Fig. 12: with default parameters ($P_{O}=0.05$), the background regions circled in red are merged into the object (see Fig. 12b), since there are several pixels in the background with higher salience (see Fig. 12a) and the corresponding regions are erroneously assigned to the object marker regions. In Fig. 12c, SSaSW produces a relatively accurate result with $P_{O}=0.02$.

For fairly comparing with other methods, we further introduce an effective scheme. For each image, with different $P_{O}$ values $P_{O_i}$ ($i=1, 2, \ldots , k$), we can easily yield the corresponding segmentation results $Z_{P_{O_i}}$. Then the average map $\bar{Z}$ is calculated for each pixel p as

$$\begin{aligned} \bar{Z}(p)=\frac{1}{k}\sum _{i=1}^{k}Z_{P_{O_i}}(p). \end{aligned}$$

(15)

Finally, the object segmentation result M can be obtained as ($\bar{Z}$ is normalized to [0, 1])

$$\begin{aligned} M(p)=\left\{ \begin{array}{ll} 1,\quad &{}\hbox { if } \bar{Z}(p)\ge 0.5; \\ 0,\quad &{}\hbox {else.} \end{array} \right. \end{aligned}$$

(16)

In this result, $M(p)=1$ indicates pixel p belonging to foreground object, and $M(p)=0$ indicates pixel p belonging to background.

In the experiments, specifically, we vary $P_{O}$ from 0.02 to 0.05 with 0.01 one step, and obtain four values $P_{O_1}=0.02$, $P_{O_2}=0.03$, $P_{O_3}=0.04$, $P_{O_4}=0.05$. In this way, all the results can be obtained using a unified parameter setting. The F-measure obtained by the proposed strategy is 0.91 which is higher than 0.90 obtained by RCC. Figure 13 shows the F-measure evaluations of SSaSW, RCC, and SSRMf (Li et al. 2011). SSRMf is also a saliency-based object segmentation method. Clearly, our SSaSW has the highest F-measure score. This confirms the effectiveness of our SSaSW.

In order to further illustrate the significance of the above comparisons, here, we give the results of statistical T tests. The corresponding p values are reported in Table 4. As we have expected, the p values are all below 0.05. This indicates that our proposed method has indeed outperformed RCC and SSRMf.

Table 4 p values of the statistical t tests for evaluations

Full size table

Comparisons with graph cuts and grabcut

In this section, we compare our method with two interactive segmentation algorithms: graph cuts (Boykov and Jolly 2001) and GrabCut (Rother et al. 2004). Object segmentation is regarded as a minimal graph cuts problem in these two methods. For a fair comparison with our region-based algorithm, we extend the classical pixel-based graph cuts and GrabCut segmentation methods to the region-based scheme. Here, we take the regions segmented by mean shift as the nodes in the graph instead of the pixels. Both graph cuts (Boykov and Jolly 2001) and GrabCut (Rother et al. 2004) require some regions labeled as a prior, i.e., seeds. In graph cuts, the user is required to mark a few strokes as object and background interactions. And in GrabCut, the interaction is a rectangle around the desired object. In Fig. 14, it seems that the prior interactions for graph cuts and GrabCut are well designed. Despite this, our method can achieve a comparable segmentation performance with the interactive object segmentation methods.

Results on domain specific images

In order to demonstrate the effectiveness of our proposed method more widely, in this subsection, we conduct some experiments on domain specific images, e.g., shadow images, medical images (here we use two vascular images). Figure 15 shows the segmentation results. From these results, we can see that our method works well on these specific images.

On the extension to more features

Our method can benefit from the integration of more feature information. Specifically, in this subsection, we add the texture information into our model [three textural features coarseness, contrast, and directionality (Tamura et al. 1978) are used to extract texture information, as done in Dogra et al. (2012)]. That is, we use color similarity, spatial proximity, and texture similarity together to define our similarity measure. Table 5 shows the comparison results. It can be seen that our method can yield better results by integrating of texture information.

Table 5 Average F-measure values on the MSRA1000 dataset based on our MSWS and MSWS with texture

Full size table

Computational complexity of SSaSW

For a clear qualitative analysis of the proposed method, we will discuss its computational complexity and compare it to that of RCC and SSRMf. The running time of our method mainly depends on two parts, the region merging process and the similarity measure. For the region merging process, the time complexity is $O(N^2)$, where N is the number of regions after initial segmentation. The time complexity of the similarity measure is $O(M\_{k})$, where $M\_k$ is the number of pixels in the k-th region. So, the worst-case running time complexity for our SSaSW is $O(N^2+MN)$, where $M=\max _{k=1,\ldots ,N}{\{M\_k\}}$. The running time complexity for SSRMf is approximately equal to that of SSaSW. The RCC method iteratively applies GrabCut (Rother et al. 2004) to refine the segmentation result. The most time-consuming step is this GrabCut iteration. Thus, the time complexity for RCC is $O(mn^2|C|)$, where n is the number of nodes, m is the number of edges, and |C| is the cost of the minimum cut in the graph. Therefore, $n \gg N$ is clearly since n is the total number of pixels in an image and N is the number of regions after over-segmentation. Table 6 shows the average time taken by RCC, SSRMf, and SSaSW on the MSRA1000 database. SSaSW and SSRMf are implemented in Matlab. For RCC, we use the authors’ implementation in C++. Although SSaSW takes longer to run, it has a lower time complexity than RCC (approximately equals to SSRMf). The difference in computation time is mainly due to the different execution environments.

Table 6 Average time required for object segmentation for images in the MSRA1000 database

Full size table

Failure of SSaSW

Up until now, we have evaluated the effectiveness of SSaSW on a variety of images. However, it may fail when one of the following conditions occurs (such cases are summarized and shown in Fig. 16). The reason for the failure of Fig. 16 arises from the wrongly connected over-segmentation between pencil region. If there was no connection between the hole (from blue sky) and pencil regions, our rule of region merging will not merged them as one region, even though they are with the similar blue color to the nearby pencils. As for Fig. 16, the result should be better if the saliency-seeded interactions (i.e., high-level semantics) are all accurate, e.g., if the bottle neck is not indicated as the background. For Fig. 16, it is just due to the human ambiguity (i.e., subjective labeling). The pixels with the highest saliency values are all from the ‘hand,’ thus they are indicated as the foreground interactions. For this image in the dataset, however, the iron handle is the benchmarked foreground object.

Conclusions

This paper proposes a fully automatic framework of saliency-seeded and spatial-weighted region merging for natural object segmentation. With the aid of a saliency detection method, the proper prior inputs for the object of interest and the background region can be automatically obtained. This labeling reflects human intention and without requiring any manual user editing effort. In addition, we present an effective maximal spatially weighted similarity criterion for region merging. It merges the regions that have the highest similarity in color, and are also the nearest to each other. By incorporating both the color similarity and the spatial distance of the candidate regions for merging, the region merging-based method can achieve better performance. For a wide range of natural images, the salient objects can be reliably segmented from their complex backgrounds. SSaSW involves no user inputs and is a fully automatic framework for segmentation. Experimental results prove that our proposed scheme is comparable to current state-of-the-art automatic segmentation techniques and outperforms the conventional interactive methods. Our future work will focus on how to overcome the failure of SSaSW in some difficult situations and how to improve its speed.

Abbreviations

MSWS:: maximal spatially weighted similarity
SSaSW:: saliency-seeded and spatial-weighted
MCS:: maximal color similarity
PRI:: probabilistic rand index
VoI:: variation of information
GCE:: global consistency error

References

Achanta R, Estrada F, Wils P, Susstrunk S (2008) Salient region detection and segmentation. In: IEEE international conference on computer vision systems. IEEE, New Jersey
Achanta R, Hemami S, Estrada F, Susstrunk S (2009) Frequency-tuned salient region detection. In: IEEE International conference on computer vision and pattern recognition. IEEE, New Jersey
Avidan S, Shamir A (2007) Seam carving for content-aware image resizing. ACM Trans Graphics 26:236–246
Article Google Scholar
Bai X, Sapiro G (2007) A geodesic framework for fast interactive image and video segmentation and matting. In: IEEE international conference on computer vision, pp 1–8
Boykov VV, Jolly MP (2001) Interactive graph cuts for optimal boundary and region segmentation of objects in n-d images. IEEE Trans Pattern Anal Mach Intell 1:105–112
Google Scholar
Cai W, Chen S, Zhang D (2007) Fast and robust fuzzy c-means clustering algorithms incorporating local information for image segmentation. Pattern Recognit 40:825–838
Article MATH Google Scholar
Chen S, Zhang D (2004) Robust image segmentation using fcm with spatial constraints based on new kernel-induced distance measure. IEEE Trans Syst Man Cybern 34:1907–1916
Article Google Scholar
Cheng MM, Mitra NJ, Huang X, Torr PH, Hu SM (2011) Global contrast based salient region detection. IEEE Trans Pattern Anal Mach Intell 37:409–416
Google Scholar
Comaniciu D, Meer P (2002) Mean shift: a robust approach toward feature space analysis. IEEE Trans Pattern Anal Mach Intell 24(5):603–619
Article Google Scholar
Ding J, Chen S, Ma R, Wang B (2006) A fast directed tree based neighborhood clustering for image segmentation. In: International conference on neural information processing. Springer, Berlin, pp 369–378
Dogra DP, Majumdar AK, Sural S, Mukherjee J, Mukherjee S, Singh A (2012) Analysis of adductors angle measurement in hammersmith infant neurological examinations using mean shift segmentation and feature point based object tracking. Comput Biol Med 42:925–934
Article Google Scholar
EDISON Software. http://www.caip.rutgers.edu/riul/research/code.html. Accessed 17 Juns 2013
Goferman S, Zelnik-Manor L, Tal A (2010) Context-aware saliency detection. IEEE Trans Conf Comp Vis Pattern Recogn 34:2376–2383
Google Scholar
Gollmer ST, Kirschner M, Buzug TM, Wesarg S (2014) Using image segmentation for evaluating 3D statistical shape models built with groupwise correspondence optimization. Comp Vis Image Underst 125:283–303
Article Google Scholar
Harel J, Koch C, Perona P (2007) Graph-based visual saliency. In: Advances in Neural Information Processing Systems. MIT Press, Cambridge, pp 545–552
Hou X, Zhang L (2007) Saliency detection: a spectral residual approach. In: IEEE international conference on computer vision and pattern recognition. IEEE, New Jersey, pp 1–8
Itti L, Kouch C, Niebur E (1998) A model of saliency-based visual attention for rapid scene analysis. IEEE Trans Pattern Anal Mach Intell 20:1254–1259
Article Google Scholar
Jiang B, Zhang L, Lu H, Yang M (2013) Saliency detection via absorbing markov chain. In: IEEE international conference on computer vision. IEEE, New Jersey
Li X, Lu H, Zhang L, Ruan X, Yang M (2013) Saliency detection via dense and sparse reconstruction. In: IEEE international conference on computer vision. IEEE, New Jersey
Li J, Ma R, Ding J (2011) Saliency-seeded region merging: automatic object segmentation. In: Asian conference on pattern recognition. IEEE, New Jersey, p 691
Li Y, Sun JC, Tang SH (2004) Interactive natural image segmentation via spline regression. SIGGRAPH, Los Angeles, pp 303–308
Google Scholar
Liu T, Yuan Z, Sun J, Wang J, Zheng N, Tang X, Shum H (2011) Learning to detect a salient object. IEEE Trans Pattern Anal Mach Intell 33:353–367
Article Google Scholar
Ma Y, Zhang H (2003) Contrast-based image attention analysis by using fuzzy growing. ACM, New York, pp 374–381
Google Scholar
Martin D, Fowlkes C, Tal D, Malik J (2001) A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: IEEE international conference on computer vision. IEEE, New Jersey, pp 416–423
Meila M (2005) Comparing clusterings-an axiomatic view. In: IEEE international conference on machine learning. ACM, Los Angeles
Mignotte M (2008) Segmentation by fusion of histogram-based k-means clusters in different color spaces. IEEE Trans Image Process 17:780–787
Article MathSciNet Google Scholar
Ning J, Zhang L, Zhang D, Wub C (2010) Interactive image segmentation by maximal similarity based region merging. Pattern Recogn 43:445–456
Article MATH Google Scholar
Otsu N (1979) A threshold selection method from gray-level histograms. IEEE Trans Syst Man Cybern 9:62–66
Article Google Scholar
Peng B, Zhang L, Zhang D, Yang J (2011) Image segmentation by iterated region merging with localized graph cuts. Pattern Recogn 44:2527–2538
Article Google Scholar
Perazzi F, Krahenbuhl P, Pritch Y, Hornung A (2012) Saliency filters: Contrast based filtering for salient object detection. In: IEEE international conference on computer vision and pattern recognition. IEEE, New Jersey, pp 733–740
Rother C, Kolmogorov V, Blake A (2004) grabcut: interactive foreground extraction using iterated graph cuts. SIGGRAPH, Los Angeles, pp 309–314
Google Scholar
Russell BC, Freeman WT, Efros AA, Sivic J, Zisserman A (2006) Using multiple segmentations to discover objects and their extent in image collections. IEEE Comp Soc Conf Comp Vis Pattern Recognit 2:1605–1614
Google Scholar
Seo K, Shin J, Kim W, Lee J (2006) Real-time object tracking and segmentation using adaptive color snake model. Int J Cont Autom Sys 4:236–246
Google Scholar
Shen X, Wu Y (2012) A unified approach to salient object detection via low rank matrix recovery. In: IEEE international conference on computer vision and pattern recognition. IEEE, New Jersey, pp 853–860
Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell 22:888–905
Article Google Scholar
Tamura H, Mori S, Yamawaki T (1978) Textural features corresponding to visual perception. IEEE Trans Syst Man Cybern 8:460–472
Article Google Scholar
Tavakoli V, Amini AA (2013) A survey of shaped-based registration and segmentation techniques for cardiac images. Comp Vis Image Underst 117:966–989
Article Google Scholar
Unnikrishnan R, Pantofaru C, Hebert M (2005) A measure for objective evaluation of image segmentation algorithms. In: IEEE international conference on computer vision and pattern recognition workshop on empirical evaluation methods in computer vision. IEEE, New Jersey
Unnikrishnan R, Pantofaru C, Hebert M (2007) Toward objective evaluation of image segmentation algorithms. IEEE Trans Pattern Anal Mach Intell 29:929–944
Article Google Scholar
Vincent L, Soille P (1991) Watersheds in digital spaces: an efficient algorithms based on immersion simulations. IEEE Trans Pattern Anal Mach Intell 13:583–598
Article Google Scholar
Xiang S, Nie F, Zhang C, Zhang C (2009) Interactive natural image segmentation via spline regression. IEEE Trans Image Process 18:1623–1632
Article MathSciNet Google Scholar
Yan Q, Xu L, Shi J, Jia J (2013) Hierarchical saliency detection. In: IEEE international conference on computer vision and pattern recognition. IEEE, New Jersey, pp 1155–1162
Yang W, Cai J, Zheng J, Luo J (2010) User-friendly interactive image segmentation through unified combinatorial user inputs. IEEE Trans Image Process 19:2470–2479
Article MathSciNet Google Scholar
Yang C, Lu L, Ruan X, Yang M (2013) Saliency detection via graph-based manifold ranking. In: IEEE international conference on computer vision and pattern recognition. IEEE, New Jersey, pp 3166–3173
Zhai Y, Shah M (2006) Visual attention detection in video sequences using spatiotemporal cues. ACM Multimedia, New York
Book Google Scholar

Download references

Authors' contributions

JL, JD, and JY conceived and designed the study. JL and LD performed the experiments. JD, JY, and LD reviewed and edited the manuscript. All authors read and approved the final manuscript.

Acknowledgements

The authors would like to thank the editor and the anonymous reviewers for their critical and constructive comments and suggestions. This work was supported in part by the National Science Fund of China under Grants 91420201, 61472187, 61502235, 61233011, and 61373063, in part by the Key Project of Chinese Ministry of Education under Grant 313030, the 973 Program under Grant 2014CB349303, and in part by the Program for Changjiang Scholars and Innovative Research Team in University Grant IRT13072.

Competing interests

The authors declared that they have no competing interests.

Funding

All the funding includes National Science Fund of China under Grant 91420201, Grant 61472187, Grant 61502235, Grant 61233011, and Grant 61373063, the Key Project of Chinese Ministry of Education under Grant 313030, the 973 Program under Grant 2014CB349303, and the Program for Changjiang Scholars and Innovative Research Team in University Grant IRT13072. All the above funding gives the financial support for the designing of the study and conducting experiments.

Author information

Authors and Affiliations

School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei Street, Nanjing, 210094, China
Junxia Li, Jundi Ding, Jian Yang & Lingzheng Dai

Authors

Junxia Li
View author publications
You can also search for this author in PubMed Google Scholar
Jundi Ding
View author publications
You can also search for this author in PubMed Google Scholar
Jian Yang
View author publications
You can also search for this author in PubMed Google Scholar
Lingzheng Dai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Junxia Li.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article

Li, J., Ding, J., Yang, J. et al. Object segmentation by saliency-seeded and spatial-weighted region merging. Appl Inform 3, 9 (2016). https://doi.org/10.1186/s40535-016-0024-z

Download citation

Received: 13 September 2016
Accepted: 02 November 2016
Published: 22 November 2016
DOI: https://doi.org/10.1186/s40535-016-0024-z

Object segmentation by saliency-seeded and spatial-weighted region merging

Abstract

Background