结合Transformer的显著性目标检测

Salient Object Detection with Transformer

  • 摘要: 显著性目标检测中学习有效的全局卷积特征至关重要。卷积神经网络模型越深越能获得更好的全局感受野,但这样往往会丢失局部信息,还会导致目标边缘粗糙。为了解决这个问题,引用了一个新的基于注意力的编码器Vision Transformer,相比于CNN (convolutional neural network)而言,可以表示浅层到深层的全局特征,并建立图像中各区域的自注意力关系。具体地,首先采用Transformer编码器提取目标特征,编码器在浅层中保留了更多的局部边缘信息,以恢复最终显著图的空间细节。然后,利用Transformer编码器前后层之间继承的全局信息,将Transformer每一层输出特征最终预测。在此基础上,浅层的边缘监督以获取丰富的边缘信息,再将浅层信息与全局位置信息相结合。最后,在解码器中采用渐近融合的方式生成最终显著性图,促进高层信息和浅层信息地充分融合,更准确地定位显著目标及其边缘。实验结果表明,在5个广泛使用的数据集上,在不进行任何后处理的情况下,提出的方法性能好于最先进的方法。

     

    Abstract: It is essential to effectively learn global convolutional features for saliency object detection. However, global receptive fields can only be obtained by stacking network layers, which results in losing local information and rough object edges. To circumvent this issue, a new attention-based encoder vision transformer is introduced. Compared with the convolutional neural network, the proposed model can represent the features from shallow to deep layers globally and establish the self-attention relationship of each region in the image. Specifically, we use a transformer encoder to extract the object features. Remarkably, the encoder retains more local edge information in the shallow layer to recover the spatial details of the final saliency map. Consequently, as each layer of the transformer encoder inherits the global information of the previous layer, each of its output features benefits from the final prediction. Furthermore, we supervise the salient edge in the shallow layers

     

/

返回文章
返回