Abstract:
It is essential to effectively learn global convolutional features for saliency object detection. However, global receptive fields can only be obtained by stacking network layers, which results in losing local information and rough object edges. To circumvent this issue, a new attention-based encoder vision transformer is introduced. Compared with the convolutional neural network, the proposed model can represent the features from shallow to deep layers globally and establish the self-attention relationship of each region in the image. Specifically, we use a transformer encoder to extract the object features. Remarkably, the encoder retains more local edge information in the shallow layer to recover the spatial details of the final saliency map. Consequently, as each layer of the transformer encoder inherits the global information of the previous layer, each of its output features benefits from the final prediction. Furthermore, we supervise the salient edge in the shallow layers