Science Cast

CNN or ViT? Revisiting Vision Transformers Through the Lens of Convolution

Chenghao LiFebruary 26, 2024 11:36am

Views (60)
Comments (0)

Export Citation

Voice is AI-generated

Connected to paperThis paper is a preprint and has not been certified by peer review

CNN or ViT? Revisiting Vision Transformers Through the Lens of Convolution

arXivPDF

Authors

Chenghao Li, Chaoning Zhang

Abstract

The success of Vision Transformer (ViT) has been widely reported on a wide range of image recognition tasks. The merit of ViT over CNN has been largely attributed to large training datasets or auxiliary pre-training. Without pre-training, the performance of ViT on small datasets is limited because the global self-attention has limited capacity in local modeling. Towards boosting ViT on small datasets without pre-training, this work improves its local modeling by applying a weight mask on the original self-attention matrix. A straightforward way to locally adapt the self-attention matrix can be realized by an element-wise learnable weight mask (ELM), for which our preliminary results show promising results. However, the element-wise simple learnable weight mask not only induces a non-trivial additional parameter overhead but also increases the optimization complexity. To this end, this work proposes a novel Gaussian mixture mask (GMM) in which one mask only has two learnable parameters and it can be conveniently used in any ViT variants whose attention mechanism allows the use of masks. Experimental results on multiple small datasets demonstrate that the effectiveness of our proposed Gaussian mask for boosting ViTs for free (almost zero additional parameter or computation cost). Our code will be publicly available at \href{https://github.com/CatworldLee/Gaussian-Mixture-Mask-Attention}{https://github.com/CatworldLee/Gaussian-Mixture-Mask-Attention}.

TwitterandLinkedIn

0 comments

Add comment

CNN or ViT? Revisiting Vision Transformers Through the Lens of Convolution

CNN or ViT? Revisiting Vision Transformers Through the Lens of Convolution

AI-powered Paper ChatBeta

AI-powered Paper ChatBeta

0 comments