Attention Is Not All You Need Anymore

Voices Powered byElevenlabs logo


what is this paper about

AI: This paper is about proposing a drop-in replacement for the self-attention mechanism in the Transformer called the Extractor. The Extractor aims to improve the performance of the Transformer in text generation tasks by addressing the variable-length issue in the Markov chain. The paper formulates the sequence prediction problem in text generation using variable-length discrete-time Markov chains and introduces the design and functionality of the Extractor sublayer. The paper also evaluates the performance of the Extractor and compares it to the self-attention mechanism in terms of speed and performance. Additionally, the paper discusses the computational complexity of the Extractor and provides estimates of the number of operations and trainable parameters for each sublayer.


Zhe Chen


In recent years, the popular Transformer architecture has achieved great success in many application areas, including natural language processing and computer vision. Many existing works aim to reduce the computational and memory complexity of the self-attention mechanism in the Transformer by trading off performance. However, performance is key for the continuing success of the Transformer. In this paper, a drop-in replacement for the self-attention mechanism in the Transformer, called the Extractor, is proposed. Experimental results show that replacing the self-attention mechanism with the Extractor improves the performance of the Transformer. Furthermore, the proposed Extractor has the potential to run faster than the self-attention since it has a much shorter critical path of computation. Additionally, the sequence prediction problem in the context of text generation is formulated using variable-length discrete-time Markov chains, and the Transformer is reviewed based on our understanding.

Follow Us on


Add comment
Recommended SciCasts