One of the features of our brain is its modularity. It is characterised by distinct but interacting subsystems that underlie key functions such as memory, language, perceptions, etc. Understanding the complex interplay between these modules requires decomposing them into a set of components. Indeed, this modular approach to modeling the brain is one of the ways neuroscientists are studying the brain. The ‘module’ that I am going to be talking about in this blogpost has its roots in neuroscience but has greatly inspired AI research: attention. I will focus on the neuroscience aspect of attention first before moving on to review some of the most exciting developments using attentional mechanisms in machine learning.
The neuroscientific roots of attention
Neuroscientists have long studied attention as an important cognitive process. It is described as the ability of organisms to ‘select a subset of available information upon which to focus for enhanced processing and integration’ and encompasses three aspects: orienting, filtering and searching. Visual attention, for example, is an active area of research. Our ability to focus on specific area of a visual scene and extract and process the information that is streamed to our brain is thought to be an evolutionary trait that all but guaranteed the survival of our species. This capability to select, process and act upon sensory experience has inspired a whole branch of research in computational modelling of visual attention.
The emergence of a whole suite of sophisticated equipment to scan and study the brain has further fanned the flames of enthusiasm for attention research. In a recent study using eye tracking and fMRI data, Leong et al. demonstrated the bidirectional interaction between attention and learning: attention facilitates learning, and learned values in turn inform attentional selection . The relationship between attention and consciousness is a complex issue, and in many sense both a scientific and a philosophical exploration. The ability to focus one’s thoughts out of several simultaneous objects or trains of thought and take control of one’s own mind in a vivid and conscious manner is not just a delightful and useful perk. It is a quintessential part of our experience of human-ness.
Given their significance, attentional mechanisms have in recent years received increasing attention (pun intended) from the AI community. A detailed explanation of how they are applied in machine learning will require a separate blog post (I highly recommend this excellent article by Olah and Carter) but in essence attention layers provide the functionality of focusing on specific elements to improve the performance of a model. In an image recognition task for example, it does so by taking ‘glimpses’ of the input image at each step, updating the internal state representations, and then selecting the next location to sample. In a cluttered setting or when the input is too big, attention serves a ‘prioritisation’ function to filter out irrelevant elements. It is a powerful technique that can be used when interfacing with a neural network that has a repeating structure in its output. For example, when applied to augment LSTM (a special variant of recurrent neural networks), it lets every step of an RNN select information to look at from a larger body of information. However, attentional mechanisms are not just useful in RNNs, as we will find out below.
State of the art using attention in machine learning
In machine learning, attention is especially useful in sequence prediction problems. Let’s review a few of the major areas where it has been applied successfully.
1. Natural language processing
Attentional mechanisms have been applied in many natural language processing (NLP) related tasks. The seminal work by Bahdanau et al. proposed a neural machine translation model that implements an attention mechanism in the decoder for English-to-French translation . As the system reads the English input (encoder), the decoder outputs French translation whereby the attention mechanism learns by stochastic gradient descent to shift the focus to concentrate on the parts surrounding the word that is being translated. Their RNN-based model has been shown to outperform traditional phrase-based models by huge margins. RNNs are the incumbent architecture for text applications but it does not allow for parallelisation, which limits its potential of using GPU hardware that powers modern machine learning. A team of Facebook AI researchers introduced a novel approach using convolutional neural networks (which are highly parallelisable) and a separate attention module in each decoder layer. As opposed to Bahdanau et al’s ‘single step attention’, theirs is a multi-hop attention module. This means instead of looking at the sentence once and then translating it without looking back, the mechanism takes multiple glimpses at the sentence to determine what it will translate next. Their approach outperformed state of the art results for English-German and English-French translation at an order of magnitude faster speed . Other examples of attentional mechanisms being applied in NLP problems include text classification , language processing (performing tasks described by natural language instructions in a 3D game-play environment)  and text comprehension (answering close-style questions about a document) .
2. Object recognition
Object recognition is one of the hallmarks of machine intelligence. Mnih et al. demonstrated how an attentional mechanism can be used to ignore irrelevant objects in a scene, allowing the model to perform well in challenging object recognition tasks in the presence of clutter . In their Recurrent Attention Model (RAM), the agent receives partial observation of the environment at each step and learns where to focus (i.e. pay attention to) next through training an RNN. Attention is used to produce a ‘glimpse feature vector’ whereby regions around a target pixel is encoded at high-resolution and pixels further from the target pixel uses progressively lower resolution. Using a similar approach, another study used a deep recurrent attention model to both localise and recognise multiple objects in images . Xu et al. trained a model that automatically learns to describe the content of images . Their attention models were trained using a multilayer perceptron that is conditioned on some previous hidden state, meaning where the network looks next depends on the sequence of words that has already been generated. The researchers showed how to use convolutional neural networks to pay attention to images when outputting a sequence, i.e. the image caption. Another advantage of attention in this case is the insights gained by approximately visualising where and what the attention focused on (i.e. what the model ‘sees’).
Google DeepMind’s Deep Q-Network (DQN) represented a significant advance in Reinforcement Learning and a breakthrough in general AI in the sense that it showed a single algorithm can learn to play a wide variety of Atari 2600 games: the agent was able to continually adapt its behaviour without any human intervention. Sorokin et al. added attention to the equation and developed the Deep Attention Recurrent Q-Network (DARQN) . Their model outperformed that of DQN by incorporating what they termed ‘soft’ and ‘hard’ attention mechanisms. The attention network takes the current game state as input and generates a context vector based on the features observed. An LSTM then takes this context vector along with a previous hidden state and memory state to evaluate the action that an agent can take. Choi et al. further improved on DARQN by implementing a multi-focus attention network where the agent is capable of attending to multiple important elements . In contrast to DARQN that uses only one attention layer, the model uses multiple parallel attentions to attend to entities that are relevant to tackling the problem.
4. Generative models
Attention has also proven useful in generative models, systems that can simulate (i.e. generate) values of any variable (inputs and outputs) in the model. Hong et al. developed a deep generative model based on a convolutional neural network for semantic segmentation (the task of assigning class labels to groups of pixels in an image) . By incorporating attention-like mechanisms they were able to capture transferable segmentation knowledge across categories. The attention mechanism adaptively focuses on different areas depending on the input labels. A softmax function is used to encourage the model to pay attention to only a segment of the image. Another example is Google DeepMind’s Deep Recurrent Attentive Writer (DRAW) neural network for image generation . Attention allows the system to build up an image incrementally (shown in the video below). The attention model is fully differentiable (making it possible to train with gradient descent), thus allowing the encoder to focus on only part of the input and the decoder to modify only a part of the canvas. The model achieved impressive results generating images from the MNIST data set and when trained on the Street View House Number data set, it generated images that are almost identical to the real data.
5. Attention alone for NLP tasks
Another exciting line of research focusses on using attentional mechanisms alone for NLP tasks traditionally solved with neural networks. Vaswani et al. developed Transformer, a simple network architecture based solely on a novel multi-head attention mechanism for translation task . They compute the attention function on a set of queries simultaneously using a dot-product attention (each key is multiplied with the query to see how similar they are) with an additional scaling factor. This multi-head approach allows their model to attend to information from different positions at the same time. Their model completely foregoes recurrence and convolutions but still managed to attain state-of-art results for English-to-German and English-to-French translations. Moreover, they achieved this in significantly less training time and their model is highly parallelizable. An earlier work by Parikh et al. experimented with a simple attention-based approach to solve natural language inference tasks . They used attention to deconstruct the problem into subproblems that can be solved individually, hence making the model trivially parallelizable.
Not just a cog in the machine
What we have learned about attention so far tells us it is likely to be an essential component in the development of general AI. Philosophically, it is a key feature of the human psyche, which makes it a natural inclusion in pursuits that concerns the grey matter, while computationally, attention-based mechanisms have helped boost model performance to deliver stunning results in many areas. Attention has also proven to be a versatile technique, as is evident in its ability to replace recurrent layers in machine translation and other NLP related tasks. But it is most powerful when used in conjunction with other components, as Kaiser et al. demonstrated in their study One Model To Learn Them All that presented a model capable of solving a number of problems spanning multiple domains . To be sure, attentional mechanisms are not without weaknesses. As Olah and Carter suggested, their propensity to take every action at every step (albeit to varying extent) could potentially be very costly computationally. Nonetheless, I believe that in a modular approach to develop general AI – IMO our best bet in this quest – attention will be a worthwhile, and perhaps even indispensable, module.
 Leong, Y. C., Radulescu, A., Daniel, R., DeWoskin, V., & Niv, Y. (2017). Dynamic interaction between reinforcement learning and attention in multidimensional environments. Neuron, 93(2), 451-463.
 Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
 Gehring, J., Auli, M., Grangier, D., Yarats, D., & Dauphin, Y. N. (2017). Convolutional Sequence to Sequence Learning. arXiv preprint arXiv:1705.03122.
 Yang, Z., Yang, D., Dyer, C., He, X., Smola, A. J., & Hovy, E. H. (2016). Hierarchical Attention Networks for Document Classification. In HLT-NAACL (pp. 1480-1489).
 Chaplot, D. S., Sathyendra, K. M., Pasumarthi, R. K., Rajagopal, D., & Salakhutdinov, R. (2017). Gated-Attention Architectures for Task-Oriented Language Grounding. arXiv preprint arXiv:1706.07230.
 Dhingra, B., Liu, H., Yang, Z., Cohen, W. W., & Salakhutdinov, R. (2016). Gated-attention readers for text comprehension. arXiv preprint arXiv:1606.01549.
 Mnih, V., Heess, N., & Graves, A. (2014). Recurrent models of visual attention. In Advances in neural information processing systems (pp. 2204-2212).
 Ba, J., Mnih, V., & Kavukcuoglu, K. (2014). Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755.
 Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., … & Bengio, Y. (2016). Show, attend and tell: Neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044.
 Sorokin, I., Seleznev, A., Pavlov, M., Fedorov, A., & Ignateva, A. (2015). Deep attention recurrent Q-network. arXiv preprint arXiv:1512.01693.
 Choi, J., Lee, B. J., & Zhang, B. T. (2017). Multi-Focus Attention Network for Efficient Deep Reinforcement Learning. AAAI Publications, Workshops at the Thirty-First AAAI Conference on Artificial Intelligence.
 Hong, S., Oh, J., Lee, H., & Han, B. (2016). Learning transferrable knowledge for semantic segmentation with deep convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3204-3212).
 Gregor, K., Danihelka, I., Graves, A., Rezende, D. J., & Wierstra, D. (2015). DRAW: A recurrent neural network for image generation. arXiv preprint arXiv:1502.04623.
 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention Is All You Need. arXiv preprint arXiv:1706.03762.
 Ankur Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. A decomposable attention model. In Empirical Methods in Natural Language Processing, 2016.
 Kaiser, L., Gomez, A. N., Shazeer, N., Vaswani, A., Parmar, N., Jones, L., & Uszkoreit, J. (2017). One Model To Learn Them All. arXiv preprint arXiv:1706.05137.