What is Multi-Head Attention?

Deep Learning

Multi-Head Attention

Multi-head attention is a mechanism that runs multiple attention operations in parallel, allowing the model to attend to different aspects of the input simultaneously. It is a core component of the Transformer architecture.

Understanding Multi-Head Attention

Multi-head attention is the mechanism at the core of the transformer architecture that allows a model to simultaneously attend to information from different representation subspaces at different positions in the input sequence. Instead of computing a single attention function, multi-head attention runs several attention heads in parallel, each learning to focus on different types of relationships, such as syntactic dependencies, semantic associations, or positional patterns. The outputs from all heads are concatenated and linearly transformed to produce the final result. This design gives transformers their remarkable ability to capture diverse, complex patterns in data, which is fundamental to the success of large language models like GPT and BERT. Multi-head attention scales well and enables efficient parallelization on GPU hardware, contributing to the transformer's dominance over earlier sequential architectures like LSTM in both natural language processing and computer vision.

Is AI recommending your brand?

Find out if ChatGPT, Perplexity, and Gemini mention you when people search your industry.

Check your brand — $9

Multi-Task Learning

Back to full glossary

Multi-Head Attention

Understanding Multi-Head Attention

Is AI recommending your brand?

Related Deep Learning Terms

Activation Function

Adam Optimizer

Adapter Layers

Attention Mechanism

Autoencoder

Backpropagation

Batch Normalization

Batch Size