Abstract
The self-attention mechanism lies at the heart of modern sequential modeling with deep neural networks, especially in the context of Large Language Models (LLMs). The basic role of self-attention is to capture pairwise interactions between tokens in a sequence. This allows token embeddings to be endowed with context-dependent information before further processing. Despite its widespread use, our theoretical understanding of self-attention is still nascent. In this work, we examine self-attention as a model of binary sequence-to-sequence (seq2seq) functions, which is the simplest non-trivial case to analyze. We find, using VC dimension and the Sauer-Shelah lemma, that in the limit of long sequences, almost all binary seq2seq functions cannot be implemented by self-attention. We further demonstrate that every such function that self-attention cannot implement contains an XOR pattern or a composition of XOR patterns—there is no other fundamental pattern that self-attention cannot implement in the context of binary seq2seq functions. A simple corollary of these results yields a scaling law for the embedding dimension of the self-attention layer in the more general context of k-ary seq2seq functions. Beyond this, we explore specific data distributions and the capability of self-attention to perform standard tasks therein. We begin this foray by finding conditions under which expected values can be passed through the softmax function, allowing us to solve for the pa- rameters that give desired expected attention weights. This result is used to provide evidence that self-attention is capable of cross-entropy optimal classification of Gaussian distributed data when the decision boundary is linear. We also find that self-attention is capable of accuracy-optimal binary classification in the same setting, but only on a per-sample basis. It is then demonstrated that self-attention is capable of cross-entropy-optimal Next Token Prediction (NTP) of k-state Markov Chains in the limit of large parameters. Finally, we examine the decoding of Hidden Markov Models (HMMs), and find that a pathological loss landscape emerges, which persists as we upgrade our model from a single self-attention layer to a transformer block with multi-headed attention. We identify critical points in the loss landscape in terms of the level-set geometries of self-attention and its gradient and hypoth- esize the existence of a long basin in which learning algorithms get stuck on a bad solution.
Degree
MS
College and Department
Computational, Mathematical, and Physical Sciences; Mathematics
Rights
https://lib.byu.edu/about/copyright/
BYU ScholarsArchive Citation
Angerhofer, Dustin, "Theoretical Expressivity of Self-Attention as a Model of Binary Sequence-To-Sequence Functions and Other Explorations" (2026). Theses and Dissertations. 11213.
https://scholarsarchive.byu.edu/etd/11213
Date Submitted
2026-04-21
Document Type
Thesis
Permanent Link
https://arks.lib.byu.edu/ark:/34234/q2b37efd9a
Keywords
Self-Attention, VC Dimension, Hyperplane Arrangements, Hidden Markov Models
Language
english