Theses and Dissertations

Theoretical Expressivity of Self-Attention as a Model of Binary Sequence-To-Sequence Functions and Other Explorations

Dustin Angerhofer, Brigham Young UniversityFollow

Abstract

The self-attention mechanism lies at the heart of modern sequential modeling with deep neural networks, especially in the context of Large Language Models (LLMs). The basic role of self-attention is to capture pairwise interactions between tokens in a sequence. This allows token embeddings to be endowed with context-dependent information before further processing. Despite its widespread use, our theoretical understanding of self-attention is still nascent. In this work, we examine self-attention as a model of binary sequence-to-sequence (seq2seq) functions, which is the simplest non-trivial case to analyze. We find, using VC dimension and the Sauer-Shelah lemma, that in the limit of long sequences, almost all binary seq2seq functions cannot be implemented by self-attention. We further demonstrate that every such function that self-attention cannot implement contains an XOR pattern or a composition of XOR patterns—there is no other fundamental pattern that self-attention cannot implement in the context of binary seq2seq functions. A simple corollary of these results yields a scaling law for the embedding dimension of the self-attention layer in the more general context of k-ary seq2seq functions. Beyond this, we explore specific data distributions and the capability of self-attention to perform standard tasks therein. We begin this foray by finding conditions under which expected values can be passed through the softmax function, allowing us to solve for the pa- rameters that give desired expected attention weights. This result is used to provide evidence that self-attention is capable of cross-entropy optimal classification of Gaussian distributed data when the decision boundary is linear. We also find that self-attention is capable of accuracy-optimal binary classification in the same setting, but only on a per-sample basis. It is then demonstrated that self-attention is capable of cross-entropy-optimal Next Token Prediction (NTP) of k-state Markov Chains in the limit of large parameters. Finally, we examine the decoding of Hidden Markov Models (HMMs), and find that a pathological loss landscape emerges, which persists as we upgrade our model from a single self-attention layer to a transformer block with multi-headed attention. We identify critical points in the loss landscape in terms of the level-set geometries of self-attention and its gradient and hypoth- esize the existence of a long basin in which learning algorithms get stuck on a bad solution.

Degree

College and Department

Computational, Mathematical, and Physical Sciences; Mathematics

Rights

https://lib.byu.edu/about/copyright/

BYU ScholarsArchive Citation

Angerhofer, Dustin, "Theoretical Expressivity of Self-Attention as a Model of Binary Sequence-To-Sequence Functions and Other Explorations" (2026). Theses and Dissertations. 11213.
https://scholarsarchive.byu.edu/etd/11213

Date Submitted

2026-04-21

Document Type

Thesis

Permanent Link

https://arks.lib.byu.edu/ark:/34234/q2b37efd9a

Keywords

Self-Attention, VC Dimension, Hyperplane Arrangements, Hidden Markov Models

Language

english

Download

Included in

Physical Sciences and Mathematics Commons

COinS

BYU ScholarsArchive

Theses and Dissertations

Theoretical Expressivity of Self-Attention as a Model of Binary Sequence-To-Sequence Functions and Other Explorations

Abstract

Degree

College and Department

Rights

BYU ScholarsArchive Citation

Date Submitted

Document Type

Permanent Link

Keywords

Language

Included in

Search

Browse

BYU Links

Author Corner

Hosted by the

BYU ScholarsArchive

Theses and Dissertations

Theoretical Expressivity of Self-Attention as a Model of Binary Sequence-To-Sequence Functions and Other Explorations

Author

Abstract

Degree

College and Department

Rights

BYU ScholarsArchive Citation

Date Submitted

Document Type

Permanent Link

Keywords

Language

Included in

Share

Search

Browse

BYU Links

Author Corner

Hosted by the