Abstract

Large Language Models (LLMs) are transformer models that generate text autoregressively in two stages: prefill and decode. The prefill stage processes all input tokens in parallel and caches them into the KV cache. The decode stage uses the KV cache, processes a single token at a time, and is generally memory-bandwidth constrained. Each decode step generates a token probability distribution, samples from it, appends the token to the cache, and repeats. This left-to-right, one-at-a-time generation paradigm has proven effective for language modeling in many domains; however, it also has limitations such as token permanence, constant compute per token generation, constant space per token generation, and a lack of a lookahead mechanism. Methods such as Chain-of-Thought (CoT) prompting, reasoning with reinforcement learning (RL), and others have improved performance through explicitly generating planning representations; however, they still lack an inherent lookahead mechanism. This thesis proposes the Lookahead Transformer, a novel architecture that introduces an explicit lookahead mechanism enabling autoregressive models to attend to and iteratively refine multiple future latent token representations during generation. The model utilizes i future-positionally encoded lookahead tokens, Ψ, refined over N recurrent steps, providing a bidirectional latent planning space that can be efficiently reused across generation steps. Experimental results show that the Lookahead Transformer can outperform a comparable baseline on language modeling tasks with a test-time control mechanism for scaling active tokens to improve performance. The Lookahead Transformer represents a step toward more flexible and efficient autoregressive transformer models that can “lookahead” to improve performance and better utilize compute during inference.

Degree

College and Department

Computational, Mathematical, and Physical Sciences; Computer Science

Rights

https://lib.byu.edu/about/copyright/