Abstract

Large language models (LLMs) show remarkable abilities in both factual and creative tasks, yet the problem of hallucination persists despite advances in training and post-training methods. Recent progress in model interpretability suggests that model behavior can be predicted and influenced by analyzing and manipulating internal activations. In this thesis, we address three questions: (1) Can we identify linear representations of hallucination and creativity in the model’s latent space? (2) Do these representations play a causal role in model behavior when manipulated? (3) Are creativity and hallucination causally intertwined? To answer these, we construct a novel dataset of factual, hallucinated, and creative responses to Python package queries. We analyze the data using PCA and train logistic regression probes to test whether creative and hallucinatory activations are linearly separable. We further use mass-mean probes to extract semantic directions and apply additive and ablative interventions on activations to test causal effects. Our findings show that representations for creativity and hallucination are weakly correlated and largely independent. We also find that post-hoc analysis of generated tokens is more effective than predictive analysis for identifying hallucination representations. Finally, we show that synthetically constructed hallucination activations can serve as suitable proxies for genuine hallucinations when training classifier probes. These results advance our understanding of how LLMs encode and generate hallucinations, and they suggest new directions for interpretability methods that aim to detect and steer model behavior.

Degree

MS

College and Department

Computer Science; Computational, Mathematical, and Physical Sciences

Rights

https://lib.byu.edu/about/copyright/

Date Submitted

2025-11-07

Document Type

Thesis

Keywords

computer, science, deep, learning, large, language, model, hallucination

Language

english

Share

COinS