Abstract

Neural networks typically struggle with reasoning tasks on out of domain data, something that humans can more easily adapt to. Humans come with prior knowledge of concepts and can segment their environment into building blocks (such as objects) that allow them to reason effectively in unfamiliar situations. Using this intuition, we train a network that utilizes fixed embeddings from the CLIP (Contrastive Language--Image Pre-training) model to do a simple task that the original CLIP model struggles with. The network learns concepts (such as "collide" and "avoid") in a supervised source domain in such a way that the network can adapt and identify similar concepts in a target domain with never-before-seen objects. Without any training in the target domain, we show a 11% accuracy improvement in recognizing concepts compared to the baseline zero-shot CLIP model. When provided with a few labels, this accuracy gap widens to 20%.

Degree

College and Department

Physical and Mathematical Sciences; Computer Science

Rights

https://lib.byu.edu/about/copyright/