Keywords

Textual data, Machine reading, Data extraction, Natural language proccessing, OntoSoar system

Abstract

Textual data—from manuscripts to publications to website content—contains much of extant human knowledge. Unfortunately, the ability to harvest and effectively use this information beyond simple search/retrieval is greatly hampered by the scale of the “reading” problem: there is too much for any one person to read, and computers are not entirely adept at comprehending all information—explicit and implicit—contained in natural language text. Developing increased capability in this area is the focus of ongoing “machine reading” and “reading the web” research initiatives. Interested parties include businesses, the military, and intelligence-gathering agencies. Our own ongoing work with the Church Family History Department’s vast digitized repository has led us to consider increased participation in this area of research. We propose to unite the efforts of two different BYU research labs to create a sophisticated machine reading system. Each lab has concentrated on specific aspects of machine reading: (1) data extraction, integration and modeling in the BYU Data Extraction Group (DEG) lab, and (2) sophisticated natural language parsing and cognitive modeling in the BYU NL-Soar lab. Both research efforts are mature, having produced many academic and scholarly products; both have also benefited from prior support from on-campus and off-campus funding. Our project will involve designing, implementing, and evaluating a new system, OntoSoar, which integrates our OntoES system with our Soar-based natural language processing systems. Soar is an agent-based cognitive modeling system that has served as an integration platform for several complex multi-task implementations. Our new reading agent will be capable of extracting low-level information in its first-pass treatment of a text; it will then perform a careful re-reading of the text to find more subtle conceptual relationships. OntoSoar will then compare extracted content from both processes and merge or supplement its growing knowledge base accordingly. We will evaluate the system against current research datasets.

Original Publication Citation

Deryle W. Lonsdale, David W. Embley, Stephen W. Liddle (2014). An ontology-driven reading agent. ORCA Mentoring Environment Grant (MEG) Final Report, Journal of Undergraduate Research, BYU.

Document Type

Report

Publication Date

2014

Publisher

Brigham Young University

Language

English

College

Humanities

Department

Linguistics and English Language

University Standing at Time of Publication

Associate Professor

Included in

Linguistics Commons

Share

COinS