Theses and Dissertations

Automating the Extraction of Domain-Specific Information from the Web-A Case Study for the Genealogical Domain

Troy L. Walker, Brigham Young University - ProvoFollow

Abstract

Current ways of finding genealogical information within the millions of pages on the Web are inadequate. In an effort to help genealogical researchers find desired information more quickly, we have developed GeneTIQS, a Genealogy Target-based Information Query System. GeneTIQS builds on ontology-based methods of data extraction to allow database-style queries on the Web. This thesis makes two main contributions to GeneTIQS. (1) It builds a framework to do generic ontology-based data extraction. (2) It develops a hybrid record separator based on Vector Space Modeling that uses both formatting clues and data clues to split pages into component records. The record separator allows GeneTIQS to extract data from the complex documents common in genealogy. Experiments show that this approach yields 92% recall and 93% precision on documents from the Web.

Degree

College and Department

Physical and Mathematical Sciences; Computer Science

Rights

http://lib.byu.edu/about/copyright/

BYU ScholarsArchive Citation

Walker, Troy L., "Automating the Extraction of Domain-Specific Information from the Web-A Case Study for the Genealogical Domain" (2004). Theses and Dissertations. 214.
https://scholarsarchive.byu.edu/etd/214

Date Submitted

2004-11-23

Document Type

Thesis

Handle

http://hdl.lib.byu.edu/1877/etd607

Keywords

Information Extraction, Genealogy

Language

English

Download

Included in

Computer Sciences Commons

COinS

BYU ScholarsArchive

Theses and Dissertations

Automating the Extraction of Domain-Specific Information from the Web-A Case Study for the Genealogical Domain

Abstract

Degree

College and Department

Rights

BYU ScholarsArchive Citation

Date Submitted

Document Type

Handle

Keywords

Language

Included in

Search

Browse

BYU Links

Author Corner

Hosted by the

BYU ScholarsArchive

Theses and Dissertations

Automating the Extraction of Domain-Specific Information from the Web-A Case Study for the Genealogical Domain

Author

Abstract

Degree

College and Department

Rights

BYU ScholarsArchive Citation

Date Submitted

Document Type

Handle

Keywords

Language

Included in

Share

Search

Browse

BYU Links

Author Corner

Hosted by the