Faculty Publications

Categorizing and Extracting Information from Multilingual HTML Documents

Keywords

HTML documents, multilingual information, languages, Internet users

Abstract

The amount of online information written in different natural languages and the number of non-English speaking Internet users have been increasing tremendously during the past decade. In order to provide high-performance access of multilingual information on the Internet, we have developed a data analysis and querying system (DatAQs) that (i) analyzes, identifies, and categorizes languages used in HTML documents, (ii) extracts information from HTML documents of interest written in different languages, (iii) allows the user to submit queries for retrieving extracted information in the same natural language provided by the query engine of DatAQs using a menu-driven user interface, and (iv) processes the user’s queries (as Boolean expressions) to generate the results. DatAQs extracts information from HTML documents that belong to various data-rich, narrow-in-breadth application domains, such as car ads, house rentals, job ads, stocks, university catalogs, etc. The average F-measure on identifying HTML documents written in a particular natural language correctly is 89%, whereas the F-measure on categorizing HTML documents belonged to the car-ads application domain is 94%.

Original Publication Citation

SeungJin Lim and Yiu-Kai Ng, "Categorization and Information Extraction of Multilingual HTML Documents." In Proceedings of the 9th International Database Engineering and Application Symposium (IDEAS'5), IEEE Computer Society, pp. 415-422, July 25, Montreal, Canada.

BYU ScholarsArchive Citation

Ng, Yiu-Kai D. and Lim, SeungJin, "Categorizing and Extracting Information from Multilingual HTML Documents" (2005). Faculty Publications. 367.
https://scholarsarchive.byu.edu/facpub/367

Document Type

Peer-Reviewed Article

Publication Date

2005-07-01

Permanent URL

http://hdl.lib.byu.edu/1877/2628

Publisher

IEEE

Language

English

College

Physical and Mathematical Sciences

Department

Computer Science

Copyright Status

© 2005 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

Copyright Use Information

http://lib.byu.edu/about/copyright/

Download

Included in

Computer Sciences Commons

COinS

BYU ScholarsArchive

Faculty Publications

Categorizing and Extracting Information from Multilingual HTML Documents

Keywords

Abstract

Original Publication Citation

BYU ScholarsArchive Citation

Document Type

Publication Date

Permanent URL

Publisher

Language

College

Department

Copyright Status

Copyright Use Information

Included in

Search

Browse

BYU Links

Author Corner

Hosted by the

BYU ScholarsArchive

Faculty Publications

Categorizing and Extracting Information from Multilingual HTML Documents

Authors

Keywords

Abstract

Original Publication Citation

BYU ScholarsArchive Citation

Document Type

Publication Date

Permanent URL

Publisher

Language

College

Department

Copyright Status

Copyright Use Information

Included in

Share

Search

Browse

BYU Links

Author Corner

Hosted by the