HTML documents, multilingual information, languages, Internet users
The amount of online information written in different natural languages and the number of non-English speaking Internet users have been increasing tremendously during the past decade. In order to provide high-performance access of multilingual information on the Internet, we have developed a data analysis and querying system (DatAQs) that (i) analyzes, identifies, and categorizes languages used in HTML documents, (ii) extracts information from HTML documents of interest written in different languages, (iii) allows the user to submit queries for retrieving extracted information in the same natural language provided by the query engine of DatAQs using a menu-driven user interface, and (iv) processes the user’s queries (as Boolean expressions) to generate the results. DatAQs extracts information from HTML documents that belong to various data-rich, narrow-in-breadth application domains, such as car ads, house rentals, job ads, stocks, university catalogs, etc. The average F-measure on identifying HTML documents written in a particular natural language correctly is 89%, whereas the F-measure on categorizing HTML documents belonged to the car-ads application domain is 94%.
Original Publication Citation
SeungJin Lim and Yiu-Kai Ng, "Categorization and Information Extraction of Multilingual HTML Documents." In Proceedings of the 9th International Database Engineering and Application Symposium (IDEAS'5), IEEE Computer Society, pp. 415-422, July 25, Montreal, Canada.
BYU ScholarsArchive Citation
Ng, Yiu-Kai D. and Lim, SeungJin, "Categorizing and Extracting Information from Multilingual HTML Documents" (2005). Faculty Publications. 367.
Physical and Mathematical Sciences
© 2005 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.
Copyright Use Information