Faculty Publications

Detecting Similar HTML Documents Using a Fuzzy Set Information Retrieval Approach

Keywords

fuzzy set model, web information retrieval, copy detection, HTML document, odds ratio

Abstract

Web documents that are either partially or completely duplicated in content are easily found on the Internet these days. Not only do these documents create redundant information on the Web, which take longer to filter unique information and cause additional storage space, but also they degrade the efficiency of Web information retrieval. In this paper, we present a new approach for detecting similar Web documents, especially HTML documents. Our detection approach determines the odd ratio of any two documents, which makes use of the degrees of resemblance of the documents, and graphically displays the locations of similar (not necessarily the same) sentences detected in the documents after (i) eliminating non-representative words in the sentences using the stopword-removal and stemming algorithms, (ii) computing the degree of similarity of sentences using a fuzzy set information retrieval approach, and (iii) matching the corresponding hierarchical content of the two documents using a simple tree matching algorithm. The proposed method for detecting similar documents handles wide range of Web pages of varying size and does not require static word lists and thus applicable to different Web (especially HTML) documents in different subject areas, such as sports, news, science, etc.

Original Publication Citation

Rajiv Yerra and Yiu-Kai Ng, "Detecting Similar HTML Documents Using a Fuzzy Set Information Retrieval Approach." In Proceedings of IEEE International Conference on Granular Computing (GrC'5), pp. 693-699, July 25, Beijing, China.

BYU ScholarsArchive Citation

Ng, Yiu-Kai D. and Yerra, Rajiv, "Detecting Similar HTML Documents Using a Fuzzy Set Information Retrieval Approach" (2005). Faculty Publications. 368.
https://scholarsarchive.byu.edu/facpub/368

Document Type

Peer-Reviewed Article

Publication Date

2005-07-01

Permanent URL

http://hdl.lib.byu.edu/1877/2630

Publisher

IEEE

Language

English

College

Physical and Mathematical Sciences

Department

Computer Science

Copyright Status

© 2005 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

Copyright Use Information

http://lib.byu.edu/about/copyright/

Download

Included in

Computer Sciences Commons

COinS

BYU ScholarsArchive

Faculty Publications

Detecting Similar HTML Documents Using a Fuzzy Set Information Retrieval Approach

Keywords

Abstract

Original Publication Citation

BYU ScholarsArchive Citation

Document Type

Publication Date

Permanent URL

Publisher

Language

College

Department

Copyright Status

Copyright Use Information

Included in

Search

Browse

BYU Links

Author Corner

Hosted by the

BYU ScholarsArchive

Faculty Publications

Detecting Similar HTML Documents Using a Fuzzy Set Information Retrieval Approach

Authors

Keywords

Abstract

Original Publication Citation

BYU ScholarsArchive Citation

Document Type

Publication Date

Permanent URL

Publisher

Language

College

Department

Copyright Status

Copyright Use Information

Included in

Share

Search

Browse

BYU Links

Author Corner

Hosted by the