Faculty Publications

Using Fuzzy-Word Correlation Factors to Compute Document Similarity Based on Phrase Matching

Keywords

fuzzy-word correlation, document similarity, phrase matching, web information retrieval

Abstract

One of the Web information Retrieval (IR) problems these days is to identify redundant information that exist in (replicated) Web documents. These documents can easily be found in several forms, such as documents in different versions, small documents combined with others to form a larger document, etc. As the Web is becoming more and more popular, the number of documents on the Web is increasing on a daily basis, and filtering redundant ones among this huge number of documents becomes a more difficult and an urgent task. As one of the solutions to this problem, we present a new method that identifies similar documents based on phrase matching using the fuzzy-word correlation factors among words in phrases. Since phrases can be treated as sequences of words in a sentence in any document, we consider the correlation factors of different words in any two phrases of two different documents to determine the degree of similarity of the phrases, which in turns can determine the similarity of the documents based on the number of matched phrases/sentences in the documents. Experimental results show that our phrase-matching approach is accurate and outperforms the word-based similarity matching approach.

Original Publication Citation

Jun Won Lee and Yiu-Kai Ng. "Using Fuzzy-Word Correlation Factors to Compute Document Similarity Based on Phrase Matching." In Proceedings of the 4th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD'7), pp. 186-191, August 24-27, 27, Haikou, China.

BYU ScholarsArchive Citation

Lee, Jun won and Ng, Yiu-Kai D., "Using Fuzzy-Word Correlation Factors to Compute Document Similarity Based on Phrase Matching" (2007). Faculty Publications. 240.
https://scholarsarchive.byu.edu/facpub/240

Document Type

Peer-Reviewed Article

Publication Date

2007-08-24

Permanent URL

http://hdl.lib.byu.edu/1877/2633

Publisher

IEEE

Language

English

College

Physical and Mathematical Sciences

Department

Computer Science

Copyright Status

© 2007 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

Copyright Use Information

http://lib.byu.edu/about/copyright/

Download

Included in

Computer Sciences Commons

COinS

BYU ScholarsArchive

Faculty Publications

Using Fuzzy-Word Correlation Factors to Compute Document Similarity Based on Phrase Matching

Keywords

Abstract

Original Publication Citation

BYU ScholarsArchive Citation

Document Type

Publication Date

Permanent URL

Publisher

Language

College

Department

Copyright Status

Copyright Use Information

Included in

Search

Browse

BYU Links

Author Corner

Hosted by the

BYU ScholarsArchive

Faculty Publications

Using Fuzzy-Word Correlation Factors to Compute Document Similarity Based on Phrase Matching

Authors

Keywords

Abstract

Original Publication Citation

BYU ScholarsArchive Citation

Document Type

Publication Date

Permanent URL

Publisher

Language

College

Department

Copyright Status

Copyright Use Information

Included in

Share

Search

Browse

BYU Links

Author Corner

Hosted by the