Web documents that are either partially or completely duplicated in content are easily found on the Internet these days. Not only these documents create redundant information on the Web, which take longer to filter unique information and cause additional storage space, but also they degrade the efficiency of Web information retrieval. In this thesis, we present a new approach for detecting similar (HTML)Web documents and evaluate its performance. To detect similar documents, we first apply our sentence-based copy detection approach to determine whether sentences in any two documents should be treated as the same or different according to the degrees of similarity of the sentences, which is computed by using either the three least frequent 4-gram approach or the fuzzy set information retrieval (IR) approach. These copy detection approaches, which achieve a high success rate in detection similar(not necessary the same) sentences, (i) handles wide range of documents in different subject areas (such as sports, news, and science, etc.) and (ii) does not require static word lists, which means that there is no need to look up for words in a predefined dictionary/thesaurus to determine the similarity among words. Not only we can detect similar sentences in two documents, we can graphically display the relative locations of similar (not necessary the same) sentences detected in the documents using the dotplot views, which is a graphical tool. Experimental results show that the fuzzy set IR approach outperforms the three least-frequent 4-gram approach in copy detection. For this reason we adopt the fuzzy set IR copy detection approach for detecting similar Web documents, especially HTML documents, by computing the degree of resemblance between any two HTML documents, which represents to what extent the two documents under consideration are similar. Hereafter, we match the corresponding hierarchical content of the two documents using a simple tree matching algorithm. Our copy detection approach is unique since it is sentence-based, instead of wordbased on which most of the existing copy detection approaches are developed, and can specify the relative positions of same (or similar) sentences in their corresponding HTML documents graphically, as well as hierarchically, according to the document structures. The targeted documents to which our copy detection approach applies is different from others, since it (i) performs copy detection on HTML documents, instead of any plain text documents, (ii) detects HTML documents with similar sentences apart from exact matches, and (iii) is simple, as it uses the fuzzy set IR model for determining related words in documents and filtering redundant Web documents, and is supported by well-known and yet simple mathematical models. Experimental results on detection of similar documents have been performed to check for accuracy using false positives, false negatives, precision, recall, and Fmeasure values. With over 90% F-measure, which indicates that the percentage of error is relatively small, our approach to detect similar documents performs reasonably well. The time complexity for our copy detection approach is O(n2), where, n is the total number of sentences in a HTML document, whereas the time complexity for detecting similar HTML documents using our copy detection approach is O(n log n). The overall time complexity of our copy detection and similar HTML documents detection approach is O(n log n + n^2) ≅ O(n^2).
College and Department
Physical and Mathematical Sciences; Computer Science
BYU ScholarsArchive Citation
Yerra, Rajiv, "Detecting Similar HTML Documents Using A Sentence-Based Copy Detection Approach" (2005). All Theses and Dissertations. 626.
html, computer, copy