This work addresses the need for an alternative to keyword-based search for sifting through large PDF medical journal article document collections for literature review purposes. Despite users' best efforts to form precise and accurate queries, it is often difficult to guess the right keywords to find all the related articles while finding a minimum number of unrelated ones. Failure during literature review to find relevant, related research results in wasted research time and effort in addition to missing significant work in the related area which could affect the quality of the research work being conducted. The purpose of this work is to explore the benefits of a retrieval system for professional journal articles in PDF format that supports hybrid queries composed of both text and images. PDF medical journal articles contain formatting and layout information that imply the structure and organization of the document. They also contain figures and tables rich with content and meaning. Stripping a PDF into “full-text” for indexing purposes disregards these important features. Specifically, this work investigated the following: (1) what effect the incorporation of a document's embedded figures into the query (in addition to its text) has on retrieval performance (precision) compared to plain keyword-based search; (2) how current text-based document-query similarity methods can be enhanced by using formatting and font-size information as a structure and organization model for a PDF document; (3) whether to use the standard Euclidean distance function or the matrix distance function for content-based image retrieval; (4) how to convert a PDF into a structured, formatted, reflowable XML representation given a pure-layout PDF document; (5) what document views (such as a term frequency cloud, a document outline, or a document's figures) would help users wade through search results to quickly select those that are worth a closer look. While the results of the experiments were unexpectedly worse than their baselines of comparison (see the conclusion for a summary), the experimental methods are very valuable in showing others what directions have already been pursued and why they did not work and what remaining problems need to be solved in order to achieve the goal of improving literature review through use of a hybrid text and image retrieval system.



Ira A. Fulton College of Engineering and Technology; Electrical and Computer Engineering



content-based image retrieval, content-based document retrieval, evaluation, modified tf-idf, document structure, document understanding, hybrid text and image retrieval system