Skip to main content

Faculty Publications

ADtrees for Sequential Data and N-gram Counting

Robert Van DamFollow
Dan A. VenturaFollow

Keywords

ADtrees, n-gram count, corpora data

Abstract

We consider the problem of efficiently storing n-gram counts for large n over very large corpora. In such cases, the efficient storage of sufficient statistics can have a dramatic impact on system performance. One popular model for storing such data derived from tabular data sets with many attributes is the ADtree. Here, we adapt the ADtree to benefit from the sequential structure of corpora-type data. We demonstrate the usefulness of our approach on a portion of the well-known Wall Street Journal corpus from the Penn Treebank and show that our approach is exponentially more efficient than the naïve approach to storing n-grams and is also significantly more efficient than a traditional prefix tree.

Original Publication Citation

Rob Van Dam and Dan Ventura, "ADtrees for Sequential Data and N-gram Counting", Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, pp. 492-497, 27.

BYU ScholarsArchive Citation

Van Dam, Robert and Ventura, Dan A., "ADtrees for Sequential Data and N-gram Counting" (2007). Faculty Publications. 943.
https://scholarsarchive.byu.edu/facpub/943

Document Type

Peer-Reviewed Article

Publication Date

2007-10-07

Permanent URL

http://hdl.lib.byu.edu/1877/2517

Publisher

IEEE

Language

English

College

Physical and Mathematical Sciences

Department

Computer Science

Copyright Status

© 2007 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

Copyright Use Information

http://lib.byu.edu/about/copyright/

Included in

Computer Sciences Commons

COinS