As computer system usage grows in our world, system administrators need better visibility into the workings of computer systems, especially when those systems have problems or go down. Most system components, from hardware, through OS, to application server and application, write log files of some sort, be it system-standardized logs such syslog or application specific logs. These logs very often contain valuable clues to the nature of system problems and outages, but their verbosity can make them difficult to utilize. Statistical data mining methods could help in filtering and classifying log entries, but these tools are often out of the reach of administrators. This research tests the effectiveness of three off-the-shelf Bayesian spam email filters (SpamAssassin, SpamBayes and Bogofilter) for effectiveness as log entry classifiers. A simple scoring system, the Filter Effectiveness Scale (FES), is proposed and used to compare these filters. These filters are tested in three stages: 1) the filters were tested with the SpamAssassin corpus, with various manipulations made to the messages, 2) the filters were tested for their ability to differentiate two types of log entries taken from actual production systems, and 3) the filters were trained on log entries from actual system outages and then tested on effectiveness for finding similar outages via the log files. For stage 1, messages were tested with normalized bodies, normalized headers and with each sentence from each message body as a separate message with a standardized message. The impact of each manipulation is presented. For stages 2 and 3, log entries were tested with digits normalized to zeros, with words chained together to various lengths and one or all levels of word chains used together. The impacts of these manipulations are presented. In each of these stages, it was found that these widely available Bayesian content filters were effective in differentiating log entries. Tables of correct match percentages or score graphs, according to the nature of tests and numbers of entries are presented, are presented, and FES scores are assigned to the filters according to the attributes impacting their effectiveness. This research leads to the suggestion that simple, off-the-shelf Bayesian content filters can be used to assist system administrators and log mining systems in sifting log entries to find entries related to known conditions (for which there are example log entries), and to exclude outages which are not related to specific known entry sets.
College and Department
Ira A. Fulton College of Engineering and Technology; Technology
BYU ScholarsArchive Citation
Havens, Russel William, "Naive Bayesian Spam Filters for Log File Analysis" (2011). All Theses and Dissertations. 2814.
Russel Havens, log file analysis, Bayesian content filter, spam filter, SpamAssassin, SpamBayes, Bogofilter, filter effectiveness scale, fes