Journal of Undergraduate Research


machine-learning algorithms, predict biomedical outcomes, gene-expression data


Life Sciences




Biomedical data are increasing in size and complexity. To make sense of these data, biomedical researchers often use “machine-learning” algorithms, which are developed by the computer-science community. Our goal was to perform a systematic comparison of many of these algorithms across 100 data sets to identify which algorithms perform best for this type of data. To help meet this goal, we also planned to carefully curating data from the public domain for others to use in their own comparisons.

Unlike DNA, which changes little from cell to cell, gene-expression levels vary dramatically across different types of tissues and under different conditions. Because of this variance, gene expression data can often be used to predict biomedical outcomes. Such outcomes might include development of a disease, survivability, reaction to a drug, and other such medically relevant information. There are a wide range of “machine learning” algorithms that researchers use; because human health is at stake, having the utmost accuracy is important. However, because of the overwhelming number of algorithms available, researchers often use whatever algorithm(s) they have used previously, even though there may be a more accurate alternative.1 We hypothesized that certain algorithms will perform better than others overall and that some algorithmic attributes may be best suited for certain dataset characteristics.

Included in

Biology Commons