A Systematic Review of Datasets that Can Help Elucidate Relationships Among Gene Expression, Race, and Immunohistochemistry-defined Subtypes in Breast Cancer
Abstract
Scholarly requirements have led to a massive increase of transcriptomic data in the public domain, withmillions of samples available for secondary research. We identified gene-expression datasets representing10,214 breast-cancer patients in public databases. We focused on datasets that included patient metadataon race and/or immunohistochemistry (IHC) profiling of the ER, PR, and HER-2 proteins. This reviewprovides a summary of these datasets and describes findings from 32 research articles associated withthe datasets. These studies have helped to elucidate relationships between IHC, race, and/or treatmentoptions, as well as relationships between IHC status and the breast-cancer intrinsic subtypes. We have alsoidentified broad themes across the analysis methodologies used in these studies, including breast cancersubtyping, deriving predictive biomarkers, identifying differentially expressed genes, and optimizing dataprocessing. Finally, we discuss limitations of prior work and recommend future directions for reusing thesedatasets in secondary analyses.