Keywords

gene transcription, clinical data, machine-learning algorithms, human transcriptional biomarker

Abstract

One important use of genome-wide transcriptional profiles is to identify relationships between transcription levels and patient outcomes. These translational insights can guide the development of biomarkers for clinical application. Data from thousands of translational-biomarker studies have been deposited in public repositories, enabling reuse. However, data-reuse efforts require considerable time and expertise because transcriptional data are generated using heterogeneous profiling technologies, preprocessed using diverse normalization procedures, and annotated in non-standard ways. To address this problem, we curated 45 publicly available, translational-biomarker datasets from a variety of human diseases. To increase the data's utility, we reprocessed the raw expression data using a uniform computational pipeline, addressed quality-control problems, mapped the clinical annotations to a controlled vocabulary, and prepared consistently structured, analysis-ready data files. These data, along with scripts we used to prepare the data, are available in a public repository. We believe these data will be particularly useful to researchers seeking to perform benchmarking studies—for example, to compare and optimize machine-learning algorithms' ability to predict biomedical outcomes.

Original Publication Citation

Golightly NP, Bell A, Bischoff AI, Hollingsworth PD, Piccolo SR. Curated compendium of human transcriptional biomarker data. Scientific Data, volume 5, Article number: 180066 (2018)

Document Type

Peer-Reviewed Article

Publication Date

2018-04-17

Publisher

Nature

Language

English

College

Life Sciences

Department

Biology

University Standing at Time of Publication

Associate Professor

Included in

Biology Commons

Share

COinS