Umpire 2.0: Simulating realistic, mixed-type, clinical data for machine learning [version 2; peer review: 1 approved, 1 approved with reservations]

Caitlin E. Coombes; Zachary B. Abrams; Samantha Nakayiza; Guy Brock; Kevin R. Coombes

doi:10.12688/F1000RESEARCH.25877.2

Umpire 2.0: Simulating realistic, mixed-type, clinical data for machine learning [version 2; peer review: 1 approved, 1 approved with reservations]

Caitlin E. Coombes, Zachary B. Abrams, Samantha Nakayiza, Guy Brock, Kevin R. Coombes

Research output: Contribution to journal › Article › peer-review

Abstract

The Umpire 2.0 R-package offers a streamlined, user-friendly workflow to simulate complex, heterogeneous, mixed-type data with known subgroup identities, dichotomous outcomes, and time-toevent data, while providing ample opportunities for fine-tuning and flexibility. Here, we describe how we have expanded the core Umpire1.0 R-package, developed to simulate gene expression data, to generate clinically realistic, mixed-type data for use in evaluating unsupervised and supervised machine learning (ML) methods. As the availability of large-scale clinical data for ML has increased, clinical data has posed unique challenges, including widely variable size, individual biological heterogeneity, data collection and measurement noise, and mixed data types. Developing and validating ML methods for clinical data requires data sets with known ground truth, generated from simulation. Umpire 2.0 addresses challenges to simulating realistic clinical data by providing the user a series of modules to generate survival parameters and subgroups, apply meaningful additive noise, and discretize to single or mixed data types. Umpire 2.0 provides broad functionality across sample sizes, feature spaces, and data types, allowing the user to simulate correlated, heterogeneous, binary, continuous, categorical, or mixed type data from the scale of a small clinical trial to data on thousands of patients drawn from electronic health records. The user may generate elaborate simulations by varying parameters in order to compare algorithms or interrogate operating characteristics of an algorithm in both supervised and unsupervised ML.

Original language	English (US)
Pages (from-to)	1-28
Number of pages	28
Journal	F1000Research
Volume	9
DOIs	https://doi.org/10.12688/F1000RESEARCH.25877.2
State	Published - 2021
Externally published	Yes

Keywords

clinical data
clinical informatics
clustering
machine learning
mixed data
mixedtype data
supervised machine learning
unsupervised machine learning

ASJC Scopus subject areas

General Immunology and Microbiology
General Pharmacology, Toxicology and Pharmaceutics
General Biochemistry, Genetics and Molecular Biology

Access to Document

10.12688/F1000RESEARCH.25877.2

Cite this

@article{0543a9efbf3f460a92982c112c50287b,

title = "Umpire 2.0: Simulating realistic, mixed-type, clinical data for machine learning [version 2; peer review: 1 approved, 1 approved with reservations]",

abstract = "The Umpire 2.0 R-package offers a streamlined, user-friendly workflow to simulate complex, heterogeneous, mixed-type data with known subgroup identities, dichotomous outcomes, and time-toevent data, while providing ample opportunities for fine-tuning and flexibility. Here, we describe how we have expanded the core Umpire1.0 R-package, developed to simulate gene expression data, to generate clinically realistic, mixed-type data for use in evaluating unsupervised and supervised machine learning (ML) methods. As the availability of large-scale clinical data for ML has increased, clinical data has posed unique challenges, including widely variable size, individual biological heterogeneity, data collection and measurement noise, and mixed data types. Developing and validating ML methods for clinical data requires data sets with known ground truth, generated from simulation. Umpire 2.0 addresses challenges to simulating realistic clinical data by providing the user a series of modules to generate survival parameters and subgroups, apply meaningful additive noise, and discretize to single or mixed data types. Umpire 2.0 provides broad functionality across sample sizes, feature spaces, and data types, allowing the user to simulate correlated, heterogeneous, binary, continuous, categorical, or mixed type data from the scale of a small clinical trial to data on thousands of patients drawn from electronic health records. The user may generate elaborate simulations by varying parameters in order to compare algorithms or interrogate operating characteristics of an algorithm in both supervised and unsupervised ML.",

keywords = "clinical data, clinical informatics, clustering, machine learning, mixed data, mixedtype data, supervised machine learning, unsupervised machine learning",

author = "Coombes, {Caitlin E.} and Abrams, {Zachary B.} and Samantha Nakayiza and Guy Brock and Coombes, {Kevin R.}",

note = "Publisher Copyright: {\textcopyright} 2021. Coombes CE et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.",

year = "2021",

doi = "10.12688/F1000RESEARCH.25877.2",

language = "English (US)",

volume = "9",

pages = "1--28",

journal = "F1000Research",

issn = "2046-1402",

publisher = "F1000 Research Ltd.",

}

TY - JOUR

T1 - Umpire 2.0

T2 - Simulating realistic, mixed-type, clinical data for machine learning [version 2; peer review: 1 approved, 1 approved with reservations]

AU - Coombes, Caitlin E.

AU - Abrams, Zachary B.

AU - Nakayiza, Samantha

AU - Brock, Guy

AU - Coombes, Kevin R.

N1 - Publisher Copyright: © 2021. Coombes CE et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PY - 2021

Y1 - 2021

N2 - The Umpire 2.0 R-package offers a streamlined, user-friendly workflow to simulate complex, heterogeneous, mixed-type data with known subgroup identities, dichotomous outcomes, and time-toevent data, while providing ample opportunities for fine-tuning and flexibility. Here, we describe how we have expanded the core Umpire1.0 R-package, developed to simulate gene expression data, to generate clinically realistic, mixed-type data for use in evaluating unsupervised and supervised machine learning (ML) methods. As the availability of large-scale clinical data for ML has increased, clinical data has posed unique challenges, including widely variable size, individual biological heterogeneity, data collection and measurement noise, and mixed data types. Developing and validating ML methods for clinical data requires data sets with known ground truth, generated from simulation. Umpire 2.0 addresses challenges to simulating realistic clinical data by providing the user a series of modules to generate survival parameters and subgroups, apply meaningful additive noise, and discretize to single or mixed data types. Umpire 2.0 provides broad functionality across sample sizes, feature spaces, and data types, allowing the user to simulate correlated, heterogeneous, binary, continuous, categorical, or mixed type data from the scale of a small clinical trial to data on thousands of patients drawn from electronic health records. The user may generate elaborate simulations by varying parameters in order to compare algorithms or interrogate operating characteristics of an algorithm in both supervised and unsupervised ML.

AB - The Umpire 2.0 R-package offers a streamlined, user-friendly workflow to simulate complex, heterogeneous, mixed-type data with known subgroup identities, dichotomous outcomes, and time-toevent data, while providing ample opportunities for fine-tuning and flexibility. Here, we describe how we have expanded the core Umpire1.0 R-package, developed to simulate gene expression data, to generate clinically realistic, mixed-type data for use in evaluating unsupervised and supervised machine learning (ML) methods. As the availability of large-scale clinical data for ML has increased, clinical data has posed unique challenges, including widely variable size, individual biological heterogeneity, data collection and measurement noise, and mixed data types. Developing and validating ML methods for clinical data requires data sets with known ground truth, generated from simulation. Umpire 2.0 addresses challenges to simulating realistic clinical data by providing the user a series of modules to generate survival parameters and subgroups, apply meaningful additive noise, and discretize to single or mixed data types. Umpire 2.0 provides broad functionality across sample sizes, feature spaces, and data types, allowing the user to simulate correlated, heterogeneous, binary, continuous, categorical, or mixed type data from the scale of a small clinical trial to data on thousands of patients drawn from electronic health records. The user may generate elaborate simulations by varying parameters in order to compare algorithms or interrogate operating characteristics of an algorithm in both supervised and unsupervised ML.

KW - clinical data

KW - clinical informatics

KW - clustering

KW - machine learning

KW - mixed data

KW - mixedtype data

KW - supervised machine learning

KW - unsupervised machine learning

UR - http://www.scopus.com/inward/record.url?scp=85117287345&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85117287345&partnerID=8YFLogxK

U2 - 10.12688/F1000RESEARCH.25877.2

DO - 10.12688/F1000RESEARCH.25877.2

M3 - Article

AN - SCOPUS:85117287345

SN - 2046-1402

VL - 9

SP - 1

EP - 28

JO - F1000Research

JF - F1000Research

ER -

Umpire 2.0: Simulating realistic, mixed-type, clinical data for machine learning [version 2; peer review: 1 approved, 1 approved with reservations]

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this