Effect of training-sample size and classification difficulty on the accuracy of genomic predictors.

Popovici, V.; Chen, W.; Gallas, B.G.; Hatzis, C.; Shi, W.; Samuelson, F.W.; Nikolsky, Y.; Tsyganova, M.; Ishkin, A.; Nikolskaya, T.; Hess, K.R.; Valero, V.; Booser, D.; Delorenzi, M.; Hortobagyi, G.N.; Shi, L.; Symmans, W.F.; Pusztai, L.

doi:10.1186/bcr2468

Effect of training-sample size and classification difficulty on the accuracy of genomic predictors.

Détails

Télécharger: BIB_0C25CFF6BFCD.P001.pdf (534.89 [Ko])
Etat: Public
Version: de l'auteur⸱e

ID Serval

serval:BIB_0C25CFF6BFCD

Type

Article: article d'un périodique ou d'un magazine.

Collection

Publications

Institution

UNIL/CHUV

Titre

Effect of training-sample size and classification difficulty on the accuracy of genomic predictors.

Périodique

Breast Cancer Research

Auteur⸱e⸱s

Popovici V., Chen W., Gallas B.G., Hatzis C., Shi W., Samuelson F.W., Nikolsky Y., Tsyganova M., Ishkin A., Nikolskaya T., Hess K.R., Valero V., Booser D., Delorenzi M., Hortobagyi G.N., Shi L., Symmans W.F., Pusztai L.

ISSN

1465-542X[electronic], 1465-5411[linking]

Statut éditorial

Publié

Date de publication

2010

Peer-reviewed

Oui

Volume

Numéro

Pages

Langue

anglais

Résumé

Introduction: As part of the MicroArray Quality Control (MAQC)-II project, this analysis examines how the choice of univariate feature-selection methods and classification algorithms may influence the performance of genomic predictors under varying degrees of prediction difficulty represented by three clinically relevant endpoints.
Methods: We used gene-expression data from 230 breast cancers (grouped into training and independent validation sets), and we examined 40 predictors (five univariate feature-selection methods combined with eight different classifiers) for each of the three endpoints. Their classification performance was estimated on the training set by using two different resampling methods and compared with the accuracy observed in the independent validation set.
Results: A ranking of the three classification problems was obtained, and the performance of 120 models was estimated and assessed on an independent validation set. The bootstrapping estimates were closer to the validation performance than were the cross-validation estimates. The required sample size for each endpoint was estimated, and both gene-level and pathway-level analyses were performed on the obtained models.
Conclusions: We showed that genomic predictor accuracy is determined largely by an interplay between sample size and classification difficulty. Variations on univariate feature-selection methods and choice of classification algorithm have only a modest impact on predictor performance, and several statistically equally good predictors can be developed for any given classification problem.

Mots-clé

Algorithms, Area Under Curve, Breast Neoplasms/chemistry, Breast Neoplasms/genetics, Female, Gene Expression Profiling/methods, Humans, Receptors, Estrogen/analysis, Sample Size

URN

urn:nbn:ch:serval-BIB_0C25CFF6BFCD7

OAI-PMH

oai:serval.unil.ch:BIB_0C25CFF6BFCD

DOI

10.1186/bcr2468

Pubmed

20064235

Web of science

000276986300011

Open Access

Oui