Combining Mixture Components for Clustering.

Details

Serval ID
serval:BIB_9E08FA638101
Type
Article: article from journal or magazin.
Collection
Publications
Title
Combining Mixture Components for Clustering.
Journal
Journal of computational and graphical statistics
Author(s)
Baudry J.P., Raftery A.E., Celeux G., Lo K., Gottardo R.
ISSN
1061-8600 (Print)
ISSN-L
1061-8600
Publication state
Published
Issued date
01/06/2010
Peer-reviewed
Oui
Volume
9
Number
2
Pages
332-353
Language
english
Notes
Publication types: Journal Article
Publication Status: ppublish
Abstract
Model-based clustering consists of fitting a mixture model to data and identifying each cluster with one of its components. Multivariate normal distributions are typically used. The number of clusters is usually determined from the data, often using BIC. In practice, however, individual clusters can be poorly fitted by Gaussian distributions, and in that case model-based clustering tends to represent one non-Gaussian cluster by a mixture of two or more Gaussian distributions. If the number of mixture components is interpreted as the number of clusters, this can lead to overestimation of the number of clusters. This is because BIC selects the number of mixture components needed to provide a good approximation to the density, rather than the number of clusters as such. We propose first selecting the total number of Gaussian mixture components, K, using BIC and then combining them hierarchically according to an entropy criterion. This yields a unique soft clustering for each number of clusters less than or equal to K. These clusterings can be compared on substantive grounds, and we also describe an automatic way of selecting the number of clusters via a piecewise linear regression fit to the rescaled entropy plot. We illustrate the method with simulated data and a flow cytometry dataset. Supplemental Materials are available on the journal Web site and described at the end of the paper.
Pubmed
Web of science
Create date
28/02/2022 11:45
Last modification date
23/03/2024 7:24
Usage data