Variable selection in model-based clustering using multilocus genotype data. We propose a variable selection procedure in model-based clustering using multilocus genotype data. Indeed, it may happen that some loci are not relevant for clustering into statistically different populations. Inferring the number K of clusters and the relevant clustering subset S of loci is seen as a model selection problem. The competing models are compared using penalized maximum likelihood criteria. Under weak assumptions on the penalty function, we prove the consistency of the resulting estimator (K ^ n ,S ^ n ). An associated algorithm named mixture model for genotype data (MixMoGenD) has been implemented using C++ programming language and is available on http://www.math.u-psud.fr/ toussile. To avoid an exhaustive search of the optimum model, we propose a modified Backward-Stepwise algorithm, which enables a better search of the optimum model among all possible cardinalities of S. We present numerical experiments on simulated and real datasets that highlight the interest of our loci selection procedure.
Keywords for this software
References in zbMATH (referenced in 2 articles )
Showing results 1 to 2 of 2.
- Bontemps, Dominique; Toussile, Wilson: Clustering and variable selection for categorical multivariate data (2013)
- Toussile, Wilson; Gassiat, Elisabeth: Variable selection in model-based clustering using multilocus genotype data (2009)