Missing value estimation for DNA microarrays. Motivation: Gene expression microarray experiments can generate data sets with multiple missing expression values. Unfortunately, many algorithms for gene expression analysis require a complete matrix of gene array values as input. For example, methods such as hierarchical clustering and K-means clustering are not robust to missing data, and may lose effectiveness even with a few missing values. Methods for imputing missing data are needed, therefore, to minimize the effect of incomplete data sets on analyses, and to increase the range of data sets to which these algorithms can be applied. In this report, we investigate automated methods for estimating missing data. Results: We present a comparative study of several methods for the estimation of missing values in gene microarray data. We implemented and evaluated three methods: a Singular Value Decomposition (SVD) based method (SVDimpute), weighted K-nearest neighbors (KNNimpute), and row average. We evaluated the methods using a variety of parameter settings and over different real data sets, and assessed the robustness of the imputation methods to the amount of missing data over the range of 1–20% missing values. We show that KNNimpute appears to provide a more robust and sensitive method for missing value estimation than SVDimpute, and both SVDimpute and KNNimpute surpass the commonly used row average method (as well as filling missing values with zeros). We report results of the comparative experiments and provide recommendations and tools for accurate estimation of missing microarray data under a variety of conditions. Availability: The software is available at

References in zbMATH (referenced in 100 articles )

Showing results 1 to 20 of 100.
Sorted by year (citations)

1 2 3 4 5 next

  1. Ma, Qian; Lee, Wang-Chien; Fu, Tao-Yang; Gu, Yu; Yu, Ge: MIDIA: exploring denoising autoencoders for missing data imputation (2020)
  2. Mazumder, Rahul; Saldana, Diego; Weng, Haolei: Matrix completion with nonconvex regularization: spectral operators and scalable algorithms (2020)
  3. Mozharovskyi, Pavlo; Josse, Julie; Husson, François: Nonparametric imputation by data depth (2020)
  4. Cascone, Marcos H.; Hotta, Luiz K.: Quasi-maximum likelihood estimation of GARCH models in the presence of missing values (2019)
  5. Chen, Xiaolin; Liu, Yi; Wang, Qihua: Joint feature screening for ultra-high-dimensional sparse additive hazards model by the sparsity-restricted pseudo-score estimator (2019)
  6. Ciaramella, Angelo; Staiano, Antonino: On the role of clustering and visualization techniques in gene microarray data (2019)
  7. Mao, Xiaojun; Chen, Song Xi; Wong, Raymond K. W.: Matrix completion with covariate information (2019)
  8. Nengsih, Titin Agustin; Bertrand, Frédéric; Maumy-Bertrand, Myriam; Meyer, Nicolas: Determining the number of components in PLS regression on incomplete data set (2019)
  9. Ben Brahim, Afef; Limam, Mohamed: Ensemble feature selection for high dimensional data: a new method and a comparative study (2018)
  10. Bertsimas, Dimitris; Pawlowski, Colin; Zhuo, Ying Daisy: From predictive methods to missing data imputation: an optimization approach (2018)
  11. Casleton, Emily; Osthus, Dave; van Buren, Kendra: Imputation for multisource data with comparison and assessment techniques (2018)
  12. Chekouo, Thierry; Murua, Alejandro: High-dimensional variable selection with the plaid mixture model for clustering (2018)
  13. Chen, Xiaolin; Chen, Xiaojing; Wang, Hong: Robust feature screening for ultra-high dimensional right censored data via distance correlation (2018)
  14. Datta, Shounak; Bhattacharjee, Supritam; Das, Swagatam: Clustering with missing features: a penalized dissimilarity measure based approach (2018)
  15. Imbert, Alyssa; Vialaneix, Nathalie: Exploring, handling, imputing and evaluating missing data in statistical analyses: a review of existing approaches (2018)
  16. Lee, Jaylen; Ciccarello, Shannon; Acharjee, Mithun; Das, Kumer: Dimension reduction of gene expression data (2018)
  17. Loh, Po-Ling; Tan, Xin Lu: High-dimensional robust precision matrix estimation: cellwise corruption under (\epsilon)-contamination (2018)
  18. O’Brien, Jonathon J.; Gunawardena, Harsha P.; Paulo, Joao A.; Chen, Xian; Ibrahim, Joseph G.; Gygi, Steven P.; Qaqish, Bahjat F.: The effects of nonignorable missing data on label-free mass spectrometry proteomics experiments (2018)
  19. Singha, Sumanta; Shenoy, Prakash P.: An adaptive heuristic for feature selection based on complementarity (2018)
  20. Faisal, Shahla; Tutz, Gerhard: Missing value imputation for gene expression data by tailored nearest neighbors (2017)

1 2 3 4 5 next