COOLCAT: an entropy-based algorithm for categorical clustering. In this paper we explore the connection between clustering categorical data and entropy: clusters of similar poi lower entropy than those of dissimilar ones. We use this connection to design an incremental heuristic algorithm, COOLCAT, which is capable of efficiently clustering large data sets of records with categorical attributes, and data streams. In contrast with other categorical clustering algorithms published in the past, COOLCAT’s clustering results are very stable for different sample sizes and parameter settings. Also, the criteria for clustering is a very intuitive one, since it is deeply rooted on the well-known notion of entropy. Most importantly, COOLCAT is well equipped to deal with clustering of data streams(continuously arriving streams of data point) since it is an incremental algorithm capable of clustering new points without having to look at every point that has been clustered so far. We demonstrate the efficiency and scalability of COOLCAT by a series of experiments on real and synthetic data sets.

References in zbMATH (referenced in 29 articles )

Showing results 1 to 20 of 29.
Sorted by year (citations)

1 2 next

  1. Gan, Guojun; Ma, Chaoqun; Wu, Jianhong: Data clustering. Theory, algorithms, and applications (2021)
  2. Yu, Liqin; Cao, Fuyuan; Zhao, Xingwang; Yang, Xiaodan; Liang, Jiye: Combining attribute content and label information for categorical data ensemble clustering (2020)
  3. Śmieja, Marek; Hajto, Krzysztof; Tabor, Jacek: Efficient mixture model for clustering of sparse high dimensional binary data (2019)
  4. Gao, Xuedong; Yang, Minghan: Understanding and enhancement of internal clustering validation indexes for categorical data (2018)
  5. Sangam, Ravi Sankar; Om, Hari: An equi-biased (k)-prototypes algorithm for clustering mixed-type data (2018)
  6. Zhao, Xingwang; Cao, Fuyuan; Liang, Jiye: A sequential ensemble clusterings generation algorithm for mixed data (2018)
  7. Kim, Kyoungok: A weighted (k)-modes clustering using new weighting method based on within-cluster and between-cluster impurity measures (2017)
  8. Jiang, Feng; Liu, Guozhu; Du, Junwei; Sui, Yuefei: Initialization of (K)-modes clustering using outlier detection techniques (2016)
  9. Larose, Chantal; Harel, Ofer; Kordas, Katarzyna; Dey, Dipak K.: Latent class analysis of incomplete data via an entropy-based criterion (2016)
  10. Bai, Liang; Liang, Jiye: Cluster validity functions for categorical data: a solution-space perspective (2015)
  11. Li, Yefeng; Le, Jiajin; Wang, Mei: Improving CLOPE’s profit value and stability with an optimized agglomerative approach (2015)
  12. Cheung, Yiu-ming; Jia, Hong: Categorical-and-numerical-attribute data clustering based on a unified similarity metric without knowing cluster number (2013)
  13. David, Gil; Averbuch, Amir: SpectralCAT: categorical spectral clustering of numerical and nominal data (2012)
  14. Liang, Jiye; Zhao, Xingwang; Li, Deyu; Cao, Fuyuan; Dang, Chuangyin: Determining the number of clusters using information entropy for mixed data (2012)
  15. Liu, Qingbao; Dong, Guozhu: CPCQ: contrast pattern based clustering quality index for categorical data (2012) ioport
  16. Xiong, Tengke; Wang, Shengrui; Mayers, André; Monga, Ernest: DHCC: divisive hierarchical clustering of categorical data (2012)
  17. Bai, Liang; Liang, Jiye; Dang, Chuangyin; Cao, Fuyuan: A novel attribute weighting algorithm for clustering high-dimensional categorical data (2011)
  18. Brouwer, Roelof K.; Groenwold, Albert: Modified fuzzy c-means for ordinal valued attributes with particle swarm for optimization (2010) ioport
  19. Yan, Hua; Chen, Keke; Liu, Ling; Yi, Zhang: SCALE: a scalable framework for efficiently clustering transactional data (2010) ioport
  20. Chen, Keke; Liu, Ling: HE-Tree: a framework for detecting changes in clustering structure for categorical data streams (2009) ioport

1 2 next