ROCK: a robust clustering algorithm for categorical attributes. Clustering, in data mining, is useful to discover distribution patterns in the underlying data. Clustering algorithms usually employ a distance metric based (e.g., euclidean) similarity measure in order to partition the database such that data points in the same partition are more similar than points in different partitions. In this paper, we study clustering algorithms for data with boolean and categorical attributes. We show that traditional clustering algorithms that use distances between points for clustering are not appropriate for boolean and categorical attributes. Instead, we propose a novel concept of links to measure the similarity/proximity between a pair of data points. We develop a robust hierarchical clustering algorithm ROCK that employs links and not distances when merging clusters. Our methods naturally extend to non-metric similarity measures that are relevant in situations where a domain expert/similarity table is the only source of knowledge. In addition to presenting detailed complexity results for ROCK, we also conduct an experimental study with real-life as well as synthetic data sets to demonstrate the effectiveness of our techniques. For data with categorical attributes, our findings indicate that ROCK not only generates better quality clusters than traditional algorithms, but it also exhibits good scalability properties.

References in zbMATH (referenced in 69 articles )

Showing results 1 to 20 of 69.
Sorted by year (citations)

1 2 3 4 next

  1. Bury, Marc; Gentili, Michele; Schwiegelshohn, Chris; Sorella, Mara: Polynomial time approximation schemes for all 1-center problems on metric rational set similarities (2021)
  2. Yu, Liqin; Cao, Fuyuan; Zhao, Xingwang; Yang, Xiaodan; Liang, Jiye: Combining attribute content and label information for categorical data ensemble clustering (2020)
  3. D’Urso, Pierpaolo; Massari, Riccardo: Fuzzy clustering of mixed data (2019)
  4. Uglickich, Evženie; Nagy, Ivan; Vlčková, Dominika: Comparing clusterings using combination of the kappa statistic and entropy-based measure (2019)
  5. Boongoen, Tossapon; Iam-On, Natthakan: Cluster ensembles: a survey of approaches with recent extensions and applications (2018)
  6. Sangam, Ravi Sankar; Om, Hari: An equi-biased (k)-prototypes algorithm for clustering mixed-type data (2018)
  7. Huang, Jinlong; Zhu, Qingsheng; Yang, Lijun; Cheng, Dongdong; Wu, Quanwang: QCC: a novel clustering algorithm based on quasi-cluster centers (2017)
  8. Huerta-Muñoz, Diana L.; Ríos-Mercado, Roger Z.; Ruiz, Rubén: An iterated greedy heuristic for a market segmentation problem with multiple attributes (2017)
  9. Kim, Kyoungok: A weighted (k)-modes clustering using new weighting method based on within-cluster and between-cluster impurity measures (2017)
  10. Vigneron, V.; Chen, H.: A multi-scale seriation algorithm for clustering sparse imbalanced data: application to spike sorting (2016)
  11. Dai, Hanbo; Zhu, Feida; Lim, Ee-Peng; Pang, HweeHwa: Detecting anomaly collections using extreme feature ranks (2015)
  12. Khalid, Shehzad; Razzaq, Shahid: TOBAE: a density-based agglomerative clustering algorithm (2015)
  13. Noorbehbahani, Fakhroddin; Mousavi, Sayyed; Mirzaei, Abdolreza: An incremental mixed data clustering method using a new distance measure (2015) ioport
  14. Kang, Pilsung; Kim, Dongil; Cho, Sungzoon: Evaluating the reliability level of virtual metrology results for flexible process control: a novelty detection-based approach (2014) ioport
  15. Lin, Kawuu W.; Lin, Chun-Hung; Hsiao, Chun-Yuan: A parallel and scalable CAST-based clustering algorithm on GPU (2014) ioport
  16. Saha, Indrajit; Maulik, Ujjwal: Incremental learning based multiobjective fuzzy clustering for categorical data (2014) ioport
  17. D’Enza, Alfonso Iodice; Palumbo, Francesco: Iterative factor clustering of binary data (2013)
  18. Karol, Stuti; Mangat, Veenu: Evaluation of text document clustering approach based on particle swarm optimization (2013) ioport
  19. Sim, Kelvin; Gopalkrishnan, Vivekanand; Zimek, Arthur; Cong, Gao: A survey on enhanced subspace clustering (2013)
  20. Tan, Jingdong; Wang, Rujing: Smooth splicing: a robust SNN-based method for clustering high-dimensional data (2013) ioport

1 2 3 4 next