Xproj: a framework for projected structural clustering of xml documents. XML has become a popular method of data representation both on the web and in databases in recent years. One of the reasons for the popularity of XML has been its ability to encode structural information about data records. However, this structural characteristic of data sets also makes it a challenging problem for a variety of data mining problems. One such problem is that of clustering, in which the structural aspects of the data result in a high implicit dimensionality of the data representation. As a result, it becomes more difficult to cluster the data in a meaningful way. In this paper, we propose an effective clustering algorithm for XML data which uses substructures of the documents in order to gain insights about the important underlying structures. We propose new ways of using multiple sub-structuralinformation in XML documents to evaluate the quality of intermediate cluster solutions, and guide the algorithms to a final solution which reflects the true structural behavior in individual partitions. We test the algorithm on a variety of real and synthetic data sets.

This software is also peer reviewed by journal TOMS.

References in zbMATH (referenced in 6 articles )

Showing results 1 to 6 of 6.
Sorted by year (citations)

  1. Tagarelli, Andrea: Exploring dictionary-based semantic relatedness in labeled tree data (2013) ioport
  2. Brzeziński, Dariusz; Leśniewska, Anna; Morzy, Tadeusz; Piernik, Maciej: XCleaner: a new method for clustering XML documents by structure (2011)
  3. Greco, Sergio; Gullo, Francesco; Ponti, Giovanni; Tagarelli, Andrea: Collaborative clustering of XML documents (2011)
  4. Aggarwal, Charu C.; Zhao, Yuchen; Yu, Philip S.: A framework for clustering massive graph streams (2010)
  5. Li, Guoliang; Feng, Jianhua; Wang, Jianyong; Zhou, Lizhu: Incremental sequence-based frequent query pattern mining from XML queries (2009) ioport
  6. Wang, Jianyong; Zhang, Yuzhou; Zhou, Lizhu; Karypis, George; Aggarwal, Charu C.: CONTOUR: an efficient algorithm for discovering discriminating subsequences (2009) ioport