Categorical Data Clustering via Value Order Estimated Distance Metric Learning

  • Yiqun Zhang
  • , Mingjie Zhao
  • , Hong Jia*
  • , Mengke Li
  • , Yang Lu
  • , Yiu-ming Cheung*
  • *Corresponding author for this work

Research output: Contribution to journalJournal articlepeer-review

Abstract

Clustering is a popular machine learning technique for data mining that can process and analyze datasets to automatically reveal sample distribution patterns. Since the ubiquitous categorical data naturally lack a well-defined metric space such as the Euclidean distance space of numerical data, the distribution of categorical data is usually under-represented, and thus valuable information can be easily twisted in clustering. This paper, therefore, introduces a novel order distance metric learning approach to intuitively represent categorical attribute values by learning their optimal order relationship and quantifying their distance in a line similar to that of the numerical attributes. Since subjectively created qualitative categorical values involve ambiguity and fuzziness, the order distance metric is learned in the context of clustering. Accordingly, a new joint learning paradigm is developed to alternatively perform clustering and order distance metric learning with low time complexity and a guarantee of convergence. Due to the clustering-friendly order learning mechanism and the homogeneous ordinal nature of the order distance and Euclidean distance, the proposed method achieves superior clustering accuracy on categorical and mixed datasets. More importantly, the learned order distance metric greatly reduces the difficulty of understanding and managing the non-intuitive categorical data. Experiments with ablation studies, significance tests, case studies, etc., have validated the efficacy of the proposed method. The source code is available at https://github.com/csmjzhao/OCL_Source_Code.
Original languageEnglish
Article number307
Pages (from-to)1-24
Number of pages24
JournalProceedings of the ACM on Management of Data
Volume3
Issue number6
DOIs
Publication statusPublished - 5 Dec 2025

User-Defined Keywords

  • categorical data
  • cluster analysis
  • distance learning
  • partitional clustering
  • subspace distance structure

Fingerprint

Dive into the research topics of 'Categorical Data Clustering via Value Order Estimated Distance Metric Learning'. Together they form a unique fingerprint.

Cite this