TY - JOUR
T1 - Categorical Data Clustering via Value Order Estimated Distance Metric Learning
AU - Zhang, Yiqun
AU - Zhao, Mingjie
AU - Jia, Hong
AU - Li, Mengke
AU - Lu, Yang
AU - Cheung, Yiu-ming
N1 - This work was supported in part by the National Natural Science Foundation of China (NSFC) under grants: 62476063, 62376233, 61806131 and 62306181, the NSFC/Research Grants Council (RGC) Joint Research Scheme under the grant N_HKBU214/21, the Natural Science Foundation of Guangdong Province under grant: 2025A1515011293, the Natural Science Foundation of Fujian Province under grant: 2024J09001, the National Key Laboratory of Radar Signal Processing under grant: JKW202403, the General Research Fund of RGC under grants: 12201321, 12202622, and 12201323, the RGC Senior Research Fellow Scheme under grant: SRFS2324-2S02, the Shenzhen Science and Technology Program under grant: RCBS20231211090659101, the Guangdong Provincial Key Laboratory under grant: 2023B1212060076, and the Xiaomi Young Talents Program.
Publisher copyright:
© 2025 Copyright held by the owner/author(s).
PY - 2025/12/5
Y1 - 2025/12/5
N2 - Clustering is a popular machine learning technique for data mining that can process and analyze datasets to automatically reveal sample distribution patterns. Since the ubiquitous categorical data naturally lack a well-defined metric space such as the Euclidean distance space of numerical data, the distribution of categorical data is usually under-represented, and thus valuable information can be easily twisted in clustering. This paper, therefore, introduces a novel order distance metric learning approach to intuitively represent categorical attribute values by learning their optimal order relationship and quantifying their distance in a line similar to that of the numerical attributes. Since subjectively created qualitative categorical values involve ambiguity and fuzziness, the order distance metric is learned in the context of clustering. Accordingly, a new joint learning paradigm is developed to alternatively perform clustering and order distance metric learning with low time complexity and a guarantee of convergence. Due to the clustering-friendly order learning mechanism and the homogeneous ordinal nature of the order distance and Euclidean distance, the proposed method achieves superior clustering accuracy on categorical and mixed datasets. More importantly, the learned order distance metric greatly reduces the difficulty of understanding and managing the non-intuitive categorical data. Experiments with ablation studies, significance tests, case studies, etc., have validated the efficacy of the proposed method. The source code is available at https://github.com/csmjzhao/OCL_Source_Code.
AB - Clustering is a popular machine learning technique for data mining that can process and analyze datasets to automatically reveal sample distribution patterns. Since the ubiquitous categorical data naturally lack a well-defined metric space such as the Euclidean distance space of numerical data, the distribution of categorical data is usually under-represented, and thus valuable information can be easily twisted in clustering. This paper, therefore, introduces a novel order distance metric learning approach to intuitively represent categorical attribute values by learning their optimal order relationship and quantifying their distance in a line similar to that of the numerical attributes. Since subjectively created qualitative categorical values involve ambiguity and fuzziness, the order distance metric is learned in the context of clustering. Accordingly, a new joint learning paradigm is developed to alternatively perform clustering and order distance metric learning with low time complexity and a guarantee of convergence. Due to the clustering-friendly order learning mechanism and the homogeneous ordinal nature of the order distance and Euclidean distance, the proposed method achieves superior clustering accuracy on categorical and mixed datasets. More importantly, the learned order distance metric greatly reduces the difficulty of understanding and managing the non-intuitive categorical data. Experiments with ablation studies, significance tests, case studies, etc., have validated the efficacy of the proposed method. The source code is available at https://github.com/csmjzhao/OCL_Source_Code.
KW - categorical data
KW - cluster analysis
KW - distance learning
KW - partitional clustering
KW - subspace distance structure
U2 - 10.1145/3769772
DO - 10.1145/3769772
M3 - Journal article
SN - 2836-6573
VL - 3
SP - 1
EP - 24
JO - Proceedings of the ACM on Management of Data
JF - Proceedings of the ACM on Management of Data
IS - 6
M1 - 307
ER -