TY - JOUR
T1 - Learning from Noisy Pairwise Similarity and Unlabeled Data
AU - Wu, Songhua
AU - Liu, Tongliang
AU - HAN, Bo
AU - Yu, Jun
AU - Niu, Gang
AU - Sugiyama, Masashi
N1 - Publisher Copyright:
©2022 Songhua Wu, Tongliang Liu, Bo Han, Jun Yu, Gang Niu, and Masashi Sugiyama.
Funding Information:
SHW and TLL were supported by Australian Research Council Project No. DE-190101473 and RIKEN Collaborative Research Fund. TLL was also supported by Australian Research Council Projects No. IC-190100031, DP-220102121, and FT-220100318. BH was supported by the RGC Early Career Scheme No. 22200720, NSFC Young Scientists Fund No. 62006202, and Guangdong Basic and Applied Basic Research Foundation No. 2022A1515011652. JY was supported by Natural Science Foundation of China No. 62276242, CAAI-Huawei MindSpore Open Fund No. CAAIXSJLJJ-2021-016B and CAAIXSJLJJ-2022-001A, Anhui Province Key Research and Development Program No. 202104a05020007. GN and MS were supported by JST AIP Acceleration Research Grant No. JPMJCR20U3, Japan. MS was also supported by the Institute for AI and Beyond, UTokyo.
PY - 2022/11
Y1 - 2022/11
N2 - SU classification employs similar (S) data pairs (two examples belong to the same class) and unlabeled (U) data points to build a classifier, which can serve as an alternative to the standard supervised trained classifiers requiring data points with class labels. SU classification is advantageous because in the era of big data, more attention has been paid to data privacy. Datasets with specific class labels are often difficult to obtain in real-world classification applications regarding privacy-sensitive matters, such as politics and religion, which can be a bottleneck in supervised classification. Fortunately, similarity labels do not reveal the explicit information and inherently protect the privacy, e.g., collecting answers to “With whom do you share the same opinion on issue I?” instead of “What is your opinion on issue I?”. Nevertheless, SU classification still has an obvious limitation: respondents might answer these questions in a manner that is viewed favorably by others instead of answering truthfully. Therefore, there exist some dissimilar data pairs labeled as similar, which significantly degenerates the performance of SU classification. In this paper, we study how to learn from noisy similar (nS) data pairs and unlabeled (U) data, which is called nSU classification. Specifically, we carefully model the similarity noise and estimate the noise rate by using the mixture proportion estimation technique. Then, a clean classifier can be learned by minimizing a denoised and unbiased classification risk estimator, which only involves the noisy data. Moreover, we further derive a theoretical generalization error bound for the proposed method. Experimental results demonstrate the effectiveness of the proposed algorithm on several benchmark datasets.
AB - SU classification employs similar (S) data pairs (two examples belong to the same class) and unlabeled (U) data points to build a classifier, which can serve as an alternative to the standard supervised trained classifiers requiring data points with class labels. SU classification is advantageous because in the era of big data, more attention has been paid to data privacy. Datasets with specific class labels are often difficult to obtain in real-world classification applications regarding privacy-sensitive matters, such as politics and religion, which can be a bottleneck in supervised classification. Fortunately, similarity labels do not reveal the explicit information and inherently protect the privacy, e.g., collecting answers to “With whom do you share the same opinion on issue I?” instead of “What is your opinion on issue I?”. Nevertheless, SU classification still has an obvious limitation: respondents might answer these questions in a manner that is viewed favorably by others instead of answering truthfully. Therefore, there exist some dissimilar data pairs labeled as similar, which significantly degenerates the performance of SU classification. In this paper, we study how to learn from noisy similar (nS) data pairs and unlabeled (U) data, which is called nSU classification. Specifically, we carefully model the similarity noise and estimate the noise rate by using the mixture proportion estimation technique. Then, a clean classifier can be learned by minimizing a denoised and unbiased classification risk estimator, which only involves the noisy data. Moreover, we further derive a theoretical generalization error bound for the proposed method. Experimental results demonstrate the effectiveness of the proposed algorithm on several benchmark datasets.
KW - privacy concern
KW - similarity learning
KW - unbiased classifier
UR - https://www.scopus.com/record/display.uri?eid=2-s2.0-85148056384&origin=inward
M3 - Journal article
SN - 1532-4435
VL - 23
JO - Journal of Machine Learning Research
JF - Journal of Machine Learning Research
IS - 307
M1 - 307
ER -