TY - JOUR
T1 - Equalization ensemble for large scale highly imbalanced data classification
AU - Ren, Jinjun
AU - Wang, Yuping
AU - Mao, Mingqian
AU - Cheung, Yiu-ming
N1 - Funding Information:
This work was supported by the National Natural Science Foundation of China ( 61872281 , 62102304 ).
Publisher Copyright:
© 2022 Elsevier B.V.
PY - 2022/4
Y1 - 2022/4
N2 - The class-imbalance problem has been widely distributed in various research fields. The larger the data scale and the higher the data imbalance, the more difficult the proper classification. For large-scale highly imbalanced data sets, the ensemble method based on under-sampling is one of the most competitive techniques among the existing techniques. However, it is susceptible to improperly sampling strategies, easy to lose the useful information of the majority class, and not easy to generalize the learning model. To overcome these limitations, we propose an equalization ensemble method (EASE) with two new schemes. First, we propose an equalization under-sampling scheme to generate a balanced data set for each base classifier, which can reduce the impact of class imbalance on the base classifiers; Second, we design a weighted integration scheme, where the G-mean scores obtained by base classifiers on the original imbalanced data set are used as the weights. These weights can not only make the better-performed base-classifiers dominate the final classification decision, but also adapt to a variety of imbalanced data sets with different scales while avoiding the occurrence of some extremely bad situations. Experimental results on three metrics show that EASE increases the diversity of base classifiers and outperforms twelve state-of-the-art methods on the imbalanced data sets with different scales.
AB - The class-imbalance problem has been widely distributed in various research fields. The larger the data scale and the higher the data imbalance, the more difficult the proper classification. For large-scale highly imbalanced data sets, the ensemble method based on under-sampling is one of the most competitive techniques among the existing techniques. However, it is susceptible to improperly sampling strategies, easy to lose the useful information of the majority class, and not easy to generalize the learning model. To overcome these limitations, we propose an equalization ensemble method (EASE) with two new schemes. First, we propose an equalization under-sampling scheme to generate a balanced data set for each base classifier, which can reduce the impact of class imbalance on the base classifiers; Second, we design a weighted integration scheme, where the G-mean scores obtained by base classifiers on the original imbalanced data set are used as the weights. These weights can not only make the better-performed base-classifiers dominate the final classification decision, but also adapt to a variety of imbalanced data sets with different scales while avoiding the occurrence of some extremely bad situations. Experimental results on three metrics show that EASE increases the diversity of base classifiers and outperforms twelve state-of-the-art methods on the imbalanced data sets with different scales.
KW - Ensemble learning
KW - Imbalanced data classification
KW - Large-scale data
KW - Under-sampling
UR - http://www.scopus.com/inward/record.url?scp=85124795674&partnerID=8YFLogxK
U2 - 10.1016/j.knosys.2022.108295
DO - 10.1016/j.knosys.2022.108295
M3 - Journal article
AN - SCOPUS:85124795674
SN - 0950-7051
VL - 242
JO - Knowledge-Based Systems
JF - Knowledge-Based Systems
M1 - 108295
ER -