Semi-supervised Classification of Long-tailed Data with Hierarchically Related Labels in a Concept Drift Environment

Project: Research project

Project Details


Background and Motivations: Long-tailed data generally exhibit a power-law distribution, in which a small number of categories (head classes) have many samples (also called instances interchangeably), and the remaining categories (tail classes) have only a few samples, as illustrated in Figure 1. Such data are widely encountered in real world scenarios (e.g., the iNaturalist dataset used in practical recognition systems). In the literature, deep neural networks (DNNs) typically perform well in tasks with numerous categories; however, their performance over tail classes is inadequate in the case of vanilla training on long-tailed data. Over the past few years, although efforts have been made to address this problem, most of the existing studies
assume that the class labels of all instances are available, and a DNN-based classification model (i.e., a classifier) is trained in a static environment. These assumptions may not always hold in practical scenarios such as autopilot frameworks, in which the training data exhibit long-tailed
characteristics. Furthermore, the unexpected and special events in autopilot applications typically belong to tail classes because of the very limited training samples, and their misclassification may result in fatal accidents. Therefore, the classification accuracy on tail classes is often very desirable and cannot be ignored. Moreover, autopilot systems operate in dynamic environments, which may lead to concept drift that abruptly changes the data
distribution and renders the labeling of data difficult or even impossible. Additionally, the existing methods perform long-tailed data classification without considering the relationship among class labels, even though the class labels may have hierarchical relationships, as shown
in Figure 2. In fact, this hierarchical information can usually help enhance classifier performance.

Problem Definition and Challenges: This project will be aimed at studying the semisupervised learning problem of DNNs for long-tailed data classification with hierarchically related labels in concept drift circumstances. It will be meaningful but challenging and focus
on addressing the following questions that have not been well explored yet: (1) When must additional efforts, (e.g., data augmentation) be made over tail classes, given that the difficulty in classifying the samples of a class relies on the sample features in addition to the class size?
(2) how can we treat the deformed feature space distribution caused by the class imbalance producing a large deviation from the actual distribution? (3) how can the changes in the data distribution caused by concept drift be addressed? and (4) how can we classify long-tailed data
with hierarchically related labels in a semi-supervised learning environment? Novelty of This Project: Semi-supervised classification of long-tailed data with hierarchically related labels has yet to be well explored in the literature. This project will study this aspect with a focus on the following four objectives: (1) theoretically analyze the characteristics of long-tailed data, specifically the tail classes, and present a new metric to help decide whether
additional efforts should be made over a tail class; (2) calibrate the feature space of long-tailed data to enhance the classifier performance on tail classes without sacrificing the classification accuracy of the head classes; (3) design a lightweight DNN with much fewer parameters and establish an effective model parameter adjustment scheme to ensure that the classifier rapidly adapts to the concept drift circumstances; and (4) develop a semi-supervised learning model for long-tailed data classification with hierarchically related labels.

Long-Term Significance: This project will be aimed at developing DNN-based theories and algorithms to address the problems mentioned above. The potential research outcome will: (1) provide an in-depth understanding of the characteristics of long-tailed data, (2) establish theoretical foundations of model building and analysis in more general imbalanced data learning, and (3) promote the development of effective and efficient practical recognition
systems in applications such as medical diagnosis and self-driving, in which long-tailed data are frequently encountered. In addition, the findings and research results of this project will help advance the knowledge of deep learning, pattern recognition, and image scene understanding, which will be of significance to academia and industry.
Effective start/end date1/01/2431/12/26

UN Sustainable Development Goals

In 2015, UN member states agreed to 17 global Sustainable Development Goals (SDGs) to end poverty, protect the planet and ensure prosperity for all. This project contributes towards the following SDG(s):

  • SDG 9 - Industry, Innovation, and Infrastructure


Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.