Trustworthy Deep Learning from Open-set Corrupted Data

Project: Research project

Project Details


Trustworthy learning from corrupted data is a vital research topic in modern machine learning (i.e., deep learning), since most real-world data are easily imperfect and corrupted, such as financial data, healthcare data and social-network data. To become trustworthy, deep learning system should behave more human-like: Reliability to uncertain cases; robustness to adversarial conditions; adaptivity to new situations. This brings us trustworthy deep learning (TDL), where either the supervised or unsupervised information (i.e., label, features and domain) are corrupted (i.e., noisy).

To handle corrupted data completely, trustworthy deep learning should focus on three crucial tasks: 1) Reliability: Trustworthy learning with noisy labels; 2) Robustness: Trustworthy training with adversarial examples; and 3) Adaptivity: Trustworthy adapting from corrupted source domain (i.e., label corruption in source domain) to unlabeled target domain. Existing works in trustworthy deep learning tend to implicitly assume that corrupted data should be closed-set: Samples with corrupted labels own true classes known in the training data; samples with the set of known classes can be crafted as adversarial examples in the testing phase; and samples in source domain share the same class of samples in target domain. However, such closed-set assumption may be suitable for an initial research but too restrictive for many real-world applications (e.g., healthcare data).

This motivates us to take a closer look at more realistic open-set corrupted data. Take the Chest X- Ray database as an illustrated scenario, which majorly consists of 8 chest diseases (i.e., atelectasis, infiltration, cardiomegaly, nodule, mass, effusion, pneumothorax and pneumonia). When the effusion image is wrongly labeled as pneumonia, we regard this image with closed-set label noise. However, when the rare image (e.g., pulmonary edema) is wrongly labeled as pneumonia, we regard this image with open-set label noise, since pulmonary edema is not present in major diseases. Similarly, the pulmonary edema image can be crafted as open-set adversarial examples in the testing phase, which can attack the learning system trained on the Chest X-Ray database.

Thus, the goal of this project is to develop models, algorithms and prototype system for trustworthy deep learning from open-set corrupted data and deploy the system in finance and healthcare fields. Note that open-set data considers unknown classes in either training set or the target domain. This project poses four cutting-edge research challenges: (1) It is unknown how to robustly learn with open-set instance-dependent noisy labels. Answering this question will promote TDL for handling label corruption. (2) Adversarial robustness is valuable for TDL. A practical challenge is how to train deep networks with open-set adversarial examples robustly. (3) Adapting from corrupted source domain is critical for TDL, but existing techniques assume source and target domains share the same classes, making adaptation difficult for open-set unlabeled target domain. (4) Towards providing TDL toolkits for practitioners without machine learning background, it is unclear how to automate proposed TDL algorithms by automated machine learning (AutoML) techniques.

To address the above challenges, we will study from Task 1 to Task 4 (see 2(b)(i)). The success of this project will greatly advance analytics of corrupted data and provide a series of new tools in trustworthy deep learning. Our proposed models and algorithms will be automated and integrated into the AutoTDL system. This system will be strictly tested using real-world corrupted data (e.g., financial data, medical imaging data, and healthcare data from Hong Kong electronic-Health Record Sharing System). The PI will collaborate with financial companies and public hospitals to deploy AutoTDL system reliably. As large amount of corrupted data is ubiquitous in Hong Kong financial companies and public hospitals, the expected project outcomes will assist and improve their knowledge discovery and decision-making processes. For the educational component, the PI will incorporate these research developments into course work as case studies and small projects.
Effective start/end date1/09/2028/02/23


Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.