TY - JOUR
T1 - Enhancing Molecular Property Prediction through Task-Oriented Transfer Learning
T2 - Integrating Universal Structural Insights and Domain-Specific Knowledge
AU - Duan, Yanjing
AU - Yang, Xixi
AU - Zeng, Xiangxiang
AU - Wang, Wenxuan
AU - Deng, Youchao
AU - Cao, Dongsheng
N1 - This work was supported by National Key Research and Development Program of China (2021YFF1201400), National Natural Science Foundation of China (22173118, 22220102001), Hunan Provincial Science Fund for Distinguished Young Scholars (2021JJ10068), the Science and Technology Innovation Program of Hunan Province (2021RC4011), the Natural Science Foundation of Hunan Province (2022JJ80104), and The 2020 Guangdong Provincial Science and Technology Innovation Strategy Special Fund (2020B1212030006, Guangdong-Hong Kong-Macau Joint Lab). We acknowledge Haikun Xu, and the High-Performance Computing Center of Central South University for support. The study was approved by the university’s review board.
Publisher Copyright:
© 2024 American Chemical Society
PY - 2024/6/13
Y1 - 2024/6/13
N2 - Precisely predicting molecular properties is crucial in drug discovery, but the scarcity of labeled data poses a challenge for applying deep learning methods. While large-scale self-supervised pretraining has proven an effective solution, it often neglects domain-specific knowledge. To tackle this issue, we introduce Task-Oriented Multilevel Learning based on BERT (TOML-BERT), a dual-level pretraining framework that considers both structural patterns and domain knowledge of molecules. TOML-BERT achieved state-of-the-art prediction performance on 10 pharmaceutical datasets. It has the capability to mine contextual information within molecular structures and extract domain knowledge from massive pseudo-labeled data. The dual-level pretraining accomplished significant positive transfer, with its two components making complementary contributions. Interpretive analysis elucidated that the effectiveness of the dual-level pretraining lies in the prior learning of a task-related molecular representation. Overall, TOML-BERT demonstrates the potential of combining multiple pretraining tasks to extract task-oriented knowledge, advancing molecular property prediction in drug discovery.
AB - Precisely predicting molecular properties is crucial in drug discovery, but the scarcity of labeled data poses a challenge for applying deep learning methods. While large-scale self-supervised pretraining has proven an effective solution, it often neglects domain-specific knowledge. To tackle this issue, we introduce Task-Oriented Multilevel Learning based on BERT (TOML-BERT), a dual-level pretraining framework that considers both structural patterns and domain knowledge of molecules. TOML-BERT achieved state-of-the-art prediction performance on 10 pharmaceutical datasets. It has the capability to mine contextual information within molecular structures and extract domain knowledge from massive pseudo-labeled data. The dual-level pretraining accomplished significant positive transfer, with its two components making complementary contributions. Interpretive analysis elucidated that the effectiveness of the dual-level pretraining lies in the prior learning of a task-related molecular representation. Overall, TOML-BERT demonstrates the potential of combining multiple pretraining tasks to extract task-oriented knowledge, advancing molecular property prediction in drug discovery.
UR - http://www.scopus.com/inward/record.url?scp=85193520334&partnerID=8YFLogxK
U2 - 10.1021/acs.jmedchem.4c00692
DO - 10.1021/acs.jmedchem.4c00692
M3 - Journal article
C2 - 38748846
AN - SCOPUS:85193520334
SN - 0022-2623
VL - 67
SP - 9575
EP - 9586
JO - Journal of Medicinal Chemistry
JF - Journal of Medicinal Chemistry
IS - 11
ER -