TY - JOUR
T1 - Improved GNNs for Log D7.4 Prediction by Transferring Knowledge from Low-Fidelity Data
AU - Duan, Yan-Jing
AU - Fu, Li
AU - Zhang, Xiao-Chen
AU - Long, Teng-Zhi
AU - He, Yuan-Hang
AU - Liu, Zhao-Qian
AU - Lu, Ai-Ping
AU - Deng, Ya-Feng
AU - Hsieh, Chang-Yu
AU - Hou, Ting-Jun
AU - Cao, Dong-Sheng
N1 - Publisher Copyright:
© 2023 American Chemical Society.
PY - 2023/4/24
Y1 - 2023/4/24
N2 - The n-octanol/buffer solution distribution coefficient at pH = 7.4
(log D7.4) is an indicator of lipophilicity, and it influences a wide
variety of absorption, distribution, metabolism, excretion, and toxicity
(ADMET) properties and druggability of compounds. In log D7.4 prediction, graph neural networks (GNNs) can
uncover subtle structure–property relationships (SPRs) by automatically
extracting features from molecular graphs that facilitate the learning of SPRs,
but their performances are often limited by the small size of available
datasets. Herein, we present a transfer learning strategy called pretraining on
computational data and then fine-tuning on experimental data (PCFE) to fully
exploit the predictive potential of GNNs. PCFE works by pretraining a GNN model
on 1.71 million computational log D data (low-fidelity data) and then fine-tuning it
on 19,155 experimental log D7.4 data (high-fidelity data). The experiments for
three GNN architectures (graph convolutional network (GCN), graph attention
network (GAT), and Attentive FP) demonstrated the effectiveness of PCFE in
improving GNNs for log D7.4 predictions. Moreover, the optimal PCFE-trained
GNN model (cx-Attentive FP, Rtest2 = 0.909) outperformed four excellent
descriptor-based models (random forest (RF), gradient boosting (GB), support
vector machine (SVM), and extreme gradient boosting (XGBoost)). The robustness
of the cx-Attentive FP model was also confirmed by evaluating the models with
different training data sizes and dataset splitting strategies. Therefore, we
developed a webserver and defined the applicability domain for this model. The
webserver (http://tools.scbdd.com/chemlogd/) provides free log D7.4 prediction services. In addition, the important
descriptors for log D7.4 were detected by the Shapley additive explanations
(SHAP) method, and the most relevant substructures of log D7.4 were identified by the attention mechanism.
Finally, the matched molecular pair analysis (MMPA) was performed to summarize
the contributions of common chemical substituents to log D7.4, including a variety of hydrocarbon groups, halogen groups,
heteroatoms, and polar groups. In conclusion, we believe that the cx-Attentive
FP model can serve as a reliable tool to predict log D7.4 and hope that pretraining on low-fidelity data can
help GNNs make accurate predictions of other endpoints in drug discovery.
AB - The n-octanol/buffer solution distribution coefficient at pH = 7.4
(log D7.4) is an indicator of lipophilicity, and it influences a wide
variety of absorption, distribution, metabolism, excretion, and toxicity
(ADMET) properties and druggability of compounds. In log D7.4 prediction, graph neural networks (GNNs) can
uncover subtle structure–property relationships (SPRs) by automatically
extracting features from molecular graphs that facilitate the learning of SPRs,
but their performances are often limited by the small size of available
datasets. Herein, we present a transfer learning strategy called pretraining on
computational data and then fine-tuning on experimental data (PCFE) to fully
exploit the predictive potential of GNNs. PCFE works by pretraining a GNN model
on 1.71 million computational log D data (low-fidelity data) and then fine-tuning it
on 19,155 experimental log D7.4 data (high-fidelity data). The experiments for
three GNN architectures (graph convolutional network (GCN), graph attention
network (GAT), and Attentive FP) demonstrated the effectiveness of PCFE in
improving GNNs for log D7.4 predictions. Moreover, the optimal PCFE-trained
GNN model (cx-Attentive FP, Rtest2 = 0.909) outperformed four excellent
descriptor-based models (random forest (RF), gradient boosting (GB), support
vector machine (SVM), and extreme gradient boosting (XGBoost)). The robustness
of the cx-Attentive FP model was also confirmed by evaluating the models with
different training data sizes and dataset splitting strategies. Therefore, we
developed a webserver and defined the applicability domain for this model. The
webserver (http://tools.scbdd.com/chemlogd/) provides free log D7.4 prediction services. In addition, the important
descriptors for log D7.4 were detected by the Shapley additive explanations
(SHAP) method, and the most relevant substructures of log D7.4 were identified by the attention mechanism.
Finally, the matched molecular pair analysis (MMPA) was performed to summarize
the contributions of common chemical substituents to log D7.4, including a variety of hydrocarbon groups, halogen groups,
heteroatoms, and polar groups. In conclusion, we believe that the cx-Attentive
FP model can serve as a reliable tool to predict log D7.4 and hope that pretraining on low-fidelity data can
help GNNs make accurate predictions of other endpoints in drug discovery.
UR - http://www.scopus.com/inward/record.url?scp=85151832955&partnerID=8YFLogxK
U2 - 10.1021/acs.jcim.2c01564
DO - 10.1021/acs.jcim.2c01564
M3 - Journal article
C2 - 37000044
AN - SCOPUS:85151832955
SN - 1549-9596
VL - 63
SP - 2345
EP - 2359
JO - Journal of Chemical Information and Modeling
JF - Journal of Chemical Information and Modeling
IS - 8
ER -