TY - JOUR
T1 - Predicting human intestinal absorption with modified random forest approach
T2 - a comprehensive evaluation of molecular representation, unbalanced data, and applicability domain issues
AU - Wang, Ning Ning
AU - Huang, Chen
AU - Dong, Jie
AU - Yao, Zhi Jiang
AU - Zhu, Min Feng
AU - Deng, Zhen Ke
AU - Lv, Ben
AU - LYU, Aiping
AU - Chen, Alex F.
AU - Cao, Dong Sheng
N1 - Funding Information:
This work is financially supported by the National Key Basic Research Program (2015CB910700), the National Natural Science Foundation of China (grant no. 81402853), the Central South University Innovation Foundation for Postgraduate (2016zzts498), the Hunan Provincial Innovation Foundation for Postgraduate (CX2016B058), the Project of Innovation-driven Plan in Central South University, and the Postdoctoral Science Foundation of Central South University, the Chinese Postdoctoral Science Foundation (2014T70794, 2014M562142). The studies meet with the approval of the university's review board.
PY - 2017
Y1 - 2017
N2 - With the increase of complexity and risk in drug discovery processes, human intestinal absorption (HIA) prediction has become more and more important. Up to now, some predictive models have been constructed to estimate HIA of new drug-like compounds with acceptable accuracies, but there are still some issues to be explored including the limited and unbalanced HIA data, the performance of different types of descriptors and the application domain issues of published models. To address these problems, in this study, we collected a relatively large dataset consisting of 970 compounds, and 9 different types of descriptors were calculated for further modeling. For all the modeling processes, a parameter named samplesize in the random forest (RF) method was applied to balance the dataset. And then, classification models were established based on different training sets and different combinations of descriptors. After a series of modeling processes and various comparisons among these statistical results, we explored the aforementioned problems and evaluated the reliabilities of existing HIA classification models and subsequently obtained a robust and applicable model based on a combination of 2D, 3D, N+ and Nrule-of-five (for the training set, SE = 0.892, SP = 0.846; for the test set, SE = 0.877, SP = 0.813). Compared with other published models, our model exhibits some advantages in data size, model accuracy and model practicability to some extent. This structure-activity relationship model is necessary and useful for HIA prediction and it could be a convenient tool for virtual screening in the early stage of drug development.
AB - With the increase of complexity and risk in drug discovery processes, human intestinal absorption (HIA) prediction has become more and more important. Up to now, some predictive models have been constructed to estimate HIA of new drug-like compounds with acceptable accuracies, but there are still some issues to be explored including the limited and unbalanced HIA data, the performance of different types of descriptors and the application domain issues of published models. To address these problems, in this study, we collected a relatively large dataset consisting of 970 compounds, and 9 different types of descriptors were calculated for further modeling. For all the modeling processes, a parameter named samplesize in the random forest (RF) method was applied to balance the dataset. And then, classification models were established based on different training sets and different combinations of descriptors. After a series of modeling processes and various comparisons among these statistical results, we explored the aforementioned problems and evaluated the reliabilities of existing HIA classification models and subsequently obtained a robust and applicable model based on a combination of 2D, 3D, N+ and Nrule-of-five (for the training set, SE = 0.892, SP = 0.846; for the test set, SE = 0.877, SP = 0.813). Compared with other published models, our model exhibits some advantages in data size, model accuracy and model practicability to some extent. This structure-activity relationship model is necessary and useful for HIA prediction and it could be a convenient tool for virtual screening in the early stage of drug development.
UR - http://www.scopus.com/inward/record.url?scp=85017178494&partnerID=8YFLogxK
U2 - 10.1039/C6RA28442F
DO - 10.1039/C6RA28442F
M3 - Journal article
AN - SCOPUS:85017178494
SN - 2046-2069
VL - 7
SP - 19007
EP - 19018
JO - RSC Advances
JF - RSC Advances
IS - 31
ER -