TY - JOUR
T1 - Identification of useful genes from multiple microarrays for ulcerative colitis diagnosis based on machine learning methods
AU - Zhang, Lin
AU - Mao, Rui
AU - Lau, Chung Tai
AU - Chung, Wai Chak
AU - Chan, Jacky C. P.
AU - Liang, Feng
AU - Zhao, Chenchen
AU - Zhang, Xuan
AU - Bian, Zhaoxiang
N1 - Funding Information:
The authors acknowledge the financial support from Health@InnoHK Initiative Fund of the Hong Kong Special Administrative Region Government (ITC RC/IHK/4/7).
Health@InnoHK Initiative Fund of the Hong Kong Special Administrative Region Government (ITC RC/IHK/4/7), and China Center for Evidence Based Traditional Chinese Medicine, CCEBTM (2020YJSZX-5). The funders had no role in the design of the study, in the collection, analysis, and interpretation of data, nor in the writing of the manuscript.
Publisher Copyright:
© The Author(s) 2022
PY - 2022/6/15
Y1 - 2022/6/15
N2 - Ulcerative colitis (UC) is a chronic relapsing inflammatory bowel disease with an increasing incidence and prevalence worldwide. The diagnosis for UC mainly relies on clinical symptoms and laboratory examinations. As some previous studies have revealed that there is an association between gene expression signature and disease severity, we thereby aim to assess whether genes can help to diagnose UC and predict its correlation with immune regulation. A total of ten eligible microarrays (including 387 UC patients and 139 healthy subjects) were included in this study, specifically with six microarrays (GSE48634, GSE6731, GSE114527, GSE13367, GSE36807, and GSE3629) in the training group and four microarrays (GSE53306, GSE87473, GSE74265, and GSE96665) in the testing group. After the data processing, we found 87 differently expressed genes. Furthermore, a total of six machine learning methods, including support vector machine, least absolute shrinkage and selection operator, random forest, gradient boosting machine, principal component analysis, and neural network were adopted to identify potentially useful genes. The synthetic minority oversampling (SMOTE) was used to adjust the imbalanced sample size for two groups (if any). Consequently, six genes were selected for model establishment. According to the receiver operating characteristic, two genes of OLFM4 and C4BPB were finally identified. The average values of area under curve for these two genes are higher than 0.8, either in the original datasets or SMOTE-adjusted datasets. Besides, these two genes also significantly correlated to six immune cells, namely Macrophages M1, Macrophages M2, Mast cells activated, Mast cells resting, Monocytes, and NK cells activated (P < 0.05). OLFM4 and C4BPB may be conducive to identifying patients with UC. Further verification studies could be conducted.
AB - Ulcerative colitis (UC) is a chronic relapsing inflammatory bowel disease with an increasing incidence and prevalence worldwide. The diagnosis for UC mainly relies on clinical symptoms and laboratory examinations. As some previous studies have revealed that there is an association between gene expression signature and disease severity, we thereby aim to assess whether genes can help to diagnose UC and predict its correlation with immune regulation. A total of ten eligible microarrays (including 387 UC patients and 139 healthy subjects) were included in this study, specifically with six microarrays (GSE48634, GSE6731, GSE114527, GSE13367, GSE36807, and GSE3629) in the training group and four microarrays (GSE53306, GSE87473, GSE74265, and GSE96665) in the testing group. After the data processing, we found 87 differently expressed genes. Furthermore, a total of six machine learning methods, including support vector machine, least absolute shrinkage and selection operator, random forest, gradient boosting machine, principal component analysis, and neural network were adopted to identify potentially useful genes. The synthetic minority oversampling (SMOTE) was used to adjust the imbalanced sample size for two groups (if any). Consequently, six genes were selected for model establishment. According to the receiver operating characteristic, two genes of OLFM4 and C4BPB were finally identified. The average values of area under curve for these two genes are higher than 0.8, either in the original datasets or SMOTE-adjusted datasets. Besides, these two genes also significantly correlated to six immune cells, namely Macrophages M1, Macrophages M2, Mast cells activated, Mast cells resting, Monocytes, and NK cells activated (P < 0.05). OLFM4 and C4BPB may be conducive to identifying patients with UC. Further verification studies could be conducted.
UR - http://www.scopus.com/inward/record.url?scp=85132080383&partnerID=8YFLogxK
U2 - 10.1038/s41598-022-14048-6
DO - 10.1038/s41598-022-14048-6
M3 - Journal article
C2 - 35705632
AN - SCOPUS:85132080383
SN - 2045-2322
VL - 12
JO - Scientific Reports
JF - Scientific Reports
IS - 1
M1 - 9962
ER -