Joint Embedding of Sequential Data and Knowledge Graphs with application to Predictive Analytics in Healthcare

Project: Research project

Project Details


Strategic use of data and domain-specific prior knowledge is known to be important for big data predictive analytics. In general, how to effectively integrate a high volume of observed data and prior knowledge from heterogeneous sources to ensure accurate and reliable analytics results is an important research issue. In this proposal, we are motivated by such a challenge in a clinical setting where two information sources are explored: (1) a freely accessible intensive care unit data set (MIMIC III Critical Care Database) which contains clinical events recorded during different intensive care unit (ICU) stays of patients, and (2) publicly available medical ontologies of international standards including ICD-9, SNOMED-CT and LOINC. The repre- sentation learning approach is adopted so that vector representations of clinical events found in the ICU data set can first be inferred. By modeling the medical ontologies as knowledge graphs, clinical entities in the knowledge graphs can also be embedded in the vector spaces. To facilitate the integration, we propose to learn a joint embedding of clinical events in the ICU data set (sequential data) and the medical ontologies (knowledge graphs). The proposed joint embedding allows clinical events of heterogeneous types (e.g., diagnosis, medication) to be represented in separate vector spaces, and the alignment between the clinical events and the clinical entities defined in the medical ontologies to be inferred simultaneously. We also infer adaptive projections from the heterogeneous clinical event spaces to a common vector space so that healthcare predictive tasks like mortality prediction as well as the prediction of other clinical events can be supported. While the accuracy of the predictive analytics task is one of our key concerns, we emphasize also the importance of model interpretability as healthcare is a safety-critical application domain. We propose different schemes to incorporate sparsity criteria and attention mechanisms into the proposed framework to achieve the interpretability objective. We then apply the proposed framework to high-throughput candidate phenotype generation. For performance comparison, empirical experiments on the MIMIC III database will be carried out. Other than quantitative evaluation based on the well-known performance measures, we will also conduct qualitative evaluation by consulting clinicians regarding the interpretability of the candidate phenotypes generated from the data
Effective start/end date1/01/1830/06/21


Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.