A comparative study of ontology based term similarity measures on PubMed document clustering

Xiaodan Zhang*, Liping Jing, Xiaohua Hu, Kwok Po NG, Xiaohua Zhou

*Corresponding author for this work

Research output: Chapter in book/report/conference proceedingConference contributionpeer-review

51 Citations (Scopus)

Abstract

Recent research shows that ontology as background knowledge can improve document clustering quality with its concept hierarchy knowledge. Previous studies take term semantic similarity as an important measure to incorporate domain knowledge into clustering process such as clustering initialization and term re-weighting. However, not many studies have been focused on how different types of term similarity measures affect the clustering performance for a certain domain. In this paper, we conduct a comparative study on how different semantic similarity measures of term including path based similarity measure, information content based similarity measure and feature based similarity measure affect document clustering. We evaluate term re-weighting as an important method to integrate domain ontology to clustering process. Meanwhile, we apply k-means clustering on one real-world text dataset, our own corpus generated from PubMed. Experiment results on 8 different semantic measures have shown that: (1) there is no a certain type of similarity measures that significantly outperforms the others; (2) Several similarity measures have rather more stable performance than the others; (3) term re-weighting has positive effects on medical document clustering, but might not be significant when documents are short of terms.

Original languageEnglish
Title of host publicationAdvances in Databases
Subtitle of host publicationConcepts, Systems and Applications - 12th International Conference on Database Systems for Advanced Applications, DASFAA 2007, Proceedings
PublisherSpringer Verlag
Pages115-126
Number of pages12
ISBN (Print)9783540717027
DOIs
Publication statusPublished - 2007
Event12th International Conference on Database Systems for Advanced Applications, DASFAA 2007 - Bangkok, Thailand
Duration: 9 Apr 200712 Apr 2007

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume4443 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference12th International Conference on Database Systems for Advanced Applications, DASFAA 2007
Country/TerritoryThailand
CityBangkok
Period9/04/0712/04/07

Scopus Subject Areas

  • Theoretical Computer Science
  • Computer Science(all)

User-Defined Keywords

  • Document clustering
  • Domain ontology
  • Semantic similarity measure

Fingerprint

Dive into the research topics of 'A comparative study of ontology based term similarity measures on PubMed document clustering'. Together they form a unique fingerprint.

Cite this