Learning the kernel matrix for XML document clustering

Jianwu Yang*, Kwok Wai CHEUNG, Xiaoou Chen

*Corresponding author for this work

Research output: Chapter in book/report/conference proceedingConference proceedingpeer-review

18 Citations (Scopus)

Abstract

The rapid growth of XML adoption has urged for the need of a proper representation for semi-structured documents, where the document structural information has to be taken into account so as to support more precise document analysis. In this paper, an XML document representation named "structured link vector model" is adopted, with a kernel matrix included for modeling the similarity between XML elements. Our formulation allows individual XML elements to have their own weighted contribution to the overall document similarity while at the same time allows the between-clement similarity to be captured. An iterative algorithm is derived to learn the kernel matrix. For performance evaluation, the ACM SIGMOD Record dataset as well as the CEDE dataset have been tested. Our proposed method outperforms significantly the traditional vector space model and the edit-distance based methods. In addition, the kernel matrix obtained as a by-product provides knowledge about the conceptual relationship between the XML elements.

Original languageEnglish
Title of host publicationProceedings - 2005 IEEE International Conference on e-Technology, e-Commerce and e-Service, EEE-05
PublisherIEEE
Pages353-358
Number of pages6
ISBN (Print)9780769522746, 0769522742
DOIs
Publication statusPublished - Apr 2005
Event2005 IEEE International Conference on e-Technology, e-Commerce and e-Service, EEE-05 - Hong Kong, China
Duration: 29 Mar 20051 Apr 2005
https://ieeexplore.ieee.org/xpl/conhome/9634/proceeding

Publication series

NameProceedings - IEEE International Conference on e-Technology, e-Commerce and e-Service

Conference

Conference2005 IEEE International Conference on e-Technology, e-Commerce and e-Service, EEE-05
Country/TerritoryChina
CityHong Kong
Period29/03/051/04/05
Internet address

User-Defined Keywords

  • Kernel
  • XML
  • Computer science
  • Text analysis
  • Information analysis
  • Iterative algorithms
  • Testing
  • Fourier transforms
  • Training data

Fingerprint

Dive into the research topics of 'Learning the kernel matrix for XML document clustering'. Together they form a unique fingerprint.

Cite this