TY - JOUR
T1 - Learning element similarity matrix for semi-structured document analysis
AU - Yang, Jianwu
AU - CHEUNG, Kwok Wai
AU - Chen, Xiaoou
N1 - Copyright:
Copyright 2009 Elsevier B.V., All rights reserved.
PY - 2009
Y1 - 2009
N2 - Capturing latent structural and semantic properties in semi-structured documents (e.g., XML documents) is crucial for improving the performance of related document analysis tasks. Structured Link Vector Mode (SLVM) is a representation recently proposed for modeling semi-structured documents. It uses an element similarity matrix to capture the latent relationships between XML elements - the constructing components of an XML document. In this paper, instead of applying heuristics to define the element similarity matrix, we propose to compute the matrix using the machine learning approach. In addition, we incorporate term semantics into SLVM using latent semantic indexing to enhance the model accuracy, with the element similarity learnability property preserved. For performance evaluation, we applied the similarity learning to k-nearest neighbors search and similarity-based clustering, and tested the performance using two different XML document collections. The SLVM obtained via learning was found to outperform significantly the conventional Vector Space Model and the edit-distance-based methods. Also, the similarity matrix, obtained as a by-product, can provide higher-level knowledge on the semantic relationships between the XML elements.
AB - Capturing latent structural and semantic properties in semi-structured documents (e.g., XML documents) is crucial for improving the performance of related document analysis tasks. Structured Link Vector Mode (SLVM) is a representation recently proposed for modeling semi-structured documents. It uses an element similarity matrix to capture the latent relationships between XML elements - the constructing components of an XML document. In this paper, instead of applying heuristics to define the element similarity matrix, we propose to compute the matrix using the machine learning approach. In addition, we incorporate term semantics into SLVM using latent semantic indexing to enhance the model accuracy, with the element similarity learnability property preserved. For performance evaluation, we applied the similarity learning to k-nearest neighbors search and similarity-based clustering, and tested the performance using two different XML document collections. The SLVM obtained via learning was found to outperform significantly the conventional Vector Space Model and the edit-distance-based methods. Also, the similarity matrix, obtained as a by-product, can provide higher-level knowledge on the semantic relationships between the XML elements.
KW - Extended Vector Space Model
KW - Learning similarity matrix
KW - Semi-structured document analysis
KW - Similarity-based clustering
UR - http://www.scopus.com/inward/record.url?scp=63649131887&partnerID=8YFLogxK
U2 - 10.1007/s10115-008-0138-2
DO - 10.1007/s10115-008-0138-2
M3 - Journal article
AN - SCOPUS:63649131887
SN - 0219-1377
VL - 19
SP - 53
EP - 78
JO - Knowledge and Information Systems
JF - Knowledge and Information Systems
IS - 1
ER -