Abstract
The Web is transforming from a merely information dissemination platform towards a distributed knowledge-based platform for supporting complex problem solving. However, the existing Web contains a large amount of knowledge which is only tagged using layout related markups, making them hard to be discovered and used. In this paper, we purpose to model semantic-rich and self-contained knowledge units embedded in a web site as a mixture of bipartite sub-graphs and to extract the subgraphs as the web site abstraction via hyperlink structure and file hierarchy analysis. A recursive algorithm, named ReHITS, is derived which can identify bipartite sub-graphs with a hierarchical organization. Each identified sub-graph contains a set of associated authorities and hubs as its summarized semantic description. The effectiveness of the algorithm has been evaluated using three real web sites (containing ∼ 10000 web pages) with promising results. Detailed interpretation of the experimental results and qualitative comparison with other related work are also included.
Original language | English |
---|---|
Pages (from-to) | 343-355 |
Number of pages | 13 |
Journal | Web Intelligence and Agent Systems |
Volume | 5 |
Issue number | 3 |
Publication status | Published - 2007 |
Scopus Subject Areas
- Software
- Computer Networks and Communications
- Artificial Intelligence
User-Defined Keywords
- HITS algorithm
- Knowledge discovery
- Web site abstraction
- Web structure mining