Open Access Open Access  Restricted Access Subscription or Fee Access

Structure-Oriented Algorithms for Comparison of Web Pages Resemblance

Vladimír Bureš(1*), Jiří Štěpánek(2)

(1) University of Hradec Králové, Czech Republic
(2) University of Hradec Králové, Czech Republic
(*) Corresponding author



Astounding growth of number of web pages, which we have been experienced during the last decade, brings both new opportunities and threats in so called information society. Plagiarism or intellectual property rights represent only a fragment from all issues triggered by the latest development. Since the necessity to compare particular web-based content has become topical, this paper presents three algorithms for structural comparison of web pages. Three novel algorithms, namely the frequency analysis, the sequence analysis and the identification of the largest sub-tree, are experimentally tested in the developed software application. Results reveal that all algorithms are usable and reliable. Furthermore, the time characteristics related to all algorithms indicate that their execution speed and computational complexity can be successfully compared to other available algorithms. Although there are certain limitations, implications are obvious and research results provide readers with further research perspectives.
Copyright © 2014 Praise Worthy Prize - All rights reserved.


Algorithms; Computer Applications; HTML Code Standards; Tree Data Structures

Full Text:



M. Kudelka, V. Snasel, Z. Horak, A.E. Hassanien, A. Abraham, J.D. Velásquez, A novel approach for comparing web sites by using MicroGenres, Engineering Applications of Artificial Intelligence, Volume 35, 2014, Pages 187-198.

M. Alpuente, D. Romero, A Visual Technique for Web Pages Comparison, Electronic Notes in Theoretical Computer Science, Volume 235, 2009, Pages 3-18.

P. Zaphiris, S. Kurniawan, M. Ghiawadwala, A systematic approach to the development of research-based web design guidelines for older people, Universal Access in the Information Society, Volume 6, (Issue 1), 2007, Pages 59-75.

N. Schneidewind, Proposed future Internet, Innovations in Systems and Software Engineering, Volume 8, (Issue 2), 2012, Pages 125-173.

S. Flesca, E. Masciari, Efficient and effective web change detection, Data and Knowledge Engineering, Volume 46, (Issue 2), 2003, Pages 203-224.

H. Artail, K. Fawaz, A fast HTML web page change detection approach based on hashing and reducing the number of similarity computations, Data and Knowledge Engineering, Volume 66, (Issue 2), 2008, Pages 326-337.

L. Liu, C. Pu, W. Tang, WebCQ – detecting and delivering information changes on the web, Proceedings of the 9th International Conference on Information and Knowledge Management, pp. 512-519, McLean, VA, November 2000.

A. Sanka, S. Chamakura, S. Chakravarthy, A dataflow approach to efficient change detection of HTML/XML documents in WebVigiL, Computer Networks, Volume 50, (Issue 10), 2006, Pages 1547-1563.

G. Meghabghab, A. Kandel, Search Engines, Link Analysis, and User’s Web Behavior, Studies in Computational Intelligence, Volume 99, 2008, Pages 23-45.

M.R. Kogalovsky, E.N. Efimova, T.A. Rybina, V.B. Brakhin, Formal methods for verification of websites macrostructure integrity, Programming and Computer Software, Volume 26, (Issue 4), 2000, Pages 186-191.

T. Kozel, Modelling Processes with Elements of Mobility. E+M Economics and Management, Volume 14, (Issue 3), 2011, Pages 130-140.

J. Su, J.P. Bao, A Wavelet Transform Based Structural Similarity Model for Semi-structured Texts, Knowledge Discovery and Data Mining: Intelligent and Soft Computing, Volume 135, 2012, Pages 159-167.

F. Zhu, J. Ning, Y. Ren, J. Peng, Optimization of Image Processing in Video-based Traffic Monitoring, Elektronika ir Elektrotechnika, Volume 18, (Issue 8), 2012, Pages 91-96.

J. Jacob, A. Sache, S. Chakravarthy, CX-DIFF: a change detection algorithm for XML content and change visualization for WebVigiL, Data and Knowledge Engineering, Volume 52, (Issue 2), 2005, Pages 209-230.

Y. Wang, D. DeWitt, J. Cai, X-Diff: an effective change detection algorithm for XML documents, Proceedings of the 19th International Conference on Data Engineering, pp. 519-530, Bangalore, India, March 2003.

C. Kang, DOM-based Web Pages to Determine the Structure of the Similarity Algorithm, Proceedings of the 3rd International Symposium on Intelligent Information Technology Application, pp. 245-248, Nanchang, China, November 2009.

Codeplex, Html Agility Pack. [Online] Available at, 2012, Retrieved September 23rd, 2014.

F.V. Fomin, S. Gaspers, S. Saurabh, S. Thomassé, A linear vertex kernel for maximum internal spanning tree. Journal of Computer and System Sciences, Volume 79, (Issue 1), 2013, Pages 1-6.

T.T. Nguyen, P.K. Nguyen, A new approach for problem of sequential pattern mining, Lecture Notes in Artificial Intelligence, Volume 7653, 2012, Pages 51-60.

S. Regourd, Cultural Policies and Technological Mesmerisation, Pensee, Volume 366, 2011, Pages 5-15.

R. Brunet-Thornton, V. Bureš, Cross-Cultural Management: Establishing a Czech Benchmark, E+M Economics and Management, Volume 15, (Issue 3), 2012, Pages 46-62.

T. Katayama, A. Morijiri, S. Ishii, T. Utsuro, Y. Kawada, T. Fukuhara, Comparing Similarity of HTML Structures and Affiliate IDs in Splog Analysis, Lecture Notes in Computer Science, Voume 6637, 2011, Pages 378-389.

J. Štěpánek, M. Šimková, Comparing web pages in terms of inner structure, Procedia-Social and Behavioral Sciences, Volume 83, 2013, Pages 458-462.

S. Bonnéry, Intellectual requirements of the school handbooks, Pensee, Volume 372, 2012, Pages 37-49.

Vimal, E.A., Chandramathi, S., Learning objects retrieval algorithm using semantic annotation and new matching score, (2013) International Review on Computers and Software (IRECOS), 8 (12), pp. 2755-2764.

Bures, V., Cech, P., Digital television based learning as a component of smart environments for elderly: The voice interaction, (2012) International Review on Computers and Software (IRECOS), 7 (4), pp. 1445-1452.

Ambika, M., Latha, K., Web mining: The demystification of multifarious aspects, (2014) International Review on Computers and Software (IRECOS), 9 (1), pp. 135-141.

V. Bureš, P. Čech, Knowledge intensity of organizations in knowledge economy, Proceedings of the 3rd International Conference on Web Information Systems and Technologies, Webist 2007, pp. 210-213, Barcelona, Spain, March 2007.

P. Čech, V. Bureš, Recommendation of web resources for academics - Architecture and components, Proceedings of the 3rd International Conference on Web Information Systems and Technologies, Webist 2007, pp. 437-440, Barcelona, Spain, March 2007.

P.T.T. Thuy, Y.K. Lee, S. Lee, Semantic and structural similarities between XML Schemas for integration of ubiquitous healthcare data, Personal and Ubiquitous Computing, Volume 17, 2013, Pages 1331-1339.

S.J. Lim, Y.K. Ng, An automated change-detection algorithm for HTML documents based on semantic hierarchies, Proceedings of the 17th International Conference on Data Engineering, pp. 303-312, Heidelber, Germany, April 2001.

M. Behan, O. Krejcar, Smart communication adviser for remote users, Advances in Intelligent Systems and Computing, Volume 183, 2013, Pages 169-178.

P. Mikulecký, Large-scale ambient intelligence: Possibilities for environmental applications, Ambient Intelligence Perspectives II - Selected Papers from the 2nd International Ambient Intelligence Forum 2009, AmIF 2009, pp. 3-10, Hradec Králové, Czech Republic, September 2009.


  • There are currently no refbacks.

Please send any question about this web site to
Copyright © 2005-2020 Praise Worthy Prize