Open Access Open Access  Restricted Access Subscription or Fee Access

Deep Web Integration: the Tip of the Iceberg


(*) Corresponding author


Authors' affiliations


DOI: https://doi.org/10.15866/irecos.v10i10.7755

Abstract


The web is divided in two parts, a part that search engines can access and which is called the surface web, and an inaccessible part called the deep web. The deep web is much bigger and richer in information than the surface web, and its web sources are only accessible through the associated Html forms. Our aim in this paper is to present our automatic approach to extract a relational schema describing a selected deep web source. This relational schema can be used by a virtual integration system to access the associated deep web source. Our approach is based on a static and dynamic analysis of the Html forms giving access to the selected deep web source. Our approach process uses two external knowledge databases: The first one is our proprietary knowledge database about the deep web domains called the Identification Tables and the second one is an external ontology. All the information extracted by our approach from and through the associated Html forms are used subsequently to build our final relational schema describing the associated deep web source.
Copyright © 2015 Praise Worthy Prize - All rights reserved.

Keywords


The web is divided in two parts, a part that search engines can access and which is called the surface web, and an inaccessible part called the deep web. The deep web is much bigger and richer in information than the surface web, and its web sources are o

Full Text:

PDF


References


Http://www.worldwidewebsize.com. Accessed October 2015.

P. Lyman, H.R, Varian, How Much Information 2003?, University of California, 2003.
http://dx.doi.org/10.3998/3336451.0006.204

M.K. Bergman, The Deep Web: Surfacing Hidden Value, Journal of Electronic Publishing, Vol. 7, 2001.
http://dx.doi.org/10.3998/3336451.0007.104

B. He, M. Patel, Z. Zhang, K. C.-C. Chang, Accessing the Deep Web: A survey, Communications of the ACM, Vol. 50, pp. 94-101, 2007.
http://dx.doi.org/10.1145/1230819.1241670

K. C.-C. Chang, B. He, C. Li, M. Patel, Z. Zhang, Structured Databases on the Web: Observations and Implications, ACM SIGMOD Record, Vol. 33, n. 3, pp. 61-70, 2004.
http://dx.doi.org/10.1145/1031570.1031584

L.Bing, Web Data Mining (Springer, 2007).
http://dx.doi.org/10.1007/978-3-540-37882-2

B. Lheureux, M. Pezzini, J. Thompson, A. Now, Predicts 2013: Application Integration, Gartner Report, 2012.

A. Zellou, Contribution to the LAV rewriting in the context of WASSIT, toward a resources integration, Ph.D. Thesis, Dept. Computer Engineering, University Mohammed V, EMI, Rabat, Morocco, 2008.

A. Ntoulas, P. Zerfos, J. Cho, Downloading Textual Hidden Web Content through Keyword Queries. Proceedings of the 5th ACM/IEEE-CS joint conference in Digital Libraries (Page: 100 Year of Publication: 2005 ISBN: 1-58113-876-8).
http://dx.doi.org/10.1145/1065385.1065407

M. Malki, A. Flory, M.K. Rahmouni, Extraction Of Object-Oriented Schemas from existing relational databases: a Form-driven Approach, INFORMATICA, Vol. 13, n. 1,pp. 47-72, 2002.

J. Choobineh, Form Driven Conceptual Data Modelling, Ph.D. Thesis, Dept. Management Information Systems, University of Arizona, USA, 1985.

H. Hammer, A. Bratterud, S. Fagernes, Crawling JavaScript websites using WebKit-with application to analysis to hate speech in online discussion. Norsk informatikkonferanse ISSN : 1892-0713, 2013.

M. Alvarez, A. Pan, J. Raposo, J. Hidalgo, Crawling web pages with support for client-side dynamism, In J. X. Yu, M. Kitsuregawa, H.V. Leong (Ed.), Lecture Notes in Computer Science, Advances in Web-age information Management Chapter, Volume 4016 (Berlin:Springer-Verlag, 2006, 252-262).
http://dx.doi.org/10.1007/11775300_22

M. Khelghati, D. Hiemstra, M. V.Keulen, Deep Web Entity Monitoring, Proceedings of the 22nd International World Wide Web, (Page : 377 Year of Publication : 2013 ISBN:978-1-4503-2038-2).

S.W. Liddle, D.W. Embley, D.T. Scott, S.H. Yau, Extracting Data Behind Web Forms, In A. Olivé, M. Yoshikawa, E. S.K. Yu (Ed.), Lecture Notes in Computer Science, Advanced Conceptual Modeling Techniques chapter, Vol. 2784 (Berlin:Springer-Verlag, 2003, 402-413)
http://dx.doi.org/10.1007/978-3-540-45275-1_35

J. Madhavan, D. Ko, L. Kot, V. Ganapathy, A. Rasmussen, A. Halevy, Google’s Deep Web Crawl, Proceedings of the VLDB Endowment (Page: 1241 Year of Publication: 2008 ISSN: 2150-8097).
http://dx.doi.org/10.14778/1454159.1454163

A. Doan, A. Halevy, Z. Ives, Principles of Data Integration, (Elsevier, 2012).
http://dx.doi.org/10.1016/b978-0-12-416044-6.00003-x

S. Raghavan, H.Garcia-Molina, Crawling the hidden web, Proceedings of the 27th International conference on VLDB (Page : 129 Year of Publication: 2001 ISBN: 1-55860-804-4)

Document Object Model Level 1 specification. http://www.w3.org/TR/REC-DOM-Level-1. Accessed October 2015.

K. C-C. Chang, B. He, Z. Zhang, Toward large scale integration: Building a MetaQuerier over databases on the web, Proceedings of the 2nd Conference on Innovative Data Systems Research CIDR (Page: 44 Year of Publication: 2005)

Z. Zhang, B. He, K. C-C. Chang, Understanding web query interfaces : Best-effort parsing with hidden syntax, Proceedings of the ACM SIGMOD international Conferenceon Management of data (Page: 107 Year of of Publication: 2004 ISBN:1-58113-859-8)
http://dx.doi.org/10.1145/1007568.1007583

H. He, W. Meng, C. Yu, Z. Wu, WISE-Integrator : A System for extracting and integrating complex web search interfaces of deep web, Proceedings of the 31st international conference on VLDB (Page: 1314 Year of of Publication: 2005 ISBN: 1-59593-154-6)

H. He, W. Meng, C. Yu, Z. Wu, Constructing interface schemas for search interfaces of web databases, In A.H. ngu, M. Kitsuregawa, E.J. Neuhold, J.-Y. Chung, Q.Z. Sheng (Ed.), Lecture Notes in Computer Science, Web Information Systems Engineering Chapter, Vol. 3806, (Berlin:Springer-Verlag, 2005, 29-42)
http://dx.doi.org/10.1007/11581062_3

T.Furche, G. Gottlob, G. Grasso, X.Guo, G. Orsi, C. Schallhart, The Ontological Key: Automatically Understanding and Integrating Forms to Access the Deep Web, The VLDB Journal Vol. 22, pp 615-640, 2013.
http://dx.doi.org/10.1007/s00778-013-0323-0

T.Furche, G. Gottlob, G. Grasso, X.Guo, G. Orsi, C. Schallhart, Real understanding of real state forms, Proceedings.of the international Conference on Web Intelligence, Mining and Semantics (Article No. 13 Year of of Publication: 2011 ISBN: 978-1-4503-0148-0)
http://dx.doi.org/10.1145/1988688.1988704

N. Mfourga, Extracting Entity-Relationship schemas from relational database: a Form driven approach, Proceedings of the 4th Working Conference on Reverse Engineering (Page: 184 Year of of Publication: 1997 ISBN: 0-8186-8162-4)
http://dx.doi.org/10.1109/wcre.1997.624589

I. Astrova, Reverse Engineering of Relational Databases to Ontologies, In C.J. Bussler, J. Davies, D. Fensel, R. Studer (Ed.), Lecture Notes in Computer Science, The Semantic Web : Research and application chapter, Vol 3053 (Berlin:Springer-Verlag, 2004, 327-341)
http://dx.doi.org/10.1007/978-3-540-25956-5_23

M.J. Cafarella, A. Halevy, Z.D. Wang, E. Wu, Y. Zhang, WebTables: Exploring the Power of Tables on the Web, Proceedings of the VLDB Endowment (Page: 538 Year of Publication: 2008 )
http://dx.doi.org/10.14778/1453856.1453916

M.J. Cafarella, A. Halevy, Z.D. Wang, E. Wu, Y. Zhang, Uncovering the relational web, Proceedings on 11th International Workshop on Web and Databases (Year of Publication: 2008 )

M.J. Cafarella, J. Madhavan, A. Halevy, Web-Scale Extraction of Structured Data, ACM SIGMOD Record, Vol. 37, pp. 55-61, 2008.
http://dx.doi.org/10.1145/1519103.1519112

Y. Saissi, A. Zellou, A. Idri, Extraction of relational schema from deep web sources: a form driven approach, Proceedings of the 2nd IEEE World Conference in Complex Systems (Page: 178 Year of Publication: 2014 ISBN: 978-1-4799-4648-8)
http://dx.doi.org/10.1109/icocs.2014.7060888

Y. Saissi, A. Zellou, A. Idri, Form driven web source integration, Proceedings of the 9th IEEE International Conference in Intelligent Systems: Theories and Applications (Page: 1 Year of Publication: 2014 ISBN: 978-1-4799-3566-6)
http://dx.doi.org/10.1109/sita.2014.6847288

Y. Saissi, A. Zellou, A. Idri, Deep web integration: Architecture for relational schema extraction, Proceedings of the 26th International conference on Software & Systems Engineering and their Applications (Year of Publication: 2015 ISBN: 978-1-4799-3566-6)

W3Techs: World Wide Web Technology Surveys. http://w3techs.com, Accessed October 2015.


Refbacks

  • There are currently no refbacks.



Please send any question about this web site to info@praiseworthyprize.com
Copyright © 2005-2024 Praise Worthy Prize