Scrapple: a Flexible Framework to Develop Semi-Automatic Web Scrapers

Alex Mathew; Harish Balakrishnan; Saravanan Palani

doi:10.15866/irecos.v10i5.5864

Scrapple: a Flexible Framework to Develop Semi-Automatic Web Scrapers

Alex Mathew^(1*), Harish Balakrishnan⁽²⁾, Saravanan Palani⁽³⁾

^(*) Corresponding author

Authors' affiliations

DOI: https://doi.org/10.15866/irecos.v10i5.5864

Abstract

The World Wide Web is the biggest source of data that the general public has access to. Students and researchers working on data-related problems need a well-maintained source to get their data from. Most online services provide APIs to use their service. However, this may not provide all the data that they actually need even though it exists on the website. In this paper, "Scrapple" – a framework for creating semi-automatic web scrapers, which extract required data from the Web, is introduced. Users do not need to write the entire scraping programs - they only need to define a configuration file which is used to build the required scraper. The configuration provides support for CSS selectors and XPath expressions
Copyright © 2015 Praise Worthy Prize - All rights reserved.

Keywords

Web Scraping; Web Wrapper; CSS Selector; XPath Expressions

Full Text:

PDF

References

Kei Kanaoka, YotaroFujii and Motomichi Toyama.Ducky: A Data Extraction Systemfor Various Structured Web Documents. In Proceedings of the 18th International Database Engineering & Applications Symposium, IDEAS ’14, pages 342-347, New York, NY, USA, 2014. ACM
http://dx.doi.org/10.1145/2628194.2628244

Selectors.[Online] Available: http://www.w3.org/TR/CSS21/selector.html

W3 Document Object Model.[Online] Available: http://www.w3.org/DOM/

XPath support in ElementTree[Online] Available: http://effbot.org/zone/element-xpath.htm

T. Furche, G. Gottlob, G. Grasso, C. Schallhart, and A. Sellers.Oxpath: A language for scalable data extraction, automation, and crawling on the deep web. The VLDB Journal, 22(1):47–72, Feb. 2013
http://dx.doi.org/10.1007/s00778-012-0286-6

V. Crescenzi, P. Merialdo, and D. Qiu. Alfred: Crowd assisted data extraction. In Proceedings of the 22Nd International Conference on World Wide Web Companion, WWW ’13 Companion, pages 297–300, Republic and Canton of Geneva, Switzerland, 2013. International World Wide Web Conferences Steering Committee.

F. Kokkoras, K. Ntonas, and N. Bassiliades.Deixto: A web data extraction suite. In Proceedings of the 6th Balkan Conference in Informatics, BCI ’13, pages 9–12, New York, NY, USA, 2013. ACM.
http://dx.doi.org/10.1145/2490257.2490297

H. A. Sleiman and R. Corchuelo, "A survey on region extractors from web documents", IEEE Trans. Knowl. Data Eng., vol. 25, no. 9, pp.1960 -1981, July 2012
http://dx.doi.org/10.1109/tkde.2012.135

Scrapy: A fast and powerful scraping and web crawling framework. [Online] Available: https://www.scrapy.org.

Boris Katz , Sue Felshin , DenizYuret , Ali Ibrahim , Jimmy Lin , Gregory Marton , Alton Jerome McFarland , BarisTemelkuran. Omnibase: Uniform Access to Heterogeneous Data for Question Answering. In Proceedings of the 7th International Workshop on Applications Of Natural Language To Information Systems. NLDB 2002.Stockholm. Sweden. June 2002.
http://dx.doi.org/10.1007/3-540-36271-1_23

lxml - Processing XML and HTML with Python [Online]. Available: http://lxml.de/

BeautifulSoup vs. lxml benchmark.[Online] Available: http://blog.dispatched.ch/2010/08/16/beautifulsoup-vs-lxml-performance/

Refbacks

There are currently no refbacks.

Username
Password
Remember me