OWler is an open-source focused web crawler based on Storm Crawler for feeding the Open Web Index developed in the OpenWebSearch.EU project.
In it’s first version, OWler reads in WARC files, particularly from Common Crawl, and extracts links as seed. It then filters and pre-processes the content and starts collects web pages form the extracted links that satisfy some specific criteria, e.g., URLs that belong to a given domain or that contain a user-specified topic.
OWLer considers the robots.txt protocol and aims for polite, legally compliant crawling. However, it is a research prototype and so some things might go wrong. In this case we apologize before. In case you have feedback or want to inform us over things that go wrong, please write an e-mail to email@example.com or contact the coordinator of the OpenWebSearch.EU project.