OWLer, Resiliparse, CIFF files, … A brief update on the results the ows.eu project achieved so far

OWI stats in numbers

(Status: all data are daily stats, apart from number of languages and and number of hosts – which are as of March 2024 | for more daily stats and information go to the owler dashboard)

2.03 Billion

URLs crawled

185

different languages

28 Million

Hosts

265.15 Total

TiB crawled

1 TiB

crawled per day

136

WARC Datasets

9.41 TiB

Size of Open Web Index

45.38 TiB

Size of WARC Datasets

183

Public Datasets

The Open Web Index – Current Status

A Web Index is a data structure that allows fast content based access, sorting, and filtering of large web documents and forms the core of every web search engine today. Usually an inverted index structure is used, where content units (e.g. words, metadata) point to a list of web-documents they occur in. The quality of a web index stems from the quality of the indexed documents augmented by additional signals, e.g. usage information, metadata or link structure, which allow fine-tuning the search rankings to user needs. Lewandowski¹ proposed the idea to create such an index openly through a corresponding infrastructure. We follow his idea and aim to create an infrastructure for building the Open Web Index. However, different to Lewandowski we do not aim to provide a corresponding search API, but to share the index as open data, such that the index can be taken by others for creating a search engine. We have detailed our view on an Open Index in our recent JASIST Publication².

The sketch of such an infrastructure is shown in the figure below, where we have all index generation steps on the left, delivering the Open Web Index as a basis for search applications (top-right) and data products (bottom-right).

Along the figure, we will give a brief update on the results achieved so far and to which stage they apply.

A first running pipeline with ~ 1 TiB/day (download facilities are not yet in place)
Use of the index in standard retrieval libraries
Prototypical Index Push to OpenSearch, ElasticSearch and S3
Dashboard with Crawl and Index Statistic
Application scenarios in development
Ethical, legal, societal (ELSA) considerations and governance planning

Workflow OpenWebSearch.eu

OWLer – the Open Web Search CrawLer

The first step obviously is crawling web resources and storing them in so called WARC (Web Archive) files. We have build our crawler OWLer on the top of the Apache Storm based StormCrawler and extended it correspondingly. You can find the following more details online:

The OWLer landing page for webmasters on our homepage
The open-source repository at our gitlab including links to the documentation

Resiliparse – Robust and Fast WARC Parsing

Crawled data is preprocessed using the resiliparse library on top of an Apache Spark Job. For more details, please take a look at our documentation / source code.

CIFF files for sharing the open Index

We are running again Apache Spark Jobs to create our index. We use the CIFF (Common Index File Format)¹ as a standard format for storing the index and will make it available alongside the extracted metadata (which is stored in parquet files). For further information, you can look at the following resources:

Serving a search engine

While our main focus lies on creating and sharing CIFF files and the associated metadata, we of course also develop tools to import CIFF Files in standard retrieval applications. We support the following search engines:

Apache Lucene, which is the basis for Apache Solr and Elastic SeaRCH, can use CIFF files via our Lucene CIFF importer
Pyterrier supports CIFF out of the box.

Stay tuned for more possiblities.

Search Engine Evaluation

We are also concerned with search engine quality and offer tools for evaluating search engines. Particularly TIREx (https://github.com/tira-io/ir-experiment-platform), which is build on TIRA (https://github.com/tira-io/tira), allows to compare different retrieval pipelines, provides leaderboards over standard IR Tasks and also supports CIFF import. TIREx received the 2023 ACM SIGIR Best Paper award¹.

Sources

[1] Lewandowski, D. The web is missing an essential part of infrastructure: An open web index. Communications of the ACM, 62 (4), 24–24.

[2] Granitzer, Michael, et al. “Impact and development of an Open Web Index for open web search.” Journal of the Association for Information Science and Technology (2023).

[3] Lin, Jimmy, et al. “Supporting Interoperability Between Open-Source Search Engines with the Common Index File Format”. Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2020.

[4] Maik Fröbe, Jan Heinrich Reimer, Sean MacAvaney, Niklas Deckers, Simon Reich, Janek Bevendorff, Benno Stein, Matthias Hagen, and Martin Potthast. The Information Retrieval Experiment Platform. In Hsin-Hsi Chen et al., editors, 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2023), pages 2826–2836, July 2023. ACM