OWI Stats
3.10
Billion
URLs crawled
185
different languages
28
Million
Hosts
435.82
Total TiB crawled
1
TiB crawled per day
147
WARC Datasets
13.95
TiB Size of Open Web Index
50.09
TiB Size of WARC Datasets
222
Public Datasets
(Status: all data are daily stats, apart from number of languages and number of hosts – which are as of March 2024 | for more daily stats and information go to the owler dashboard)
The Open Web Index – Current Status
A Web Index is a data structure that allows fast content based access, sorting, and filtering of large web documents and forms the core of every web search engine today. Usually an inverted index structure is used, where content units (e.g. words, metadata) point to a list of web-documents they occur in. The quality of a web index stems from the quality of the indexed documents augmented by additional signals, e.g. usage information, metadata or link structure, which allow fine-tuning the search rankings to user needs. Lewandowski¹ proposed the idea to create such an index openly through a corresponding infrastructure. We follow his idea and aim to create an infrastructure for building the Open Web Index. However, different to Lewandowski we do not aim to provide a corresponding search API, but to share the index as open data, such that the index can be taken by others for creating a search engine. We have detailed our view on an Open Index in our recent JASIST Publication².
The sketch of such an infrastructure is shown in the figure below, where we have all index generation steps on the left, delivering the Open Web Index as a basis for search applications (top-right) and data products (bottom-right).
Along the figure, we will give a brief update on the results achieved so far and to which stage they apply.
- A first running pipeline with ~ 1 TiB/day (download facilities are not yet in place)
- Use of the index in standard retrieval libraries
- Prototypical Index Push to OpenSearch, ElasticSearch and S3
- Dashboard with Crawl and Index Statistic
- Application scenarios in development
- Ethical, legal, societal (ELSA) considerations and governance planning
Workflow OpenWebSearch.eu
OWLer – the Open Web Search CrawLer
The first step obviously is crawling web resources and storing them in so called WARC (Web Archive) files. We have build our crawler OWLer on the top of the Apache Storm based StormCrawler and extended it correspondingly. You can find the following more details online:
- The OWLer landing page for webmasters on our homepage
- The open-source repository at our gitlab including links to the documentation
Resiliparse – Robust and Fast WARC Parsing
Crawled data is preprocessed using the resiliparse library on top of an Apache Spark Job. For more details, please take a look at our documentation / source code.
CIFF files for sharing the open Index
We are running again Apache Spark Jobs to create our index. We use the CIFF (Common Index File Format)¹ as a standard format for storing the index and will make it available alongside the extracted metadata (which is stored in parquet files). For further information, you can look at the following resources:
Serving a search engine
While our main focus lies on creating and sharing CIFF files and the associated metadata, we of course also develop tools to import CIFF Files in standard retrieval applications. We support the following search engines:
- Apache Lucene, which is the basis for Apache Solr and Elastic SeaRCH, can use CIFF files via our Lucene CIFF importer
- Pyterrier supports CIFF out of the box.
Stay tuned for more possiblities.
Search Engine Evaluation
We are also concerned with search engine quality and offer tools for evaluating search engines. Particularly TIREx (https://github.com/tira-io/ir-experiment-platform), which is build on TIRA (https://github.com/tira-io/tira), allows to compare different retrieval pipelines, provides leaderboards over standard IR Tasks and also supports CIFF import. TIREx received the 2023 ACM SIGIR Best Paper award¹.
Sources
[1] Lewandowski, D. The web is missing an essential part of infrastructure: An open web index. Communications of the ACM, 62 (4), 24–24.
[3] Lin, Jimmy, et al. “Supporting Interoperability Between Open-Source Search Engines with the Common Index File Format”. Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2020.
[4] Maik Fröbe, Jan Heinrich Reimer, Sean MacAvaney, Niklas Deckers, Simon Reich, Janek Bevendorff, Benno Stein, Matthias Hagen, and Martin Potthast. The Information Retrieval Experiment Platform. In Hsin-Hsi Chen et al., editors, 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2023), pages 2826–2836, July 2023. ACM