Presenting the first Market Feasibility Study on a European Open Web Index

In the light of “Free Web Search Day” on 29 September, OpenWebSearch.eu and its partner Open Search Foundation present the results of a 9 month long in depth market potential study*, focussing on the economic impact of an Open Web Index for Europe.

The Munich based consulting firm Mücke Roth & Company, who was selected as a OpenWebSearch.eu third-party partner in 2023, was commissioned with a study to investigate the macro-economic as well as the societal implications of a European Open Web Index, as currently developed by OpenWebSearch.eu.

On 30 September 2024 the final study results will be presented to the public for the first time. The presentation will be hosted by OpenWebSearch.eu consortium partner Open Search Foundation alongside Mücke Roth & Company and with the kind support of the BMW Foundation Herbert Quandt in Munich.

The study is available for DL via the following link:

https://openwebsearch.eu/market-potential-study

Methodology and Findings

By employing both top-down and bottom-up analysis methods, the study quantifies the presumable benefits and costs, offering a robust framework for transparent decision-making for OpenWebSearch.eu and its stakeholders.

Applications of an Open Web Index were derived in a broad variety, and over various industries. Use cases were detailed for a more tangible understanding of benefits from the Open Web Index for different customer and user segments, which could help to showcase the index potential in certain industries and for respected stakeholders.

The cost-benefit evaluation has specifically shown that an open search infrastructure on the basis of an Open Web Index is expected to amortise in the fourth year. The report forecasts a considerable macro-economic benefit of 4 to 5 billion euros in the first decade.

For anyone interested in learning more, the presentation will be streamed online. Sign-up here:

https://gstoo.de/OpenWebSearch

 

*The study was funded by the European Union.
Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union, granting authority.
Neither the European Union nor the granting authority can be held responsible for them.

OWS.EU Partner in Focus: University of Passau

The University of Passau coordinates the OpenWebSearch.EU project and is beyond that responsible for providing the Open Web Index (OWI), which includes the development of technology for coordinating crawlers, building the OWI and enabling its download. Building the OWI is one of the key milestones in the OWS.EU project since it will accelerate further use and research towards an open web search.

Prof. Michael Granitzer leads the OWS.EU project and holds the Chair of Data Science at University of Passau. Together with Jelena Mitrović, Professor of Legal Informatics and Natural Language Processing and leader of the Junior Research Group CAROLL, he supervises the research team working on the Open Web Index. We talked to three researchers from their team about the work they do in the OWS.EU project: Saber Zerhoudi, Mahmoud Istaiti and Mohammed Al-Maamari.

How is the project progressing so far?

Saber: Very good, we made considerable progress over the past months. Our team has developed a scalable and distributed crawling software that is currently deployed across three datacenters. To keep users informed about the content being crawled and provide them with filtering options, we have also created a monitoring dashboard that can be accessed under https://dashboard.ows.eu/.

Can you explain what the dashboard does?

Saber: One of the key features of the dashboard is its ability to display near real-time information about the crawling process. Users can easily track the progress of the crawling tasks and view statistics on the number of pages crawled. This transparency ensures that users are always informed about the status of our crawling pipeline.

Furthermore, the dashboard offers users the flexibility to filter the crawling content based on various criteria, such as domain, keyword, or date range. This functionality allows users to focus on specific subsets of data that are relevant to their needs, saving time and effort in analyzing the collected information.

In addition to monitoring and filtering capabilities, the dashboard provides users with the ability to actively contribute to the crawling process. Users can submit lists of URLs they wish to have crawled, expanding the scope of our data collection efforts. This feature enables users to tailor the crawling process to their specific requirements, ensuring that the most relevant and valuable data is collected.

But how does this look from the perspective of a website owner? Will they have the option to manage their data?

Saber: Yes, to address the important aspects of data privacy and intellectual property rights, we have integrated takedown request and website ownership verification functionalities into the dashboard. Through our third-party partners, users can easily submit takedown requests for content they believe infringes upon their rights. Similarly, website owners can verify their ownership, establishing a clear line of communication and ensuring that any concerns or requests are promptly addressed.

By combining a scalable and distributed crawling software with a user-friendly monitoring dashboard, we have created a powerful tool for data collection and management. The ability to monitor, filter, and contribute to the crawling process, along with the integration of takedown request and website ownership verification features, positions our system as a comprehensive solution for users seeking to gather and analyze web data efficiently and responsibly.

What other milestones did you achieve in the project so far?

Mahmoud: My role involves enhancing the crawler process by implementing various filters and features, as well as integrating different data sources into our process. Additionally, I am working on developing machine learning models to extract information from privacy policies.

One major accomplishment is that we can now label crawled websites as either spam or high-quality content by verifying their presence on datasets like Wikipedia external links or CURLIE.

Mohammed: I specialize in Machine Learning and Data Science. My responsibilities include building and processing datasets, training machine learning models (such as URL classification models), and enhancing model modularity.

Key milestones we achieved so far include developing and comparing various URL classification models and building and open-sourcing several datasets.

What are the challenges you face in your work?

Saber: Navigating the diverse infrastructure setups, guidelines, and technology stacks unique to each of the three data centers that currently host the OWSI can be a significant challenge. Each data center has its own distinct configuration of hardware, software, and networking components, which requires a deep understanding of the specific environment to effectively manage and maintain.

Moreover, data centers often have their own set of best practices, policies, and procedures that must be followed to ensure smooth operations and compliance with industry standards and regulations. These guidelines cover various aspects, from physical security and access control to data backup and disaster recovery protocols.

Mahmoud: Same, I often encounter issues related to the infrastructure, that can be a challenge at times.

Mohammed: For me it’s often challenging to effectively test the trained machine learning models.

What are the next steps from here? 

Saber: In the coming months, our goal is to streamline the crawling process across various datacenters using a centralized control center. This automation will enhance efficiency and consistency in data collection. Additionally, we are exploring methods to integrate embeddings seamlessly into our crawling-preprocessing-indexing pipeline.

Mohammed: In the coming months, I aim to optimize and improve the machine learning models, particularly the URL classification model.

Mahmoud: I plan to finish the integration of data from the Mastodon platform into our process.

 

Thank you for the interview!

Read more about University of Passau: https://openwebsearch.eu/partners/university-of-passau/