OWS.EU Partner in Focus: University of Passau

The University of Passau coordinates the OpenWebSearch.EU project and is beyond that responsible for providing the Open Web Index (OWI), which includes the development of technology for coordinating crawlers, building the OWI and enabling its download. Building the OWI is one of the key milestones in the OWS.EU project since it will accelerate further use and research towards an open web search.

Prof. Michael Granitzer leads the OWS.EU project and holds the Chair of Data Science at University of Passau. Together with Jelena Mitrović, professor of Legal Informatics and Natural Language Processing and leader of the Junior Research Group CAROLL, he supervises the research team working on the Open Web Index. We talked to three researchers from the team about the work they do in the OWS.EU project: Saber Zerhoudi, Mahmoud Istaiti and Mohammed Al-Maamari.

How is the project progressing so far?

Saber: Very good, we made considerable progress over the past months. Our team has developed a scalable and distributed crawling software that is currently deployed across three datacenters. To keep users informed about the content being crawled and provide them with filtering options, we have also created a monitoring dashboard that can be accessed under https://dashboard.ows.eu/.

Can you explain what the dashboard does?

Saber: One of the key features of the dashboard is its ability to display near real-time information about the crawling process. Users can easily track the progress of the crawling tasks and view statistics on the number of pages crawled. This transparency ensures that users are always informed about the status of our crawling pipeline.

Furthermore, the dashboard offers users the flexibility to filter the crawling content based on various criteria, such as domain, keyword, or date range. This functionality allows users to focus on specific subsets of data that are relevant to their needs, saving time and effort in analyzing the collected information.

In addition to monitoring and filtering capabilities, the dashboard provides users with the ability to actively contribute to the crawling process. Users can submit lists of URLs they wish to have crawled, expanding the scope of our data collection efforts. This feature enables users to tailor the crawling process to their specific requirements, ensuring that the most relevant and valuable data is collected.

Sounds cool! But how does this look from the perspective of a website owner? Will they have the option to manage their data?

Saber: Yes, to address the important aspects of data privacy and intellectual property rights, we have integrated takedown request and website ownership verification functionalities into the dashboard. Through our third-party partners, users can easily submit takedown requests for content they believe infringes upon their rights. Similarly, website owners can verify their ownership, establishing a clear line of communication and ensuring that any concerns or requests are promptly addressed.

By combining a scalable and distributed crawling software with a user-friendly monitoring dashboard, we have created a powerful tool for data collection and management. The ability to monitor, filter, and contribute to the crawling process, along with the integration of takedown request and website ownership verification features, positions our system as a comprehensive solution for users seeking to gather and analyze web data efficiently and responsibly.

What other milestones did you achieve in the project so far?

Mahmoud: My role involves enhancing the crawler process by implementing various filters and features, as well as integrating different data sources into our process. Additionally, I am working on developing machine learning models to extract information from privacy policies.

One major accomplishment is that we can now label crawled websites as either spam or high-quality content by verifying their presence on datasets like Wikipedia external links or CURLIE.

Mohammed: I specialize in Machine Learning and Data Science. My responsibilities include building and processing datasets, training machine learning models (such as URL classification models), and enhancing model modularity.

Key milestones we achieved so far include developing and comparing various URL classification models and building and open-sourcing several datasets.

What are the challenges you face in your work?

Saber: Navigating the diverse infrastructure setups, guidelines, and technology stacks unique to each of the three data centers that currently host the OWSI can be a significant challenge. Each data center has its own distinct configuration of hardware, software, and networking components, which requires a deep understanding of the specific environment to effectively manage and maintain.

Moreover, data centers often have their own set of best practices, policies, and procedures that must be followed to ensure smooth operations and compliance with industry standards and regulations. These guidelines cover various aspects, from physical security and access control to data backup and disaster recovery protocols.

Mahmoud: Same, I often encounter issues related to the infrastructure, that can be a challenge at times.

Mohammed: For me it’s often challenging to effectively test the trained machine learning models.

What are the next steps from here? 

Saber: In the coming months, our goal is to streamline the crawling process across various datacenters using a centralized control center. This automation will enhance efficiency and consistency in data collection. Additionally, we are exploring methods to integrate embeddings seamlessly into our crawling-preprocessing-indexing pipeline.

Mohammed: In the coming months, I aim to optimize and improve the machine learning models, particularly the URL classification model.

Mahmoud: I plan to finish the integration of data from the Mastodon platform into our process.

 

Thanks for the interview and good luck with your tasks!

Read more about University of Passau: https://openwebsearch.eu/partners/university-of-passau/

Update from OWS.EU partner projects: Part 3

Building an Open Web Index does not only include technical challenges, but also legal and societal ones. To extend our R&D activities around Open Web Search, we initiated the OWS.EU Community Programme. In our first Third-party call we asked for contributions on legally compliant data gathering and identifying legal or economic aspects that enable or block the development and maintenance of an Open Web Index. The call opened in March 2023 and ended with the onboarding of six new partner projects in November 2023. This blogpost includes updates from two projects that address legal challenges of providing an Open Web Index: ALMASTIC and LOREN.

ALMASTIC: Legal Evaluation of Technical Aspects of the Open Web Index

The ALMASTIC project aims to legally secure the Open Web Index by subjecting its technical aspects to legal evaluation. Its goal is to identify obstacles and mitigate legal risks in the process of successful global dissemination.

After helping to draft the first version of the Open Web Index License (OWIL 1), a comprehensive analysis of relevant legislation, case law and applicable guidelines and academic literature has been performed, forming a solid basis for the future legal compliance of OpenWebSearch.EU. The examination focused on five key areas:

  1. liability for third-party content,
  2. copyright,
  3. data protection,
  4. cybersecurity, and
  5. data governance.

The team around Prof. Kai Erenli from the University of Applied Sciences BFI Vienna will use the remaining time of the project to finalise their analysis while keeping in mind that a final assessment is not always possible, as the legal situation in many relevant areas is currently highly dynamic and relevant legal acts have yet to be finalised or case laws identified.

More information about the ALMASTIC project.

LOREN: Legal Open European Web Index

The LOREN project seeks to provide a comprehensive analysis of the legal constraints and requirements for building and operating an Open Web Index. The project will specifically look into the legal implications of crawling, data storage and sharing as well as provide recommendations for building and operating an Open Web Index that complies with the European laws and regulations.

The team around the two lawyers Paul C. Johannes and Dr. Maxi Nebel compiled and analysed the laws and norms that are relevant to building and maintaining of an Open Web Index. Results are currently compiled into a legal opinion with actionable advice regarding crawling, searching, indexing, sharing of index and disclosure of data for scientific purposes.

Additionally, the LOREN team started to work on the implications of the right to de-referencing. Furthermore they are analysing existing open source and open data licenses in regard to the suitability for usage in an Open Web Index. In the next months the team will concentrate on providing their legal opinion with advice concerning selection and/or adaptation of open data licenses for the Open Web Index. In order to present a workable license the LOREN team has worked together with other projects from call #1 of the OWS.Eu Community Programme.

More information about the LOREN project.

Update from OWS.EU partner projects: Part 2

The OWS.EU Community Programme is an essential part of our work towards a European Open Web Search. The programme helps us to integrate new third-party project teams into the OWS.EU landscape and future R&D activities.

In November 2023 we successfully onboarded six new partner projects looking into technical, legal and economic research topics in support of a European Open Web Index. The projects were selected via our first Third-party call. Information on successful projects selected from our second and third open call will follow soon. This blogpost provides an update from two of the more technical projects from call #1 – LAW4OSAI and Open Console.

LAW4OSAI: License-Aware Web Crawling for Open Search AI 

The LAW4OSAI (License-Aware Web Crawling for Open Search AI) project deals with legal and technical aspects of content crawling and aims to enable license-aware crawling of web content by automatically identifying and retrieving content licenses. The team successfully developed a browser plugin to annotate a dataset for the detection of content licenses on websites and open sourced the code (https://github.com/LAW4OSAI/plugin-license-annotation). Furthermore, an algorithm to detect standard open licenses (like Common Creative licenses) on websites was created and the annotation of a dataset has started. The size of the dataset will increase in the remaining time of the project.

Currently the LAW4OSAI team calls for contributions to an online workshop series for researchers and practitioners that are interested in legal aspects of generative AI (https://www.utwente.nl/en/bms/law4osai/workshop/).

More information about the LAW4OSAI project

Open Console: Improving Knowledge about Websites

The Open Console project is implemented by Markov Solutions – a freelance business run by Mark Overmeer and Thao Phuong Nguyen. The goal of the project is to build an infrastructure (called Open Console) to share information about websites and thereby improve the availability and quality of produced knowledge.

In the current version of the Open Console, people can already create an account and log in to the console. They are able to generate their personal identities (to define different roles), as well as group identities (for cooperation or association). From that, ownership of  websites (or email, or domain name) can be verified.

In the remaining project lifetime, the Open Console team works on implementing other types of ownership proof and making the website production ready. Together with the OWS.EU partners University of Passau and SUMA-eV, the first service provided by Open Console will be implemented. This will be a learning path for the Open Web Index logging requirements and the design of the OC-third party interface specification.

More information about the Open Console project.

Update from OWS.EU partner projects: Part 1

In November 2023 OWS.EU successfully onboarded six new partner projects looking into technical, legal and economic research topics in support of the European Open Web Index that is currently in the making. The projects were selected in 2023 following an open call. Currently projects from the second and third calls are being reviewed with updates following soon.

Market potential assessment by Mücke Roth & Company

One of the endeavours from call #1 was the MRC project, dealing with economical questions related to an Open Web Index. The project was initiated by Mücke Roth & Company (MRC) with the goal to assess the market potential of OWS.EU.

The analysis is already fully executed, with a comprehensive study on the market potential of OWS.EU being the major result of the project. The study that has revealed substantial economic and societal benefits of OWS.EU will be presented to the public in autumn 2024.

Key achievements of the MRC work include a cost-benefit analysis, the identification of key customer segments and market dynamics through competitor benchmarking and a quantification of the European search engine market potential.

Figure 1: Share of Benefits & Costs on Net Benefit over time (Market Potential Assessment for OWS.EU by Mücke Roth & Company)

Last but not least, the assessment incorporates additional customer feedback and further interviews validating the findings of the MRC project. Strategic recommendations were provided to OWS.EU by the MRC team based on the results of their work.

Currently implications of the EU AI Act are monitored in order to adapt the strategy in case new regulations may arise.

More about the MRC project

Legal, Intellectual Property & Cyber Security Aspects of Open Web Search

The OWS.EU-Project raises a multitude of highly complex legal questions. LISA (Legal, Intellectual Property and Cyber-Security Aspects) is one of the legal projects that has taken the challenge to determine legal questions, identify relevant legal risks and adequately address them. The goal is to define a legal framework for the development and operation of an Open Web Search Index.

In the first half of the project, the team around Prof. Dr. Matthias Wendland from the University of Oldenburg defined what constitutes illegal content and established the legal duties for operators of an Open Web Index. Legal requirements for takedown requests, including those for criminal content, IP infringements, and data protection were set out. Additionally, the ownership of digital content and of the Open Web Index was clarified and the legal framework necessary for sharing the index was created. Furthermore, the team drafted an End User License Agreement (EULA).

Figure 2: Data Centers & Legal Territoriality in OWS.EU (from the LISA framework)

In the remaining time of the LISA project, the team plans to focus on the design of a comprehensive legal framework for the Open Web Index, including governance structures and guidelines as well as best practices for its operation. Additionally, the End User License Agreement (EULA) to facilitate the sharing and usage of the index will be finalized and European legislative acts that came into force recently, will be monitored closely and incorporated to project’s plans and policies when necessary.

More about the LISA project

 

Second year around: #FreeWebSearch Day – a day for free access to digital information is happening again

Freedom of information and democracy within the digital sphere requires open access to online resources. The #FreeWebSearch Day on 29 September brings this topic to the agenda.

In 2023 the Open Search Foundation (OSF) – one of our 14 consortium partners, brought to life the International #FreeWebSearchDay, which on 29 September each year stands for free and transparent internet search: Via the #FWSD website people are invited to participate and advocate for free, transparent and open web search.

Intransparent information: A core problem

#FreeWebSearch Day on 29 September is all about raising awareness for the current lack of transparency within internet search.

“Many internet users still think that search results at the top of their results list are good, correct and trustworthy, even though they cannot know the criteria of the rankings,” states Christine Plote.

However, freedom of information is the most important foundation of a functioning democracy. There is still a huge lack of knowledge on how search results come about and are ranked or how a search engine will know, what is in a picture.

“Surprisingly, we seem to accept a high degree of digital illiteracy in this respect. Yet, it is high time that search and the evaluation of search results become part of the curricula of schools or universities, training and further education”, the co-founder of the OSF claims.

In addition, schools and companies should give higher priority to hot topics, such as the impacts on online search by artificial intelligence, the new text generators or Large Language Models (LLMs).

Call for Ideas : Actions, lectures, hackathons wanted

For #FreeWebSearch Day 2024 on 29 September contributions from many different fields of expertise are welcome: Companies, schools, universities & educational institutions, museums or associations are invited to contribute with (online) lectures, discussions, participatory activities or projects. IT specialists or programmers can contribute with technical know-how and organise hackathons and the like.
Additionally everyone can help spread awareness by downloading and reposting the #FWSD social media graphics to help spread the word.

Information and events on #FreeWebSearch Day on and around 29 September will be continuously updated at: www.FreeWebsearch.org

 

 

Nine projects selected to work with OWS.EU

Nine new projects will support our quest for a better European Web Search from July 2024 onwards. The projects are the winners of the last OWS.EU third-party calls #2 and #3, which opened in February 2024 and closed in April 2024.

The nine winners were selected from 49 submissions by a jury of experts from the OWS.EU project. Researchers, innovators and computing centres submitted their ideas for:

Call #2: Applications of the Open Web Index

or

Call #3: Data Centre on-boarding

The projects will receive funding ranging from 50.000 Euro to 150.000 Euro for a funding period of up to 12 months. Stay tuned for more information on the winner projects and read more about selected projects from call #1.

New project “Privacy-enhancing digital infrastructures” (PriDI) launches

At the interface between law and business informatics, the PriDI project is researching how an open web index can be designed in conformity with fundamental rights and in such a way that it protects privacy. This includes how values such as privacy and data protection can be anchored in such a web index in the sense of “value-by-design”.

Together with the University of Kassel, one of our consortium partners – namely the Open Search Foundation – has won the PriDI project, funded by the German Federal Ministry of Education and Research.

PriDI (which from German translates to “privacy-enhancing digital infrastructures”) will examine the necessary legal implications for the formation of an open European web index that complies with fundamental rights and protects privacy. The PriDI project team will accompany the OWS.EU-driven development of the European open search infrastructure over a period of 48 months and ensure that values such as privacy and data protection are anchored in the open web index in the sense of “value-by-design”.

Without search engines, navigation in the digital world is almost unthinkable.

The current web search business models are based on the intentional exploitation of user data for personalised advertising in extensive online advertising networks. User data is a huge income stream for online businesses, with personal data being the “digital currency” of the 21st century. But why are there no real alternatives to the search engine models of the big tech giants?

The answer is simple: search engines require a web index – a kind of table of contents of the World Wide Web. Currently, there are only four search index providers worldwide with comprehensive indexes. This is because market entry barriers such as the enormous costs of setting up and operating a web index make it difficult for new search engines to build their own index and assert themselves on the market. In other words, Search engine developers are currently dependent on the proprietary web indexes of the four major platforms, which dictate their access and usage conditions and act as gatekeepers in the search engine market. This makes web search anything but “open,” “privacy-protected,” or “free.”

The OWS.EU open web index, on the other hand, could provide a large number of search engines with a basis for their services. The open web index will also be used by science, research and companies for innovations in the field of artificial intelligence.

Once launched, the open web index will promote diversity and freedom of choice in the area of internet search as well as freedom of information and will be an important step towards digital sovereignty. In addition, opening up the search engine market will strengthen the informational self-determination of citizens.

PriDI will support OWS.EU with pioneering legal design patterns

The PriDI project team will therefore support the development of the open web index with legal and business IT expertise. The aim is to align the resulting search infrastructure with the best possible implementation of fundamental rights and applicable data protection law. With a focus on a particularly privacy-friendly design, the project team will examine legal requirements, translate them into specific catalogues of requirements and have them evaluated by stakeholder groups.

The project website on the Federal Ministry of Education and Research (in German):
https://www.forschung-it-sicherheit-kommunikationssysteme.de/projekte/pridi

Open Calls #2 and #3 are closed

The application deadline for Call #2 and Call #3 is closed.

The 8.5 Mio Euro EU project on Open Web Search had launched new third-party calls, inviting researchers, innovators and computing centres to join the quest for a new Internet Search in Europe. More information on the calls: Call #2 and Call #3.

“Funding of up to 150,000 euros“ | Informationsdienst Wissenschaft

The German research outlet idw (Informationsdienst Wissenschaft) is the go-to news platform for staying up to date on cutting-edge science updates, publications, projects and topics. The members-centric platform caters to more than 43,000 subscribers. 

With ows.eu consortium partner “University of Passau” being an active member, idw shared about the recent OWS.EU Third-Party Open Calls 2 and 3.

The ows.eu project, part of Horizon Europe, is currently calling on third parties to contribute innovations and infrastructure to help further develop the Open Web Index.

https://idw-online.de/en/news828851

Proceedings #ossym23 – 5th international Symposium on Open Search

The proceedings of the 5th International Open Search Symposium #ossym23 have been published Vol. 5 (2023): Proceedings 5th International Open Search Symposium #ossym2023, 4–6 October 2023, CERN, Geneva, Switzerland Volume five of the Proceedings of the International Open Search Symposium 2023 summarises peer-reviewed articles and research results selected and presented at the Open Search Symposium 2023. […]