OWS.EU Partner in Focus: The Graz University of Technology

As a partner in the OpenWebSearch.Eu project, Graz University of Technology, contributes its interdisciplinary expertise through the Cognitive and Digital Science Lab. CoDiS Lap explores the intersection of computer science, cognitive psychology, and human factors, focusing on digital literacy, decision-making, and human-centered design.

Within the project, Prof. Dr. Christian Guetl, as head of CoDiS Lab and Postdoctoral Researcher Dr. Alexander Nussbaumer, develop search applications together with their team – Chiara Ruß-Baumann (MSc in psychology), Sebastian Gürtl (PhD student in computer science), Felix Holz (BSc in computer science), and Daniel Scharf (Bsc in computer science) and ensure the integration of ethical and societal principles such as trust, privacy, and quality into the open web search infrastructure.

Thanks to Christian and Alexander for taking the time to share your insights with us.

Portrait of Christian Guetl (@ TU Graz)Portrait of Alexander Nussbaumer (@ TU Graz)

 

 

 

 

 

 

 

 

 

Please describe your organization’s tasks in the project. What is your field of expertise that you bring to the project?

Our main tasks in the OpenWebSearch.eu project is (a) to take care of creating applications using the Open Web Index data, and (b) coordinating work on ethical, legal, and societal aspects related to the creation and operation of the Open Web Index. The search applications should demonstrate how the Open Web Index is used for special-purpose search applications. The elaboration of ethical, legal, and societal aspects is needed to understand and adhere to them.

How is the project progressing? Which major milestones did you achieve?

In order to support the application development, the MOSAIC search framework has been developed that constitutes an out-of-the-box search engine that can deal with web index data downloaded from the Open Web Index. Furthermore, it can be used as a backbone to create more complex own search applications. For taking care of the ethical and legal aspects, a framework of technical-organisational has been elaborated that advises index creators, operators, and users, how to adhere to ethical and legal standards and mitigate respective risks.

What are the challenges you have been facing (regarding your tasks)?

The most demanding challenges are dealing with the legal and ethical constraints when creating and sharing web data and index shards. The main difficulty arises from the fact that third-party web content is downloaded, processed and shared with the public. However, web content can contain sensible and problematic information, such as personal data, copyright content, illegal data, or disinformation. Hence, various European laws have to be taken in consideration, such as copyright laws, data protection, or criminal law.

Which milestones do you plan to achieve in the remaining months?

The final milestones mainly include application demonstrators that showcase and document how to make use of and benefit from the Open Web Index. This should stimulate others to create their own Applications based on web data and the Open Web Index. Furthermore, summaries will be created that explain ethical and legal situations related to the creation, operation, sharing, and usage of the Open Web Index.

What makes the OWS project special to you?

Already in the early 2000s we had a first project to work on an alternative search engine. It was already at that time a distributed system and enabled in a flexible way to crawl web sites and build an index to be used by different applications. Since that time we saw a great value for an open search index and an open search infrastructure. The OWS project finally not only realized this idea but also scaled it up to a useful source for search applications, AI tools and research. Moreover, development on the global scale has shown that digital sovereignty on the European level is key for our economic and scientific landscape.

Do you already have plans for the time after the project ends?

Due to the collaborative effort to keep the infrastructure operative and providing up-to-date web index slices, we want to continue our effort to further improve the MOSAIC search framework and work on further search applications. The main interest is on science search applications and applications for Austria’s sovereignty on digital infrastructure, in particular working on Web search independence and data infrastructure for emergency management.

logo of TU Graz

Read more about Graz University of Technology: https://openwebsearch.eu/partners/tu-graz/

OWS.EU Partner in Focus: Radboud University

Continuing our partner portrait series, today’s spotlight is set on Radboud University in the Netherlands. Prof.dr.ir. Arjen P. de Vries and Prof.dr.ir. Djoerd Hiemstra lead the Information Retrieval research group at Radboud University, part of the Data Science section in the Institute for Computing and Information Sciences.

In OpenWebSearch.EU, the team, which is complemented by PHD candidates Gijs Hendriksen and Daria Alexander, have been developing a new architecture for search engines with many parts of the system being decentralized. The key idea is to separate index construction from the search engines themselves, where the most expensive step to create index shards can be carried out on large clusters while the search engine itself can be operated locally.

Another vision includes an Open-Web-Search Engine Hub, where companies and individuals can share their specifications of search engines and pre-computed, regularly updated search indices.

Having recently launched the OpenWebIndex pilot, we asked Arjen and Gijs about some key results and learnings thus far while also touching on some next steps for the remaining project time.
    Gijs Hendriksen, Radboud University, PhD Candidate

Gijs and Arjen, thank you both for your time today. Please could you describe Radboud University’s tasks in the OpenWebSearch.eu project? What is the field of expertise that your bring to the project?

Arjen: The Radboud University expertise is Information Retrieval, which is the core field of computer science that contributes to the development of search engines. The central question is how computers can establish the relevance of information objects for people’s information needs. We look into a wide range of open questions in the field, covering topics including the mathematical modeling of information (with and without new AI techniques), scalable and resource efficient system architectures, and, perhaps the most difficult one, how to measure the quality of retrieval systems and compare different approaches on their effectiveness.

Sounds like an ongoing tedious process. Have you found any key learnings for what works and what doesn’t in combining or comparing the various approaches?

Gijs: There were many learnings along the way indeed. Without going into too much detail, some of our key learnings are published as research papers and OWS deliverables.

How is the project progressing overall? Which major milestones are you proud of thus far?

Gijs: From our point of view, the project is progressing very well! After 2.5 years of engineering we are now running daily workflows that produce daily index shards from crawled content across three European data centers. Now that we are getting the data out there, we can focus on improving the ease of access to these index shards.

Could you elaborate on that a bit more?

Gijs: Sure. We are now working on improving access to the Open Web Index. A main part of that is deciding how we want to ‘shard’ the data, i.e. how we want to distribute the data across logical partitions that can be used to efficiently query a part of the data. Currently, we split the index into language-based shards, but we want to experiment with topic-based shards and even create shards based on frequent access patterns.
We are also actively investigating how we can best integrate shards over time. We are currently producing daily index shards, but have yet to decide how we can best combine these daily subsets, and how we should deal with document updates and deletions. Finally, we recognize that many people want to be able to query our index directly without having to download all our index data. We are working on a way in which we offer direct querying capabilities over an inverted file hosted in a data lake. This should also enable us to efficiently propagate updates to the index.

Sounds promising. What are some of the challenges you are facing?

Arjen: The main technical challenges stem directly from the scale of the Web, and the noisiness of Web data. The really big problem remains however that of evaluation. How do you establish the value of innovations in search without continuously running costly user studies? We are looking into mixing ideas from what is known in our field as ‘the Cranfield tradition’, with new developments in LLMs, and user-oriented studies to fill in where machines would fail.

What makes the OWS project special?

Arjen: EU projects are often a way for partner organisations to fund their own interests, resulting in internal project frictions (large or small) about the direction and final objectives. With OpenWebSearch.eu it is nothing like that. Everyone on the team is highly motivated to make a lasting change in the distribution of online powers, and such a broadly shared target is so refreshing!
We are enjoying it thoroughly to take part in this enterprise, and we are convinced that OpenWebSearch.eu will produce a lasting impact, sustainable beyond the duration of the project.

Do you already have plans for the time after the project ends?

Arjen: The brief answer is ‘Keep going’. Hopefully we manage to keep the team together, and find funding to even expand by integrating parties that have started to contribute actively to the Open Web Search and Analysis Infrastructure. And we will work hard to make the index a fundamental building block, suitable for others to do Web search research.

Thank you for the insights!

Read more about Radboud University: https://openwebsearch.eu/partners/radboud-university/

OWS.EU Partner in Focus: Leibniz Supercomputing Centre

The Leibniz Supercomputing Centre (LRZ) is the second partner we are introducing following our portrait of the University of Passau. The LRZ forms part of the BADW (Bayerische Akademie der Wissenschaften), providing technical support and supercomputing power, delivers a robust infrastructure for the Open Web Index. The research team of LRZ is guided and supported by the chairman of the board of directors, Prof. Dr. Dieter Kranzlmüller. The team includes Research and Information Management Team leader Megi Sharikadze, Research Data Management Team leader Stephan Hachinger, Research Managers: Shahab Khormali, Jirathana Dittrich and Nana Gratiashvili, Research Associates Mohamad Hayek and Stuart Gordon, and Communications Manager Anita Schuffert. The LRZ team has multiple functions: coordinates the project management and research activities, takes care of financial support to third-party program of the project, contributes to infrastructure work-package, actively participates in dissemination, communication and exploration measures as well as in topics such as governance, legal and ethical aspects of OpenWebSearch.eu.

Shahab Khormali is in charge of project management, with a primary focus on Cascade funding, also known as Financial Support for Third Parties (FSTP), activities to be tackled within the project. We asked him about crucial milestones thus far as well as the outlook for the last project period and beyond.

Shahab Khormali, Leibniz Supercomputing Centre of the BAdW (LRZ), Research Manager

Please describe your tasks (LRZ) in the project. What is your field of expertise that you bring to the project?

Shahab: BADW-LRZ, as one of the foremost European computing centers, contributes to the OpenWebSearch.eu project in two key areas. The first is the technical domain, where, along with the other participating data centers in the consortium, we are responsible for providing a state-of-the-art storage and compute infrastructure. In this context, LRZ’s Research Data Management team, led by Dr. Stephan Hachinger, applies its expertise on developing and managing highly scalable, reliable and secure computing infrastructure, and supports the project in executing core services and storing core data products.

The second area concerns project management and coordination and communication. Within this framework, LRZ’s Research and Information Management (RIME) team, comprising experts in science management and communication and heading by Dr. Megi Sharikadze, leads the “Project Management and Coordination Office (PMCO)” and actively supports the “Open Web Search Ecosystem and Sustainability” and “Outreach and Communication” project goals.

In PMCO, the LRZ team works closely with the project coordinator and partners to oversee the overall project management. We together ensure that the project remains aligned with the project objectives, while progressing efficiently and successfully. We are responsible for delivering optimal project performance, managing resources efficiently, overseeing reporting, maintain effective communication and networking with the relevant stakeholders, supporting international cooperation, fostering innovation and capacity building. Additionally, we coordinate the entire process of third-party contributions, handling everything from the initial preparation and announcement of open calls to the evaluation, awarding, distribution of funds, and final closure of grants. The PMCO oversees that each step is carried out smoothly and efficiently, and maintains clear processes, especially, for continuous communication with all involved third-party partners.

How is the project progressing so far?

Shahab: The project is progressing very well being on track. We have achieved several key milestones, including the successful completion of the mid-term project review with extremely positive feedback from the reviewers. This also included the approval of all deliverables and milestones in the first periodic report. Furthermore, we conducted all three open calls for third-party contributors as planned. The selected third-parties have been successfully integrated into the project. Moreover, we are introducing the project to key stakeholders across Europe and enhancing the project’s visibility within the relevant networks and circles.

What are the challenges you are facing with regard to your tasks?

Shahab: My main task belongs to PMCO activities. In this context, I do not think we have faced any major challenges. However, one important topic for us and the consortium is the continuation of the project in the form of a follow-up project. This relates to our ecosystem and sustainability responsibilities. In this context, we are closely following up on this matter and actively working to address it by applying for and securing new funds.
Another challenge worth pointing out is establishing long-term and productive relationships with industrial stakeholders and policymakers. Engaging effectively with these groups and ensuring ongoing collaboration throughout the project is an area that requires continuous attention and improvement.

Which milestones do you plan to achieve in the next months?

Shahab: A key milestones in the coming months is the completion of the first group of third-party projects, meaning that all partners should timely finalize their projects, and submit their project reports. This will allow us to close these projects officially. In addition to this, we look forward to conducting the mid-term review for the second and third groups of third-party projects in March/April 2025. Another important consideration is that the new end date of the project (due to a six-month prolongation) must be integrated across all project tasks.

What makes the OWS project special?

Shahab: In my opinion, the OpenWebSearch.eu is special and important due to its focus on cutting-edge IT and internet search technologies, which are continuously evolving being highly dynamic. It brings together highly specialized partners with state-of-the-art infrastructure and technologies, ensuring innovation and tangible outcomes. Additionally, the involvement of third-party contributors adds diverse expertise, enriching the project’s outcomes.

Do you already have plans for the time after the project ends?

Shahab: Yes, we have plans for the continuation of the project’s themes and objectives beyond its end date, and we are actively pursuing new funding opportunities to continue the work.

Thank you for the interview!

Read more about LRZ: https://openwebsearch.eu/partners/leibniz-supercomputing-centre-lrz/

OWS.EU Partner in Focus: University of Passau

The University of Passau coordinates the OpenWebSearch.EU project and is beyond that responsible for providing the Open Web Index (OWI), which includes the development of technology for coordinating crawlers, building the OWI and enabling its download. Building the OWI is one of the key milestones in the OWS.EU project since it will accelerate further use and research towards an open web search.

Prof. Michael Granitzer leads the OWS.EU project and holds the Chair of Data Science at University of Passau. Together with Jelena Mitrović, Professor of Legal Informatics and Natural Language Processing and leader of the Junior Research Group CAROLL, he supervises the research team working on the Open Web Index. We talked to three researchers from their team about the work they do in the OWS.EU project: Saber Zerhoudi, Mahmoud Istaiti and Mohammed Al-Maamari.

How is the project progressing so far?

Saber: Very good, we made considerable progress over the past months. Our team has developed a scalable and distributed crawling software that is currently deployed across three datacenters. To keep users informed about the content being crawled and provide them with filtering options, we have also created a monitoring dashboard that can be accessed under https://dashboard.ows.eu/.

Can you explain what the dashboard does?

Saber: One of the key features of the dashboard is its ability to display near real-time information about the crawling process. Users can easily track the progress of the crawling tasks and view statistics on the number of pages crawled. This transparency ensures that users are always informed about the status of our crawling pipeline.

Furthermore, the dashboard offers users the flexibility to filter the crawling content based on various criteria, such as domain, keyword, or date range. This functionality allows users to focus on specific subsets of data that are relevant to their needs, saving time and effort in analyzing the collected information.

In addition to monitoring and filtering capabilities, the dashboard provides users with the ability to actively contribute to the crawling process. Users can submit lists of URLs they wish to have crawled, expanding the scope of our data collection efforts. This feature enables users to tailor the crawling process to their specific requirements, ensuring that the most relevant and valuable data is collected.

But how does this look from the perspective of a website owner? Will they have the option to manage their data?

Saber: Yes, to address the important aspects of data privacy and intellectual property rights, we have integrated takedown request and website ownership verification functionalities into the dashboard. Through our third-party partners, users can easily submit takedown requests for content they believe infringes upon their rights. Similarly, website owners can verify their ownership, establishing a clear line of communication and ensuring that any concerns or requests are promptly addressed.

By combining a scalable and distributed crawling software with a user-friendly monitoring dashboard, we have created a powerful tool for data collection and management. The ability to monitor, filter, and contribute to the crawling process, along with the integration of takedown request and website ownership verification features, positions our system as a comprehensive solution for users seeking to gather and analyze web data efficiently and responsibly.

What other milestones did you achieve in the project so far?

Mahmoud: My role involves enhancing the crawler process by implementing various filters and features, as well as integrating different data sources into our process. Additionally, I am working on developing machine learning models to extract information from privacy policies.

One major accomplishment is that we can now label crawled websites as either spam or high-quality content by verifying their presence on datasets like Wikipedia external links or CURLIE.

Mohammed: I specialize in Machine Learning and Data Science. My responsibilities include building and processing datasets, training machine learning models (such as URL classification models), and enhancing model modularity.

Key milestones we achieved so far include developing and comparing various URL classification models and building and open-sourcing several datasets.

What are the challenges you face in your work?

Saber: Navigating the diverse infrastructure setups, guidelines, and technology stacks unique to each of the three data centers that currently host the OWSI can be a significant challenge. Each data center has its own distinct configuration of hardware, software, and networking components, which requires a deep understanding of the specific environment to effectively manage and maintain.

Moreover, data centers often have their own set of best practices, policies, and procedures that must be followed to ensure smooth operations and compliance with industry standards and regulations. These guidelines cover various aspects, from physical security and access control to data backup and disaster recovery protocols.

Mahmoud: Same, I often encounter issues related to the infrastructure, that can be a challenge at times.

Mohammed: For me it’s often challenging to effectively test the trained machine learning models.

What are the next steps from here? 

Saber: In the coming months, our goal is to streamline the crawling process across various datacenters using a centralized control center. This automation will enhance efficiency and consistency in data collection. Additionally, we are exploring methods to integrate embeddings seamlessly into our crawling-preprocessing-indexing pipeline.

Mohammed: In the coming months, I aim to optimize and improve the machine learning models, particularly the URL classification model.

Mahmoud: I plan to finish the integration of data from the Mastodon platform into our process.

 

Thank you for the interview!

Read more about University of Passau: https://openwebsearch.eu/partners/university-of-passau/